JP3707121B2

JP3707121B2 - Pitch detection device

Info

Publication number: JP3707121B2
Application number: JP00525296A
Authority: JP
Inventors: 健大聖寺; 康男若森; 俊彦鈴木; 裕介山本
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 1996-01-16
Filing date: 1996-01-16
Publication date: 2005-10-19
Anticipated expiration: 2016-01-16
Also published as: JPH09198097A

Abstract

PROBLEM TO BE SOLVED: To accurately find a pitch cycle fast with inexpensive constitution even when a speech waveform is a complicated waveform containing an overtone component. SOLUTION: A digital speech signal is oversampled by four-times oversampling 7 and the result is binarized by a binarization part 8; and inversion intervals of the signal obtained by the binarization are measured by a timer 9 to find successive zero-crossing intervals of the digital speech signal, and the intervals are stored in a RAM 10. A pitch arithmetic part 11 assumes that the pitch cycle is the sum of 2n pieces of zero-crossing interval data as to n=1-4, calculates a reproduction rate as the degree of matching in each pitch cycle of each zero-crossing interval data constituting one pitch cycle, and employs the assumption by which the highest reproduction rate is obtained to find the pitch cycle.

Description

【０００１】
【発明の属する技術分野】
この発明は、音声波形のピッチ周期またはピッチ周波数を検出するピッチ検出装置に関する。
【０００２】
【従来の技術】
音声波形を特徴付けるパラメータの１つとしてピッチ周期（あるいはピッチ周波数）があり、この音声波形のピッチ周期を検出する技術が音声分析・合成システム、音声符号化システム等において一般的に使用されている。また、最近では、カラオケシステムにも、歌唱者の音声のピッチ周期の検出を行うものがあり、歌唱の採点等に利用されている。
【０００３】
従来、音声のピッチ周期を検出する方法として以下のものがあった。
（１）零クロス法
音声波形が正弦波に非常に近いものと仮定すると、音声波形は零レベル線を負方向から正方向に横切り、次いで正方向から負方向に横切り、再び負方向から正方向に横切るという単調な変化を繰り返すため、零レベル線を同一方向に横切る時間間隔によってピッチ周期が与えられる。零クロス法は、この考えに従い、単純に２つの零クロス間隔を計測してピッチ周期とする方法である。また、これと同様な発想として、音声波形の瞬時値が極大値または極小値となるタイミングの間隔を計測してピッチ周期とする方法もある。
【０００４】
（２）自己相関法
この自己相関法においては、音声波形を一定のサンプリング周期毎にサンプリングすることによって得られる時系列サンプルｘ（１），ｘ（２），…を用い、以下の自己相関関数Ｒ（ｒ）の演算を行うことにより、ピッチ周期を求める。
Ｒ（ｒ）＝１／Ｎ・Σ ｛ｘ（ｎ）・ｘ（ｎ＋ｒ）｝
（ただし、上記式において、Σはｎ＝１〜Ｎ・ｒの範囲で｛｝内の総和を求める演算子である。）
すなわち、ｒを各種変化させ、各ｒについて自己相関関数Ｒ（ｒ）を求め、Ｒ（ｒ）が最大（すなわち、自己相関が最大）になるときのｒから音声波形のピッチ周期を算出する。
【０００５】
【発明が解決しようとする課題】
ところで、上述した零クロス法は、比較的安価にしかも高速にピッチ周期を検出することができる反面、人間の音声は多くの倍音成分を多く含んでいるため正確なピッチ周期を検出することができないという問題があった。また、上述した自己相関法は、ある程度正確にピッチ周期を検出することが可能であるが、計算量が膨大であるとともに、検出時間が多くかかる。また、コスト的にも高くなる。
【０００６】
この発明は、上記問題点を克服し、基本的には零クロス間隔を測定することによってピッチ周期を求める方法を採用し、かつ、かかる方法を採用したことによって生じる弊害を防止する手段を講じ、安価な構成で、正確かつ高速にピッチ周期を検出することが可能なピッチ検出装置を提供することを目的とする。
【０００７】
【課題を解決するための手段】
この発明は、デジタル音声信号をサンプリング周波数を所定数倍にして出力するオーバーサンプリング手段と、前記オーバーサンプリング手段によって出力されるデジタル音声信号を所定のレベルと比較し、２値信号に変換する２値化手段と、前記２値信号に基づき、前記デジタル音声信号の連続する零クロス間隔ｔ_１，ｔ_２，…を計測する零クロス間隔計測手段と、ｎ（ｎは１以上の整数）を各種変化させ、各ｎについて、２ｎ個の零クロス間隔の総和Ｔ＝（ｔ_１＋ｔ_２＋・・ｔ_２ｎ）をピッチ周期と仮定し、隣接するｍ周期（ｍは２以上の整数）分の前記仮定したピッチ周期の各々に含まれる互いに対応する前記零クロス間隔の間の一致の程度を前記デジタル音声信号の波形の一致の程度として算出し、波形の一致の程度が最も高いｎを選択することによりピッチ周期を求めるピッチ演算手段とを具備することを特徴とするピッチ検出装置を要旨とする。
【０００８】
【発明の実施の形態】
以下、本発明を更に理解しやすくするため、実施の形態について説明する。
かかる実施の形態は、本発明の一態様を示すものであり、この発明を限定するものではなく、本発明の範囲で任意に変更可能である。
【０００９】
Ａ．実施形態の構成
図１はこの発明をカラオケシステムに適用した実施形態の構成を示すブロック図である。本実施形態は、カラオケシステムの構成部分のうち歌唱者の歌の採点をする部分に関するものである。図１において、１はデジタル音楽信号が記録されたＣＤ（コンパクトディスク）である。このＣＤ１に記録されたデジタル音楽信号はサンプリング周波数ｆｓ＝４４．１ｋＨｚのクロックに同期して順次再生される。２はボーカル抽出部であり、ＣＤ１から再生されたデジタル音楽信号からボーカル音に相当する信号（以下、デジタルお手本信号という。）を抽出する。一例としてＣＤ１から再生されたデジタル音楽信号の音声帯域を含む周波数帯域の信号をバンドパスフィルタにより抽出するという処理によりデジタルお手本信号を得ることができる。また、ボーカル音のみを記録したメディアを利用可能な場合は、そのメディアから再生されたデジタル音楽信号をそのままデジタルお手本信号として使用すればよい。３はマイクロホンであり、ＣＤ１の再生に合わせて歌う歌唱者の歌声を採取し、アナログ音声信号として出力する。４はＡ／Ｄ変換器であり、マイクロホン１からアナログ音声信号を、ＣＤ１の再生の場合と同様なサンプリング周波数ｆｓ＝４４．１ｋＨｚのクロックに同期してサンプリングし、デジタル音声信号に変換する。
【００１０】
５はＤＣ除去部であり、順次供給されるデジタル音声信号およびデジタルお手本信号に対してＤＣ除去処理を施し、ＤＣとみなせる低い周波数帯域、例えば０Ｈｚ〜５０Ｈｚの帯域の成分の除去されたデジタル音声信号およびデジタルお手本信号を各々出力する。６はＬＰＦ（ローパスフィルタ）であり、ＤＣ除去部５によって出力されたデジタル音声信号およびデジタルお手本信号の各々から例えば５００Ｈｚ以上の周波数の成分を除去して出力する。これらのＤＣ除去部５およびＬＰＦ６により、デジタル音声信号およびデジタルお手本信号の各々について、５０〜５００Ｈｚの帯域内の成分のみが選択され、出力される。
【００１１】
７は４倍オーバーサンプリング部であり、ＬＰＦ６を通過したデジタル音声信号およびデジタルお手本信号（いずれもサンプリング周波数ｆｓ＝４４．１ｋＨｚ）に対して補間演算を施し、４倍のサンプリング周波数の信号に変換して出力する。
【００１２】
図２はこの４倍オーバーサンプリング部７のうちデジタル音声信号またはデジタルお手本信号の一方（以下、入力デジタル信号という。）の処理を行うのに必要な回路構成を例示したものである。この図において、ラッチ７１は、サンプリング周波数ｆｓに対応したクロックが与えられることにより、入力デジタル信号を取り込んで保持する。遅延器７２，７２，…は図示の通りラッチ７１の後段にカスケード接続されている。これらの遅延器７２は、各々サンプリング周波数ｆｓの４倍の周波数のクロックが与えられることにより、ラッチ７１に保持された入力信号を順次シフトし、該入力信号を１クロック周期ずつ順次遅延させた遅延信号を各々出力する。７３，７３，…は乗算器、７４，７４，…は加算器であり、これらによりラッチ７１および遅延器７２，７２，…の各出力信号に所定の補間係数列を畳み込む補間演算が実行される。以上の構成により、サンプリング周波数ｆｓの４倍の周波数のクロックに同期して補間演算が実行され、補間のなされたデジタル信号が最終段の加算器７４から順次出力される。
【００１３】
この４倍オーバーサンプリング部７は、ピッチ周期を求める際の精度を高めるために設けられた手段である。すなわち、本実施形態においては、デジタル音声信号およびデジタルお手本信号の各々の零クロス点の時間間隔を測定することにより各デジタル信号のピッチ周期を求める。このため、ピッチ周期の測定精度を高めるためには、時間軸上における零クロス点の位置の検出精度を高める必要がある。そこで、この４倍オーバーサンプリング部７を介挿することにより、デジタル音声信号およびデジタルお手本信号の各々のサンプルの時間密度を４倍にし、各々の零クロス点の位置の検出精度を高めている。この例では曲線補間によりオーバーサンプリングを行っているが、コストの問題に鑑みて、ある程度の精度が得られる直線補間を用いることもできる。
【００１４】
８は２値化部であり、４倍オーバーサンプリング部７から出力されるデジタル音声信号およびデジタルお手本信号のレベルの２値化を行う。この２値化は、基本的には、零レベルを基準として入力デジタル信号の正負判定を行い、入力デジタル信号が正の場合は“１”を、負の場合は“０”を出力するものである。すなわち、この２値化部８は入力デジタル信号が零レベルを横切る毎に“０”／“１”が反転する２値信号を出力する手段である。ただし、本実施形態においては２値化を行う際に零レベルを中心に±Δの範囲をマスキング帯とし、入力デジタル信号にこの±Δのマスキング帯内の微小な振動があったとしても、かかる微小な振動によっては２値信号を反転させないようにしている。
【００１５】
図３はこの２値化部８のうちデジタル音声信号またはデジタルお手本信号の一方（以下、入力デジタル信号という。）の処理を行うのに必要な回路構成を例示したものである。この図において、８１は入力デジタル信号の絶対値を検出する絶対値検出部である。８２は比較部であり、絶対値検出部８１によって検出された入力デジタル信号の絶対値を所定値Δと比較し、絶対値がΔを越えている場合には“１”を、越えていない場合には“０”を出力する。８３はサンプルホールド部であり、比較部８２から“１”が出力されている期間は入力デジタル信号をそのまま出力し（サンプル状態）、比較部８２から“０”が出力されている期間は比較部８２の出力信号が“１”から“０”に変化する直前の入力デジタル信号を保持し出力する（ホールド状態）。８４は比較部であり、零レベルを基準としてサンプルホールド部８３の出力信号の正負判定を行い、正の場合は“１”を、負の場合は“０”の２値信号を出力する。
【００１６】
以上の構成によれば、入力デジタル信号が±Δの範囲外にある場合にはサンプルホールド部８３を介してそのまま出力される。また、入力デジタル信号が零レベル±Δのマスキング帯内に入った場合には、その直前の入力デジタル信号の値がサンプルホールド部８３によって保持され、この保持動作が行われている期間中は比較部８４が出力する２値信号が反転することはない。従って、入力デジタル信号が零レベル±Δのマスキング帯を横切って変化する場合はマスキング帯を横切り終えた時点で２値信号が反転することとなる。一方、入力デジタル信号が零レベル±Δのマスキング帯に入ったがこれを横切ることなくマスキング帯内を上下動するような場合には、たとえ入力デジタル信号が零レベルを横切ったとしてもサンプルホールド部８３の出力信号値が零レベルを横切ることはないため、２値信号の反転は起こらない。
【００１７】
図３において比較部８４よりも前段にある回路は、図４に示すものに置き換えてもよい。この図４において、８５および８６は比較部であり、各々、入力デジタル信号を基準レベルと比較し、入力デジタル信号が基準レベルより高いときには“１”を、基準レベルより低いときには“０”を出力する。比較部８５に対しては基準レベルとして＋Δが与えられ、比較部８６に対しては基準レベルとして−Δが与えられる。８７は入力デジタル信号を保持するラッチ、８８は入力デジタル信号またはラッチ８７の出力信号を選択して出力するセレクタである。８９は制御部であり、比較部８５および８６の各出力信号に基づいてラッチ８７およびセレクタ８８の制御を行う。すなわち、次の通りである。
【００１８】
ａ．比較部８５および８６の出力信号がいずれも“１”、あるいはいずれも“０”である場合
入力デジタル信号が零レベル±Δのマスキング帯の外側にある場合である。この場合、制御部８９は、ラッチ８７をサンプル状態とし、セレクタ８８には入力デジタル信号を出力させる。
ｂ．比較部８５の出力信号が“０”であり、かつ、比較部８６の出力信号が“１”である場合
入力デジタル信号が零レベル±Δのマスキング帯の内側にある場合である。この場合、制御部８９は、入力デジタル信号がマスキング帯内に入った時点でラッチ８７をホールド状態とし、セレクタ８８にはラッチ８７の出力信号を出力させる。
【００１９】
図１において、９はデジタル音声信号およびデジタルお手本信号に対応した２値化部８の各出力信号の反転が起こる時間間隔、すなわち、これらの各デジタル信号の零クロス点の発生する時間間隔を計時するためのタイマであり、１０はタイマ９の計時結果を記憶するＲＡＭである。
【００２０】
図５はタイマ９およびＲＡＭ１０をそれらの制御系と共に示したブロック図である。なお、この図は、デジタル音声信号およびデジタルお手本信号の一方に対応した処理に必要な部分のみが示されている。図５において、９１は遅延器、９２は排他的論理和回路である。これらは２値化部８が出力する２値信号を微分する微分回路９０を構成しており、２値信号の反転が起こる毎にパルスを出力する。タイマ９は、微分回路９０からの出力パルスが与えられる毎にリセットされ、このリセットの後、次にリセットされるまでの間は、一定周波数４ｆｓのクロックをカウントする。
【００２１】
タイマ９のカウント値は、ラッチ９３に対し入力データとして与えられる。ラッチ９３は、微分回路９０からの出力パルスが与えられることにより、リセット直前のタイマ９のカウント値を取り込んで保持する。このラッチ９３に保持されるカウント値は、前回の２値信号の反転が検出されてから今回の反転が検出されるまでの間に出力された周波数４ｆｓのクロックの個数であるから、零クロス点が発生する時間間隔を表していると言える。従って、以下では、このラッチ９３の保持データを零クロス間隔データと呼ぶ。
【００２２】
書込制御部９４は、微分回路９０からの出力パルスが与えられる毎に、ラッチ９３内の零クロス間隔データを順次読み出し、一定範囲内の零クロス間隔データが所定値以上（タイマ９のカウント値が大）のときはリミットを設けてＲＡＭ１０に書込み、また、所定値未満（タイマ９のカウント値が小）のときはリミットを設けてＲＡＭ１０への書込みを行わず廃棄する。このように一定範囲内の零クロス間隔データのみをＲＡＭ１０へ書込むようにしたのは、音声信号の零クロス点の時間間隔として妥当でない零クロス間隔データが演算に使用され、誤ったピッチ周期が演算されてしまうのを防止するためである。
【００２３】
図１におけるピッチ演算部１１は、ＲＡＭ１０に蓄積された零クロス間隔データを参照することにより、デジタル音声信号およびデジタルお手本信号の各々のピッチ周期を演算する。
【００２４】
ここで、デジタル音声信号等が正弦波であるとすると、１周期分の正弦波の始点と終点において零レベル線とクロスする他、これらの零クロス点の中間において１回だけ零レベル線とクロスする。従って、連続した２個の零クロス間隔データを加算することによりピッチ周期を求めることができる。
【００２５】
しかしながら、人間の音声波形を表したデジタル音声信号等は、多くの倍音成分を含んでいるため、１ピッチ周期分の波形がそのピッチ周期の始点と終点の間に３個以上の零クロス点を含んでいる場合があり、かかる場合には連続した２個の零クロス間隔データを加算しても正しいピッチ周期が得られない。
【００２６】
そこで、本実施形態においては、複数種類の整数ｎの各々について、１ピッチ周期が２ｎ個の零クロス間隔データの和に相当する長さを有するものと仮定する。そして、各々の仮定の下でピッチ周期を求め、１ピッチ周期内の各零クロス点の発生タイミングが各ピッチ周期間でどの程度一致しているかを求める。なお、この零クロス点の発生タイミングの一致の程度の検出の詳細については後述する。そして、この一致の程度が最も高いピッチ周期を真のピッチ周期として選択する。これは、短い時間内であれば大きな波形の変化は生じないという音声信号の性質を前提としたものである。
【００２７】
次に、図１において、１２はレベル検出部であり、Ａ／Ｄ変換器４によって出力されたデジタル音声信号およびボーカル抽出部２によって出力されたデジタルお手本信号の各々のレベルを検出し、各レベルを表す信号を出力する。
【００２８】
１３は採点部であり、ピッチ演算部１１によって求められたデジタル音声信号およびデジタルお手本信号の各々のピッチ周期のずれと、レベル検出部１２によって求められた両信号レベルのずれを総合評価し、歌唱者の歌を採点する。この採点結果は表示部１４に表示される。
【００２９】
Ｂ．実施形態の動作
以下、本実施形態の動作を説明する。歌唱者によって選曲が行われると、その曲に対応したＣＤ１からデジタル音楽信号が順次再生される。そして、ボーカル抽出部２により、デジタル音楽信号からデジタルお手本信号が抽出され、ＤＣ除去部５およびレベル検出部１２へ出力される。一方、ＣＤ１の再生により歌唱者が歌唱を開始し、その歌声がマイクロホン３によって採取され、アナログ音声信号として出力される。このアナログ音声信号は、Ａ／Ｄ変換器４を介すことにより、デジタル音声信号に変換され、ＤＣ除去部５およびレベル検出部１２へ出力される。
【００３０】
デジタル音声信号およびデジタルお手本信号は、ＤＣ除去部５およびＬＰＦ６を順次介すことにより、不要な周波数帯域の信号が除去され、人の声の周波数帯域内の成分のみからなる波形を表すデジタル信号となって４倍オーバーサンプリング部７へ各々出力される。
【００３１】
そして、デジタル音声信号およびデジタルお手本信号は、４倍オーバーサンプリング部７により、各々時間軸上において補間され、４倍のサンプリング周波数の信号に変換されて出力され、２値化部８によって２値信号に変換される。
【００３２】
図６はこの４倍オーバーサンプリング部７の動作を例示したものである。図６（ａ）において、水平方向の直線は零レベル線である。また、正弦波状の信号波形に沿って○印のプロットが示されているが、後者のプロットはデジタル音声信号（デジタルお手本信号）を構成する個々の原サンプルを表しており、前者はこれらの原サンプルの母体である本来の信号波形を表している。また、各原サンプルを表す○印のプロットの間には、３個の×印のプロットが介挿されているが、これらは４倍オーバーサンプリング部７によって求められた補間サンプルを各々表している。
【００３３】
図６（ｂ）は、４倍オーバーサンプリングを行わず、原サンプル（○印）のみを２値化部８に与えた場合に得られる２値信号を示しており、図６（ｃ）は４倍オーバーサンプリングを行い、原サンプル（○印）および補間サンプル（×印）を２値化部８に与えた場合に得られる２値信号を示している。なお、これらの図は、説明の便宜のため、デジタル音声信号（デジタルお手本信号）が２値化部８のマスキング帯よりも小さなレベルの振動を含んでいない場合の例を示している。
【００３４】
ここで、デジタル音声信号等は信号波形と無関係に一定のサンプリング周期毎にサンプリングされたものである。従って、デジタル音声信号等が同一波形を繰り返すものである場合に、図６（ａ）に示すように、いずれのタイミングの瞬時値がサンプリングされるかは各波形により区々になる。このため、サンプリング周期が粗いと、図６（ｂ）に示すように、ピッチ周期が切り換わると同一波形であるにも拘わらず異なった波形の２値信号が得られてしまう場合がある。しかしながら、本実施形態のようにデジタル音声信号等の４倍オーバーサンプリングを行った後で２値化を行う場合には、図６（ｃ）に示すように本来の零クロス点に近いタイミングで反転する２値信号が得られ、図６（ｂ）に示したような不具合は防止される。
【００３５】
図７（ａ）〜（ｄ）は２値化部８の動作を例示したものである。まず、図７（ａ）において正弦波状の信号波形は４倍オーバーサンプリング部７から出力されるデジタル音声信号（デジタルお手本信号）を表しており、水平線は零レベル線を表している。図７（ｂ）は図３におけるサンプルホールド部８３の動作を示すものである。この図に示すように、サンプルホールド部８３は、入力信号たるデジタル音声信号（デジタルお手本信号）が零レベル±Δのマスキング帯の外側にある場合にはサンプル状態とされ（同図において“Ｓ”と表記）、零レベル±Δのマスキング帯の内側にある場合にはホールド状態とされる（同図において“Ｈ”と表記）される。このようなサンプルホールド部８３の制御が行われる結果、比較部８４へ入力される信号波形は図７（ｃ）に例示するものとなり、比較部８４から得られる２値信号は図７（ｄ）に例示するものとなる。このようにデジタル音声信号（デジタルお手本信号）が零レベル±Δのマスキング帯を横切って変化する場合はマスキング帯を横切り終えた時点で２値信号が反転することとなる。また、仮にデジタル音声信号（デジタルお手本信号）に±Δ以下の振幅の微小な振動部分を含んでいたとしても、デジタル音声信号（デジタルお手本信号）が零レベル±Δのマスキング帯内にある場合にはサンプルホールド部８３が前値保持動作を行うため、振動部分において２値信号が反転することはない。
【００３６】
本実施形態においては、零クロス間隔を使用してピッチ周期を演算するため、１ピッチ周期相当の入力デジタル信号波形についてあまりの多くの零クロス間隔が検出されてしまうと、ピッチ周期の演算の負担が大きくなってしまう。しかしながら、本実施形態においては、上記のようにマスキング帯を有する２値化部８によって２値信号を生成しているので、入力デジタル信号中、ピッチ周期の演算にとって重要でない零レベル近傍の微動が無視され、“０”／“１”反転箇所を必要以上に多く含まない２値信号が得られ、ピッチ周期の演算にとって適度な数の零クロス間隔を検出することが可能となる。
【００３７】
以上のようにデジタル音声信号およびデジタルお手本信号の各々に基づいて２値信号が生成される。そして、各２値信号毎に、“１”／“０”反転が生じる時間間隔がタイマ９によって順次計時され、その計時結果たる零クロス間隔データが図５に示すラッチ９３に順次保持される。このようにしてラッチ９３に順次保持される零クロス間隔データが、書込制御部９４による制御の下、ＲＡＭ１０に順次書込まれる。すなわち、書込制御部９４は、２値信号の反転によって微分回路９０からパルスが出力されるのに応答し、図８にフローを示す書込制御ルーチンを実行する。まず、書込制御部９４は、ラッチ９３から零クロス間隔データｔを取り込み（ステップＳ１）、この零クロス間隔データｔが下限値「８」以上か否かを判断する。この判断結果が「ＮＯ」の場合は零クロス間隔データｔの書込みを行うことなくルーチンを終了する。ステップＳ２の判断結果が「ＹＥＳ」の場合はステップＳ３に進み、零クロス間隔データｔが上限値「８１９２」より大きいか否かを判断する。この判断結果が「ＮＯ」の場合は零クロス間隔データｔをＲＡＭ１０へ書込み（ステップＳ４）、ルーチンを終了する。一方、ステップＳ３の判断結果が「ＹＥＳ」の場合は、取り込んだ零クロス間隔データｔの代りに「８１９２」をＲＡＭ１０に書込み（ステップＳ５）、ルーチンを終了する。以上の制御により、「８」〜「８１９２」の範囲内の零クロス間隔データのみがＲＡＭ１０へ書込まれるため、音声信号の零クロス点の時間間隔として妥当でない零クロス間隔データが演算に使用され、誤ったピッチ周期が演算されてしまうのを防止することができる。
【００３８】
このようにしてＲＡＭ１０に蓄積される零クロス間隔データがピッチ演算部１１によって参照され、デジタル音声信号およびデジタルお手本信号の各々のピッチ周期が求められる。ここで、図９を参照し、デジタル音声信号のピッチ周期の算出処理を例にその概要を説明する。図９（ａ）に例示するようなデジタル音声信号が２値化部８に与えられたとすると、現時点までに発生された零クロス間隔データｔ₁，ｔ₂，…がＲＡＭ１０内に蓄積されている。ピッチ演算部１１は、これらの零クロス間隔データｔ₁，ｔ₂，…とデジタル音声信号のピッチ周期との間の関係について以下の４通りの仮定を設け、各々の妥当性を検討するという手順に従ってピッチ周期を求める。
【００３９】
▲１▼仮定１
デジタル音声信号のピッチ周期は、２個の零クロス間隔データｔ₁，ｔ₂の和に相当する長さＴ₁を有する。すなわち、図９（ｂ１）に示す時間Ｔ₁₁，Ｔ₁₂，…がデジタル音声信号のピッチ周期である。
▲２▼仮定２
デジタル音声信号のピッチ周期は、４個の零クロス間隔データｔ₁〜ｔ₄の和に相当する長さＴ₂を有する。すなわち、図９（ｂ２）に示す時間Ｔ₂₁，Ｔ₂₂，…がデジタル音声信号のピッチ周期である。
▲３▼仮定３
デジタル音声信号のピッチ周期は、６個の零クロス間隔データｔ₁〜ｔ₆の和に相当する長さＴ₃を有する。すなわち、図９（ｂ３）に示す時間Ｔ₃₁，Ｔ₃₂，…がデジタル音声信号のピッチ周期である。
▲４▼仮定４
デジタル音声信号のピッチ周期は、８個の零クロス間隔データｔ₁〜ｔ₈の和に相当する長さＴ₄を有する。すなわち、図９（ｂ４）に示す時間Ｔ₄₁，Ｔ₄₂，…がデジタル音声信号のピッチ周期である。
【００４０】
上記各仮定の妥当性の検討およびこの検討結果に基づくピッチ周期の算出は図１０に示すフローに従って実行される。まず、ピッチ演算部１１は、上記仮定１を前提とした場合のデジタル音声信号の波形の再現率ＣＲ１を算出する（ステップＳ１０１）。この再現率は、上記各仮定に従った場合に各ピッチ周期に対応した各デジタル音声信号波形がどの程度一致しているかを表す数値であり、本実施形態においては、零クロス間隔データｔ₁，ｔ₂，…に基づいて算出する。
【００４１】
ここで、図１１のフローチャートを参照し、ステップＳ１０１において行われる再現率ＣＲ１を求める演算の手順について説明する。まず、ステップＳ２０１に進み、カウンタＣＮＴおよび制御変数ｉに対し、初期値として「０」および「１」を各々設定する。
【００４２】
次にステップＳ２０２に進み、制御変数ｉを「２」だけ増加させ、ｉ＝「３」とする。次にステップＳ２０３に進み、０．９ｔ₁−ｔ_i＜０なる条件を満たすか否か、すなわち、零クロス間隔データｔ₃が零クロス間隔データｔ₁の９０％よりも大きいか否かを判断する。そして、この判断結果が「ＹＥＳ」の場合はカウンタＣＮＴを「１」だけ増加させ（ステップＳ２０４）、ステップＳ２０５へ進み、「ＮＯ」の場合はステップＳ２０４を介すことなくステップＳ２０５に進む。次にステップＳ２０５に進むと、−１．１ｔ₁＋ｔ_i＜０なる条件を満たすか否か、すなわち、零クロス間隔データｔ₃が零クロス間隔データｔ₁の１１０％よりも小さいか否かを判断する。そして、この判断結果が「ＹＥＳ」の場合はカウンタＣＮＴを「１」だけ増加させ（ステップＳ２０６）、ステップＳ２０７へ進み、「ＮＯ」の場合はステップＳ２０６を介すことなくステップＳ２０７に進む。
【００４３】
次にステップＳ２０７に進むと、制御変数ｉが「７」となったか否かを判断し、この判断結果が「ＮＯ」の場合はステップＳ２０２に戻る。以後、２回に亙ってステップＳ２０２〜Ｓ２０７が実行され、零クロス間隔データｔ₅およびｔ₇の各々について上記ステップＳ２０３およびＳ２０５の判断が行われ、各零クロス間隔データが零クロス間隔データｔ₁の９０％より大きい場合または１１０％よりも小さい場合にカウンタＣＮＴのインクリメントが行われる（ステップＳ２０４，Ｓ２０６）。
【００４４】
そして、ｉ＝「７」となると、ステップＳ２０７の判断結果が「ＹＥＳ」となってステップＳ２０８へ進み、制御変数ｉに「２」を設定する。
【００４５】
次いでステップＳ２０９に進み、制御変数ｉを「２」だけ増加させ、ｉ＝「４」とする。次にステップＳ２１０に進み、０．９ｔ₂−ｔ_i＜０なる条件を満たすか否か、すなわち、零クロス間隔データｔ₄が零クロス間隔データｔ₂の９０％よりも大きいか否かを判断する。そして、この判断結果が「ＹＥＳ」の場合はカウンタＣＮＴを「１」だけ増加させ（ステップＳ２１１）、ステップＳ２１２へ進み、「ＮＯ」の場合はステップＳ２１１を介すことなくステップＳ２１２に進む。次にステップＳ２１２に進むと、−１．１ｔ₂＋ｔ_i＜０なる条件を満たすか否か、すなわち、零クロス間隔データｔ₄が零クロス間隔データｔ₂の１１０％よりも小さいか否かを判断する。そして、この判断結果が「ＹＥＳ」の場合はカウンタＣＮＴを「１」だけ増加させ（ステップＳ２１３）、ステップＳ２１４へ進み、「ＮＯ」の場合にはステップＳ２１３を介すことなくステップＳ２１４に進む。
【００４６】
次にステップＳ２１４に進むと、制御変数ｉが「８」となったか否かを判断し、この判断結果が「ＮＯ」の場合はステップＳ２０９に戻る。以後、２回に亙ってステップＳ２０９〜Ｓ２１４が実行され、零クロス間隔データｔ₆およびｔ₈の各々について上記ステップＳ２１０およびＳ２１２の判断が行われ、各零クロス間隔データが零クロス間隔データｔ₂の９０％より大きい場合または１１０％よりも小さい場合にカウンタＣＮＴのインクリメントが行われる（ステップＳ２１１，Ｓ２１３）。
【００４７】
そして、ｉ＝「８」となると、ステップＳ２１４の判断結果が「ＹＥＳ」となってステップＳ２１５へ進み、カウンタＣＮＴの値を零クロス間隔データについての判断の回数によって正規化し、その結果を再現率ＣＲ１とする。このフローの場合、判断は１２回行われるので、ＣＮＴ／１２が再現率ＣＲ１とされる。
【００４８】
ここで、ピッチ周期の長さを２個の零クロス間隔データの和Ｔ₁とした仮定が正しく、かつ、ピッチ周期が４回切り換わってもデジタル音声信号の波形が変化しない理想状態においては、ｔ₁＝ｔ₃＝ｔ₅＝ｔ₇かつｔ₂＝ｔ₄＝ｔ₆＝ｔ₈となる。従って、この場合に上記処理によって得られる再現率ＣＲ１は１００％となる。また、各零クロス間隔データに多少の誤差があっても、ｔ₃，ｔ₅およびｔ₇がｔ₁±１０％の範囲内に収っており、かつ、ｔ₄，ｔ₆およびｔ₈がｔ₂±１０％の範囲内に収っている場合には再現率ＣＲ１は１００％となる。一方、上記仮定が誤りであるとすると、ピッチ周期が切り換わることによって相互に対応する零クロス間隔データ間に大きな差が生じることとなる。このため、上記ステップＳ２０３等において否定的な判断がされ易くなり、そのような否定的な判断のなされる回数の増加に応じて再現率ＣＲ１が低下することとなる。
【００４９】
このようにして再現率ＣＲ１の算出が終了すると、図１０のフローに戻ってステップＳ１０２に進み、上記仮定２を前提とした場合のデジタル音声信号の波形の再現率ＣＲ２を算出する。すなわち、ピッチ周期が４個の零クロス間隔データの和に相当する長さＴ₂を有していると仮定する。そして、第１番目のピッチ周期に対応した零クロス間隔データｔ₁〜ｔ₄を各々基準とし、第２番目，第３番目および第４番目の各ピッチ周期に対応した零クロス間隔データｔ₅〜ｔ₈，ｔ₉〜ｔ₁₂およびｔ₁₃〜ｔ₁₅の各々が基準と所定の誤差範囲内で一致しているか否かを判断する。そして、肯定的な判断結果の得られた回数をカウントし、全判断回数によって正規化し、再現率ＣＲ２を求める。
【００５０】
ピッチ周期の長さを４個の零クロス間隔データの和とした仮定が正しく、かつ、ピッチ周期が４回切り換わってもデジタル音声信号の波形が変化しない理想状態においては、
ｔ₁＝ｔ₅＝ｔ₉＝ｔ₁₃
ｔ₂＝ｔ₆＝ｔ₁₀＝ｔ₁₄
ｔ₃＝ｔ₇＝ｔ₁₁＝ｔ₁₅
ｔ₄＝ｔ₈＝ｔ₁₂＝ｔ₁₆
なる条件を全て満たし、再現率ＣＲ２は１００％となる。また、各零クロス間隔データに多少の誤差があっても、±１０％の範囲内に収っている場合には再現率ＣＲ２は１００％となる。ピッチ周期が切り換わることによって基準（すなわち、第１番目のピッチ周期に対応した零クロス間隔データ）から大きくずれた零クロス間隔データが生じる場合には、その個数に応じて再現率ＣＲ２が低下することとなる。
【００５１】
次にステップＳ１０３に進み、上記仮定３を前提とした場合のデジタル音声信号の波形の再現率ＣＲ３を算出する。すなわち、ピッチ周期が６個の零クロス間隔データの和に相当する長さＴ₃を有していると仮定する。そして、第１番目のピッチ周期に対応した零クロス間隔データｔ₁〜ｔ₆を各々基準とし、第２番目，第３番目および第４番目の各ピッチ周期に対応した零クロス間隔データｔ₇〜ｔ₁₂，ｔ₁₃〜ｔ₁₈およびｔ₁₉〜ｔ₂₄の各々が基準と所定の誤差範囲内で一致しているか否かを判断する。そして、肯定的な判断結果の得られた回数をカウントし、全判断回数によって正規化し、再現率ＣＲ３を求める。
【００５２】
この再現率ＣＲ３は、
ｔ₁＝ｔ₇＝ｔ₁₃＝ｔ₁₉
ｔ₂＝ｔ₈＝ｔ₁₄＝ｔ₂₀
ｔ₃＝ｔ₉＝ｔ₁₅＝ｔ₂₁
ｔ₄＝ｔ₁₀＝ｔ₁₆＝ｔ₂₂
ｔ₅＝ｔ₁₁＝ｔ₁₇＝ｔ₂₃
ｔ₆＝ｔ₁₂＝ｔ₁₈＝ｔ₂₄
なる条件を全て満たす場合あるいは各零クロス間隔データに多少の誤差があっても±１０％の範囲内の誤差である場合には再現率ＣＲ３は１００％となる。また、誤差の大きな零クロス間隔データが生じる場合にはその個数に応じて再現率ＣＲ３が低下する。
【００５３】
次にＳ１０４に進み、上記仮定４を前提とした場合のデジタル音声信号の波形の再現率ＣＲ３を算出する。すなわち、ピッチ周期が８個の零クロス間隔データの和に相当する長さＴ₄を有していると仮定する。そして、第１番目のピッチ周期に対応した零クロス間隔データｔ₁〜ｔ₈を各々基準とし、第２番目および第３番目の各ピッチ周期に対応した零クロス間隔データｔ₉〜ｔ₁₆およびｔ₁₇〜ｔ₂₄の各々が基準と所定の誤差範囲内で一致しているか否かを判断する。そして、肯定的な判断結果の得られた回数をカウントし、全判断回数によって正規化し、再現率ＣＲ４を求める。
【００５４】
上記ステップＳ１０１〜Ｓ１０３までの各処理においては４個分のピッチ周期を処理対象としたが、このステップＳ１０４においては３個分のピッチ周期（図９（ｂ４）におけるＴ₄₁〜Ｔ₄₃）を処理対象としている。これは次の理由によるものである。すなわち、ステップＳ１０４においては、ピッチ周期として８個分の零クロス間隔データに相当する長い時間を仮定している。従って、仮にステップＳ１０４において４個分のピッチ周期を処理対象とすると、たとえ仮定４が正しい場合であっても、４個分のピッチ周期という極めて長時間に亙ってデジタル音声信号波形が安定していないと再現率ＣＲ４が低下することとなる。しかし、デジタル音声信号の波形は、ある程度の短時間の間は同一波形を維持し得るが、ある程度の時間が経つと波形に変化が生じるものである。このため、４個分のピッチ周期を処理対象とした場合には、たとえ仮定４が正しかったとしても、デジタル音声信号の波形の時間的変化の影響によって不当に低い再現率ＣＲ４が演算されてしまう可能性が高い。そこで、ステップＳ１０４においては、上述の通り３個分のピッチ周期を処理対象としている。
【００５５】
ステップＳ１０４において、再現率ＣＲ４は、
ｔ₁＝ｔ₉＝ｔ₁₇
ｔ₂＝ｔ₁₀＝ｔ₁₈
ｔ₃＝ｔ₁₁＝ｔ₁₉
ｔ₄＝ｔ₁₂＝ｔ₂₀
ｔ₅＝ｔ₁₃＝ｔ₂₁
ｔ₆＝ｔ₁₄＝ｔ₂₂
ｔ₇＝ｔ₁₅＝ｔ₂₃
ｔ₈＝ｔ₁₆＝ｔ₂₄
なる条件を全て満たす場合あるいは各零クロス間隔データに多少の誤差があっても±１０％の範囲内の誤差である場合には再現率ＣＲ４は１００％となる。また、誤差の大きな零クロス間隔データが生じる場合にはその個数に応じて再現率ＣＲ４が低下する。
【００５６】
次にステップＳ１０５に進み、以上のようにして求めた再現率ＣＲ１〜ＣＲ４に基づき、仮定１〜４のいずれが妥当であるか否かを判断する。この判断の詳細なフローを図１２に示す。まず、ステップＳ３０１に進み、再現率ＣＲ１〜ＣＲ４のうちどれが最大であるかを判断する。そして、再現率ＣＲ１が最大である場合は、このＣＲ１が所定の基準値ｒｅｆよりも大きいか否かを判断し（ステップＳ３０２）、この判断結果が「ＹＥＳ」の場合には仮定１に従うこと、すなわち、２個分の零クロス間隔データの長さＴ₁によりピッチ周期を求めることとする。他の再現率ＣＲ２〜ＣＲ４が最大である場合も同様であり、ＣＲ２等が所定の基準値ｒｅｆよりも大きいか否かを判断し（ステップＳ３０３〜Ｓ３０５）、この判断結果が「ＹＥＳ」の場合には、各再現率の算出の前提となった仮定に従い、４個分の零クロス間隔データの長さＴ₂、６個分の零クロス間隔データの長さＴ₃あるいは８個分の零クロス間隔データの長さＴ₄によりピッチ周期を求めることとする。万一、再現率が同じ場合には、その優先順位はＣＲ１＞ＣＲ２＞ＣＲ３＞ＣＲ４（ＣＲ１が最優先）である。
【００５７】
一方、再現率ＣＲ１〜ＣＲ４のうち最大のものが基準値ｒｅｆ以下である場合には、ステップＳ３０２〜Ｓ３０５のいずれに進んだとしても判断結果が「ＮＯ」となる。この場合、仮定１〜４のいずれが妥当であるか結論を出すことができず、該当なしという判断結果となる。
【００５８】
以上の判断が終了すると、図１０に示すフローに戻り、判断結果に対応したステップへ進む。すなわち、２個分の零クロス間隔データの長さＴ₁によりピッチ周期を求めることと判断した場合にはステップＳ１０６に進み、各々２個分の零クロス間隔データからなるピッチ周期を４周期分求め（図９（ｂ１）のＴ₁₁〜Ｔ₁₄に相当）、これらの平均値をデジタル音声信号のピッチ周期とする。また、４個分の零クロス間隔データの長さＴ₂によりピッチ周期を求めることと判断した場合にはステップＳ１０７に進み、この判断結果に従ってピッチ周期を４周期分求め（図９（ｂ２）のＴ₂₁〜Ｔ₂₄に相当）、これらの平均値をデジタル音声信号のピッチ周期とする。また、６個分の零クロス間隔データの長さＴ₃によりピッチ周期を求めることと判断した場合にはステップＳ１０８に進み、この判断結果に従ってピッチ周期を４周期分求め（図９（ｂ３）のＴ₃₁〜Ｔ₃₄に相当）、これらの平均値をデジタル音声信号のピッチ周期とする。そして、８個分の零クロス間隔データの長さＴ₄によりピッチ周期を求めることと判断した場合にはステップＳ１０９に進み、この判断結果に従ってピッチ周期を３周期分求め（図９（ｂ４）のＴ₄₁〜Ｔ₄₃に相当）、これらの平均値をデジタル音声信号のピッチ周期とする。
【００５９】
以上の処理が終了すると、ステップＳ１０１へ戻り、同様の処理を繰り返す。このようにして、デジタル音声信号のピッチ周期が連続的に出力される訳である。一方、図１２の判断において、「該当なし」との結論が得られた場合にはピッチ周期の演算は行わず、ピッチ周期の演算を行わなかった旨を示す信号を出力し、ステップＳ１０１に戻る。なお、上記においては、デジタル音声信号の場合を例にピッチ周期の演算処理を説明したが、デジタルお手本信号についても全く同様な処理によりピッチ周期が演算される。
【００６０】
以上のように、本実施形態は、仮定１〜４のすべてについて再現率を求め、最も高い再現率の得られた仮定を選択し、この選択した仮定に基づくピッチ演算を当該再現率が許容範囲内である場合に限って実施し、許容範囲外である場合は実施しないという慎重な手順を踏むものである。このような慎重な手順を踏むこととした理由は次の通りである。
【００６１】
ａ．上記手順以外のものとして、例えば仮定１〜４に対応した各再現率を順次演算してゆき、許容範囲内の再現率が得られた時点で演算を終了し、その再現率の得られた仮定を選択してピッチ周期を求めるような代替案が考えられる。しかしながら、音声波形によっては、例えば仮定１および３に対応した再現率が許容範囲内にあり、しかも仮定３に対応した再現率の方が仮定１のものよりも高いという状況の生じることが有り得る。かかる場合にこの代替案に従うとすると、仮定１を選択し、誤ったピッチ周期を求めることとなる。仮定の選択が正しくなされるように許容範囲を狭く設定することも考えられるが、その場合には「該当なし」と判断されるケースが続出するおそれがある。
【００６２】
ｂ．また、仮定１〜４に対応した各再現率をすべて演算し、最大の再現率の得られた仮定を無条件に採用し、ピッチ周期を求めるという代替案も考えられる。しかしながら、いずれの仮定に対応した再現率も一様に低く、特定の仮定に対応した再現率が僅かに他より勝っているようなケースが生じる場合が考えられ、このような場合に特定の仮定を採用して無理にピッチ周期を求めたとしても果たして正確なピッチ周期が得られるか、その保証はない。例えばピッチ周期をデジタル音声信号の波形が急激に変化した場合等においては、上記仮定のいずれにおいても再現率が低くなる可能性が高い。
【００６３】
ｃ．そこで、本実施形態においては、上述の手順に従ってピッチ周期の演算をすることとし、不適当なピッチ周期の出力を防止している。
【００６４】
以上のようにして求められるデジタル音声信号およびデジタルお手本信号の各ピッチ周期が採点部１３に順次報告され、この両信号のピッチ周期のずれとレベル検出部１２によって求められた両信号レベルのずれとの総合評価により、歌唱者の歌が採点され、採点結果が表示部１４に表示される。
【００６５】
Ｃ．本実施形態に係る装置の評価結果
以上説明したピッチ周期検出装置について各部の動作条件を種々設定し、ピッチ周期の検出時間および検出誤差の評価を行った。図１３〜図１６はその結果を示すものである。まず、図１３は、４倍オーバーサンプリング部７として直線補間を行う回路を使用し、この回路のオーバーサンプリング周波数を種々に変化させ、実用域でのピッチ周期の検出誤差を測定した結果である。この結果より、４倍オーバーサンプリング程度の補間を行えば実用域での検出誤差を充分に小さくすることができることがわかる。次に図１４は、ピッチ周期を３周期間の相関により求めた場合（ｍ＝３）と４周期間の相関により求めた場合（ｍ＝４）の各々について、ピッチ周期が検出されるまでの遅れ時間を入力周波数毎に測定した結果を示すものである。この実験結果が示すように、ｍ＝３または４程度であれば、検出遅れを問題のない範囲に収めることができる。また、図１５は、平均化の回数とピッチ周期の抽出誤差との関係を示している。また、図１６は、過去何周期（ピッチ周期）分の波形と比較をすれば正確にピッチ周期を抽出できるかを実験した結果を示すものである。この実験結果は、過去２周期程度を比較したのでは誤差が多く、過去５周期以上の入力波形を比較したのでは波形が古過ぎて却ってピッチ周期を誤ってしまい、結局のところ、過去３〜４周期に亙って入力波形の比較を行うことが正確なピッチ抽出を行う上で最適であることを物語っている。
【００６６】
Ｄ．変形例
（１）上記実施形態においては、１ピッチ周期を構成する各零クロス間隔データが各ピッチ周期間でどの程度一致しているかにより、ピッチ周期を２ｎ個分の零クロス間隔データの和とした仮定が妥当か否かの判断を行った。この方法の代りに、各ｎについて、２ｎ個分の零クロス間隔データの和を演算することにより所定個数のピッチ周期を求め、これらのピッチ周期のばらつきが最も少ないｎを選択し、ピッチ周期を選択するようにしてもよい。すなわち、図９（ｂ１）〜（ｂ４）において、Ｔ₁₁〜Ｔ₁₄のばらつきが最も小さい場合はＴ₁₁〜Ｔ₁₄の平均値をピッチ周期とし、Ｔ₂₁〜Ｔ₂₄のばらつきが最も小さい場合はＴ₂₁〜Ｔ₂₄の平均値をピッチ周期とし、…という具合にピッチ周期を求める訳である。また、上記実施形態において開示した零クロス間隔データに基づく判定方法とこのピッチ周期のばらつきに求める判定方法を併用し、零クロス間隔データおよびピッチ周期の長さのピッチ周期間ばらつきを総合評価し、ピッチ周期を選択するようにしてもよい。
【００６７】
（２）上記実施形態において、２値化部８のマスキング帯の幅Δを固定とした。しかし、零レベル付近に生じる音声波形の微小な上下動の振幅は、音声波形全体の振幅に依存するため、適切なΔを決めるのが困難な場合もある。そこで、デジタル音声信号またはデジタルお手本信号の振幅を検出し、この振幅値に所定の係数を乗じ、その結果をΔとする等の方法により、２値化部８のマスキング帯の幅Δの制御を行うのが好ましい。
【００６８】
【発明の効果】
以上説明したように、本発明によれば、デジタル音声信号をオーバーサンプリングし、このオーバーサンプリングのなされたデジタル音声信号を２値信号に変換し、この２値信号に基づいて音声波形の連続した零クロス間隔を測定し、各種のｎについて、ピッチ周期を２ｎ個分の零クロス間隔データの和と仮定し、各ピッチ周期間での波形の一致度を求め、一致度の最も優れた仮定を採用してピッチ周期を求めるようにしたので、零クロス間隔を正確に測定することができ、音声波形が倍音成分を含んだ複雑な波形である場合においても、安価の構成で、高速かつ正確にピッチ周期を求めることができるという効果がある。
【図面の簡単な説明】
【図１】この発明の一実施形態の構成を示すブロック図である。
【図２】同実施形態における４倍オーバーサンプリング部の構成例を示すブロック図である。
【図３】同実施形態における２値化部の構成を例示するブロック図である。
【図４】同実施形態における２値化部の構成を例示するブロック図である。
【図５】同実施形態におけるタイマ、ＲＡＭおよびこれらの制御系を示すブロック図である。
【図６】同実施形態における４倍オーバーサンプリング部の動作を示す図である。
【図７】同実施形態における２値化部の動作を示す図である。
【図８】同実施形態における書込制御部の動作を示す図である。
【図９】同実施形態におけるピッチ周期の算出処理の概要を説明する図である。
【図１０】同実施形態におけるピッチ周期の算出処理を示すフローチャートである。
【図１１】同実施形態におけるピッチ周期の算出処理を示すフローチャートである。
【図１２】同実施形態におけるピッチ周期の算出処理を示すフローチャートである。
【図１３】同実施形態の性能評価結果を示す図である。
【図１４】同実施形態の性能評価結果を示す図である。
【図１５】同実施形態の性能評価結果を示す図である。
【図１６】同実施形態の性能評価結果を示す図である。
【符号の説明】
１……ＣＤ、２……ボーカル抽出部、３……マイクロホン、
４……Ａ／Ｄ変換器、５……ＤＣ除去部、６……ＬＰＦ、
７……４倍オーバーサンプリング部、８……２値化部、
９……タイマ（零クロス間隔計測手段）、
１０……ＲＡＭ（零クロス間隔計測手段）、
１１……ピッチ演算部（ピッチ演算手段）、
１２……レベル検出部、１３……採点部、１４……表示部。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a pitch detection device that detects a pitch period or a pitch frequency of a speech waveform.
[0002]
[Prior art]
One of the parameters characterizing the speech waveform is a pitch period (or pitch frequency), and a technique for detecting the pitch period of the speech waveform is generally used in speech analysis / synthesis systems, speech coding systems, and the like. Recently, some karaoke systems detect the pitch period of a singer's voice and are used for singing.
[0003]
Conventionally, there are the following methods for detecting the pitch period of speech.
(1) Zero cross method
Assuming that the speech waveform is very close to a sine wave, the speech waveform crosses the zero level line from the negative direction to the positive direction, then crosses from the positive direction to the negative direction, and then crosses from the negative direction to the positive direction again. Is repeated, a pitch period is given by a time interval crossing the zero level line in the same direction. In accordance with this idea, the zero cross method is a method in which two pitch intervals are simply measured to obtain a pitch period. Further, as a similar idea, there is a method of measuring a time interval at which an instantaneous value of a speech waveform becomes a maximum value or a minimum value to obtain a pitch period.
[0004]
(2) Autocorrelation method
In this autocorrelation method, the following autocorrelation function R (r) is calculated using time series samples x (1), x (2),... Obtained by sampling a speech waveform at a constant sampling period. To obtain the pitch period.
R (r) = 1 / N · Σ {x (n) · x (n + r)}
(In the above formula, Σ is an operator for calculating the sum in {} in the range of n = 1 to N · r.)
That is, r is changed variously, an autocorrelation function R (r) is obtained for each r, and the pitch period of the speech waveform is calculated from r when R (r) becomes maximum (that is, autocorrelation is maximum).
[0005]
[Problems to be solved by the invention]
By the way, the above-described zero cross method can detect the pitch period relatively inexpensively and at the same time, but human speech contains many overtone components and cannot detect an accurate pitch period. There was a problem. Moreover, although the above-described autocorrelation method can detect the pitch period to some degree of accuracy, the calculation amount is enormous and the detection time is long. In addition, the cost increases.
[0006]
The present invention overcomes the above problems, basically adopts a method for determining the pitch period by measuring the zero cross interval, and takes measures to prevent the harmful effects caused by adopting such a method, An object of the present invention is to provide a pitch detection device capable of detecting a pitch cycle accurately and at high speed with an inexpensive configuration.
[0007]
[Means for Solving the Problems]
The present invention relates to an oversampling means for outputting a digital audio signal with a sampling frequency multiplied by a predetermined number, and a binary for comparing the digital audio signal output by the oversampling means with a predetermined level and converting it to a binary signal. And a continuous zero cross interval t of the digital audio signal based on the binary signal₁, T₂,..., And various changes in n (n is an integer of 1 or more), and for each n, the total of 2n zero cross intervals T = (t₁+ T₂＋・・ t_2n) Is assumed to be the pitch period,next toFor m periods (m is an integer of 2 or more) in contactAssumed abovePitch periodThe degree of coincidence between the corresponding zero-crossing intervals included in eachThe degree of coincidence of the waveform of the digital audio signalAsThe gist of the present invention is a pitch detection device comprising pitch calculation means that calculates and selects a pitch period by selecting n having the highest degree of waveform matching.
[0008]
DETAILED DESCRIPTION OF THE INVENTION
Embodiments will be described below for easier understanding of the present invention.
Such an embodiment shows one aspect of the present invention, and is not intended to limit the present invention, and can be arbitrarily changed within the scope of the present invention.
[0009]
A. Configuration of the embodiment
FIG. 1 is a block diagram showing a configuration of an embodiment in which the present invention is applied to a karaoke system. The present embodiment relates to a part for scoring the song of the singer among the constituent parts of the karaoke system. In FIG. 1, reference numeral 1 denotes a CD (compact disc) on which a digital music signal is recorded. The digital music signal recorded on the CD 1 is sequentially reproduced in synchronization with a clock having a sampling frequency fs = 44.1 kHz. A vocal extraction unit 2 extracts a signal corresponding to a vocal sound (hereinafter referred to as a digital model signal) from a digital music signal reproduced from the CD 1. As an example, a digital model signal can be obtained by a process of extracting a signal in a frequency band including a voice band of a digital music signal reproduced from the CD 1 using a band pass filter. In addition, when a medium on which only vocal sounds are recorded can be used, a digital music signal reproduced from the medium may be used as a digital model signal as it is. A microphone 3 collects the singing voice of a singer who sings along with the reproduction of the CD 1 and outputs it as an analog audio signal. Reference numeral 4 denotes an A / D converter, which samples an analog audio signal from the microphone 1 in synchronism with a clock having a sampling frequency fs = 44.1 kHz as in the case of CD1, and converts it into a digital audio signal.
[0010]
Reference numeral 5 denotes a DC removal unit, which performs a DC removal process on sequentially supplied digital audio signals and digital example signals, and removes a low frequency band that can be regarded as DC, for example, a digital audio signal from which components in the 0 Hz to 50 Hz band are removed. And a digital model signal are output. Reference numeral 6 denotes an LPF (low-pass filter), for example 500 from each of the digital audio signal and the digital model signal output by the DC removing unit 5.HA component having a frequency equal to or higher than z is removed and output. These DC removal unit 5 and LPF 6 select and output only the components within the 50 to 500 Hz band for each of the digital audio signal and the digital model signal.
[0011]
7 is a 4-times oversampling unit, which performs an interpolation operation on the digital audio signal and digital sample signal (both sampling frequency fs = 44.1 kHz) that have passed through the LPF 6 and converts them to a signal having a 4-times sampling frequency. Output.
[0012]
FIG. 2 exemplifies a circuit configuration necessary for processing one of the digital audio signal and the digital model signal (hereinafter referred to as an input digital signal) in the 4 × oversampling unit 7. In this figure, a latch 71 receives and holds an input digital signal when a clock corresponding to the sampling frequency fs is given. The delay devices 72, 72,... Are cascade-connected to the subsequent stage of the latch 71 as shown. These delay units 72 are each provided with a clock having a frequency four times the sampling frequency fs, thereby sequentially shifting the input signal held in the latch 71 and delaying the input signal sequentially by one clock period. Each signal is output. .. Are multipliers, and 74, 74,... Are adders, which perform an interpolation operation that convolves a predetermined interpolation coefficient sequence with the output signals of the latch 71 and delay units 72, 72,. . With the above configuration, interpolation calculation is executed in synchronization with a clock having a frequency four times the sampling frequency fs, and the interpolated digital signal is sequentially output from the adder 74 at the final stage.
[0013]
The quadruple oversampling unit 7 is a means provided in order to increase the accuracy in obtaining the pitch period. That is, in the present embodiment, the pitch period of each digital signal is obtained by measuring the time interval between the zero cross points of the digital audio signal and the digital model signal. For this reason, in order to increase the measurement accuracy of the pitch period, it is necessary to increase the detection accuracy of the position of the zero cross point on the time axis. Therefore, by interposing the 4-times oversampling unit 7, the time density of each sample of the digital audio signal and the digital model signal is quadrupled, and the detection accuracy of the position of each zero cross point is increased. In this example, oversampling is performed by curve interpolation. However, in view of the problem of cost, linear interpolation that can obtain a certain degree of accuracy can also be used.
[0014]
A binarization unit 8 binarizes the levels of the digital audio signal and digital model signal output from the 4 × oversampling unit 7. This binarization basically determines whether the input digital signal is positive or negative with reference to the zero level, and outputs “1” if the input digital signal is positive and “0” if it is negative. is there. That is, the binarizing unit 8 is a means for outputting a binary signal in which “0” / “1” is inverted every time the input digital signal crosses the zero level. However, in this embodiment, when binarization is performed, the range of ± Δ centering on the zero level is used as a masking band, and even if the input digital signal has minute vibrations within the masking band of ± Δ The binary signal is not inverted by minute vibrations.
[0015]
FIG. 3 exemplifies a circuit configuration necessary for processing one of the digital audio signal and the digital model signal (hereinafter referred to as input digital signal) in the binarization unit 8. In this figure, reference numeral 81 denotes an absolute value detector for detecting the absolute value of the input digital signal. Reference numeral 82 denotes a comparison unit, which compares the absolute value of the input digital signal detected by the absolute value detection unit 81 with a predetermined value Δ. When the absolute value exceeds Δ, “1” is not exceeded. "0" is output to. Reference numeral 83 denotes a sample hold unit, which outputs an input digital signal as it is during a period in which “1” is output from the comparison unit 82 (sample state), and in a period in which “0” is output from the comparison unit 82 The input digital signal immediately before the output signal 82 changes from “1” to “0” is held and output (hold state). Reference numeral 84 denotes a comparison unit that determines whether the output signal of the sample hold unit 83 is positive or negative with reference to a zero level, and outputs a binary signal of “1” if positive and “0” if negative.
[0016]
According to the above configuration, when the input digital signal is outside the range of ± Δ, it is output as it is via the sample hold unit 83. When the input digital signal falls within the zero level ± Δ masking band, the value of the input digital signal immediately before is held by the sample hold unit 83, and the comparison is performed during this holding operation. The binary signal output from the unit 84 is not inverted. Therefore, when the input digital signal changes across the masking band of zero level ± Δ, the binary signal is inverted when the masking band has been crossed. On the other hand, if the input digital signal enters the masking band of zero level ± Δ but moves up and down in the masking band without crossing it, the sample hold unit even if the input digital signal crosses the zero level Since the output signal value 83 does not cross the zero level, inversion of the binary signal does not occur.
[0017]
3 may be replaced with the circuit shown in FIG. In FIG. 4, reference numerals 85 and 86 denote comparison units, which respectively compare the input digital signal with a reference level, and output “1” when the input digital signal is higher than the reference level, and output “0” when lower than the reference level. To do. The comparison unit 85 is given + Δ as the reference level, and the comparison unit 86 is given -Δ as the reference level. Reference numeral 87 denotes a latch that holds an input digital signal, and reference numeral 88 denotes a selector that selects and outputs the input digital signal or the output signal of the latch 87. A control unit 89 controls the latch 87 and the selector 88 based on the output signals of the comparison units 85 and 86. That is, it is as follows.
[0018]
a. When the output signals of the comparators 85 and 86 are both “1” or both are “0”
This is the case when the input digital signal is outside the zero level ± Δ masking band. In this case, the control unit 89 sets the latch 87 to the sample state and causes the selector 88 to output an input digital signal.
b. When the output signal of the comparison unit 85 is “0” and the output signal of the comparison unit 86 is “1”
This is the case when the input digital signal is inside the masking band of zero level ± Δ. In this case, the control unit 89 puts the latch 87 in the hold state when the input digital signal enters the masking band, and causes the selector 88 to output the output signal of the latch 87.
[0019]
In FIG. 1, reference numeral 9 denotes a time interval at which each output signal of the binarization unit 8 corresponding to a digital audio signal and a digital model signal occurs, that is, a time interval at which a zero cross point of each digital signal occurs. 10 is a RAM for storing the time measurement result of the timer 9.
[0020]
FIG. 5 is a block diagram showing the timer 9 and the RAM 10 together with their control systems. This figure shows only a portion necessary for processing corresponding to one of the digital audio signal and the digital model signal. In FIG. 5, 91 is a delay device, and 92 is an exclusive OR circuit. These constitute a differentiating circuit 90 for differentiating the binary signal output from the binarizing unit 8 and outputs a pulse every time the inversion of the binary signal occurs. The timer 9 is reset every time an output pulse from the differentiating circuit 90 is given, and after this reset, the clock with a constant frequency of 4 fs is counted until the next reset.
[0021]
The count value of the timer 9 is given as input data to the latch 93. The latch 93 receives and holds the count value of the timer 9 immediately before the reset, when the output pulse from the differentiation circuit 90 is given. The count value held in the latch 93 is the number of clocks having a frequency of 4 fs output from when the previous inversion of the binary signal is detected until the current inversion is detected. It can be said that this represents a time interval in which the occurrence of. Therefore, hereinafter, the data held in the latch 93 is referred to as zero cross interval data.
[0022]
The write controller 94 sequentially reads the zero cross interval data in the latch 93 each time an output pulse from the differentiation circuit 90 is given, and the zero cross interval data within a certain range is equal to or greater than a predetermined value (the count value of the timer 9). Is set and written to the RAM 10, and when it is less than a predetermined value (the count value of the timer 9 is small), a limit is set and the RAM 10 is not written and discarded. The reason why only the zero cross interval data within a certain range is written in the RAM 10 is that the zero cross interval data that is not valid as the time interval of the zero cross point of the audio signal is used for the calculation, and an incorrect pitch period is set. This is to prevent the calculation.
[0023]
The pitch calculation unit 11 in FIG. 1 calculates the pitch period of each of the digital audio signal and the digital model signal by referring to the zero cross interval data stored in the RAM 10.
[0024]
Here, if the digital audio signal or the like is a sine wave, it crosses the zero level line at the start point and end point of one cycle of the sine wave, and crosses the zero level line only once in the middle of these zero cross points. To do. Therefore, the pitch period can be obtained by adding two consecutive zero cross interval data.
[0025]
However, since a digital audio signal or the like representing a human audio waveform contains many overtone components, the waveform for one pitch period has three or more zero cross points between the start and end points of the pitch period. In such a case, even if two consecutive zero cross interval data are added, a correct pitch period cannot be obtained.
[0026]
Therefore, in the present embodiment, it is assumed that for each of a plurality of types of integers n, one pitch period has a length corresponding to the sum of 2n zero cross interval data. Then, the pitch period is obtained under each assumption, and how much the occurrence timing of each zero cross point within one pitch period is matched between the pitch periods. Details of detection of the degree of coincidence of the occurrence timing of the zero cross point will be described later. Then, the pitch period having the highest degree of coincidence is selected as the true pitch period. This presupposes the property of the audio signal that a large waveform change does not occur within a short time.
[0027]
Next, in FIG. 1, reference numeral 12 denotes a level detection unit that detects the levels of the digital audio signal output by the A / D converter 4 and the digital model signal output by the vocal extraction unit 2. A signal representing is output.
[0028]
Reference numeral 13 denotes a scoring unit, which comprehensively evaluates the pitch period deviations of the digital audio signal and the digital model signal obtained by the pitch calculation unit 11 and the deviations of both signal levels obtained by the level detection unit 12 to sing A person's song. This scoring result is displayed on the display unit 14.
[0029]
B. Operation of the embodiment
The operation of this embodiment will be described below. When a song is selected by a singer, digital music signals are sequentially reproduced from the CD 1 corresponding to the song. Then, a digital model signal is extracted from the digital music signal by the vocal extraction unit 2 and is output to the DC removal unit 5 and the level detection unit 12. On the other hand, the singer starts singing by playing CD1, and the singing voice is collected by the microphone 3 and output as an analog audio signal. This analog audio signal is converted into a digital audio signal via the A / D converter 4 and output to the DC removing unit 5 and the level detecting unit 12.
[0030]
The digital audio signal and the digital model signal are sequentially passed through the DC removal unit 5 and the LPF 6 so that unnecessary frequency band signals are removed, and a digital signal representing a waveform composed only of components in the human voice frequency band and Are output to the 4-times oversampling unit 7 respectively.
[0031]
Then, the digital audio signal and the digital model signal are each interpolated on the time axis by the 4 × oversampling unit 7, converted into a signal having a 4 × sampling frequency, and output as a binary signal by the binarizing unit 8. Is converted to
[0032]
FIG. 6 illustrates the operation of the 4 × oversampling unit 7. In FIG. 6A, the horizontal straight line is a zero level line. In addition, a ◯ -plot is shown along the sinusoidal signal waveform, but the latter plot represents the individual original samples that make up the digital audio signal (digital model signal), and the former represents these original samples. It represents the original signal waveform that is the matrix of the sample. In addition, three x-marked plots are inserted between the ◯ -marked plots representing the original samples, and these represent the interpolated samples obtained by the 4-times oversampling unit 7. .
[0033]
FIG. 6B shows a binary signal obtained when the oversampling is not performed four times and only the original sample (circle mark) is given to the binarization unit 8. FIG. A binary signal obtained when double oversampling is performed and the original sample (◯ mark) and the interpolation sample (x mark) are given to the binarization unit 8 is shown. For convenience of explanation, these drawings show an example in which the digital audio signal (digital example signal) does not include a vibration having a level smaller than the masking band of the binarization unit 8.
[0034]
Here, the digital audio signal or the like is sampled at a constant sampling period regardless of the signal waveform. Therefore, when the digital audio signal or the like repeats the same waveform, as shown in FIG. 6A, the timing at which the instantaneous value is sampled varies depending on each waveform. For this reason, when the sampling period is rough, as shown in FIG. 6B, when the pitch period is switched, binary signals having different waveforms may be obtained even though the waveform is the same. However, when binarization is performed after 4 times oversampling of a digital audio signal or the like as in the present embodiment, inversion is performed at a timing close to the original zero cross point as shown in FIG. The binary signal to be obtained is obtained, and the problem as shown in FIG. 6B is prevented.
[0035]
7A to 7D illustrate the operation of the binarization unit 8. First, in FIG. 7A, a sinusoidal signal waveform represents a digital audio signal (digital example signal) output from the quadruple oversampling unit 7, and a horizontal line represents a zero level line. FIG. 7B shows the operation of the sample hold unit 83 in FIG. As shown in this figure, the sample hold unit 83 is in the sample state when the digital audio signal (digital model signal) as an input signal is outside the zero level ± Δ masking band (“S” in the figure). If it is inside the masking band of zero level ± Δ, it is in a hold state (indicated as “H” in the figure). As a result of such control of the sample and hold unit 83, the signal waveform input to the comparison unit 84 is as illustrated in FIG. 7C, and the binary signal obtained from the comparison unit 84 is illustrated in FIG. It will be illustrated as follows. As described above, when the digital audio signal (digital model signal) changes across the masking band of zero level ± Δ, the binary signal is inverted when the masking band is crossed. Even if the digital audio signal (digital example signal) includes a minute vibration part with an amplitude of ± Δ or less, the digital audio signal (digital example signal) is within the masking band of zero level ± Δ. Since the sample hold unit 83 performs the previous value holding operation, the binary signal is not inverted in the vibration portion.
[0036]
In this embodiment, since the pitch period is calculated using zero cross intervals, if too many zero cross intervals are detected for the input digital signal waveform corresponding to one pitch period, the burden of calculating the pitch period Will become bigger. However, in the present embodiment, since the binary signal is generated by the binarization unit 8 having the masking band as described above, in the input digital signal, fine movement near the zero level that is not important for the calculation of the pitch period is generated. A binary signal that is ignored and does not contain “0” / “1” inversions more than necessary is obtained, and it is possible to detect a number of zero cross intervals that are appropriate for the calculation of the pitch period.
[0037]
As described above, a binary signal is generated based on each of the digital audio signal and the digital model signal. Then, for each binary signal, the time intervals at which the “1” / “0” inversion occurs are sequentially counted by the timer 9, and the zero cross interval data as the timing results are sequentially held in the latch 93 shown in FIG. In this way, the zero cross interval data sequentially held in the latch 93 is sequentially written into the RAM 10 under the control of the write control unit 94. That is, the write control unit 94 executes a write control routine whose flow is shown in FIG. 8 in response to the pulse output from the differentiation circuit 90 by the inversion of the binary signal. First, the write control unit 94 takes in the zero cross interval data t from the latch 93 (step S1), and determines whether the zero cross interval data t is equal to or greater than the lower limit “8”. If this determination is “NO”, the routine is terminated without writing the zero-crossing interval data t. If the determination result in step S2 is “YES”, the process proceeds to step S3, and it is determined whether or not the zero crossing interval data t is larger than the upper limit value “8192”. If the determination result is “NO”, the zero cross interval data t is written to the RAM 10 (step S4), and the routine is terminated. On the other hand, if the determination result in step S3 is “YES”, “8192” is written in the RAM 10 instead of the fetched zero cross interval data t (step S5), and the routine is terminated. With the above control, only zero cross interval data within the range of “8” to “8192” is written to the RAM 10, and therefore zero cross interval data that is not valid as the time interval of the zero cross point of the audio signal is used for the calculation. It is possible to prevent an erroneous pitch period from being calculated.
[0038]
In this way, the zero cross interval data stored in the RAM 10 is referred to by the pitch calculation unit 11, and the pitch periods of the digital audio signal and the digital model signal are obtained. Here, with reference to FIG. 9, the outline of the calculation process of the pitch period of the digital audio signal will be described as an example. If a digital audio signal as illustrated in FIG. 9A is given to the binarization unit 8, zero-crossing interval data t generated up to the present time₁, T₂,... Are stored in the RAM 10. The pitch calculation unit 11 calculates the zero cross interval data t₁, T₂,... And the pitch period of the digital audio signal, the following four assumptions are made, and the pitch period is obtained according to a procedure for examining the validity of each.
[0039]
(1) Assumption 1
The pitch period of the digital audio signal is two zero cross interval data t₁, T₂Length T equivalent to the sum of₁Have That is, the time T shown in FIG.₁₁, T₁₂,... Are the pitch period of the digital audio signal.
(2) Assumption 2
The pitch period of the digital audio signal is four zero cross interval data t.₁~ T_FourLength T equivalent to the sum of₂Have That is, the time T shown in FIG._{twenty one}, T_{twenty two},... Are the pitch period of the digital audio signal.
(3) Assumption 3
The pitch period of the digital audio signal is six zero cross interval data t₁~ T₆Length T equivalent to the sum of_ThreeHave That is, the time T shown in FIG.₃₁, T₃₂,... Are the pitch period of the digital audio signal.
(4) Assumption 4
The pitch period of the digital audio signal is 8 zero cross interval data t₁~ T₈Length T equivalent to the sum of_FourHave That is, the time T shown in FIG.₄₁, T₄₂,... Are the pitch period of the digital audio signal.
[0040]
The examination of the validity of each of the above assumptions and the calculation of the pitch period based on the examination results are executed according to the flow shown in FIG. First, the pitch calculation unit 11 calculates the reproducibility CR1 of the waveform of the digital audio signal when the above assumption 1 is assumed (step S101). This reproduction rate is a numerical value indicating how much the digital audio signal waveforms corresponding to each pitch period match in accordance with the above assumptions. In this embodiment, the zero cross interval data t₁, T₂Calculate based on.
[0041]
Here, with reference to the flowchart of FIG. 11, the procedure of the calculation which calculates | requires the reproduction rate CR1 performed in step S101 is demonstrated. First, in step S201, "0" and "1" are set as initial values for the counter CNT and the control variable i, respectively.
[0042]
In step S202, the control variable i is increased by “2”, and i = “3”. Next, proceeding to step S203, 0.9t₁-T_i<0 is satisfied, that is, zero cross interval data t_ThreeIs zero cross interval data t₁It is judged whether it is larger than 90%. If the determination result is “YES”, the counter CNT is incremented by “1” (step S204), and the process proceeds to step S205. If “NO”, the process proceeds to step S205 without going through step S204. Next, in step S205, -1.1t₁+ T_i<0 is satisfied, that is, zero cross interval data t_ThreeIs zero cross interval data t₁It is judged whether it is smaller than 110%. If the determination result is “YES”, the counter CNT is incremented by “1” (step S206), and the process proceeds to step S207. If “NO”, the process proceeds to step S207 without going through step S206.
[0043]
Next, in step S207, it is determined whether or not the control variable i is “7”. If the determination result is “NO”, the process returns to step S202. Thereafter, steps S202 to S207 are executed twice, and zero cross interval data t_FiveAnd t₇Are determined in steps S203 and S205, and each zero cross interval data is converted into zero cross interval data t.₁The counter CNT is incremented when it is larger than 90% or smaller than 110% (steps S204 and S206).
[0044]
When i = “7”, the determination result in step S207 is “YES”, the process proceeds to step S208, and “2” is set to the control variable i.
[0045]
Next, in step S209, the control variable i is increased by “2”, and i = “4”. Next, proceeding to step S210, 0.9t₂-T_i<0 is satisfied, that is, zero cross interval data t_FourIs zero cross interval data t₂It is judged whether it is larger than 90%. If the determination result is “YES”, the counter CNT is incremented by “1” (step S211), and the process proceeds to step S212. If “NO”, the process proceeds to step S212 without passing through step S211. Next, in step S212, -1.1t₂+ T_i<0 is satisfied, that is, zero cross interval data t_FourIs zero cross interval data t₂It is judged whether it is smaller than 110%. If the determination result is “YES”, the counter CNT is incremented by “1” (step S213), and the process proceeds to step S214. If “NO”, the process proceeds to step S214 without going through step S213.
[0046]
Next, in step S214, it is determined whether or not the control variable i is “8”. If the determination result is “NO”, the process returns to step S209. Thereafter, steps S209 to S214 are executed twice, and zero cross interval data t₆And t₈Are determined in steps S210 and S212, and each zero cross interval data is converted into zero cross interval data t.₂The counter CNT is incremented when it is larger than 90% or smaller than 110% (steps S211, S213).
[0047]
When i = “8”, the determination result in step S214 is “YES”, the process proceeds to step S215, and the value of the counter CNT is normalized by the number of determinations regarding the zero cross interval data, and the result is reproduced. Let CR1. In this flow, since the determination is performed 12 times, CNT / 12 is set as the recall rate CR1.
[0048]
Here, the length of the pitch period is the sum T of the two zero cross interval data.₁In an ideal state where the assumption is correct and the waveform of the digital audio signal does not change even when the pitch period is switched four times, t₁= T_Three= T_Five= T₇And t₂= T_Four= T₆= T₈It becomes. Accordingly, in this case, the reproduction rate CR1 obtained by the above processing is 100%. Even if there is some error in each zero cross interval data, t_Three, T_FiveAnd t₇Is t₁Within ± 10% and t_Four, T₆And t₈Is t₂When it is within the range of ± 10%, the recall ratio CR1 is 100%. On the other hand, if the above assumption is incorrect, a large difference occurs between the zero-crossing interval data corresponding to each other by switching the pitch period. For this reason, it becomes easy to make a negative determination in the above-described step S203 and the like, and the reproduction rate CR1 decreases as the number of such negative determinations increases.
[0049]
When the calculation of the reproduction rate CR1 is completed in this way, the process returns to the flow of FIG. 10 and proceeds to step S102, where the reproduction rate CR2 of the digital audio signal waveform on the assumption of the above assumption 2 is calculated. That is, a length T corresponding to the sum of four zero-crossing interval pitch pitch data₂Is assumed to have The zero cross interval data t corresponding to the first pitch period₁~ T_FourAnd zero cross interval data t corresponding to the second, third and fourth pitch periods, respectively._Five~ T₈, T₉~ T₁₂And t₁₃~ T₁₅It is determined whether or not each of them matches the reference within a predetermined error range. Then, the number of times that a positive determination result is obtained is counted, normalized by the total number of determinations, and the recall ratio CR2 is obtained.
[0050]
In an ideal state where the assumption that the length of the pitch period is the sum of the four zero-cross interval data is correct and the waveform of the digital audio signal does not change even when the pitch period is switched four times,
t₁= T_Five= T₉= T₁₃
t₂= T₆= T_Ten= T₁₄
t_Three= T₇= T₁₁= T₁₅
t_Four= T₈= T₁₂= T₁₆
All the conditions are satisfied, and the reproducibility CR2 is 100%. In addition, even if there is some error in each zero cross interval data, the reproduction rate CR2 is 100% if it is within ± 10%. When the zero pitch interval data greatly deviates from the reference (that is, zero cross interval data corresponding to the first pitch cycle) is generated by switching the pitch cycle, the recall ratio CR2 is lowered according to the number of the zero cycle interval data. It will be.
[0051]
In step S103, the digital audio signal waveform reproduction rate CR3 when the above assumption 3 is assumed is calculated. That is, a length T corresponding to the sum of six zero-cross interval data whose pitch period is six._ThreeIs assumed to have The zero cross interval data t corresponding to the first pitch period₁~ T₆And zero cross interval data t corresponding to the second, third and fourth pitch periods, respectively.₇~ T₁₂, T₁₃~ T₁₈And t₁₉~ T_{twenty four}It is determined whether or not each of them matches the reference within a predetermined error range. Then, the number of times that a positive determination result is obtained is counted, normalized by the total number of determinations, and the recall ratio CR3 is obtained.
[0052]
This recall CR3 is
t₁= T₇= T₁₃= T₁₉
t₂= T₈= T₁₄= T₂₀
t_Three= T₉= T₁₅= T_{twenty one}
t_Four= T_Ten= T₁₆= T_{twenty two}
t_Five= T₁₁= T₁₇= T_{twenty three}
t₆= T₁₂= T₁₈= T_{twenty four}
If all the above conditions are satisfied, or if there is some error in each zero-crossing interval data, but the error is within a range of ± 10%, the recall ratio CR3 is 100%. In addition, when zero-cross interval data having a large error occurs, the recall CR3 is lowered according to the number of the data.
[0053]
In step S104, the digital audio signal waveform reproduction rate CR3 when the above assumption 4 is assumed is calculated. That is, the length T corresponding to the sum of the zero cross interval data whose pitch period is eight._FourIs assumed to have The zero cross interval data t corresponding to the first pitch period₁~ T₈And zero cross interval data t corresponding to the second and third pitch periods, respectively.₉~ T₁₆And t₁₇~ T_{twenty four}It is determined whether or not each of them matches the reference within a predetermined error range. Then, the number of times that a positive determination result is obtained is counted, normalized by the total number of determinations, and the recall ratio CR4 is obtained.
[0054]
In each of the processes in steps S101 to S103, four pitch periods are targeted for processing, but in this step S104, three pitch periods (T in FIG. 9 (b4)).₄₁~ T₄₃). This is due to the following reason. That is, in step S104, a long time corresponding to eight zero-cross interval data is assumed as the pitch period. Accordingly, assuming that four pitch periods are to be processed in step S104, even if assumption 4 is correct, the digital audio signal waveform is stable over an extremely long period of four pitch periods. If not, the reproducibility CR4 is lowered. However, the waveform of the digital audio signal can maintain the same waveform for a certain short period of time, but the waveform changes after a certain period of time. For this reason, when four pitch periods are processed, even if Assumption 4 is correct, an unreasonably low recall ratio CR4 is calculated due to the influence of temporal changes in the waveform of the digital audio signal. Probability is high. Therefore, in step S104, as described above, three pitch periods are targeted for processing.
[0055]
In step S104, the recall CR4 is
t₁= T₉= T₁₇
t₂= T_Ten= T₁₈
t_Three= T₁₁= T₁₉
t_Four= T₁₂= T₂₀
t_Five= T₁₃= T_{twenty one}
t₆= T₁₄= T_{twenty two}
t₇= T₁₅= T_{twenty three}
t₈= T₁₆= T_{twenty four}
If all the above conditions are satisfied, or if there is some error in each zero-crossing interval data, but the error is within the range of ± 10%, the recall ratio CR4 is 100%. Further, when zero-cross interval data having a large error is generated, the recall ratio CR4 is lowered according to the number of the data.
[0056]
Next, the process proceeds to step S105, and it is determined which of assumptions 1 to 4 is appropriate based on the recall rates CR1 to CR4 obtained as described above. A detailed flow of this determination is shown in FIG. First, in step S301, it is determined which of the recall rates CR1 to CR4 is the maximum. If the recall ratio CR1 is the maximum, it is determined whether or not this CR1 is greater than a predetermined reference value ref (step S302). If this determination result is “YES”, the assumption 1 is followed. That is, the length T of the zero cross interval data for two pieces₁Thus, the pitch period is obtained. The same applies to the case where the other reproduction ratios CR2 to CR4 are maximum, and it is determined whether or not CR2 or the like is larger than a predetermined reference value ref (steps S303 to S305), and the determination result is “YES” Includes the length T of the four zero-crossing interval data in accordance with the assumptions used to calculate each recall.₂, The length T of the zero cross interval data for 6 pieces_ThreeAlternatively, the length T of the zero cross interval data for 8 pieces_FourThus, the pitch period is obtained. If the recall is the same, the priority order is CR1> CR2> CR3> CR4 (CR1 is the highest priority).
[0057]
On the other hand, when the maximum one of the recall ratios CR1 to CR4 is equal to or less than the reference value ref, the determination result is “NO” regardless of which of the steps S302 to S305 is performed. In this case, a conclusion cannot be drawn as to which of assumptions 1 to 4 is valid, and the determination result is “not applicable”.
[0058]
When the above determination is completed, the process returns to the flow shown in FIG. 10 and proceeds to the step corresponding to the determination result. That is, the length T of the zero cross interval data for two pieces₁If it is determined that the pitch period is to be obtained, the process proceeds to step S106, where four pitch periods each consisting of two pieces of zero cross interval data are obtained (T in FIG. 9B1).₁₁~ T₁₄These average values are used as the pitch period of the digital audio signal. Also, the length T of the zero cross interval data for four pieces₂If it is determined that the pitch period is to be obtained, the process proceeds to step S107, and four pitch periods are obtained according to this determination result (T in FIG. 9B2)._{twenty one}~ T_{twenty four}These average values are used as the pitch period of the digital audio signal. In addition, the length T of the zero cross interval data for six pieces_ThreeIf it is determined that the pitch period is to be obtained, the process proceeds to step S108, and four pitch periods are obtained according to the determination result (T in FIG. 9B3).₃₁~ T₃₄These average values are used as the pitch period of the digital audio signal. And the length T of the zero cross interval data for 8 pieces_FourIf it is determined that the pitch period is to be obtained, the process proceeds to step S109, and three pitch periods are obtained according to the determination result (T in FIG. 9 (b4)).₄₁~ T₄₃These average values are used as the pitch period of the digital audio signal.
[0059]
When the above process ends, the process returns to step S101, and the same process is repeated. In this way, the pitch period of the digital audio signal is output continuously. On the other hand, in the determination of FIG. 12, if the conclusion “not applicable” is obtained, the pitch period is not calculated, a signal indicating that the pitch period is not calculated is output, and the process returns to step S101. . In the above description, the pitch period calculation process has been described by taking the case of a digital audio signal as an example. However, the pitch period is also calculated for a digital model signal by exactly the same process.
[0060]
As described above, in the present embodiment, the recall is obtained for all assumptions 1 to 4, the assumption with the highest recall is selected, and the pitch calculation based on the selected assumption is within the allowable range. This is a cautious procedure that is implemented only when it is within the range and not performed when it is out of the allowable range. The reason for taking such a careful procedure is as follows.
[0061]
a. As a procedure other than the above procedure, for example, the respective recall rates corresponding to assumptions 1 to 4 are sequentially calculated, and when the recall rate within the allowable range is obtained, the calculation is terminated. An alternative is possible in which the pitch period is determined by selecting. However, depending on the speech waveform, for example, a situation may occur in which the recall rate corresponding to Assumptions 1 and 3 is within an allowable range, and the recall rate corresponding to Assumption 3 is higher than that of Assumption 1. If this alternative is to be followed in such a case, assumption 1 is selected and an incorrect pitch period is obtained. Although it is conceivable to set the allowable range narrow so that the selection of the assumption is made correctly, there may be cases where “not applicable” is determined in that case.
[0062]
b. An alternative is also conceivable in which all recall rates corresponding to assumptions 1 to 4 are calculated, the assumption with which the maximum recall rate is obtained is unconditionally adopted, and the pitch period is obtained. However, the recall rate corresponding to any assumption is uniformly low, and there may be cases where the recall rate corresponding to a specific assumption is slightly better than others. Even if the pitch period is forcibly obtained by adopting, there is no guarantee that an accurate pitch period will be obtained. For example, when the waveform of the digital audio signal changes abruptly in the pitch period, the reproducibility is likely to be low in any of the above assumptions.
[0063]
c. Therefore, in the present embodiment, the pitch period is calculated according to the above-described procedure, thereby preventing an inappropriate pitch period from being output.
[0064]
The pitch periods of the digital audio signal and the digital model signal obtained as described above are sequentially reported to the scoring unit 13, and the deviation of the pitch period of both signals and the deviation of both signal levels obtained by the level detection unit 12 are reported. According to the overall evaluation, the song of the singer is scored, and the scoring result is displayed on the display unit 14.
[0065]
C. Evaluation result of the apparatus according to this embodiment
Various operation conditions were set for each part of the pitch cycle detection device described above, and the pitch cycle detection time and detection error were evaluated. FIGS. 13 to 16 show the results. First, FIG. 13 shows a result of measuring a pitch period detection error in a practical range by using a circuit that performs linear interpolation as the 4-times oversampling unit 7 and changing the oversampling frequency of the circuit variously. From this result, it can be seen that the detection error in the practical range can be sufficiently reduced by performing interpolation of about 4 times oversampling. Next, FIG. 14 shows how the pitch period is detected for each of the case where the pitch period is obtained from the correlation between the three periods (m = 3) and the case where the pitch period is obtained from the correlation between the four periods (m = 4). The result of measuring the delay time for each input frequency is shown. As shown in this experimental result, if m = 3 or 4, it is possible to keep the detection delay within a range where there is no problem. FIG. 15 shows the relationship between the number of times of averaging and the pitch cycle extraction error. FIG. 16 shows the result of an experiment of how many periods (pitch periods) in the past can be accurately extracted by comparing with the waveform for the past. This experimental result shows that there are many errors when comparing the past two cycles, and when comparing input waveforms over the past five cycles, the waveform is too old and the pitch cycle is wrong. Comparing input waveforms over four periods shows that it is optimal for accurate pitch extraction.
[0066]
D. Modified example
(1) In the above embodiment, it is assumed that the pitch period is the sum of 2n pieces of zero cross interval data depending on how much the zero cross interval data constituting one pitch period matches between the pitch periods. Judgment whether or not is appropriate. Instead of this method, for each n, a predetermined number of pitch periods are obtained by calculating the sum of 2n zero cross interval data, and the n with the least variation in these pitch periods is selected, and the pitch period is determined. You may make it select. That is, in FIGS. 9B1 to 9B4, T₁₁~ T₁₄T is the smallest variation of T₁₁~ T₁₄Is the pitch period, and T_{twenty one}~ T_{twenty four}T is the smallest variation of T_{twenty one}~ T_{twenty four}Is the pitch period, and so on. In addition, using the determination method based on the zero cross interval data disclosed in the above embodiment and the determination method obtained for the variation of the pitch period, the zero cycle interval data and the pitch period variation of the pitch period length are comprehensively evaluated, You may make it select a pitch period.
[0067]
(2) In the above embodiment, the width Δ of the masking band of the binarization unit 8 is fixed. However, since the amplitude of the minute vertical movement of the speech waveform generated near the zero level depends on the amplitude of the entire speech waveform, it may be difficult to determine an appropriate Δ. Therefore, the amplitude of the digital audio signal or the digital model signal is detected, the amplitude value is multiplied by a predetermined coefficient, and the result is set to Δ, and the control of the masking band width Δ of the binarization unit 8 is performed. It is preferred to do so.
[0068]
【The invention's effect】
As described above, according to the present invention, a digital audio signal is oversampled, the oversampled digital audio signal is converted into a binary signal, and a continuous zero of the audio waveform is generated based on the binary signal. Measure the cross interval, and for each n, assume that the pitch period is the sum of 2n zero cross interval data.Find the degree of coincidence of the waveform between each pitch period and adopt the best assumption of coincidenceSince the pitch period is calculated, the zero-crossing interval can be measured accurately, and even when the speech waveform is a complex waveform containing harmonic components, the pitch period can be accurately and quickly set with an inexpensive configuration. There is an effect that it can be obtained.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of an embodiment of the present invention.
FIG. 2 is a block diagram showing a configuration example of a 4-times oversampling unit in the same embodiment.
FIG. 3 is a block diagram illustrating a configuration of a binarization unit in the embodiment.
FIG. 4 is a block diagram illustrating a configuration of a binarization unit in the embodiment.
FIG. 5 is a block diagram showing a timer, a RAM, and a control system thereof in the same embodiment.
FIG. 6 is a diagram showing an operation of a 4-times oversampling unit in the same embodiment.
FIG. 7 is a diagram showing an operation of a binarization unit in the same embodiment.
FIG. 8 is a diagram showing an operation of a write control unit in the same embodiment.
FIG. 9 is a diagram illustrating an outline of a pitch cycle calculation process in the embodiment.
FIG. 10 is a flowchart showing a pitch period calculation process in the embodiment.
FIG. 11 is a flowchart showing a pitch period calculation process in the embodiment.
FIG. 12 is a flowchart showing a pitch period calculation process in the embodiment.
FIG. 13 is a diagram showing a performance evaluation result of the embodiment.
FIG. 14 is a diagram showing a performance evaluation result of the embodiment.
FIG. 15 is a diagram showing a performance evaluation result of the embodiment.
FIG. 16 is a diagram showing a performance evaluation result of the embodiment.
[Explanation of symbols]
1 ... CD, 2 ... Vocal extractor, 3 ... Microphone,
4 ... A / D converter, 5 ... DC removal unit, 6 ... LPF,
7 ... 4 times oversampling unit, 8 ... binarization unit,
9 …… Timer (zero cross interval measuring means),
10: RAM (zero cross interval measuring means),
11: Pitch calculation unit (pitch calculation means),
12... Level detection section, 13... Scoring section, 14.

Claims

Oversampling means for outputting a digital audio signal with a sampling frequency multiplied by a predetermined number;
Binarizing means for comparing the digital audio signal output by the oversampling means with a predetermined level and converting it to a binary signal;
Based on the binary signal, zero cross interval measuring means for measuring continuous zero cross intervals t ₁ , t ₂ ,... Of the digital audio signal;
n (n is an integer of 1 or more) is varied is the for each n, assuming that the total sum _{_{T = (t 1 + t 2}} + ·· t 2n) the pitch period of the 2n zero cross interval, adjacent contact m cycles The degree of coincidence between the corresponding zero-crossing intervals included in each of the assumed pitch periods (m is an integer of 2 or more) is calculated as the degree of coincidence of the waveform of the digital audio signal. And a pitch calculation means for obtaining a pitch period by selecting n having the highest degree of coincidence.