JP4302837B2

JP4302837B2 - Audio signal processing apparatus and audio signal processing method

Info

Publication number: JP4302837B2
Application number: JP30027299A
Authority: JP
Inventors: 竜児中川; ケイノペドロ; ロスコスアレックス
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 1999-10-21
Filing date: 1999-10-21
Publication date: 2009-07-29
Anticipated expiration: 2019-10-21
Also published as: JP2001117580A

Description

【０００１】
【発明の属する技術分野】
本発明は、予め記憶された音符列と入力音声とを時系列で対応付ける音声信号処理装置および音声信号処理方法に関する。
【０００２】
【従来の技術】
従来より、楽器音のピッチや発音時間を検出することによって、現時点で楽器音譜面のどの位置にいるかを追従するための方法が考えられている（例えば、 "An On-Line Algorithm for Real-time Accompanmient"（R. Dannenberg. Proceedings of the ICMC 1984）や、"The Synthetic performer in the Context of Live Musical Performance"（B. Vercoe. Proceedings of the ICMC 1984.）などに記載された方法）。
【０００３】
【発明が解決しようとする課題】
しかし、実際の演奏や歌声は、完全に譜面通りに進行するとは限らず、微妙なテンポ、タイミングピッチのずれや、揺らぎなどの不確定な要素が悪影響を与え、上述した従来の手法では、楽器音や歌声を譜面のどこにいるかを正確に検出することができない場合がある。
【０００４】
本発明は、上記の事情を考慮してなされたものであり、入力される歌声や楽器音などの音声が譜面のどの位置にいるかをより正確に検出することが可能な音声信号処理装置、および音声信号処理方法を提供することを目的とする。
【０００５】
【課題を解決するための手段】
上記課題を解決するため、本発明の請求項１に記載の音声信号処理装置は、入力音声を音符列のいずれかの音符と対応付ける音声信号処理装置であって、時間列で記述された音符列情報を記憶する音符列記憶手段と、フレーム単位で入力される音声信号から少なくともピッチを表すパラメータを含む特徴パラメータを取得するパラメータ取得手段と、音声信号の代表的な特徴パラメータを特徴ベクトルとしてシンボルにクラスタ化した符号帳と、各音符毎に状態数、状態遷移確率および前記シンボルの観測確率とを記憶した認識用音符情報記憶手段と、前記認識用音符情報記憶手段を参照することにより、前記パラメータ取得手段により取得された特徴パラメータから前記入力音声の観測シンボルを取得し、該観測シンボルの観測確率を取得する観測確率取得手段と、前記認識用音符情報記憶手段に記憶された状態数および状態遷移確率に基づいて、前記音符列情報記憶手段に記憶された前記音符列の各状態を有限状態ネットワーク上で隠れマルコフモデルによって形成する状態形成手段と、前記観測確率取得手段によって取得された観測確率と、前記状態形成手段により形成された前記隠れマルコフモデルとにしたがって状態遷移を決定する状態遷移決定手段と、前記状態遷移決定手段によって決定された状態遷移に基づいて、前記入力音声信号の各フレームと前記音符列情報とを対応付ける対応付け手段とを具備し、前記状態形成手段は、ピッチの有る音、ピッチの無い音、無音に応じて３種類のleft-to-right型の隠れマルコフモデルを形成し、前記ピッチの有る音およびピッチの無い音を３状態のモデルとして形成し、前記無音を１状態のモデルとして形成することを特徴としている。
【０００６】
また、請求項２に記載の音声信号処理装置は、請求項１に記載の音声信号処理装置において、前記対応付け手段による対応付け結果に基づいて、現在の入力音声が前記音符列情報のどの部分であるかを表示する表示手段をさらに具備することを特徴としている。
【０００７】
また、請求項３に記載の音声信号処理装置は、請求項１または２に記載の音声信号処理装置において、前記パラメータ取得手段は、入力される音声信号から少なくともエネルギー、デルタエネルギー、ゼロクロス、ピッチ、デルタピッチおよびピッチエラーを特徴パラメータとして取得することを特徴としている。
【０００８】
また、請求項４に記載の音声信号処理装置は、請求項３に記載の音声信号処理装置において、前記認識用音符情報記憶手段に記憶されたエネルギー、デルタエネルギー、ゼロクロス、デルタピッチおよびピッチエラーの５種の観測確率は、ガウス分布を用いた観測関数を用いて算出されており、
前記認識用音符情報記憶手段に記憶されたピッチの観測確率は、ガウス分布を用いた観測関数とステップ観測確率関数とを用いて算出されており、このピッチの観測確率を算出する際に、前記ピッチの有無に応じて前記ガウス分布を用いた観測関数と前記ステップ観測関数とを使い分けるようにしたことを特徴としている。
【００１０】
また、請求項５に記載の音声信号処理装置は、請求項１ないし４のいずれかに記載の音声信号処理装置において、前記状態形成手段は、前記ピッチの有る音の隠れマルコフモデルを形成する際、前音符とスラーで接続された音符と単音符とを別のモデルとして形成することを特徴としている。
【００１１】
また、請求項６に記載の音声信号処理装置は、請求項１ないし５のいずれかに記載の音声信号処理装置において、学習用楽音波形データと該楽音波形を音符化した学習用音符列データとを入力する入力手段と、前記入力手段から入力される学習用音符列データの音符毎に有限ネットワーク上で隠れマルコフモデルを形成する学習用モデル形成手段と、学習時に、前記学習用モデル形成手段により形成されたモデルの尤度が最大となるパラメータをｋ平均アルゴリズムにより推定するパラメータ推定手段とをさらに備え、前記認識用音符情報記憶手段は、前記パラメータ推定手段によって推定されたパラメータにより求められた各音符における特徴ベクトルの状態遷移確率および観測確率を記憶することを特徴としている。
【００１２】
また、請求項７に記載の音声信号処理装置は、請求項１ないし６のいずれかに記載の音声信号処理装置において、前記状態遷移決定手段は、ビタービアルゴリズムによって状態遷移を決定することを特徴としている。
【００１３】
また、請求項８に記載の音声信号処理装置は、請求項７に記載の音声信号処理装置において、前記音符列記憶手段は、前記音符列に対応する持続時間データを記憶しており、前記状態遷移決定手段は、前記音符列記憶手段に記憶された持続時間データを用いたビタービアルゴリズムによって状態遷移を決定することを特徴としている。
【００１４】
また、請求項９に記載の音声信号処理方法は、入力音声を予め記憶された音符列のいずれかの音符と対応付ける音声信号処理方法であって、フレーム単位で入力される音声信号から少なくともピッチを表すパラメータを含む特徴パラメータを取得するパラメータ取得ステップと、予め記憶された音声信号の代表的な特徴パラメータを特徴ベクトルとしてシンボルにクラスタ化した符号帳と、各音符毎に状態数、状態遷移確率および前記シンボルの観測確率とを参照することにより、前記パラメータ取得ステップにより取得された特徴パラメータから前記入力音声の観測シンボルを取得し、該観測シンボルの観測確率を取得する観測確率取得ステップと、予め記憶された状態数および状態遷移確率に基づいて、予め記憶された音符列の各状態を有限状態ネットワーク上で隠れマルコフモデルによって形成する状態形成ステップと、前記観測確率取得ステップによって取得された観測確率と、前記状態形成ステップにより形成された前記隠れマルコフモデルとにしたがって状態遷移を決定する状態遷移決定ステップと、前記状態遷移決定ステップによって決定された状態遷移に基づいて、前記入力音声信号の各フレームと前記音符列情報とを対応付ける対応付けステップとを具備し、前記状態形成ステップでは、ピッチの有る音、ピッチの無い音、無音に応じて３種類のleft-to-right型の隠れマルコフモデルを形成し、前記ピッチの有る音およびピッチの無い音を３状態のモデルとして形成し、前記無音を１状態のモデルとして形成することを特徴としている。
【００１５】
また、請求項１０に記載の音声信号処理方法は、請求項９に記載の音声信号処理方法において、前記対応付けステップによる対応付け結果に基づいて、現在の入力音声が前記音符列情報のどの部分であるかを表示する表示ステップをさらに具備することを特徴としている。
【００１６】
また、請求項１１に記載の音声信号処理方法は、請求項９または１０に記載の音声信号処理方法において、前記パラメータ取得ステップでは、入力される音声信号から少なくともエネルギー、デルタエネルギー、ゼロクロス、ピッチ、デルタピッチおよびピッチエラーを特徴パラメータとして取得することを特徴としている。
【００１７】
また、請求項１２に記載の音声信号処理方法は、請求項１１に記載の音声信号処理方法において、前記エネルギー、デルタエネルギー、ゼロクロス、デルタピッチおよびピッチエラーの５種の観測確率を、ガウス分布を用いた観測関数を用いて算出して記憶する第１の観測確率算出ステップと、前記ピッチの観測確率を、ガウス分布を用いた観測関数とステップ観測確率関数とを用い、前記ピッチの有無に応じて前記ガウス分布を用いた観測関数と前記ステップ観測関数とを使い分けて算出して記憶する第２の観測確率算出ステップとをさらに具備し、前記観測確率取得ステップでは、前記第１および第２の観測確率算出ステップで記憶された観測確率を参照することにより観測確率を取得することを特徴としている。
【００１９】
また、請求項１３に記載の音声信号処理方法は、請求項９ないし１２のいずれかに記載の音声信号処理方法において、前記状態形成ステップでは、前記ピッチの有る音の隠れマルコフモデルを形成する際、前音符とスラーで接続された音符と単音符とを別のモデルとして形成することを特徴としている。
【００２０】
また、請求項１４に記載の音声信号処理方法は、請求項９ないし１３のいずれかに記載の音声信号処理方法において、学習用楽音波形データと該楽音波形を音符化した学習用音符列データとを入力する入力ステップと、前記入力ステップで入力される学習用音符列データの音符毎に有限ネットワーク上で隠れマルコフモデルを形成する学習用モデル形成ステップと、学習時に、前記学習用モデル形成手段により形成されたモデルの尤度が最大となるパラメータをｋ平均アルゴリズムにより推定するパラメータ推定ステップと、前記パラメータ推定ステップによって推定されたパラメータにより求められた各音符における特徴ベクトルの状態遷移確率および観測確率を記憶する確率記憶ステップとを備え、前記観測確率取得ステップでは、前記確率記憶ステップにより記憶された観測確率を参照することにより観測確率を取得し、前記状態形成ステップでは、前記確率記憶ステップにより記憶された状態遷移確率に基づいて、予め記憶された音符列の各状態を有限状態ネットワーク上で隠れマルコフモデルによって形成することを特徴としている。
【００２１】
また、請求項１５に記載の音声信号処理方法は、請求項９ないし１４のいずれかに記載の音声信号処理方法において、前記状態遷移決定ステップは、ビタービアルゴリズムによって状態遷移を決定することを特徴としている。
【００２２】
また、請求項１６に記載の音声信号処理方法は、請求項１５に記載の音声信号処理方法において、前記状態遷移決定ステップでは、予め記憶された音符列に対応する持続時間データを用いたビタービアルゴリズムによって状態遷移を決定することを特徴としている。
【００２３】
【発明の実施の形態】
以下、図面を参照して本発明の実施形態について説明する。
Ａ．実施形態の構成
Ａ−１．全体構成
まず、図１は本発明の一実施形態に係る音声信号処理装置の構成を示す図である。同図において、符号１はマイクであり、歌唱者の歌声や楽器音などの音を収集し、入力音声信号Ｓｖとして入力音声信号切出部３に出力する。符号２は分析窓生成部であり、分析窓生成部２は前回のフレームで検出したピッチの周期の固定倍の周期を有する分析窓（例えば、ハミング窓）ＡＷを生成し、入力音声信号切出部３に出力する。なお、初期状態もしくは前回のフレームが無音の場合には、予め設定した固定周期の分析窓を分析窓ＡＷとして入力音声信号切出部３に出力する。
【００２４】
入力音声信号切出部３は、入力された分析窓ＡＷと入力音声信号Ｓｖとを掛け合わせ、入力音声信号Ｓｖをフレーム単位で切り出し、フレーム音声信号ＦＳｖとして高速フーリエ変換部４に出力する。高速フーリエ変換部４は、フレーム音声信号ＦＳｖから周波数スペクトルを求め、特徴パラメータ分析部５に出力する。
【００２５】
特徴パラメータ分析部５は、入力音声のスペクトル特性を特徴づける特徴パラメータを抽出し、シンボル量子化部７に出力する。本実施形態では、特徴パラメータとして後に説明する６種類（エネルギー、デルタエネルギー、ゼロクロス率、ピッチ周波数、デルタピッチおよびトータルエラー）の特徴ベクトルを用いている。認識用音符情報記憶部６は、後に詳しく説明するように、符号帳６ａと、各音符における特徴ベクトルの状態数および状態遷移確率とシンボル観測確率とを示す確率データを記憶した音符辞書６ｂとを備えている。
【００２６】
シンボル量子化部７は、認識用音符情報記憶部６に記憶された符号帳６ａを参照して、そのフレームにおける最大尤度をもつ特徴シンボルを選び出し、選び出した特徴シンボルの観測確率を音符列状態形成部８に出力する。
【００２７】
音符列状態形成部８は、上記認識用音符情報記憶部６と後述する音符列情報記憶部１１とを参照することにより、隠れマルコフモデル（ＨＭＭ）によって音符列情報記憶部１１に記述された音符列状態を形成する。状態遷移決定部９は、シンボル量子化部７から出力される入力音声から得られたフレーム単位の特徴シンボルの観測確率を用いて、後述するビタービアルゴリズムにしたがって状態遷移を決定する。これにより、入力音声の音符をフレーム単位の各時刻において特定できる。
【００２８】
マッチング部１０は、特定された入力音声のフレーム単位の音符により、入力音声のフレーム単位の各時刻と、音符列情報記憶部１１に記憶された音符列のいずれの位置であるかを特定し、入力音声と音符列との対応付けを行う。表示装置１２は、マッチング部１０による入力音声と音符列との対応付け結果を表示する。
【００２９】
Ａ−２．認識用音符情報記憶部
次に、上記認識用音符情報記憶部６に記憶される符号帳６ａおよび音符辞書６ｂについて説明する。符号帳６ａは、音声信号の代表的な特徴パラメータを特徴ベクトルとして所定数のシンボルにクラスタ化されている。
【００３０】
Ａ−２−１．特徴ベクトル
まず、符号帳６ａについて説明する前に、本実施形態で用いる６種類の特徴ベクトルについて説明する。
【００３１】
▲１▼エネルギー
エネルギーは音の強さを表す係数であり、次式により特徴ベクトルにパラメータ化される。
【数１】

▲２▼デルタエネルギー
デルタエネルギーは音の強さを差分として表す係数であり、次式により特徴ベクトルにパラメータ化される。
【数２】

▲３▼ゼロクロス率
ゼロクロス率は、有声音であるほどゼロクロス率が低くなる特徴を有するものであり、次式により特徴ベクトルにパラメータ化される。
【数３】

▲４▼ピッチ周波数
ピッチ周波数は、"Fundamental Frequency Estimation in the SMS Analysis"(P. Cano. DAFX Proceedings 1998.)に記載されたTwo-Way Mismatch法により求めることができる。
【００３２】
▲５▼デルタピッチ
デルタピッチは、次式により特徴ベクトルにパラメータ化される。
【数４】

▲６▼トータルエラー
トータルエラーは、予測ピッチの測定ピッチとのエラー、および測定ピッチの予測ピッチのエラーの２方向からのミスマッチを求めることにより有声音らしさを示すものである。
まず、予測ピッチ（p）の測定ピッチ（m）とのピッチエラーは次式により表される。
【数５】

上記式において、ｆ_kはｋ番目の予測ピーク周波数、Δｆ_kはｋ番目の予測ピーク周波数と測定ピッチの周波数差、ａ_kはｋ番目の予測アンプリチュード、Ａ_maxはアンプリチュードの最大値を示す。
一方、測定ピッチ（m）から予測ピッチ（p）へのピッチエラーは、次式で表される。
【数６】

上記式において、ｆ_kはｋ番目の予測ピーク周波数、Δｆ_kはｋ番目の予測ピーク周波数と測定ピッチの周波数差、ａ_kはｋ番目の予測アンプリチュード、Ａ_maxはアンプリチュードの最大値を示す。
【００３３】
従って、トータルエラーは、次式のようになる。
【数７】

Ａ−２−２．音符辞書
音符辞書６ｂには、各音符毎にleft-to-right型の隠れマルコフモデルが記憶されている。ここで、音符辞書６ｂには入力音声の物理的特性に応じて３種類のモデルが用意されている。具体的には、ピッチ有り音、ピッチ無し音および無音の３種類毎にモデルが用意されている。また、ピッチ有り音としては、各音程（Ｃ０、Ｃ＃０、Ｄ０、Ｄ＃０、Ｅ０、Ｆ０、Ｆ＃０、Ｇ０、Ｇ＃０……）毎にモデル、つまり状態数および状態遷移確率が記憶されている（図２および図３参照）。また、後述するように予めガウス分布等にしたがって算出された各特徴ベクトルのシンボルに対する観測確率の値が記憶されている。ここで、ピッチ有り音の場合には、状態数は３であり、それぞれ発音の立ち上がり（アタック）、定常状態（ステディー）、リリース状態の３つの状態を擬似的に表している。
【００３４】
また、本実施形態ではピッチ無し音として、破裂音（/p/,/b/……）や摩擦音（/s/,/sh/……）毎に１モデルが用意されている。本実施形態では、このピッチ無し音のモデルでも状態数がアタック、ステディー、リリースの３状態に設定されており、これにより呼気音、破裂音、摩擦音等の細かなニュアンスを再現できるようにしている。また、無音（SILENCE）の場合は、状態数は１に設定されている。なお、ピッチ有り音の場合、スラーで全音と接続された音符に関しては、単音符とは異なるモデルを用意するようにすれば、より精度のよい入力音声と音符との対応付けが可能となる。
Ａ−２−３．符号帳
符号帳６ａは、上記６種の特徴ベクトルのうち、▲１▼エネルギー、▲２▼Δエネルギー、▲３▼ゼロクロス率、▲５▼デルタピッチ、および▲６▼トータルエラーの５種類により生成され、クラスタ分けされている。各クラスタには、各シンボルを表す典型的なベクトル集合が入っている。（図４参照）。本実施形態では、符号帳６ａの作成の際に、公知のＬＢＧアルゴリズムを使用している。
また、符号帳６ａの作成は、次式で示されるガウス分布の連続密度関数にしたがってなされる。
【数８】

ここで、Ｃ_jmは状態ｊにおける成分ｍの混合重み係数を示している。また、Ｎ（ｊ；μ，Σ_j）は平均ベクトルμと共分散行列Σの多次元ガウス分布を示しているが、このままでは多次元であることにより学習パラメータの数が膨大になるため、本実施形態ではＮ（ｊ；μ，Σ_j）を一定の関数としている。
【００３５】
このようなガウス分布にしたがった符号帳６ａを作成するため、フレームにおける量子化ベクトルＮ_lの観測系列ｙ_t〜_lは部分集合Ｍ_lとなる。ここで、Ｍ_lは作成する符号帳の混合要素の数である。そして、符号帳６ａ作成の際には、以下のような演算でパラメータが推定されることになる。
【数９】

【数１０】

混合重み係数Ｃ_jmは、量子化ベクトルＮ_lmが符号帳のｍ番目に混合されるときに用いられ、以下の式のように表される。
【数１１】

次に、本実施形態では、上記５種類以外の特徴ベクトルである▲４▼ピッチ周波数については、その観測確率を算出する際に用いる関数は、ピッチ有り音と、ピッチ無し音又は無音の場合の２通りに分けられている。ここで、図５（ａ）はピッチ有り音の場合の観測確率ｂ（y）を算出するためのステップ関数を示し、図５（ｂ）はピッチ無し音又は無音の場合の観測確率ｂ（y）を算出するためのステップ関数を示す。同図に示すように、ピッチ有り音の場合、ピッチＦ₀＝０、つまりピッチが検出されないときには、算出される観測確率は定数となる。一方、ピッチが検出された場合、つまりＦ₀＝０でない場合には、観測確率ｂ’（y）を算出する関数（図５（ａ）中右上に示すグラフ参照）としてガウス分布にしたがった連続密度関数に置き換えられる。そして、この算出の際に用いられるピッチＦ₀Devは以下の式で算出される。
【数１２】

上記式において、Ｆ_Perfect0は、本来演奏されるべきピッチ周波数を示す。例えば、平均率Ａ４＝４４０Ｈｚの場合には、各音程において観測確率算出の際に用いられるＦ_Perfect0は以下の数値である。
Ｃ３………２６１．６２６Ｈｚ
Ｄ３………２９３．６６Ｈｚ
Ｅ３………３２９．６２８Ｈｚ
このようにガウス分布の連続密度関数等にしたがって６種類の特徴ベクトルに対する観測確率は算出され、音符辞書６ｂに記憶されており、また符号帳６ａもこのガウス分布にしたがった形で作成されている。
【００３６】
Ａ−３．音符列情報記憶部
次に、音符列情報記憶部１１について説明する。図６に示すように、音符列情報記憶部１１は、楽曲などの音符列が時系列に記述されている。また、本実施形態では、上記音符列の各音符毎にその持続時間情報が記憶されている。従って、図６（ａ）に示すような譜面で示される楽曲の音符列を記憶する場合には、図６（ｂ）に示すような音符列情報と持続時間情報が記憶されることになる。ここで、持続時間情報は、以下のように表されている。
【００３７】
ピッチ有り音又はピッチ無し音の場合、
Ｓ１：全音符
Ｓ２：２分音符
Ｓ３：４分音符
Ｓ４：８分音符
といったように表される。
【００３８】
一方、無音（休符）の場合、
Ｕ１：全休符
Ｕ２：２分休符
Ｕ３：４分休符
Ｕ４：８分休符
といったように表される。
【００３９】
従って、上記持続時間の実時間は、楽譜面の速度表記や設定テンポによって決定される。
【００４０】
Ｂ．実施形態の動作
次に、上記構成を有する音声信号処理装置の動作について説明する。
Ｂ−１．概要動作
最初に、この音声信号処理装置の概要動作について図７に示すフローチャートを参照しながら説明する。まず、マイク１により入力音声信号が生成されると、この音声信号に対してフレーム単位で高速フーリエ変換して周波数スペクトルを取得する。そして、取得した周波数スペクトルから特徴パラメータ解析を行って上述した６種類の特徴パラメータを取得する（ステップＳ１）。
【００４１】
次に、認識用音符情報記憶部６を参照することにより、シンボル量子化部７によって取得した６種類の特徴パラメータのシンボル量子化が行われる（ステップＳ２）。そして、シンボル量子化部７は音符辞書６ｂを参照することにより、シンボル量子化したシンボルの観測確率を取得する。
【００４２】
この後、音符列情報記憶部１１に記憶された音符列情報および音符辞書６ｂを参照することにより、音符列状態形成部８によって音符列の状態がＨＭＭモデルにより構成される（ステップＳ３）。そして、上述したようにシンボル量子化部７によって取得されたシンボル観測確率と、音符列状態形成部８によって形成されたＨＭＭモデルとに基づき、状態遷移決定部９がビタービアルゴリズムを用いて状態遷移を決定する（ステップＳ４）。ＨＭＭモデルおよびビタービアルゴリズムについては後述する。そして、状態遷移決定部９により決定された状態遷移に基づいて、マッチング部１０が入力音声と音符列との時間的な対応付けを行われ（ステップＳ５）、この対応付け結果が表示装置１２に表示される（ステップＳ６）。
【００４３】
Ｂ−２．動作の詳細
次に、概要動作においてふれた各処理について詳細に説明する。
【００４４】
Ｂ−２−１．特徴パラメータ分析およびシンボル量子化
図８は、マイク１により生成される入力音声信号から特徴パラメータを取得してシンボル量子化する処理を説明するための図である。同図に示すように、入力された音声信号は、フレーム単位で高速フーリエ変換によって周波数スペクトルに変換される。この周波数スペクトルには、特徴パラメータ分析が行われる。そして、各特徴ベクトル毎に、認識用音符情報記憶部６の符号帳６ａから最大尤度のシンボルを見つけだし、音符辞書６ｂを参照して見つけだしたシンボルについての観測確率を取得する。
【００４５】
Ｂ−２−２．隠れマルコフモデル
次に、図９を参照しながら、隠れマルコフモデル（ＨＭＭ）について説明する。なお、音声の状態は一方向へ遷移するので、本実施形態では、上述したようにLeft-to-right型のモデルを用いている。
【００４６】
時刻ｔにおいて、状態ｉからｊへ遷移する確率（状態遷移確率）をａ_ijと表す。図９に示す例では、状態▲１▼にとどまる確率をａ₁₁と表し、状態▲１▼から状態▲２▼へ遷移する確率をａ₁₂と表している。このような状態遷移確率が上述したように音符辞書６ｂには音符毎に記憶されている（図３参照）。また、本実施形態において、ピッチ有り音およびピッチ無し音では、状態▲１▼はアタック、状態▲２▼はステディー、状態▲３▼はリリースを示す。
【００４７】
各状態の中には特徴ベクトルがそれぞれ存在し、各々に異なる観測シンボルがある。これをＹ＝｛ｙ₁、ｙ₂、……、ｙ_t｝と表す。
そして、時刻ｔにおいて状態がｊである時に特徴ベクトルのシンボルｙ_tを発生させる確率（シンボル観測確率）をｂ_j（ｙ_t）と表す。
ここで、図示のモデルをＭとした場合、観測シンボル系列ｙが状態▲１▼、▲２▼、▲３▼と推移する確率をＹとすると、ＸとＹが同時に起こる確率は、次の式で表せる。
【数１３】

本実施形態では、音符列情報記憶部１１に記憶された音符列情報に基づいて、図９に示すようなＦＳＮ（有限状態ネットワーク）を音符単位で形成する。例えば、図６（ｂ）に示すような情報が記憶されている場合には、音符辞書６ｂに記憶された音程「Ｅ３」、「Ｇ３」、「無音」……の状態数および状態遷移確率に基づいて隠れマルコフモデルを形成する。
【００４８】
Ｂ−２−３．状態遷移決定
本実施形態では、音符列情報記憶部１１に記憶された音符列情報に基づいて上記のように形成された隠れマルコフモデルと、シンボル量子化部７が取得した特徴シンボルと、このシンボルの観測確率とから、ビタービアルゴリズムにしたがって入力音声の状態遷移を決定するが、その概要を簡単に説明する。
ビタービアルゴリズムは、モデルＭが観測シンボル系列ｙを出力するときの最も可能性の高い観測状態列を導くためのものである。
Φ_j（ｔ）を時刻ｔで状態にあるときの、観測ベクトルｙ₁〜ｙ_tに遷移する際の最も高い確率であるとすると、このときの部分的尤度は次の帰納式で表される。
【数１４】

上記式において、ａ_ijは状態ｉからｊへの状態遷移率（つまり、音符列情報記憶部１１に記憶された音符列により決定される）、ｂ_j（ｙ_t）は特徴ベクトルの各々の時刻ｔにおけるシンボル観測確率であり、入力される音声の特徴ベクトルと上述した認識用音符情報記憶部６等とに基づいて決定されるものである。
【００４９】
また、Φ₁（１）＝１、Φ_j（１）＝ａ_1jｂ_j（ｙ₁）とすると、１＜ｊ＜Ｎについて、最大尤度確率Ｐ’（Ｙ｜Ｍ）は、次の式で表される。
【数１５】

このようにして状態ｊへの最大の確率を再帰的に求めることにより、最適なパス、つまり最も高い確率の観測状態列が導かれ、状態遷移が決定される。
【００５０】
さらに、本実施形態では、音符列情報記憶部１１に音符列の各音符に対応した持続時間情報が記憶されており、上述の最適パス算出のためのビタービアルゴリズムにこの持続時間情報を含めた以下に示すようなアルゴリズムを使用している。
【００５１】
このアルゴリズムでは、各々の音符ｎ、時間ｔでの持続時間の経過を保持し、時間ｔの状態から時間ｔ＋１の状態ｊに推移する際の罰則関数Ｐを導入している。
【数１６】

このような罰則関数Ｐの導入に関しての詳細は、"Robust Parametric Modeling of Durations in HMMs"（D. Burshtein. ICASSP Proceedings 1995）に記載されている。
【００５２】
上記罰則関数Ｐにおいて、ΔＤ（ｔ）は対象となる本来演奏されるべき音の持続時間と、実際の音の持続時間との差を示している。つまり、ΔＤ（t）＝Ｄ（t）−Ｄｎとなる。ここで、Ｄ（ｔ）は実際の音の持続時間であり、Ｄｎは本来演奏されるべき音の持続時間で、音符列中では持続時間を示すシンボルで表される（Ｓ１，Ｓ２等、図６（ｂ）参照）。上述したように持続時間の実時間長は、楽譜面の速度表記や設定テンポによって決定される。例えば、図６に示す楽譜において、持続時間情報が「Ｓ３」、つまり四分音符の場合には、Ｄｎ＝６０／１２０＝５００msecとなる。
また、上記罰則関数において、ｌ（u）＝logp（u）を示している。
ここで、ｐ（u）はΔＤの確率密度であり、ガウス混合密度でモデル化されたものである。
【００５３】
このような罰則関数Ｐを上記ビタービアルゴリズムに含めるために、モデルパラメータの対数をとると、帰納式は次のように表される。
【数１７】

この式により、音韻の持続時間が考慮された最適なパスを決定することができる。
【００５４】
図１０に示す例では、上記式によって計算された確率を○あるいは△で示している（○＞△）。例えば、時刻ｔｍ１〜時刻ｔｍ３（時刻ｔｍ１等はフレーム単位の時刻を示す）までの観測をふまえ、状態「Silence」から状態「Ｃ₁」へのパスが形成される確率は、状態「Silence」から状態「Silence」へのパスが形成される確率よりも高く、時刻ｔｍ３におけるベスト確率となり、図中太線で示すように状態遷移が決定される。このような演算を入力音声の各フレームに対応する時刻（ｔｍ１、ｔｍ２、ｔｍ３、……）毎に行うことによって、図１０に示す例では、図中太線で示すような遷移したように決定される。これにより入力音声の音符を各フレーム単位の各時刻において特定できるようになる。図１０に示す場合には、時刻ｔｍ１、ｔｍ２が「Silence」、ｔｍ３〜ｔｍ１０までが「Ｃ１」、ｔｍ１１〜が「Ｄ０」といった具合に特定できる。
【００５５】
このようにしてフレーム単位の時刻で特定した入力音声の音符と、音符列情報記憶部１１に記憶された音符列を構成する各音符との対応付けが可能となる。これにより、入力音声と音符列の各音符とを対応付けた結果を表示装置１２に表示することができる。ここで、表示装置１２の表示方法としては、図１０に示すように、楽譜面を表示し、現時点での入力音声がこの楽譜面のどの位置にいるかを矢印等で指し示すようにしてもよいし、現在の入力音声に相当する音符を他の音符と色を変えて表示し、現在の入力音声が音符列のどの位置にあるかを表示してもよいし、任意である。
【００５６】
Ｃ．変形例
なお、上述した実施形態では、認識用音符情報記憶部６の音符辞書６ｂには予め算出した遷移確率やシンボル観測確率等を記述するようにしていたが、随時学習データ（楽音や歌声の波形データと、これに対応する音符列を示すデータとのセット）を入力して、これらのパラメータを推定して書き換えるようにしてもよい。この場合、学習データの音符列を表記したデータを用いて、各音符について隠れマルコフモデルをＦＳＮに拡張したものを生成する。そして、入力される学習データの尤度を最大にするために各々の音符モデルのパラメータを推定することにより求める。
ここでは、公知のＫ平均法を用いてパラメータを推定する方法について簡単に説明する。
【００５７】
▲１▼初期化
まず、学習データの波形データを音符列表記の音符毎に分割する。
【００５８】
▲２▼推定
次に、遷移に要する時間をカウントし、それを状態の遷移時間で割ることで遷移確率を算出する。つまり、次式で遷移確率が算出される。
【数１８】

この過程では、学習時間中の各々の遷移状態と出力シンボルを追跡するために、カウンタの管理を行う必要がある。
【００５９】
そして、ピッチ周波数以外の５種類の特徴ベクトルの観測確率に使用されるガウス分布の連続密度関数における混合重み係数は、各々の状態ｉについて次の式で推定される。
【数１９】

また、残りの特徴ベクトルであるピッチ周波数についての確率関数は、次に示すものを使用する。
【数２０】

また、上述したようにピッチ周波数についてピッチ音ありの場合は、他の５種類の特徴ベクトルと同様にガウス分布の連続密度関数が用いられるので、この場合には、上記５種類の特徴ベクトルと同様にして推定される。
【００６０】
▲３▼セグメント分け
上記▲２▼に示した推定過程において、推定されたパラメータを用いて、あらためてセグメント分けを行う。
【００６１】
▲４▼反復
上記▲２▼と▲３▼とを収束するまで繰り返す。
【００６２】
このように学習を行うことにより、より正確なパラメータを推定して音符辞書６ｂに記述することができる。すなわち、音符辞書６ｂを参照して行われる状態遷移の決定の正確性を向上させることができ、入力音声が音符列のどの位置にいるかをより正確に検出することができるようになる。
【００６３】
【発明の効果】
以上説明したように、本発明によれば、入力される歌声や楽器音などの音声が譜面のどの位置にいるかをより正確に検出することが可能となる。
【図面の簡単な説明】
【図１】本発明の一実施形態に係る音声信号処理装置の構成を示すブロック図である。
【図２】前記音声信号処理装置の構成要素である音符辞書を説明するための図である。
【図３】前記音声信号処理装置の構成要素である音符辞書を説明するための図である。
【図４】前記音声信号処理装置の構成要素である符号帳を説明するための図である。
【図５】前記音符辞書に記述されたシンボルの観測確率を算出するための関数を説明するための図である。
【図６】前記音声信号処理装置の構成要素である音符列情報記憶部を説明するための図である。
【図７】前記音声信号処理装置の動作を説明するためのフローチャートである。
【図８】入力音声から特徴ベクトルを取得する過程について説明する図である。
【図９】前記音声信号処理装置で使用される隠れマルコフモデルを説明するための図である。
【図１０】入力音声と音符列との対応付けを説明するための図である。
【図１１】入力音声と音符列との対応付け結果の表示例を説明するための図である。
【符号の説明】
１……マイク、２……分析窓生成部、４……高速フーリエ変換部、５……特徴パラメータ分析部、６……認識用音符情報記憶部、６ａ……符号帳、６ｂ……音符辞書、８……音符列状態形成部、９……状態遷移決定部、１１……音符列情報記憶部、１２……表示装置[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a sound signal processing apparatus and a sound signal processing method for associating a pre-stored musical note string and input sound in time series.
[0002]
[Prior art]
Conventionally, there has been considered a method for tracking the position of the musical instrument musical score at the present time by detecting the pitch and the sounding time of the musical instrument sound (for example, “An On-Line Algorithm for Real-time”). Accompanmient "(R. Dannenberg. Proceedings of the ICMC 1984) and" The Synthetic performer in the Context of Live Musical Performance "(B. Vercoe. Proceedings of the ICMC 1984.).
[0003]
[Problems to be solved by the invention]
However, actual performances and singing voices do not always follow the musical score completely, and subtle tempos, timing pitch shifts, and uncertain elements such as fluctuations have an adverse effect. There are cases where it is impossible to accurately detect where the sound or singing voice is on the musical score.
[0004]
The present invention has been made in consideration of the above circumstances, and an audio signal processing apparatus capable of more accurately detecting the position of the input voice such as a singing voice or instrument sound on the musical score, and An object of the present invention is to provide an audio signal processing method.
[0005]
[Means for Solving the Problems]
In order to solve the above-mentioned problem, an audio signal processing device according to claim 1 of the present invention is an audio signal processing device for associating input speech with any note of a note sequence, and a note sequence described in a time sequence. From a note string storage means for storing information and an audio signal input in units of frames At least a parameter representing the pitch Parameter acquisition means for acquiring feature parameters, a codebook clustered into symbols using representative feature parameters of speech signals as feature vectors, and the number of states, state transition probability, and observation probability of the symbols for each note By referring to the recognized note information storage means and the recognition note information storage means, the observation symbol of the input speech is acquired from the characteristic parameter acquired by the parameter acquisition means, and the observation probability of the observation symbol is determined. Based on the observation probability acquisition means to acquire, the number of states stored in the note information storage means for recognition and the state transition probability, each state of the note string stored in the note string information storage means is displayed on a finite state network. The state forming means formed by the hidden Markov model and the observation acquired by the observation probability acquiring means A state transition determining unit that determines a state transition according to a rate and the hidden Markov model formed by the state forming unit, and each of the input audio signals based on the state transition determined by the state transition determining unit Corresponding means for associating a frame with the note string information The state forming means forms three types of left-to-right hidden Markov models according to the sound with pitch, the sound without pitch, and the silence, and the sound with pitch and the sound without pitch are generated. Formed as a three-state model and the silence as a one-state model It is characterized by doing.
[0006]
Further, the audio signal processing device according to claim 2 is the audio signal processing device according to claim 1, wherein the current input speech is based on a result of association by the association means, It further comprises display means for displaying whether or not.
[0007]
Further, in the audio signal processing device according to claim 3, in the audio signal processing device according to

claim

1 or 2, the parameter acquisition unit includes at least energy, delta energy, zero cross, pitch, It is characterized in that delta pitch and pitch error are acquired as feature parameters.
[0008]
According to a fourth aspect of the present invention, there is provided the audio signal processing apparatus according to the third aspect, wherein the energy, delta energy, zero cross, delta pitch, and pitch error stored in the note information storage means for recognition are stored. The five observation probabilities are calculated using an observation function using a Gaussian distribution.
The pitch observation probability stored in the recognizing note information storage means is calculated using an observation function using a Gaussian distribution and a step observation probability function, and when calculating the observation probability of this pitch, According to the present invention, the observation function using the Gaussian distribution and the step observation function are selectively used according to the presence or absence of a pitch.
[0010]

Claims

5 The audio signal processing device according to claim 1 1 to 4 In the audio signal processing device according to claim 1, when forming the hidden Markov model of the pitched sound, the state forming means forms a note and a single note connected by a slur as a separate model. It is characterized by.
[0011]

Claims

6 The audio signal processing device according to claim 1 is provided. 5 In the audio signal processing device according to any one of the above, input means for inputting learning musical sound waveform data and learning musical note string data obtained by converting the musical sound waveform into notes, and learning musical note string data input from the input means A learning model forming means for forming a hidden Markov model on a finite network for each note and a parameter that maximizes the likelihood of the model formed by the learning model forming means at the time of learning is estimated by a k-average algorithm Parameter estimation means, and the recognition note information storage means stores the state transition probability and the observation probability of the feature vector for each note obtained from the parameters estimated by the parameter estimation means. .
[0012]
Claims 7 The audio signal processing device according to claim 1 is provided. 6 In the audio signal processing device according to any one of the above, the state transition determining means determines a state transition by a Viterbi algorithm.
[0013]
Claims 8 The audio signal processing device according to claim 1 7 In the audio signal processing device according to claim 1, the note string storage means stores duration data corresponding to the note string, and the state transition determination means is stored in the note string storage means. State transition is determined by Viterbi algorithm using duration data It is characterized by that.
[0014]

Claims

9 Is a speech signal processing method for associating an input speech with any note of a pre-stored note sequence, from a speech signal input in units of frames. At least a parameter representing the pitch A parameter acquisition step for acquiring feature parameters; a codebook in which representative feature parameters of a speech signal stored in advance are clustered into symbols as feature vectors; and the number of states, state transition probabilities, and observation of the symbols for each note An observation probability acquisition step of acquiring an observation symbol of the input speech from the characteristic parameter acquired by the parameter acquisition step by referring to the probability, and acquiring an observation probability of the observation symbol; and a number of states stored in advance And a state forming step of forming each state of the musical note sequence stored in advance by a hidden Markov model on a finite state network based on the state transition probability, the observation probability acquired by the observation probability acquiring step, and the state forming With the hidden Markov model formed by steps Comprising: a state transition determination step of determining a state transition, based on the state transitions are determined by the state transition determination step, and a correspondence step for associating to each frame of the input audio signal and the sequence of notes data Te In the state forming step, three types of left-to-right hidden Markov models are formed according to the sound with pitch, the sound without pitch, and the silence, and the sound with pitch and the sound without pitch are generated. Formed as a three-state model and the silence as a one-state model It is characterized by doing.
[0015]

Claims

10 The audio signal processing method according to claim 1 9 The audio signal processing method according to claim 1, further comprising a display step of displaying which part of the note string information the current input voice is based on the association result of the association step. .
[0016]

Claims

11 The audio signal processing method according to claim 1 9 Or 10 In the audio signal processing method described in (1), in the parameter acquisition step, at least energy, delta energy, zero cross, pitch, delta pitch, and pitch error are acquired as characteristic parameters from the input audio signal.
[0017]

Claims

12 The audio signal processing method according to claim 1 11 In the speech signal processing method according to claim 1, a first observation probability that calculates and stores the five types of observation probabilities of energy, delta energy, zero cross, delta pitch, and pitch error using an observation function using a Gaussian distribution. The calculation step and the observation probability of the pitch using an observation function using a Gaussian distribution and a step observation probability function, and using the observation function using the Gaussian distribution and the step observation function depending on the presence or absence of the pitch A second observation probability calculation step that calculates and stores the observation probability. In the observation probability acquisition step, an observation probability is obtained by referring to the observation probability stored in the first and second observation probability calculation steps. It is characterized by acquiring.
[0019]

Claims

13 The audio signal processing method according to claim 1 Any of 9-12 In the audio signal processing method according to claim 1, in the state forming step, when the hidden Markov model of the pitched sound is formed, the note and the single note connected by the slur are formed as separate models. It is characterized by.
[0020]
Claims 14 The audio signal processing method according to claim 1 9 Or 13 In the speech signal processing method according to any one of the above, an input step of inputting learning musical sound waveform data and learning musical note string data obtained by converting the musical sound waveform into notes, and learning musical note string data input in the input step A learning model forming step of forming a hidden Markov model on a finite network for each note of the note, and a parameter that maximizes the likelihood of the model formed by the learning model forming means at the time of learning is estimated by a k-average algorithm A parameter estimation step; and a probability storage step for storing a state transition probability and an observation probability of a feature vector in each note obtained by the parameter estimated by the parameter estimation step, and the probability storage step includes the probability storage step. By referring to the observation probabilities stored by the step, Probability is obtained, and in the state formation step, each state of the musical note sequence stored in advance is formed by a hidden Markov model on a finite state network based on the state transition probability stored in the probability storage step. It is said.
[0021]
Claims 15 The audio signal processing method according to claim 1 9 Or 14 In the audio signal processing method according to any one of the above, the state transition determining step determines a state transition by a Viterbi algorithm.
[0022]

Claims

16 The audio signal processing method according to claim 1 15 In the audio signal processing method according to claim 1, the state transition determination step corresponds to a pre-stored note string. State transition is determined by Viterbi algorithm using duration data It is characterized by that.
[0023]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
A. Configuration of the embodiment
A-1. overall structure
First, FIG. 1 is a diagram showing a configuration of an audio signal processing apparatus according to an embodiment of the present invention. In the figure, reference numeral 1 denotes a microphone, which collects sounds such as a singer's singing voice and musical instrument sound, and outputs them to the input sound signal cutout unit 3 as an input sound signal Sv. Reference numeral 2 denotes an analysis window generation unit, which generates an analysis window (for example, a Hamming window) AW having a period that is a fixed multiple of the pitch period detected in the previous frame, and extracts an input audio signal. Output to part 3. When the initial state or the previous frame is silent, the analysis window having a preset fixed period is output to the input voice signal cutout unit 3 as the analysis window AW.
[0024]
The input voice signal cutout unit 3 multiplies the input analysis window AW and the input voice signal Sv, cuts out the input voice signal Sv in units of frames, and outputs it as a frame voice signal FSv to the fast Fourier transform unit 4. The fast Fourier transform unit 4 obtains a frequency spectrum from the frame sound signal FSv and outputs it to the feature parameter analysis unit 5.
[0025]
The feature parameter analysis unit 5 extracts feature parameters that characterize the spectral characteristics of the input speech and outputs them to the symbol quantization unit 7. In this embodiment, six types of feature vectors (energy, delta energy, zero cross rate, pitch frequency, delta pitch, and total error) described later are used as feature parameters. As will be described in detail later, the recognizing note information storage unit 6 includes a codebook 6a and a note dictionary 6b that stores probability data indicating the number of feature vector states, state transition probabilities, and symbol observation probabilities for each note. I have.
[0026]
The symbol quantization unit 7 refers to the codebook 6a stored in the recognizing note information storage unit 6, selects a feature symbol having the maximum likelihood in the frame, and determines the observation probability of the selected feature symbol as a note string state. Output to the forming unit 8.
[0027]
The note sequence state forming unit 8 refers to the recognition note information storage unit 6 and a note sequence information storage unit 11 to be described later, so that the note sequence described in the note sequence information storage unit 11 by a hidden Markov model (HMM) is used. Form a column state. The state transition determination unit 9 determines the state transition according to the Viterbi algorithm, which will be described later, using the observation probability of the feature symbol in units of frames obtained from the input speech output from the symbol quantization unit 7. Thereby, the notes of the input voice can be specified at each time in frame units.
[0028]
The matching unit 10 specifies each time in the frame unit of the input voice and the position of the note string stored in the note string information storage unit 11 based on the specified note in the frame unit of the input voice, Correspondence between input speech and note sequence is performed. The display device 12 displays the correlation result between the input voice and the note string by the matching unit 10.
[0029]
A-2. Note information storage unit for recognition
Next, the code book 6a and the note dictionary 6b stored in the recognition note information storage unit 6 will be described. The codebook 6a is clustered into a predetermined number of symbols using representative feature parameters of the audio signal as feature vectors.
[0030]
A-2-1. Feature vector
First, before describing the codebook 6a, the six types of feature vectors used in the present embodiment will be described.
[0031]
▲ 1 ▼ Energy
Energy is a coefficient representing the intensity of sound, and is parameterized into a feature vector by the following equation.
[Expression 1]

(2) Delta energy
Delta energy is a coefficient representing the intensity of sound as a difference, and is parameterized into a feature vector by the following equation.
[Expression 2]

(3) Zero cross rate
The zero cross rate has a feature that the zero cross rate becomes lower as the voiced sound is obtained, and is parameterized into a feature vector by the following equation.
[Equation 3]

(4) Pitch frequency
The pitch frequency can be obtained by a two-way mismatch method described in “Fundamental Frequency Estimation in the SMS Analysis” (P. Cano. DAFX Proceedings 1998.).
[0032]
(5) Delta pitch
The delta pitch is parameterized into a feature vector by
[Expression 4]

(6) Total error
The total error indicates the likelihood of a voiced sound by obtaining a mismatch from two directions, that is, an error between the predicted pitch and the measured pitch error, and an error in the predicted pitch of the measured pitch.
First, the pitch error between the predicted pitch (p) and the measured pitch (m) is expressed by the following equation.
[Equation 5]

In the above formula, f _k Is the k-th predicted peak frequency, Δf _k Is the frequency difference between the k-th predicted peak frequency and the measurement pitch, a _k Is the kth prediction amplitude, A _max Indicates the maximum value of the amplitude.
On the other hand, the pitch error from the measured pitch (m) to the predicted pitch (p) is expressed by the following equation.
[Formula 6]

In the above formula, f _k Is the k-th predicted peak frequency, Δf _k Is the frequency difference between the k-th predicted peak frequency and the measurement pitch, a _k Is the kth prediction amplitude, A _max Indicates the maximum value of the amplitude.
[0033]
Therefore, the total error is as follows:
[Expression 7]

A-2-2. Note dictionary
The note dictionary 6b stores a left-to-right hidden Markov model for each note. Here, three types of models are prepared in the note dictionary 6b according to the physical characteristics of the input speech. Specifically, models are prepared for each of three types: pitched sound, pitchless sound, and silence. In addition, as a pitched sound, each pitch (C0, C # 0, D0, D # 0, E0, F0, F # 0, G0, G # 0...) Is modeled, that is, the number of states and the state transition probability. Is stored (see FIGS. 2 and 3). Further, as will be described later, an observation probability value for each feature vector symbol calculated in advance according to a Gaussian distribution or the like is stored. Here, in the case of a pitched sound, the number of states is 3, and each of the three states of sounding rise (attack), steady state (steady), and release state is represented in a pseudo manner.
[0034]
In the present embodiment, one model is prepared for each plosive sound (/ p /, / b / ...) and friction sound (/ s /, / sh / ...) as a pitchless sound. In the present embodiment, even in this no-pitch sound model, the number of states is set to three states of attack, steady, and release, so that fine nuances such as exhalation sounds, burst sounds, and friction sounds can be reproduced. . In the case of silence, the number of states is set to 1. In the case of a pitched sound, if a model different from a single note is prepared for notes connected to all notes with a slur, it is possible to associate input speech with notes with higher accuracy.
A-2-3. Code book
The codebook 6a is generated by five types of (1) energy, (2) Δ energy, (3) zero cross rate, (5) delta pitch, and (6) total error among the above six types of feature vectors, Clustered. Each cluster contains a typical set of vectors representing each symbol. (See FIG. 4). In the present embodiment, a known LBG algorithm is used when the code book 6a is created.
The code book 6a is created according to a continuous density function having a Gaussian distribution represented by the following equation.
[Equation 8]

Where C _jm Indicates the mixing weight coefficient of the component m in the state j. N (j; μ, Σ _j ) Shows the multidimensional Gaussian distribution of the mean vector μ and the covariance matrix Σ. However, since the number of learning parameters becomes enormous due to the multidimensionality as it is, N (j; μ, Σ in this embodiment. _j ) As a constant function.
[0035]
In order to create the codebook 6a according to such a Gaussian distribution, the quantization vector N in the frame _l Observation series y _t ~ _l Is a subset M _l It becomes. Where M _l Is the number of code book mix elements to create. When the code book 6a is created, the parameters are estimated by the following calculation.
[Equation 9]

[Expression 10]

Mixing weight coefficient C _jm Is the quantization vector N _lm Is used when m is mixed in the codebook, and is expressed as the following equation.
[Expression 11]

Next, in this embodiment, the function used to calculate the observation probability for the pitch frequency (4), which is a feature vector other than the above five types, is a sound with pitch and a sound with no pitch or silence. It is divided into two ways. Here, FIG. 5A shows a step function for calculating the observation probability b (y) in the case of the sound with pitch, and FIG. 5B shows the observation probability b (y in the case of the sound without pitch or silence. ) Is a step function for calculating. As shown in the figure, in the case of a pitched sound, the pitch F ₀ = 0, that is, when the pitch is not detected, the calculated observation probability is a constant. On the other hand, when the pitch is detected, that is, F ₀ When it is not = 0, the function is calculated as a function for calculating the observation probability b ′ (y) (see the graph shown in the upper right in FIG. 5A) with a continuous density function according to a Gaussian distribution. The pitch F used for this calculation ₀ Dev is calculated by the following formula.
[Expression 12]

In the above formula, F _Perfect0 Indicates a pitch frequency to be originally played. For example, in the case of the average rate A4 = 440 Hz, F used for calculating the observation probability at each pitch. _Perfect0 Is the following numerical value.
C3 ......... 261.626Hz
D3 ......... 293.66Hz
E3 ......... 329.628Hz
As described above, the observation probabilities for the six types of feature vectors are calculated according to the continuous density function of the Gaussian distribution, etc., stored in the note dictionary 6b, and the codebook 6a is also created in a form according to the Gaussian distribution. .
[0036]
A-3. Note string information storage unit
Next, the note string information storage unit 11 will be described. As shown in FIG. 6, in the note string information storage unit 11, note strings such as music are described in time series. In the present embodiment, the duration information is stored for each note of the note string. Therefore, when storing the musical note string shown in the musical score as shown in FIG. 6A, the musical note string information and the duration information as shown in FIG. 6B are stored. Here, the duration information is expressed as follows.
[0037]
For sounds with or without pitch,
S1: All notes
S2: Half note
S3: Quarter note
S4: Eighth note
It is expressed as follows.
[0038]
On the other hand, in the case of silence (rest),
U1: All rests
U2: Rest for 2 minutes
U3: 4 minutes rest
U4: Rest for 8 minutes
It is expressed as follows.
[0039]
Therefore, the actual time of the duration is determined by the speed notation on the musical score and the set tempo.
[0040]
B. Operation of the embodiment
Next, the operation of the audio signal processing apparatus having the above configuration will be described.
B-1. Overview operation
First, the outline operation of the audio signal processing apparatus will be described with reference to the flowchart shown in FIG. First, when an input audio signal is generated by the microphone 1, the audio signal is subjected to fast Fourier transform for each frame to obtain a frequency spectrum. Then, the characteristic parameter analysis is performed from the acquired frequency spectrum to acquire the six types of characteristic parameters described above (step S1).
[0041]
Next, by referring to the note information storage unit 6 for recognition, symbol quantization of the six types of feature parameters acquired by the symbol quantization unit 7 is performed (step S2). Then, the symbol quantization unit 7 refers to the note dictionary 6b to obtain the observation probability of the symbol quantized symbol.
[0042]
Thereafter, by referring to the note string information stored in the note string information storage unit 11 and the note dictionary 6b, the note string state forming unit 8 configures the state of the note string by the HMM model (step S3). Based on the symbol observation probability acquired by the symbol quantization unit 7 and the HMM model formed by the note sequence state forming unit 8 as described above, the state transition determining unit 9 uses the Viterbi algorithm to change the state. Is determined (step S4). The HMM model and the Viterbi algorithm will be described later. Then, based on the state transition determined by the state transition determination unit 9, the matching unit 10 performs temporal association between the input speech and the note sequence (step S5), and the association result is displayed on the display device 12. It is displayed (step S6).
[0043]
B-2. Details of operation
Next, each process referred to in the outline operation will be described in detail.
[0044]
B-2-1. Feature parameter analysis and symbol quantization
FIG. 8 is a diagram for explaining a process of acquiring feature parameters from the input audio signal generated by the microphone 1 and performing symbol quantization. As shown in the figure, the input audio signal is converted into a frequency spectrum by fast Fourier transform in units of frames. A characteristic parameter analysis is performed on the frequency spectrum. For each feature vector, the maximum likelihood symbol is found from the code book 6a of the recognition note information storage unit 6, and the observation probability for the found symbol is obtained by referring to the note dictionary 6b.
[0045]
B-2-2. Hidden Markov model
Next, the hidden Markov model (HMM) will be described with reference to FIG. Note that since the voice state transitions in one direction, the left-to-right model is used in this embodiment as described above.
[0046]
The probability of transition from state i to j at time t (state transition probability) is a _ij It expresses. In the example shown in FIG. 9, the probability of staying in state (1) is a ₁₁ And the probability of transition from state (1) to state (2) is a ₁₂ It expresses. Such a state transition probability is stored for each note in the note dictionary 6b as described above (see FIG. 3). In the present embodiment, in the sound with pitch and the sound without pitch, the state (1) indicates an attack, the state (2) indicates a steady state, and the state (3) indicates a release.
[0047]
Each state has its own feature vector, and each has a different observation symbol. Y = {y ₁ , Y ₂ , ..., y _t }.
When the state is j at time t, the feature vector symbol y _t Is the probability of symbol generation (symbol observation probability) b _j (Y _t ).
Here, when the model shown in the figure is M, if the probability that the observed symbol series y transitions to states (1), (2), and (3) is Y, the probability that X and Y occur simultaneously is It can be expressed as
[Formula 13]

In the present embodiment, an FSN (finite state network) as shown in FIG. 9 is formed for each note based on the note string information stored in the note string information storage unit 11. For example, when information as shown in FIG. 6B is stored, the number of states “E3”, “G3”, “silence”... Stored in the note dictionary 6b and the state transition probability are included. Based on this, a hidden Markov model is formed.
[0048]
B-2-3. State transition decision
In the present embodiment, the hidden Markov model formed as described above based on the note sequence information stored in the note sequence information storage unit 11, the characteristic symbol acquired by the symbol quantization unit 7, and the observation probability of this symbol Thus, the state transition of the input speech is determined according to the Viterbi algorithm, and the outline thereof will be briefly described.
The Viterbi algorithm is for deriving the most likely observation state sequence when the model M outputs the observation symbol sequence y.
Φ _j Observation vector y when (t) is in the state at time t ₁ ~ Y _t If it is the highest probability when transitioning to, the partial likelihood at this time is expressed by the following induction equation.
[Expression 14]

In the above formula, a _ij Is the state transition rate from state i to j (that is, determined by the note string stored in the note string information storage unit 11), b _j (Y _t ) Is the symbol observation probability of each feature vector at time t, and is determined based on the feature vector of the input speech and the above-described recognition note information storage unit 6 and the like.
[0049]
Φ ₁ (1) = 1, Φ _j (1) = a _1j b _j (Y ₁ ), For 1 <j <N, the maximum likelihood probability P ′ (Y | M) is expressed by the following equation.
[Expression 15]

Thus, by recursively obtaining the maximum probability for the state j, the optimum path, that is, the observation state sequence having the highest probability is derived, and the state transition is determined.
[0050]
Furthermore, in this embodiment, the duration information corresponding to each note of the note sequence is stored in the note sequence information storage unit 11, and this duration information is included in the Viterbi algorithm for calculating the optimum path described above. The following algorithm is used.
[0051]
This algorithm introduces a penalty function P for transitioning from the state at time t to the state j at time t + 1 while maintaining the passage of the duration at each note n and time t.
[Expression 16]

Details regarding the introduction of such a penalty function P are described in “Robust Parametric Modeling of Durations in HMMs” (D. Burshtein. ICASSP Proceedings 1995).
[0052]
In the penalty function P, ΔD (t) indicates the difference between the duration of the sound to be originally played and the actual duration of the sound. That is, ΔD (t) = D (t) −Dn. Here, D (t) is the duration of the actual sound, and Dn is the duration of the sound that should be played, and is represented by a symbol indicating the duration in the note string (S1, S2, etc., FIG. 6 (b)). As described above, the actual duration of the duration is determined by the speed notation on the score and the set tempo. For example, in the musical score shown in FIG. 6, when the duration information is “S3”, that is, a quarter note, Dn = 60/120 = 500 msec.
In the penalty function, l (u) = logp (u) is shown.
Here, p (u) is a probability density of ΔD, which is modeled by a Gaussian mixture density.
[0053]
In order to include such a penalty function P in the above Viterbi algorithm, taking the logarithm of the model parameter, the induction equation is expressed as follows.
[Expression 17]

By this formula, an optimal path in consideration of the phoneme duration can be determined.
[0054]
In the example shown in FIG. 10, the probability calculated by the above equation is indicated by ◯ or Δ (◯> Δ). For example, based on observations from time tm1 to time tm3 (time tm1 and the like indicate time in frame units), state “Silence” to state “C ₁ ”Is higher than the probability that the path from the state“ Silence ”to the state“ Silence ”is formed, and is the best probability at time tm3, and the state transition is determined as indicated by the bold line in the figure. Is done. By performing such calculation for each time (tm1, tm2, tm3,...) Corresponding to each frame of the input speech, in the example shown in FIG. 10, it is determined as a transition as indicated by a thick line in the figure. The As a result, the notes of the input voice can be specified at each time in units of frames. In the case illustrated in FIG. 10, the times tm1 and tm2 can be specified as “Silence”, tm3 to tm10 can be “C1”, and tm11 to “D0” can be specified.
[0055]
In this way, it is possible to associate the note of the input voice specified at the time in frame units with each note constituting the note string stored in the note string information storage unit 11. As a result, the result of associating the input voice with each note of the note sequence can be displayed on the display device 12. Here, as a display method of the display device 12, as shown in FIG. 10, a score surface may be displayed, and an arrow or the like may indicate the position of the current input sound on the score surface. The note corresponding to the current input voice may be displayed in a different color from the other notes, and the position of the current input voice in the note string may be displayed.
[0056]
C. Modified example
In the above-described embodiment, the note dictionary 6b of the note information storage unit 6 for recognition describes the previously calculated transition probability, symbol observation probability, and the like. However, learning data (musical tone and singing waveform data as needed) And a set of data indicating a note string corresponding thereto), and these parameters may be estimated and rewritten. In this case, by using the data describing the note sequence of the learning data, a hidden Markov model extended to FSN is generated for each note. Then, in order to maximize the likelihood of the input learning data, it is obtained by estimating the parameters of each note model.
Here, a method for estimating parameters using a known K-average method will be briefly described.
[0057]
▲ 1 ▼ Initialization
First, the waveform data of the learning data is divided for each note represented by a note string.
[0058]
(2) Estimate
Next, the transition probability is calculated by counting the time required for the transition and dividing it by the transition time of the state. That is, the transition probability is calculated by the following equation.
[Formula 18]

In this process, it is necessary to manage the counter in order to track each transition state and output symbol during the learning time.
[0059]
Then, the mixing weight coefficient in the continuous density function of the Gaussian distribution used for the observation probabilities of five types of feature vectors other than the pitch frequency is estimated by the following equation for each state i.
[Equation 19]

The probability function for the pitch frequency, which is the remaining feature vector, uses the following.
[Expression 20]

Further, as described above, when there is a pitch sound with respect to the pitch frequency, a Gaussian distribution continuous density function is used in the same manner as the other five types of feature vectors. In this case, the same as the above five types of feature vectors. To be estimated.
[0060]
(3) Segmentation
In the estimation process shown in (2) above, segmentation is performed again using the estimated parameters.
[0061]
(4) Repeat
Repeat (2) and (3) above until convergence.
[0062]
By performing learning in this way, more accurate parameters can be estimated and described in the note dictionary 6b. That is, it is possible to improve the accuracy of the determination of the state transition performed with reference to the note dictionary 6b, and more accurately detect the position of the input sound in the note string.
[0063]
【The invention's effect】
As described above, according to the present invention, it is possible to more accurately detect the position of the input voice such as a singing voice or instrument sound on the musical score.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of an audio signal processing apparatus according to an embodiment of the present invention.
FIG. 2 is a diagram for explaining a note dictionary which is a component of the audio signal processing apparatus.
FIG. 3 is a diagram for explaining a note dictionary which is a component of the audio signal processing apparatus.
FIG. 4 is a diagram for explaining a code book that is a component of the audio signal processing apparatus;
FIG. 5 is a diagram for explaining a function for calculating an observation probability of a symbol described in the note dictionary.
FIG. 6 is a diagram for explaining a note string information storage unit that is a component of the audio signal processing device;
FIG. 7 is a flowchart for explaining the operation of the audio signal processing apparatus.
FIG. 8 is a diagram illustrating a process of acquiring a feature vector from input speech.
FIG. 9 is a diagram for explaining a hidden Markov model used in the audio signal processing apparatus.
FIG. 10 is a diagram for explaining an association between an input voice and a note string.
FIG. 11 is a diagram for explaining a display example of a result of associating an input voice with a note string.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 ... Microphone, 2 ... Analysis window generation part, 4 ... Fast Fourier transform part, 5 ... Feature parameter analysis part, 6 ... Recognizing note information storage part, 6a ... Code book, 6b ... Note dictionary , 8... Note sequence state forming unit, 9... State transition determining unit, 11.

Claims

An audio signal processing apparatus for associating an input voice with any note of a note string, note string storage means for storing note string information described in a time string;
Parameter acquisition means for acquiring a characteristic parameter including at least a parameter representing a pitch from an audio signal input in units of frames;
A codebook clustered into symbols as representative feature parameters of speech signals as feature vectors, and a note information storage means for recognition storing the number of states, state transition probability, and observation probability of the symbol for each note;
By referring to the recognition note information storage means, an observation probability acquisition means for acquiring an observation symbol of the input speech from the characteristic parameter acquired by the parameter acquisition means, and acquiring an observation probability of the observation symbol;
A state in which each state of the note string stored in the note string information storage unit is formed by a hidden Markov model on a finite state network based on the number of states and the state transition probability stored in the recognition note information storage unit Forming means;
State transition determination means for determining a state transition according to the observation probability acquired by the observation probability acquisition means and the hidden Markov model formed by the state formation means;
Association means for associating each frame of the input speech signal with the note string information based on the state transition determined by the state transition determination means;
Comprising
The state forming means forms three types of left-to-right hidden Markov models according to the sound with pitch, the sound without pitch, and the silence, and the sound with pitch and the sound without pitch are in three states The silence is formed as a one-state model.
The audio signal processing apparatus according to claim and this.

The audio signal processing according to claim 1, further comprising display means for displaying which part of the note string information the current input voice is based on the association result by the association means. apparatus.

The audio signal processing apparatus according to claim 1, wherein the parameter acquisition unit acquires at least energy, delta energy, zero cross, pitch, delta pitch, and pitch error as characteristic parameters from the input audio signal. .

The five types of observation probabilities of energy, delta energy, zero cross, delta pitch and pitch error stored in the recognition note information storage means are calculated using an observation function using a Gaussian distribution,
The pitch observation probability stored in the recognizing note information storage means is calculated using an observation function using a Gaussian distribution and a step observation probability function, and when calculating the observation probability of this pitch, The audio signal processing apparatus according to claim 3, wherein the observation function using the Gaussian distribution and the step observation function are selectively used according to the presence or absence of a pitch.

Before SL state forming means when forming the hidden Markov model of the sound having the said pitch, claims 1 and forming the notes are connected in the previous note and slur and single notes as another model 5. The audio signal processing device according to any one of 4 above.

Input means for inputting the notes of the learning note sequence data Manabu習用tone waveform data and musical sound waveform,
Learning model forming means for forming a hidden Markov model on a finite network for each note of the learning note string data input from the input means;
A parameter estimating unit that estimates a parameter having a maximum likelihood of the model formed by the learning model forming unit at the time of learning by a k-average algorithm;
The recognition note information storage means, any one of claims 1 to 5, characterized in that storing a state transition probability and observation probability of feature vectors in each note determined by the parameter parameter estimated by the estimating means The audio signal processing apparatus according to 1.

Before SL state transition determining means, the audio signal processing apparatus according to any one of claims 1 to 6, wherein determining a state transition by Viterbi algorithm.

Before SL note sequence storage means stores the duration data corresponding to the sequence of notes,
The audio signal processing apparatus according to claim 7 , wherein the state transition determination unit determines a state transition by a Viterbi algorithm using duration data stored in the note string storage unit.

An audio signal processing method for associating with any of the notes previously stored sequence of notes the input speech,
A parameter acquisition step of acquiring a feature parameter including at least a parameter representing a pitch from an audio signal input in units of frames;
By referring to a codebook in which representative feature parameters of a speech signal stored in advance are clustered into symbols as feature vectors, and the number of states, state transition probability, and observation probability of the symbols for each note, the parameters An observation probability acquisition step of acquiring an observation symbol of the input speech from the feature parameter acquired by the acquisition step, and acquiring an observation probability of the observation symbol;
A state forming step of forming each state of the musical note sequence stored in advance by a hidden Markov model on a finite state network based on the number of states stored in advance and the state transition probability;
A state transition determination step for determining a state transition according to the observation probability acquired by the observation probability acquisition step and the hidden Markov model formed by the state formation step;
An association step for associating each frame of the input speech signal with the note string information based on the state transition determined by the state transition determination step;
Comprising
In the state formation step, three types of left-to-right hidden Markov models are formed according to the sound with pitch, the sound without pitch, and the silence, and the sound with pitch and the sound without pitch are in three states The silence is formed as a one-state model.
Audio signal processing method comprising the this.

Based on the correlation result of the previous SL associating step, the audio signal of claim 9, the current input speech characterized by what part a is either further a display step of displaying the provision of the note sequence information Processing method.

Prior Symbol parameter acquisition step, at least energy, delta energy, zero crossing, pitch, audio signal processing according to claim 9 or 10, characterized in that to obtain the delta pitch and the pitch error as a feature parameter from the audio signal input Method.

Before SL energy, delta energy, zero crossing, the five observation probability delta pitch and pitch errors, the first observation probability calculation step of calculating and storing using observation function with a Gaussian distribution,
The observation probability of the pitch is calculated by using an observation function using a Gaussian distribution and a step observation probability function, and using the observation function using the Gaussian distribution and the step observation function according to the presence or absence of the pitch. A second observation probability calculation step of storing,
The speech signal processing method according to claim 11 , wherein in the observation probability acquisition step, the observation probability is acquired by referring to the observation probabilities stored in the first and second observation probability calculation steps.

Prior SL state forming step, when forming the hidden Markov model of the sound having the said pitch, to 9 claims, characterized in that to form the notes are connected in the previous note and slur and single notes as another model The audio signal processing method according to any one of 12 .

An input step of inputting the notes of the learning note sequence data Manabu習用tone waveform data and musical sound waveform,
A learning model formation step of forming a hidden Markov model on a finite network for each note of the learning note string data input in the input step;
A parameter estimation step of estimating a parameter with the maximum likelihood of the model formed by the learning model forming means during learning by a k-average algorithm;
A probability storage step of storing a state transition probability and an observation probability of a feature vector in each note obtained by the parameter estimated by the parameter estimation step;
In the observation probability acquisition step, the observation probability is acquired by referring to the observation probability stored in the probability storage step,
In the state forming step, on the basis of the probability memory stored state transition probabilities by step, according to claim 9, each state of the pre-stored sequence of notes, and forming by Hidden Markov Model on a finite state network 14. The audio signal processing method according to any one of items 13 to 13 .

Before SL state transition determination step, the audio signal processing method according to any one of claims 9 to 14, characterized in that to determine the state transition by Viterbi algorithm.

Prior SL state transition determination step, the audio signal processing method according to claim 15, wherein determining a state transition by Viterbi algorithm using the duration data corresponding to the pre-stored sequence of notes.