JP4323029B2

JP4323029B2 - Voice processing apparatus and karaoke apparatus

Info

Publication number: JP4323029B2
Application number: JP30027699A
Authority: JP
Inventors: 隆宏川嶋; ケイノペドロ; ロスコスアレックス
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 1999-10-21
Filing date: 1999-10-21
Publication date: 2009-09-02
Anticipated expiration: 2019-10-21
Also published as: JP2001117582A

Description

【０００１】
【発明の属する技術分野】
この発明は、アライメント対象となる対象音声と入力音声とを時系列で対応づける音声処理装置および当該音声処理装置を備えるカラオケ装置に関する。
【０００２】
【従来の技術】
従来より、カラオケ等の技術分野において、歌唱者の歌声を歌手などの特定の歌唱者の歌声に似せて変換するといった音声処理技術が提案されている。
通常、このような音声処理においては、二つの音声信号を時系列で対応づけるアライメントを行う必要がある。例えば図１に示すように、対象者が「なきながら」と発声した音声に歌唱者が「なきながら」と発声した音声を似せるように合成する場合であっても、対象者が「き」の音を発声するタイミングと歌唱者が「き」の音を発声するタイミングは異なる場合がある。
このように人間が同じ言語を発音しても、その継続時間が異なる上に非線形に伸縮することが多いので、二つの音声を比較する場合には、同じ音素同士が対応するように、時間軸を非線形に伸縮する時間正規化するＤＰマッチング手法（Dynamic Time Warping:ＤＴＷ）という技術が知られている。ＤＰマッチング手法では、単語や音素に関して標準的な時系列を標準パターンとして用いているので、時系列パターンの時間的構造変化に対して音素単位で一致させることができる。
またスペクトル変動に対して優れている隠れマルコフモデル（Hidden Markov Model：ＨＭＭ）を用いる技術も知られている。隠れマルコフモデルでは、スペクトル時系列の統計的変動をモデルのパラメータに反映するので、話者の個人差などに起因するスペクトル変動に対して音素単位で一致させることができる。
【０００３】
【発明が解決しようとする課題】
しかしながら、上述したＤＰマッチング手法を用いた場合には、スペクトル変動に対しては精度が悪く、従来の隠れマルコフモデルを用いる技術では記憶容量や計算量が膨大になるので、カラオケ装置における物まねなどのリアルタイム性が必要な処理には不向きだった。
【０００４】
本発明は、上述した課題を解決するためになされたものであり、アライメント対象となる対象音声と入力音声とを時系列で対応づける音声処理を、少ない記憶容量でリアルタイム処理が可能な音声処理装置および当該音声処理装置を備えるカラオケ装置を提供することを目的としている。
【０００５】
【課題を解決するための手段】
上述した課題を解決するために、請求項１に記載の発明は、アライメント対象となる対象音声と入力音声とを時系列で対応づける音声処理装置であって、
前記対象音声をフレーム単位で分析し、前記対象音声の音素の時系列情報である音素列をフレーム単位で記憶する対象音声情報記憶手段と、
音声信号の代表的な特徴パラメータを特徴ベクトルとして所定数のシンボルにクラスタ化した符号帳と、各音素毎に状態遷移確率および前記各シンボルの観測確率とを記憶する音素情報記憶手段と、
入力音声信号をフレーム単位で特徴パラメータ分析し、前記音素情報記憶手段に記憶された符号帳に基づいて前記入力音声の特徴パラメータをシンボル量子化して前記入力音声の観測シンボルとする入力音声量子化手段と、
前記音素情報記憶手段に記憶された状態遷移確率および観測確率に基づいて、前記各音素の状態を有限状態ネットワーク上で隠れマルコフモデルによって形成する状態形成手段と、
前記入力音声量子化手段によってシンボル量子化された観測シンボルと、前記状態形成手段によって形成された各音素の状態に従って、ビタービアルゴリズムによって状態遷移を決定する状態遷移決定手段と、
前記状態遷移決定手段によって決定された状態遷移に基づいて、前記入力音声の各フレームと音素が一致する前記対象音声のフレームを特定するアライメント手段と
を備えることを特徴とする。
【０００６】
請求項２に記載の発明は、請求項１記載の音声処理装置において、
前記特徴ベクトルは、音声のスペクトル特性をメルケプストラム係数で特徴づけるベクトルを含むことを特徴とする。
請求項３に記載の発明は、請求項１記載の音声処理装置において、
前記特徴ベクトルは、音声のスペクトル特性を差分メルケプストラム係数で特徴づけるベクトルを含むことを特徴とする。
請求項４に記載の発明は、請求項１記載の音声処理装置において、
前記特徴ベクトルは、音声を差分エネルギーで特徴づけるベクトルを含むことを特徴とする。
請求項５に記載の発明は、請求項１記載の音声処理装置において、
前記特徴ベクトルは、音声をエネルギーで特徴づけるベクトルを含むことを特徴とする。
請求項６に記載の発明は、請求項１記載の音声処理装置において、
前記特徴ベクトルは、音声の有声音らしさをゼロクロス率およびピッチエラーで特徴づけるベクトルを含むことを特徴とする。
【０００７】
請求項７に記載の発明は、請求項１記載の音声処理装置において、
前記音素情報記憶手段は、大量の学習セットの予測ベクトルからクラスタ化アルゴリズムによってベクトル量子化することによって生成された符号帳を記憶することを特徴とする。
請求項８に記載の発明は、請求項１記載の音声処理装置において、
前記音素情報記憶手段は、学習データに対するモデル尤度を最大にするパラメータを推定することによって求められた各音素における特徴ベクトルに対する状態遷移確率および観測シンボル確率を記憶することを特徴とする。
【０００８】
請求項９に記載の発明は、請求項１記載の音声処理装置において、
前記状態遷移決定手段は、前記状態形成手段によって形成された状態から数状態前後の範囲から最適状態を検索して当該最適状態へ飛び越しを行うことを特徴とする。
請求項１０に記載の発明は、請求項１記載の音声処理装置において、
前記状態形成手段は、前記対象音声の音素列にかかわらず、音素から無音状態あるいは息継ぎ状態への飛び越しを認めるパスを有する状態を形成することを特徴とする。
請求項１１に記載の発明は、請求項１記載の音声処理装置において、
前記状態形成手段は、前記対象音声の音素列のうち近似する音素を有するグループについては、遷移確率を等価としたパスを有する状態を形成することを特徴とする。
【０００９】
請求項１２に記載の発明は、請求項１記載の音声処理装置において、
前記対象音声情報記憶手段は、前記音素列を、連続した同一音素の前記フレームからなるリージョンと対応付けて記憶し、
前記アライメント手段は、前記状態遷移決定手段が決定した前記入力音声の、連続した同一音素に対応するフレームのフレーム数が、当該音素と一致する前記対象音声の音素に対応する前記リージョンを構成するフレーム数よりも多い場合には、予め記憶した所定フレームを用いて前記対象音声のフレーム数の不足フレーム数を補間することを特徴とする。
請求項１３に記載の発明は、請求項１記載の音声処理装置において、
前記対象音声情報記憶手段は、前記音素列を、連続した同一音素の前記フレームからなるリージョンと対応付けて記憶し、
前記アライメント手段は、前記状態遷移決定手段が決定した前記入力音声の、連続した同一音素に対応するフレームのフレーム数が、当該音素と一致する前記対象音声の音素に対応する前記リージョンを構成するフレーム数よりも少ない場合には、当該連続した同一音素に対応するフレームの次のフレームと音素が一致する前記対象音声のフレームを、当該リージョンの次のリージョンにおいて特定することを特徴とする。
請求項１４に記載の発明は、請求項１記載の音声処理装置において、
前記アライメント手段による対応に基づいて、前記対象音声と前記入力音声とを対応つけたフレーム単位で合成する合成手段を備えることを特徴とする。
【００１０】
請求項１５に記載の発明は、請求項１４記載の音声処理装置において、
前記入力音声信号をフレーム単位で周波数分析して正弦波成分および残差成分を抽出する周波数分析手段を備え、
前記対象音声情報記憶手段は、前記対象音声を予めフレーム単位で周波数分析して抽出した正弦波成分および残差成分を記憶し、
前記合成手段は、前記入力音声の正弦波成分および残差成分と前記対象音声の正弦波成分および残差成分とを所定の割合で合成することを特徴とする音声処理装置。
請求項１６に記載の発明は、請求項１５記載の音声処理装置において、
前記合成手段によって合成された正弦波成分および残差成分から逆周波数変換によって合成音声波形を生成する波形生成手段を備えることを特徴とする。
【００１１】
請求項１７に記載の発明は、請求項１記載の音声処理装置を備えるカラオケ装置であって、
楽曲を構成する曲データを記憶する曲データ記憶手段と、
前記曲データに基づいて楽曲を再生する再生手段と、
前記対象音声の音素列および前記対象音声を予めフレーム単位で周波数分析して抽出したフレームデータを前記曲データの時系列と同期させる同期手段と、
前記アライメント手段による対応に基づいて、前記対象音声と前記入力音声とを対応つけたフレーム単位で合成する合成手段と、
前記同期手段による同期に基づいて、前記再生手段による楽曲の再生と前記合成手段によって合成された音声とを同期して出力する出力手段と
を備えることを特徴とする。
請求項１８に記載の発明は、請求項１７記載のカラオケ装置において、
前記状態遷移決定手段は、前記曲データの時系列と同期して前記状態遷移確率への重み付け関数を付与することを特徴とする。
【００１２】
【発明の実施の形態】
以下、図面を参照しながら、本発明の実施の形態について説明する。
【０１３】
[１．実施形態の構成]
[１−２．全体構成]
図２は、本実施形態の構成を示す図である。実施形態は、本発明を物まね機能付きのカラオケ装置に適用したものであり、歌唱者（Me）のマイクからの入力音声を、例えば歌手などの物まね対象者（ターゲット：Target）の音声に似せるように音声変換を行って出力するように構成されている。
より具体的には、所定の時間単位で区切ったフレーム単位で対象音声を分析したデータを記憶しておき、入力音声も同様の時間単位で区切ったフレーム単位で分析することにより、入力音声のフレームの時間に対応する対象者のフレームを特定できれば、時間関係を一致させることができるようになっている。そして、音素単位で入力音声と対象音声とを一致させたフレームデータを合成することによって音声変換を行うように構成されている。
【００１４】
図２において、マイク１は、ものまねをしようとする歌唱者の声を収集し、入力音声信号Ｓｖとして入力音声信号切出部３に出力する。分析窓生成部２は、前回のフレームで検出したピッチの周期の固定倍の周期を有する分析窓（例えば、ハミング窓）ＡＷを生成し、入力音声信号切出部３に出力する。なお、初期状態あるいは前回のフレームが無声音（含む無音）の場合には、予め設定した固定周期の分析窓を分析窓ＡＷとして入力音声信号切出部３に出力する。
入力音声信号切出部３は、入力された分析窓ＡＷと入力音声信号Ｓvとを掛け合わせ、入力音声信号Ｓvをフレーム単位で切り出し、フレーム音声信号ＦＳvとして高速フーリエ変換部４に出力する。高速フーリエ変換部４は、フレーム音声信号ＦＳvから周波数スペクトルを求め、周波数分析部５ｓおよび特徴パラメータ分析部５ｐを備えた入力音声分析部５に出力する。
【００１５】
周波数分析部５ｓは、後述するＳＭＳ（Spectral Modeling Synthesis）分析を行って正弦波成分および残差成分を抽出し、分析した当該フレームの歌唱者の周波数成分情報として保持する。
特徴パラメータ分析部５ｐは、入力音声のスペクトル特性を特徴づける特徴パラメータを抽出し、シンボル量子化部７に出力する。本実施形態では、特徴パラメータとして後に説明する５種類（メルケプストラム係数、差分メルケプストラム係数、差分エネルギー係数、エネルギー、ボイスネス）の特徴ベクトルを用いる。
【００１６】
音素辞書記憶部６は、後に詳しく説明するように、符号帳および各音素における特徴ベクトルの状態遷移確率とシンボル発生確率とを示す確率データを含む音素辞書を記憶している。
シンボル量子化部７は、音素辞書記憶部６に記憶された符号帳を参照して、そのフレームにおける特徴シンボルを選び出し、状態遷移決定部９に出力する。
音素列状態形成部８は、隠れマルコフモデル（ＨＭＭ）によって音素列状態を形成し、状態遷移決定部９は、入力音声から得られたフレーム単位の特徴シンボルを用いて、後述するビタービアルゴリズムに従って状態遷移を決定する。
【００１７】
アライメント部１０は、決定された状態遷移から入力音声の時間ポインタを決定し、当該時間ポインタに対応するターゲットフレームを特定し、周波数分析部に保持された入力音声の周波数成分と、ターゲットフレーム情報保持部１１に保持された対象者の周波数成分とを合成部１２に出力する。
ターゲットフレーム情報保持部１１には、予めフレーム単位で周波数分析された周波数分析データおよび、いくつかのフレームで構成される時間リージョン（region）単位で記述された音素列が記憶されている。
【００１８】
合成部１２は、入力音声の周波数成分と対象者の周波数成分とを所定の割合で合成した新たな周波数成分を生成して逆高速フーリエ変換部１３に出力し、逆高速フーリエ変換部１３は新たな周波数成分を逆高速フーリエ変換して新たな音声信号を生成する。
ところで、本実施形態は物まね機能を備えたカラオケ装置であり、曲データ記憶部１４には、ＭＩＤＩデータや時間データ、歌詞データなどによって示されるカラオケ曲データが記憶されており、ＭＩＤＩデータを時間データに従って再生するシーケンサ１５およびシーケンサ１５の出力データから楽音信号を生成する音源１６を備えている。
ミキサ１７は、逆高速フーリエ変換部１３から出力された音声信号と音源１６から出力された楽音信号とを合成してスピーカ１８から出力する。
このように、歌唱者がマイク１に向かって歌唱すると、歌唱者の音声が対象者の音声に似せて変換された新たな音声と、カラオケの伴奏楽音とがスピーカ１８から出力されるように構成されている。
【００１９】
[１−２．音素辞書]
次に、本実施形態で用いる音素辞書について説明する。音素辞書は、音声信号の代表的な特徴パラメータを特徴ベクトルとして所定数のシンボルにクラスタ化した符号帳と、各音素毎に状態遷移確率および前記各シンボルの観測確率とから構成されいる。
【００２０】
[１−２−１．特徴ベクトル]
符号帳について説明する前に、まず、本実施形態で用いる特徴ベクトルについて説明しておく。
▲１▼メルケプストラム係数（b_MEL）
メルケプストラム係数は、音のスペクトル特性を少ない次数で表す係数であり、本実施形態では１２次元ベクトルとして１２８シンボルにクラスタ化している。
▲２▼差分メルケプストラム係数（b_deltaMEL）
差分メルケプストラム係数は、メルケプストラム係数の時間差分を表す係数であり、本実施形態では１２次元ベクトルとして１２８シンボルにクラスタ化している。
▲３▼差分エネルギー係数（b_deltaENERGY）
差分エネルギー係数は、音の強さの時間差分を表す係数であり、本実施形態では１次元ベクトルとして３２シンボルにクラスタ化している。
▲４▼エネルギー（b_ENERGY）
エネルギーは、音の強さを表す係数であり、本実施形態では１次元ベクトルとして３２シンボルにクラスタ化している。
▲５▼ボイスネス（b_VOICENESS）
ボイスネスは、有声音らしさを表す特徴ベクトルであり、音声をゼロクロス率およびピッチエラーで特徴づける２次元ベクトルとして３２シンボルにクラスタ化している。以下、ゼロクロス率とピッチエラーについてそれぞれ説明する。
【００２１】
（１）ゼロクロス率
ゼロクロス率は、有声音であるほどゼロクロス率が低くなる特徴を有するものであり、次式で定義される。
【数１】

ここで、sgn{s(n)}=+1:s(n)>=0,-1:s(n)<0,
N：フレームサンプル数
W：フレーム窓
s：入力信号
【００２２】
（２）ピッチエラー
ピッチエラーは、予測ピッチから測定ピッチへのエラーおよび、測定ピッチから予測ピッチへのエラーの２方向からのミスマッチを求めることによって有声音らしさを示すものであり、詳細には、"Fundamental Frequency Estimation in the SMS Analysis"(P.Cano. Proceedings of the Digital Audio Effects Workshop,1898）にTwo-Way Mismuch手法として説明されている。
【００２３】
まず、予測ピッチ(p)から測定ピッチ(m)へのピッチエラーは次式で表される。
【数２】

fn：ｎ番目の予測ピーク周波数
Δfn：ｎ番目の予測ピーク周波数とそれに近接した測定ピーク周波数差
a_n：ｎ番目の測定アンプリチュード
Amax：最大アンプリチュード
【００２４】
一方、測定ピッチ(m)から予測ピッチ(p)へのピッチエラーは次の式で表される。
【数３】

fk：ｋ番目の予測ピーク周波数
Δfk：ｋ番目の予測ピーク周波数とそれに近接した測定ピーク周波数差
ak：ｋ番目の測定アンプリチュード
Amax：最大アンプリチュード
【００２５】
従って、トータルエラーは次式のようになる。
【数４】

なお、常数として、p=0.5,q=1.4,r=0.5が実験的にほとんどの音声に対して最適であることが報告されている。
【００２６】
[１−２−２．符号帳]
符号帳は、それぞれの特徴ベクトルに対して、各シンボルの数へクラスタされたベクトル情報が記憶されている（図３参照）。
符号帳は、大量の学習セット中の全ての予測ベクトルの中から、最小歪みである量子化によって、Ｋ予測ベクトル（コード）と言われるセットを見つけることによって作成されている。本実施形態では、クラスタ化のアルゴリズムとしてＬＧＢアルゴリズムを用いる。
【００２７】
以下、ＬＧＢアルゴリズムを以下に示す。
▲１▼初期化
まず、ベクトルの全体の中からセントロイドを見つける。ここでは、初期コードベクトルとする。
▲２▼反復
Ｉをトータル反復回数とすると、２^Iのコードベクトルが要求される。そこで、反復回数をｉ＝１，２，・・・・・・，Ｉとすると、反復ｉについて、以下の計算を行う。
１）いくつかの存在するｘというコードベクトルを、ｘ（１＋ｅ）とｘ（１−ｅ）という二つのコードへ分割する。ここで、ｅは、例えば０．００１という小さな数値である。
これにより、２ ⁱ個の新しいコードベクトルｘⁱ _k（ｋ＝１，２，・・・，２ⁱ）が得られる。
２）学習セットの中の各々の予測ベクトルｘについて、ｘからコードへｘ ⁱ _k量子化する。
ｋ’＝ａｒｇｍｉｎ_kｄ（ｘ、ｘⁱ _k）
ここで、ｄ（ｘ、ｘⁱ _k）は、予測空間での歪み距離を示している。
３）反復計算の間、各々のｋについて、ｘⁱ _k＝Ｑ（ｘ）のように、すべてのベクトルをセントロイドする計算を行う。
【００２８】
[１−２−３．確率データ]
次に、確率データについて説明する。本実施形態では、音声をモデル化するためのサブワード単位としてＰＬＵ（疑似音素単位）を用いる。より具体的には、図４に示すように、日本語を２７の音素単位で扱うものとし、各音素には状態数が割り付けられている。
状態数とは、サブワード単位の持続する最も短いフレーム数をいい、例えば、音素“ａ”の状態数は“３”であるので、音素“ａ”は少なくとも３フレーム続くことを意味する。
３状態は、発音の立ち上がり・定常状態・リリース状態を擬似的に表したものである。音素“ｂ”や“ｇ”などの破裂音は、本来持つ音韻が短いので２状態に設定されており、息継ぎ（ASPIRATION）も２状態に設定されている。そして、無音（SILEMCE）は、時間的変動がないので１状態に設定されている。
【００２９】
図５に示すように、音素辞書中の確率データには、サブワード単位で表される２７の音素に対して、各状態の遷移確率と、各特徴ベクトルのシンボルに対する観測シンボル発生確率が記述されている。なお、図５においては、記載を中略しているが、各特徴ベクトル毎の観測シンボル発生確率の和は１となっている。
これらのパラメータは、学習データに対するモデルの尤度を最大にするサブワード単位モデルのパラメータを推定することにより求める。ここでは、セグメントｋ平均学習アルゴリズムを用いる。
【００３０】
セグメントｋ平均学習アルゴリズムを以下に示す。
▲１▼初期化
まず、予め音素セグメント分けされた初期推定データについて、各々の音素セグメントをＨＭＭ状態へ線形的にセグメント化（分割）する。
▲２▼推定
遷移確率は、次式に示すように、遷移に用いられる遷移数（フレーム単位）をカウントし、これを、状態からの遷移全てに用いられる遷移数（フレーム単位）のカウント値で割り算することにより求められる。
【数５】

【００３１】
一方、シンボル発声確率は、次式に示すように、各状態で各特徴シンボルを発生する数をカウントし、これを各状態における全ての発生数のカウントで割り算することによって求められる。
【数６】

【００３２】
▲３▼セグメンテーション
学習セットに対して、ステップ▲２▼で求めた推定パラメータを用いて、ビタービアルゴリズムを介して再セグメント化する。
▲４▼反復
ステップ▲２▼とステップ▲３▼を収束するまで繰り返す。
【００３３】
[１−３．ターゲットフレーム情報]
ターゲットフレーム情報保持部１１には、予め対象者の音声がＳＭＳ分析されてフレーム単位で記憶されている。
まず、図６を参照しながら、ＳＭＳ分析について説明する。ＳＭＳ分析では、まず標本化された音声波形に窓関数を乗じた音声波形（Frame）を切り出し、高速フーリエ変換（FFT）を行って得られる周波数スペクトルから、正弦波成分と残差成分とを抽出する。
【００３４】
正弦波成分とは、基本周波数（Pitch）および基本周波数の倍数にあたる周波数（倍音）の成分をいう。本実施形態では、基本周波数を“Fi”として保持し、各成分の平均アンプリチュードを“Ai”として保持し、スペクトル包絡をエンベロープとして保持する。
残差成分とは、入力信号から正弦波成分を除いた成分であり、本実施形態では、図６に示すように周波数領域のデータとして保持する。
図６に示すように得られた正弦波成分および残差成分で示される周波数分析データは、図７に示すようにフレーム単位で記憶される。本実施形態では、フレーム間の時間間隔は５ｍｓとし、フレームをカウントすることによって時間を特定することができるようになっている。各フレームには曲の冒頭からの経過時間に相当するタイムスタンプが付されている（ｔｔ１、ｔｔ２、……）。
【００３５】
ところで、先に説明したように、各音素は、少なくとも音素毎に設定されている状態数分のフレームが続くから、ターゲットフレーム情報においても、各音素情報は複数のフレームから構成される。この複数フレームのまとまりをリージョン（region）とする。
ターゲットフレーム情報保持部には、対象者が歌唱したときの音素列が記憶されるが、各音素とリージョンとを対応つけて記述している。図７に示す例では、フレームｔｔ１〜ｔｔ５から構成されるリージョンが音素“ｎ”に対応し、フレームｔｔ６〜ｔｔ１０から構成されるリージョンが音素“ａ”に対応している。このように、ターゲットフレーム情報を保持し、同様のフレーム分析を入力音声に対して行えば、音素単位で両者を一致させた際に、フレームで時間を特定することができ、周波数分析データで合成処理ができるようになる。
【００３６】
[２．実施形態の動作]
次に、実施形態の動作について説明する。
【００３７】
[２−１．概要動作]
最初に、概要動作について図８に示すフローチャートを参照しながら説明する。
まず、マイク入力音声分析が行われる（Ｓ１）。具体的には、フレーム単位で高速フーリエ変換し、周波数スペクトルからＳＭＳ分析を行った周波数分析データを保持する。また、周波数スペクトルから特徴パラメータ解析を行って、音素辞書に基づいてシンボル量子化を行う。
【００３８】
次に、音素辞書および音素記述列に基づいて、ＨＭＭモデルによる音素の状態決定を行い（Ｓ２）、シンボル量子化された特徴パラメータおよび決定された音素状態に基づいて１パスビタービアルゴリズムによって状態遷移を決定する（Ｓ３）。ＨＭＭモデルおよび１パスビタービアルゴリズムについては後に詳しく説明する。
そして、決定した状態遷移により入力音声の時間ポインタを決定し（Ｓ４）、当該時間が新たな音素状態になったか否かを判定する（Ｓ５）。時間ポインタとは、入力音声および対象音声の時系列において、当該処理時刻におけるフレームを特定するものである。本実施形態では、入力音声および対象音声はフレーム単位で周波数分析され、各フレームは、入力音声および対象音声の時系列と対応付けられている。以後、入力音声の時系列を時刻ｔｍ１、ｔｍ２……と表記し、対象音声の時系列をｔｔ１、ｔｔ２……と表記する。
【００３９】
ステップＳ５の判定において、新たな音素状態になったと判定した場合は（Ｓ５；Ｙｅｓ）、フレームカウントを開始し（Ｓ６）、時間ポインタを音素列の先頭へ移動する（Ｓ７）。フレームカウントとは、当該音素状態として処理したフレーム数をいい、先に説明したように、各音素は複数のフレームが続くので、すでに何フレーム続いたかを示す値となる。
そして、入力音声フレームの周波数分析データと対象者音声フレームの周波数分析データとを周波数領域で合成し（Ｓ８）、逆高速フーリエ変換することによって（Ｓ９）新たな音声信号を生成して出力する。
【００４０】
ところで、ステップＳ５の判定において、新たな音素状態に遷移していないと判定した場合は（Ｓ５；Ｎｏ）、フレームカウントをインクリメントして（Ｓ１０）、時間ポインタをフレーム時間間隔分進め（Ｓ１１）、ステップＳ８に移行する。
具体例をあげて説明すると、図７に示す例では、音素状態が“ｎ”にとどまり続ける場合はフレームカウントをインクリメントして、時間ポインタをｔｔ１、ｔｔ２……と移動させる。しかし、フレームｔｔ３の音素状態が“ｎ”を処理した次の時刻に“ａ”に遷移した場合には、音素列“ａ”の先頭フレームｔｔ６に時間ポインタを移動させる。
このようにすれば、対象者と歌唱者との発音タイミングが異なっても、音素単位での時間一致を行うことができる。
【００４１】
[２−２．動作の詳細]
次に、概要動作においてふれた各処理について詳細に説明する。
【００４２】
[２−２−１．入力音声分析]
図９は、入力音声を分析する処理について詳細に説明する図である。図９に示すように、入力音声波形からフレーム単位で切り出された音声信号は、高速フーリエ変換によって周波数スペクトルに変換される。
周波数スペクトルは、先に説明したＳＭＳ分析によって周波数成分データとして保持される他、特徴パラメータ解析が行われる。
一方、周波数スペクトルは、特徴パラメータ分析も行われる。より具体的には、各特徴ベクトル毎に、音素辞書から最大尤度のシンボルを見つけることによってシンボル量子化して観測シンボルとする。
このようにして得られたフレーム毎の観測シンボルを用いて、後に詳しく説明するように状態遷移が決定されるようになる。
【００４３】
[２−２−２．隠れマルコフモデル]
次に、図１０を参照しながら、隠れマルコフモデル（ＨＭＭ）について説明する。なお、音声の状態は一方向へ遷移するので、left to right型のモデルを用いている。
【００４４】
時刻ｔにおいて、状態がｉからｊへ遷移する確率（状態遷移確率）をａ_ijと表す。図１０に示す例では、状態▲１▼にとどまる確率をａ₁₁と表し、状態▲１▼から状態▲２▼へ遷移する確率をａ₁₂と表している。
各状態の中には特徴ベクトルがそれぞれ存在し、各々に異なる観測シンボルがある。これをＸ＝｛ｘ₁、ｘ₂，…、ｘ_T｝と表す。
そして、時刻ｔにおいて状態がｊである時に特徴ベクトルのシンボルｘ_tを発生させる確率（観測シンボル離散確率）をｂ_j（ｘ_t）と表す。
モデルλにおいて、時刻Ｔまでの状態系列をＱ＝｛ｑ₁,ｑ₂,…,ｑ_T｝とすると、観測シンボル系列Ｘと状態系列Ｑの同時発生確率は、次式で表せる。
【数７】

観測シンボル系列は判っているが、状態系列は観測しえないという理由で、このようなモデルが隠れマルコフモデル（ＨＭＭ）と呼ばれている。本実施形態では、ターゲットフレーム情報保持部１１に記憶されている音素記述列に基づいて、図１０に示すようなＦＮＳ（有限状態ネットワーク）を音素単位で形成する。
【００４５】
[２−２−３．アライメント]
次に、図１１および図１２を参照しながら、本実施形態におけるアライメントについて説明する。
本実施形態では、音素記述列に基づいて形成された上述の隠れマルコフモデルと、入力音声から抽出したフレーム単位の特徴シンボルを用いて、１パスビタービアルゴリズムに従って入力音声の状態遷移を決定する。そして、入力音声の音素と対象音声の音素とをフレーム単位で対応づける処理を行う。また、本実施形態では、二つの音声信号のアライメントをカラオケ装置において用いているので、曲データに従った楽曲の時系列と、音声信号の時系列とを同期させる処理も行う。
以下、これらの処理について順次説明する。
【００４６】
［２−２−３−１．１パスビタービアルゴリズム］
ビタービアルゴリズムは、観測シンボル系列の各観測シンボルが各ＨＭＭモデルによって出現する全ての確率を算出し、最大確率を与えるパスを後から選択して状態遷移結果とするものである。しかしながら、観測シンボル系列が終結した後に状態遷移結果を求めるので、リアルタイム処理には不向きである。
そこで、本実施形態では、以下に説明する１パスビタービアルゴリズムを用いて、その時点まで音素状態を決定する。
下記式におけるΨ_t（ｊ）は、時刻ｔフレームまでの観測をふまえて算出した、一つのパスを経由して得られる時刻ｔのフレームにおけるベスト確率δ_t（ｊ）を最大とする状態を選択する。すなわち、Ψ_t（ｊ）に従って音素状態が遷移していく。
初期演算としてδ₁（ｉ）＝１とし、繰り返し演算として
【数８】

を実行する。ここで、ａ_ijは状態ｉから状態ｊへの状態遷移確率であり、Ｎは歌唱する曲の音韻数によって決まる状態ｉ、ｊのとりうる最大の状態数である。また、ｂ_j（Ｏ_t）は特徴ベクトルの時刻ｔにおけるシンボル発生確率である。各観測シンボルは、入力音声から抽出された特徴ベクトルであるから、歌唱者の発声態様によって観測シンボルが異なり、遷移の態様も異なるようになる。
図１１に示す例では、上記式によって計算された確率を○あるいは△で示している（○＞△）。例えば、時刻ｔｍ１から時刻ｔｍ３までの観測をふまえ、状態“Ｓｉｌｅｎｃｅ”から状態“ｎ１”へのパスが形成される確率は、状態“Ｓｉｌｅｎｃｅ”から状態“Ｓｉｌｅｎｃｅ”へのパスが形成される確率よりも高く、時刻ｔｍ３におけるベスト確率となり、図中太矢印で示すように状態遷移を決定する。
このような演算を入力音声の各フレームに対応する時刻（ｔｍ１、ｔｍ２、……）毎に行うことによって、図１１に示す例では、時刻ｔｍ３において状態“Ｓｉｌｅｎｃｅ”から状態“ｎ１”に遷移し、時刻ｔｍ５において状態“ｎ１”から状態“ｎ２”に遷移し、時刻ｔｍ９において状態“ｎ２”から状態“ｎ３”に遷移し、時刻ｔｍ１１において状態“ｎ３”から状態“ａ１”に遷移したように決定されている。
これにより、入力音声の音素をフレーム単位の各時刻において特定できるようになる。
【００４７】
[２−２−３−２．フレーム単位の対応]
上述したように状態遷移を決定し、入力音声の音素がフレーム単位で特定されると、次に、特定された音素に対応する対象音声のフレームを特定する。
上述したように、隠れマルコフモデルの各状態はターゲットフレーム情報保持部１１に記憶された対象音声の音素列記述に基づいて形成されているので、各状態に対応する対象音声の音素毎のフレームを特定することができるようになっている。
本実施形態では、アライメントとして、対象音声と入力音声の対応する音素が同じフレーム同士を、各フレーム毎に時系列で一致させる処理を行う。
【００４８】
図１１に示す例では、対象音声の時刻ｔｔ１〜ｔｔ３のフレームが音素“Ｓｉｌｅｎｃｅ”に対応し、時刻ｔｔ４〜ｔｔ９のフレームが音素“ｎ”に対応し、時刻ｔｔ１０〜のフレームが音素“ａ”に対応している。
一方、１パスビタービアルゴリズムによって入力音声の状態遷移が決定され、入力音声の時刻ｔｍ１〜ｔｍ２のフレームが音素“Ｓｉｌｅｎｃｅ”に対応し、時刻ｔｍ３〜ｔｍ１０のフレームが音素“ｎ”に対応し、時刻ｔｍ１１〜のフレームが音素“ａ”に対応している。
そして、音素“Ｓｉｌｅｎｃｅ”に対応するフレームとして、入力音声の時刻ｔｍ１のフレームと対象音声の時刻ｔｔ１のフレームを一致させ、入力音声の時刻ｔｍ２のフレームと対象音声の時刻ｔｔ２のフレームを一致させる。
入力音声の時刻ｔｍ３において状態“Ｓｉｌｅｎｃｅ”から状態“ｎ１”に遷移しているので、音素“ｎ”に対応するフレームとしては、入力音声の時刻ｔｍ３のフレームが最初のフレームになる。
一方、対象音声のフレームは、音素“ｎ”に対応するフレームは、音素列記述によれば時刻ｔｔ４のフレームからであるので、音素“ｎ”発声開始時の対象音声の時間ポインタは時刻ｔｔ４となる（図８：ステップＳ５〜Ｓ７参照）。
次に、入力音声の時刻ｔｍ４においては、新たな音素状態に遷移していないので、フレームカウントをインクリメントするとともに、対象音声の時間ポインタをフレーム時間間隔分進めて（図８：ステップＳ５〜Ｓ１１参照）、時刻ｔｔ５のフレームを入力音声の時刻ｔｍ４のフレームと一致させる。
このようにして、入力音声の時刻ｔｍ５〜ｔｍ７と、対象音声の時刻ｔｔ６〜ｔｔ８とを順次一致させていく。
【００４９】
ところで、図１１に示す例では、入力音声の時刻ｔｍ３〜ｔｍ１０までの８フレーム分が音素“ｎ”に対応しているのに対して、対象音声の音素“ｎ”に対応しているフレームは時刻ｔｔ４〜ｔｔ９までのフレームである。
このように、歌唱者が対象者よりも同じ音素を長い時間発声してしまう場合が生じるので、本実施形態では、予め用意したループフレームを用いて対象音声が入力音声よりも短い場合の補間を行う。
ループフレームは、音をのばして発音する場合のピッチの変化やアンプリチュードの変化を擬似的に再現するためのデータを数フレーム分記憶しており、例えば、基本周波数の差分（ΔPitchi）やアンプリチュードの差分（ΔAmp）などから構成される。
そして、ターゲットフレームデータ中には、音素列における各音素の最終フレームにループフレームの呼び出しを指示するデータを記述しておく。これにより、歌唱者が対象者よりも同じ音素を長い時間発声してしまった場合でも、良好にアライメントを行うことができるようになる。
【００５０】
[２−２−３−３．曲データとの同期]
ところで、本実施形態では、カラオケ装置に音声変換を適用しており、カラオケ装置はＭＩＤＩデータに基づいて楽曲の演奏を行うので、音声の進行と楽曲の進行が同期していることが望ましい。
そこで、本実施形態では、アライメント部１０は、曲データで示される時系列と対象音声の音素列とを同期させるように構成している。より具体的には、図１２に例示するように、シーケンサ１５は曲データに記述された時間情報（例えば、ＭＩＤＩデータの再生時間間隔を示すΔタイムやテンポ情報）などに基づいて、楽曲の進行情報を生成してアライメント部１０に出力する。
アライメント部１０は、シーケンサ１５から出力された時間情報とターゲットフレーム情報保持部１１に記憶されている音素記述列とを比較して、曲進行の時系列と対象音声の時系列とを対応づける。
【００５１】
また、図１２に示すような重み付け関数ｆ（｜ｔ_m−ｔ_t｜）を用いて、楽曲に同期して状態遷移確率への重み付けをおこなうことができるようにしている。この重み付け関数は、各状態遷移確率ａ_ijに乗じる窓関数である。
なお、図中ａおよびｂは楽曲のテンポに応じた要素である。また、αは限りなく０に近い値に設定する。
上述したように、対象音声の時間ポインタは楽曲のテンポに同期して進行するので、このような重みつけ関数を導入することによって、結果的に歌唱音声と対象音声との同期が正確になる。
【００５２】
[３．変形例]
本発明は、上述した実施形態に限定されるものではなく、以下に説明するような各種の変形が可能である。
【００５３】
[３−１．音素の飛び越し]
上記実施形態では、１パスビタービアルゴリズムを用いて状態遷移を決定しているが、歌唱者が歌詞を間違えた場合には不向きである。例えば、数フレーズ先の歌詞を歌ってしまった場合や、数フレーム前の歌詞を歌ってしまった場合などが考えられる。
このような場合は、図１３に示すように、数状態前後まで最適状態を検索する範囲を広げ、最適状態と判断した場合に限り飛び越しを行うようにすればよい。
より具体的には、入力音声の時刻ｔｍ４においては、音素“ａ”に対応する状態となっているので、上述した１パスビタービアルゴリズムによれば、入力音声の時刻ｔｍ５のフレームについては、音素“ａ”から遷移しない確率、あるいは音素列記述において音素“ａ”の次にくる“Ｓｉｌｅｎｃｅ”への遷移確率のいずれか高い方から最大確率を選択することになる。
しかしながら、歌唱者は無音期間なしに音素“ｋ”の発声を開始しているので、対象者の音素列記述のうちの“Ｓｉｌｅｎｃｅ”については飛び越してアライメントすることが望ましい。
そこで、このような歌唱者が対象者の音素列記述に従わずに発声した場合には、数状態前後の状態まで最大確率となる状態を検索するようにしてもよい。図１３に示す例では、直前のフレーム状態の前後３状態の範囲を検索して、２状態先の音素“ｋ”を最大確率としている。このようにて、“Ｓｉｌｅｎｃｅ”を飛び越して音素“ｋ”への状態遷移を決定する。
【００５４】
また、無音の位置や息継ぎの位置などが異なる場合も多い。このような場合には、上記実施形態では音素の位置が異なってしまう。
そこで、図１３に示すように、発音音素単位から“Ｓｉｌｅｎｃｅ”と“Ａｓｐｉｌａｔｉｏｎ”や発音音素単位への飛び越しの確率を同じように設定する。
例えば、対象者の音素列記述においては、音素“ｉ”の前後数状態には“Ａｓｐｉｌａｔｉｏｎ”は記述されていない。しかしながら、音素記述列において音素“ｉ”の次に記述されている音素“ｎ”へ遷移する確率と、記述されていない“Ｓｉｌｅｎｃｅ”あるいは“Ａｓｐｉｌａｔｉｏｎ”への飛び越しを行う確率を同等に設定し、“Ｓｉｌｅｎｃｅ”あるいは“Ａｓｐｉｌａｔｉｏｎ”に飛び越しを行った後に、音素記述列中の音素に戻る確率も同等に設定しておけばよい。このようにすれば、例えば図１３に示す例のように、歌唱者が時刻ｔｍ７において、対象者の音素記述列に従わずに息継ぎを行った場合でも柔軟にアライメントすることができる。
また、対象者の音素列記述にかかわらず、ある摩擦音の次に他の摩擦音に遷移する場合があるので、摩擦音をアライメントしている時は、摩擦音あるいは対象音声の音素記述の次の音素について最大確率を検索するようにしてもよい。
【００５５】
[３−２．似通った音素]
日本語では、同じ言葉でも歌唱者によって異なる音素で発音する場合がある。たとえば、図１４に示すように、音素記述では“nagara”であっても、“nakara”“nagala”“nakala”などと発音される場合がある。
このように、似通った音素については、グループ化したパスを持つ隠れマルコフモデルを用いることにより、柔軟性のあるアライメントを実現することができる。
【００５６】
[３−３．その他]
上記実施形態においては、アライメント対象となる対象音声と入力音声とを時系列で対応づける音声処理装置を、物まね機能を有するカラオケ装置に適用しているが、これに限らず、カラオケ装置であれば例えば採点に用いてもよいし、歌唱を補正するために用いても良い。
また、音素単位で時系列を一致させる技術はカラオケ装置に限らず、他の音声認識に関する装置にも適用することが可能である。
【００５７】
上記実施形態では、音声信号の代表的な特徴パラメータを特徴ベクトルとして所定数のシンボルにクラスタ化した符号帳と、各音素毎に状態遷移確率および前記各シンボルの観測確率とを記憶する音素辞書について説明しているが、上述した５種類の特徴ベクトルに限らず他のパラメータを用いてもよい。
【００５８】
上記実施形態では、対象音声および入力音声をフレーム単位で周波数分析しているが、分析の手法は上述したＳＭＳに限定されるものではないし、時間領域の波形データとして分析しても構わない。あるいは、周波数と波形とを併用した分析をおこなっても構わない。
【００５９】
【発明の効果】
以上説明したように、本発明によれば、アライメント対象となる対象音声と入力音声とを時系列で対応づける音声処理を少ない記憶容量でリアルタイムで処理可能となる。
【図面の簡単な説明】
【図１】本発明の概要を説明する図である。
【図２】実施形態の構成を示すブロック図である。
【図３】符号帳を説明する図である。
【図４】音素を説明する図である。
【図５】音素辞書を説明する図である。
【図６】ＳＭＳ分析を説明する図である。
【図７】対象音声のデータについて説明する図である。
【図８】実施形態の動作を説明するフローチャートである。
【図９】入力音声の分析について説明する図である。
【図１０】隠れマルコフモデルを説明する図である。
【図１１】アライメントについて具体例を示した図である。
【図１２】楽曲との同期について説明する図である。
【図１３】音素の飛び越しを行う場合について説明する図である。
【図１４】似通った音素が発声される場合について説明する図である。
【符号の説明】
１……マイク、
２……分析窓生成部、
３……入力音声信号切出部、
４……高速フーリエ変換部、
５……入力音声分析部、
５ｓ……周波数分析部、
５ｐ……特徴パラメータ分析部、
６……音素辞書記憶部、
７……シンボル量子化部、
８……音素列状態形成部、
９……状態遷移決定部、
１０……アライメント部、
１１……ターゲットフレーム情報保持部、
１２……合成部、
１３……逆高速フーリエ変換部、
１４……曲データ記憶部、
１５……シーケンサ、
１６……音源、
１７……ミキサ、
１８……スピーカ。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a voice processing device that associates a target voice to be aligned with an input voice in time series, and a karaoke apparatus including the voice processing device.
[0002]
[Prior art]
Conventionally, in a technical field such as karaoke, a voice processing technique has been proposed in which a singer's singing voice is converted to resemble a singing voice of a specific singer such as a singer.
Usually, in such audio processing, it is necessary to perform alignment that associates two audio signals in time series. For example, as shown in FIG. 1, even if the singer synthesizes the voice uttered “without being” to the voice uttered by the subject “without being”, There is a case where the timing at which the sound is uttered and the timing at which the singer utters the sound of “ki” are different.
In this way, even if a person pronounces the same language, the duration is different and often non-linearly expands and contracts.Therefore, when comparing two sounds, the time axis is set so that the same phonemes correspond to each other. There is known a technique called DP matching method (Dynamic Time Warping: DTW) that normalizes time for nonlinear expansion and contraction. In the DP matching method, since a standard time series for words and phonemes is used as a standard pattern, it is possible to match the temporal structure change of the time series pattern in units of phonemes.
A technique using a Hidden Markov Model (HMM) that is excellent with respect to spectral fluctuation is also known. In the hidden Markov model, the statistical variation of the spectrum time series is reflected in the parameters of the model, so that it can be matched in phoneme units to the spectral variation caused by individual differences among speakers.
[0003]
[Problems to be solved by the invention]
However, when the above-described DP matching method is used, the accuracy with respect to the spectrum fluctuation is poor, and the technology using the conventional hidden Markov model increases the storage capacity and calculation amount. It was not suitable for processing that requires real-time performance.
[0004]
The present invention has been made to solve the above-described problems, and is a voice processing device capable of performing real-time processing with a small storage capacity for voice processing in which target voices to be aligned and input voices are correlated in time series. And it aims at providing the karaoke apparatus provided with the said audio processing apparatus.
[0005]
[Means for Solving the Problems]
  In order to solve the above-described problem, the invention described in claim 1 is a speech processing device that associates a target speech to be aligned with an input speech in time series,
  The target speech is analyzed in units of frames, and a phoneme string that is time-series information of phonemes of the target speechIn frame unitsTarget speech information storage means to remember,
  A codebook clustered into a predetermined number of symbols as representative feature parameters of a speech signal as a feature vector; a phoneme information storage means for storing a state transition probability and an observation probability of each symbol for each phoneme;
  Input speech quantization means for analyzing feature parameters of the input speech signal in units of frames and symbol-quantizing the feature parameters of the input speech based on a codebook stored in the phoneme information storage means to obtain observation symbols of the input speech When,
  Based on the state transition probability and the observation probability stored in the phoneme information storage means,State of each phonemeA state forming means that forms a hidden Markov model on a finite state network;
  By the input speech quantization meanssymbolQuantized observation symbols,By the state forming meansBeen formedState of each phonemeAccording to the state transition determining means for determining the state transition by the Viterbi algorithm,
  By the state transition determining meansDecisionWasBased on state transitionIdentify the frame of the target speech whose phoneme matches each frame of the input speechWith alignment means
  It is characterized by providing.
[0006]
The invention according to claim 2 is the speech processing apparatus according to claim 1,
The feature vector includes a vector that characterizes a spectral characteristic of speech with a mel cepstrum coefficient.
The invention according to claim 3 is the speech processing apparatus according to claim 1,
The feature vector includes a vector that characterizes a spectral characteristic of speech with a differential mel cepstrum coefficient.
The invention according to claim 4 is the speech processing apparatus according to claim 1,
The feature vector includes a vector that characterizes speech with differential energy.
The invention according to claim 5 is the speech processing apparatus according to claim 1,
The feature vector includes a vector that characterizes speech with energy.
The invention according to claim 6 is the speech processing apparatus according to claim 1,
The feature vector includes a vector that characterizes voiced sound likeness with a zero cross rate and a pitch error.
[0007]
The invention according to claim 7 is the speech processing apparatus according to claim 1,
The phoneme information storage unit stores a codebook generated by vector quantization using a clustering algorithm from prediction vectors of a large number of learning sets.
The invention according to claim 8 is the speech processing apparatus according to claim 1,
The phoneme information storage means stores a state transition probability and an observed symbol probability for a feature vector in each phoneme obtained by estimating a parameter that maximizes a model likelihood for learning data.
[0008]
The invention according to claim 9 is the speech processing apparatus according to claim 1,
The state transition determining unit searches for an optimal state from a range of about several states from the state formed by the state forming unit, and jumps to the optimal state.
The invention according to claim 10 is the speech processing apparatus according to claim 1,
The state forming means forms a state having a path that allows a jump from a phoneme to a silent state or a breathing state regardless of the phoneme string of the target speech.
The invention according to claim 11 is the speech processing apparatus according to claim 1,
The state forming means forms a state having a path with an equivalent transition probability for a group having approximate phonemes in the phoneme string of the target speech.
[0009]
  The invention according to claim 12 is the invention according to claim1In the described voice processing device,
  The target speech information storage unit stores the phoneme string in association with a region including the frames of consecutive identical phonemes,
  The alignment means includesThe state transition determining meansOf the determined input voice, Continuous identicalCorresponding to phonemesFrameIf the number of frames is greater than the number of frames constituting the region corresponding to the phoneme of the target speech that matches the phoneme, the number of frames that are insufficient for the number of frames of the target speech is determined using a predetermined frame stored in advance. It is characterized by interpolation.
  The invention according to claim 13 is the claim.1In the described voice processing device,
  The target speech information storage unit stores the phoneme string in association with a region including the frames of consecutive identical phonemes,
  The alignment means includesThe state transition determining meansOf the determined input voice, Continuous identicalCorresponding to phonemesFrameWhen the number of frames is smaller than the number of frames constituting the region corresponding to the phoneme of the target speech that matches the phoneme,The frame of the target speech whose phoneme matches the next frame of the frame corresponding to the continuous same phoneme is specified in the next region of the regionIt is characterized by.
  The invention according to claim 14The speech processing apparatus according to claim 1, wherein
Based on the correspondence by the alignment means, a synthesis unit is provided for synthesizing the target speech and the input speech in units of frames in which they are associated with each other.
[0010]
  Claim15The invention described inThe speech processing apparatus according to claim 14, wherein
  Frequency analysis means for extracting a sine wave component and a residual component by performing frequency analysis on the input audio signal in units of frames,
  The target speech information storage means stores a sine wave component and a residual component extracted by frequency analysis of the target speech in advance in units of frames,
  The speech processing apparatus characterized in that the synthesizing unit synthesizes the sine wave component and residual component of the input speech with the sine wave component and residual component of the target speech at a predetermined ratio.
  The invention according to claim 16 is the speech processing apparatus according to claim 15,
  Waveform generation means for generating a synthesized speech waveform from the sine wave component and residual component synthesized by the synthesis means by inverse frequency conversion is provided.
[0011]
  Claim17The invention according to claim 1 is a karaoke apparatus comprising the voice processing device according to claim 1,
  Song data storage means for storing song data constituting the song;
  Playback means for playing back music based on the song data;
  Synchronization means for synchronizing the phoneme sequence of the target speech and the frame data extracted by frequency analysis of the target speech in advance in units of frames with the time series of the song data;
  A synthesizing unit that synthesizes the target speech and the input speech in units of frames based on correspondence by the alignment unit;
  An output means for synchronously outputting the reproduction of music by the reproduction means and the sound synthesized by the synthesis means based on the synchronization by the synchronization means;
  It is characterized by providing.
  Claim18The invention described in claim17In the karaoke device described,
  The state transition determination means is characterized in that a weighting function is given to the state transition probability in synchronization with the time series of the music data.
[0012]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[0113]
[1. Configuration of Embodiment]
[1-2. overall structure]
FIG. 2 is a diagram illustrating a configuration of the present embodiment. In the embodiment, the present invention is applied to a karaoke device with a mimicry function, and an input voice from a microphone of a singer (Me) is made to resemble a voice of a mimic target (Target) such as a singer. It is configured to perform voice conversion on and output.
More specifically, data obtained by analyzing the target voice in units of frames divided by a predetermined time unit is stored, and the input voice frame is analyzed by analyzing the input voice in units of frames divided in the same time unit. If the target person's frame corresponding to the time is specified, the time relationship can be matched. And it is comprised so that audio | voice conversion may be performed by synthesize | combining the frame data which made the input audio | voice and the object audio | voice correspond in a phoneme unit.
[0014]
In FIG. 2, the microphone 1 collects the voice of a singer who tries to imitate, and outputs it to the input voice signal cutout unit 3 as an input voice signal Sv. The analysis window generation unit 2 generates an analysis window (for example, a hamming window) AW having a period that is a fixed multiple of the pitch period detected in the previous frame, and outputs the analysis window AW to the input audio signal cutout unit 3. When the initial state or the previous frame is a silent sound (including silent sound), an analysis window having a preset fixed period is output to the input voice signal cutout unit 3 as an analysis window AW.
The input voice signal cutout unit 3 multiplies the input analysis window AW and the input voice signal Sv, cuts out the input voice signal Sv in units of frames, and outputs it as a frame voice signal FSv to the fast Fourier transform unit 4. The fast Fourier transform unit 4 obtains a frequency spectrum from the frame sound signal FSv and outputs it to the input sound analysis unit 5 including the frequency analysis unit 5s and the feature parameter analysis unit 5p.
[0015]
The frequency analysis unit 5s performs an SMS (Spectral Modeling Synthesis) analysis described later to extract a sine wave component and a residual component, and holds the extracted frequency component information of the singer of the analyzed frame.
The feature parameter analysis unit 5 p extracts feature parameters that characterize the spectral characteristics of the input speech and outputs the feature parameters to the symbol quantization unit 7. In this embodiment, feature vectors of five types (mel cepstrum coefficient, difference mel cepstrum coefficient, difference energy coefficient, energy, and voiceness) described later are used as feature parameters.
[0016]
As will be described in detail later, the phoneme dictionary storage unit 6 stores a phoneme dictionary including probability data indicating a code book and a state transition probability of a feature vector and a symbol occurrence probability in each phoneme.
The symbol quantization unit 7 refers to the code book stored in the phoneme dictionary storage unit 6 to select a feature symbol in the frame and outputs it to the state transition determination unit 9.
The phoneme sequence state formation unit 8 forms a phoneme sequence state by a hidden Markov model (HMM), and the state transition determination unit 9 uses a feature symbol in units of frames obtained from the input speech according to a Viterbi algorithm described later. Determine state transitions.
[0017]
The alignment unit 10 determines the time pointer of the input voice from the determined state transition, specifies the target frame corresponding to the time pointer, holds the frequency component of the input voice held in the frequency analysis unit, and the target frame information The frequency component of the subject held in the unit 11 is output to the synthesis unit 12.
The target frame information holding unit 11 stores frequency analysis data that has been subjected to frequency analysis in advance in units of frames and phoneme strings that are described in units of time regions composed of several frames.
[0018]
The synthesizing unit 12 generates a new frequency component obtained by synthesizing the frequency component of the input speech and the frequency component of the subject at a predetermined ratio, and outputs the new frequency component to the inverse fast Fourier transform unit 13. A new audio signal is generated by performing inverse fast Fourier transform on the frequency components.
By the way, this embodiment is a karaoke apparatus having a mimicry function, and the song data storage unit 14 stores karaoke song data indicated by MIDI data, time data, lyric data, and the like. And a sound source 16 for generating a musical sound signal from the output data of the sequencer 15.
The mixer 17 synthesizes the sound signal output from the inverse fast Fourier transform unit 13 and the musical sound signal output from the sound source 16 and outputs the result from the speaker 18.
In this way, when the singer sings into the microphone 1, a new voice in which the voice of the singer is converted to resemble the voice of the target person and the accompaniment sound of karaoke are output from the speaker 18. Has been.
[0019]
[1-2. Phoneme dictionary]
Next, the phoneme dictionary used in this embodiment will be described. The phoneme dictionary is composed of a codebook clustered into a predetermined number of symbols using representative feature parameters of a speech signal as a feature vector, a state transition probability and an observation probability of each symbol for each phoneme.
[0020]
[1-2-1. Feature vector]
Before describing the codebook, first, feature vectors used in the present embodiment will be described.
(1) Mel cepstrum coefficient (b_MEL)
The mel cepstrum coefficient is a coefficient that represents the spectral characteristics of sound with a small order, and is clustered into 128 symbols as a 12-dimensional vector in this embodiment.
(2) Differential Mel cepstrum coefficient (b_deltaMEL)
The difference mel cepstrum coefficient is a coefficient representing the time difference of the mel cepstrum coefficient, and is clustered into 128 symbols as a 12-dimensional vector in this embodiment.
(3) Differential energy coefficient (b_deltaENERGY)
The difference energy coefficient is a coefficient representing a time difference in sound intensity, and is clustered into 32 symbols as a one-dimensional vector in this embodiment.
(4) Energy (b_ENERGY)
Energy is a coefficient representing the intensity of sound, and is clustered into 32 symbols as a one-dimensional vector in this embodiment.
▲ 5 ▼ Voiceness (b_VOICENESS)
Voiceness is a feature vector that represents the likelihood of a voiced sound, and is clustered into 32 symbols as a two-dimensional vector that characterizes speech with a zero-cross rate and pitch error. Hereinafter, the zero cross rate and the pitch error will be described.
[0021]
(1) Zero cross rate
The zero-cross rate has a feature that the zero-cross rate becomes lower as the voiced sound becomes, and is defined by the following equation.
[Expression 1]

Where sgn {s (n)} = + 1: s (n)> = 0, -1: s (n) <0,
N: Number of frame samples
W: Frame window
s: Input signal
[0022]
(2) Pitch error
The pitch error indicates the likelihood of voiced sound by obtaining a mismatch from two directions, an error from the predicted pitch to the measured pitch and an error from the measured pitch to the predicted pitch. For details, refer to “Fundamental Frequency Estimation in The SMS Analysis "(P. Cano. Proceedings of the Digital Audio Effects Workshop, 1898) is described as the Two-Way Mismuch method.
[0023]
First, the pitch error from the predicted pitch (p) to the measured pitch (m) is expressed by the following equation.
[Expression 2]

fn: nth predicted peak frequency
Δfn: nth predicted peak frequency and measurement peak frequency difference close to it
a_n: Nth measurement amplitude
Amax: Maximum amplitude
[0024]
On the other hand, the pitch error from the measured pitch (m) to the predicted pitch (p) is expressed by the following equation.
[Equation 3]

fk: k-th predicted peak frequency
Δfk: kth predicted peak frequency and measured peak frequency difference close to it
ak: kth measurement amplitude
Amax: Maximum amplitude
[0025]
Therefore, the total error is as follows:
[Expression 4]

As constants, p = 0.5, q = 1.4, and r = 0.5 have been experimentally reported to be optimal for most speech.
[0026]
[1-2-2. Codebook]
The codebook stores vector information clustered in the number of symbols for each feature vector (see FIG. 3).
The codebook is created by finding a set called a K prediction vector (code) by quantization, which is the minimum distortion, from all the prediction vectors in a large number of learning sets. In this embodiment, the LGB algorithm is used as the clustering algorithm.
[0027]
The LGB algorithm is shown below.
▲ 1 ▼ Initialization
First, find the centroid in the whole vector. Here, the initial code vector is used.
(2) Repeat
If I is the total number of iterations, 2^ICode vectors are required. Therefore, if the number of iterations is i = 1, 2,..., I, the following calculation is performed for iteration i.
1) Divide some existing code vectors x into two codes x (1 + e) and x (1-e). Here, e is a small numerical value such as 0.001.
This2 ⁱNew code vectors xⁱ _k(K = 1, 2,..., 2ⁱ) Is obtained.
2) For each prediction vector x in the learning set, from x to codex ⁱ _kQuantize.
k '= argmin_kd (x, xⁱ _k)
Where d (x, xⁱ _k) Indicates the distortion distance in the prediction space.
3) During each iteration, for each k, xⁱ _k= Calculate to centroid all vectors as Q (x).
[0028]
[1-2-3. Probability data]
Next, probability data will be described. In this embodiment, PLU (pseudophoneme unit) is used as a subword unit for modeling speech. More specifically, as shown in FIG. 4, Japanese is handled in units of 27 phonemes, and each phoneme is assigned a number of states.
The number of states means the shortest number of frames that are maintained in units of subwords. For example, the number of states of phoneme “a” is “3”, which means that phoneme “a” continues for at least three frames.
The three states are pseudo representations of the rising sound, the steady state, and the released state. The plosives such as phonemes “b” and “g” are set to two states because the original phoneme is short, and breathing (ASPIRATION) is also set to two states. Silence (SILEMCE) is set to 1 because there is no temporal variation.
[0029]
As shown in FIG. 5, the probability data in the phoneme dictionary describes the transition probability of each state and the observed symbol occurrence probability for each feature vector symbol for 27 phonemes expressed in subword units. Yes. In FIG. 5, the description is omitted, but the sum of the observed symbol occurrence probabilities for each feature vector is 1.
These parameters are obtained by estimating the parameters of the subword unit model that maximizes the likelihood of the model for the learning data. Here, a segment k average learning algorithm is used.
[0030]
The segment k average learning algorithm is shown below.
▲ 1 ▼ Initialization
First, each initial phoneme segment is segmented (divided) linearly into an HMM state with respect to initial estimated data that has been previously phoneme segmented.
(2) Estimate
As shown in the following equation, the transition probability is calculated by counting the number of transitions (in frame units) used for transition and dividing this by the count value of the number of transitions (in frame units) used for all transitions from the state. Desired.
[Equation 5]

[0031]
On the other hand, the symbol utterance probability is obtained by counting the number of occurrences of each feature symbol in each state and dividing this by the count of all occurrences in each state, as shown in the following equation.
[Formula 6]

[0032]
(3) Segmentation
The learning set is re-segmented through the Viterbi algorithm using the estimation parameter obtained in step (2).
(4) Repeat
Repeat steps (2) and (3) until convergence.
[0033]
[1-3. Target frame information]
In the target frame information holding unit 11, the subject's voice is SMS analyzed and stored in advance in units of frames.
First, the SMS analysis will be described with reference to FIG. In SMS analysis, a speech waveform (Frame) obtained by multiplying a sampled speech waveform by a window function is cut out, and a sine wave component and a residual component are extracted from the frequency spectrum obtained by performing fast Fourier transform (FFT). To do.
[0034]
The sine wave component is a component of a fundamental frequency (Pitch) and a frequency (overtone) that is a multiple of the fundamental frequency. In this embodiment, the fundamental frequency is held as “Fi”, the average amplitude of each component is held as “Ai”, and the spectrum envelope is held as an envelope.
The residual component is a component obtained by removing the sine wave component from the input signal. In this embodiment, the residual component is held as frequency domain data as shown in FIG.
The frequency analysis data indicated by the sine wave component and residual component obtained as shown in FIG. 6 is stored in units of frames as shown in FIG. In this embodiment, the time interval between frames is 5 ms, and the time can be specified by counting the frames. Each frame has a time stamp corresponding to the elapsed time from the beginning of the song (tt1, tt2,...).
[0035]
By the way, as described above, since each phoneme has at least frames corresponding to the number of states set for each phoneme, each phoneme information is composed of a plurality of frames in the target frame information. A group of the plurality of frames is defined as a region.
The target frame information holding unit stores a phoneme string when the subject sings, and describes each phoneme and a region in association with each other. In the example shown in FIG. 7, the region composed of the frames tt1 to tt5 corresponds to the phoneme “n”, and the region composed of the frames tt6 to tt10 corresponds to the phoneme “a”. In this way, if the target frame information is retained and the same frame analysis is performed on the input speech, the time can be specified by the frame when both are matched in units of phonemes, and synthesized by the frequency analysis data. It can be processed.
[0036]
[2. Operation of the embodiment]
Next, the operation of the embodiment will be described.
[0037]
[2-1. Overview operation]
First, the outline operation will be described with reference to the flowchart shown in FIG.
First, microphone input voice analysis is performed (S1). Specifically, fast Fourier transform is performed in frame units, and frequency analysis data obtained by performing SMS analysis from the frequency spectrum is held. Also, feature parameter analysis is performed from the frequency spectrum, and symbol quantization is performed based on the phoneme dictionary.
[0038]
Next, the phoneme state is determined by the HMM model based on the phoneme dictionary and phoneme description sequence (S2), and the state transition is performed by the one-pass Viterbi algorithm based on the symbol quantized feature parameter and the determined phoneme state. Is determined (S3). The HMM model and the 1-pass Viterbi algorithm will be described in detail later.
Then, the time pointer of the input voice is determined based on the determined state transition (S4), and it is determined whether or not the time is in a new phoneme state (S5). The time pointer specifies a frame at the processing time in the time series of the input sound and the target sound. In this embodiment, the input voice and the target voice are subjected to frequency analysis in units of frames, and each frame is associated with a time series of the input voice and the target voice. Hereinafter, the time series of the input speech is expressed as times tm1, tm2,..., And the time series of the target speech is expressed as tt1, tt2.
[0039]
If it is determined in step S5 that a new phoneme state has been reached (S5; Yes), frame counting is started (S6), and the time pointer is moved to the head of the phoneme string (S7). The frame count refers to the number of frames processed as the phoneme state. As described above, each phoneme is a value indicating how many frames have already been continued since a plurality of frames continue.
Then, the frequency analysis data of the input speech frame and the frequency analysis data of the subject speech frame are synthesized in the frequency domain (S8), and inverse fast Fourier transform is performed (S9) to generate and output a new speech signal.
[0040]
By the way, if it is determined in step S5 that the state has not changed to a new phoneme state (S5; No), the frame count is incremented (S10), and the time pointer is advanced by the frame time interval (S11). The process proceeds to step S8.
Referring to a specific example, in the example shown in FIG. 7, when the phoneme state continues to be “n”, the frame count is incremented and the time pointer is moved to tt1, tt2,. However, when the phoneme state of the frame tt3 transitions to “a” at the next time when “n” is processed, the time pointer is moved to the first frame tt6 of the phoneme string “a”.
In this way, even if the subject person and the singer are at different pronunciation timings, time matching can be performed in units of phonemes.
[0041]
[2-2. Details of operation]
Next, each process referred to in the outline operation will be described in detail.
[0042]
[2-2-1. Input speech analysis]
FIG. 9 is a diagram for explaining in detail processing for analyzing input speech. As shown in FIG. 9, an audio signal cut out in units of frames from an input audio waveform is converted into a frequency spectrum by fast Fourier transform.
The frequency spectrum is retained as frequency component data by the SMS analysis described above, and is also subjected to feature parameter analysis.
On the other hand, the frequency spectrum is also subjected to feature parameter analysis. More specifically, for each feature vector, a symbol with the maximum likelihood is found from the phoneme dictionary, and symbol quantization is performed to obtain an observation symbol.
Using the observation symbol for each frame obtained in this way, the state transition is determined as will be described in detail later.
[0043]
[2-2-2. Hidden Markov Model]
Next, the hidden Markov model (HMM) will be described with reference to FIG. Note that the sound state transitions in one direction, so a left-to-right model is used.
[0044]
The probability that the state transitions from i to j at time t (state transition probability) is a_ijIt expresses. In the example shown in FIG. 10, the probability of staying in the state (1) is a.₁₁And the probability of transition from state (1) to state (2) is a₁₂It expresses.
Each state has its own feature vector, and each has a different observation symbol. X = {x₁, X₂, ..., x_T}.
When the state is j at time t, the feature vector symbol x_tIs the probability (observed symbol discrete probability) of generating_j(X_t).
In the model λ, the state sequence up to time T is expressed as Q = {q₁, q₂, ..., q_T}, The simultaneous occurrence probability of the observed symbol series X and the state series Q can be expressed by the following equation.
[Expression 7]

Such a model is called a hidden Markov model (HMM) because the observation symbol series is known, but the state series cannot be observed. In the present embodiment, an FNS (finite state network) as shown in FIG. 10 is formed for each phoneme based on the phoneme description sequence stored in the target frame information holding unit 11.
[0045]
[2-2-3. alignment]
Next, alignment in the present embodiment will be described with reference to FIGS. 11 and 12.
In the present embodiment, the state transition of the input speech is determined according to the one-pass Viterbi algorithm using the above-described hidden Markov model formed based on the phoneme description sequence and the feature symbol in units of frames extracted from the input speech. And the process which matches the phoneme of an input audio | voice and the phoneme of an object audio | voice per frame is performed. In this embodiment, since the alignment of the two audio signals is used in the karaoke apparatus, a process of synchronizing the time series of the music according to the music data and the time series of the audio signal is also performed.
Hereinafter, these processes will be sequentially described.
[0046]
[2-2-3-1.1 Pass Viterbi algorithm]
The Viterbi algorithm calculates all probabilities that each observation symbol of an observation symbol series appears by each HMM model, and selects a path that gives the maximum probability later as a state transition result. However, since the state transition result is obtained after the observation symbol series is terminated, it is not suitable for real-time processing.
Therefore, in this embodiment, the phoneme state is determined up to that point using a one-pass Viterbi algorithm described below.
Ψ in the following formula_t(J) is the best probability δ in a frame at time t obtained through one path, calculated based on observations up to time t frame._t(j) Is maximized. That is, Ψ_tThe phoneme state transitions according to (j).
Δ as initial calculation₁(I) = 1
[Equation 8]

Execute. Where a_ijIs the state transition probability from state i to state j;N is the maximum number of states i and j that can be determined by the number of phonemes of the song to be sung. Also,b_j(O_t) Is a symbol occurrence probability of the feature vector at time t. Since each observation symbol is a feature vector extracted from the input speech, the observation symbol differs depending on the utterance mode of the singer, and the transition mode also differs.
In the example shown in FIG. 11, the probability calculated by the above formula is indicated by ◯ or Δ (◯> Δ). For example, based on observations from time tm1 to time tm3, the probability that a path from the state “Silence” to the state “n1” is formed is higher than the probability that a path from the state “Silence” to the state “Silence” is formed. Is the best probability at time tm3, and the state transition is determined as shown by the thick arrow in the figure.
By performing such calculation for each time (tm1, tm2,...) Corresponding to each frame of the input speech, in the example shown in FIG. 11, the state “Silence” transitions to the state “n1” at time tm3. At time tm5, the state transitions from state “n1” to state “n2”, at time tm9 transitions from state “n2” to state “n3”, and at time tm11 transitions from state “n3” to state “a1”. It has been decided.
Thereby, the phoneme of the input speech can be specified at each time in frame units.
[0047]
[2-2-3-2. Frame unit correspondence]
When the state transition is determined as described above and the phoneme of the input speech is specified in units of frames, the target speech frame corresponding to the specified phoneme is then specified.
As described above, since each state of the hidden Markov model is formed based on the phoneme string description of the target speech stored in the target frame information holding unit 11, a frame for each phoneme of the target speech corresponding to each state is obtained. It can be specified.
In the present embodiment, as alignment, a process is performed in which frames having the same phoneme corresponding to the target voice and the input voice are matched in time series for each frame.
[0048]
In the example illustrated in FIG. 11, the frames of the target speech from time tt1 to tt3 correspond to the phoneme “Silence”, the frames from time tt4 to tt9 correspond to the phoneme “n”, and the frames from time tt10 to the phoneme “a”. It corresponds to.
On the other hand, the state transition of the input speech is determined by the one-pass Viterbi algorithm, the frames of the input speech from time tm1 to tm2 correspond to the phoneme “Silence”, and the frames of time tm3 to tm10 correspond to the phoneme “n”. The frames at times tm11 to t1 correspond to the phoneme “a”.
Then, as a frame corresponding to the phoneme “Silence”, the frame at the time tm1 of the input voice and the frame at the time tt1 of the target voice are matched, and the frame at the time tm2 of the input voice and the frame at the time tt2 of the target voice are matched.
Since the state “Silence” transitions to the state “n1” at the time tm3 of the input speech, the frame at the time tm3 of the input speech is the first frame as the frame corresponding to the phoneme “n”.
On the other hand, since the frame of the target speech is that corresponding to the phoneme “n” from the frame at time tt4 according to the phoneme string description, the time pointer of the target speech at the start of utterance of the phoneme “n” is time tt4. (See FIG. 8: Steps S5 to S7).
Next, at time tm4 of the input voice, since it has not changed to a new phoneme state, the frame count is incremented and the time pointer of the target voice is advanced by the frame time interval (see steps S5 to S11 in FIG. 8). ), The frame at time tt5 is matched with the frame at time tm4 of the input voice.
In this way, the time tm5 to tm7 of the input sound and the time tt6 to tt8 of the target sound are sequentially matched.
[0049]
By the way, in the example shown in FIG. 11, 8 frames from time tm3 to tm10 of the input speech correspond to the phoneme “n”, whereas the frame corresponding to the phoneme “n” of the target speech is This is a frame from time tt4 to tt9.
In this way, since the singer may utter the same phoneme for a longer time than the target person, in this embodiment, interpolation is performed when the target voice is shorter than the input voice using a loop frame prepared in advance. Do.
The loop frame stores several frames of data for artificially reproducing the change in pitch and the change in amplitude when sound is extended, such as the difference in basic frequency (ΔPitchi) and the amplitude. Difference (ΔAmp) or the like.
In the target frame data, data instructing to call a loop frame is described in the last frame of each phoneme in the phoneme string. Thereby, even when the singer utters the same phoneme for a longer time than the subject, alignment can be performed satisfactorily.
[0050]
[2-2-3-3. Sync with song data]
By the way, in this embodiment, voice conversion is applied to the karaoke apparatus, and the karaoke apparatus performs music based on the MIDI data. Therefore, it is desirable that the progress of the sound and the progress of the music are synchronized.
Therefore, in the present embodiment, the alignment unit 10 is configured to synchronize the time series indicated by the song data and the phoneme string of the target speech. More specifically, as illustrated in FIG. 12, the sequencer 15 advances the music based on time information described in the music data (for example, Δ time or tempo information indicating the reproduction time interval of MIDI data). Information is generated and output to the alignment unit 10.
The alignment unit 10 compares the time information output from the sequencer 15 with the phoneme description sequence stored in the target frame information holding unit 11, and associates the time series of the song progression with the time series of the target speech.
[0051]
Also, a weighting function f (| t as shown in FIG._m-T_t|) Is used to weight the state transition probability in synchronization with the music. This weighting function is used for each state transition probability a_ijIs a window function to multiply
In the figure, a and b are elements corresponding to the tempo of the music. Also, α is set to a value close to 0 as much as possible.
As described above, since the time pointer of the target sound proceeds in synchronization with the tempo of the music, the introduction of such a weighting function results in accurate synchronization between the singing sound and the target sound.
[0052]
[3. Modified example]
The present invention is not limited to the above-described embodiments, and various modifications as described below are possible.
[0053]
[3-1. Phoneme jump]
In the above embodiment, the state transition is determined using the 1-pass Viterbi algorithm, but this is not suitable when the singer mistakes the lyrics. For example, a case where the lyrics of a few phrases ahead are sung, or a case where the lyrics of a few frames before are sung.
In such a case, as shown in FIG. 13, the search range for the optimum state may be expanded to around several states, and the jump may be performed only when the optimum state is determined.
More specifically, since it is in a state corresponding to the phoneme “a” at the time tm4 of the input voice, according to the above-described one-pass Viterbi algorithm, the phoneme at the time tm5 of the input voice is The maximum probability is selected from the higher of the probability of not transitioning from “a” or the transition probability of “Silence” following the phoneme “a” in the phoneme string description.
However, since the singer has started uttering the phoneme “k” without a period of silence, it is desirable that “Silence” in the phoneme string description of the target person be skipped and aligned.
Therefore, when such a singer utters without following the phoneme string description of the target person, a state having the maximum probability may be searched up to a state around several states. In the example shown in FIG. 13, the range of three states before and after the immediately preceding frame state is searched, and the phoneme “k” that is two states ahead is set as the maximum probability. In this way, the state transition to the phoneme “k” is determined by skipping “Silence”.
[0054]
In many cases, the silent position and the breathing position are different. In such a case, the position of the phoneme is different in the above embodiment.
Therefore, as shown in FIG. 13, the probability of jumping from the phoneme unit to “Silence” and “Aspiration” or the phoneme unit is set in the same way.
For example, in the phoneme string description of the target person, “Aspiration” is not described in the numbers before and after the phoneme “i”. However, in the phoneme description string, the probability of transitioning to the phoneme “n” described next to the phoneme “i” and the probability of jumping to “Silence” or “Aspiration” that is not described are set equally. The probability of returning to the phoneme in the phoneme description string after jumping to “Silence” or “Aspiration” may be set equally. In this way, for example, as in the example shown in FIG. 13, even when the singer performs breath breathing at time tm7 without following the phoneme description string of the target person, alignment can be performed flexibly.
Also, regardless of the phoneme sequence description of the subject, there may be a transition from one friction sound to another, so when aligning the friction sound, the maximum of the next phoneme in the friction sound or the phoneme description of the target speech You may make it search a probability.
[0055]
[3-2. Similar phonemes]
In Japanese, the same word may be pronounced with different phonemes depending on the singer. For example, as shown in FIG. 14, even in the phoneme description, even “nagara” may be pronounced as “nakara”, “nagala”, “nakala”, or the like.
In this way, for similar phonemes, flexible alignment can be realized by using a hidden Markov model having a grouped path.
[0056]
[3-3. Other]
In the above embodiment, the speech processing device that associates the target speech to be aligned with the input speech in time series is applied to a karaoke device having an imitation function. For example, it may be used for scoring or may be used to correct a song.
In addition, the technique for matching the time series in units of phonemes is not limited to the karaoke device, and can be applied to other devices related to speech recognition.
[0057]
In the above-described embodiment, a phonebook dictionary that stores a codebook clustered into a predetermined number of symbols using a representative feature parameter of a speech signal as a feature vector, and a state transition probability and an observation probability of each symbol for each phoneme Although described, the present invention is not limited to the five types of feature vectors described above, and other parameters may be used.
[0058]
In the above embodiment, the target voice and the input voice are frequency-analyzed in units of frames. However, the analysis method is not limited to the above-described SMS, and may be analyzed as time domain waveform data. Or you may perform the analysis which used a frequency and a waveform together.
[0059]
【The invention's effect】
As described above, according to the present invention, audio processing for associating target audio to be aligned with input audio in time series can be processed in real time with a small storage capacity.
[Brief description of the drawings]
FIG. 1 is a diagram illustrating an outline of the present invention.
FIG. 2 is a block diagram showing a configuration of an embodiment.
FIG. 3 is a diagram illustrating a code book.
FIG. 4 is a diagram illustrating phonemes.
FIG. 5 is a diagram illustrating a phoneme dictionary.
FIG. 6 is a diagram illustrating SMS analysis.
FIG. 7 is a diagram for explaining target audio data;
FIG. 8 is a flowchart illustrating the operation of the embodiment.
FIG. 9 is a diagram for explaining analysis of input speech.
FIG. 10 is a diagram illustrating a hidden Markov model.
FIG. 11 is a diagram showing a specific example of alignment.
FIG. 12 is a diagram for explaining synchronization with music.
FIG. 13 is a diagram illustrating a case where phonemes are skipped.
FIG. 14 is a diagram illustrating a case where similar phonemes are uttered.
[Explanation of symbols]
1 ... Mike,
2 ... Analysis window generator,
3 …… Input audio signal cutout part,
4 …… Fast Fourier transform unit,
5 …… Input speech analysis unit,
5s …… Frequency analyzer
5p …… Feature parameter analysis unit,
6 …… phoneme dictionary storage,
7 …… Symbol quantization section,
8 ... Phoneme sequence state forming section,
9 …… State transition determination unit,
10 …… Alignment section,
11 …… Target frame information holding unit,
12 …… Combining part,
13. Inverse fast Fourier transform unit,
14 …… Music data storage unit,
15 …… Sequencer,
16 …… Sound source,
17 …… Mixer,
18 …… Speaker.

Claims

A speech processing apparatus that associates target speech to be aligned with input speech in time series,
Analyzing the target speech in units of frames, the target voice information storage means for memorize a frame basis phoneme sequence which is time-series information of phonemes of the target speech,
A codebook clustered into a predetermined number of symbols as representative feature parameters of a speech signal as a feature vector; a phoneme information storage means for storing a state transition probability and an observation probability of each symbol for each phoneme;
Input speech quantization means for analyzing feature parameters of the input speech signal in units of frames and symbol-quantizing the feature parameters of the input speech based on a codebook stored in the phoneme information storage means to obtain observation symbols of the input speech When,
Based on the state transition probability and observation probability stored in the phoneme information storage means, state forming means for forming a state of each phoneme on a finite state network by a hidden Markov model;
State transition determination means for determining a state transition by a Viterbi algorithm according to an observation symbol symbol quantized by the input speech quantization means and a state of each phoneme formed by the state formation means ;
Based on the state transition is determined by the state transition determination unit, audio processing apparatus, characterized in that it comprises an alignment means for specifying the target speech frames each frame and the phoneme of the input voice matches.

The speech processing apparatus according to claim 1, wherein
The speech processing apparatus according to claim 1, wherein the feature vector includes a vector that characterizes a spectral characteristic of speech with a mel cepstrum coefficient.

The speech processing apparatus according to claim 1, wherein
The speech processing apparatus according to claim 1, wherein the feature vector includes a vector that characterizes a spectral characteristic of speech with a differential mel cepstrum coefficient.

The speech processing apparatus according to claim 1, wherein
The speech processing apparatus, wherein the feature vector includes a vector that characterizes speech with differential energy.

The speech processing apparatus according to claim 1, wherein
The speech processing apparatus, wherein the feature vector includes a vector that characterizes speech with energy.

The speech processing apparatus according to claim 1, wherein
The speech processing apparatus according to claim 1, wherein the feature vector includes a vector that characterizes a voiced sound likeness with a zero cross rate and a pitch error.

The speech processing apparatus according to claim 1, wherein
The speech processing apparatus, wherein the phoneme information storage means stores a codebook generated by vector quantization using a clustering algorithm from a large number of prediction vectors of a learning set.

The speech processing apparatus according to claim 1, wherein
The phoneme information storage means stores a state transition probability and an observed symbol probability for a feature vector in each phoneme obtained by estimating a parameter that maximizes a model likelihood for learning data. .

The speech processing apparatus according to claim 1, wherein
The state transition determination unit searches for an optimum state from a range around several states from the state formed by the state formation unit, and jumps to the optimum state.

The speech processing apparatus according to claim 1, wherein
The state processing means forms a state having a path that allows a jump from a phoneme to a silent state or a breathing state regardless of the phoneme string of the target speech.

The speech processing apparatus according to claim 1, wherein
The state forming means forms a state having a path with an equivalent transition probability for a group having approximate phonemes in the phoneme string of the target speech.

In the audio processing apparatus 請 Motomeko 1,
The target speech information storage unit stores the phoneme string in association with a region including the frames of consecutive identical phonemes,
The alignment means is a frame constituting the region corresponding to the phoneme of the target voice in which the number of frames of the input speech determined by the state transition determination means corresponds to the same phoneme in a continuous manner. When the number is larger than the number, the speech processing apparatus is characterized by interpolating an insufficient number of frames of the target speech using a predetermined frame stored in advance.

In the audio processing apparatus 請 Motomeko 1,
The target speech information storage unit stores the phoneme string in association with a region including the frames of consecutive identical phonemes,
The alignment means is a frame constituting the region corresponding to the phoneme of the target voice in which the number of frames of the input speech determined by the state transition determination means corresponds to the same phoneme in a continuous manner. If the number is less than the number, the speech processing apparatus is characterized in that the frame of the target speech in which the phoneme coincides with the next frame of the frame corresponding to the consecutive identical phonemes is specified in the next region of the region .

In the audio processing apparatus 請 Motomeko 1,
A speech processing apparatus comprising: a synthesizing unit that synthesizes the target speech and the input speech in units of frames based on correspondence by the alignment unit.

In the audio processing apparatus 請 Motomeko 14,
Frequency analysis means for extracting a sine wave component and a residual component by performing frequency analysis on the input audio signal in units of frames,
The target speech information storage means stores a sine wave component and a residual component extracted by frequency analysis of the target speech in advance in units of frames,
The speech processing apparatus characterized in that the synthesis means synthesizes the sine wave component and residual component of the input speech with the sine wave component and residual component of the target speech at a predetermined ratio.

In the audio processing apparatus 請 Motomeko 15,
A speech processing apparatus comprising waveform generation means for generating a synthesized speech waveform by inverse frequency conversion from the sine wave component and residual component synthesized by the synthesis means.

A karaoke device comprising a sound processing apparatus 請 Motomeko 1,
Song data storage means for storing song data constituting the song;
Playback means for playing back music based on the song data;
Synchronization means for synchronizing the phoneme sequence of the target speech and the frame data extracted by frequency analysis of the target speech in advance in units of frames with the time series of the song data;
A synthesizing unit that synthesizes the target speech and the input speech in units of frames based on correspondence by the alignment unit;
A karaoke apparatus comprising: output means for synchronously outputting the reproduction of music by the reproduction means and the sound synthesized by the synthesis means based on the synchronization by the synchronization means.

In the karaoke apparatus 請 Motomeko 17, wherein,
The karaoke apparatus characterized in that the state transition determination means gives a weighting function to the state transition probability in synchronization with the time series of the music data.