JP4710130B2

JP4710130B2 - Audio signal separation method and apparatus

Info

Publication number: JP4710130B2
Application number: JP2000384745A
Authority: JP
Inventors: 多伸近藤
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2000-12-19
Filing date: 2000-12-19
Publication date: 2011-06-29
Anticipated expiration: 2020-12-19
Also published as: JP2002182689A

Description

【０００１】
【発明の属する技術分野】
この発明は、音声信号を含む混合信号から音声信号のみを分離して音声信号及びその他の信号の少なくとも一方を抽出する音声信号分離方法及び装置に関する。
【０００２】
【従来の技術】
複数の音響信号が混在した信号から、特定の信号を強調・抑圧したり分離抽出する技術が知られている。音声信号に対しては、雑音と音声信号が混在した音響信号から雑音のみを抑圧する雑音抑圧方式（例えば特開平９−１５３７６９号、特開平９−２１２１９６号等）が、音楽に対しては演奏に含まれる旋律の分離や除去に関する方式（特開平１１−１４３４６０号等）が様々に提案されている。
【０００３】
雑音抑圧方式は、例えば信号増幅器などの音響処理装置において、聴取したい音声信号が雑音に埋もれてしまい、目的の音声信号が聴き難いといった問題に対処する。また、音楽に対する分離や除去の方式は、例えばある旋律だけを除いてカラオケのようなものを作成したりする。
【０００４】
特開平９−２１２１９６号では、スペクトラルサブトラクションと呼ばれる手法によって雑音抑圧を実現している。これは、入力信号中の音声／非音声を検出し、非音声区間で代表的な雑音振幅スペクトルを求め、音声区間でこれを入力信号の振幅スペクトルから差し引くことで雑音を抑圧する。合成時の位相成分については、混合された状態のときのものを用いる。ここでは、音声の母音が整数次倍音構造を持っていることを利用して、基本周波数とその倍音成分のパワーを加算したものを指標として、非音声の検出をしている。特開平９−２１２１９６号では、この指標に対する閾値を小さくすることで、確実に雑音であると思われる区間から代表的な雑音スペクトルを求め、音声の子音の影響を小さくしている。
【０００５】
特開平１１−１４３４６０号では、楽器音が整数次倍音構造を持っているものが多いことから、基本周波数とその倍音成分を同一楽器からの音であると判断している。そして、これらの周波数成分の時刻、振幅、位相の情報に基づいて波形を加算合成することで抽出や除去後の音を合成している。
【０００６】
【発明が解決しようとする課題】
雑音抑圧方式では、非音声信号とは雑音のことであり、これは不要なものである。従って、基本的には音声の抑圧された非音声側の信号を得ることはない。特開平９−２１２１９６号に開示されたスペクトラルサブトラクション法では、子音部分でも母音部分でも同じ抑圧処理をしている。ここでは、経時的に平均した代表的雑音スペクトルを用いているため、音声とその他の信号の混在区間で雑音抑圧方式に変更を加えずに非音声側の信号を出力しようと思った場合、常に代表的雑音スペクトルが出力されることになってしまい、非音声信号側の経時的な変化に追従できない。
【０００７】
また、音楽に対する分離や除去の方式では、整数次倍音構造を持たない信号は、全てその他の信号として処理されてしまうため、基本周波数の存在しない音声の子音部分に関しては、非音声信号に残留してしまう。非音声信号に対して適切な効果を与える場合に、残留した子音部分によってその効果が損なわれてしまう。例えばテレビのスポーツ実況に残響を付加し、臨場感を高める場合、実況音声と環境音信号とを分離し、環境音のみに残響付加することが望ましい。しかし、環境音側に子音部分だけが残ると、この子音にも残響付加され、高めたいはずの臨場感を損なってしまう。
【０００８】
この発明は、このような問題点に鑑みなされたもので、非音声信号側の経時的な変化に追従可能で、且つ子音部分も精度良く分離可能な音声分離方法及び装置を提供することを目的とする。
【０００９】
【課題を解決するための手段】
この発明に係る音声信号分離方法は、音声信号とその他の信号とが混合された混合信号から音声信号を分離して前記音声信号及びその他の信号の少なくとも一方を抽出する音声信号分離方法において、前記混合信号から整数次倍音構造に基づいて前記音声信号のうちの母音部分を検出して分離する母音処理ステップと、前記混合信号又は前記混合信号から前記母音部分を分離した残りの信号を子音判定対象信号とし、この子音判定対象信号から子音の特性に基づいて前記音声信号のうちの子音部分を検出して分離する子音処理ステップと、前記母音処理ステップで検出された音声信号の母音部分と前記子音処理ステップで検出された音声信号の子音部分とによって音声信号を分離して前記音声信号及びその他の信号の少なくとも一方を抽出する出力ステップとを備えたことを特徴とする。
【００１０】
また、この発明に係る音声信号分離装置は、音声信号とその他の信号とが混合された混合信号から音声信号を分離して前記音声信号及びその他の信号の少なくとも一方を抽出する音声信号分離装置において、前記混合信号から整数次倍音構造に基づいて前記音声信号のうちの母音部分を検出して分離する母音処理手段と、前記混合信号又は前記混合信号から前記母音部分を分離した残りの信号を子音判定対象信号とし、この子音判定対象信号から子音の特性に基づいて前記音声信号のうちの子音部分を検出して分離する子音処理手段と、前記母音処理手段で検出された音声信号の母音部分と前記子音処理手段で検出された音声信号の子音部分とによって音声信号を分離して前記音声信号及びその他の信号の少なくとも一方を抽出する出力手段とを備えたことを特徴とする。
【００１１】
この発明によれば、音声信号とその他の信号とが混合された混合信号から整数次倍音構造に基づいて音声信号のうちの母音部分を抽出すると共に、混合信号又は混合信号から母音部分を分離した残りの信号を子音対象信号として、この子音対象信号から子音の特性に基づいて子音部分を検出してこれを分離するようにしているので、母音部分と子音部分とが分離された残りの非音声信号は、経時的変化が反映されたものとなる。また、子音部分を含んで音声信号が混合信号から分離されるので、非音声信号に子音部分が含まれることがなく、非音声信号を処理する場合にも、精度の良い処理が可能になる。
【００１２】
なお、ここで“母音”とは、この明細書では、母音のみならず、整数次倍音構造を持つ有声子音も含む。また、“子音”とは、整数次倍音構造を持たない無声子音を意味する。子音処理時において子音区間を検出するために使用される子音の特性としては、例えば子音判定対象信号のスペクトル包絡、特定帯域のパワー（例えば４〜１０ｋＨｚ程度）等を使用することができる。スペクトル包絡を使用する場合、子音処理では、例えば混合信号から母音部分を分離した残りの信号のうち非子音区間のスペクトル包絡を経時的に蓄積し、この経時的に蓄積した非子音区間のスペクトル包絡と子音判定対象信号のスペクトル包絡との距離を定量的に評価して子音区間を検出する様にすればよい。また、予め学習された代表的な子音のスペクトル包絡と子音判定対象信号のスペクトル包絡との距離を定量的に評価するようにしても良い。スペクトル包絡間の距離尺度としては、例えば線形予測係数に対する最尤スペクトル距離、ＬＰＣ（線形予測）ケプストラム距離等を使用することができる。更に、特定帯域のパワーを使用する場合には、特定帯域のパワーと所定の閾値との比較を行えば良い。
【００１３】
また、子音処理では、混合信号から母音部分を分離した残りの信号のうち非子音区間のスペクトル包絡を経時的に蓄積し、この経時的に蓄積した非子音区間のスペクトル包絡と子音判定対象信号のスペクトル包絡との間で顕著に異なる帯域を分離する帯域として特定するようにすればよい。この他、混合信号から母音部分を分離した残りの信号のうち非子音区間のスペクトル包絡を経時的に蓄積し、この経時的に蓄積した非子音区間のスペクトル包絡を現在対象としている子音判定対象信号のパワーで正規化したスペクトル包絡と子音判定対象信号のスペクトル包絡との間で所定の閾値以上の関係を有する帯域を分離する帯域として特定するようにしても良い。
【００１４】
なお、子音部分の分離は、時間領域の信号に対しては、例えばバンドパスフィルタやノッチフィルタによる特定帯域のゲイン処理によって行うことができ、周波数領域の信号に対しては、例えばスペクトラルサブトラクションにより行うことができる。
【００１５】
【発明の実施の形態】
以下、図面を参照して、この発明の好ましい実施の形態について説明する。
図１は、この発明の一実施例に係る音声信号分離システムの構成を示すブロック図である。
音声信号とその他の信号（環境音、背景音、雑音等）とを含む混合信号Ｉは、母音処理部１と子音処理部２とに入力されている。母音処理部１では、混合信号Ｉに含まれる基本周波数ｆに基づいて混合信号Ｉから音声信号の母音部分を検出し、母音信号Ｖｖと、その他の信号Ｏ１とに分離する。子音処理部２では、混合信号Ｉのスペクトル包絡の特徴や特定帯域のパワー等に基づいて混合信号Ｉから音声信号に含まれる子音部分を検出し、混合信号Ｉを子音信号Ｖｃとその他の信号Ｏ２とに分離する。母音・子音判定部３は、母音処理部１からの母音／非母音判定結果ｖ／ｏと子音処理部２からの子音／非子音判定結果ｃ／ｏとに基づいて、母音区間、子音区間及び非音声区間を判定し、切替部４の切替制御を行う。切替部４は、母音・子音判定部３により切替制御され、母音区間では母音処理部１で分離された母音信号Ｖｖとその他の信号Ｏ１とを、また非母音区間では子音処理部２で分離された子音信号Ｖｃとその他の信号Ｏ２とを選択し、それぞれ音声信号Ｖ及びその他の信号Ｏとして出力する。また、非子音区間では母音処理部１で分離された母音信号Ｖｖとその他の信号Ｏ１とを、子音区間では子音処理部２で分離された子音信号Ｖｃとその他の信号Ｏ２とを選択し、それぞれ音声信号Ｖ及びその他の信号Ｏとして出力するようにしても良い。
【００１６】
図２は、この発明の他の実施例に係る音声分離システムの構成を示すブロック図である。
母音処理部１、子音処理部２及び母音・子音判定部３は、上述した実施例と同様のものであるが、この実施例では、子音処理部２が母音処理部１で母音信号Ｖｖを抑圧したその他の信号Ｏ１を子音判定対象信号として入力し、母音信号成分が除去された状態で子音部分と非子音部分とを検出し、子音信号Ｖｃとその他の信号Ｏ２とに分離する点が異なっている。この場合には、母音信号成分が除去された信号に対して子音検出を行うため、先の実施例よりも検出精度は上がる。子音処理部２で分離された子音信号Ｖｃは、母音処理部１で分離された母音信号Ｖｖに加算器５で加算されて音声信号Ｖとして出力される。また、母音処理部１で分離されたその他の信号Ｏ１と子音処理部２で分離されたその他の信号Ｏ２とは、母音・子音判定部３での切替制御に従って切替器６よって切り替え他の信号Ｏとして出力される。
【００１７】
これらの実施例において、母音処理部１は、例えば図３に示すように構成されている。
混合信号Ｉは、先ず周波数分析部１１に入力される。周波数分析部１１は、ハニング窓部１１１とＦＦＴ（高速フーリエ変換）部１１２とからなる。混合信号Ｉは、ハニング窓部１１１でフレーム分割されたのち、ＦＦＴ部１１２により周波数分析される。ＦＦＴ部１１２での周波数分析結果は、基本周波数検出部１２と母音分離部１３とに入力されている。基本周波数検出部１２では、ＦＦＴ部１１２による周波数分析結果から整数次倍音構造を評価して基本周波数ｆ′を推定する。母音分離部１３では、基本周波数検出部１２で検出された基本周波数ｆ′から整数次倍音構造の各周波数成分の振幅を振幅推定部１３１₁，１３１₂，…，１３１_nで推定する。各周波数成分の振幅は、例えば複素スペクトル内挿法によって推定することができる。複素スペクトル内挿法は、複素平面上でピークに隣接する複素ベクトルから内積によって真のピークを求める手法であり、これによりハニング窓対応補正された基本周波数ｆ及びその倍音周波数２ｆ，３ｆ，…，ｎｆと、その振幅とが求められる。各補正周波数ｆ，２ｆ，３ｆ，…，ｎｆは、位相推定部１３２₁，１３２₂，…，１３２_nに入力されここで、ハニング窓の特性と該当周波数成分の前後の周波数サンプル値とから位相を推定することができる。これにより線スペクトルが推定され、そこからハニング窓による影響（メインローブ、サイドローブ）を排除することができる。このようにして求められた整数次倍音構造は、ＦＦＴ部１１２の周波数分析結果から減算器１３３によって減算されると共に、ＩＦＦＴ（逆ＦＦＴ）部１３４によって時間領域の信号に戻される。また、減算器１３３の減算結果もＩＦＦＴ部１３５によって時間領域の信号に戻される。これらは、フレーム間のつなぎ部分を滑らかにするため、加算器１３６，１３７においてオーバーラップ／アド用データ１３８，１３９とそれぞれ加算されて、加算器１３６からは混合信号Ｉから母音信号成分のみ強調された母音信号Ｖｖが、また加算器１３７混合信号Ｉから母音信号成分が抑圧されたその他の信号Ｏ１が生成出力される。
【００１８】
図４は、図１及び図２の実施例における子音処理部２の構成例を示すブロック図、図５は、この子音処理部２における子音区間検出処理を示すフローチャートとである。
混合信号Ｉ（図２の実施例では他の信号Ｏ１）は、子音特徴量計算手段であるＬＰＣ（線形予測）分析部２１に与えられ、ここで特徴量計算が実行される。ここでは、子音のうち特に目立つ無声子音の特徴量として、スペクトル包絡特性を計算する。スペクトル包絡特性にて特徴量評価を行うためには、まず、ＬＰＣ係数を計算する（Ｓ１，Ｓ２，Ｓ３）。ＬＰＣ分析部２１では、過去の標本値から現時点での標本値を予測する。このときの予測係数をＬＰＣ係数という。ＬＰＣ分析では、共分散法や自己相関法にて直接ＬＰＣ係数を求める方法もあるが、ＰＡＲＣＯＲ分析によるＰＡＲＣＯＲ係数、ＬＳＰ（線スペクトル対）分析によるＬＳＰ係数と、ＬＰＣ係数とは相互に変換可能である。ここで、ＰＡＲＣＯＲ分析、ＬＳＰ分析は、いずれもＬＰＣ分析法の一種であるが、より性能の改善された手法である。
【００１９】
特徴量評価部２２では、次にＬＰＣケプストラム距離計算部２２１において、非無声子音区間に経時的に平均したＬＰＣ係数２２２との間のＬＰＣケプストラム距離Ｄcepを計算する（Ｓ６，Ｓ７，Ｓ８）。ＬＰＣ係数を経時的に平均化する場合には、求めたＬＰＣ係数（Ｓ４）をＬＳＰ係数（Ｓ１４）に変換して、平均を計算すると良い（Ｓ１５，Ｓ１６，Ｓ１７）。ＬＳＰ係数はＬＰＣ係数やＰＡＲＣＯＲ係数よりも補間性能が良いため、平均操作に適している。そして平均化後のＬＳＰ係数をＬＰＣ係数に戻す。これにより、平均化後のＬＰＣ係数を得る。また、ここで言う経時的な平均化とは、信号の入力の開始から現在までのＬＰＣ係数の全てを重み付け加算することを言う。具体的には、以下のような計算を行えば良い。
【００２０】
【数１】
ａｖｇ（ｉ）＝ｗ＊ｃｕｒ（ｉ）＋（１−ｗ）＊ａｖｇ（ｉ−１）
【００２１】
なお、ここで、ｃｕｒ（ｉ）は現在のＬＰＣ係数、ａｖｇ（ｉ）は経時平均ＬＰＣ係数、ｗは重み関数である。
また、経時的に平均化したＬＰＣ係数２２２の算出精度を高めるため、母音処理部１からの他の信号Ｏ１をＬＰＣ分析部２１に供給してピッチが存在する母音検出区間においても、平均化処理を続行することが望ましい（Ｓ５，Ｓ１４，Ｓ１５，Ｓ１６）。
【００２２】
なお、このとき、経時的に平均したＬＰＣ係数２２２ではなく、予め求めておいた代表的な無声子音のＬＰＣ係数との距離を計算するようにしても良い。予め求めておく代表的な無声子音のＬＰＣ係数は、音声認識データベース等から流用可能である。また、ＬＰＣケプストラム距離やここでは用いていないが最尤スペクトル距離等は、音声認識においてＬＰＣ係数間（スペクトル包絡間）の距離尺度として用いられているものである。
【００２３】
また、無声子音には、有声音と比較して比較的高い４ｋＨｚ以上の周波数成分が多く含まれていることが一般に知られている。このため、子音判定部２２３は、ＬＰＣ分析部２１で求めた入力信号のスペクトル包絡特性２２４から４ｋＨｚ以上の帯域の振幅を閾値と比較し、高いレベルにある帯域を検出する。これは、あまり高い周波数帯域まで調べる必要はなく、１０ｋＨｚ程度までで十分である。比較結果をパラメータＤspecとして数値化する（Ｓ９，Ｓ１０，Ｓ１１）。
【００２４】
子音判定部２２３は、計算されたＤcep及びＤspecと、それぞれ事前に調査して求めた閾値ＴhＤcep及びＴhＤspecとを比較する（Ｓ１２）。これらの総合判定結果から、当該区間が無声子音であるかどうかの判定を行う（Ｓ１３）。なお、閾値ＴhＤcepやＴhＤspecは入力信号に適応して動的に制御することも可能である。無声子音と判定された場合には、入力信号Ｉ又はＯ１と経時的平均ＬＰＣ係数２２２とからそれぞれのスペクトル包絡特性２２４，２２５を求め、これを各周波数成分に対して比較する（Ｓ９，Ｓ１８，Ｓ１９）。このとき、信号パワーへの依存性を減らすため、スペクトル包絡は正規化したものを用いると良い。比較によって経時的平均スペクトル包絡特性２２５に対して、入力の方が高い周波数を特定する（Ｓ２０）。これは、音声信号Ｖのミックスレベルがその他の信号Ｏよりも高いレベルにある場合に相当する。一般の実況放送等では、この条件は十分満たされている。
【００２５】
子音分離部２４では、特定された帯域に、ＦＦＴ部２３でのＦＦＴ結果の振幅スペクトルのゲイン操作を行ったり、時間軸上でフィルタリングすることで、無声子音の強調・抑圧が可能となる。振幅スペクトルのゲイン操作を行った場合、得られた無声子音信号とその他の信号とをＩＦＦＴ部２５，２６でそれぞれ時間軸上の信号に戻すことで子音信号Ｖｖとその他の信号Ｏ２とが得られる。
【００２６】
図１において説明したように、出力時には、ピッチ周波数の有無による母音区間判定、上述した子音区間判定の結果を用いて、母音処理部１からの出力Ｖｖ，Ｏ１を用いるか、子音処理部２からの出力Ｖｃ，Ｏ２を用いるかを切替部４で切り替えるが、このとき、図６に示すように、母音区間、子音区間及び非音声区間の信号を滑らかに接続するため、ハニング窓等のオーバーラップ／アドデータ４１，４２を用いて加算器４３，４４にて信号Ｖｖ／Ｖｃ，Ｏ１／Ｏ２をオーバーラップ／アド処理して出力信号Ｖ，Ｏを得ることが望ましい。
【００２７】
図７は、上述したシステムの適用例を示すものである。同図（ａ）は混合信号Ｉを音声信号Ｖと他の信号Ｏとに分離する強調・抑圧部５０１にこの発明を適用している。分離された音声信号Ｖと他の信号Ｏには、信号処理部５０２，５０３によってそれぞれ別々の信号処理が施され、音声信号Ｖ′及び他の信号Ｏ′として出力される。同図（ｂ）は、非音声信号である他の信号Ｏに対する処理の例として、テレビの実況中継における音場制御の例を示している。テレビ６０１から出力される実況中継の音響信号（混合信号Ｉ）は、この発明に係る強調・抑圧部６０２で実況音声（Ｖ）と、環境音（Ｏ）とに分離される。実況音声については視聴者６０３の前方のフロントスピーカ６０４から出力される。環境音については、残響付加部６０５で残響成分が付加されて、視聴者６０３の前後左右に配置された４つのスピーカ６０６，６０７，６０８，６０９から出力される。これにより臨場感が向上する。同図（ｃ）は、音声認識の例である。即ち、音声強調部７０１は、入力音響信号Ｉから音声信号Ｖ以外の他の信号（雑音）Ｏを抑圧して、これにより音声信号Ｖを分離抽出する。音声認識部７０２は、分離抽出された音声信号Ｖに対して音声認識処理を実行する。このように音声認識において不要な周囲雑音を取り除くことで音声認識精度が向上する。この場合、他の信号Ｏは、不要な雑音成分なので、音声強調部７０１は音声信号Ｖのみを抽出する。
【００２８】
【発明の効果】
以上述べたように、この発明によれば、音声信号とその他の信号とが混合された混合信号から整数次倍音構造に基づいて音声信号のうちの母音部分を抽出すると共に、混合信号又は混合信号から母音部分を分離した残りの信号を子音対象信号として、この子音対象信号から子音の特性に基づいて子音部分を検出してこれを分離するようにしているので、母音部分と子音部分とが分離された残りの非音声信号は、経時的変化が反映されたものとなり、且つ子音部分を含んで音声信号が混合信号から分離されるので、非音声信号に子音部分が含まれることがなく、非音声信号を処理する場合にも、精度の良い処理が可能になるという効果を奏する。
【図面の簡単な説明】
【図１】この発明の一実施例に係る音声信号分離システムのブロック図である。
【図２】この発明の他の実施例に係る音声信号分離システムのブロック図である。
【図３】図２及び図３の実施例における音声処理部の構成を示すブロック図である。
【図４】図２及び図３の実施例における子音処理部の構成を示すブロック図である。
【図５】子音処理部における子音区間検出処理を示すフローチャートである。
【図６】図１のシステムの出力部分の変形例を示す図である。
【図７】同システムの応用例を示すブロック図である。
【符号の説明】
１…母音処理部、２…子音処理部、３…母音・子音判定部、４，６…切替部、５…加算器、１１…周波数分析部、１２…基本周波数検出部、１３…母音分離部、２１…ＬＰＣ分析部、２２…特徴量評価部、２４…子音分離部。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an audio signal separation method and apparatus for extracting only an audio signal from a mixed signal including an audio signal and extracting at least one of the audio signal and other signals.
[0002]
[Prior art]
A technique is known in which a specific signal is emphasized / suppressed or separated and extracted from a signal in which a plurality of acoustic signals are mixed. For audio signals, noise suppression methods (for example, JP-A-9-153769, JP-A-9-212196, etc.) that suppress only noise from an acoustic signal in which noise and audio signals are mixed are performed for music. Various methods relating to separation and removal of melody contained in Japanese Patent Laid-Open No. 11-143460 have been proposed.
[0003]
The noise suppression method addresses the problem that, for example, in a sound processing device such as a signal amplifier, a sound signal to be listened to is buried in noise, making it difficult to hear the target sound signal. As a method of separating and removing music, for example, a karaoke is created except for a certain melody.
[0004]
In Japanese Patent Laid-Open No. 9-212196, noise suppression is realized by a technique called spectral subtraction. This detects speech / non-speech in the input signal, obtains a representative noise amplitude spectrum in the non-speech interval, and subtracts it from the amplitude spectrum of the input signal in the speech interval to suppress noise. As a phase component at the time of synthesis, a phase component in a mixed state is used. Here, non-speech is detected by using as an index the sum of the fundamental frequency and the power of its harmonic component, utilizing the fact that the vowels of the speech have an integer order harmonic structure. In Japanese Patent Laid-Open No. 9-212196, by reducing the threshold value for this index, a representative noise spectrum is obtained from a section that is considered to be surely noise, and the influence of the consonant of the voice is reduced.
[0005]
In Japanese Patent Laid-Open No. 11-143460, since many instrument sounds have an integer harmonic structure, it is determined that the fundamental frequency and its harmonic component are sounds from the same instrument. Then, the sound after extraction or removal is synthesized by adding and synthesizing waveforms based on the time, amplitude, and phase information of these frequency components.
[0006]
[Problems to be solved by the invention]
In the noise suppression method, the non-speech signal is noise, which is unnecessary. Therefore, basically, a signal on the non-voice side in which voice is suppressed is not obtained. In the spectral subtraction method disclosed in Japanese Patent Laid-Open No. 9-212196, the same suppression processing is performed for both the consonant part and the vowel part. Here, since the representative noise spectrum averaged over time is used, if you want to output the non-speech side signal without changing the noise suppression method in the mixed section of voice and other signals, always A typical noise spectrum will be output, and it will not be possible to follow changes over time on the non-voice signal side.
[0007]
Also, in the separation and removal method for music, all signals that do not have an integer order overtone structure are processed as other signals, so the consonant part of the speech that does not have a fundamental frequency remains in the non-speech signal. End up. When an appropriate effect is given to a non-speech signal, the effect is impaired by the remaining consonant part. For example, when reverberation is added to a live sports situation on a television to enhance the sense of reality, it is desirable to separate the live sound and the environmental sound signal and add the reverberation only to the environmental sound. However, if only the consonant part remains on the environmental sound side, reverberation is also added to this consonant, impairing the sense of presence that should be enhanced.
[0008]
The present invention has been made in view of such problems, and it is an object of the present invention to provide a speech separation method and apparatus that can follow changes over time on the non-speech signal side and can also accurately separate consonant portions. And
[0009]
[Means for Solving the Problems]
The audio signal separation method according to the present invention is an audio signal separation method for extracting an audio signal from a mixed signal obtained by mixing an audio signal and another signal and extracting at least one of the audio signal and the other signal. A vowel processing step for detecting and separating a vowel part of the speech signal based on an integer order harmonic structure from the mixed signal, and a consonant determination target for the mixed signal or the remaining signal obtained by separating the vowel part from the mixed signal A consonant processing step for detecting and separating a consonant portion of the speech signal based on the consonant characteristics from the consonant determination target signal, and the vowel portion of the speech signal detected in the vowel processing step and the consonant The audio signal is separated from the consonant portion of the audio signal detected in the processing step, and at least one of the audio signal and the other signal is extracted. Characterized by comprising an output step.
[0010]
The audio signal separation device according to the present invention is an audio signal separation device that separates an audio signal from a mixed signal obtained by mixing an audio signal and another signal and extracts at least one of the audio signal and the other signal. Vowel processing means for detecting and separating a vowel part of the audio signal from the mixed signal based on an integer order harmonic structure, and the mixed signal or the remaining signal obtained by separating the vowel part from the mixed signal is a consonant A consonant processing unit that detects and separates a consonant portion of the speech signal based on a consonant characteristic from the consonant determination target signal, and a vowel part of the speech signal detected by the vowel processing unit. Output means for separating at least one of the audio signal and other signals by separating the audio signal from a consonant portion of the audio signal detected by the consonant processing means Characterized by comprising a.
[0011]
According to the present invention, the vowel part of the voice signal is extracted from the mixed signal obtained by mixing the voice signal and other signals based on the integer order harmonic structure, and the vowel part is separated from the mixed signal or the mixed signal. Since the remaining signal is used as a consonant target signal and the consonant part is detected from the consonant target signal based on the characteristics of the consonant and separated, the remaining non-speech is obtained by separating the vowel part and the consonant part. The signal reflects changes over time. In addition, since the audio signal including the consonant part is separated from the mixed signal, the non-speech signal does not include the consonant part, and the non-speech signal can be processed with high accuracy.
[0012]
In this specification, the term “vowel” includes not only a vowel but also a voiced consonant having an integer overtone structure. “Consonant” means an unvoiced consonant that does not have an integer overtone structure. As a consonant characteristic used for detecting a consonant section at the time of consonant processing, for example, a spectrum envelope of a consonant determination target signal, power in a specific band (for example, about 4 to 10 kHz), or the like can be used. When using the spectral envelope, in the consonant processing, for example, the spectral envelope of the non-consonant section of the remaining signal obtained by separating the vowel part from the mixed signal is accumulated over time, and the spectral envelope of the non-consonant section accumulated over time is accumulated. What is necessary is just to detect the consonant section by quantitatively evaluating the distance between the signal and the spectrum envelope of the consonant determination target signal. Further, the distance between the spectrum envelope of a typical consonant learned in advance and the spectrum envelope of the consonant determination target signal may be quantitatively evaluated. As a distance measure between spectral envelopes, for example, a maximum likelihood spectral distance with respect to a linear prediction coefficient, an LPC (linear prediction) cepstrum distance, or the like can be used. Furthermore, when using power in a specific band, the power in the specific band may be compared with a predetermined threshold value.
[0013]
In the consonant processing, the spectrum envelope of the non-consonant section of the remaining signal obtained by separating the vowel part from the mixed signal is accumulated over time, and the spectrum envelope of the non-consonant section accumulated over time and the consonant determination target signal are stored. What is necessary is just to specify as a band which isolate | separates a band remarkably different between spectrum envelopes. In addition, the spectral envelope of the non-consonant section of the remaining signal obtained by separating the vowel part from the mixed signal is accumulated over time, and the consonant determination target signal that currently targets the spectral envelope of the non-consonant section accumulated over time A band having a relationship equal to or higher than a predetermined threshold may be specified as a band to be separated between the spectrum envelope normalized by the power of and the spectrum envelope of the consonant determination target signal.
[0014]
Separation of consonant parts can be performed by, for example, gain processing in a specific band using a bandpass filter or notch filter for a time domain signal, and by spectral subtraction for a frequency domain signal, for example. be able to.
[0015]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, preferred embodiments of the present invention will be described with reference to the drawings.
FIG. 1 is a block diagram showing the configuration of an audio signal separation system according to an embodiment of the present invention.
A mixed signal I including an audio signal and other signals (environmental sound, background sound, noise, etc.) is input to the vowel processing unit 1 and the consonant processing unit 2. The vowel processing unit 1 detects a vowel part of the audio signal from the mixed signal I based on the fundamental frequency f included in the mixed signal I, and separates it into a vowel signal Vv and another signal O1. The consonant processing unit 2 detects a consonant part included in the audio signal from the mixed signal I based on the characteristics of the spectral envelope of the mixed signal I, the power of a specific band, and the like, and the mixed signal I is detected as the consonant signal Vc and the other signal O2. And to separate. Based on the vowel / non-vowel determination result v / o from the vowel processing unit 1 and the consonant / non-consonant determination result c / o from the consonant processing unit 2, the vowel / consonal determination unit 3 A non-voice section is determined, and switching control of the switching unit 4 is performed. The switching unit 4 is controlled to be switched by the vowel / consonant determination unit 3, and is separated by the vowel signal Vv and the other signal O1 separated by the vowel processing unit 1 in the vowel section and by the consonant processing unit 2 in the non-vowel section. The consonant signal Vc and the other signal O2 are selected and output as the audio signal V and the other signal O, respectively. In the non-consonant section, the vowel signal Vv and the other signal O1 separated by the vowel processing unit 1 are selected, and in the consonant section, the consonant signal Vc and the other signal O2 separated by the consonant processing unit 2 are selected, respectively. The audio signal V and other signals O may be output.
[0016]
FIG. 2 is a block diagram showing the configuration of a speech separation system according to another embodiment of the present invention.
The vowel processing unit 1, the consonant processing unit 2, and the vowel / consonant determination unit 3 are the same as those in the above-described embodiment. In this embodiment, the consonant processing unit 2 suppresses the vowel signal Vv by the vowel processing unit 1. The other signal O1 is input as a consonant determination target signal, the consonant part and the non-consonant part are detected in a state where the vowel signal component is removed, and the consonant signal Vc and the other signal O2 are separated. Yes. In this case, since the consonant detection is performed on the signal from which the vowel signal component is removed, the detection accuracy is higher than in the previous embodiment. The consonant signal Vc separated by the consonant processing unit 2 is added to the vowel signal Vv separated by the vowel processing unit 1 by the adder 5 and output as the audio signal V. Further, the other signal O1 separated by the vowel processing unit 1 and the other signal O2 separated by the consonant processing unit 2 are switched by the switch 6 according to switching control in the vowel / consonant determination unit 3, and other signals O1 are switched. Is output as
[0017]
In these embodiments, the vowel processing unit 1 is configured as shown in FIG. 3, for example.
The mixed signal I is first input to the frequency analysis unit 11. The frequency analysis unit 11 includes a Hanning window unit 111 and an FFT (Fast Fourier Transform) unit 112. The mixed signal I is divided into frames by the Hanning window 111 and then subjected to frequency analysis by the FFT unit 112. The frequency analysis result in the FFT unit 112 is input to the fundamental frequency detection unit 12 and the vowel separation unit 13. The fundamental frequency detector 12 evaluates the integer order harmonic structure from the frequency analysis result by the FFT unit 112 and estimates the fundamental frequency f ′. In the vowel separation unit 13, the amplitude estimation units 131 ₁ , 131 ₂ ,..., 131 _n estimate the amplitude of each frequency component of the integer order harmonic structure from the fundamental frequency f ′ detected by the fundamental frequency detection unit 12. The amplitude of each frequency component can be estimated by, for example, complex spectrum interpolation. The complex spectrum interpolation method is a method of obtaining a true peak by an inner product from complex vectors adjacent to the peak on the complex plane, and thereby the fundamental frequency f corrected for Hanning window and its harmonic frequencies 2f, 3f,. nf and its amplitude are determined. Each of the correction frequencies f, 2f, 3f,..., Nf is input to the phase estimation units 132 ₁ , 132 ₂ ,..., 132 _n , where the phase is determined from the Hanning window characteristics and the frequency sample values before and after the corresponding frequency component. Can be estimated. Thereby, the line spectrum is estimated, and the influence (main lobe, side lobe) due to the Hanning window can be eliminated therefrom. The integer order overtone structure obtained in this way is subtracted from the frequency analysis result of the FFT unit 112 by the subtracter 133 and also returned to the time domain signal by the IFFT (inverse FFT) unit 134. The subtraction result of the subtracter 133 is also returned to the time domain signal by the IFFT unit 135. These are added to the overlap / add data 138 and 139 in the adders 136 and 137 in order to smooth the connecting portion between the frames, and only the vowel signal component is emphasized from the mixed signal I from the adder 136. The vowel signal Vv and the other signal O1 in which the vowel signal component is suppressed from the adder 137 mixed signal I are generated and output.
[0018]
FIG. 4 is a block diagram showing a configuration example of the consonant processing unit 2 in the embodiment of FIGS. 1 and 2, and FIG. 5 is a flowchart showing consonant section detection processing in the consonant processing unit 2.
The mixed signal I (the other signal O1 in the embodiment of FIG. 2) is given to an LPC (linear prediction) analysis unit 21 which is a consonant feature quantity calculation means, where the feature quantity calculation is executed. Here, a spectral envelope characteristic is calculated as a feature amount of a particularly unvoiced consonant among consonants. In order to evaluate the feature quantity using the spectral envelope characteristic, first, LPC coefficients are calculated (S1, S2, S3). The LPC analysis unit 21 predicts the current sample value from the past sample value. The prediction coefficient at this time is called an LPC coefficient. In LPC analysis, there is also a method for directly obtaining LPC coefficients by the covariance method or autocorrelation method. However, PARCOR coefficients by PARCOR analysis, LSP coefficients by LSP (line spectrum pair) analysis, and LPC coefficients can be converted into each other. is there. Here, the PARCOR analysis and the LSP analysis are both types of LPC analysis methods, but are methods with improved performance.
[0019]
Next, in the feature quantity evaluation unit 22, the LPC cepstrum distance calculation unit 221 calculates the LPC cepstrum distance Dcep between the LPC coefficient 222 averaged over time in the non-voiceless consonant section (S6, S7, S8). When the LPC coefficients are averaged over time, the obtained LPC coefficients (S4) are converted into LSP coefficients (S14), and the average is calculated (S15, S16, S17). Since the LSP coefficient has better interpolation performance than the LPC coefficient and the PARCOR coefficient, it is suitable for the average operation. Then, the averaged LSP coefficient is returned to the LPC coefficient. Thereby, the LPC coefficient after averaging is obtained. Further, the averaging over time here means that all LPC coefficients from the start of signal input to the present are weighted and added. Specifically, the following calculation may be performed.
[0020]
[Expression 1]
avg (i) = w * cur (i) + (1-w) * avg (i-1)
[0021]
Here, cur (i) is a current LPC coefficient, avg (i) is a time-average LPC coefficient, and w is a weighting function.
Further, in order to increase the calculation accuracy of the LPC coefficient 222 averaged over time, the other signal O1 from the vowel processing unit 1 is supplied to the LPC analysis unit 21, and the averaging process is also performed in the vowel detection section where the pitch exists. It is desirable to continue (S5, S14, S15, S16).
[0022]
At this time, the distance from the LPC coefficient of a typical unvoiced consonant obtained in advance may be calculated instead of the LPC coefficient 222 averaged over time. The LPC coefficients of typical unvoiced consonants obtained in advance can be used from a speech recognition database or the like. Further, the LPC cepstrum distance and the maximum likelihood spectral distance, which are not used here, are used as a distance scale between LPC coefficients (between spectral envelopes) in speech recognition.
[0023]
In addition, it is generally known that unvoiced consonants contain many frequency components of 4 kHz or higher that are relatively high compared to voiced sounds. For this reason, the consonant determination unit 223 compares the amplitude of the band of 4 kHz or higher from the spectral envelope characteristic 224 of the input signal obtained by the LPC analysis unit 21 with a threshold, and detects a band at a high level. It is not necessary to check up to a very high frequency band, and up to about 10 kHz is sufficient. The comparison result is digitized as a parameter Dspec (S9, S10, S11).
[0024]
The consonant determination unit 223 compares the calculated Dcep and Dspec with the threshold values ThDcep and ThDspec obtained by examining in advance, respectively (S12). From these comprehensive determination results, it is determined whether the section is an unvoiced consonant (S13). The threshold values ThDcep and ThDspec can be dynamically controlled in accordance with the input signal. If it is determined as an unvoiced consonant, the respective spectral envelope characteristics 224 and 225 are obtained from the input signal I or O1 and the temporal average LPC coefficient 222, and are compared with each frequency component (S9, S18, S19). At this time, in order to reduce the dependency on the signal power, it is preferable to use a normalized spectrum envelope. By comparison, a frequency with a higher input is specified with respect to the temporal average spectral envelope characteristic 225 (S20). This corresponds to the case where the mix level of the audio signal V is higher than the other signals O. This condition is sufficiently satisfied in general live broadcasting.
[0025]
The consonant separation unit 24 can perform enhancement / suppression of unvoiced consonants by performing gain operation on the amplitude spectrum of the FFT result from the FFT unit 23 or filtering on the time axis in the specified band. When gain manipulation of the amplitude spectrum is performed, the obtained unvoiced consonant signal and other signals are returned to the signals on the time axis by the IFFT units 25 and 26, respectively, to obtain the consonant signal Vv and the other signal O2. .
[0026]
As described in FIG. 1, at the time of output, the output Vv, O1 from the vowel processing unit 1 is used by using the vowel section determination based on the presence or absence of the pitch frequency and the above-described consonant section determination, or from the consonant processing unit 2 The switching unit 4 switches whether the outputs Vc and O2 are used. At this time, as shown in FIG. 6, in order to smoothly connect the signals of the vowel section, the consonant section, and the non-speech section, an overlap such as a Hanning window is used. It is desirable that the adders 43 and 44 use the / add data 41 and 42 to overlap / add the signals Vv / Vc and O1 / O2 to obtain the output signals V and O.
[0027]
FIG. 7 shows an application example of the system described above. In FIG. 6A, the present invention is applied to an emphasis / suppression unit 501 that separates a mixed signal I into a voice signal V and another signal O. The separated audio signal V and the other signal O are subjected to different signal processing by the signal processing units 502 and 503, respectively, and output as the audio signal V ′ and the other signal O ′. FIG. 5B shows an example of sound field control in live broadcasting of a television as an example of processing for another signal O that is a non-audio signal. The live relay sound signal (mixed signal I) output from the television 601 is separated into live speech (V) and environmental sound (O) by the emphasis / suppression unit 602 according to the present invention. The live audio is output from the front speaker 604 in front of the viewer 603. As for the environmental sound, a reverberation component is added by the reverberation adding unit 605, and the sound is output from four speakers 606, 607, 608, and 609 arranged on the front, back, left, and right of the viewer 603. This improves the sense of reality. FIG. 3C shows an example of speech recognition. That is, the speech enhancement unit 701 suppresses a signal (noise) O other than the speech signal V from the input acoustic signal I, and thereby separates and extracts the speech signal V. The speech recognition unit 702 performs speech recognition processing on the separated and extracted speech signal V. Thus, the voice recognition accuracy is improved by removing unnecessary ambient noise in the voice recognition. In this case, since the other signal O is an unnecessary noise component, the speech enhancement unit 701 extracts only the speech signal V.
[0028]
【The invention's effect】
As described above, according to the present invention, the vowel part of the audio signal is extracted from the mixed signal obtained by mixing the audio signal and the other signal based on the integer order harmonic structure, and the mixed signal or the mixed signal is extracted. The vowel part is separated from the consonant part by detecting the consonant part from the consonant target signal based on the characteristics of the consonant and separating it as the consonant target signal. The remaining non-speech signal reflects the change over time, and since the speech signal including the consonant part is separated from the mixed signal, the non-speech signal does not contain the consonant part, Even when an audio signal is processed, there is an effect that processing with high accuracy is possible.
[Brief description of the drawings]
FIG. 1 is a block diagram of an audio signal separation system according to an embodiment of the present invention.
FIG. 2 is a block diagram of an audio signal separation system according to another embodiment of the present invention.
3 is a block diagram showing a configuration of an audio processing unit in the embodiment of FIGS. 2 and 3. FIG.
4 is a block diagram showing a configuration of a consonant processing unit in the embodiment of FIGS. 2 and 3. FIG.
FIG. 5 is a flowchart showing consonant section detection processing in a consonant processing unit.
FIG. 6 is a diagram showing a modification of the output part of the system of FIG. 1;
FIG. 7 is a block diagram showing an application example of the system.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 ... Vowel processing part, 2 ... Consonant processing part, 3 ... Vowel / consonant determination part, 4, 6 ... Switching part, 5 ... Adder, 11 ... Frequency analysis part, 12 ... Fundamental frequency detection part, 13 ... Vowel separation part , 21 ... LPC analysis unit, 22 ... feature quantity evaluation unit, 24 ... consonant separation unit.

Claims

In an audio signal separation method for extracting an audio signal from a mixed signal obtained by mixing an audio signal and another signal and extracting at least one of the audio signal and the other signal,
A vowel processing step for detecting and separating a vowel part of the speech signal based on an integer overtone structure from the mixed signal;
The mixture remaining signal to signals or al the vowel portion was separated and the consonant determination target signal, detects and consonants process of separating the consonant portion of the audio signal based on the characteristics of consonant from the consonant determination target signal Steps,
The voice signal is separated by the vowel part of the voice signal detected in the vowel processing step and the consonant part of the voice signal detected in the consonant processing step, and at least one of the voice signal and other signals is extracted and output . An audio signal separation method comprising: an output step.

2. The speech signal separation method according to claim 1, wherein the consonant processing step is a step of detecting a consonant section of the speech signal based on a spectrum envelope of the consonant determination target signal as a characteristic of the consonant. .

The consonant processing step accumulates the spectral envelope of the non-consonant section of the remaining signal obtained by separating the vowel part from the mixed signal with time, and the spectral envelope of the non-consonant section accumulated with time and the consonant determination The speech signal separation method according to claim 2, wherein the speech signal separation method is a step of quantitatively evaluating a distance from a spectrum envelope of the target signal to detect a consonant section of the speech signal.

The consonant processing step is a step of detecting a consonant section of the speech signal by quantitatively evaluating a distance between a spectral envelope of a representative consonant learned in advance and a spectrum envelope of the consonant determination target signal. The audio signal separation method according to claim 2, wherein:

The said consonant processing step is a step which detects the consonant area of the said audio | voice signal based on the power of the specific band of the said consonant determination object signal, The any one of Claims 1-4 characterized by the above-mentioned. Audio signal separation method.

The consonant processing step accumulates the spectral envelope of the non-consonant section of the remaining signal obtained by separating the vowel part from the mixed signal with time, and the spectral envelope of the non-consonant section accumulated with time and the consonant determination The audio signal separation method according to claim 1, wherein a band that is significantly different from the spectrum envelope of the target signal is specified as a band to be separated.

The consonant processing step accumulates the spectral envelope of the non-consonant section of the remaining signal obtained by separating the vowel part from the mixed signal over time, and the spectral envelope of the non-consonant section accumulated over time is currently targeted. A band having a relationship equal to or greater than a predetermined threshold between a spectrum envelope normalized by the power of the consonant determination target signal and a spectrum envelope of the consonant determination target signal is specified as a band to be separated. The audio signal separation method according to any one of 1 to 5.

In an audio signal separation device for separating an audio signal from a mixed signal obtained by mixing an audio signal and another signal and extracting at least one of the audio signal and the other signal,
Vowel processing means for detecting and separating a vowel part of the speech signal based on an integer order harmonic structure from the mixed signal;
The mixture remaining signal to signals or al the vowel portion was separated and the consonant determination target signal, detects and consonants process of separating the consonant portion of the audio signal based on the characteristics of consonant from the consonant determination target signal Means,
The voice signal is separated by the vowel part of the voice signal detected by the vowel processing means and the consonant part of the voice signal detected by the consonant processing means, and at least one of the voice signal and other signals is extracted and output . An audio signal separation device comprising an output means.

9. The speech signal separation device according to claim 8, wherein the consonant processing means is means for detecting a consonant section of the speech signal based on a spectral envelope of the consonant determination target signal as a characteristic of the consonant. .

The consonant processing means accumulates the spectral envelope of the non-consonant section of the remaining signal obtained by separating the vowel part from the mixed signal over time, and the spectral envelope of the non-consonant section accumulated over time and the consonant determination The speech signal separation device according to claim 9, wherein the speech signal separation device is means for quantitatively evaluating a distance from a spectrum envelope of the target signal to detect a consonant section of the speech signal.

The consonant processing means is means for quantitatively evaluating a distance between a spectral envelope of a typical consonant learned in advance and a spectrum envelope of the consonant determination target signal and detecting a consonant section of the speech signal. The audio signal separation device according to claim 9.

The said consonant processing means is a means to detect the consonant area of the said audio | voice signal based on the power of the specific band of the said consonant determination object signal, The any one of Claims 8-11 characterized by the above-mentioned. Audio signal separation device.

The consonant processing means accumulates the spectral envelope of the non-consonant section of the remaining signal obtained by separating the vowel part from the mixed signal with time, and the spectral envelope of the non-consonant section accumulated with time and the consonant determination The audio signal separation device according to any one of claims 8 to 12, wherein a band that significantly differs from a spectrum envelope of the target signal is specified as a band to be separated.

The consonant processing means accumulates the spectral envelope of the non-consonant section of the remaining signal obtained by separating the vowel part from the mixed signal over time, and the spectral envelope of the non-consonant section accumulated over time is currently targeted. A band having a relationship equal to or greater than a predetermined threshold between a spectrum envelope normalized by the power of the consonant determination target signal and a spectrum envelope of the consonant determination target signal is specified as a band to be separated. The audio signal separation device according to any one of 8 to 12.