JP3670562B2

JP3670562B2 - Stereo sound signal processing method and apparatus, and recording medium on which stereo sound signal processing program is recorded

Info

Publication number: JP3670562B2
Application number: JP2000268442A
Authority: JP
Inventors: 昌英水島; 真理子青木; 正人三好
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2000-09-05
Filing date: 2000-09-05
Publication date: 2005-07-13
Anticipated expiration: 2020-09-05
Also published as: JP2002078100A

Description

【０００１】
【発明の属する技術分野】
この発明は、音声、楽音、各種環境音源などの複数の音源から発せられた複数の音響信号が混ざった２チャネルステレオ信号において、中央付近に定位する音源信号を強調もしくは抑圧するステレオ音響信号処理方法及び装置並びにステレオ音響信号処理プログラムを記録した記録媒体に関し、ステレオ音楽ソースの受聴者の好みに応じた再生や、騒音環境下で目的とする音声だけを受聴する時などに適用される。
【０００２】
【従来の技術】
二本のマイクロホンで収音されたステレオ音響信号、もしくは人工的にチャネル間でレベル差や位相差などをつけることで複数音源を複数位置に定位されたステレオ音響信号から、中央付近に定位する音源信号のみを抑圧するには、片側の信号の正負を反転、逆相にしてもう一方の信号に加算すればよい。これは中央に定位する音源の左右の信号の差違が小さいことにより実現される方法である。この方法は、既にでき上がった音楽信号から歌など主旋律のパートを消去し、伴奏だけを取りだすいわゆるボーカルキャンセル技術として利用される。
【０００３】
しかしこの方法では、ステレオであった伴奏は加算によってモノラルになってしまうという問題があった。加えてこの方法では、中央に定位する音源を抑圧する量を調整することは出来ない。
一方、難聴者は、複数の音源が存在する中から目的とする音源信号を聞き取る能力（いわゆるカクテルパーティー効果と呼ばれる。）が劣っているといわれている。このため、健聴者を対象に作成された音楽信号では、しばしば伴奏が歌より大きく感じられることが指摘されている。この場合にはセンターに定位する歌を強調し、伴奏を抑圧することが望まれるが、これは前述の方法では実現できない。
【０００４】
複数の音源が混合された信号から目的とする音源信号を抽出、もしくは強調する方法は他にもある。
第１の方法は、周期構造を持つ音源を周波数領域において基本周波数を推定し、調波構造を抜きだすことにより、同一音源と推定する成分を再合成する方法である。
しかしこの第１の方法では、音源は調波構造に限定され、さらに音源の調波構造の推定には必ず誤差が生じるため、それが雑音として知覚されることにより、目的音源信号の抽出精度が悪くなる問題があった。
【０００５】
第２の方法は、周波数特性の変動が比較的ゆるやかな定常的な雑音源と周波数特性が定常的音源よりも頻繁に変動する例えば音声のような目的信号音源が重畳された信号から、後者の目的音源信号を抽出、もしくは強調する方法である。これは混合された信号を周波数領域において、まず目的音源信号が重畳されていない部分、すなわち雑音源信号を推定し、雑音源信号の平均的な周波数特性を記憶する。そして、周波数領域において、雑音源信号と目的音源信号が重畳された信号から記憶された雑音源の平均的な周波数構造を減算することで目的音源信号を強調、もしくは抽出する方法である。
【０００６】
しかしこの第２の方法では、雑音源信号が定常であることが必要で、歌の伴奏のように非定常な音源の伴奏のみの個所の推定、及び抑圧は困難であった。
【０００７】
【発明が解決しようとする課題】
本発明の目的は、ステレオ音響信号から中央付近に定位する音源信号を抑圧、もしくは強調する技術において、抑圧、強調する割合の調整を可能とすることである。
本発明の別の目的は原信号の定位を損なわず、中央付近に定位する音源信号だけを強調、もしくは抑圧することである。
【０００８】
本発明の別の目的は、目的とする音源信号の調波構造に依存せずに高精度に抑圧、もしくは強調することである。
本発明の別の目的は、目的外の信号（雑音信号）が非定常な信号であっても高精度に目的音を抑圧、もしくは強調することである。
【０００９】
【課題を解決するための手段】
上記目的を達成するために、本発明のステレオ音響信号処理方法及び装置は、ステレオ音響信号を入力し、二つのチャネル信号を各チャネルごとに複数の周波数帯域に分割し、各周波数帯域ごとにチャネル間の類似度を計算し、類似度などから中央に定位する音源信号を抑圧、もしくは強調するための減衰係数を計算し、この減衰係数を各周波数帯域信号に乗じた後の各チャネルごとの各周波数帯域信号を再合成し、再合成した信号を出力することにより構成される。
【００１０】
本発明は、入力されたステレオ信号をチャネルごとに複数の周波数帯域に分割する。そして、各周波数帯域ごとにチャネル間の信号の類似度をその振幅比や位相差などによって決定する。そして、類似度の高い周波数帯域に比べて類似度の低い周波数帯域に小さな減衰係数を乗算して、各チャネルごとに再合成して出力すれば、減衰係数の下限値に応じて中央に定位する音源が強調される。反対に類似度の低い周波数帯域に比べて、類似度の高い周波数帯域に小さな減衰係数を乗算して各チャネルごとに再合成して出力すれば、減衰係数の下限値に応じて中央に定位する音源が抑圧される。
【００１１】
【発明の実施の形態】
（第１実施例）
図１は本発明の第１の実施例を示すブロック図である。
ステレオ信号入力部102に入力される音響信号は、強調、もしくは抑圧したい目的音源信号が中央付近に定位するように収音されているステレオ信号であれば本発明は有効である。
【００１２】
ステレオ信号入力部102に入力されたステレオ信号は左右のチャネルごとに処理される。以下にその処理方法の詳細を述べる。
左チャネルの信号sLは、左チャネル周波数帯域分割部103によって周波数領域に変換される。同様に右チャネルの信号sRは、右チャネル周波数帯域分割部110によって周波数領域に変換される。ここで帯域分割数をNとする。左チャネルにおいて帯域分割された信号を低い周波数から順にfL(0), fL(1),fL(2),・・・,fL(k),・・・,fL(N−1)とする。右チャネルにおいて帯域分割された信号を低い周波数から順にfR(0), fR(1),fR(2),・・・,fR(k),・・・,fR(N−1)とする。類似度計算部104において、fL(k),fR(k)は、同じ周波数帯域ごとに類似度a(0),a(1),a(2),・・・,a(k),・・・,a(N−1)が計算される。ステレオ信号において、中央付近に定位する音源信号は左右の信号が一致、もしくはその差違が非常に小さい。これは、周波数領域に変換したのちも全ての周波数領域において、左右の成分の差違は小さいことを意味する。このことから類似度は、kが等しい、即ち同じ周波数成分間で、fL(k)とfR(k)の差違で決定することが出来る。
【００１３】
そして、各周波数帯域ごとに計算された類似度a(k)に基づき各周波数帯域ごとに減衰係数計算部105において減衰係数g(k)(k＝０〜N−１）が算出される。減衰係数は同一周波数帯域において、左右チャネル間で同一なものが各周波数帯域信号fL(k)に乗算器116で乗算される。
つまり、各周波数帯域ごとの左右レベル差、位相差から各周波数帯域ごとに類似度、そして減衰係数を計算し、各帯域に乗じて、左右チャネル音源信号合成部106,111で再合成することで、類似度の大きな成分だけの成分集合sL´,sR´が出力され、その結果、中央付近に定位する音源信号だけが残る。
類似度の計算方法
類似度a(k)の計算方法について、左右周波数帯域分割部103,110が短時間フーリエ変換（以下、FFTと略する）である場合について述べる。
【００１４】
FFTで周波数分割した場合、fL(k)およびfR(k)は一般に複素数となり、位相を考慮する必要がある。そこで、各成分の大きさの比と位相差によって二つの類似度を計算する。大きさの比による類似度をai(k)、位相差による類似度ap(k)とすると、
【００１５】
【数１】

ap(k)＝cosθ （２）
ここで、θはfL(k)とfR(k)の位相差を表す。
【００１６】
類似度ai(k),ap(k)は減衰係数計算部105に送られ、減衰係数g(k)が計算される。
減衰係数の計算方法
減衰係数g(k)の計算方法について説明する。
１.中央定位音源信号を強調する場合
中央に定位する音源信号を強調する場合について説明する。
▲１▼大きさの比による減衰係数gi(k)の計算方法を説明する。
【００１７】
（１）式から明らかなように、類似度ai(k)は、fL(k)とfR(k)の大きさが等しい時に１になり、それ以外は１より小さい値となる。したがって、大きさの比による類似度ai(k)を引数とする関数において、単調増加の関数の出力をgi(k)に選べばよい。
図２にその一例を示す。横軸は20log₁₀（ai(k)）、縦軸は20log₁₀（gi(k)）を示している。
【００１８】
ここで、 Ai(k)＝20log₁₀(ai(k)),Gi(k)＝20log₁₀(gi(k)）とすると、

中央に定位する音源信号だけであるならば、全てのkに対してai(k)は１（20log₁₀（ai(k)）＝０）になるが、その他に定位する信号が重畳されることにより、中央定位音源信号が支配的な帯域であっても１よりもやや小さくなることがある。よって図２のように適当な幅εを持たせることが有効である。この適当な幅εは例えばβは左右のレベル差や位相差が僅かで中央に音を知覚させる中央定位音源信号について音質などの変化が無視できる範囲で予め聴感上で決めることが好適である。ただし、εを大きくしすぎると、中央付近で左右いずれかの方向にずれて定位した音源信号などを抑圧することが出来なくなる。よって、εは誤差による中央定位音源信号の音質などの変化が無視できる範囲で０に近い値にすることが望ましい。
【００１９】
Giminは、中央定位音源信号以外の信号の抑圧量に相当する。この値を変化させることで、歌と伴奏に例えるならば、歌の大きさに対する伴奏の大きさを調整することが可能となる。図２において、βをεと一致させてもよいし、一致させなくてもよい。βをεに近づけると中央定位音源信号以外の信号は等しくGiminの減衰量で減衰されることが期待できるが、中央定位音源信号の支配的な帯域が誤って抑圧された場合の誤差の影響も大きくなる。βをεから離すことで中央定位音源信号が支配的な帯域を誤って抑圧した場合の誤差の影響を小さく出来るが、定位する位置によって抑圧量が変わってしまい、歌の伴奏に例えるならば、伴奏楽器間の音量のバランスが変わってしまうことが予想される。よって、中央定位音源信号の音質などの変化が無視できる範囲でβはεに近い値（０＞ε＞β）にすることが望ましい。
▲２▼位相差による減衰係数gp(k)の計算方法を説明する。
【００２０】
（２）式から明らかなように、類似度ap(k)は、fL(k)とfR(k)の位相が一致したときに１になり、それ以外は１より小さい値であり、位相差θがπ／２ラジアンの時に０、θがπラジアンの時、すなわち逆相の時に−１で最小である。一般に位相差による定位知覚は周波数帯域に依存し、大きさの比ほど単純ではない。しかし、少なくとも中央に定位する音源信号に関しては位相差は０に近く、よってap(k)は１に近い値であることが期待できる。このことから位相差による減衰係数gp(k)は例えば図３に示すように計算すればよい。
【００２１】
図３にその一例を示す。横軸はap(k)、縦軸は20log₁₀(gp(k))を表す。
ここで、Gp(k)＝20log₁₀(gp(k))とすると、

中央に定位する音源信号だけであるならば、全てのkに対してap(k)は１になるが、その他の雑音信号が重畳されることにより、中央定位音源信号が支配的な帯域であっても１よりやや小さくなることがある。よって図２のように適当な幅ζを持たせることが有効である。しかしζを大きくしすぎると、中央に定位しない他の音源信号の抑圧が不十分になる。よって、ζは誤差による中央定位音源信号の変化が無視できる範囲で１に近い値（１＞ζ）にすることが望ましい。Gpminは、中央定位音源信号以外の信号の抑圧量に相当する。この値を変化させることで、歌と伴奏に例えるならば、歌の大きさに対する伴奏の大きさを調整することが可能となる。
【００２２】
図３において、αとζとを一致させてもよいし、一致させなくてもよい。αをζに近づけると中央定位音源信号以外の信号は等しくGpminの減衰量で減衰されることが期待できるが、中央定位音源信号の支配的な帯域が誤って抑圧された場合の誤差も大きくなる。αをζから離すことで中央定位音源信号が支配的な帯域を誤って抑圧された場合の誤差の影響を小さく出来るが、位相差による抑圧量の違いは周波数帯域によってその影響度が異なるため、歌の伴奏に例えるならば、伴奏楽器の音量のバランスだけではなく音色などが変わってしまうことなどが予想される。よって、中央定位音源信号の変化が無視できる範囲でαはζに近い値（０＞ζ＞α）にすることが望ましい。
▲３▼二つの減衰係数gi(k)とgp(k)から実際にfL(k),fR(k)に乗算する減衰係数g(k)の計算方法を説明する。
【００２３】
適当な距離を離した二つのマイクロホンで比較的マイクロホンから距離が近い複数の音源信号を収音したステレオ信号が入力信号である場合には、ステレオ再生における定位は左右のマイクロホンに入ってくる信号の位相差と大きさの比（レベル差）に依存する。低い周波数においてはレベル差はつきにくく、位相差が大きく影響する。高い周波数では、大きさの比が大きく影響する。よって、例えば周波数帯域を二つに分けてそれよりも低い周波数においてはgi(k)を、高い周波数においてはgp(k)を採用することが考えられる。しかしながら、壁に囲まれた残響のある部屋において、マイクロホンから離れた位置に存在する音源からの信号は一般に左右のレベル差はほとんどなく、逆に位相が左右のマイクロホンでランダムになるため（２）式の値が０に近くなる。この場合は全ての周波数において優先的にgp(k)を使うことが望ましい。さらにポピュラー音楽等の場合は、直接マイクロホンで収音するだけでなく、左右チャネル信号に大きさの比や時間差、あるいは位相の時間的な変化を人工的に付加することで自然界には存在しない定位を得ることが普通であり、もっと複雑になる。以上のように様々なステレオ入力信号に応じて、最適なg(k)の選択をすることは非常に困難である。しかしながら、どの場合も少なくとも中央に定位する音源信号の大きさの比と位相差は共に小さい、そこで、g(k)として、gi(k)とgp(k)の小さいほうを採用することにする。即ち、
g(k)＝min（gi(k),gp(k)）（３）
ここで、min(A,B）はAとBのどちらか小さい方を出力することを意味する。つまり、どんなステレオ入力信号であっても、大きさか位相の少なくとも一方が左右で異なる場合は抑圧することになり、その結果、中央に定位する音源信号を強調することが可能となる。
【００２４】
上記のように減衰係数計算部105で計算されたg(k)は図１にあるように各チャネル各周波数帯域のfL(k),fR(k)に乗算器116で乗算される。同じ帯域kにおいて左右のチャネルに同じg(k)を乗算することで、中央に定位する音源信号以外の音源信号を定位を維持したまま抑圧することが可能となる。g(k)を乗算した信号は、fL(k)は左チャネル音源信号合成部106で再合成して時間波形sL´に変換される。fR(k)は右チャネル音源信号合成部111で再合成して時間波形sR´に変換される。sL´,sR´はステレオ信号出力部107から、ラウドスピーカ108やステレオヘッドホン109に送られる。
【００２５】
以上の処理により、中央に定位する音源信号を強調、その他の音源信号を抑圧した合成信号をステレオラウドスピーカ108やステレオヘッドホン109等で受聴することが可能となる。
２.中央定位音源信号を抑圧する場合
中央に定位する音源信号を抑圧し、それ以外の音源信号を強調する場合について説明する。
図１において類似度計算部104で類似度ai(k),ap(k)を計算するところまでは先に述べた中央に定位する音源信号を強調する場合と同じであり、類似度から減衰係数を計算する部分が異なる。中央に定位する音源信号を抑圧するのであるから、大きさによる減衰係数gi(k)を図４に示すように計算し、位相による減衰係数を図５に示すように計算すればよい。
【００２６】
図４、５にその一例を示す。
ここで、Ai(k)＝20log₁₀(ai(k)),Gi(k)＝20log₁₀(gi(k)),Gp(k)＝20log₁₀(gp(k))とすると、

即ち、左右の類似度が大きいほど減衰係数を小さくすることによって、中央に定位する音源信号を抑圧することが可能となる。α,β,ε,ζの考え方は前述の中央に定位する音源信号を強調する場合と同様であるため割愛する。gi(k)とgp(k)からg(k)を得る方法も強調の場合と同じ考えで、
g(k)＝max(gi(k),gp(k)）（４）
と計算する。ここで、max(A,B)はAとBから大きいほうを出力することを意味する。即ち、大きさによる減衰係数gi(k)と位相による減衰係数gp(k)の少なくともどちらか一方が大きい場合には、左右チャネル信号に位相差か大きさの違いがあることを意味し、その信号は中央に定位する音源信号ではないと考えるからである。gi(k)とgp(k)が共に小さい場合のみ、中央に定位する音源信号であり、抑圧の対象となる。
【００２７】
減衰係数計算部105で計算されたg(k)を各周波数帯域のfL(k),fR(k)に乗算するところから先は中央に定位する音源を強調する場合と同じであるので割愛する。上記の方法では、原信号における中央付近に定位する音源信号とそれ以外の音源信号（例えば、歌と伴奏に対応する）の音量差にかかわらず、減衰係数g(k)により、一律に抑圧される。つまり、原信号において既に適切な音量差であった場合でも、さらに抑圧され、音量差は拡大する。
【００２８】
そこで、次に、原信号の中央付近に定位する音源信号とそれ以外の音源信号の音量差を推定し、その差を一定にするように減衰係数を決定する方法について説明する。
中央付近に定位する音源信号を強調する場合には、g(k)の値が大きければ中央定位音源信号成分と推測される。そこで、g(k)の値に適当なしきい値gthを設定し、gthよりも大きな周波数成分の大きさの合計を中央付近定位信号の大きさの推定値gcとする。同様にgthよりも小さな周波数成分の大きさの合計を中央以外に定位する音源信号の大きさの推定値gbとする。これらの値は瞬時の値であるため、これらの長時間平均をとる、その方法には、例えば、移動平均法などが考えられる。それらの値をgca,gbaとすると、
rcb＝gca／gba （５）
rcbは騒音で考えると長時間平均のSN比に相当する。Sが中央付近定位音源信号の大きさ（パワー）で、Nがそれ以外の定位音源信号の大きさに相当する。次にこのrcbを基準に所望のSN比にするように「N」を抑圧することを考える。所望のSN比をrd、必要な抑圧量をg2とすると、rcbを抑圧量g2で割った値がrdになればよい。よって、
g2＝rcb／rd (rcb≦rdのとき)
g2＝1.0 (rcb＞rdのとき) （６）
上式では、原信号のSN比が、所望のSN比よりも大きい時には抑圧しない。その際、原信号よりもSN比を強制的に小さくしたい場合には、rcbとrdの大小関係にかかわらず、（６）式上段のみを使用し、１より大きなg2に設定すればよい。
【００２９】
各周波数帯域において、減衰係数が上述のgthよりも小さな帯域成分に対し、g2を乗算することで、平均的なSN比、すなわち中央定位音源信号とそれ以外の音源信号の音量差を所望量にすることが可能となる。
以上の処理により、中央に定位する音源信号を抑圧、その他の音源信号を強調した合成信号をステレオラウドスピーカ108やステレオヘッドホン109等で受聴することが可能となる。
（第２実施例）
図６は、本発明の第２の実施例を示すブロック図である。
【００３０】
減衰係数g(k)を乗算した後、左右のチャネル信号を加算器117で加算することでモノラル化する。ステレオの効果はなくなるが、左右のチャネルの信号を加算することで左右チャネルで無相関な音源信号成分をより抑圧することが可能である。
多くの歌の入ったポピュラー音楽において中央には歌の他にベースドラムやベースの音を定位させる場合が多い。これらの主たる周波数成分は歌の周波数成分よりも低いため、これらを抑圧するには、例えば減衰係数計算部105において、図７に示したようなge(k)をg(k)に乗算し、新たなg(k)とすることも有効である。図７において、横軸は分割周波数帯域k、縦軸は20log₁₀（ge(k)）である。kLは低音楽器を抑圧するための下限周波数帯域を示す。kL以下の周波数帯域をGemin抑圧する。mは周波数帯域が低くなるにつれて除々に抑圧量を増やしていくことで周波数軸上の不連続を押さえるための小さな正の整数である。kLやmを大きくしすぎると歌の低域周波数成分を抑圧してしまうため、例えば周波数に換算してkLは100Hzから200Hzくらいが適当である。
【００３１】
逆に中央に定位する歌のみを抑圧し、中央に定位するベースドラムやベースの音を抑圧しないようにするには、g(k)による中央定位音源帯域の抑圧を、低い周波数帯域（例えば、０〜200Hzの帯域）では行わないようにすればよい。
中央に定位するベースドラムなどを抑圧するもう一つの方法を説明する。ベースドラムは、音の立ち上がり時間が音声に比べて速い。そこでベースドラムの主たる周波数帯域において、音の立ち上がり時間を観測し、立ち上がり時間の短さに応じた減衰係数gak(k)をg(k)に乗算して新たなg(k)とすることで立ち上がりの速いベースドラムだけを抑圧することが可能となる。
【００３２】
その一手法について説明する。
ある周波数帯域kのT時刻の左右チャネルの平均を取った大きさをA(k,T)＝(|fL(k,T)|＋|fR(k,T)|)／２とする。１時刻前のA(k,T−1)との比をrとする。ここで、単位時刻は１フレームで通例数十ミリ秒程度である。
r＝A(k,T)／A(k,T−1) （A(k,T)＞A(k,T−1)の時）
r＝1.0 （A(k,T)＜A(k,T−1)の時）（７）
rが大きいほど、kの周波数帯域の立ち上がりが鋭いことを意味するから、立ち上がりの鋭さに対する減衰係数をgak(k,T)を、rが大きいほどgak(k,T)が小さくなるような関数の出力にすればよい。
例えば、
【００３３】
【数２】

rtはrの値に対してどの程度の割合で減衰させるかを表す負の実数である。
図８はrt＝−３の時、横軸にr、縦軸に20log₁₀（gak(k,T)）を示した例である。Gakmin＝−50は、減衰量の下限値を表している。各時刻において常に（８）式におけるgak(k,T)を乗算しているとスペクトルの時間変化が不連続になるので、
gak´(k,T)＝gak´(k,T−1)＋δ(gak(k,T)−gak´(k,T−1)）（９）
（９）式のようにスムージング処理を施したgak´(k,T)を用いるのがよい。ここでδはスムージングのための係数で、０より大きく１以下の実数である。gak´(k,T−1)とgak(k,T)の大小関係で異なる値を用いてもよい。
【００３４】
以上の処理により、ステレオ音響信号の中央に定位する音源信号を強調、もしくは抑圧することが可能となる。
本発明のステレオ音声処理装置はＣＰＵやメモリ等を有するコンピュータと、アクセス主体となるユーザが利用するユーザ端末と記録媒体とから構成することができる。記録媒体はCD-ROM、磁気ディスク装置、半導体メモリ等のコンピュータ読み取り可能な記録媒体であり、ここに記録されたステレオ音声処理プログラムはコンピュータに読み取られ、コンピュータの動作を制御し、コンピュータに左右チャネルごとに複数の周波数帯域に分割する処理、各周波数帯域ごとにチャネル間の類似度を計算する処理、類似度から中央付近に定位する音源信号を抑圧、もしくは強調するための減衰係数を計算する処理、その減衰係数を各周波数帯域信号に乗算する処理、及び減衰係数を乗じた後の各チャネルごとの各周波数帯域信号を再合成する処理等のステレオ音響処理方法を実行させる。なお、上記ステレオ音声処理プログラムは通信回線を介して伝送されたものであってもよい。
（利用方法）
次に本発明の利用方法について説明する。
【００３５】
図９は本発明の第１の利用方法を示している。
音楽コンパクトディスク301はステレオ再生用で、その中に中央に定位する主たる音源信号も収録されているものとする。音楽コンパクトディスク301をパーソナルコンピュータ302において、図１，もしくは図４に示した本発明の中央に定位する音源信号を強調する処理と周波数特性等の聴覚補正処理などを施し、出力する。アンプ303で利得を調整した後、ステレオヘッドホン304等で聴取する。これは、例えば聴覚者等が歌に比べて伴奏を小さくして聞きたい場合などに利用できる。
【００３６】
図10は本発明の第２の利用方法でミュージックオンデマンドに本発明を利用する例を示している。
音楽ソースはネットワーク306に接続されたホストコンピュータ305に多数格納されている。利用者はネットワーク306に接続したパーソナルコンピュータ302から、ホストコンピュータ305に自分の聞きたい音楽ソース名と、信号の処理方法を指定する。信号の処理方法の指定とは、例えば、本発明における中央に定位する歌以外の伴奏をどれだけ小さくするか、あるいは周波数特性を自分の好みに応じてどのように調整するか、などの処理の指定である。ホストコンピュータはその指定に従って音楽ソースを検索、指示通りの信号処理をした後、もしくは処理をしながらネットワークを介して利用者のパーソナルコンピュータ302へ音楽信号を送信する。利用者は、パーソナルコンピュータ302から出力された音楽信号をアンプ303で利得を調整し、ステレオヘッドホン304等で送られて来た音楽信号を聴取する。
【００３７】
また、パーソナルコンピュータの代わりに、ネットワークに無線で接続できる機能を内蔵した携帯型の音楽再生機でも、同じことが可能である。
また、以上の利用方法において、中央に定位する歌などを抑圧する処理をすれば、例えば携帯型の簡易カラオケなどに利用することも出来る。
図11は本発明の第３の利用方法を示した図である。
正面で話す話者の声を強調することを目的とする。難聴者の例えば頭部左右に配置した単一指向性マイクロホン201,202で収音した音響信号を、難聴者が携帯する小型の筺体に内蔵した本発明処理部で処理することで正面話者の音声以外の騒音を抑圧する。その後、同筺体に内蔵された音質や利得などの補聴処理を施し、左右のイヤホン203,204へ出力することで騒音を抑圧し、強調した正面話者の音声を受聴することが可能となる。
【００３８】
【発明の効果】
以上の説明のように本発明によれば、ステレオ音響信号から原信号の定位を損ねず、中央に定位する音源信号を所望の量だけ強調、もしくは抑圧することが、音源の定位情報のみで可能となり、以下のような効果が期待できる。
（１）難聴者等が市販の音楽ソースを受聴する際、中央に定位する主たる音源信号である歌とそれ以外の伴奏の音量バランスを、難聴者自身が自由に聞き易いように調整し、音楽をより良く楽しむことが期待できる。
（２）騒音環境下において、正面にいる目的話者の音声のみを強調することが可能となり、快適なコミュニケーションを実現することが期待できる。
（３）中央に定位する歌などを抑圧することでカラオケ音源などを作成することが可能となる。
【図面の簡単な説明】
【図１】本発明の第１の実施例を示すブロック図。
【図２】中央定位音源を強調する時のai(k)とgi(k)の関係を示す図。
【図３】中央定位音源を強調する時のap(k)とgp(k)の関係を示す図。
【図４】中央定位音源を抑圧する時のai(k)とgi(k)の関係を示す図。
【図５】中央定位音源を抑圧する時のap(k)とgp(k)の関係を示す図。
【図６】本発明の第２の実施例を示すブロック図。
【図７】 kとge(k)の関係を示す図。
【図８】 rとgak(k,T)の関係を示す図。
【図９】本発明の第１の利用方法を示す図。
【図１０】本発明の第２の利用方法を示す図。
【図１１】本発明の第３の利用方法を示す図。
【符号の説明】
１０２ステレオ信号入力部
１０３左チャネル周波数帯域分割部
１０４類似度計算部
１０５減衰係数計算部
１０６左チャネル音源信号合成部
１０７ステレオ信号出力部
１０８ステレオラウドスピーカ
１０９ステレオヘッドホン
１１０右チャネル周波数帯域分割部
１１１右チャネル周波数音源信号合成部
１１２音源信号合成部
１１３モノラル信号出力部
１１４ラウドスピーカ
１１５モノラルイヤホン
１１６乗算器
１１７加算器
２０１左ｃｈ単一指向性マイクロホン
２０２右ｃｈ単一指向性マイクロホン
２０３左ｃｈイヤホン
２０４右ｃｈイヤホン
２０５本発明処理部、補聴部
３０１音楽コンパクトディスク
３０２パーソナルコンピュータ
３０３アンプ
３０４ステレオヘッドホン
３０５ホストコンピュータ
３０６ネットワーク[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a stereo sound signal processing method for emphasizing or suppressing a sound source signal localized near the center in a two-channel stereo signal in which a plurality of sound signals emitted from a plurality of sound sources such as voice, music, and various environmental sound sources are mixed. In addition, the present invention relates to a recording medium in which a stereo sound signal processing program is recorded, and is applied to reproduction according to the listener's preference of a stereo music source or when listening only to a target sound in a noisy environment.
[0002]
[Prior art]
A stereo sound signal picked up by two microphones, or a sound source that is localized near the center from a stereo sound signal in which multiple sound sources are localized at multiple positions by artificially adding a level difference or phase difference between channels. In order to suppress only the signal, it is only necessary to invert the positive / negative of the signal on one side and add it to the other signal in the opposite phase. This is a method realized by a small difference between left and right signals of a sound source localized in the center. This method is used as a so-called vocal canceling technique in which a main melodic part such as a song is deleted from a music signal that has already been completed, and only the accompaniment is taken out.
[0003]
However, this method has a problem that the accompaniment that is stereo becomes monophonic by addition. In addition, this method cannot adjust the amount of suppression of the sound source localized in the center.
On the other hand, a hearing-impaired person is said to be inferior in the ability to hear a target sound source signal from a plurality of sound sources (so-called cocktail party effect). For this reason, it is pointed out that accompaniment often feels larger than a song in a music signal created for a normal hearing person. In this case, it is desired to emphasize the song localized at the center and suppress the accompaniment, but this cannot be realized by the above-described method.
[0004]
There are other methods for extracting or enhancing a target sound source signal from a signal in which a plurality of sound sources are mixed.
The first method is a method of re-synthesizing components estimated to be the same sound source by estimating a fundamental frequency in a frequency domain of a sound source having a periodic structure and extracting a harmonic structure.
However, in this first method, the sound source is limited to the harmonic structure, and further, an error always occurs in the estimation of the harmonic structure of the sound source. There was a problem that got worse.
[0005]
The second method uses a stationary noise source whose frequency characteristic fluctuation is relatively gentle and a signal on which a target signal sound source such as a voice whose frequency characteristic fluctuates more frequently than a stationary sound source is superimposed. This is a method for extracting or enhancing a target sound source signal. In the frequency domain, the portion where the target sound source signal is not superimposed, that is, the noise source signal is first estimated, and the average frequency characteristic of the noise source signal is stored. In the frequency domain, the target sound source signal is emphasized or extracted by subtracting the stored average frequency structure of the noise source from the signal on which the noise source signal and the target sound source signal are superimposed.
[0006]
However, in this second method, it is necessary that the noise source signal is stationary, and it is difficult to estimate and suppress the location of only the accompaniment of a non-stationary sound source such as the accompaniment of a song.
[0007]
[Problems to be solved by the invention]
An object of the present invention is to make it possible to adjust the ratio of suppression and enhancement in a technique for suppressing or enhancing a sound source signal localized near the center from a stereo sound signal.
Another object of the present invention is to enhance or suppress only the sound source signal localized near the center without impairing the localization of the original signal.
[0008]
Another object of the present invention is to suppress or enhance with high accuracy without depending on the harmonic structure of the target sound source signal.
Another object of the present invention is to suppress or enhance a target sound with high accuracy even if a non-target signal (noise signal) is an unsteady signal.
[0009]
[Means for Solving the Problems]
In order to achieve the above object, a stereo sound signal processing method and apparatus according to the present invention inputs a stereo sound signal, divides two channel signals into a plurality of frequency bands for each channel, and sets a channel for each frequency band. Calculate the attenuation coefficient for suppressing or enhancing the sound source signal localized in the center from the similarity, etc., and multiplying each frequency band signal by this attenuation coefficient The frequency band signal is re-synthesized and the re-synthesized signal is output.
[0010]
The present invention divides an input stereo signal into a plurality of frequency bands for each channel. Then, the similarity of signals between channels is determined for each frequency band based on the amplitude ratio, phase difference, and the like. Then, multiply the frequency band with low similarity compared to the frequency band with high similarity by a small attenuation coefficient, re-synthesize for each channel, and output to the center according to the lower limit value of the attenuation coefficient The sound source is emphasized. On the other hand, if the frequency band with high similarity is multiplied by a small attenuation coefficient and recombined for each channel and output, it is localized in the center according to the lower limit value of the attenuation coefficient compared to the frequency band with low similarity. The sound source is suppressed.
[0011]
DETAILED DESCRIPTION OF THE INVENTION
(First embodiment)
FIG. 1 is a block diagram showing a first embodiment of the present invention.
The present invention is effective if the acoustic signal input to the stereo signal input unit 102 is a stereo signal that is collected so that the target sound source signal to be emphasized or suppressed is localized near the center.
[0012]
The stereo signal input to the stereo signal input unit 102 is processed for each of the left and right channels. Details of the processing method will be described below.
The left channel signal sL is converted into the frequency domain by the left channel frequency band division unit 103. Similarly, the right channel signal sR is converted into the frequency domain by the right channel frequency band dividing unit 110. Here, N is the number of band divisions. The signals divided in the left channel are fL (0), fL (1), fL (2),..., FL (k),. The signals divided in the right channel are fR (0), fR (1), fR (2),..., FR (k),. In the similarity calculation unit 104, fL (k) and fR (k) are similarities a (0), a (1), a (2),..., A (k),. ..A (N-1) is calculated. In the stereo signal, the left and right signals of the sound source signal localized near the center are the same or the difference is very small. This means that the difference between the left and right components is small in all frequency domains after conversion to the frequency domain. Therefore, the similarity can be determined by the difference between fL (k) and fR (k) between the same frequency components where k is equal.
[0013]
Then, based on the similarity a (k) calculated for each frequency band, the attenuation coefficient g (k) (k = 0 to N−1) is calculated in the attenuation coefficient calculation unit 105 for each frequency band. In the same frequency band, the same attenuation coefficient between the left and right channels is multiplied by each frequency band signal fL (k) by the multiplier 116.
In other words, similarity is calculated for each frequency band from the left / right level difference and phase difference for each frequency band, and the attenuation coefficient is calculated for each frequency band. A component set sL ′ and sR ′ having only a large component is output, and as a result, only a sound source signal localized near the center remains.
Similarity calculation method
A method of calculating the similarity a (k) will be described in the case where the left and right frequency

band dividing sections

103 and 110 are short-time Fourier transforms (hereinafter abbreviated as FFT).
[0014]
When frequency division is performed by FFT, fL (k) and fR (k) are generally complex numbers, and the phase needs to be considered. Therefore, two similarities are calculated based on the ratio of the size of each component and the phase difference. If the degree of similarity is ai (k) and the degree of similarity is ap (k) due to the phase difference,
[0015]
[Expression 1]

ap (k) = cosθ (2)
Here, θ represents the phase difference between fL (k) and fR (k).
[0016]
The similarities ai (k) and ap (k) are sent to the attenuation coefficient calculation unit 105, and the attenuation coefficient g (k) is calculated.
Calculation method of damping coefficient
A method for calculating the attenuation coefficient g (k) will be described.
1. When emphasizing the centralized sound source signal
A case where the sound source signal localized at the center is emphasized will be described.
(1) A method of calculating the attenuation coefficient gi (k) based on the ratio of the sizes will be described.
[0017]
As apparent from the equation (1), the similarity ai (k) becomes 1 when the magnitudes of fL (k) and fR (k) are equal, and becomes a value smaller than 1 otherwise. Therefore, in the function having the similarity ai (k) based on the size ratio as an argument, the output of the monotonically increasing function may be selected as gi (k).
An example is shown in FIG. Horizontal axis is 20log_Ten(Ai (k)), vertical axis is 20log_Ten(Gi (k)).
[0018]
Where Ai (k) = 20log_Ten(ai (k)), Gi (k) = 20log_Ten(gi (k))

If there is only a sound source signal localized in the center, ai (k) is 1 (20log for all k)_Ten(Ai (k)) = 0), but other localized signals may be superimposed, so that even if the central localized sound source signal is dominant, it may be slightly smaller than 1. Therefore, it is effective to have an appropriate width ε as shown in FIG. The appropriate width ε is preferably determined in advance in terms of audibility, for example, in a range where β is a slight level difference or phase difference between left and right and a change in sound quality or the like is negligible for a centralized sound source signal that perceives sound in the center. However, if ε is too large, it becomes impossible to suppress a sound source signal or the like that has been localized by shifting in either the left or right direction near the center. Therefore, ε is preferably set to a value close to 0 within a range in which the change in sound quality of the centralized sound source signal due to the error can be ignored.
[0019]
Gimin corresponds to the amount of suppression of signals other than the centralized sound source signal. By changing this value, the size of the accompaniment with respect to the size of the song can be adjusted if compared to a song and accompaniment. In FIG. 2, β may or may not match ε. When β is close to ε, signals other than the central sound source signal can be expected to be equally attenuated by the Gimin attenuation, but the effect of errors when the dominant band of the central sound source signal is erroneously suppressed is also affected. growing. By separating β from ε, you can reduce the effect of errors when the central localization sound source signal erroneously suppresses the dominant band, but the amount of suppression changes depending on the position where it is localized, It is expected that the volume balance between accompaniment instruments will change. Therefore, it is desirable to set β to a value close to ε (0> ε> β) within a range where changes in the sound quality of the centralized sound source signal can be ignored.
(2) A method of calculating the attenuation coefficient gp (k) due to the phase difference will be described.
[0020]
As apparent from the equation (2), the similarity ap (k) becomes 1 when the phases of fL (k) and fR (k) coincide with each other, and is smaller than 1 otherwise. 0 is the minimum when θ is π / 2 radians, and -1 is the minimum when θ is π radians, that is, in the reverse phase. In general, localization perception based on a phase difference depends on a frequency band and is not as simple as a size ratio. However, at least for a sound source signal localized in the center, the phase difference is close to 0, so ap (k) can be expected to be a value close to 1. Therefore, the attenuation coefficient gp (k) due to the phase difference may be calculated as shown in FIG. 3, for example.
[0021]
An example is shown in FIG. The horizontal axis is ap (k), the vertical axis is 20log_Ten(gp (k)).
Where Gp (k) = 20log_Ten(gp (k))

If there is only a sound source signal localized in the center, ap (k) is 1 for all k, but the other noise signals are superimposed so that the central localization sound source signal is the dominant band. However, it may be slightly smaller than 1. Therefore, it is effective to have an appropriate width ζ as shown in FIG. However, if ζ is too large, suppression of other sound source signals that are not localized in the center becomes insufficient. Therefore, it is desirable that ζ be a value close to 1 (1> ζ) in a range where the change of the centralized sound source signal due to the error can be ignored. Gpmin corresponds to the suppression amount of signals other than the centralized sound source signal. By changing this value, the size of the accompaniment with respect to the size of the song can be adjusted if compared to a song and accompaniment.
[0022]
In FIG. 3, α and ζ may or may not match. When α is close to ζ, signals other than the central localization sound source signal can be expected to be equally attenuated by the attenuation amount of Gpmin. . By separating α from ζ, the effect of error when the centralized sound source signal is erroneously suppressed in the dominant band can be reduced, but the difference in the amount of suppression due to the phase difference differs depending on the frequency band, If it is compared to the accompaniment of a song, it is expected that not only the balance of the volume of the accompaniment instrument but also the tone will change. Therefore, it is desirable to set α to a value close to ζ (0> ζ> α) within a range where the change of the centralized sound source signal can be ignored.
(3) A method of calculating the attenuation coefficient g (k) for actually multiplying fL (k) and fR (k) from the two attenuation coefficients gi (k) and gp (k) will be described.
[0023]
When a stereo signal that picks up a plurality of sound source signals that are relatively close to each other with two microphones separated by an appropriate distance is the input signal, the localization in stereo playback is based on the signals coming into the left and right microphones. It depends on the ratio of phase difference and size (level difference). At low frequencies, the level difference is difficult to occur, and the phase difference greatly affects. At higher frequencies, the size ratio has a significant effect. Thus, for example, it is conceivable to divide the frequency band into two and adopt gi (k) at a lower frequency and gp (k) at a higher frequency. However, in a room with reverberation surrounded by walls, the signal from the sound source located far from the microphone generally has little difference between the left and right levels, and conversely the phase is random between the left and right microphones (2). The value of the expression is close to 0. In this case, it is desirable to use gp (k) preferentially at all frequencies. Furthermore, in the case of popular music, not only the sound is directly picked up by a microphone, but also a localization that does not exist in nature by artificially adding a magnitude ratio, time difference, or phase temporal change to the left and right channel signals. Is usually more complicated. As described above, it is very difficult to select the optimum g (k) according to various stereo input signals. However, in all cases, at least the ratio of the magnitude of the sound source signal localized in the center and the phase difference are both small. Therefore, the smaller of gi (k) and gp (k) is adopted as g (k). . That is,
g (k) = min (gi (k), gp (k)) (3)
Here, min (A, B) means that the smaller one of A and B is output. In other words, any stereo input signal is suppressed when at least one of the magnitude and the phase differs between the left and right, and as a result, the sound source signal localized at the center can be emphasized.
[0024]
As described above, g (k) calculated by the attenuation coefficient calculation unit 105 is multiplied by fL (k) and fR (k) of each frequency band of each channel by a multiplier 116 as shown in FIG. By multiplying the left and right channels by the same g (k) in the same band k, it becomes possible to suppress the sound source signals other than the sound source signal localized at the center while maintaining the localization. The signal multiplied by g (k) is re-synthesized by fL (k) by the left channel sound source signal synthesizer 106 and converted into a time waveform sL ′. fR (k) is re-synthesized by the right channel sound source signal synthesis unit 111 and converted to a time waveform sR ′. sL ′ and sR ′ are sent from the stereo signal output unit 107 to the loudspeaker 108 and the stereo headphones 109.
[0025]
Through the above processing, it is possible to listen to the synthesized signal in which the sound source signal localized at the center is emphasized and the other sound source signals are suppressed by the stereo loudspeaker 108, the stereo headphones 109, or the like.
2. When suppressing the centralized sound source signal
A case will be described in which a sound source signal localized at the center is suppressed and other sound source signals are emphasized.
In FIG. 1, the steps up to the calculation of the similarity ai (k) and ap (k) by the similarity calculation unit 104 are the same as those in the case where the sound source signal localized at the center is emphasized, and the attenuation coefficient is calculated from the similarity. The part to calculate is different. Since the sound source signal localized at the center is suppressed, the attenuation coefficient gi (k) depending on the magnitude is calculated as shown in FIG. 4, and the attenuation coefficient due to the phase is calculated as shown in FIG.
[0026]
An example is shown in FIGS.
Where Ai (k) = 20log_Ten(ai (k)), Gi (k) = 20log_Ten(gi (k)), Gp (k) = 20log_Ten(gp (k))

In other words, the sound source signal localized in the center can be suppressed by decreasing the attenuation coefficient as the left and right similarity is larger. The concept of α, β, ε, and ζ is omitted because it is the same as the case of emphasizing the sound source signal localized at the center. The method of obtaining g (k) from gi (k) and gp (k) is also the same as the case of emphasis,
g (k) = max (gi (k), gp (k)) (4)
And calculate. Here, max (A, B) means that the larger one of A and B is output. That is, when at least one of the attenuation coefficient gi (k) due to the magnitude and the attenuation coefficient gp (k) due to the phase is large, it means that there is a phase difference or a difference in magnitude between the left and right channel signals. This is because the signal is not a sound source signal localized in the center. Only when both gi (k) and gp (k) are small, it is a sound source signal localized in the center and is subject to suppression.
[0027]
Since g (k) calculated by the attenuation coefficient calculation unit 105 is multiplied by fL (k) and fR (k) of each frequency band, it is the same as the case of emphasizing the sound source localized in the center, so it is omitted. . In the above method, the sound source signal localized near the center of the original signal and the other sound source signals (for example, corresponding to songs and accompaniment) are uniformly suppressed by the attenuation coefficient g (k) regardless of the volume difference. The In other words, even if the original signal already has an appropriate volume difference, it is further suppressed and the volume difference is increased.
[0028]
Therefore, a method of estimating the volume difference between the sound source signal localized near the center of the original signal and the other sound source signals and determining the attenuation coefficient so as to make the difference constant will be described.
When emphasizing a sound source signal localized near the center, if the value of g (k) is large, it is presumed to be a central localization sound source signal component. Therefore, an appropriate threshold value gth is set as the value of g (k), and the sum of the magnitudes of frequency components larger than gth is set as an estimated value gc of the localization signal near the center. Similarly, the sum of the magnitudes of the frequency components smaller than gth is set as an estimated value gb of the size of the sound source signal localized other than the center. Since these values are instantaneous values, for example, a moving average method can be considered as a method of taking an average of these values for a long time. If those values are gca and gba,
rcb = gca / gba (5)
rcb corresponds to the long-term average SN ratio in terms of noise. S corresponds to the size (power) of the localization sound source signal near the center, and N corresponds to the size of the other localization sound source signals. Next, it is considered to suppress “N” so as to obtain a desired S / N ratio based on this rcb. Assuming that the desired S / N ratio is rd and the necessary suppression amount is g2, a value obtained by dividing rcb by the suppression amount g2 may be rd. Therefore,
g2 = rcb / rd (when rcb ≤ rd)
g2 = 1.0 (when rcb> rd) (6)
In the above equation, no suppression is performed when the SN ratio of the original signal is larger than the desired SN ratio. At that time, if it is desired to forcibly reduce the S / N ratio from the original signal, only the upper stage of equation (6) should be used and set to g2 larger than 1 regardless of the magnitude relationship between rcb and rd.
[0029]
In each frequency band, by multiplying the band component whose attenuation coefficient is smaller than the above-mentioned gth by g2, the average signal-to-noise ratio, that is, the volume difference between the centralized sound source signal and the other sound source signals is set to a desired amount. It becomes possible to do.
Through the above processing, it is possible to listen to the synthesized signal in which the sound source signal localized at the center is suppressed and the other sound source signals are emphasized by the stereo loudspeaker 108, the stereo headphones 109, or the like.
(Second embodiment)
FIG. 6 is a block diagram showing a second embodiment of the present invention.
[0030]
After multiplying by the attenuation coefficient g (k), the left and right channel signals are added by the adder 117 to be monaural. Although the stereo effect is eliminated, it is possible to further suppress uncorrelated sound source signal components in the left and right channels by adding the signals of the left and right channels.
In popular music with many songs, there are many cases where the sound of the bass drum or bass is localized in addition to the song in the center. Since these main frequency components are lower than the frequency components of the song, in order to suppress them, for example, the attenuation coefficient calculation unit 105 multiplies g (k) by g (k) as shown in FIG. It is also effective to set a new g (k). In FIG. 7, the horizontal axis is the divided frequency band k, and the vertical axis is 20 logs._Ten(Ge (k)). kL indicates a lower limit frequency band for suppressing low music instruments. Gemin suppression for frequency bands below kL. m is a small positive integer for suppressing discontinuities on the frequency axis by gradually increasing the amount of suppression as the frequency band becomes lower. If kL and m are too large, the low frequency component of the song is suppressed, so for example, kL is appropriately about 100 Hz to 200 Hz in terms of frequency.
[0031]
Conversely, in order to suppress only the song localized in the center and not the bass drum or bass sound localized in the center, suppression of the central localization sound source band by g (k) is performed by using a low frequency band (for example, (0 to 200 Hz band) may not be performed.
Another method for suppressing a bass drum or the like localized in the center will be described. The bass drum has a faster sound rise time than the sound. Therefore, by observing the rise time of the sound in the main frequency band of the bass drum and multiplying g (k) by the attenuation coefficient gak (k) corresponding to the short rise time, a new g (k) is obtained. Only the fast-rising bass drum can be suppressed.
[0032]
One method will be described.
Let A (k, T) = (| fL (k, T) | + | fR (k, T) |) / 2 be the average size of the left and right channels at a T time in a certain frequency band k. Let r be the ratio to A (k, T−1) one hour ago. Here, the unit time is usually about several tens of milliseconds per frame.
r = A (k, T) / A (k, T−1) (when A (k, T)> A (k, T−1))
r = 1.0 (when A (k, T) <A (k, T−1)) (7)
A larger r means that the rise of the frequency band of k is sharper, so the attenuation coefficient for the sharpness of the rise is gak (k, T), and a function that reduces gak (k, T) as r is larger. Output.
For example,
[0033]
[Expression 2]

rt is a negative real number indicating the rate of attenuation with respect to the value of r.
In FIG. 8, when rt = -3, the horizontal axis is r and the vertical axis is 20 logs._TenThis is an example showing (gak (k, T)). Gakmin = −50 represents a lower limit value of the attenuation amount. If you always multiply gak (k, T) in equation (8) at each time, the time change of the spectrum becomes discontinuous.
gak ′ (k, T) = gak ′ (k, T−1) + δ (gak (k, T) −gak ′ (k, T−1)) (9)
It is preferable to use gak ′ (k, T) that has been subjected to smoothing processing as shown in equation (9). Here, δ is a smoothing coefficient and is a real number greater than 0 and equal to or less than 1. Different values may be used depending on the magnitude relationship between gak ′ (k, T−1) and gak (k, T).
[0034]
Through the above processing, it is possible to emphasize or suppress the sound source signal localized at the center of the stereo sound signal.
The stereo sound processing apparatus of the present invention can be constituted by a computer having a CPU, a memory and the like, a user terminal used by a user who is an access subject, and a recording medium. The recording medium is a computer-readable recording medium such as a CD-ROM, a magnetic disk device, a semiconductor memory, etc., and the stereo sound processing program recorded here is read by the computer to control the operation of the computer and to the left and right channels to the computer. Processing to divide each frequency band into multiple frequency bands, processing to calculate similarity between channels for each frequency band, processing to calculate attenuation coefficient to suppress or enhance sound source signal localized near the center from similarity Then, a stereo sound processing method such as a process of multiplying each frequency band signal by the attenuation coefficient and a process of recombining each frequency band signal for each channel after multiplying the attenuation coefficient is executed. The stereo audio processing program may be transmitted via a communication line.
(How to Use)
Next, a method of using the present invention will be described.
[0035]
FIG. 9 shows a first usage method of the present invention.
It is assumed that the music compact disc 301 is for stereo reproduction, and that a main sound source signal localized in the center is also recorded therein. In the personal computer 302, the music compact disc 301 is subjected to processing for emphasizing a sound source signal localized at the center of the present invention shown in FIG. 1 or FIG. 4, auditory correction processing such as frequency characteristics, and the like. After the gain is adjusted by the amplifier 303, it is listened to by the stereo headphones 304 or the like. This can be used, for example, when a listener or the like wants to listen with a smaller accompaniment than a song.
[0036]
FIG. 10 shows an example in which the present invention is used for music on demand in the second usage method of the present invention.
Many music sources are stored in the host computer 305 connected to the network 306. The user designates the name of the music source he wants to hear and the signal processing method from the personal computer 302 connected to the network 306 to the host computer 305. The designation of the signal processing method means, for example, how much the accompaniment other than the centrally localized song in the present invention is reduced, or how the frequency characteristic is adjusted according to one's preference. It is a specification. The host computer searches the music source according to the designation, performs signal processing as instructed, or transmits the music signal to the user's personal computer 302 via the network while performing processing. The user adjusts the gain of the music signal output from the personal computer 302 by the amplifier 303 and listens to the music signal sent by the stereo headphones 304 or the like.
[0037]
The same can be done with a portable music player having a built-in function that can be wirelessly connected to a network instead of a personal computer.
Further, in the above usage method, if processing for suppressing a song localized at the center is performed, it can be used for, for example, a portable simple karaoke.
FIG. 11 is a diagram showing a third usage method of the present invention.
The aim is to emphasize the voice of the speaker speaking in front. For example, the sound signal collected by the

unidirectional microphones

201 and 202 placed on the left and right of the head of the deaf person is processed by the processing unit of the present invention built in the small enclosure carried by the deaf person, so that it is not the voice of the front speaker Suppresses noise. After that, hearing aid processing such as sound quality and gain incorporated in the same body is performed and output to the left and

right earphones

203 and 204, so that noise can be suppressed and the voice of the emphasized front speaker can be received.
[0038]
【The invention's effect】
As described above, according to the present invention, it is possible to emphasize or suppress a desired amount of the sound source signal localized in the center without deteriorating the localization of the original signal from the stereo sound signal only by the sound source localization information. Thus, the following effects can be expected.
(1) When a hearing-impaired person listens to a commercially available music source, the volume balance between the song, which is the main sound source signal localized in the center, and the other accompaniment is adjusted so that the hearing-impaired person can easily hear the music. You can expect to enjoy it better.
(2) In a noisy environment, it is possible to emphasize only the voice of the target speaker in front, and it can be expected to realize comfortable communication.
(3) A karaoke sound source or the like can be created by suppressing a song or the like localized at the center.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a first embodiment of the present invention.
FIG. 2 is a diagram showing the relationship between ai (k) and gi (k) when emphasizing a centrally located sound source.
FIG. 3 is a diagram showing the relationship between ap (k) and gp (k) when emphasizing a centrally located sound source.
FIG. 4 is a diagram showing a relationship between ai (k) and gi (k) when suppressing a centrally located sound source.
FIG. 5 is a diagram showing a relationship between ap (k) and gp (k) when suppressing a centralized sound source.
FIG. 6 is a block diagram showing a second embodiment of the present invention.
FIG. 7 is a diagram showing the relationship between k and ge (k).
FIG. 8 is a diagram showing the relationship between r and gak (k, T).
FIG. 9 is a diagram showing a first usage method of the present invention.
FIG. 10 is a diagram showing a second usage method of the present invention.
FIG. 11 is a diagram showing a third usage method of the present invention.
[Explanation of symbols]
102 Stereo signal input section
103 Left channel frequency band divider
104 similarity calculator
105 Damping coefficient calculator
106 Left channel sound source signal synthesis unit
107 Stereo signal output section
108 Stereo Loudspeaker
109 Stereo headphones
110 Right channel frequency band divider
111 Right channel frequency source signal synthesis unit
112 Sound source signal synthesis unit
113 Monaural signal output section
114 Loudspeaker
115 monaural earphone
116 multiplier
117 Adder
201 Left channel unidirectional microphone
202 Right channel unidirectional microphone
203 Left ch earphone
204 Right ch earphone
205 Inventive processor, hearing aid
301 music compact disc
302 Personal computer
303 amplifier
304 Stereo headphones
305 Host computer
306 network

Claims

In a stereo sound signal processing method for suppressing or enhancing a sound source signal localized near the center from a two-channel sound signal processed for stereo sound collection or stereo reproduction,
The stereo signal is divided into multiple frequency bands for each channel, the similarity between channels is calculated for each frequency band, and the attenuation coefficient for suppressing or enhancing the sound source signal localized near the center from the similarity is calculated. A stereo acoustic signal processing method comprising: calculating, multiplying each frequency band signal by the attenuation coefficient, and recombining and outputting each frequency band signal for each channel after multiplying the attenuation coefficient.

The stereo sound signal processing method according to claim 1,
Two similarities for each frequency band are obtained for each frequency band by the ratio of the magnitude between the channel signals and the phase difference, and the sound source signal localized near the center is suppressed or enhanced from the two similarities obtained. A stereo sound signal processing method characterized by obtaining two attenuation coefficients.

The stereo sound signal processing method according to claim 2,
A stereo acoustic signal processing method characterized by multiplying each frequency band signal by a small coefficient when emphasizing a sound source signal localized near the center of the two obtained attenuation coefficients and a large coefficient when suppressing it.

The stereo sound signal processing method according to any one of claims 1 to 3,
Calculate the ratio between the average power of the sound source signal with a slight level difference and time difference between channels and the power of the other sound source signal, calculate the attenuation coefficient required for suppression from the ratio and the desired fixed ratio, A stereo sound signal processing method characterized by maintaining a desired constant ratio of the average power of a sound source signal with a slight level difference and time difference between channels and the power of other sound source signals.

The stereo sound signal processing method according to any one of claims 1 to 4,
The sound rise time is observed for each frequency band, and the attenuation coefficient corresponding to the speed of the rise time is multiplied by the attenuation coefficient based on the similarity to obtain a new attenuation coefficient. A stereo sound signal processing method characterized by further suppressing a sound source signal that is determined to rise faster than sound.

In a stereo sound signal processing apparatus for suppressing or enhancing a sound source signal localized near the center from a two-channel sound signal processed for stereo sound collection or stereo reproduction,
Frequency band dividing means for dividing the stereo signal into a plurality of frequency bands for each channel;
Similarity calculation means for calculating the similarity between channels for each frequency band;
An attenuation coefficient calculating means for calculating an attenuation coefficient for suppressing or enhancing the sound source signal localized near the center from the similarity;
Multiplication means for multiplying each frequency band signal by the attenuation coefficient;
A stereo sound signal processing apparatus comprising sound source signal synthesis / output means for re-synthesizing and outputting each frequency band signal for each channel after multiplication by an attenuation coefficient.

The stereo sound signal processing device according to claim 6,
The similarity calculation means obtains two similarities based on the ratio of magnitude and phase difference between channel signals for each frequency band,
A stereophonic signal processing apparatus characterized in that the attenuation coefficient calculating means obtains two attenuation coefficients for suppressing or enhancing a sound source signal localized near the center from the two similarities obtained by the similarity calculating means.

The stereo sound signal processing device according to claim 7,
A multiplying means multiplies each frequency band signal by a small coefficient when emphasizing a sound source signal localized near the center, and a large coefficient when suppressing, among the two obtained attenuation coefficients. apparatus.

The stereo sound signal processing device according to any one of claims 6 to 8,
The attenuation coefficient calculation means calculates the ratio between the average power of the sound source signal with a slight level difference and time difference between the channels and the power of the other sound source signals, and is necessary for suppression from the ratio and the desired fixed ratio. A stereo sound signal processing apparatus that calculates an attenuation coefficient and maintains a desired constant ratio of an average power of a sound source signal having a slight level difference and time difference between channels and a power of other sound source signals.

The stereo sound signal processing device according to any one of claims 6 to 9,
The attenuation coefficient calculating means observes the sound rise time for each frequency band, and multiplies the attenuation coefficient according to the speed of the rise time by the attenuation coefficient based on the similarity to obtain a new attenuation coefficient. A stereo sound signal processing apparatus characterized by further suppressing a sound source signal that is localized together with sound and that is determined to rise faster than sound.

In a computer-readable recording medium recorded with a program for executing a stereo sound signal processing method for suppressing or enhancing a sound source signal localized near the center from a two-channel sound signal processed for stereo sound collection or stereo reproduction,
Processing to divide the stereo signal into multiple frequency bands for each channel;
Processing to calculate the similarity between channels for each frequency band;
A process of calculating an attenuation coefficient for suppressing or enhancing the sound source signal localized near the center from the similarity,
A process of multiplying each frequency band signal by the attenuation coefficient;
Processing to re-synthesize and output each frequency band signal for each channel after multiplying by the attenuation coefficient;
A computer-readable recording medium having recorded thereon a program for executing a stereo sound signal processing method.

A computer-readable recording medium recording a program for executing the stereo sound signal processing method according to claim 11.
The process of calculating the similarity has two processes for obtaining a frequency ratio and a phase difference between channel signals for each frequency band.
The attenuation coefficient is calculated using two attenuation coefficients for suppressing or enhancing the sound source signal localized near the center based on the similarity obtained by the ratio of the magnitude between the channel signals and the phase difference for each frequency band. The computer-readable recording medium which recorded the program which performs the stereo sound signal processing method characterized by having the process calculated | required.

In the computer-readable recording medium which recorded the program which performs the stereo sound signal processing method of Claim 12,
The multiplying process includes a process of multiplying each frequency band signal by a small coefficient when emphasizing a sound source signal localized near the center, and a large coefficient when suppressing, among the obtained two attenuation coefficients. A computer-readable recording medium recording a processing program for executing a stereo sound signal method.

A computer-readable recording medium recording a program for executing the stereo sound signal processing method according to any one of claims 11 to 13,
The process of calculating the attenuation coefficient calculates the ratio between the average power of the sound source signal with a slight level difference and time difference between channels and the power of the other sound source signals, and suppresses the ratio from the desired fixed ratio. A stereo having a process for calculating a necessary attenuation coefficient and maintaining a ratio of the average power of the sound source signal with a slight level difference and time difference between the channels and the power of the other sound source signals to a desired constant. A computer-readable recording medium storing a program for executing an acoustic signal processing method.

A computer-readable recording medium recording a program for executing the stereo sound signal processing method according to any one of claims 11 to 14.
The process of calculating the attenuation coefficient is to observe the sound rise time for each frequency band and multiply the attenuation coefficient according to the speed of the rise time by the attenuation coefficient according to the similarity to obtain a new attenuation coefficient. A computer-readable recording medium storing a program for executing a stereo sound signal processing method characterized by having a process of further suppressing a sound source signal that is localized in the center together with the sound and that is determined to have a faster rise compared to the sound .