JP2007047427A

JP2007047427A - Sound processor

Info

Publication number: JP2007047427A
Application number: JP2005231488A
Authority: JP
Inventors: Masato Togami; 真人戸上
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2005-08-10
Filing date: 2005-08-10
Publication date: 2007-02-22

Abstract

<P>PROBLEM TO BE SOLVED: To provide a disturbing sound suppressing filter which is applicable even to a target sound section and has small distortion of an output sound. <P>SOLUTION: A sound processor has a disturbing sound generation section which decides whether each band component of an input signal is a disturbing sound or target sound based upon sparse property of a sound, extracts only a band whose component is decided as a disturbing sound, and generates a disturbing sound signal. Further, the sound processor has an adaptive processing section characterized in updating a spatial correlation inverse matrix by using the generated disturbing sound signal. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、例えば複数のマイクロホン素子で観測した音声や音楽や各種雑音が混合した信号から、目的とする音のみを復元する音源分離技術に属する。 The present invention belongs to a sound source separation technique for restoring only a target sound, for example, from a signal obtained by mixing voice, music, and various noises observed with a plurality of microphone elements.

従来より、マイクロホンを複数使い、妨害音方向に死角を作ることで妨害音を抑圧し、目的音方向にビームを作ることで目的音だけを強調して抽出する音源分離技術として、最小分散ビームフォーマ法が知られている。（例えば、非特許文献１参照）。最小分散ビームフォーマ法では、空間相関逆行列と呼ばれるマイク間の相関行列を、入力信号から更新し、空間相関逆行列を使って、妨害音方向に死角を作り目的音方向にビームを作る線形フィルタを生成する。最小分散ビームフォーマでは、目的音が存在する音声区間で空間相関逆行列の更新を行うと、妨害音と共に、目的音をも誤って抑圧してしまうため、空間相関逆行列の更新は、目的音が存在しない区間だけで行わなければならなかった。そのため、予め何らかの方法で、目的音が存在しない音声区間を抽出する必要があった。しかし予め何らかの方法で、目的音が存在しない音声区間を抽出することは、困難であり、また目的音が存在しない区間だけで空間相関逆行列を更新すると、目的音が存在する区間に突発的に発生する妨害音を抑圧できない。 Conventionally, a minimum dispersion beamformer has been used as a sound source separation technology that uses multiple microphones to suppress the interference sound by creating a blind spot in the direction of the interference sound and emphasize and extract only the target sound by creating a beam in the direction of the target sound. The law is known. (For example, refer nonpatent literature 1). In the minimum variance beamformer method, the correlation matrix between microphones, called the spatial correlation inverse matrix, is updated from the input signal, and a linear filter that creates a blind spot in the direction of the disturbing sound and a beam in the target sound direction using the spatial correlation inverse matrix. Is generated. In the minimum variance beamformer, if the spatial correlation inverse matrix is updated in the speech section in which the target sound exists, the target sound is erroneously suppressed along with the interference sound. Had to be done only in the section where there is no. For this reason, it is necessary to extract a speech section in which no target sound exists in advance by some method. However, it is difficult to extract a speech section where the target sound does not exist in some way in advance, and if the spatial correlation inverse matrix is updated only in a section where the target sound does not exist, it suddenly occurs in the section where the target sound exists. The generated interference sound cannot be suppressed.

また適応フィルタを用いるのではなく、帯域別チャネル間パラメータ値差にもとづき、その帯域の上記帯域分割された各出力チャネル信号の何れがいずれの音源から入力された信号であるかを判定する音源信号判定過程を有し、発音していない音源からの検出信号を抑圧する妨害音抑圧技術がある。（例えば、特許文献１）
この妨害音抑圧技術は、音声のスパース性という音声の性質に基づく技術である。音声のスパース性とは、音声が40ms程度の短い時間に存在する周波数成分は限られており、異なる音源が短い時間に同じ周波数成分を保持する確率は低いという性質である。そのため、時間・周波数ごとに周波数成分をどれか一つの音源に割り振ることで、音源分離が可能となる。しかしこの妨害音抑圧技術では、音源同士が短い時間に同じ周波数成分を保持する場合、妨害音抑圧性能が劣化し、音声が歪みやすい。 Also, instead of using an adaptive filter, a sound source signal for determining which sound source is input from which of the band-divided output channel signals of that band is based on the parameter value difference between channels for each band. There is a disturbing sound suppression technique that suppresses a detection signal from a sound source that has a determination process and does not sound. (For example, Patent Document 1)
This interference noise suppression technique is a technique based on the voice property of voice sparseness. The sparseness of speech is a property in which the frequency components that exist in a short time of about 40 ms are limited, and the probability that different sound sources hold the same frequency component in a short time is low. Therefore, sound source separation becomes possible by assigning frequency components to any one sound source for each time and frequency. However, in this interference noise suppression technique, when the sound sources hold the same frequency component in a short time, the interference noise suppression performance deteriorates and the sound is easily distorted.

特開平１０−３１３４９７号公報Japanese Patent Laid-Open No. 10-313497

大賀寿郎, 山崎芳男, 金田豊, ”音響システムとディジタル処理,” 電子情報通信学会,1995.Toshiro Oga, Yoshio Yamazaki, Yutaka Kaneda, “Acoustic systems and digital processing,” IEICE, 1995. 戸上真人, 天野明雄, ”音源重複度判定に基づく音源定位を用いた音源分離手法,” 日本音響学会講演論文集Vol.1,pp.441-442,2005 年3 月Masato Togami, Akio Amano, "Sound source separation method using sound source localization based on sound source duplication determination," Proc. Of the Acoustical Society of Japan Vol.1, pp.441-442, March 2005

最小分散ビームフォーマ法では、目的音が存在する区間で空間相関逆行列を更新できず、目的音が存在する区間に突発的に発生する妨害音を抑圧できない。
さらに、上記特許文献１記載の技術では音源同士が短い時間に同じ周波数成分を保持する場合、妨害音抑圧性能が劣化し、音声が歪みやすいという課題がある。 In the minimum dispersion beamformer method, the spatial correlation inverse matrix cannot be updated in the section where the target sound exists, and the disturbing sound suddenly generated in the section where the target sound exists cannot be suppressed.
Furthermore, in the technique described in Patent Document 1, when the sound sources hold the same frequency component in a short time, there is a problem that the interference noise suppression performance is deteriorated and the sound is easily distorted.

本願で開示する代表的な発明は以下のとおりである。
入力信号の帯域成分毎に妨害音であるか目的音であるかを判定し、妨害音であると判定された帯域を抽出し、妨害音信号を生成する妨害音生成部を持つ。そして生成した妨害音信号を用いて空間相関逆行列を更新することを特徴とする適応処理部を有し、生成された空間相関逆行列を用いて性セ氏した妨害音抑圧フィルタをマイクロホンアレー出力信号に適応して音源分離を行う音声処理装置。 Representative inventions disclosed in the present application are as follows.
It has a disturbing sound generation unit that determines whether it is a disturbing sound or a target sound for each band component of the input signal, extracts a band determined to be the disturbing sound, and generates a disturbing sound signal. And an adaptive processing unit characterized in that the spatial correlation inverse matrix is updated using the generated interference sound signal, and the interference sound suppression filter generated by using the generated spatial correlation inverse matrix is connected to the microphone array output signal. A sound processing device that performs sound source separation adaptively.

本発明の構成によれば、目的音が存在する区間であっても、音声のスパース性に基づき妨害音であると判定された帯域だけで空間相関逆行列を更新することが可能となり、目的音が存在する区間を予め抽出することが必要となくなる。また本発明の構成では適応フィルタを使って出力信号を生成しており音声の歪みを防止することができる。 According to the configuration of the present invention, it is possible to update the spatial correlation inverse matrix only in the band determined to be the interference sound based on the sparseness of the sound even in the section where the target sound exists. It is no longer necessary to previously extract a section in which there exists. Further, in the configuration of the present invention, an output signal is generated using an adaptive filter, and distortion of speech can be prevented.

本発明の実施の形態について図面を用いて説明する。図１は、本発明の音声処理装置の基本構成図である。マイクロホンアレイ１で複数チャンネルの音信号を取得する。取得した複数チャンネルの音信号は、メモリ2に送られ、CPU3にて各種信号処理を施される。CPU3は、目的音方向からの音の伝達モデルなどを必要に応じて記憶媒体４から取り出し、利用する。 Embodiments of the present invention will be described with reference to the drawings. FIG. 1 is a basic configuration diagram of a speech processing apparatus according to the present invention. The microphone array 1 acquires sound signals of a plurality of channels. The acquired sound signals of a plurality of channels are sent to the memory 2 and subjected to various signal processing by the CPU 3. The CPU 3 extracts and uses a sound transmission model from the target sound direction from the storage medium 4 as necessary.

図2は、本発明の実施の形態についてのブロック図である。マイクロホンアレイ１で複数チャンネルの音信号を取得する。マイクロホンアレイ１で取得した複数チャンネルの音信号をA/D変換部2でデジタルデータに変換し、メモリ2上に取り込む。取り込んだデジタルデータに変換した音信号を、X(t)と表す。音信号は、帯域分割部6に送れ、短時間フーリエ変換を施され、周波数帯域毎に分割される。帯域分割された信号はXt(f)と表すことができる。帯域分割された信号はラベル生成部7に送られる。ラベル生成部７では、まず、各帯域毎に FIG. 2 is a block diagram of the embodiment of the present invention. The microphone array 1 acquires sound signals of a plurality of channels. The sound signals of a plurality of channels acquired by the microphone array 1 are converted into digital data by the A / D converter 2 and are taken into the memory 2. The sound signal converted into the captured digital data is represented as X (t). The sound signal is sent to the band dividing unit 6, subjected to short-time Fourier transform, and divided for each frequency band. The band-divided signal can be expressed as Xt (f). The band-divided signal is sent to the label generator 7. In the label generation unit 7, first, for each band.

で、音源方向jt(f)を決定する。ここで,Λは方向の推定範囲である。またaj(f)は位置ベクトルと呼ばれるベクトルで、マイクロホンアレイの配置より決定する音の伝達特性である（例えば、非特許文献１参照）。決定した各帯域毎の音源方向が予め定める目的音方向の範囲内にあれば、その帯域を目的音成分であると判定する。そうでなければ、その帯域を妨害音成分であると判定する。音声信号を40ms程度の短い時間で眺めた場合、その短い時間の中で、複数の音源が同じ周波数成分を持つことは確率的に低いという、スパース性という性質が成り立つ。そのため、ラベル生成部７で行う周波数振り分け処理を行うことで、妨害音が優位である信号と、目的音が優位である信号を精度良く分離することが可能となる。背景で述べたように、音声のスパース性に基づき、周波数成分を振り分ける妨害音抑圧技術では、音声が歪んでしまうという問題がある。複数の音源が同じ周波数成分を持つことは確率的に低いとは言っても、複数の音源が同じ周波数成分が持つことはあり得る。そして複数の音源が同じ周波数成分を持つ場合、その周波数成分が妨害音か目的音の一方にしか割り振られず、目的音に割り振られなかった場合、目的音のその周波数成分は失われてしまい、歪みの原因となってしまう。

Then, the sound source direction jt (f) is determined. Where Λ is the direction estimation range. Further, aj (f) is a vector called a position vector, and is a sound transfer characteristic determined by the arrangement of the microphone array (for example, see Non-Patent Document 1). If the determined sound source direction for each band is within a predetermined target sound direction range, the band is determined to be the target sound component. Otherwise, it is determined that the band is a disturbing sound component. When an audio signal is viewed in a short time of about 40 ms, a sparsity property is realized that a plurality of sound sources have a low probability that they have the same frequency component in that short time. Therefore, by performing the frequency distribution process performed by the label generation unit 7, it is possible to accurately separate a signal having a dominant disturbance sound and a signal having a dominant target sound. As described in the background, the interfering sound suppression technique that distributes frequency components based on the sparseness of the sound has a problem that the sound is distorted. Although it is stochastically low that a plurality of sound sources have the same frequency component, it is possible that a plurality of sound sources have the same frequency component. And if multiple sound sources have the same frequency component, that frequency component can only be assigned to one of the disturbing sound or the target sound, and if it is not assigned to the target sound, that frequency component of the target sound will be lost and distorted It becomes the cause of.

本願では周波数振り分け処理の結果を適応処理部及び目的方向推定部だけに用いている。そして、出力結果は、目的音方向の信号を全て通過させる線形フィルタを用いて生成している。そのため、スパース性が成立しない場合であっても、適応後のフィルタの死角形成性能が若干する程度の影響で済み、音声が歪んでしまうという状況を回避することが可能となる。 In the present application, the result of the frequency distribution process is used only for the adaptive processing unit and the target direction estimation unit. And the output result is produced | generated using the linear filter which passes all the signals of the target sound direction. For this reason, even when sparsity is not established, it is possible to avoid a situation in which the sound is distorted because the effect of forming the blind spot of the filter after adaptation is slightly affected.

妨害音生成部8では、ラベル生成部7が妨害音成分であると判定した帯域成分を使い妨害音を出力する。妨害音生成部8が出力する妨害音は The interfering sound generation unit 8 outputs the interfering sound using the band component determined by the label generating unit 7 as the interfering sound component. The interference sound output by the interference sound generator 8 is

と表される。適応処理部9では、妨害音生成部8が出力する妨害音を用いて空間相関逆行列を

It is expressed. The adaptive processing unit 9 uses the interference sound output from the interference sound generation unit 8 to calculate the spatial correlation inverse matrix.

で更新する。ここで、βは、忘却係数と呼ばれ、0〜1までの範囲で設定する。βが小さいほど、現在の妨害音の重みを大きくし、空間相関逆行列を更新する。空間相関逆行列の計算は、

Update with. Here, β is called a forgetting factor and is set in the range of 0 to 1. The smaller β is, the larger the weight of the current interference sound is, and the spatial correlation inverse matrix is updated. The calculation of the spatial correlation inverse matrix is

の逆行列としても良い。

It may be an inverse matrix of

ここで妨害音として出力される信号を用いて空間相関逆行列を更新するために、妨害音方向の情報を多く含むように空間相関逆行列が作られる。
目的音方向推定部10では、ラベル生成部7が計算した各帯域毎の音源方向の情報を使い、目的音方向の範囲内で、修正遅延和アレイ法（例えば、非特許文献２）に基づき、音源方向を推定し、推定された音源方向を目的音方向とする。 Here, in order to update the spatial correlation inverse matrix using the signal output as the interference sound, the spatial correlation inverse matrix is created so as to include a lot of information on the interference sound direction.
The target sound direction estimation unit 10 uses the information on the sound source direction for each band calculated by the label generation unit 7 and uses the corrected delay sum array method (for example, Non-Patent Document 2) within the range of the target sound direction. The sound source direction is estimated, and the estimated sound source direction is set as the target sound direction.

本願では、ラベル生成部７で目的音成分であると判定する目的音方向の範囲外の音源を妨害音と判定し、死角を形成し抑圧する。逆に目的音方向の範囲内に存在する音源であれば、死角を形成しないため、目的音方向の範囲内であれば、抑圧せずに取り出すことが可能となる。目的音方向を１方向に限定する場合、実際の目的音方向と設定する目的音方向とが、少しでもずれると、目的音成分を抑圧してしまう可能性がある。この点、目的音方向に幅を持たせることで、実際の目的音方向と設定する目的音方向とがずれる場合で、あっても目的音成分を抑圧することが無いという効果がある。 In the present application, a sound source outside the range of the target sound direction determined by the label generation unit 7 as the target sound component is determined as a disturbing sound, and a blind spot is formed and suppressed. Conversely, if the sound source exists within the range of the target sound direction, a dead angle is not formed. Therefore, if the sound source is within the range of the target sound direction, it can be extracted without being suppressed. When the target sound direction is limited to one direction, if the actual target sound direction and the target sound direction to be set are slightly deviated, the target sound component may be suppressed. In this respect, by providing a width in the target sound direction, there is an effect that the target sound component is not suppressed even if the actual target sound direction is different from the target sound direction to be set.

フィルタ生成部11では、適応処理部9で更新した空間相関逆行列と目的音方向推定部10で推定した目的音方向を使い、 The filter generation unit 11 uses the spatial correlation inverse matrix updated by the adaptive processing unit 9 and the target sound direction estimated by the target sound direction estimation unit 10,

の妨害音抑圧フィルタを生成する。asub(f)は目的音方向の伝達特性である。生成されたフィルタは、目的音方向にビームを当て、空間相関逆行列から計算される妨害音方向に死角を向け、抑圧することで、目的音だけを抽出するための線形フィルタである。この線形フィルタは、入力信号そのものに適用する。

The interference noise suppression filter is generated. asub (f) is the transfer characteristic in the target sound direction. The generated filter is a linear filter for extracting only the target sound by applying a beam in the direction of the target sound, directing the dead angle in the direction of the disturbing sound calculated from the spatial correlation inverse matrix, and suppressing it. This linear filter is applied to the input signal itself.

従来の最小分散ビームフォーマ法の適応法は、なるべく音の大きさが大きい音源を抑圧するように、適応するため、入力信号中に目的音が優位な周波数成分が含まれていて、かつ目的音の方向が想定する方向とずれた場合などに、目的音を抑圧するように適応しまう。 The conventional adaptive method of the minimum dispersion beamformer method is adapted so as to suppress a sound source with a loud sound as much as possible. Therefore, the input signal contains a frequency component in which the target sound is dominant and the target sound When the direction of the sound is deviated from the assumed direction, the target sound is adapted to be suppressed.

本願では、入力信号中に目的音が含まれていても、ラベル生成部７及び妨害音生成部８が入力信号中から、妨害音が優勢な信号（目的音がほとんど含まれていない）を抽出し、その抽出した妨害音が優勢な信号で適応した空間相関行列を使うため、目的音方向を抑圧せず、妨害音のみを抑圧するフィルタを生成することが可能となる。 In the present application, even if the target sound is included in the input signal, the label generation unit 7 and the interference sound generation unit 8 extract the signal in which the interference sound is dominant (the target sound is hardly included) from the input signal. In addition, since a spatial correlation matrix in which the extracted disturbing sound is preferentially used as a signal is used, a filter that suppresses only the disturbing sound without suppressing the target sound direction can be generated.

フィルタリング部12ではフィルタ生成部11が出力する線形フィルタを使い、帯域分割部6で得られた帯域分割信号を使い、 The filtering unit 12 uses the linear filter output from the filter generation unit 11, uses the band division signal obtained by the band division unit 6,

でフィルタリング後の信号を生成し出力する。波形生成部13ではフィルタリング部12が出力したフィルタリング後の信号を逆短時間フーリエ変換し、時間領域の信号を生成し、出力する。
上記実施例は装置構成を説明したが、本願はプログラムとしてコンピュータに読み込むことで実行されるようにしても良い。

To generate and output the filtered signal. The waveform generation unit 13 performs inverse short-time Fourier transform on the filtered signal output from the filtering unit 12 to generate and output a time-domain signal.
Although the above embodiment has described the device configuration, the present application may be executed by being read into a computer as a program.

本発明の音声処理装置の基本構成図。1 is a basic configuration diagram of a speech processing apparatus of the present invention. 本発明の実施の形態についてのブロック図。The block diagram about embodiment of this invention.

Explanation of symbols

１・・・マイクロホンアレイ、２・・・メモリ、３・・・ＣＰＵ、４・・・記憶媒体、５・・・Ａ／Ｄ変換部、６・・・帯域分割部、７・・・ラベル生成部、８・・・妨害音生成部、９・・・適応処理部、１０・・・目的音方向推定部、１１・・・フィルタ生成部、１２・・・フィルタリング部、１３・・・波形生成部。
DESCRIPTION OF SYMBOLS 1 ... Microphone array, 2 ... Memory, 3 ... CPU, 4 ... Storage medium, 5 ... A / D conversion part, 6 ... Band division part, 7 ... Label production | generation , 8 ... Interfering sound generation unit, 9 ... Adaptive processing unit, 10 ... Target sound direction estimation unit, 11 ... Filter generation unit, 12 ... Filtering unit, 13 ... Waveform generation Department.

Claims

A microphone array holding at least two microphone elements;
A band division unit for outputting a band division signal obtained by dividing the signal output from the microphone array into a plurality of frequency bands for each channel;
For each band division signal output by the band division unit, a label generation unit that estimates a sound source direction and outputs a label indicating whether the band division signal is an interference sound or a target sound from the sound source direction;
An interference sound generating unit that outputs a band-division interference sound signal from a label output from the label generation unit, and an adaptive calculation that calculates a spatial correlation inverse matrix from the band-divided interference sound signal output from the interference sound generation unit A processing unit;
A filter generation unit that generates a jamming noise suppression filter using the calculated spatial correlation inverse matrix;
And a filtering unit that performs sound source separation by adapting the filter to a signal output from the microphone array.

A target sound direction estimating unit that estimates the sound source direction using the sound source direction estimation information calculated by the label generation unit;
The speech processing apparatus according to claim 1, wherein the filter generation unit also uses information on the estimated sound source direction.