JP2009094707A

JP2009094707A - Sound signal processor and sound signal processing method

Info

Publication number: JP2009094707A
Application number: JP2007262232A
Authority: JP
Inventors: Takayoshi Kawaguchi; 貴義川口; Yohei Sakuraba; 洋平櫻庭
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2007-10-05
Filing date: 2007-10-05
Publication date: 2009-04-30

Abstract

<P>PROBLEM TO BE SOLVED: To set the sound volume balance between channels automatically and appropriately for a self-amplified sound in a sound signal processor for composing a stereo-capable sound amplification communication system having a self-amplified sound output function. <P>SOLUTION: While the sound of the speaker of one's own side is input into one microphone, the ratio of each distance to two speakers from the microphone is obtained as the ratio of the level of output signals of two adaptive filters for performing echo cancellation to the sound signal collected by the microphone. Based on the obtained level ratio, the level balance between an L channel that is used for outputting to each of the L and R channels and is used as the sound of the speaker of one's own side, and the sound signal of the R channel is determined. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、いわゆるエコーキャンセルといわれる音声信号処理機能を有する音声信号処理装置と、その方法とに関するものである。 The present invention relates to an audio signal processing apparatus having an audio signal processing function called so-called echo cancellation, and a method thereof.

電話機でのハンズフリー通話のほか、音声会議システム及びテレビ会議システムなどにおける音声送受信処理系などのようにして、互いに離れた場所や位置に居る話者間での通話、会話などが行えるように構成された音響システムは、拡声通話系などともいわれ、既に実用化され、また、普及している。
上記の拡声通話系システムでは、例えば、所定の通信方式に従って相互に通信可能な通信端末装置が複数の異なる場所に配置される。そのうえで、一方の通信端末装置側にてマイクロフォンで収音した音声が他方の通信端末装置に対して送信され、これを受信した他方の通信端末装置側にてスピーカから音として放出するようにされる。これにより、遠隔した場所にいる話者同士の会話が可能となるものである。 In addition to hands-free phone calls, voice communication systems in voice conference systems and video conference systems can be used to make calls and conversations between speakers at remote locations and locations. Such a sound system is also called a loudspeaking call system, and has already been put into practical use and has become widespread.
In the above voice communication system, for example, communication terminal devices that can communicate with each other according to a predetermined communication method are arranged at a plurality of different locations. In addition, the sound picked up by the microphone on one communication terminal device side is transmitted to the other communication terminal device, and is emitted as a sound from the speaker on the other communication terminal device side that has received the sound. . As a result, conversations between speakers at remote locations are possible.

ただし、拡声通話系システムでは、一方の通信端末装置側にてスピーカから放出された他方の通信端末装置側からの音声が、一方の通信端末装置側にて、再度マイクロフォンにより収音されて他方の通信端末装置のスピーカから音として放出される。そして、このような動作が循環（ループ）するようにして繰り返されることになる。これにより、例えばスピーカから、相手側の話した音声以外に、自分の話した声もこだまのようにして混ざって聴こえる、いわゆるエコーといわれる現象が生じる。また、エコー音が大きくなれば、上記のループが無限に繰り返されてハウリングといわれる現象が発生する。このようして、拡声通話系システムでは、通信端末装置間の通信によりマイクロフォンの収音音声が循環することで、エコー、ハウリングによる通話音声品質の低下であるとか、通話システムが使いにくくなるなどの問題を抱えることになる。 However, in the loudspeaker communication system, the sound from the other communication terminal device side emitted from the speaker on one communication terminal device side is picked up again by the microphone on the one communication terminal device side and is collected on the other side. It is emitted as sound from the speaker of the communication terminal device. Such an operation is repeated so as to circulate (loop). As a result, for example, a so-called echo phenomenon occurs in which the voice spoken by the person other than the voice spoken by the other party is heard from the speaker in a mixed manner. If the echo sound becomes louder, the above loop is repeated infinitely and a phenomenon called howling occurs. In this way, in the voice communication system, the collected voice of the microphone is circulated by communication between the communication terminal devices, so that the voice quality is deteriorated due to echo and howling, or the telephone system becomes difficult to use. You will have a problem.

そこで、拡声通話系システムに対してエコーキャンセルのための音声信号処理系を備えることが知られている。
このようなエコーキャンセルのための信号処理としては、適応フィルタシステムを採用したものが知られている。
この適応フィルタシステムは、スピーカとマイクロフォンの間の伝達音（エコーパス）についてのインパルス応答の特性を得たうえで、スピーカから放出させるべき音を入力信号として、この入力信号に上記のインパルス応答を畳み込むことで出力として擬似的なエコー音の信号成分を生成する。そして、このエコー音の信号成分を、マイクロフォンにより収音して相手側の通信端末装置に送信すべき音声信号から差し引くようにされる。このような適応フィルタシステムの動作が収束した状態では、相手側の通信端末装置に対しては、エコー音がキャンセルされた音声を送信することになり、従って、スピーカから放出される音からエコー音は聴こえなくなっている（キャンセルされている）こととなる。 Therefore, it is known to provide an audio signal processing system for echo cancellation for the loudspeaker system.
As such signal processing for echo cancellation, one using an adaptive filter system is known.
This adaptive filter system obtains the characteristics of the impulse response for the sound (echo path) transmitted between the speaker and the microphone, and then convolves the input signal with the impulse response described above using the sound to be emitted from the speaker as an input signal. Thus, a pseudo echo signal component is generated as an output. Then, the signal component of this echo sound is subtracted from the audio signal to be collected by the microphone and transmitted to the communication terminal device on the other side. In a state where the operation of the adaptive filter system has converged, a voice with the echo sound canceled is transmitted to the communication terminal device on the other side. Therefore, the echo sound is emitted from the sound emitted from the speaker. Is no longer audible (cancelled).

また、拡声通話系システムとしては、Ｌ（左）・Ｒ(右)によるステレオチャンネル（２チャンネル）の音声通話に対応したものも存在する。そして、このようなステレオ対応の拡声通話系システムに対応したエコーキャンセラの構成も知られている。
ステレオ対応の拡声通話系システムにあっては、エコーパスとして、ＬチャンネルのスピーカからＬチャンネルのマイクロフォンに到達する経路(パス)、ＬチャンネルのスピーカからＲチャンネルのマイクロフォンに到達する経路、ＲチャンネルのスピーカからＬチャンネルのマイクロフォンに到達する経路、ＲチャンネルのスピーカからＲチャンネルのマイクロフォンに到達する経路との４つの経路が存在する。これに対応して、エコーキャンセラとしては、これらの４つの適応フィルタシステムを用意し、これら適応フィルタシステムの各々により、４つのエコーパスのそれぞれを対象としたエコーキャンセル処理を実行させるようにして構成する。 In addition, there is a system that supports a stereo channel (2-channel) voice call using L (left) and R (right) as a loudspeaker call system. A configuration of an echo canceller corresponding to such a stereo-compatible loudspeaker communication system is also known.
In a stereo-compatible loudspeaker communication system, as an echo path, a path (path) from an L channel speaker to an L channel microphone, a path from an L channel speaker to an R channel microphone, an R channel speaker There are four paths: a path from the R channel to the L channel microphone and a path from the R channel speaker to the R channel microphone. Corresponding to this, the echo canceller is configured to prepare these four adaptive filter systems, and to execute echo cancellation processing for each of the four echo paths by each of these adaptive filter systems. .

さらに、拡声通話系システムから成る会議システムを利用するのにあたって、通信端末装置が設置される場所(会場)が非常に広いような場合、同じ会場内でも、或る話者の声が、そこから離れた位置に居る会議参加者には、遠すぎて聴き取りにくいような状況となることが考えられる。このような状況に対応しては、例えば特許文献１にも記載されているようにして、通信端末装置により、同じ近端側のマイクロフォンにより収音される音声をスピーカから拡声して出力させる、ＰＡシステムの機能（自己拡声音出力機能）を付加することが知られている。これにより、マイクロフォンを使用して入力された近端側話者の音声は、同じ近端側のスピーカにて拡声されて出力されることになり、会議参加者は、同じ会議場内の話者の音声を大きな音で明瞭に聴くことが可能になる。 Furthermore, when using a conference system consisting of a loudspeaker system, if the place (venue) where the communication terminal device is installed is very wide, the voice of a speaker can be heard from within the same venue. It is conceivable that a conference participant who is in a remote location may be too far away to listen. In response to such a situation, as described in, for example, Patent Document 1, the communication terminal device outputs and outputs the sound collected by the same near-end microphone from the speaker. It is known to add a PA system function (self-sound output function). As a result, the near-end speaker's voice input using the microphone is amplified and output by the same near-end speaker, and the conference participants can hear the speakers in the same conference hall. It becomes possible to listen to the sound clearly with a loud sound.

上記した技術を背景とすると、ステレオ対応とされたうえで自己拡声音出力機能が与えられた拡声通話系システムを考えることができる。このような拡声通話システムでは、Ｌチャンネルのマイクロフォンにより収音した近端側話者の音声と、Ｒチャンネルのマイクロフォンにより収音した近端側話者との音声を、同じ近端側に設けられるＬチャンネル、Ｒチャンネルの各スピーカから音として放出させることになる。 Against the background of the above technology, it is possible to consider a loudspeaker communication system that is stereo-compatible and is provided with a self-speaking sound output function. In such a loudspeaker communication system, the voice of the near-end speaker picked up by the L-channel microphone and the voice of the near-end talker picked up by the R-channel microphone can be provided on the same near-end side. The sound is emitted from the L channel and R channel speakers.

特開平９−３０７６２６号公報JP-A-9-307626

上記のようにしてステレオ対応の拡声通話系システムに自己拡声音出力機能を与えて構成することとした場合、自己拡声音の左右の音量バランス設定をどのようにして設定するのか、ということが１つの問題として上がってくる。この音量バランス設定として最も単純なものとしては、例えば、Ｌチャンネルのマイクロフォンにより収音された音声と、Ｒチャンネルのマイクロフォンにより収音された音声を、それぞれ、ＬチャンネルのスピーカとＲチャンネルのスピーカから出力させるようにして振り分けてしまうという手法を考えることができる。 As described above, when the stereo-compatible loudspeaker communication system is configured to be provided with the self-speech sound output function, how to set the left and right volume balance settings of the self-speech sound is 1 Comes up as one problem. As the simplest volume balance setting, for example, the sound collected by the L channel microphone and the sound collected by the R channel microphone are respectively transmitted from the L channel speaker and the R channel speaker. It is possible to consider a method of sorting by outputting.

しかし、自己拡声音出力機能は、通常、会議場などの場所が相応に広いことにより必要となってくるものなので、実際においては左右のスピーカの距離であるとか左右のマイクロフォンの距離が相応に離れていることを想定すると、自己拡声音の左右の音量バランスを上記のようにして単純に振り分けたのでは、その自己拡声音出力機能としての本来の目的を十分に達成できなくなる場合がある。
例えば、広い会議場において或る話者が発話した音声は、この話者から遠ざかるほど聴き取りにくくなる。このことからすれば、例えばＬチャンネルのマイクロフォンに向かって話者が発話した音声は、この話者に近い位置に設置されているＬチャンネルのスピーカからよりは、遠い位置に設置されている、Ｒチャンネルのスピーカのほうから大きい音で出力されることが好ましいことになる。しかし、上記したような単純な振り分け方を行ってしまうと、Ｒチャンネルとは反対のＬチャンネルのスピーカのほうからしか話者の音声が出力されなくなるという、具合のよくないことになる。
すると、Ｌチャンネルのマイクロフォンにより収音された音声については、Ｒチャンネルのスピーカから出力させ、Ｒチャンネルのマイクロフォンにより収音された音声については、Ｌチャンネルのスピーカから出力させるというように、発話音声が入力されたチャンネルと出力するチャンネルとを入れ替えて振り分ければよい、ということになる。しかし、このような振り分けでは、例えばＬチャンネルのマイクロフォンに向かって発話した音声は、Ｌチャンネルのスピーカからは全く出力されない。先にも述べたように、自己拡声音出力機能は、使用場所が広いことを想定しているので、たとえ話者の近くであるとしても、スピーカから全く自己拡声音が出力されないこととなると、やはり、その付近の会議参加者は、話者音声を聴き取りにくくなってしまう。 However, since the self-speaking sound output function is usually required due to the reasonably large space such as a conference hall, the distance between the left and right microphones or the distance between the left and right microphones is actually increased. If the left and right volume balance of the self-sounding sound is simply distributed as described above, the original purpose of the self-sounding sound output function may not be sufficiently achieved.
For example, a voice uttered by a speaker in a wide conference hall becomes harder to hear as the distance from the speaker increases. From this, for example, the voice uttered by the speaker toward the L-channel microphone is installed at a position farther than the L-channel speaker installed at a position close to the speaker. It is preferable that loud sound is output from the channel speaker. However, if the simple distribution method described above is performed, the speaker's voice is output only from the speaker of the L channel opposite to the R channel, which is not good.
Then, the voice collected by the L channel microphone is output from the R channel speaker, and the voice collected by the R channel microphone is output from the L channel speaker. This means that the input channel and the output channel need only be switched and distributed. However, in such distribution, for example, the voice uttered toward the L channel microphone is not output from the L channel speaker at all. As mentioned above, the self-speaking sound output function assumes that the place of use is wide, so even if it is close to the speaker, no self-speaking sound will be output from the speaker. After all, it is difficult for participants in the conference nearby to hear the speaker's voice.

このことからすると、自己拡声音の左右の音量バランスは、Ｌチャンネル、Ｒチャンネルの何れのスピーカからも音は出力させることとした上で、適正な音量となるようにして調整すべきであることになる。このような左右音量バランスの調整は、手動のバランスボリュームを設けることとして、ユーザが実際に自己拡声音を聴きながら上記バランスボリュームを操作するという、簡易な手法を採ることが一般的となっている。 From this, the left and right volume balance of the self-amplifying sound should be adjusted so that the sound is output from both the L channel and R channel speakers and at an appropriate volume. become. For such adjustment of the left / right volume balance, it is common to adopt a simple method in which the user operates the balance volume while actually listening to the self-sounding sound, by providing a manual balance volume. .

しかし、手動による左右音量バランスの調整は、ユーザに対しては面倒を強いることになる。
そこで、本願発明としては、自己拡声音出力機能が与えられたマルチチャンネル（ステレオチャンネル）対応の拡声通話系システムを成すとされる音声信号処理装置において、自己拡声音についてのチャンネル間の音量バランスを、自動で適切に設定できるようにすることを目的とする。 However, manual adjustment of the left / right volume balance is cumbersome for the user.
Therefore, as an invention of the present application, in an audio signal processing apparatus that is considered to constitute a multi-channel (stereo channel) -compatible loudspeaker communication system provided with a self-speaking sound output function, The purpose is to enable automatic and appropriate setting.

そこで本発明は上記した課題を考慮して、音声信号処理装置として次のように構成する。
つまり、所定のマルチチャンネル構成を成すチャンネルごとに対応して設けられる複数のマイクロフォンにより収音して得られる各収音音声信号を所望信号として利用するともに、通信相手側から送信されてきた音声信号を受信してマルチチャンネル構成を成すチャンネルごとに対応する複数のスピーカから音として放出するまでにおける所定の処理段階を経た音声信号である、チャンネルごとに対応する複数のスピーカ出力用音声信号を参照信号として利用して、所望信号から参照信号に基づいて生成したキャンセル用信号を減算して得られる出力信号が最小となるようにして適応処理を実行し、マルチチャンネル構成に対応したチャンネルごとの出力信号が、相手側通信装置に対して送信出力すべき音声信号となるようにされた適応処理手段と、この適応処理手段により得られる上記チャンネルごとの出力信号を入力して、チャンネル間での信号レベルのバランスを変更設定して出力するバランス設定手段と、このバランス設定手段から出力される上記チャンネルごとの出力信号が、対応するチャンネルの上記スピーカ出力用音声信号の成分として含まれるように合成する合成手段と、複数のマイクロフォンのうちの或るチャンネルに対応するマイクロフォンにより収音して得られる収音音声信号が自己側話者音声入力状態に相当する内容を有する場合に、或るチャンネルに対応するマイクロフォンと、マルチチャンネル構成を成すチャンネルごとに対応する複数のスピーカとの各距離に応じて、バランス設定手段にて設定されるべきチャンネル間での信号レベルのバランスを決定するバランス決定手段とを備えることとした。 In view of the above-described problems, the present invention is configured as an audio signal processing apparatus as follows.
In other words, each collected sound signal obtained by collecting sound by a plurality of microphones provided corresponding to each channel constituting a predetermined multi-channel configuration is used as a desired signal, and the sound signal transmitted from the communication partner side A plurality of speaker output audio signals corresponding to each channel, which is an audio signal that has undergone a predetermined processing stage until it is emitted as a sound from a plurality of speakers corresponding to each channel of the multi-channel configuration. As a result, adaptive processing is performed so that the output signal obtained by subtracting the cancellation signal generated based on the reference signal from the desired signal is minimized, and the output signal for each channel corresponding to the multi-channel configuration Is an audio signal to be transmitted and output to the other communication device A balance setting means for inputting the output signal for each channel obtained by the adaptive processing means, changing and setting the balance of the signal level between the channels, and the output from the balance setting means It is obtained by collecting sound by a synthesizing means for synthesizing so that an output signal for each channel is included as a component of the audio signal for speaker output of the corresponding channel, and a microphone corresponding to a certain channel among a plurality of microphones. Depending on the distance between the microphone corresponding to a certain channel and the plurality of speakers corresponding to each channel constituting the multi-channel configuration when the collected sound signal has contents corresponding to the self-speaker speech input state. , Balance the signal level between channels that should be set by the balance setting means It was decided and a balance determining unit configured to constant.

なお、上記記載にあって、「自己側話者音声入力状態」とは、収音音声信号の内容に、対応のマイクロフォンに向かって発話したことで得られた（有効とみなされる一定以上のレベルによりマイクロフォンにて収音された）とされる、話者の音声（自己側話者音声）が含まれている状態をいう。 In the above description, the “self-speaker voice input state” is obtained by speaking into the corresponding microphone in the content of the collected voice signal (a level above a certain level considered valid) The voice of the speaker (self-speaker voice) is included.

上記構成の音声信号処理装置では、先ず、マルチチャンネル構成に対応する適応処理手段が備えられることで、マルチチャンネル構成の拡声通話系システムにおいて、通信相手側との通信によりマイクロフォンの収音音声が循環することで生じるエコー（相手側話者のエコー音）を、適応処理によりキャンセルする機能が与えられる。また、これにバランス設定手段及び合成手段が備えられることで、通信相手側から送信されてきた音声を各チャンネルのスピーカから拡声して出力できることに加えて、音声信号処理装置側において完結するかたちで、各チャンネルのマイクロフォンにより収音された音声を所要のチャンネル間の音量バランスによってスピーカから拡声して出力するという、マルチチャンネルに対応した自己拡声音出力機能が与えられることになる。
そのうえで、或るマイクロフォンの収音音声信号の内容が自己側話者音声入力状態とされて、自己側話者音声がスピーカから出力されるときには、その自己側話者音声入力状態の内容の収音音声信号を得たマイクロフォンと、各チャンネルのスピーカの距離に応じて、バランス設定手段にて設定されるべきチャンネル間での信号レベルのバランスを決定する、即ち各チャンネルのスピーカから出力される自己側話者音声（自己拡声音）の音量を決定するようにされている。 In the audio signal processing apparatus having the above configuration, first, adaptive processing means corresponding to the multi-channel configuration is provided, so that in the multi-channel voice communication system, the sound collected by the microphone is circulated by communication with the communication partner side. The function of canceling the echo (echo sound of the other party's speaker) generated by performing the adaptive processing is given. In addition, since the balance setting means and the synthesizing means are provided, the sound transmitted from the communication partner side can be amplified and output from the speaker of each channel, and the sound signal processing apparatus side can complete the sound. Thus, a self-sounding sound output function corresponding to multi-channels is provided, in which sound picked up by the microphones of each channel is amplified and output from a speaker according to the volume balance between the required channels.
In addition, when the content of the collected voice signal of a certain microphone is set to the self-side speaker voice input state and the self-side speaker voice is output from the speaker, the collected sound content of the self-side speaker voice input state is collected. The balance setting means determines the balance of the signal level between the channels to be set according to the distance between the microphone that has obtained the audio signal and the speaker of each channel, that is, the self side that is output from the speaker of each channel. The volume of the speaker voice (self-speaking sound) is determined.

上記のような構成により、本願発明によっては、自己側話者音声が入力されるマイクロフォンと各チャンネルのスピーカの距離に対応させて、各チャンネルのスピーカから出力される自己拡声音について、例えばその場所にいるユーザが自己側話者音声を聴き取りやすいとされる適切な音量バランスを自動的に得ることが可能になる。本願発明によっては、特に、拡声通話系システムの使用環境、場所などの変更に伴って、設置されるスピーカとマイクロフォンの間の距離が変わったとしても、これに適応して、適切な音量バランスが設定されることになる。これにより、例えば手動により音量バランス調整を行う場合のようにしてユーザが面倒な作業を強いられることは無くなり、より使い勝手のよい拡声通話系システムを提供できることになる。 With the configuration as described above, depending on the present invention, the self-speech sound output from the speaker of each channel corresponding to the distance between the microphone to which the self-speaker voice is input and the speaker of each channel, for example, the location Therefore, it is possible to automatically obtain an appropriate volume balance that makes it easy for the user in the room to listen to the self-speaker's voice. Depending on the invention of the present application, even if the distance between the installed speaker and the microphone changes due to changes in the usage environment, location, etc. of the loudspeaker communication system, an appropriate volume balance can be adapted accordingly. Will be set. Thus, for example, the user is not forced to perform troublesome work as in the case of manually adjusting the volume balance, and a more convenient loudspeaking call system can be provided.

本願発明を実施するための最良の形態（以下、実施の形態という）としては、テレビジョン会議システム(テレビ会議システム)における音声送受信系に本願発明を適用する。
テレビ会議システムは、一般には、場所の異なる会議場ごとに通信端末装置を設置し、この通信端末装置から、カメラ装置により撮影した画像と、マイクロフォンにより収音した音声を他の通信端末装置に送信させると共に、他の通信装置から送信されてきた画像と音声を受信して、それぞれ、表示装置、スピーカから出力させるように構成される。つまり、テレビ会議システムでは、画像を相互に送受信する映像送受信系と、音声を相互に送受信する音声送受信系とを備える。そして、本実施の形態としては、上記音声送受信系として音声を送受信するとともに、スピーカなどからの拡声出力が可能なようにして設けられる通信端末装置（音声通信端末装置）とされるものである。そのうえで、本実施の形態の音声送受信系としては、マルチチャンネルとしてＬ(左)・Ｒ(右)によるステレオチャンネルに対応した構成を有するものとされる。 As the best mode for carrying out the present invention (hereinafter referred to as an embodiment), the present invention is applied to an audio transmission / reception system in a video conference system (video conference system).
In general, a video conference system installs a communication terminal device for each conference hall at different locations, and transmits an image captured by a camera device and sound collected by a microphone to another communication terminal device from the communication terminal device. And an image and a sound transmitted from another communication device are received and output from the display device and the speaker, respectively. That is, the video conference system includes a video transmission / reception system that transmits / receives images to / from each other and an audio transmission / reception system that transmits / receives audio to / from each other. The present embodiment is a communication terminal device (voice communication terminal device) provided so as to transmit and receive voice as the voice transmission / reception system and to enable loudspeaking output from a speaker or the like. In addition, the audio transmission / reception system of the present embodiment has a configuration corresponding to a stereo channel of L (left) and R (right) as a multi-channel.

図１は、本実施の形態の基となる、テレビ会議システムにおける音声送受信系のシステム構成例を示している。
この場合には、互いに離れた２つの場所Ａ、場所Ｂが会議場とされており、これらの場所Ａ，Ｂのそれぞれにおいて、音声送受信系を成す音声通信端末装置１−１、１−２が設置される。これらの音声通信端末装置１−１は、所定の通信方式に対応する通信回線により接続されて、相互通信が可能なようにされている。なお、この場合においては、場所Ａ、Ｂは、それぞれ、例えば同じ部屋内でも離れた距離に居る者同士では、通常会話程度の声量では相手の発話内容を明瞭に聴き取ることが難しい程度の、相当の広さであることを想定している。
そのうえで、先ず、場所Ａには、Ｌ・Ｒの各チャンネルに対応するマイクロフォン２−１（Ｌ）、２−１（Ｒ）が設置される。マイクロフォン２−１（Ｌ）、２−１（Ｒ）は、それぞれ、場所Ａ内に居る会議参加者の声を収音するためのもので、場所Ａ内において、Ｌチャンネル、Ｒチャンネルに対応させた適当な位置に対してそれぞれ設けられる。
スピーカ３−１（Ｌ）、３−１（Ｒ）は、場所Ａを近端側として、遠端側となる他の場所(場所Ｂ)の会議参加者の声を聴くためのもので、これも場所Ａにおいて、Ｌチャンネル、Ｒチャンネルに対応させた適当な位置に設けられる。
同様にして、場所Ｂにおいても、Ｌ・Ｒの各チャンネルに対応するマイクロフォン２−２（Ｌ）、２−２（Ｒ）、及びスピーカ３−２（Ｌ）、３−２（Ｒ）を配置するようにされる。
なお、以降の説明において、音声通信端末装置、マイクロフォン、及びスピーカについて、特に離れた場所にある同一のものを区別する必要のない場合には、音声通信端末装置１、マイクロフォン２（２（Ｌ）・３（Ｌ））、スピーカ３（３（Ｌ）・３（Ｒ））などのようにして表記する。 FIG. 1 shows a system configuration example of an audio transmission / reception system in a video conference system, which is the basis of the present embodiment.
In this case, two places A and B that are separated from each other are used as conference halls, and in each of these places A and B, the voice communication terminal apparatuses 1-1 and 1-2 that form a voice transmission / reception system are provided. Installed. These voice communication terminal apparatuses 1-1 are connected by a communication line corresponding to a predetermined communication method so that mutual communication is possible. In this case, the locations A and B are, for example, those who are separated from each other even in the same room, such that it is difficult to clearly hear the content of the other party's utterance with a voice volume of normal conversation level. It is assumed that it is quite large.
In addition, first, microphones 2-1 (L) and 2-1 (R) corresponding to the L and R channels are installed at the location A. The microphones 2-1 (L) and 2-1 (R) are for collecting the voices of conference participants in the location A, and correspond to the L channel and the R channel in the location A. Provided for each appropriate position.
Speakers 3-1 (L) and 3-1 (R) are for listening to the voices of conference participants in other places (place B) on the far end side, with place A as the near end side. Are also provided at appropriate locations in the location A corresponding to the L channel and the R channel.
Similarly, at location B, microphones 2-2 (L) and 2-2 (R), and speakers 3-2 (L) and 3-2 (R) corresponding to the L and R channels are arranged. To be done.
In the following description, when it is not necessary to distinguish between the voice communication terminal device, the microphone, and the speaker, particularly the same one in a remote place, the voice communication terminal device 1, the microphone 2 (2 (L) 3 (L)), speaker 3 (3 (L) · 3 (R)), etc.

ここで、図２により、場所Ａにおけるマイクロフォン２−１（Ｌ）、２−１（Ｒ）、スピーカ３−１（Ｌ）、３−１（Ｒ）についての、これらの配置に応じた距離的な関係を示しておくこととする。
先ず、スピーカ３−１（Ｌ）、３−１（Ｒ）は、それぞれ、場所Ａにおける左側、右側としての各区域内の所定位置に対して配置され、これにより、スピーカ３−１（Ｌ）とスピーカ３−１（Ｒ）との間には、距離Ｋ0を隔てて位置することになる。
そのうえで、Ｌチャンネルに対応するマイクロフォン２−１（Ｌ）は、同じＬチャンネルに対応するスピーカ３−１（Ｌ）とともに、場所Ａにおいて左側となる区域内に置かれるものとする。同様にして、Ｒチャンネルに対応するマイクロフォン２−１（Ｒ）は、同じＲチャンネルに対応するスピーカ３−１（Ｒ）とともに、場所Ａにおいて右側となる区域内に置かれるものとする。
これにより、マイクロフォン２−１（Ｌ）とスピーカ３−１（Ｌ）との間の距離Kllは、マイクロフォン２−１（Ｌ）とスピーカ３−１（Ｒ）との間の距離Krlよりも短くなる。同様にして、マイクロフォン２−１（Ｒ）とスピーカ３−１（Ｒ）との間の距離Krrは、マイクロフォン２−１（Ｒ）とスピーカ３−１（Ｌ）との間の距離Klrよりも短くなる。
なお、場所Ｂにおけるマイクロフォン２−２（Ｌ）、２−２（Ｒ）、スピーカ３−２（Ｌ）、３−２（Ｒ）についても、上記場所Ａにおけるマイクロフォン２−１（Ｌ）、２−１（Ｒ）、スピーカ３−１（Ｌ）、３−１（Ｒ）と同様の関係が得られるようにして配置されているものとされる。 Here, according to FIG. 2, the distances according to the arrangement of the microphones 2-1 (L), 2-1 (R), the speakers 3-1 (L), and 3-1 (R) in the place A are determined. Let's show the relationship.
First, the speakers 3-1 (L) and 3-1 (R) are arranged with respect to predetermined positions in the respective areas as the left side and the right side in the location A, and thereby the speakers 3-1 (L). And the speaker 3-1 (R) are located at a distance K0.
In addition, it is assumed that the microphone 2-1 (L) corresponding to the L channel is placed in the area on the left side at the location A together with the speaker 3-1 (L) corresponding to the same L channel. Similarly, the microphone 2-1 (R) corresponding to the R channel is placed in the area on the right side at the location A together with the speaker 3-1 (R) corresponding to the same R channel.
Thereby, the distance Kll between the microphone 2-1 (L) and the speaker 3-1 (L) is shorter than the distance Krl between the microphone 2-1 (L) and the speaker 3-1 (R). Become. Similarly, the distance Krr between the microphone 2-1 (R) and the speaker 3-1 (R) is larger than the distance Klr between the microphone 2-1 (R) and the speaker 3-1 (L). Shorter.
Note that the microphones 2-2 (L), 2-2 (R), the speakers 3-2 (L), and 3-2 (R) at the location B are also the microphones 2-1 (L), 2 at the location A. -1 (R), speakers 3-1 (L), and 3-1 (R) are arranged so as to obtain the same relationship.

説明を図１に戻す。
先ず、場所Ａにおいて、マイクロフォン２−１（Ｌ）、２−１（Ｒ）のそれぞれにより収音して得たＬチャンネル、Ｒチャンネルそれぞれの音声信号(ステレオ音声信号)は、音声通信端末装置１−１に入力される。音声通信端末装置１−１は、入力されたステレオ音声信号について、後述するようにしてＬ・Ｒチャンネル間での音量バランス調整を行ったうえで、通信回線を経由して音声通信端末装置１−２に対して送信する。音声通信端末装置１−２は、上記のようにして送信されてきたステレオ音声信号を受信し、この受信したステレオ音声信号に基づいて、Ｌチャンネルの音声信号についてはスピーカ３−２（Ｌ）から出力させ、Ｒチャンネルの音声信号についてはスピーカ３−２（Ｒ）から出力させる。これにより、場所Ｂの会議参加者は、場所Ａの会議参加者の声を聴くことができる。
また、同様にして、場所Ｂ内のマイクロフォン２−２（Ｌ）、２−２（Ｒ）により収音して得られたステレオ音声は、音声通信端末装置１−２により音声通信端末装置１−１に送信される。音声通信端末装置１−１では、受信したステレオ音声信号に基づき、Ｌチャンネル、Ｒチャンネルの各音声をスピーカ３Ｌ−１、３Ｒ−２から出力させる。
このようにして、テレビ会議システムの音声送受信系では、音声の双方向通信を行うものであり、これにより、例えば或る１つの場所(近端側)にいる会議参加者と、他の場所(遠端側)に居る会議参加者との間で会話を行うことが可能になる。また、このテレビ会議システムの場合には、各場所において、複数の会議参加者が居ることを想定しており、このために、各場所の会議参加者の全員が、他の場所の会議参加者の声を聴くことができるように、スピーカ３を備えることとしているものである。このようにしてスピーカを用いて双方向で音声のやりとりを行うシステムは、拡声通話系などともいわれる。 Returning to FIG.
First, at the location A, the audio signals (stereo audio signals) of the L channel and the R channel obtained by the microphones 2-1 (L) and 2-1 (R) are collected by the audio communication terminal device 1. -1. The audio communication terminal apparatus 1-1 performs volume balance adjustment between the L and R channels on the input stereo audio signal as will be described later, and then performs an audio communication terminal apparatus 1- via a communication line. 2 is transmitted. The audio communication terminal device 1-2 receives the stereo audio signal transmitted as described above, and based on the received stereo audio signal, the L-3 audio signal is output from the speaker 3-2 (L). The R channel audio signal is output from the speaker 3-2 (R). Thereby, the meeting participant in the place B can listen to the voice of the meeting participant in the place A.
Similarly, stereo sound obtained by picking up the microphones 2-2 (L) and 2-2 (R) in the place B is obtained by the voice communication terminal device 1-2. 1 is transmitted. The voice communication terminal device 1-1 outputs the L channel and R channel sounds from the speakers 3L-1 and 3R-2 based on the received stereo audio signal.
In this way, the audio transmission / reception system of the video conference system performs two-way audio communication, whereby, for example, a conference participant in one certain place (near end side) and another place ( It becomes possible to have a conversation with a conference participant on the far end. In addition, in the case of this video conference system, it is assumed that there are a plurality of conference participants at each location. For this reason, all the conference participants at each location are considered to be conference participants at other locations. The speaker 3 is provided so that the voice can be heard. A system that performs two-way audio exchange using a speaker in this manner is also called a loudspeaker call system.

また、本実施の形態の音声通信端末装置１は、マイクロフォン２（Ｌ）、２（Ｒ）により収音して得た音声信号についても、スピーカ３（Ｌ）、３（Ｒ）から拡声音として出力させることが可能とされている。即ち、この場合の音声通信端末装置１は、他の音声通信端末装置から送信されてきた音声に加えて、自身に接続されたマイクロフォンにより収音された音声についてもスピーカ３から出力させることのできる、自己拡声音出力機能を有する。
上記もしたように、この場合の会議会場である場所Ａ，Ｂは相当に広い場合を想定しているが、このような自己拡声音出力機能は、同じ会議場としての場所において、或る会議参加者(自己側話者)が発話したときの音声（自己側話者音声）を、他の会議参加者も必定充分に大きな音で聴くために備えられるものである。 In addition, the voice communication terminal device 1 according to the present embodiment also uses the speakers 3 (L) and 3 (R) as loud sounds for the voice signals obtained by collecting the sounds using the microphones 2 (L) and 2 (R). It is possible to output. That is, the voice communication terminal device 1 in this case can output from the speaker 3 not only the voice transmitted from the other voice communication terminal device but also the voice collected by the microphone connected to the voice communication terminal device 1. , Has a self-speaking sound output function.
As described above, it is assumed that the locations A and B, which are the conference venues in this case, are considerably wide. However, such a self-speaking sound output function is used in a certain conference location. It is prepared for the other conference participants to listen to the sound when the participant (self-speaker) speaks (self-speaker speech) with a sufficiently loud sound.

図３は、音声通信端末装置１の構成例を示している。確認のために述べておくと、図１に示した音声通信端末装置１−１、１−２は、この図３に示す構成を共通に有するものとされる。
音声通信端末装置１は、例えばこの図に示すようにして、音声信号処理部１１、コーデック部１２、通信部１５を備えて成る。 FIG. 3 shows a configuration example of the voice communication terminal device 1. For confirmation, the voice communication terminal apparatuses 1-1 and 1-2 shown in FIG. 1 have the configuration shown in FIG. 3 in common.
The voice communication terminal device 1 includes a voice signal processing unit 11, a codec unit 12, and a communication unit 15, for example, as shown in FIG.

音声信号処理部１１に対しては、マイクロフォン２（Ｌ）・２（Ｒ）により収音して得られるＬチャンネル、Ｒチャンネルの各収音音声信号と、後述するコーデック部１２内のデコーダ１４から出力されるＬチャンネル（ｃｈ）、Ｒチャンネル（ｃｈ）の音声信号とが入力される。また、音声信号処理部１１からは、後述するエコーキャンセル処理後のＬ・Ｒチャンネルの音声信号をコーデック部１２内のエンコーダ１３に出力するとともに、拡声音として出力させるべきＬ・Ｒチャンネルの音声信号を、それぞれスピーカ３（Ｌ）・３（Ｒ）に対して出力するようにされている。
なお、実際においては、マイクロフォン２により収音して得られたアナログとしての音声信号をデジタル信号に変換するＡ／Ｄ変換器であるとか、音声信号処理部１１から出力されるものとするデジタルによる音声信号をアナログ信号に変換し、増幅してスピーカ３（Ｌ）・３（Ｒ）に出力するためのＤ／Ａ変換器、増幅回路などの構成が備えられるべきであるが、ここでは、説明を簡単なものとすることの都合上、これらの部位についての図示は省略している。また、これらの部位のそれぞれは、音声通信端末装置１内に設けられてもよいし、音声通信端末装置１に対しては外部となる装置に対して設けられるものであってもよい。 For the audio signal processing unit 11, the collected sound signals of the L channel and the R channel obtained by collecting the sound with the microphones 2 (L) and 2 (R), and a decoder 14 in the codec unit 12 to be described later. The output L channel (ch) and R channel (ch) audio signals are input. The audio signal processing unit 11 outputs an L / R channel audio signal after echo cancellation processing, which will be described later, to the encoder 13 in the codec unit 12 and also outputs an L / R channel audio signal to be output as a loud sound. Are output to the speakers 3 (L) and 3 (R), respectively.
Actually, it is an A / D converter that converts an analog audio signal obtained by collecting sound by the microphone 2 into a digital signal, or digital that is output from the audio signal processing unit 11. A configuration such as a D / A converter and an amplifier circuit for converting an audio signal into an analog signal, amplifying it, and outputting it to the speakers 3 (L) and 3 (R) should be provided. These parts are not shown for the sake of simplicity. In addition, each of these parts may be provided in the voice communication terminal device 1 or may be provided in a device that is external to the voice communication terminal device 1.

先に述べたように、拡声通話系システムは、そのまま使用したのでは、エコー、ハウリングなどの現象を生じる。つまり、図３において示しているように、スピーカ３（Ｌ）・３（Ｒ）から空間に放出された音は、空間伝搬経路（エコーパス）Sll、Srl、Srr、Slrを経て、実際には直接音及び間接音が混合された状態でマイクロフォン２Ｌ・２Ｒの各々に到達する。つまり、通信相手側の音声通信端末装置から送信されスピーカ３（３Ｌ・３Ｒ）から放出された相手側話者の声がマイクロフォン２（２Ｌ・２Ｒ）にて収音され、再び、通信相手側の音声通信端末装置に送信される。また、通信相手側においても、さらにスピーカから放出された音がマイクロフォンで収音されて、こちらの音声通信端末装置に送信されてくる。即ち、拡声通話系システムでは、一度空間に放出された音が、音声通信端末装置間で循環するようにして送受信される。これにより、スピーカから放出される音には、自分が今話している声が、或る遅延時間をもってこだまのようにして聴こえるものが含まれることになる。これがエコーであり、ループが或る程度以上に繰り返されればハウリングとなる。
そこで、拡声通話系システムでは、このようなエコーの現象を解消、抑制するエコーキャンセルシステムを備えることが行われている。音声信号処理部１１は、このエコーキャンセルシステムとしての信号処理機能を有するようにして構成されている。
また、音声信号処理部１１では、先の自己拡声音出力機能に対応して、マイクロフォン２（Ｌ）・２（Ｒ）により収音して得られた音声信号をスピーカ３（Ｌ）・３（Ｒ）から出力させるための自己拡声音用の信号系も備えている。
なお、この音声信号処理部１１は、例えば実際には、ＤＳＰ(Digital Signal Processor)として構成される。また、音声信号処理部１１におけるエコーキャンセルのための構成、及び自己拡声音用の信号系については後述する。 As described above, if the loudspeaker communication system is used as it is, phenomena such as echo and howling occur. That is, as shown in FIG. 3, the sound emitted from the speakers 3 (L) and 3 (R) into the space actually passes directly through the spatial propagation paths (echo paths) Sll, Srl, Srr, Slr. Each of the microphones 2L and 2R is reached in a state where the sound and the indirect sound are mixed. That is, the voice of the other party's speaker transmitted from the voice communication terminal device on the other side of the communication and emitted from the speaker 3 (3L · 3R) is picked up by the microphone 2 (2L · 2R). It is transmitted to the voice communication terminal device. Further, on the communication partner side, sound emitted from the speaker is further picked up by the microphone and transmitted to the voice communication terminal device here. That is, in the loudspeaker communication system, the sound once released into the space is transmitted / received in a circulating manner between the voice communication terminal devices. As a result, the sound emitted from the speaker includes a sound in which the voice he / she is currently speaking can be heard with a certain delay time. This is an echo, and howling occurs when the loop is repeated to some extent.
In view of this, in the loudspeaker communication system, an echo canceling system that eliminates and suppresses such an echo phenomenon is provided. The audio signal processing unit 11 is configured to have a signal processing function as the echo cancellation system.
Also, the audio signal processing unit 11 corresponds to the above-described self-speaking sound output function, and the audio signal obtained by collecting the sound with the microphones 2 (L) and 2 (R) is output from the speakers 3 (L) and 3 ( A signal system for self-sounding sound for output from R) is also provided.
The audio signal processing unit 11 is actually configured as a DSP (Digital Signal Processor), for example. The configuration for echo cancellation in the audio signal processing unit 11 and the signal system for self-sounding sound will be described later.

音声信号処理部１１によりエコーキャンセル処理が施されたＬチャンネル、Ｒチャンネルの各音声信号は、コーデック部１２内のエンコーダ１３に対して入力される。エンコーダ１３は、入力されたＬチャンネル、Ｒチャンネルの各音声信号について、例えば所定方式に応じた音声圧縮符号化などの信号処理を施し、例えばＬチャンネル・Ｒチャンネルの音声信号が多重化された所定のステレオ形式による圧縮符号化音声信号データを生成し、これを送信用音声信号として通信部１５に対して出力する。通信部１５は、入力された送信用音声信号を、所定の通信方式に従って、通信回線経由で、他の音声通信端末装置に対して出力するようにされる。 The L channel and R channel audio signals subjected to echo cancellation processing by the audio signal processing unit 11 are input to the encoder 13 in the codec unit 12. The encoder 13 performs signal processing, such as audio compression encoding according to a predetermined method, on the input L channel and R channel audio signals, for example, and a predetermined L channel / R channel audio signal is multiplexed. The compressed encoded audio signal data in the stereo format is generated and output to the communication unit 15 as an audio signal for transmission. The communication unit 15 outputs the input audio signal for transmission to another audio communication terminal device via a communication line according to a predetermined communication method.

また、通信部１５は、他の音声通信端末装置から送信されてきた送信用音声信号を受信して所定の圧縮符号化形式の音声信号に復元し、コーデック部１２のデコーダ１４に出力する。
デコーダ１４では、入力された音声信号の圧縮符号化に対する復調処理を実行して、例えば所定のＰＣＭ形式のデジタル音声信号に変換し、音声信号処理部１１に出力する。この場合には、圧縮符号化された再生用音声信号は、例えばエンコーダ１３により生成される圧縮符号化音声信号データと同様のステレオ形式とされ、従って、デコーダ１４の復調処理によっては、例えば所定のＰＣＭ形式によるＬチャンネルとＲチャンネルの音声信号が得られることになる。音声信号処理部１１を経由した、上記Ｌチャンネル、Ｒチャンネルの音声信号の成分が、最終的にはスピーカ３（Ｌ）・３（Ｒ）から出力される相手側話者音声となる。 The communication unit 15 receives a transmission audio signal transmitted from another audio communication terminal device, restores the audio signal in a predetermined compression encoding format, and outputs the audio signal to the decoder 14 of the codec unit 12.
The decoder 14 performs a demodulation process on the compression encoding of the input audio signal, converts it into a digital audio signal in a predetermined PCM format, for example, and outputs it to the audio signal processing unit 11. In this case, the compression-encoded playback audio signal has a stereo format similar to that of the compression-encoded audio signal data generated by the encoder 13, for example. The audio signals of the L channel and the R channel in the PCM format can be obtained. The components of the L channel and R channel audio signals that have passed through the audio signal processing unit 11 are finally the other party's speaker audio output from the speakers 3 (L) and 3 (R).

図４は、本実施の形態におけるエコーキャンセルシステムである音声信号処理部１１の内部構成例を示している。なお、この図においては、音声信号処理部１１とともに、マイクロフォン２、スピーカ３、及びコーデック部１２（エンコーダ１３、デコーダ１４）を示している。 FIG. 4 shows an internal configuration example of the audio signal processing unit 11 which is an echo cancellation system in the present embodiment. In this figure, the microphone 2, the speaker 3, and the codec unit 12 (encoder 13 and decoder 14) are shown together with the audio signal processing unit 11.

先の図３によっても示しているが、先ず、本実施の形態の音声通信端末装置１のようなステレオチャンネルの拡声通話系システムでは、空間伝搬経路は４つ存在する。つまり、Ｌチャンネルのスピーカ３（Ｌ）からＬチャンネルのマイクロフォン２（Ｌ）に到達する空間伝搬経路Ｓll、Ｒチャンネルのスピーカ３（Ｒ）からＬチャンネルのマイクロフォン２（Ｌ）に到達する空間伝搬経路Ｓrl、Ｌチャンネルのスピーカ３（Ｌ）からＲチャンネルのマイクロフォン２（Ｒ）に到達する空間伝搬経路Ｓlr、Ｒチャンネルのスピーカ３（Ｒ）からＲチャンネルのマイクロフォン２（Ｒ）に到達する空間伝搬経路Ｓrrとが存在する。
本実施の形態のエコーキャンセルシステムである音声信号処理部１１は、以降の説明から理解されるように、適応フィルタシステムによるエコーキャンセル処理を実行するようにして構成されるが、上記の４つの空間伝搬経路Ｓll、Ｓrl、Ｓlr、Ｓrrごとに対応してエコーキャンセルを行うために、適応フィルタシステムについては、下記のようにして構成することとしている。 As shown also in FIG. 3, first, in a stereo channel loudspeaker communication system such as the voice communication terminal device 1 of the present embodiment, there are four spatial propagation paths. That is, a spatial propagation path Sll reaching the L channel microphone 2 (L) from the L channel speaker 3 (L), and a spatial propagation path reaching the L channel microphone 2 (L) from the R channel speaker 3 (R). Srl, spatial propagation path Srr from the L channel speaker 3 (L) to the R channel microphone 2 (R), spatial propagation path from the R channel speaker 3 (R) to the R channel microphone 2 (R) Srr exists.
As will be understood from the following description, the audio signal processing unit 11 which is an echo cancellation system of the present embodiment is configured to execute echo cancellation processing by the adaptive filter system. In order to perform echo cancellation corresponding to each of the propagation paths Sll, Srl, Slr, and Srr, the adaptive filter system is configured as follows.

この場合の適応フィルタシステムとしては、図示するようにして、適応フィルタシステム２０Ａ、２０Ｂの２つから成るものとしてみることができる。以降の説明から理解されるように、適応フィルタシステム２０Ａは、上記の４つの空間伝搬経路のうち、空間伝搬経路Ｓlr、Ｓrrを伝搬してＲチャンネルのマイクロフォン２（Ｒ）により収音されることで生じるエコー音をキャンセルするためのものであり、適応フィルタシステム２０Ｂは、残る２つの空間伝搬経路Ｓll、Ｓrlを伝搬してＬチャンネルのマイクロフォン２（Ｌ）により収音されることで生じるエコー音をキャンセルするためのものである。 As shown in the figure, the adaptive filter system in this case can be regarded as consisting of two adaptive filter systems 20A and 20B. As will be understood from the following description, the adaptive filter system 20A propagates the spatial propagation paths Slr and Srr out of the four spatial propagation paths described above and is picked up by the R channel microphone 2 (R). The adaptive filter system 20B propagates through the remaining two spatial propagation paths Sll and Srl and is picked up by the L-channel microphone 2 (L). Is for canceling.

先ず、適応フィルタシステム２０Ａ側に対応する構成について説明する。
この適応フィルタシステム２０Ａは、例えば図示するようにして、適応フィルタ（ADF：Adaptive Digital Filter）２１ＲＲ、２１ＬＲ、加算器２２（Ｒ）、減算器２３（Ｒ）を備える。
先ず、適応フィルタ２１ＲＲは、空間伝搬経路Ｓrrを伝搬してＲチャンネルのマイクロフォン２（Ｒ）により収音されることで生じるエコー音をキャンセルするために設けられるもので、この適応フィルタ２１ＲＲに対しては、参照信号として、Ｒチャンネルのスピーカ３（Ｒ）に対して出力すべき音声信号を入力させることとしている。この音声信号は、図示するようにして、加算器３３の出力信号とされる。加算器３３は、デコーダ１４から出力されるＲチャンネルの相手側話者音声（Xd1）の音声信号に対して、自己拡声音用ボリューム３１から出力される、Ｒチャンネルの自己側話者音声の音声信号を合成する。
一方の適応フィルタ２１ＬＲは、空間伝搬経路Ｓlrを伝搬してＲチャンネルのマイクロフォン２（Ｒ）により収音されることで生じるエコー音をキャンセルするために設けられるもので、この適応フィルタ２１ＬＲに対しては、参照信号として、Ｌチャンネルのスピーカ３（Ｌ）に出力すべき音声信号を入力させることとしている。この音声信号は、加算器３４の出力信号であり、この加算器３４は、デコーダ１４から出力されるＬチャンネルの相手側話者音声の音声信号(Xd2)に対して、自己拡声音用ボリューム３１から出力される、Ｌチャンネルの自己側話者音声の音声信号を合成する。 First, a configuration corresponding to the adaptive filter system 20A side will be described.
This adaptive filter system 20A includes, for example, adaptive filters (ADF: Adaptive Digital Filter) 21RR, 21LR, an adder 22 (R), and a subtractor 23 (R) as shown in the figure.
First, the adaptive filter 21RR is provided for canceling echo sound generated by being picked up by the R channel microphone 2 (R) through the spatial propagation path Srr. Is to input an audio signal to be output to the R channel speaker 3 (R) as a reference signal. This audio signal is used as an output signal of the adder 33 as shown in the figure. The adder 33 responds to the voice signal of the R-channel counterpart speaker voice (Xd1) output from the decoder 14 and the voice of the R-channel self-speaker voice output from the self-amplifying volume 31. Synthesize the signal.
One adaptive filter 21LR is provided to cancel the echo sound generated by being picked up by the R channel microphone 2 (R) through the spatial propagation path Slr. Is to input an audio signal to be output to the L-channel speaker 3 (L) as a reference signal. This audio signal is an output signal of the adder 34, and the adder 34 is a self-amplifying volume 31 for the audio signal (Xd 2) of the other party's speaker voice on the L channel output from the decoder 14. Synthesizes the L-channel self-speaker speech signal that is output from

これら適応フィルタ２１ＲＲ・２１ＬＲの各出力信号（キャンセル用信号）Yrr・Ｙlrは、加算器２２（Ｒ）により加算、合成され、減算器２３（Ｒ）に対して入力される。減算器２３（Ｒ）では、マイクロフォン２（Ｒ）により得られる収音音声信号から、上記加算器２２（Ｒ）の出力信号（Yrr＋Ｙlr）を減算する。この減算器２３（Ｒ）の出力が、適応フィルタシステム２０Ａの出力信号Ye1として出力されることになる。
なお、減算器２３（Ｒ）に入力されるマイクロフォン２（Ｒ）により得られる収音音声信号が、適応フィルタ２１ＲＲ・２１ＬＲにとって共通の所望信号となる。つまり、減算器２３（Ｒ）は、所望信号としてのマイクロフォン２（Ｒ）により得られる収音音声信号から、適応フィルタ２１ＲＲ・２１ＬＲの各出力信号（キャンセル用信号）を合成した信号を減算しているものである。
上記減算器２３（Ｒ）の出力である信号Ye1は、通信相手側（遠端側）の他の音声通信端末装置に対して送信すべき、Ｒチャンネルの音声信号として送信用ボリューム３２を介してエンコーダ１３に入力される。また、信号Ye1は、上記適応フィルタ２１ＲＲ・２１ＬＲに対応する共通の誤差信号（残差信号）として、適応フィルタ２１ＲＲ・２１ＬＲに対して帰還される。さらに、本実施の形態においては、信号Ye1は、Ｒチャンネルの自己側話者音声信号として、自己拡声音用ボリューム３１に対しても入力されるようになっている。 The output signals (cancellation signals) Yrr and Ylr of the adaptive filters 21RR and 21LR are added and synthesized by the adder 22 (R) and input to the subtractor 23 (R). The subtracter 23 (R) subtracts the output signal (Yrr + Ylr) of the adder 22 (R) from the collected sound signal obtained by the microphone 2 (R). The output of the subtracter 23 (R) is output as the output signal Ye1 of the adaptive filter system 20A.
The collected sound signal obtained by the microphone 2 (R) input to the subtracter 23 (R) is a desired signal common to the adaptive filters 21RR and 21LR. That is, the subtracter 23 (R) subtracts a signal obtained by synthesizing the output signals (cancellation signals) of the adaptive filters 21RR and 21LR from the collected sound signal obtained by the microphone 2 (R) as a desired signal. It is what.
The signal Ye1 which is the output of the subtractor 23 (R) is transmitted to the other voice communication terminal device on the communication partner side (far end side) via the transmission volume 32 as an R channel audio signal. Input to the encoder 13. The signal Ye1 is fed back to the adaptive filters 21RR and 21LR as a common error signal (residual signal) corresponding to the adaptive filters 21RR and 21LR. Furthermore, in the present embodiment, the signal Ye1 is also input to the self-speaking sound volume 31 as an R-channel self-speaker voice signal.

次に、適応フィルタシステム２０Ｂ側に対応する構成について説明する。
適応フィルタシステム２０Ｂは、例えば図示するようにして、適応フィルタ２１ＲＬ、２１ＬＬ、加算器２２（Ｌ）、減算器２３（Ｌ）を備える。
適応フィルタ２１ＲＬは、空間伝搬経路Ｓrlを伝搬してＬチャンネルのマイクロフォン２（Ｌ）により収音されることで生じるエコー音をキャンセルするために設けられるもので、参照信号として、加算器３３から出力され、Ｒチャンネルのスピーカ３（Ｒ）に対して入力されるべき信号が入力される。
また、適応フィルタ２１ＬＬは、空間伝搬経路Ｓllを伝搬してＬチャンネルのマイクロフォン２（Ｌ）により収音されることで生じるエコー音をキャンセルするために設けられるもので、参照信号として、加算器３４から出力され、Ｌチャンネルのスピーカ３（Ｌ）に対して入力されるべき信号が入力される。 Next, a configuration corresponding to the adaptive filter system 20B side will be described.
The adaptive filter system 20B includes adaptive filters 21RL and 21LL, an adder 22 (L), and a subtractor 23 (L) as shown in the figure, for example.
The adaptive filter 21RL is provided to cancel an echo sound generated by being picked up by the L channel microphone 2 (L) through the spatial propagation path Srl, and is output from the adder 33 as a reference signal. Then, a signal to be inputted to the R channel speaker 3 (R) is inputted.
The adaptive filter 21LL is provided for canceling echo sound generated by being picked up by the L-channel microphone 2 (L) through the spatial propagation path Sll. The adder 34 is used as a reference signal. The signal to be input to the L channel speaker 3 (L) is input.

適応フィルタ２１ＲＬ・２１ＬＬの各出力信号（キャンセル用信号）Yrl・Ｙllは、加算器２２（Ｌ）により加算、合成され、減算器２３（Ｌ）に対して入力される。減算器２３（Ｌ）は、マイクロフォン２（Ｌ）により得られる収音音声信号から、上記加算器２２（Ｌ）の出力信号（Yrl＋Ｙll）を減算する。この減算器２３（Ｌ）の出力が、適応フィルタシステム２０Ｂの出力信号Ye2となる。
ここでは、減算器２３（Ｌ）に入力されるマイクロフォン２（Ｌ）により得られる収音音声信号が、適応フィルタ２１ＲＬ・２１ＬＬにとって共通の所望信号とされ、減算器２３（Ｌ）は、この所望信号から、適応フィルタＲＬ・２１ＬＬの各出力信号（キャンセル用信号）を合成した信号を減算しているものである。
また、信号Ye2は、通信相手側（遠端側）の他の音声通信端末装置に対して送信すべきＬチャンネルの音声信号として、送信用ボリューム３２を介してエンコーダ１３に入力される。また、信号Ye2は、上記適応フィルタ２１ＲＬ・２１ＬＬに対応する共通の誤差信号（残差信号）として、適応フィルタ２１ＲＬ・２１ＬＬに対して帰還される。さらに、信号Ye2は、Ｌチャンネルの自己側話者音声信号として、自己拡声音用ボリューム３１に対しても入力される。 The output signals (cancellation signals) Yrl and Yll of the adaptive filters 21RL and 21LL are added and synthesized by the adder 22 (L) and input to the subtractor 23 (L). The subtractor 23 (L) subtracts the output signal (Yrl + Yll) of the adder 22 (L) from the collected sound signal obtained by the microphone 2 (L). The output of the subtracter 23 (L) becomes the output signal Ye2 of the adaptive filter system 20B.
Here, the collected sound signal obtained by the microphone 2 (L) input to the subtracter 23 (L) is a desired signal common to the adaptive filters 21RL and 21LL, and the subtracter 23 (L) A signal obtained by synthesizing output signals (cancellation signals) of the adaptive filters RL and 21LL is subtracted from the signal.
The signal Ye2 is input to the encoder 13 via the transmission volume 32 as an L channel audio signal to be transmitted to another audio communication terminal device on the communication partner side (far end side). The signal Ye2 is fed back to the adaptive filters 21RL and 21LL as a common error signal (residual signal) corresponding to the adaptive filters 21RL and 21LL. Further, the signal Ye2 is also input to the self-sound volume 31 as an L-channel self-speaker voice signal.

自己拡声音用ボリューム３１は、上記のようにして信号Ye1、Ye2を、それぞれＲチャンネル、Ｌチャンネルの自己側話者音声として入力し、後述するようにしてＬチャンネル音声とＲチャンネル音声の音量バランス調整を行ったうえで、Ｒチャンネルの自己側話者音声となる音声信号Xsrを加算器３３に対して出力すると共に、Ｌチャンネルの自己側話者音声となる音声信号Xslを加算器３４に対して出力するようにされる。 The self-sound volume 31 inputs the signals Ye1 and Ye2 as the R-side and L-channel self-speaker voices as described above, and the volume balance between the L-channel sound and the R-channel sound as described later. After the adjustment, an audio signal Xsr that becomes the R-channel self-speaker voice is output to the adder 33, and an audio signal Xsl that becomes the L-channel self-speaker voice is output to the adder 34. Output.

また、送信音用ボリューム３２は、入力されるＲチャンネル、Ｌチャンネルの音声信号について、設定された減衰率によるレベル・ゲインの減衰を与えて出力可能に構成されるもので、次に説明するようにして適応フィルタシステム２０Ａ、２０Ｂが実行する、相手側話者音声のエコー音についてのキャンセル効果を補強するために設けられる。つまり、適応フィルタシステム２０Ａ、２０Ｂにおいて備えられる適応フィルタ２１（ＲＲ）、２１（ＬＲ）、２１（ＲＬ）、２１（ＬＬ）が収束しており、充分に有効なエコーキャンセル効果が得られているとされる状態にあっても、現実においては、若干のエコー成分が残留することがある。送信音用ボリューム３２は、適応フィルタが収束し、かつ、収音音声信号の内容としては自己側話者音声が含まれず、可能性としては相手側話者音声のエコー音のほうのみが含まれる、いわゆる相手側シングルトーク状態を検出したとき、入力される信号について、これにほぼ１００％の減衰率を与えて出力させないようにする、あるいは一定以上の減衰率を与えて出力するように動作する。これにより、相手側通信端末装置にて、上記のエコー残留成分が聴こえないようにする、あるいは聴こえにくくすることができる。 The transmission sound volume 32 is configured to be capable of outputting an input R-channel and L-channel audio signal by applying level gain attenuation according to a set attenuation factor, and will be described below. Thus, the adaptive filter systems 20A and 20B are provided to reinforce the cancellation effect on the echo sound of the other party's speaker voice. That is, the adaptive filters 21 (RR), 21 (LR), 21 (RL), and 21 (LL) provided in the adaptive filter systems 20A and 20B have converged, and a sufficiently effective echo cancellation effect is obtained. Even in such a state, in reality, some echo components may remain. In the transmission sound volume 32, the adaptive filter converges, and the content of the collected sound signal does not include the self-speaker speech, and possibly includes only the echo sound of the other-speaker speech. When the so-called counterparty single talk state is detected, the input signal is not output by giving an almost 100% attenuation rate, or is operated to give an attenuation rate of a certain level or more. . As a result, it is possible to make the above-mentioned echo residual component inaudible or difficult to hear in the counterpart communication terminal device.

上記構成の音声信号処理部１１における適応フィルタシステム２０Ａ、２０Ｂの動作について、先ず、適応フィルタシステム２０Ａを例に挙げて説明する。
適応フィルタシステム２０Ａにおける適応フィルタ２１（ＲＲ）、２１（ＬＲ）は、図示による説明は省略するが、例えば、上記の所望信号が通過する、必要次数によるＦＩＲ(Finite Impulse Response：有限インパルス応答)型のデジタルフィルタと、このデジタルフィルタの係数(フィルタ係数)を、所定の適応アルゴリズムに従って可変設定する係数設定回路とを備えている。上記のデジタルフィルタの出力が、適応フィルタ２１（ＲＲ）、２１（ＬＲ）の出力信号であり、疑似エコー信号（キャンセル用信号）となる。 The operation of the adaptive filter systems 20A and 20B in the audio signal processing unit 11 having the above configuration will be described first by taking the adaptive filter system 20A as an example.
The description of the adaptive filters 21 (RR) and 21 (LR) in the adaptive filter system 20A is omitted, but, for example, an FIR (Finite Impulse Response) type based on a required order through which the desired signal passes is used. And a coefficient setting circuit for variably setting the coefficient (filter coefficient) of the digital filter according to a predetermined adaptive algorithm. The output of the above digital filter is the output signal of the adaptive filters 21 (RR) and 21 (LR) and becomes a pseudo echo signal (cancellation signal).

そして、適応フィルタ２１（ＲＲ）、２１（ＬＲ）においては、上記の誤差信号(Ye1)により示される残差量を最小とする出力信号(キャンセル用信号)が常に得られるようにして、上記係数設定回路が、所要の次数段階の係数器のフィルタ係数を変更設定していくようにされる。
この結果、適応フィルタ２１（ＲＲ）の係数ベクトル（次数段階に応じた係数の配列に相当する）は、Ｒチャンネルのスピーカ３（Ｒ）から空間伝搬経路Ｓrrを経由してマイクロフォン２（Ｒ）にて収音される音の擬似的な伝達関数を表現するインパルス応答を形成することになる。同様にして、適応フィルタ２１（ＬＲ）の係数ベクトルは、Ｌチャンネルのスピーカ３（Ｌ）から空間伝搬経路Ｓlrを経由してマイクロフォン２（Ｒ）にて収音される音の擬似的な伝達関数を表現するインパルス応答を形成することになる。これらの動作は、即ち、上記の空間伝搬経路Srr、Slrを経由して得られる各音の信号成分を、そのときの処理対象信号(参照信号、所望信号)の状態に応じて適応的にキャンセルする動作であるといえる。
そして、上記の空間伝搬経路Ｓrrとしてのエコーパスを経由してマイクロフォン２（Ｒ）により収音される音は、Ｒチャンネルのスピーカ３（Ｒ）にて再生させるべき音声信号を基としたエコー音の成分であり、空間伝搬経路Slrとしてのエコーパスを経由してマイクロフォン２（Ｒ）により収音される音は、Ｌチャンネルのスピーカ３（Ｌ）にて再生させるべき音声信号を基としたエコー音の成分である。従って、適応フィルタ２１（ＲＲ）、２１（ＬＲ）の出力信号Yrr、Ylr（キャンセル用信号）は、それぞれ、マイクロフォン２（Ｒ）に到達してきたＲチャンネル、Ｌチャンネルのスピーカ出力音についての疑似エコー音として捉えられることとなる。そして、適応フィルタシステム２０Ａにおいては、減算器２３（Ｒ）により、Ｒチャンネルのマイクロフォン２（Ｒ）による収音音声信号から、適応フィルタ２１（ＲＲ）、２１（ＬＲ）からのキャンセル用信号を合成したもの、即ち、上記の疑似エコー音を差し引くことになる。このようにして、適応フィルタシステム２０Ａは、Ｒチャンネルのマイクロフォン２（Ｒ）にて収音された音からエコー音の成分を適応的に除去するという動作を実行するものである。 In the adaptive filters 21 (RR) and 21 (LR), an output signal (cancellation signal) that minimizes the residual amount indicated by the error signal (Ye1) is always obtained. The setting circuit is configured to change and set the filter coefficient of the required order stage coefficient unit.
As a result, the coefficient vector of the adaptive filter 21 (RR) (corresponding to an array of coefficients corresponding to the order level) is transferred from the R channel speaker 3 (R) to the microphone 2 (R) via the spatial propagation path Srr. Thus, an impulse response representing a pseudo transfer function of the sound collected is formed. Similarly, the coefficient vector of the adaptive filter 21 (LR) is a pseudo transfer function of sound collected by the microphone 2 (R) from the L-channel speaker 3 (L) via the spatial propagation path Slr. Is formed. In other words, the signal components of each sound obtained via the spatial propagation paths Srr and Slr are adaptively canceled according to the state of the signal to be processed (reference signal, desired signal) at that time. It can be said that this is
The sound picked up by the microphone 2 (R) via the echo path as the spatial propagation path Srr is an echo sound based on the sound signal to be reproduced by the R channel speaker 3 (R). The sound collected by the microphone 2 (R) via the echo path as the spatial propagation path Slr is an echo sound based on the sound signal to be reproduced by the L channel speaker 3 (L). It is an ingredient. Accordingly, the output signals Yrr and Ylr (cancellation signals) of the adaptive filters 21 (RR) and 21 (LR) are pseudo echoes for the R channel and L channel speaker output sounds that have arrived at the microphone 2 (R), respectively. It will be captured as sound. Then, in the adaptive filter system 20A, the subtracter 23 (R) synthesizes cancellation signals from the adaptive filters 21 (RR) and 21 (LR) from the collected sound signal from the R channel microphone 2 (R). That is, the above pseudo echo sound is subtracted. In this way, the adaptive filter system 20A performs an operation of adaptively removing the component of the echo sound from the sound collected by the R channel microphone 2 (R).

また、適応フィルタシステム２０Ｂにおいては、適応フィルタ２１（ＲＬ）により、空間伝搬経路Srlとしてのエコーパスを経由してＲチャンネルのスピーカ３（Ｒ）からＬチャンネルのマイクロフォン２（Ｌ）に収音されるエコー音に対応する疑似エコー音の音声信号（キャンセル用信号）Yrlを生成して出力し、適応フィルタ２１（ＬＬ）により、空間伝搬経路Sllとしてのエコーパスを経由してＬチャンネルのスピーカ３（Ｌ）からＬチャンネルのマイクロフォン２（Ｌ）に収音されるエコー音に対応する疑似エコー音の音声信号（キャンセル用信号）Yllを生成して出力するようにされる。そして、これらの音声信号を合成器２２（L）により合成し、減算器２３（L）に出力する。
減算器２３（Ｌ）では、Lチャンネルのマイクロフォン２（Ｌ）により収音された音声信号を所望信号として、この所望信号から、上記減算器２３（L）から出力される合成されたキャンセル用信号（Yrl＋Yll）を減算し、信号Ye2として出力する。上記の適応フィルタシステム２０Ａについての説明に準ずれば、この信号Ye2は、Lチャンネルのマイクロフォン２（Ｌ）により収音された音声信号から、空間伝搬経路Srl、Sllを経由してスピーカ３（Ｒ）、３（Ｌ）のそれぞれにより伝搬されてくるエコー音の成分をキャンセルしたものとなる。 In the adaptive filter system 20B, sound is collected by the adaptive filter 21 (RL) from the R channel speaker 3 (R) to the L channel microphone 2 (L) via an echo path as the spatial propagation path Srl. An audio signal (cancellation signal) Yrl of a pseudo echo sound corresponding to the echo sound is generated and output, and the adaptive filter 21 (LL) passes through the echo path as the spatial propagation path Sll and passes through the L channel speaker 3 (L ) To generate and output a pseudo echo sound signal (cancellation signal) Yll corresponding to the echo sound picked up by the L-channel microphone 2 (L). And these audio | voice signals are synthesize | combined by the synthesizer 22 (L), and are output to the subtractor 23 (L).
In the subtractor 23 (L), the audio signal picked up by the L-channel microphone 2 (L) is used as a desired signal, and the combined cancellation signal output from the subtractor 23 (L) is output from the desired signal. Subtract (Yrl + Yll) and output as signal Ye2. According to the above description of the adaptive filter system 20A, this signal Ye2 is obtained from the sound signal collected by the L-channel microphone 2 (L) via the spatial propagation paths Srl and Sll and the speaker 3 (R ), And the echo sound component propagated by each of 3 (L) is canceled.

このようにして、本実施の形態の音声信号処理部１１としては、基本的には、ステレオチャンネルに対応して、空間伝搬経路Ｓll、Ｓrl、Ｓlr、Ｓrrを伝搬する音により発生するエコー音成分のそれぞれをキャンセルするようにして構成されているものである。
そして、先の説明のようにしてエコー音がキャンセルされることで得られる音声信号Ye1、Ye2は、それぞれＲチャンネル、Ｌチャンネルの送信用音声信号として、送信音用ボリューム３２からエンコーダ１３を介して他の音声通信端末装置に対して送信される。これにより、他の音声通信端末装置にて受信した音声信号をスピーカから放出させて聴こえる音からも、エコー音が取り除かれる。このようにしてエコーキャンセル効果が生じるものである。 In this way, the audio signal processing unit 11 according to the present embodiment basically has an echo sound component generated by sound propagating through the spatial propagation paths Sll, Srl, Slr, Srr corresponding to the stereo channel. Each is configured to cancel each of the above.
The audio signals Ye1 and Ye2 obtained by canceling the echo sound as described above are transmitted from the transmission sound volume 32 via the encoder 13 as transmission audio signals for the R channel and the L channel, respectively. It is transmitted to other voice communication terminal devices. Thereby, the echo sound is also removed from the sound that is heard by releasing the sound signal received by the other sound communication terminal device from the speaker. In this way, an echo canceling effect is produced.

また、上記音声信号Ye1、Ye2は、スピーカ３（Ｒ）、３（Ｌ）から音声出力させるべき、自己側話者音声としての音声信号Xsr,Xslの基となる信号として、自己拡声用ボリューム３１に対して入力されている。
例えば、マイクロフォン２（Ｒ）に対して自己側話者音声（有効とみなされる一定以上のレベルによりマイクロフォンにて収音された音声）が入力されたとすると、その音声信号成分は、適応フィルタシステム２０Ａの減算器２３（Ｒ）を経由して自己拡声音用ボリューム３１に対して信号Ye1として入力される。同様にして、マイクロフォン２（Ｌ）に対して自己側話者音声（有効とみなされる一定以上のレベルによりマイクロフォンにて収音された音声）が入力されたとすると、その音声信号成分は、適応フィルタシステム２０Ｂの減算器２３（Ｌ）を経由して自己拡声音用ボリューム３１に対して信号Ye2として入力される。
自己拡声音用ボリューム３１は、後述するようにして、入力された信号Ye1、Ye2を使用して減衰率設定、バランス調整を行って、音声信号Xsr,Xslを出力する。この音声信号Xsr,Xslは、マイクロフォン２（Ｒ）若しくはマイクロフォン２（L）により収音された自己側話者音声となる。そして、この音声信号Xsr,Xslは、それぞれ、加算器３３、３４を介して、スピーカ３（Ｒ）、３（Ｌ）に対して出力される。これにより、これにより、マイクロフォン２（Ｒ）、２（Ｌ）により収音された自己側話者音声が、同じ場所内のスピーカ３（Ｒ）、３（Ｌ）から音として放出されることになる。即ち、自己拡声音出力機能が実現される。
そして、上記のようにしてスピーカ３（Ｒ）、３（Ｌ）から放出された自己側話者音声も、空間伝搬経路Ｓll、Ｓrl、Ｓlr、Ｓrrを伝搬してマイクロフォン２（Ｌ）、２（Ｒ）に到達することから、自己側話者音声のエコー音が発生することになるが、先にも説明したように、適応フィルタシステム２０Ａ、２０Ｂが空間伝搬経路（エコーパス）Ｓll、Ｓrl、Ｓlr、Ｓrrを経由して伝搬するエコー音をキャンセルするようにして動作するのにあたり、適応フィルタ２１（ＲＲ）、２１（ＲＬ）に入力される参照信号は、合成器３３の出力信号としている。つまり、自己拡声音用ボリューム３１から出力されるＲチャンネルの自己側話者音声としての音声信号Xsrが含まれるようにされており、同様にして、適応フィルタ２１（ＲＲ）、２１（ＲＬ）に入力される参照信号は、合成器３４の出力信号とされていることで、Ｌチャンネルの自己側話者音声としての音声信号Xslが含まれている。従って、自己側話者音声のエコー音もキャンセルされることになる。
ちなみに、上記の信号Ye1、Ye2が自己拡声音用ボリューム３１を経由して、信号Xsr、Xslとして加算器３３、３４に入力される系を省略した構成は、自己拡声音出力機能を有さない、通常のステレオ対応のエコーキャンセルシステムと同等の構成となる。 The audio signals Ye1 and Ye2 are self-speaking volumes 31 as signals based on the audio signals Xsr and Xsl as self-speaker voices to be output from the speakers 3 (R) and 3 (L). Is entered against.
For example, if a self-speaker voice (sound collected by a microphone at a level above a certain level considered valid) is input to the microphone 2 (R), the voice signal component is the adaptive filter system 20A. The signal Ye1 is input to the self-sound volume 31 via the subtracter 23 (R). Similarly, if a self-speaker voice (sound collected by the microphone at a level above a certain level considered valid) is input to the microphone 2 (L), the voice signal component is an adaptive filter. The signal Ye2 is input to the self-sound volume 31 via the subtractor 23 (L) of the system 20B.
As described later, the self-sound volume 31 performs attenuation rate setting and balance adjustment using the input signals Ye1 and Ye2, and outputs audio signals Xsr and Xsl. The voice signals Xsr and Xsl become the self-speaker voice collected by the microphone 2 (R) or the microphone 2 (L). The audio signals Xsr and Xsl are output to the speakers 3 (R) and 3 (L) via the adders 33 and 34, respectively. Thereby, the self-speaker voice collected by the microphones 2 (R) and 2 (L) is emitted as sound from the speakers 3 (R) and 3 (L) in the same place. Become. That is, a self-speaking sound output function is realized.
The self-speaker speech emitted from the speakers 3 (R) and 3 (L) as described above also propagates through the spatial propagation paths Sll, Srl, Slr, and Srr to the microphones 2 (L), 2 ( R), an echo sound of the self-speaker voice is generated. As described above, the adaptive filter systems 20A and 20B are connected to the spatial propagation paths (echo paths) Sll, Srl and Slr. The reference signal input to the adaptive filters 21 (RR) and 21 (RL) is used as the output signal of the synthesizer 33 when operating so as to cancel the echo sound propagating via Srr. That is, the audio signal Xsr as the R-side self-speaker sound output from the self-sound volume 31 is included, and similarly, the adaptive filters 21 (RR) and 21 (RL) are included in the adaptive filters 21 (RR) and 21 (RL). Since the input reference signal is an output signal of the synthesizer 34, an audio signal Xsl as an L-channel self-speaker voice is included. Therefore, the echo sound of the self-speaker voice is also canceled.
Incidentally, the configuration in which the signals Ye1 and Ye2 are input to the adders 33 and 34 as the signals Xsr and Xsl through the self-sound volume 31 does not have a self-sound output function. The configuration is equivalent to a normal stereo-compatible echo cancellation system.

続いては、音声信号処理部１１により行うものとされる、自己拡声音についての左右（Ｌ，Ｒ）音量バランス設定についての説明を行う。
図５は、図４に示される自己拡声音用ボリューム３１の内部構成例を示している。
図４によっても説明したように、自己拡声音用ボリューム３１に対しては、Ｒチャンネルのマイクロフォン２（Ｒ）の収音音声信号を基とする信号Ye1と、Lチャンネルのマイクロフォン２（Ｌ）の収音音声信号を基とする信号Ye2とが入力される。
信号Ye1は、さらにＲチャンネル対応の信号Ye1(R)とＬチャンネル対応の信号Ye1(L)とに分岐されたうえで、信号Ye1(R)は音量可変部４１Ａに入力され、信号Ye1(L)は、音量可変部４１Ｂに入力される。
信号Ye2も、同様にして、Ｒチャンネル対応の信号Ye2(R)とＬチャンネル対応の信号Ye2(L)とに分岐されたうえで、信号Ye2(R)は音量可変部４１Ｃに入力され、信号Ye2(L)は、音量可変部４１Ｄに入力される。 Next, the left / right (L, R) volume balance setting for the self-amplifying sound, which is performed by the audio signal processing unit 11, will be described.
FIG. 5 shows an example of the internal configuration of the self-sounding volume 31 shown in FIG.
As described with reference to FIG. 4, the self-speaking sound volume 31 includes the signal Ye1 based on the collected sound signal of the R channel microphone 2 (R) and the L channel microphone 2 (L). A signal Ye2 based on the collected sound signal is input.
The signal Ye1 is further branched into a signal Ye1 (R) corresponding to the R channel and a signal Ye1 (L) corresponding to the L channel, and then the signal Ye1 (R) is input to the volume varying unit 41A and the signal Ye1 (L ) Is input to the volume changing unit 41B.
Similarly, the signal Ye2 is branched into a signal Ye2 (R) corresponding to the R channel and a signal Ye2 (L) corresponding to the L channel, and then the signal Ye2 (R) is input to the volume varying unit 41C, Ye2 (L) is input to the sound volume variable unit 41D.

音量可変部４１Ａ、４１Ｂ、４１Ｃ、４１Ｄは、それぞれ、入力される信号（信号Ye1(R)、Ye1(L)、Ye2(R)、Ye2(L)）についてのレベル（音量）変更のための信号処理を独立して実行可能とされている。
また、音量可変部４１Ａ、４１Ｂは、互いに連携して音量可変を行うことにより、信号Ye1(R)、Ye1(L)との間でのレベル（音量）の「バランス」を設定することも可能とされている。同様にして、音量可変部４１C、４１Dは、互いに連携して音量可変を行うことにより、信号Ye2(R)、Ye2(L)との間でのレベル（音量）のバランスを設定することも可能とされている。 The sound volume variable sections 41A, 41B, 41C, 41D are respectively for changing the level (volume) of the input signals (signals Ye1 (R), Ye1 (L), Ye2 (R), Ye2 (L)). Signal processing can be performed independently.
In addition, the volume changing units 41A and 41B can set the “balance” of the level (volume) between the signals Ye1 (R) and Ye1 (L) by changing the volume in cooperation with each other. It is said that. Similarly, the volume changing units 41C and 41D can set the level (volume) balance between the signals Ye2 (R) and Ye2 (L) by changing the volume in cooperation with each other. It is said that.

音量可変部４１Ａにて音量調整した後の信号Ye1(R)は、信号Yf1(R)として加算器４２Ａに対して出力し、音量可変部４１Ｂにて音量調整した後の信号Ye1(L)については、信号Yf1(L)として加算器４２Ｂに対して出力する。また、音量可変部４１Ｃにて音量調整した後の信号Ye2(R)は、信号Yf2(R)として加算器４２Ａに対して出力し、音量可変部４１Ｄにて音量調整した後の信号Ye2(L)については、信号Yf2(L)として加算器４２Ｂに対して出力する。
加算器４２Ａは、上記のようにして入力されてくる信号Yf1(R)、Yf2(R)を、信号Xsrとして加算器３３(図４)に対して入力させる。加算器４２Ｂは、上記のようにして入力されてくる信号Yf1(L)、Yf2(L)を、信号Xslとして加算器３４(図４)に対して入力させる。 The signal Ye1 (R) after the volume adjustment by the volume variable unit 41A is output to the adder 42A as the signal Yf1 (R), and the signal Ye1 (L) after the volume adjustment by the volume variable unit 41B is performed. Is output to the adder 42B as a signal Yf1 (L). The signal Ye2 (R) after the volume adjustment by the volume variable unit 41C is output to the adder 42A as the signal Yf2 (R), and the signal Ye2 (L after the volume adjustment by the volume variable unit 41D is performed. ) Is output to the adder 42B as a signal Yf2 (L).
The adder 42A inputs the signals Yf1 (R) and Yf2 (R) input as described above to the adder 33 (FIG. 4) as the signal Xsr. The adder 42B inputs the signals Yf1 (L) and Yf2 (L) input as described above to the adder 34 (FIG. 4) as the signal Xsl.

自己拡声音用ボリューム３１が上記のような構成を採ることで、信号Ye1として入力されてくるマイクロフォン２（Ｒ）により収音された自己側話者音声については、ＬチャンネルとＲチャンネル間の音量バランスの設定が行われたうえで、スピーカ３（Ｒ）とスピーカ３（Ｌ）から音声出力させることが可能になる。同様に、信号Ye2として入力されてくるマイクロフォン２（Ｌ）により収音された自己側話者音声についても、ＬチャンネルとＲチャンネル間の音量バランスの設定が行われたうえで、スピーカ３（Ｒ）とスピーカ３（Ｌ）から音声出力させることが可能になる。 Since the self-sound volume 31 is configured as described above, the self-speaker sound collected by the microphone 2 (R) input as the signal Ye1 is the volume between the L channel and the R channel. After the balance is set, it is possible to output sound from the speaker 3 (R) and the speaker 3 (L). Similarly, the volume balance between the L channel and the R channel is set for the self-speaker voice collected by the microphone 2 (L) input as the signal Ye2, and then the speaker 3 (R ) And the speaker 3 (L).

そのうえで、上記構成の自己拡声音用ボリューム３１を含む本実施の形態の音声信号処理部１１は、その動作中において、図６のフローチャートに示す処理手順を実行するようにされる。この図に示す処理は、音声信号処理部１１がＤＳＰにより構成されるものである場合には、このＤＳＰに与えるインストラクションなどといわれるプログラムにより実現される。 In addition, the audio signal processing unit 11 of the present embodiment including the self-sounding volume 31 having the above-described configuration is configured to execute the processing procedure shown in the flowchart of FIG. 6 during the operation. When the audio signal processing unit 11 is configured by a DSP, the processing shown in this figure is realized by a program called an instruction given to the DSP.

先ず、ステップＳ１０１においては、Ｒチャンネル対応のマイクロフォン２（Ｒ）により収音して得られる収音音声信号（Ｒｃｈ収音音声信号）の内容が、自己側話者音声入力状態以外とされる状態から自己側話者音声入力状態に遷移するのを待機するようにされる。ここでの自己側話者音声入力状態とは、収音音声信号の内容に、マイクロフォン２（Ｒ）に向かって発話したことで得られた（有効とみなされる一定以上のレベルによりマイクロフォンにて収音された）とされる、自己側話者音声が含まれている状態を指す。 First, in step S101, the state of the collected voice signal (Rch collected voice signal) obtained by collecting sound with the microphone 2 (R) corresponding to the R channel is set to a state other than the self-side speaker voice input state. To wait for a transition to the self-speaker voice input state. The self-speaker voice input state here is obtained by speaking to the microphone 2 (R) in the contents of the collected voice signal (acquired by the microphone at a level above a certain level considered to be valid). It is a state in which self-speaker speech is included.

自己側話者音声入力状態に遷移したことを判別するための手法としてはいくつか考えられるが、１つには、適応フィルタシステム２０Ａ内の減算器２３（Ｒ）に対して所望信号として入力されるマイクロフォン２（Ｒ）からの収音音声信号と、同じ減算器２３（Ｒ）から出力される信号Ye1を比較することで判別結果が得られる。
説明を分かりやすいものとするために、自己側話者音声入力状態に遷移する前後のタイミングでは、適応フィルタシステム２０Ａ内の適応フィルタが、相手側話者音声のエコー音をキャンセルする状態で収束している状態にあるときを考えてみる。このとき、自己側話者音声入力状態以外の状態として、例えば、スピーカ３（Ｌ）、３（Ｒ）の少なくとも一方から空間伝達経路(Srr、Slr)を経由して到達してくる音、即ちエコー音のみがマイクロフォン２（Ｒ）にて収音されている、エコー音入力状態であるとする。
上記のエコー音入力状態に対応しては、信号Ye1に対してマイクロフォン２（Ｒ）からの所望信号のほうが一定以上に大きなレベルを有する。シングルトーク状態では、上記所望信号としては、スピーカ３（Ｒ）、３（Ｌ）から空間伝搬経路Srr、Slrを経由してマイクロフォン２（Ｒ）に到達してくる音、即ち、エコー音が収音音声信号として得られるのに対して、適応フィルタシステム２０Ａの出力信号である信号Ye1としては、上記のエコー音がキャンセルされ、非常に小さいレベルの状態となっている。
一方、自己側話者音声入力状態に対応しては、マイクロフォン２（Ｒ）からの所望信号と信号Ye1とは、ほぼ同レベルとみてよい程度に近似したレベルとなる。このときには、マイクロフォン２（Ｒ）の収音音声信号である所望信号は、自己側話者音声に対応する成分が支配的となるが、この自己側話者音声は、この時点では、適応フィルタシステム２０Ａによっては、キャンセルされない成分となる。このために、信号Ye1には、自己側話者音声の成分がキャンセルされずに残ることになる。
そして、ステップＳ１０１において自己側話者音声入力状態に遷移したことが判別されると、ステップＳ１０２以降の手順に進むことになる。 There are several possible methods for determining the transition to the self-speaker voice input state, but one is input as a desired signal to the subtracter 23 (R) in the adaptive filter system 20A. A discrimination result is obtained by comparing the collected sound signal from the microphone 2 (R) with the signal Ye1 output from the same subtractor 23 (R).
In order to make the explanation easy to understand, the adaptive filter in the adaptive filter system 20A converges in a state of canceling the echo sound of the other party's speaker voice before and after the transition to the self-speaker voice input state. Consider when you are in a state. At this time, as a state other than the self-speaker voice input state, for example, a sound arriving from at least one of the speakers 3 (L) and 3 (R) via the spatial transmission path (Srr, Slr), that is, It is assumed that only the echo sound is picked up by the microphone 2 (R) and the echo sound is input.
Corresponding to the above-described echo sound input state, the desired signal from the microphone 2 (R) has a level larger than a certain level with respect to the signal Ye1. In the single talk state, as the desired signal, sounds arriving at the microphone 2 (R) from the speakers 3 (R) and 3 (L) via the spatial propagation paths Srr and Slr, that is, echo sounds are collected. The echo sound is canceled and the signal Ye1 which is an output signal of the adaptive filter system 20A is in a very low level while being obtained as a sound and sound signal.
On the other hand, in response to the self-speaker voice input state, the desired signal from the microphone 2 (R) and the signal Ye1 have levels that are approximated to the extent that they may be regarded as substantially the same level. At this time, the component corresponding to the self-speaker voice is dominant in the desired signal, which is the collected sound signal of the microphone 2 (R). Depending on 20A, it becomes a component which is not canceled. For this reason, the component of the self-speaker voice remains in the signal Ye1 without being canceled.
If it is determined in step S101 that the state has been changed to the self-speaker voice input state, the procedure proceeds to step S102 and subsequent steps.

マイクロフォン２（Ｒ）の収音音声信号が、上記のようにして自己側話者音声入力状態に遷移したことに応じては、自己側話者音声の内容が支配的となった収音音声信号が信号Ye1として自己拡声音用ボリューム３１に対して入力される。図５により示したように、入力された信号Ye1は、信号Ye1(R)、Ye1(L)に分岐されてバランス調整器４１Ａに入力される。
そして、ステップＳ１０２においては、先ず、音量可変部４１Ａ，４１Ｂを連係動作させて、信号Yf1(R)、Yf1(L)のレベルのバランスについて、同レベル（Yf1(R)：Yf1(L)＝１：１）となるようにして初期設定を行うようにされる。この初期設定により、マイクロフォン２（Ｒ）にて収音された自己側話者音声は、スピーカ３（Ｒ）、３（Ｌ）のそれぞれから同じ音量で出力されることになる。 In response to the collected sound signal of microphone 2 (R) transitioning to the self-speaker speech input state as described above, the sound-collected sound signal in which the content of the self-speaker speech becomes dominant Is input to the self-sound volume 31 as the signal Ye1. As shown in FIG. 5, the input signal Ye1 is branched into signals Ye1 (R) and Ye1 (L) and input to the balance adjuster 41A.
In step S102, first, the volume control units 41A and 41B are linked to operate the level balance of the signals Yf1 (R) and Yf1 (L) at the same level (Yf1 (R): Yf1 (L) = The initial setting is performed so that 1: 1). By this initial setting, the self-speaker speech picked up by the microphone 2 (R) is output at the same volume from each of the speakers 3 (R) and 3 (L).

上記ステップＳ１０２による初期設定により、マイクロフォン２（Ｒ）にて収音された自己側話者音声が、スピーカ３（Ｒ）、３（Ｌ）から同音量で出力されている状態の下で、音声信号処理部１１は、ステップＳ１０３の手順を実行する。
ステップＳ１０３は、マイクロフォン２（Ｒ）の収音音声信号を所望信号としてエコーキャンセルを行う側の適応フィルタシステム２０Ａにおける適応フィルタ２１（ＲＲ）から出力されるキャンセル用信号YrrのレベルVrrと、同じ適応フィルタシステム２０Ａにおける適応フィルタ２１（ＬＲ）から出力されるキャンセル用信号YlrのレベルVlrを検出する。この検出にあたっては、例えば実際にキャンセル用信号Yrr、Ylrのレベルを測定してもよい。あるいは、適応フィルタ２１（ＲＲ）、２１（ＬＲ）の係数ベクトルから推定して求めることも可能である。 With the initial setting in step S102 described above, the voice of the self-side speaker picked up by the microphone 2 (R) is output in the same volume from the speakers 3 (R) and 3 (L). The signal processing unit 11 executes the procedure of step S103.
Step S103 is the same adaptation as the level Vrr of the cancellation signal Yrr output from the adaptive filter 21 (RR) in the adaptive filter system 20A on the side that performs echo cancellation using the collected sound signal of the microphone 2 (R) as a desired signal. The level Vlr of the cancellation signal Ylr output from the adaptive filter 21 (LR) in the filter system 20A is detected. In this detection, for example, the levels of the cancellation signals Yrr and Ylr may be actually measured. Alternatively, it can be obtained by estimation from the coefficient vectors of the adaptive filters 21 (RR) and 21 (LR).

ここで、キャンセル用信号YrrのレベルVrrと、キャンセル用信号YlrのレベルVlrの比率は、スピーカ３（Ｒ）から空間伝搬経路Srrを経てマイクロフォン２（Ｒ）に到達してくる自己側話者音声の音量と、スピーカ３（Ｌ）から空間伝搬経路Slrを経てマイクロフォン２（Ｒ）に到達してくる自己側話者音声の音量の比を反映している。
例えば、説明を簡単にするために、スピーカ３（Ｒ）、３（Ｌ）からマイクロフォン２（Ｒ）に到達してくる音は直接音のみであるとして、スピーカ３（Ｒ）からマイクロフォン２（Ｒ）までの距離Krrと、スピーカ３（Ｌ）からマイクロフォン２（Ｌ）までの距離Ｋlrの比が、Krr：Ｋlr＝１：２であるとする。そして、音の強さ(音量)は単純に距離の２乗に反比例して小さくなっていくものとして考えると、ステップＳ１０２の初期設定によりスピーカ３（Ｒ）、３（Ｌ）から同音量で自己側話者音声が放出されている状態のもとでは、スピーカ３（Ｒ）から空間伝搬経路Srrを経てマイクロフォン２（Ｒ）に到達してくる自己側話者音声の音量（Arr）と、スピーカ３（Ｌ）から空間伝搬経路Slrを経てマイクロフォン２（Ｒ）に到達してくる自己側話者音声の音量（Alr）の比については、Arr：Alr＝４：１となる。
マイクロフォン２（Ｒ）による収音音声信号には、スピーカ３（Ｒ）から空間伝搬経路Srrを経てマイクロフォン２（Ｒ）に到達してきた自己側話者音声の成分と、スピーカ３（Ｌ）から空間伝搬経路Slrを経てマイクロフォン２（Ｒ）に到達してくる自己側話者音声の成分が含まれるが、前者と後者のレベルの比としても、４：１であることになる。そして、この信号を所望信号として適応処理を実行する結果、空間伝搬経路Srrに対応する適応フィルタ２１（ＲＲ）が出力するキャンセル用信号YrrのレベルVrrと、空間伝搬経路Slrに対応する適応フィルタ２１（ＬＲ）が出力するキャンセル用信号YlrのレベルVlrの比としても、Vrr：Vlr＝４：１となる。そして、このようにして検出されたレベルVrr、Vlrの比は、これまでの説明からも理解されるようにして、元々は、自己側話者音声が入力されたＲチャンネルのマイクロフォン２（Ｒ）と、スピーカ３（Ｒ）、３（Ｌ）までの各距離の比を反映しているものである。換言すれば、ステップＳ１０３によるレベルVrr、Vlrを検出することを以て、上記のマイクロフォン２（Ｒ）とスピーカ３（Ｒ）、３（Ｌ）までの距離を求めることが可能となっている、ということがいえる。 Here, the ratio of the level Vrr of the cancellation signal Yrr to the level Vlr of the cancellation signal Ylr is the self-speaker voice that reaches the microphone 2 (R) from the speaker 3 (R) via the spatial propagation path Srr. And the volume ratio of the self-speaker voice reaching the microphone 2 (R) via the spatial propagation path Slr from the speaker 3 (L).
For example, in order to simplify the description, it is assumed that the sound that reaches the microphone 2 (R) from the speakers 3 (R) and 3 (L) is only direct sound, and the microphone 3 (R) from the speaker 3 (R). ) And the distance Klr from the speaker 3 (L) to the microphone 2 (L) is Krr: Klr = 1: 2. If it is assumed that the sound intensity (volume) simply decreases in inverse proportion to the square of the distance, the initial setting in step S102 causes the speakers 3 (R) and 3 (L) to self-same at the same volume. Under the state where the side speaker voice is emitted, the volume (Arr) of the self side speaker voice reaching the microphone 2 (R) via the spatial propagation path Srr from the speaker 3 (R), and the speaker The ratio of the volume (Alr) of the self-speaker speech that reaches the microphone 2 (R) from 3 (L) via the spatial propagation path Slr is Arr: Alr = 4: 1.
The collected sound signal from the microphone 2 (R) includes the component of the speaker's own speech that has reached the microphone 2 (R) via the spatial propagation path Srr from the speaker 3 (R) and the space from the speaker 3 (L). The self-speaker speech component that reaches the microphone 2 (R) via the propagation path Slr is included, but the ratio of the former level to the latter level is 4: 1. As a result of executing the adaptive processing using this signal as a desired signal, the level Vrr of the cancellation signal Yrr output from the adaptive filter 21 (RR) corresponding to the spatial propagation path Srr and the adaptive filter 21 corresponding to the spatial propagation path Slr. The ratio of the level Vlr of the cancellation signal Ylr output by (LR) is also Vrr: Vlr = 4: 1. The ratio of the levels Vrr and Vlr detected in this way can be understood from the above description. Originally, the R-channel microphone 2 (R) to which the self-speaker voice is input is used. And the ratio of the distances to the speakers 3 (R) and 3 (L). In other words, by detecting the levels Vrr and Vlr in step S103, it is possible to determine the distance between the microphone 2 (R) and the speakers 3 (R) and 3 (L). I can say.

そして、次のステップＳ１０４においては、音量可変部４１Ａ、４１Ｂにより、信号Yf1(R)、Yf1(L)のレベルのバランスについて、
Yf1(R)：Yf1(L)＝(Vlr)＾2：(Vrr)＾2 （＾2は２乗すべきことを表す）・・・（式１）
となるようにして、初期設定の状態からの変更を行う。
例えば、上記のようにして、Krr：Klr＝１：２となる環境とされていることで、ステップ１０３により検出されたレベルVrr、Vlrの関係がVrr：Vlr＝４：１となった場合であれば、
Yf1(R)：Yf1(L)＝１：１６
となるようにして信号レベルのバランスを設定（決定）することを意味する。この結果、マイクロフォン２（Ｒ）により収音された自己側話者音声は、相対的関係として、スピーカ３（Ｒ）から１倍の音量で出力されるとすると、スピーカ３（Ｌ）からは１６倍の音量により出力されることになる。 In the next step S104, the volume balance units 41A and 41B are used to balance the levels of the signals Yf1 (R) and Yf1 (L).
Yf1 (R): Yf1 (L) = (Vlr) ^ 2: (Vrr) ^ 2 (^ 2 indicates that it should be squared) ... (Formula 1)
Then, change from the initial setting state.
For example, as described above, when the environment is such that Krr: Klr = 1: 2, the relationship between the levels Vrr and Vlr detected in step 103 becomes Vrr: Vlr = 4: 1. if there is,
Yf1 (R): Yf1 (L) = 1: 16
This means that the signal level balance is set (determined). As a result, if the self-speaker speech picked up by the microphone 2 (R) is output at a volume that is one time from the speaker 3 (R), the speaker 3 (L) has 16 It will be output with double the volume.

先にも述べたようにして、相当に広いとされる会議場において、ステレオにより自己拡声音を出力させる場合においては、一方のチャンネルに対応するマイクロフォンにより収音した音声は、同じ側のチャンネルのスピーカからも一定以上の音量で出力させるとともに、他方のチャンネルのスピーカのほうから、より大きな音量で出力させるようにしたうえで、適切にバランス調整が行われるようにすることが好ましい。このようなバランス設定であれば、自己側話者音声が入力されるマイクロフォンとは反対のチャンネルのスピーカの付近にいる会議参加者は、自己側話者音声を明瞭に聴き取ることができ、自己側話者音声が入力されるマイクロフォンと同じチャンネルのスピーカの付近にいる会議参加者も、或る程度の自己拡声音量により自己側話者音声を明瞭に聴き取ることができるからである。
そして、上記ステップＳ１０４によるバランス設定であれば、Ｒチャンネルに対応するマイクロフォン２（Ｒ）により収音された自己側話者音声は、上記もしているように、スピーカ３（Ｒ）、３（Ｌ）の双方から出力されたうえで、マイクロフォン２（Ｒ）と同じＲチャンネルのスピーカ３（Ｒ）から出力される音量に対して、Ｌチャンネルのスピーカ３（Ｌ）からは、マイクロフォンとスピーカの距離比に応じた所定倍（Krr：Klr＝１：２であれば１６倍となる）の音量により出力されることになる。このとき、マイクロフォン２（Ｒ）を使用している話者から遠く離れた位置に居るスピーカ３（Ｌ）付近の会議参加者は、スピーカ３（Ｌ）から出力される音により明瞭に自己側話者音声を聴くことができる。また、マイクロフォン２（Ｒ）を使用している話者の近くに居る会議参加者も、スピーカ３（Ｒ）から出力される自己拡声音により、話者の生声だけを聴く場合より、自己側話者音声を明瞭に聴くことができる。つまり、相当に広いとされる会議場に適合した適切な音量バランスが得られているものであり、このような音量バランス設定が自動的に実行されるものである。
また、ステップＳ１０４の信号レベルのバランス設定によれば、スピーカ３（Ｒ）からマイクロフォン２（Ｒ）までの距離Krrに対する、スピーカ３（Ｌ）からマイクロフォン２（Ｒ）までの距離Klrの比がさらに拡大すれば、スピーカ３（Ｌ）の音量のほうがさらに大きくなっていくようにして、スピーカ３（Ｒ）の音量とのバランスの比も拡大することになる。逆に、スピーカ３（Ｒ）からマイクロフォン２（Ｒ）までの距離(Krr)に対する、スピーカ３（Ｌ）からマイクロフォン２（Ｒ）までの距離(Klr)の比が小さくなれば、スピーカ３（Ｌ）の音量が低減して、スピーカ３（Ｒ）の音量が増加するようにして、バランスの比が縮小することになる。つまり、ステップＳ１０４によっては、例えば会議中においてマイクロフォンが会議参加者間で受け渡されて、スピーカとマイクロフォンの距離関係が変化したり、あるいは、会議場所が変更になって、以前の会議場所とは異なるスピーカとマイクロフォンの距離関係になったとしても、これらの変化に適応して、常に適切な音量バランスを得ることが可能とされる。例えば手動操作によりバランス調整を行う場合であれば、上記のような変化が生じる都度、ユーザがバランス調整のための操作を行う必要が出てくるが、本実施の形態であれば、このような必要はないわけであり、この点で、非常に高い利便性が得られているものである。 As described above, in the case of a conference room that is considerably wide, when a self-sounding sound is output in stereo, the sound collected by the microphone corresponding to one channel is not transmitted to the channel on the same side. It is preferable to output the sound from the speaker at a certain level or higher and to output the sound at a higher volume from the speaker of the other channel, so that the balance is appropriately adjusted. With such a balance setting, conference participants in the vicinity of the speaker on the channel opposite to the microphone to which the self-speaker voice is input can clearly hear the self-speaker voice. This is because a conference participant who is in the vicinity of the speaker of the same channel as the microphone to which the side speaker voice is input can also hear the self side speaker voice clearly with a certain amount of self-amplifying volume.
And if it is the balance setting by the said step S104, the self-side speaker audio | voice picked up by the microphone 2 (R) corresponding to R channel will use the speakers 3 (R), 3 (L ) And the sound volume output from the same R channel speaker 3 (R) as the microphone 2 (R), the distance between the microphone and the speaker from the L channel speaker 3 (L). The sound is output at a predetermined volume corresponding to the ratio (16 times if Krr: Klr = 1: 2). At this time, a conference participant in the vicinity of the speaker 3 (L) located far away from the speaker who uses the microphone 2 (R) clearly hears his / her side talk by the sound output from the speaker 3 (L). Can hear the voice of a person. In addition, a conference participant who is near the speaker who uses the microphone 2 (R) can also hear the speaker's live voice from the speaker 3 (R). The speaker's voice can be heard clearly. That is, an appropriate volume balance suitable for a conference hall that is considerably wide is obtained, and such volume balance setting is automatically executed.
Further, according to the signal level balance setting in step S104, the ratio of the distance Klr from the speaker 3 (L) to the microphone 2 (R) to the distance Krr from the speaker 3 (R) to the microphone 2 (R) is further increased. If the magnification is increased, the volume of the speaker 3 (L) is further increased, and the balance ratio with the volume of the speaker 3 (R) is also increased. Conversely, if the ratio of the distance (Klr) from the speaker 3 (L) to the microphone 2 (R) to the distance (Krr) from the speaker 3 (R) to the microphone 2 (R) decreases, the speaker 3 (L ) Is reduced, and the volume of the speaker 3 (R) is increased, so that the balance ratio is reduced. That is, depending on step S104, for example, a microphone is handed over between conference participants during a conference, the distance relationship between the speaker and the microphone changes, or the conference location is changed, so that the previous conference location is Even if the distance between the different speakers and the microphones becomes different, it is possible to always obtain an appropriate volume balance by adapting to these changes. For example, when balance adjustment is performed by manual operation, the user needs to perform an operation for balance adjustment whenever the above-described change occurs. This is not necessary, and very high convenience is obtained in this respect.

なお、ここでは、上記ステップＳ１０４により設定された信号レベルのバランスは、シングルトーク状態に遷移するまで継続するものとされる。シングルトーク状態に遷移した場合には、例えば上記の信号レベルバランスの設定は解除され、シングルトーク状態に適合した所定のレベルバランスへの設定変更が行われる。 Here, the balance of the signal level set in step S104 is continued until the single talk state is entered. When transitioning to the single talk state, for example, the above signal level balance setting is canceled, and the setting change to a predetermined level balance adapted to the single talk state is performed.

また、これまでに説明した図６の手順は、Ｌチャンネルに対応するマイクロフォン２（Ｌ）による収音音声信号の内容が自己側話者音声入力状態に遷移したときに応じて、信号Ye2に基づいて得られる信号Yf2(R)とYf2(L)を対象として、音量可変部４１Ｂ、４１Ｃ側においても、同様にして実行するものとされる。 Further, the procedure of FIG. 6 described so far is based on the signal Ye2 when the content of the collected voice signal by the microphone 2 (L) corresponding to the L channel is changed to the self-side speaker voice input state. For the signals Yf2 (R) and Yf2 (L) obtained in this way, the same is executed on the volume variable sections 41B and 41C.

また、ステップＳ１０４にあっては、ステップ１０３により検出されたレベルVrr、Vlrを２乗した値に基づいて、信号Yf1(R)、Yf1(L)（若しくは信号Yf2(R)、Yf2(L)）のバランスを決めることとしているが、これはあくまでも一例であって、実際において、より最適とされる音量バランスが得られるのであれば、他の規則に従ってバランスを決めることとしてもよい。また、ステップＳ１０４によるバランス決定のアルゴリズムを複数用意しておき、予めの操作などによってユーザが選択できるようにしておくことも可能である。 In step S104, the signals Yf1 (R) and Yf1 (L) (or the signals Yf2 (R) and Yf2 (L) are based on the values obtained by squaring the levels Vrr and Vlr detected in step 103. However, this is merely an example. In practice, if a more optimal volume balance can be obtained, the balance may be determined according to other rules. It is also possible to prepare a plurality of balance determination algorithms in step S104 so that the user can select them by a pre-operation or the like.

また、図７のフローチャートは、上記図６に示した手順による自己拡声音のバランス調整とともに、本実施の形態における音声信号処理部１１において実行される処理であって、適応フィルタ２１（ＲＲ）の参照信号及び所望信号の入力状態に応じて、適応フィルタ２１（ＲＲ）及び自己拡声音用ボリューム３１内の音量可変部４１Ａの動作を制御するための処理を示している。
なお、この図７と同様処理は、後述もするようにして、残る３つの適応フィルタ２１（ＬＲ）、２１（ＲＬ）、２１（ＬＬ）ごとにも対応して実行される。
また、図に示す処理も、音声信号処理部１１がＤＳＰにより構成されるものである場合には、このＤＳＰに与えるインストラクションなどといわれるプログラムにより実現されるものである。
また、この図に示す処理を最初に実行開始するのにあたっては、適応フィルタシステム２０Ａ、２０Ｂ（適応フィルタ２１（ＲＲ）、２１（ＬＲ）、２１（ＲＬ）、２１（ＬＬ））の適応処理についても実行状態で開始されるものとする。確認のために述べておくと、適応フィルタシステム２０Ａ、２０Ｂが適応処理を実行している状態では、それぞれ加算器３３、３４から出力される音声信号を参照信号として入力するとともに、マイクロフォン２（Ｒ）、２（Ｌ）からの収音音声信号を所望信号として、減算器２３（Ｒ）、２３（Ｌ）の出力である誤差信号(Ye1,Ye2)が最小となるようにして、適応フィルタ２１（ＲＲ）、２１（ＬＲ）、２１（ＲＬ）、２１（ＬＬ）の各々がＦＩＲフィルタの係数ベクトルを可変していく。 7 is a process executed by the audio signal processing unit 11 according to the present embodiment together with the balance adjustment of the self-sounding sound according to the procedure shown in FIG. 6, and is performed by the adaptive filter 21 (RR). The process for controlling the operations of the adaptive filter 21 (RR) and the volume changing unit 41A in the self-sound volume 31 according to the input state of the reference signal and the desired signal is shown.
Note that the processing similar to FIG. 7 is executed for each of the remaining three adaptive filters 21 (LR), 21 (RL), and 21 (LL) as will be described later.
The processing shown in the figure is also realized by a program called an instruction given to the DSP when the audio signal processing unit 11 is a DSP.
In addition, in starting the execution of the processing shown in this figure for the first time, the adaptive processing of the adaptive filter systems 20A and 20B (adaptive filters 21 (RR), 21 (LR), 21 (RL), and 21 (LL)) will be described. Is also started in the running state. For confirmation, in a state where the adaptive filter systems 20A and 20B are executing adaptive processing, the audio signals output from the adders 33 and 34 are input as reference signals and the microphone 2 (R ) The collected sound signal from 2 (L) is used as a desired signal, and the error signal (Ye1, Ye2), which is the output of the subtractors 23 (R), 23 (L), is minimized so that the adaptive filter 21 Each of (RR), 21 (LR), 21 (RL), and 21 (LL) varies the coefficient vector of the FIR filter.

また、図７の説明にあたっては、適応フィルタ２１における参照信号及び所望信号の入力状態については、「相手側シングルトーク状態」「自己側シングルトーク状態」「ダブルトーク状態」「非トーク状態」の４つのうちの何れかの状態にあるものとして捉えることとする。
「相手側シングルトーク状態」は、参照信号に、デコーダ１４から出力されたとする、有効とみなされる一定レベル以上の相手側話者音声の音声信号が含まれている一方で、所望信号を得る側のチャンネルのマイクロフォン２によっては、有効とされる一定以上のレベルの自己側話者音声は収音されていない状態を指す。
例えば、適応フィルタ２１（ＲＲ）が相手側シングルトーク状態にある場合とは、デコーダ１４からのＲチャンネル出力の音声信号Xd1(参照信号)に、有効とみなされる一定レベル以上の相手側話者音声の音声信号が含まれている一方で、Ｒチャンネルのマイクロフォン２（Ｒ）（所望信号を得る側のチャンネルのマイクロフォンである）には、有効とされる一定以上のレベルの自己側話者音声が収音されていない状態となる。 In the description of FIG. 7, the input state of the reference signal and the desired signal in the adaptive filter 21 is 4 in the “other party single talk state”, “self side single talk state”, “double talk state”, and “non-talk state”. Assume that it is in one of the two states.
In the “other party single talk state”, the reference signal includes a voice signal of the other party's speaker voice of a certain level or higher that is considered to be valid, and is obtained from the decoder 14, while the desired signal is obtained. Depending on the microphone 2 of the channel, the self-speaker voice of a certain level or more that is valid is not picked up.
For example, when the adaptive filter 21 (RR) is in the partner-side single talk state, the speech signal Xd1 (reference signal) of the R channel output from the decoder 14 has a certain level or more of the other-party speech that is considered valid. However, the R-channel microphone 2 (R) (which is the microphone of the channel that obtains the desired signal) has a certain level of self-speaker speech that is effective above a certain level. The sound is not picked up.

「自己側シングルトーク状態」は、参照信号には、デコーダ１４からの有効とみなされる相手側話者音声の音声信号は含まれていないが、所望信号を得る側のチャンネルのマイクロフォン２にて、有効とみなされる一定レベル以上の自己側話者音声が収音されている状態を指す。
「ダブルトーク状態」は、参照信号に、デコーダ１４からの有効とみなされる相手側話者音声の音声信号が含まれているとともに、所望信号を得る側のチャンネルのマイクロフォン２にて、有効とみなされる一定レベル以上の自己側話者音声も、収音されている状態を指す。
「非トーク状態」は、参照信号に、デコーダ１４からの有効とみなされる相手側話者音声の音声信号は含まれていないとともに、所望信号を得る側のチャンネルのマイクロフォン２によっても、有効とみなされる一定レベル以上の自己側話者音声が収音されていない状態を指す。 In the “self-side single talk state”, the reference signal does not include the voice signal of the other party's speaker voice regarded as valid from the decoder 14, but the microphone 2 of the channel on the side where the desired signal is obtained is This refers to the state in which the self-speaker speech above a certain level that is considered valid is being collected.
In the “double talk state”, the audio signal of the other party's voice that is regarded as valid from the decoder 14 is included in the reference signal, and the microphone 2 of the channel that obtains the desired signal is regarded as valid. The self-speaker voice above a certain level is also being picked up.
In the “non-talk state”, the reference signal does not include the voice signal of the other party's speaker voice that is regarded as valid from the decoder 14, and is also regarded as valid by the microphone 2 of the channel that obtains the desired signal. This is a state where the self-speaker voice above a certain level is not picked up.

先ず、ステップＳ２０１においては、減算器２３（Ｒ）からの出力であって、適応フィルタ２１（ＲＲ）に対応する誤差信号(音声信号)Ye1のレベル(値)が、デコーダ１４から加算器３３に入力されるＲチャンネルの音声信号Xd1のレベル(値)に対して、一定率以下（Ye1≦Xd1*m(mは1未満の正による所定数)）であるか否かについて判別するようにしている。
ここで、上記の音声信号Ye1のレベルが音声信号Xd1に対して一定率以下となる状態とは、デコーダ１４のＲチャンネルからは、有効とみなされる一定レベル以上の相手側話者音声のＬチャンネル音声信号が入力されている一方で、マイクロフォン２（Ｒ）においては、有効とされる一定以上のレベルの自己側話者音声は収音されていないという状態であり、従って、相手側シングルトーク状態であることになる。
つまり、デコーダ１４のＲチャンネル出力から、有効とみなされる一定レベル以上の相手側話者音声の音声信号が入力されていれば、この音声信号である音声信号Xd1は、一定以上の大きなレベル(振幅)値を有していることになる。一方、音声信号Ye1については、適応フィルタ２１（ＲＲ）が、相手側話者音声のエコー音をキャンセルする状態で収束していることを前提にすると、このときにスピーカ３から空間伝搬経路Srrを経由してマイクロフォン２（Ｒ）に到達してくる相手側話者音声のエコー音が適正にキャンセルされることになるから、非常に小さいレベルの状態となるのである。 First, in step S201, the level (value) of the error signal (speech signal) Ye1 corresponding to the adaptive filter 21 (RR), which is an output from the subtractor 23 (R), is sent from the decoder 14 to the adder 33. It is determined whether or not the level (value) of the input R channel audio signal Xd1 is equal to or less than a certain rate (Ye1 ≦ Xd1 * m (m is a predetermined positive number less than 1)). Yes.
Here, the state in which the level of the voice signal Ye1 is equal to or lower than a certain rate with respect to the voice signal Xd1 means that the L channel of the other party's speaker voice above a certain level that is considered valid from the R channel of the decoder 14. While the voice signal is being input, the microphone 2 (R) is in a state where the self-speaker voice of a level higher than a certain level that is valid is not picked up. It will be.
That is, if an audio signal of the other party's speaker voice that is considered to be valid is input from the R channel output of the decoder 14, the audio signal Xd1 that is the audio signal has a large level (amplitude) greater than a certain level. ) Value. On the other hand, for the speech signal Ye1, on the assumption that the adaptive filter 21 (RR) converges in a state where the echo sound of the other party's speaker speech is cancelled, the spatial propagation path Srr from the speaker 3 at this time Since the echo sound of the other party's speaker voice that reaches the microphone 2 (R) via the route is canceled appropriately, the state becomes a very low level.

なお、相手側シングルトーク状態以外の状態（自己側シングルトーク状態、ダブルトーク状態、非トーク状態）においては、音声信号Ye1のレベルは、音声信号Xd1のレベルに対して上記の一定率を超えることになる。
つまり、先ず、自己側シングルトーク状態では、マイクロフォン２（Ｒ）により収音された自己側話者音声の信号が適応フィルタ２１（ＲＲ）によりキャンセルされることなく通過することになるので、音声信号Ye1は、この自己側話者音声に対応した相応に大きなレベルとなる。これに対して音声信号Xd1については、デコーダ１４のＬチャンネル出力からの有効とみなされる音声信号の出力が無いのであるから、非常に小さいレベルとなる。従って、誤差信号Ye1のほうが音声信号Xd1よりも大きくなって、上記の一定率を超えることとなる。
また、ダブルトーク状態では、或る程度の差はあるものの、マイクロフォン２（Ｒ）にて収音して得られる自己側話者音声の信号と、デコーダ１４のＬチャンネル出力からの相手側話者音声の信号の何れも、有効とみなされる一定以上のレベルとなるので、音声信号Ye1と音声信号Xd1のレベル差は相手側シングルトーク状態のときよりも小さくなり、従って上記の一定率を超えることとなる。
また、非トーク状態では、マイクロフォン２（Ｒ）にて収音して得られる自己側話者音声の信号と、デコーダ１４のＬチャンネル出力からの相手側話者音声の信号の何れも、有効とみなされる一定以上のレベルが得られないことになるが、この場合にも、音声信号Ye1と音声信号Xd1のレベル差は相手側シングルトーク状態のときよりも小さくなり、従って上記の一定率を超えることとなる。 Note that, in a state other than the other party single talk state (self side single talk state, double talk state, non-talk state), the level of the audio signal Ye1 exceeds the above-mentioned fixed rate with respect to the level of the audio signal Xd1. become.
That is, first, in the self-side single talk state, the signal of the self-side speaker voice picked up by the microphone 2 (R) passes without being canceled by the adaptive filter 21 (RR). Ye1 is a correspondingly large level corresponding to this self-speaker voice. On the other hand, the audio signal Xd1 has a very small level because there is no audio signal output regarded as valid from the L channel output of the decoder 14. Therefore, the error signal Ye1 is larger than the audio signal Xd1, and exceeds the above-mentioned fixed rate.
In the double talk state, although there is a certain difference, the other party's speaker from the L channel output of the decoder 14 and the signal of the own speaker's voice obtained by picking up the microphone 2 (R). Since all of the audio signals are above a certain level that is considered valid, the level difference between the audio signal Ye1 and the audio signal Xd1 is smaller than that in the other party's single talk state, and therefore exceeds the above-mentioned fixed rate. It becomes.
Further, in the non-talk state, both the self-speaker speech signal obtained by collecting sound by the microphone 2 (R) and the other-speaker speech signal from the L channel output of the decoder 14 are valid. In this case, the level difference between the audio signal Ye1 and the audio signal Xd1 is smaller than that in the other party's single talk state, and thus exceeds the above-mentioned fixed rate. It will be.

上記の相手側シングルトーク状態が発生していることで、ステップＳ２０１において肯定の判別結果が得られた場合には、ステップＳ２０２に進む。
ステップＳ２０２においては、自己拡声音用ボリューム３１における音量可変部４１Ａについて、一定以上の減衰率を設定することで、音量可変部４１Ａにおいて入力信号を遮断して、適応フィルタ２１（ＲＲ）の参照信号に、音声信号Yf1(R)の成分、即ち、自己拡声音の成分が含まれないのと等価の状態とする。なお、このときには、先ずは、図６により説明した音量バランス調整により設定されている音量を基準として（例えば最大レベル（減衰率最小）として扱って）、上記の減衰率に応じた音量可変を行うものとされる。この点については、後述するステップＳ２０７、Ｓ２０８における音量可変部４１Ａへの減衰率設定についても同様とされる。 If an affirmative determination result is obtained in step S201 because the counterparty single talk state has occurred, the process proceeds to step S202.
In step S202, the volume variable unit 41A in the self-sounding volume 31 sets an attenuation factor of a certain level or higher, thereby blocking the input signal in the volume variable unit 41A and the reference signal of the adaptive filter 21 (RR). In other words, the sound signal Yf1 (R), that is, the self-speech sound component is not included. At this time, first, the volume is changed in accordance with the above-described attenuation rate with reference to the volume set by the volume balance adjustment described with reference to FIG. 6 (for example, treated as the maximum level (minimum attenuation rate)). It is supposed to be. The same applies to the attenuation rate setting for the sound volume variable unit 41A in steps S207 and S208 described later.

ステップＳ２０２に続くステップＳ２０３においては、適応フィルタ２１（ＲＲ）が充分に収束したとされる状態にあるか否かについての判別を行う。例えば、適応フィルタ２１（ＲＲ）のＦＩＲフィルタにおける係数ベクトルについて、充分に収束したものとしてみなされる所定の状態に至ったとされると、ここで肯定の判別結果が得られることになる。あるいは、例えば適応フィルタ２１が、自身の収束の状態について、例えば収束度などとしての評価値として出力することが可能なように構成した上で、この評価値を参照するようにしても、ステップＳ２０３の判別処理を実現できる。 In step S203 subsequent to step S202, it is determined whether or not the adaptive filter 21 (RR) is in a state where it has been sufficiently converged. For example, if it is assumed that the coefficient vector in the FIR filter of the adaptive filter 21 (RR) has reached a predetermined state that is regarded as being sufficiently converged, a positive determination result is obtained here. Alternatively, for example, the adaptive filter 21 may be configured to output the evaluation value of the convergence state of itself, for example, as the evaluation value as the degree of convergence. Can be realized.

上記ステップＳ２０３において、先ず、適応フィルタ２１（ＲＲ）が収束していないとして否定の判別結果が得られた場合には、ステップＳ２０４に進んで、適応フィルタ２１（ＲＲ）については、その適応処理を実行させる（活性傾向の状態とする）ように制御する。例えば、このステップＳ２０４に至る時点まで、適応フィルタ２１（ＲＲ）としての適応処理が実行されていたのであれば、ステップＳ２０４では、これまでの適応処理を継続させる。これに対して、適応フィルタ２１（ＲＲ）としての適応処理が停止されていた状態にあったのであれば、ステップＳ２０４により適応処理の実行を開始させることになる。
確認のために述べておくと、ステップＳ２０２において音量可変部４１Ａは信号遮断状態が設定されていることから、このステップＳ２０４により実行される適応処理としては、先にも述べたように適正、良好な動作が得られる。 In step S203, when a negative determination result is obtained that the adaptive filter 21 (RR) has not converged, the process proceeds to step S204, and the adaptive filter 21 (RR) is subjected to adaptive processing. Control to execute (set to an active tendency state). For example, if the adaptive process as the adaptive filter 21 (RR) has been executed up to the time point of reaching step S204, the conventional adaptive process is continued in step S204. On the other hand, if the adaptive process as the adaptive filter 21 (RR) has been stopped, the execution of the adaptive process is started in step S204.
For confirmation, since the volume varying unit 41A is set to the signal cutoff state in step S202, the adaptive processing executed in step S204 is appropriate and good as described above. Operation is obtained.

これに対して、ステップＳ２０３において適応フィルタ２１（ＲＲ）が収束しているとして肯定の判別結果が得られた場合には、ステップＳ２０５に進み、適応フィルタ２１（ＲＲ）による適応処理の実行を停止させる（停止傾向の状態とする）。この場合にも、ステップＳ２０５に至るまでの時点において、適応フィルタ２１（ＲＲ）の適応処理が実行されていたのであれば、ステップＳ２０５では、この適応処理が停止される状態に変更することになる。また、適応処理が停止されていたのであれば、この状態を継続させることになる。
ここで、例えば上記ステップＳ２０５により、適応処理が実行されていた状態から停止状態に変更された場合、適応フィルタ２１（ＲＲ）におけるＦＩＲフィルタの係数ベクトルは、停止直前の設定状態が固定して維持されることになる。即ち、適応フィルタ２１（ＲＲ）に入力される参照信号（加算器３３の出力）は、このようにして係数ベクトルが固定された状態で加算器２２（Ｒ）から減算器２３（Ｒ）に入力されて、適応フィルタ２１（ＲＲ）の出力信号（キャンセル用信号）Ylrと減算され、この成分が音声信号Ye1に含まれることになる。
なお、相手側シングルトーク状態の場合には、適応フィルタシステムが収束している状態にあって適応処理を継続させたとしても、特に問題になることはない。しかし、ステップＳ２０５のようにして適応処理を停止させれば、例えばその間は、適応処理に必要とされる演算を実行しなくともよくなるので、処理負担やリソースの軽減を図ることができる。
上記ステップＳ２０４、Ｓ２０５の手順を実行したとされると、例えばステップＳ２０１に戻る。 On the other hand, if a positive determination result is obtained in step S203 that the adaptive filter 21 (RR) has converged, the process proceeds to step S205, and the adaptive processing by the adaptive filter 21 (RR) is stopped. (A state of tendency to stop) Also in this case, if the adaptive process of the adaptive filter 21 (RR) has been executed up to step S205, the adaptive process is changed to a state in which the adaptive process is stopped in step S205. . Further, if the adaptive process has been stopped, this state is continued.
Here, for example, when the adaptive processing is changed from the state in which the adaptive processing has been executed to the stopped state in step S205, the coefficient vector of the FIR filter in the adaptive filter 21 (RR) is maintained with the setting state immediately before the stop being fixed. Will be. That is, the reference signal (output of the adder 33) input to the adaptive filter 21 (RR) is input from the adder 22 (R) to the subtracter 23 (R) in a state where the coefficient vector is fixed in this way. Then, it is subtracted from the output signal (cancellation signal) Ylr of the adaptive filter 21 (RR), and this component is included in the audio signal Ye1.
In the case of the other party single talk state, there is no particular problem even if the adaptive processing is continued while the adaptive filter system is converged. However, if the adaptive process is stopped as in step S205, for example, it is not necessary to perform an operation required for the adaptive process during that period, so that the processing load and resources can be reduced.
If the steps S204 and S205 are executed, the process returns to, for example, step S201.

ステップＳ２０１にて否定の判別結果が得られた場合、即ち、自己側シングルトーク状態、ダブルトーク状態、及び非トーク状態のうちの何れかの状態の場合には、ステップＳ２０６に進む。
ステップＳ２０６においては、先のステップＳ２０３と同様にして、適応フィルタ２１（ＲＲ）が収束しているか否かについての判別を行う。ただし、どの程度の収束度である場合に適応フィルタ２１（ＲＲ）が収束している状態であるとして判別するのかについては、相手側シングルトーク状態と、これ以外のトーク状態であることに対応させて、ステップＳ２０３とステップＳ２０６とでそれぞれ異なる条件が設定されてもよい。さらには、ステップＳ２０６の実際としては、自己側シングルトーク状態、ダブルトーク状態、非トーク状態のそれぞれに適合させた収束度の条件を設定したうえで、判別処理を行うようにされてもよい。 If a negative determination result is obtained in step S201, that is, if any of the self-side single talk state, the double talk state, and the non-talk state, the process proceeds to step S206.
In step S206, as in the previous step S203, it is determined whether or not the adaptive filter 21 (RR) has converged. However, regarding the degree of convergence, it is determined that the adaptive filter 21 (RR) is determined to be in a converged state, corresponding to the other party single talk state and the other talk state. Thus, different conditions may be set in step S203 and step S206. Further, as the actual step S206, the determination process may be performed after setting the convergence condition adapted to each of the self-side single talk state, the double talk state, and the non-talk state.

ステップＳ２０６において肯定の判別結果が得られた場合には、ステップＳ２０７に進み、音量可変部４１Ａについて一定以下の減衰率を設定することで、音量可変部４１Ａにおいて入力信号を通過させるのと等価の状態とする。これに対して、ステップＳ２０６において肯定の判別結果が得られた場合には、ステップＳ２０８により、一定以上に対応した所定の減衰率（ステップＳ２０２と同じ減衰率でなくともよい）を設定することで、音量可変部４１Ａにおいて入力信号を遮断して出力させないのと等価の状態とする。 If an affirmative determination result is obtained in step S206, the process proceeds to step S207, which is equivalent to passing an input signal through the sound volume variable unit 41A by setting an attenuation rate below a certain value for the sound volume variable unit 41A. State. On the other hand, if a positive determination result is obtained in step S206, a predetermined attenuation rate corresponding to a certain level or more (not necessarily the same attenuation rate as step S202) is set in step S208. The sound volume variable unit 41A is equivalent to a state where the input signal is blocked and not output.

ステップＳ２０７、Ｓ２０８の手順を実行した後は、ステップＳ２０９により、先のステップＳ２０５と同様にして、適応フィルタ２１（ＲＲ）の適応処理を停止させ、ステップＳ２０１に戻る。確認のために述べておくと、このステップＳ２０９により、これまで実行されていた適応処理を停止させることとなった場合には、ステップＳ２０５の場合と同様に、適応フィルタ２１（ＲＲ）の適応フィルタ２１におけるＦＩＲフィルタの係数ベクトルは、停止直前の設定状態が固定して維持されることとなる。 After executing the procedures of steps S207 and S208, the adaptive processing of the adaptive filter 21 (RR) is stopped in step S209 in the same manner as in the previous step S205, and the process returns to step S201. For confirmation, if the adaptive processing that has been executed so far is stopped by this step S209, the adaptive filter of the adaptive filter 21 (RR) is the same as in step S205. The coefficient vector of the FIR filter at 21 is maintained in a fixed state immediately before stopping.

これまでに説明した図７の処理によっては、適応フィルタ２１（ＲＲ）に入力される参照信号及び所望信号の状態に応じて、適応フィルタ２１（ＲＲ）の適応処理の動作実行と音量可変部４１Ａについて、次のようにして制御することになる。
先ず、相手側シングルトーク状態では、ステップＳ２０２からステップＳ２０３を経て、ステップＳ２０４又はステップＳ２０５の何れかの処理を行うことになる。これにより、先ず、音量可変部４１Ａについては、ステップＳ２０２により音声信号Yf1(R)を遮断して出力させない状態が設定される。 Depending on the processing of FIG. 7 described so far, depending on the state of the reference signal and the desired signal input to the adaptive filter 21 (RR), the adaptive filter 21 (RR) performs adaptive processing and the sound volume variable unit 41A. Is controlled as follows.
First, in the other party single talk state, the process of either step S204 or step S205 is performed from step S202 to step S203. Thereby, first, about the volume variable part 41A, the state which interrupts | blocks and outputs the audio | voice signal Yf1 (R) by step S202 is set.

上記のようにして、相手側シングルトーク状態に対応させて音量可変部４１Ａについて信号遮断状態を設定するのは、次のような理由による。
先ずは、相手側シングルトーク状態では、所望信号Mrの状態として、有効とみなされる自己側話者音声がマイクロフォン２（Ｒ）にて収音されている状態にはない、即ち、マイクロフォン２（Ｒ）側にて自己拡声が必要な音声信号は得られていない状態にある。従って、音量可変部４１Ａについて信号遮断状態を設定したとしても何ら問題はない。なお、このようにして音量可変部４１Ａを信号遮断状態としたときの、適応フィルタ２１（ＲＲ）に対応するエコーキャンセルの系（空間伝搬経路Srrを経由するエコー音をキャンセルする系）は、自己拡声音出力機能を省略した通常のエコーキャンセルの系と等価の回路構成を形成しているといえる。
また、音量可変部４１Ａが信号通過状態のままであると、相手側シングルトーク状態において、実際の適応フィルタ２１（ＲＲ）が充分に収束していない状態のときには、音声信号Ye1に含まれる、空間伝搬経路Srrを経由したエコー音の残留成分が、自己拡声音出力のための系（自己拡声音用ボリューム３１、合成器３３）を経由して適応フィルタ２１（ＲＲ）、及びスピーカ３（Ｒ）に対して再び入力されることになる。適応フィルタ２１（ＲＲ）にとって必要な参照信号は、デコーダ１４からのＲチャンネル出力である音声信号Xd1の成分のみである。このために、音量可変部４１Ａ経由の音声信号Yf1(R)が参照信号として適応フィルタ２１（ＲＲ）に入力されると、適応フィルタ２１（ＲＲ）の適正な適応処理が阻害される可能性が出てくる。また、現実においては適応フィルタ２１（ＲＲ）が充分に収束している状態であっても、或る程度のエコー音の残留成分が誤差信号Ye1に現れる可能性もある。
そこで、ステップＳ２０２により音量可変部４１Ａについて信号遮断状態を設定することで、正常で良好な適応フィルタ２１（ＲＲ）の適応処理を確保するものである。 As described above, the signal cut-off state is set for the volume varying unit 41A in correspondence with the counterparty single talk state for the following reason.
First, in the other party single talk state, the state of the desired signal Mr is not in a state in which the microphone 2 (R) is picking up the self-side speaker voice regarded as valid, that is, the microphone 2 (R The voice signal that requires self-speaking is not obtained on the) side. Therefore, there is no problem even if the signal cut-off state is set for the volume varying unit 41A. The echo canceling system corresponding to the adaptive filter 21 (RR) (the system for canceling the echo sound via the spatial propagation path Srr) when the volume varying unit 41A is in the signal cutoff state in this way is self It can be said that a circuit configuration equivalent to a normal echo cancellation system in which the loud sound output function is omitted is formed.
Also, if the volume variable unit 41A remains in the signal passing state, the space included in the audio signal Ye1 is included in the audio signal Ye1 when the actual adaptive filter 21 (RR) is not sufficiently converged in the counterpart single talk state. The residual component of the echo sound that has passed through the propagation path Srr passes through the system for self-sounding sound output (self-sounding sound volume 31, synthesizer 33), adaptive filter 21 (RR), and speaker 3 (R). Will be input again. The reference signal necessary for the adaptive filter 21 (RR) is only the component of the audio signal Xd1, which is the R channel output from the decoder 14. For this reason, if the audio signal Yf1 (R) via the volume variable unit 41A is input to the adaptive filter 21 (RR) as a reference signal, there is a possibility that proper adaptive processing of the adaptive filter 21 (RR) may be hindered. Come out. Further, in reality, even if the adaptive filter 21 (RR) is sufficiently converged, there is a possibility that a certain residual component of the echo sound appears in the error signal Ye1.
Therefore, the normal and good adaptive processing of the adaptive filter 21 (RR) is ensured by setting the signal cutoff state for the sound volume varying unit 41A in step S202.

なお、相手側シングルトーク状態において、例えば一時的に自己話者音声がマイクロフォン２（Ｒ）により収音されてダブルトーク状態に遷移するような状況もあると考えられる。しかし、相手側シングルトーク状態においては、会議参加者は、デコーダ１４から出力される相手側話者音声を主体として聴くことになるので、そのときに例えば一時的に同じ場所内において或る会議参加者が声を発したとしても、これがスピーカから聴こえないことについて、会議参加者は違和感を持たない。従って、上記のような状態遷移が生じたとしても、音量可変部４１Ａについて信号遮断状態を設定しておくことについては、特に問題を生じない。 In the other party single talk state, for example, it may be considered that there is a situation in which, for example, the self-speaker voice is temporarily picked up by the microphone 2 (R) and transitions to the double talk state. However, in the other party single talk state, the conference participant mainly listens to the other party's speaker voice output from the decoder 14, and at that time, for example, temporarily joins a conference in the same place. Even if a person utters a voice, the conference participants do not feel uncomfortable that this cannot be heard from the speaker. Therefore, even if the state transition as described above occurs, there is no particular problem with setting the signal cutoff state for the sound volume variable unit 41A.

また、送信音用ボリューム３２は、先にも述べたように、適応フィルタ２１（ＲＲ）の収束時に出力される音声信号Ye1におけるエコー音の残留成分を抑制することなどに使用されるもので、この点で、送信音用ボリューム３２における減衰率の調整は相応に微妙で、制御も或る程度高度なものとなる。例えば極端な減衰率を設定すると、相手方の音声通信端末装置側にて聴こえる音声が不自然なものとなる可能性が高くなる。これに対して、相手側シングルトーク状態時においては、音量可変部４１Ａについて、信号出力遮断のために、例えば１００％、若しくはこれに近い強い減衰率を設定したとしても、先に述べたようにして何ら支障はない。 Further, as described above, the transmission sound volume 32 is used for suppressing the residual component of the echo sound in the audio signal Ye1 output when the adaptive filter 21 (RR) converges. In this respect, the adjustment of the attenuation factor in the transmission sound volume 32 is correspondingly delicate, and the control is somewhat advanced. For example, if an extreme attenuation rate is set, there is a high possibility that the sound heard on the other party's voice communication terminal device side becomes unnatural. On the other hand, in the other party single talk state, as described above, even if the volume variable unit 41A is set to 100% or a strong attenuation factor close to this to cut off the signal output, for example. There is no hindrance.

また、同じ相手側シングルトーク状態において、ステップＳ２０３の判別結果として、適応フィルタ２１（ＲＲ）が収束していない状態にあるときには、適応フィルタ２１（ＲＲ）が適応処理を実行する状態として（ステップＳ２０３、Ｓ２０４）、収束している状態にあるときには、適応フィルタ２１（ＲＲ）の適応処理が停止される状態となるようにしている（ステップＳ２０３、Ｓ２０５）
先ず、相手側シングルトーク状態は、本来キャンセルすべき相手側話者音声として有効な音声信号成分が近端側に入力されている状態である。このことは、適応フィルタ２１（ＲＲ）が収束していない状態なのであれば、空間伝搬経路Srrを経由する相手側話者音声のエコー音がキャンセルされる状態で収束するようにして適応フィルタ２１（ＲＲ）について積極的に適応処理を実行させるべきときであるということがいえる。
そこで、適応フィルタ２１（ＲＲ）が収束していない状態のときには、その適応処理を実行させることとしている。そして、本実施の形態においては、先にも述べたように、ステップＳ２０２の処理によって、音量可変部４１Ａが信号遮断状態とされることで、適応フィルタ２１（ＲＲ）に入力される参照信号は、デコーダ１４からの音声信号Xd1の成分のみとなる。このために、ステップＳ２０４に対応して実行される適応処理は、本来のキャンセル対象音をキャンセルするための適正な動作となるものである。 Further, in the same counterparty single talk state, when the adaptive filter 21 (RR) is not converged as a result of determination in step S203, the adaptive filter 21 (RR) is in a state in which adaptive processing is executed (step S203). , S204) When the state is converged, the adaptive processing of the adaptive filter 21 (RR) is stopped (steps S203, S205).
First, the other party single talk state is a state in which an audio signal component effective as the other party's speaker voice to be canceled is input to the near end side. This means that if the adaptive filter 21 (RR) is not converged, the adaptive filter 21 (RR) converges in a state where the echo sound of the other party's speaker voice that passes through the spatial propagation path Srr is canceled. It can be said that it is a time when the adaptive processing should be actively executed for (RR).
Therefore, when the adaptive filter 21 (RR) is not converged, the adaptive process is executed. In the present embodiment, as described above, the reference signal input to the adaptive filter 21 (RR) is obtained when the volume variable unit 41A is turned off by the processing in step S202. , Only the component of the audio signal Xd1 from the decoder 14 is obtained. For this reason, the adaptation process executed in response to step S204 is an appropriate operation for canceling the original cancel target sound.

また、上記図７の処理によれは、ステップＳ２０１にて否定の判別結果が得られた場合に対応する、自己側シングルトーク状態、ダブルトーク状態、若しくは非トーク状態にあっては、音量可変部４１Ａについて、適応フィルタ２１（ＲＲ）が収束している状態に対応しては信号通過状態を設定し(Ｓ２０６、Ｓ２０７)、収束していない状態に対応しては信号遮断状態を設定する（Ｓ２０６、Ｓ２０８）ことになる。また、適応フィルタ２１（ＲＲ）については、共通に適応処理を停止させた状態とする（Ｓ２０９）ことになる。かかる音声信号処理部１１の状態を設定する理由について、上記の３状態ごとに対応させて説明する。 Further, according to the processing of FIG. 7 described above, in the self-side single talk state, double talk state, or non-talk state corresponding to the case where a negative determination result is obtained in step S201, the volume variable unit For 41A, a signal passing state is set corresponding to the state where the adaptive filter 21 (RR) has converged (S206, S207), and a signal blocking state is set corresponding to the state where the adaptive filter 21 (RR) has not converged (S206). , S208). Further, the adaptive filter 21 (RR) is in a state where the adaptive processing is stopped in common (S209). The reason for setting the state of the audio signal processing unit 11 will be described in correspondence with each of the three states.

先ず、ダブルトーク状態、及び自己側シングルトーク状態との対応を考えてみる。ダブルトーク状態は、有効とみなされる相手側話者音声が、音声信号Xd1として得られているともに、有効とされる自己側話者音声がマイクロフォン２（Ｒ）にて収音されたことで、これが所望信号Mrの成分として得られている状態である。一方、自己側シングルトーク状態は、有効とされる自己側話者音声が所望信号として得られてはいるが、有効とみなされる相手側話者音声は信号Xd1として得られていない状態である。つまり、ダブルトーク状態と自己側シングルトーク状態は、自己側話者音声の音声信号が得られているという点で共通している。 First, consider the correspondence between the double talk state and the self-side single talk state. In the double talk state, the other party's speaker voice regarded as valid is obtained as the voice signal Xd1, and the valid self-speaker voice is picked up by the microphone 2 (R). This is a state obtained as a component of the desired signal Mr. On the other hand, the self-side single talk state is a state in which a valid self-speaker voice is obtained as a desired signal, but a counterpart-speaker voice deemed valid is not obtained as the signal Xd1. That is, the double talk state and the self-side single talk state are common in that an audio signal of the self-side speaker voice is obtained.

このようにして、少なくともマイクロフォン２（Ｒ）により、自己側話者音声の音声信号が得られている状態では、自己音声拡声機能を有している以上、この自己側話者音声の音声信号についてはできるだけ、先の図６の手順により設定される音量バランスにより、スピーカ３（Ｒ）から再生出力(自己拡声)させるべきであることになる。このことからすれば、音量可変部４１Ａについては信号通過状態を設定すればよいことになる。
しかし、適応フィルタ２１（ＲＲ）は、本来は、参照信号Xd1として、デコーダ１４のＲチャンネル出力からの相手側話者音声に対応する音声信号成分のみを入力し、かつ、所望信号Mrとしても、スピーカ３（Ｒ）から空間伝経路Srrを経由してマイクロフォン２（Ｒ）に到達してきた音声の音声信号成分のみを入力することにより、相手側話者音声のエコー音をキャンセルするようにして収束することができるものである。
仮に、適応フィルタ２１（ＲＲ）が収束していない状態にあって、音量可変部４１Ａを信号通過状態にしてしまうと、ダブルトーク状態では、適応フィルタ２１（ＲＲ）の参照信号に、マイクロフォン２（Ｒ）により収音された自己側話者音声の音声信号成分が、相当量含まれることになる。また、自己側シングルトーク状態においては、参照信号は、自己側話者音声の音声信号成分が支配的になる。また、適応フィルタ２１（ＲＲ）の所望信号Mrにも、マイクロフォン２（Ｒ）に向かって発話して得られた自己側話者音声の成分が相当量含まれることになる。また、自己側シングルトーク状態においては、所望信号Mrは、自己側話者音声の音声信号成分が支配的になる。
上記の状態で適応フィルタ２１（ＲＲ）の適応処理を実行させたとすると、適応フィルタ２１（ＲＲ）の本来の目的である、空間伝搬経路Srrを経由する相手側話者音声のエコー音をキャンセルできる状態に収束していくことができず、かえって、収束からは遠い係数ベクトルが設定されていってしまうようなことにもなる。すると、このダブルトーク状態において、相手側話者音声のエコー音は多く残留することになって、スピーカ３から聴こえる音は非常に聞き苦しいものとなってしまう。また、以降において、例えば相手側シングルトーク状態に遷移したときなどに収束に至るまでの時間もそれだけ長くなってしまう。 In this way, at least in the state where the voice signal of the self-side speaker voice is obtained by the microphone 2 (R), the voice signal of the self-side talker voice is obtained as long as it has the self-speech enhancement function. Should be reproduced and output (self-speaking) from the speaker 3 (R) according to the volume balance set by the procedure of FIG. 6 as much as possible. From this, it is only necessary to set a signal passing state for the sound volume variable unit 41A.
However, the adaptive filter 21 (RR) originally inputs only the speech signal component corresponding to the other party's speaker speech from the R channel output of the decoder 14 as the reference signal Xd1, and also as the desired signal Mr, By inputting only the voice signal component of the voice that has reached the microphone 2 (R) from the speaker 3 (R) via the spatial transmission path Srr, the echo sound of the other party's speaker voice is canceled and converged. Is something that can be done.
If the adaptive filter 21 (RR) is not converged and the sound volume variable unit 41A is set in a signal passing state, the microphone 2 (in the double talk state, the microphone 2 ( A considerable amount of the speech signal component of the self-speaker speech collected by R) is included. In the self-side single talk state, the reference signal is dominated by the audio signal component of the self-side speaker voice. In addition, the desired signal Mr of the adaptive filter 21 (RR) also includes a considerable amount of the component of the self-speaker speech obtained by speaking toward the microphone 2 (R). In the self-side single talk state, the desired signal Mr is dominated by the audio signal component of the self-side speaker voice.
If the adaptive process of the adaptive filter 21 (RR) is executed in the above state, the echo sound of the other party's speaker voice that passes through the spatial propagation path Srr, which is the original purpose of the adaptive filter 21 (RR), can be canceled. In other words, the state cannot be converged to a state, and a coefficient vector far from the convergence is set. Then, in this double talk state, many echo sounds of the other party's speaker voice remain, and the sound heard from the speaker 3 becomes very hard to hear. Further, after that, for example, when the state transitions to the counterparty single talk state, the time until convergence is increased accordingly.

このことに基づいて、ダブルトーク状態若しくは自己側シングルトーク状態にあって、先ず、適応フィルタ２１（ＲＲ）が収束しているときには、音量可変部４１Ａを信号通過状態としたうえで、適応フィルタ２１（ＲＲ）については適応処理が停止されるようにしている。
これにより、先ず、マイクロフォン２（Ｒ）により収音される自己側話者音声の音声信号は、適応フィルタ２１（ＲＲ）から音量可変部４１Ａを通過し、さらに加算器３３を経由してスピーカ３（Ｒ）から音として出力されることになる。つまり、自己拡声音として出力される。ただし、このときに適応フィルタ２１（ＲＲ）の適応処理は、これまでの収束した状態（係数ベクトル）が固定された状態で停止している。このために、適応フィルタ２１（ＲＲ）が自己側話者音声の音声信号が支配的な参照信号を入力して収束状態から離れていくような変化を生じることはない。
また、このときには、スピーカ３（Ｒ）からマイクロフォン２（Ｒ）に対して空間伝搬経路Srrを経由して伝達する伝達音には、自己側話者音声の成分が相応に含まれる、あるいは支配的となっており、これがエコー音として生じることになる。しかし、この自己側話者音声のエコー音も、空間伝搬経路Srrを経由してスピーカ３（Ｒ）からマイクロフォン２（Ｒ）に伝達される。従って、適応フィルタ２１（ＲＲ）が収束状態で固定されていることで、相手側話者音声のエコー音とともに、自己側話者音声のエコー音も適正にキャンセルされることになる。 Based on this, when the adaptive filter 21 (RR) has converged in the double talk state or the self-side single talk state, the volume variable unit 41A is set in the signal passing state, and then the adaptive filter 21 is set. For (RR), the adaptive processing is stopped.
Thereby, first, the voice signal of the self-speaker voice picked up by the microphone 2 (R) passes from the adaptive filter 21 (RR) through the volume varying unit 41A, and further via the adder 33 to the speaker 3. (R) will be output as sound. That is, it is output as a self-speaking sound. However, at this time, the adaptive processing of the adaptive filter 21 (RR) is stopped in a state where the converged state (coefficient vector) so far is fixed. For this reason, the adaptive filter 21 (RR) does not change such that the reference signal in which the speech signal of the self-speaker speech is dominant is input and the convergence filter is moved away from the convergence state.
Further, at this time, the transmitted sound transmitted from the speaker 3 (R) to the microphone 2 (R) via the spatial propagation path Srr appropriately includes a dominant component of the speaker voice on the side. This is generated as an echo sound. However, the echo sound of the self-speaker voice is also transmitted from the speaker 3 (R) to the microphone 2 (R) via the spatial propagation path Srr. Accordingly, since the adaptive filter 21 (RR) is fixed in a converged state, the echo sound of the self-side speaker voice is properly canceled together with the echo sound of the counterpart-side speaker voice.

また、適応フィルタ２１（ＲＲ）が収束していないときには、音量可変部４１Ａについて信号遮断状態を設定することとなる。仮にマイクロフォン２（Ｒ）により収音された自己側話者音声の成分をスピーカ３（Ｒ）から出力させたとすると、適応フィルタ２１（ＲＲ）は収束していないので、この自己側話者音声についてのエコー音が多く残留して、非常に聴きにくいものとなってしまい、ハウリングが生じる可能性もそれだけ高くなる。そこで、この場合にはエコー音やハウリングをできるだけ抑制、キャンセルすべきことを優先することとして、自己側話者音声をスピーカ３から出力させないようにしているものである。なお、ダブルトーク状態においては、収束度合いに応じて残留する相手側話者音声のエコー音が聴こえることになるが、これに自己側話者音声のエコー音も加わる状況と比較すれば、よりエコー音が抑制された状態が得られていることになるものである。
また、このときには適応フィルタ２１（ＲＲ）の適応処理が停止されるが、これによっては、所望信号としてマイクロフォン２（Ｒ）により収音された自己側話者音声の成分が含まれる（あるいは支配的である）のにもかかわらず、適応フィルタ２１（ＲＲ）がこれ以上収束から離れていく方向に変化していくことはなくなる。 Further, when the adaptive filter 21 (RR) has not converged, the signal cutoff state is set for the volume variable unit 41A. Assuming that the component of the self-side speaker sound collected by the microphone 2 (R) is output from the speaker 3 (R), the adaptive filter 21 (RR) has not converged. As a result, a lot of echo sounds remain, making it very difficult to hear, and the possibility of howling increases. Therefore, in this case, priority is given to suppressing and canceling the echo sound and howling as much as possible, so that the self-speaker voice is not output from the speaker 3. In the double talk state, the echo sound of the other party's voice that remains will be heard according to the degree of convergence. A state in which sound is suppressed is obtained.
At this time, the adaptive processing of the adaptive filter 21 (RR) is stopped, but depending on this, the component of the self-speaker voice collected by the microphone 2 (R) is included as a desired signal (or is dominant). Despite this, the adaptive filter 21 (RR) will no longer change in a direction away from convergence.

また、非トーク状態は、相手側話者音声、自己側話者音声の音声信号が何れも得られていない状態であり、従って、相手側話者音声の音声信号からなる有効な参照信号と、相手側話者音声のエコー音の音声信号からなる有効な所望信号は、何れも得られていない状態であることになる。この場合には、適応フィルタ２１（ＲＲ）により適応処理を実行させたとしても収束していく動作が得られない。従って、適応フィルタ２１（ＲＲ）の適応処理が停止されることで、やはり、適応フィルタ２１（ＲＲ）が、より収束から離れた状態に遷移していってしまうことが防がれ、例えば相手側シングルトーク状態に遷移したときには、可能な範囲で収束に最も近いとされる状態から適応処理を開始させることができる。
そのうえで、適応フィルタ２１（ＲＲ）が収束している状態のときには音量可変部４１について信号通過状態を設定しておくことで、例えば、非トーク状態から、自己側シングルトーク状態若しくはダブルトーク状態などの自己側話者音声の音声信号が所望信号として得られる状態に遷移したときには、例えばその冒頭部分が途切れるようなことなく、迅速に、自己側話者音声をスピーカ３から出力させることが可能になる。
また、適応フィルタ２１（ＲＲ）が収束していない状態に対応して音量可変部４１Ａを信号停止状態にしておけば、やはり、非トーク状態から、自己側シングルトーク状態若しくはダブルトーク状態（自己側話者音声の音声信号が所望信号として得られる状態）に遷移したときには、既に、先に説明した自己側シングルトーク状態及びダブルトーク状態時において、適応フィルタ２１（ＲＲ）が収束していないときに対応した音声信号処理部１１の状態が得られていることになるものである。 Further, the non-talk state is a state in which neither the other party's speaker voice nor the self-side speaker's voice signal is obtained, and therefore, an effective reference signal composed of the other party's speaker's voice signal, No effective desired signal composed of the echo signal of the other party's speaker voice is obtained. In this case, even if the adaptive processing is executed by the adaptive filter 21 (RR), the operation of convergence cannot be obtained. Therefore, the adaptation process of the adaptive filter 21 (RR) is stopped, so that the adaptive filter 21 (RR) is prevented from making a transition to a state far from convergence. When transitioning to the single talk state, the adaptive processing can be started from a state that is closest to convergence within a possible range.
In addition, when the adaptive filter 21 (RR) is converged, by setting a signal passing state for the volume variable unit 41, for example, from a non-talk state to a self-side single talk state or a double talk state. When a transition is made to a state in which the voice signal of the self-speaker voice is obtained as a desired signal, for example, the self-speaker voice can be quickly output from the speaker 3 without the beginning portion being interrupted. .
Further, if the volume variable unit 41A is set in a signal stop state in response to a state where the adaptive filter 21 (RR) has not converged, the self-side single talk state or the double talk state (self side) is changed from the non-talk state. When the state of the adaptive filter 21 (RR) has not converged in the self-side single-talk state and the double-talk state described above, The state of the corresponding audio signal processing unit 11 is obtained.

このようにして本実施の形態による音声信号処理装置１１としての構成を採ることで、空間伝搬経路Srrを経由するエコー音をキャンセルする系は、適応フィルタ２１（ＲＲ）が収束した状態に至ってさえいれば、相手側シングルトーク状態時だけではなく、ダブルトーク状態時においても、相手側話者音声のエコー音をキャンセル可能となる。さらに、ダブルトーク状態時においては、自己側話者音声のエコー音もキャンセルされるようになっている。また、自己側シングルトーク状態においても、自己側話者音声のエコー音がキャンセルされる。即ち、空間伝搬経路Srrを経由するエコー音をキャンセルする系において、相手側話者音声のエコー音と自己側話者音声のエコー音の双方を適正にキャンセルすることが可能とされている。 Thus, by adopting the configuration as the audio signal processing device 11 according to the present embodiment, the system that cancels the echo sound that passes through the spatial propagation path Srr has reached the state where the adaptive filter 21 (RR) has converged. Thus, the echo sound of the other party's speaker voice can be canceled not only in the other party's single talk state but also in the double talk state. Further, in the double talk state, the echo sound of the self-speaker voice is also canceled. Even in the self-side single talk state, the echo sound of the self-side speaker voice is canceled. That is, in the system for canceling the echo sound passing through the spatial propagation path Srr, it is possible to appropriately cancel both the echo sound of the partner speaker voice and the echo sound of the self speaker voice.

そして、上記の図７と同等の処理は、他の３つの適応フィルタ２１（ＬＲ）、２１（ＲＬ）、２１（ＬＬ）ごとにも対応して実行される。
つまり、適応フィルタ２１（ＬＲ）の参照信号（Xd2）及び所望信号(Mr)の入力状態（信号入力状態）に応じて、適応フィルタ２１（ＬＲ）の適応処理についての実行・停止、及び音量可変部４１Ｃについての通過・遮断についての制御を実行する。
また、適応フィルタ２１（ＲＬ）の参照信号（Xd1）及び所望信号(Ml)の入力状態（信号入力状態）に応じて、適応フィルタ２１（ＲＬ）の適応処理についての実行・停止、及び音量可変部４１Ｂについての通過・遮断についての制御を実行する。
また、適応フィルタ２１（ＬＬ）の参照信号（Xd2）及び所望信号(Ml)の入力状態（信号入力状態）に応じて、適応フィルタ２１（ＬＬ）の適応処理についての実行・停止、及び音量可変部４１Ｄについての通過・遮断についての制御を実行する。
このような構成により、空間伝搬経路Srrとともに、空間伝搬経路Slr、SrlSllのそれぞれを経由するエコー音をキャンセルする系についても、相手側話者音声のエコー音と自己側話者音声のエコー音の双方を適正にキャンセルすることが可能となる。 Then, the processing equivalent to that in FIG. 7 is executed for each of the other three adaptive filters 21 (LR), 21 (RL), and 21 (LL).
In other words, the adaptive filter 21 (LR) performs and stops adaptive processing and the volume is variable according to the input state (signal input state) of the reference signal (Xd2) and the desired signal (Mr) of the adaptive filter 21 (LR). Control about the passage / blocking of the unit 41C is executed.
Further, depending on the input state (signal input state) of the reference signal (Xd1) and the desired signal (Ml) of the adaptive filter 21 (RL), execution / stop of the adaptive processing of the adaptive filter 21 (RL), and volume adjustment Control about the passage / blocking of the unit 41B is executed.
Also, depending on the input state (signal input state) of the reference signal (Xd2) and the desired signal (Ml) of the adaptive filter 21 (LL), execution / stop of the adaptive processing of the adaptive filter 21 (LL) and the volume change Control about the passage / blocking of the unit 41D is executed.
With such a configuration, for the system that cancels the echo sound that passes through the spatial propagation paths Srr and SrlSll, together with the spatial propagation path Srr, the echo sound of the other party's speaker voice and the echo sound of the self-side speaker voice Both can be canceled appropriately.

また、上記図７に示す処理を実行するのにあたり、本実施の形態の音声信号処理部１１の構成では、本来的には、相手側話者音声のエコー音をキャンセルするための適応フィルタを、自己側話者音声のエコー音のキャンセルにも用いるようにされている。つまり、自己側話者音声のエコー音キャンセルのために、新たに適応フィルタを設ける構成としていないものであり、その分の演算量、リソースの低減が図られることにもなる。 In executing the processing shown in FIG. 7, the configuration of the audio signal processing unit 11 of the present embodiment is essentially configured with an adaptive filter for canceling the echo sound of the other party's speaker audio. It is also used to cancel the echo sound of the self-speaker voice. That is, it is not configured to newly provide an adaptive filter for canceling the echo sound of the self-speaker speech, and the amount of computation and resources can be reduced accordingly.

ところで、図７の処理におけるステップＳ２０２、Ｓ２０７、Ｓ２０８では、音量可変部４１Ａ（４１Ｂ、４１Ｃ、４１Ｄ）について、信号遮断状態と信号通過状態の２状態に対応した減衰率を設定するものとして説明しているが、実際においては、この減衰率（若しくはこれに準ずる制御値）について、連続的な値の変更制御が行われるようにしてもよい。
例えば、音量可変部４１Ａ（４１Ｂ、４１Ｃ、４１Ｄ）における信号通過の度合いを示す制御値λを定義する。この制御値λは、信号が完全に通過する状態ではλ＝１となり、完全に遮断される状態ではλ＝０となるものであるとする。
そのうえで、実際において、減衰率を設定するのにあたっては、音量可変部４１Ａの場合であれば、例えば、λ＝(max(1,Ye1/Xd1)*ｗ（max(1,Ye1/Xd1)は、1と音声信号Yのレベルと音声信号Xdのレベルとで大きい方の値を選択することを意味し、係数ｗは適応フィルタ２１（ＲＲ）の収束度を示す）により表されるような演算を行う。このようにして得られる制御値に応じて、より柔軟に音量可変部の減衰率を設定できるようにするものである。
また、同様にして、適応フィルタ２１についても、図７のステップＳ２０４、Ｓ２０５、Ｓ２０９では、適応処理について実行、停止の何れかの状態とする２値的な制御としているが、これについても、連続的な制御が行えるようにすることができる。つまり、適応処理について、これを活性化させる傾向の状態（活性傾向の状態）と、停止若しくは停止に近くなっていく傾向の状態(停止傾向の状態)との間で連続的に遷移させるようにすることができる。
このためには、例えば、適応フィルタ２１のパラメータの１つであり、ＦＩＲフィルタの係数更新量を設定するためのステップサイズパラメータμについて、μ＝(1-λ)*( max(1,Y/Xd))により表されるような演算を行うこととして、適応フィルタ２１の適応処理の応答速度を変更するような構成とすることができる。
このような連続的制御を行うこととすれば、例えば、先に述べた相手側シングルトーク状態、自己側シングルトーク状態、ダブルトーク状態、及び非トーク状態の間での中間的な状態にもより適合した信号処理の動作を得ることができる。例えば、ダブルトーク状態であっても、自己側話者音声が小さく、相手側シングルトーク状態に近いとされるトーク状態では、適応フィルタ２１が収束していなければ、対応する音量可変部４１の減衰率を或る程度高めにして、自己拡声音が抑えられるようにすると共に、適応処理を或る程度活性化させて収束方向に向かわせることが可能になる。 By the way, in steps S202, S207, and S208 in the process of FIG. 7, the sound volume variable unit 41A (41B, 41C, and 41D) will be described as setting attenuation factors corresponding to two states of a signal blocking state and a signal passing state. However, in practice, continuous change control of the attenuation rate (or a control value corresponding thereto) may be performed.
For example, a control value λ indicating the degree of signal passage in the sound volume variable unit 41A (41B, 41C, 41D) is defined. This control value λ is assumed to be λ = 1 when the signal is completely passed, and λ = 0 when the signal is completely cut off.
In addition, in setting the attenuation factor, in the case of the volume variable unit 41A, for example, λ = (max (1, Ye1 / Xd1) * w (max (1, Ye1 / Xd1) is 1 and the level of the audio signal Y and the level of the audio signal Xd means that the larger value is selected, and the coefficient w represents an operation expressed by the adaptive filter 21 (RR). In accordance with the control value obtained in this way, the attenuation rate of the volume variable section can be set more flexibly.
Similarly, with regard to the adaptive filter 21 as well, in steps S204, S205, and S209 in FIG. 7, the adaptive processing is set to either the execution state or the stop state, but this is also continuous. Control can be performed. In other words, the adaptive process is continuously transitioned between a state of activation tendency (active tendency state) and a state of tendency to stop or stop (stop tendency state). can do.
For this purpose, for example, one of the parameters of the adaptive filter 21, and the step size parameter μ for setting the coefficient update amount of the FIR filter, μ = (1-λ) * (max (1, Y / By performing the calculation represented by Xd)), the response speed of the adaptive processing of the adaptive filter 21 can be changed.
If such continuous control is performed, for example, it depends on the intermediate state between the partner-side single talk state, the self-side single talk state, the double talk state, and the non-talk state described above. An adapted signal processing operation can be obtained. For example, even in the double talk state, if the adaptive filter 21 does not converge in a talk state in which the self-side speaker voice is low and is close to the counterpart single talk state, the attenuation of the corresponding volume variable unit 41 is attenuated. It is possible to increase the rate to some extent so as to suppress the self-speaking sound and to activate the adaptive processing to some extent so as to make it converge.

また、これまでに説明した実施の形態は、マルチチャンネルとして、ステレオチャンネルに対応した場合の構成を示しているが、３チャンネル以上のマルチチャンネルの音声を送受信する拡声通話系システムを構成することとした場合にも、本願発明に基づく構成を適用し、３チャンネル以上のチャンネル間での音量バランス設定を行うようにすることが可能である。 Moreover, although the embodiment described so far shows a configuration in the case of supporting a stereo channel as a multi-channel, it is possible to configure a loudspeaker communication system that transmits and receives multi-channel audio of 3 channels or more. Even in this case, it is possible to apply the configuration based on the present invention and perform the volume balance setting between three or more channels.

また、図４などに示した適応フィルタシステム２０Ａ、２０Ｂ内の適応フィルタ２１（ＲＲ）、２１（ＬＲ）、２１（ＲＬ）、２１（ＬＬ）などに採用する適応アルゴリズムとしては、これまでに知られているもののほか、現在以降において提案される将来技術のうちから、適切とされるものを選択すればよい。また、図４に示した適応フィルタシステム２０Ａ、２０Ｂは、それぞれ、２つの適応フィルタから出力されるキャンセル用信号を先ず合成したうえで、対応するチャンネルのマイクロフォンの収音音声を所望信号として、上記の合成されたキャンセル用信号を減算することで、適応フィルタシステムの出力（Ye1、Ye2）を得るようにされているが、より基本的な形式に従い、例えば２つの適応フィルタごとに、先ず、対応するチャンネルのマイクロフォンの収音音声からキャンセル用信号を減算して、空間伝搬経路ごとに対応するエコーキャンセル出力を得たうえで、これらのエコーキャンセル出力を加算合成したものを信号Ye1、Ye2として得るようにした構成としてもよい。また、いずれにせよ、図４に示される適応フィルタシステム２０Ａ、２０Ｂの構成は、説明を分かりやすいものとすることの都合上、基本的な構成のうちの１つを示しいるにすぎないものであり、実際にあっては、より発展、改善された構成が採られてもよいものである。
また、これまでの実施の形態の説明にあっては、説明を分かりやすいものとすることの都合上、音声信号処理部１１は、可聴帯域の全域に対応して音声信号処理を実行する構成を例に挙げているが、実際においては、例えばマイクロフォン２により収音して得られる収音音声信号と、デコーダ１４により受信した音声信号とについて、所定の周波数帯域毎に分割して、この分割された周波数帯域ごとに、図４に示した音声信号処理部１１の構成を割り当てる、いわゆるフィルタバンク的な構成を採ることとしてもよい。 In addition, adaptive algorithms used for the adaptive filters 21 (RR), 21 (LR), 21 (RL), 21 (LL) in the adaptive filter systems 20A and 20B shown in FIG. What is necessary is just to select an appropriate thing from the future technology proposed from now on, in addition to what has been proposed. Further, each of the adaptive filter systems 20A and 20B shown in FIG. 4 first synthesizes the cancellation signals output from the two adaptive filters, and uses the collected sound of the microphone of the corresponding channel as a desired signal. The output of the adaptive filter system (Ye1, Ye2) is obtained by subtracting the combined canceling signal of, but according to a more basic format, for example, every two adaptive filters After subtracting the cancel signal from the collected sound of the microphone of the channel to obtain the echo cancel output corresponding to each spatial propagation path, the echo cancel output is added and synthesized as signals Ye1 and Ye2. It is good also as such a structure. In any case, the configuration of the adaptive filter systems 20A and 20B shown in FIG. 4 is merely one of the basic configurations for convenience of explanation. In fact, more advanced and improved configurations may be adopted.
Further, in the explanation of the embodiments so far, for the sake of convenience of explanation, the audio signal processing unit 11 is configured to execute the audio signal processing corresponding to the entire audible band. As an example, in practice, for example, the collected sound signal obtained by collecting sound by the microphone 2 and the sound signal received by the decoder 14 are divided into predetermined frequency bands and divided. Alternatively, a so-called filter bank configuration may be employed in which the configuration of the audio signal processing unit 11 illustrated in FIG. 4 is assigned to each frequency band.

また、これまでの実施の形態においては、エコーキャンセラとしての音声信号処理部１１は、デジタル信号処理を実行するものとして説明したが、例えば音声信号処理部１１が実行するエコーキャンセル動作の少なくとも一部をアナログ回路により実現することとした場合にも本願発明は適用可能とされる。
また、これまでの実施の形態の説明では、テレビ会議システムにおいて２つの音声通信端末装置１−１、１−２が一対一の関係で通信をする場合を前提としているが、これは、説明を簡単なものとすることを配慮して、テレビ会議システムとして最もシンプルな形態を例に挙げたためである。従って、実際においては、３以上の音声通信端末装置によりテレビ会議システムを構築して、一対多の通信を行うようにすることも考えられるが、このようなシステム構成においても、本願発明に基づいた構成は、個々の音声通信端末装置に対して適用可能である。
また、音声通信端末装置１における送信用音声信号、及び再生用音声信号の処理は、主にデジタル信号処理によるものとしているが、デジタル信号処理を施すときの送信用音声信号及び再生用音声信号の形式については特に限定されるべきものではない。例えば、再生用音声信号を出力させる場合には、ΔΣ変調されたビットストリーム形式の音声信号をD級増幅によって再生するような構成とすることも場合によっては考えられる。
また、実施の形態としてはテレビ会議システムにおいて音声送受信のために設けられる音声通信端末装置を例に挙げているが、これ以外にも、例えば、音声会議システムであるとか、電話装置におけるハンズフリー通話機能などをはじめとして、いわゆる拡声通話系システムとして捉えることのできる装置全般に適用可能である。 In the above embodiments, the audio signal processing unit 11 as an echo canceller has been described as executing digital signal processing. However, for example, at least a part of the echo cancellation operation executed by the audio signal processing unit 11 is described. The present invention can also be applied when the above is realized by an analog circuit.
In the description of the embodiments so far, it is assumed that the two audio communication terminal devices 1-1 and 1-2 communicate in a one-to-one relationship in the video conference system. This is because the simplest form of the video conference system is taken as an example in consideration of the simplicity. Therefore, in practice, it is conceivable to construct a video conference system with three or more voice communication terminal devices to perform one-to-many communication. However, even in such a system configuration, a configuration based on the present invention is also possible. Can be applied to individual voice communication terminal devices.
Further, the processing of the transmission audio signal and the reproduction audio signal in the audio communication terminal apparatus 1 is mainly based on the digital signal processing, but the transmission audio signal and the reproduction audio signal when the digital signal processing is performed are performed. The format is not particularly limited. For example, in the case of outputting a playback audio signal, a configuration in which a ΔΣ-modulated bit stream format audio signal is reproduced by class D amplification may be considered in some cases.
Moreover, although the voice communication terminal device provided for voice transmission / reception in the video conference system is taken as an example as an embodiment, other than this, for example, a voice conference system or a hands-free call in a telephone device It can be applied to all devices that can be regarded as a so-called loudspeaker communication system, including functions.

本発明の実施の形態に対応するテレビ会議システムにおける音声送受信系の構成例を示すブロック図である。It is a block diagram which shows the structural example of the audio | voice transmission / reception system in the video conference system corresponding to embodiment of this invention. 図１に示される音声送受信系が備えられる場所での、マイクロフォン／スピーカの配置位置による距離についての関係例を示す図である。It is a figure which shows the example of a relationship regarding the distance by the arrangement position of a microphone / speaker in the place where the audio | voice transmission / reception system shown by FIG. 1 is provided. 図１に示される音声通信端末装置の内部構成例を示すブロック図である。It is a block diagram which shows the internal structural example of the audio | voice communication terminal device shown by FIG. 実施の形態としての音声信号処理部の構成例を示す図である。It is a figure which shows the structural example of the audio | voice signal processing part as embodiment. 実施の形態としての音声信号処理部内に備えられる自己拡声音用ボリュームの構成例を示す図である。It is a figure which shows the structural example of the volume for self-speech sounds provided in the audio | voice signal processing part as embodiment. 実施の形態の音声信号処理部が実行する手順例を示すフローチャートである。It is a flowchart which shows the example of a procedure which the audio | voice signal processing part of embodiment performs. 実施の形態の音声信号処理部が実行する手順例を示すフローチャートである。It is a flowchart which shows the example of a procedure which the audio | voice signal processing part of embodiment performs.

Explanation of symbols

１（１−１・１−２）音声通信端末装置、２（Ｒ）・２（Ｌ）（２−１（Ｒ）・２−１（Ｌ）・２−２（Ｒ）・２−２（Ｌ））マイクロフォン、３（Ｒ）・３（Ｌ）（３−１（Ｒ）・３−１（Ｌ）・３−２（Ｒ）・３−２（Ｌ））スピーカ、１１音声信号処理部、１２コーデック部、１３エンコーダ、１４デコーダ、１５通信部、２０Ａ・２０Ｂ適応フィルタシステム、２１（ＲＲ）・２１（ＬＲ）・２１（ＲＬ）・２１（ＬＬ）適応フィルタ、２２（Ｒ）・２２（Ｌ）加算器、２３（Ｒ）・２３（Ｌ）減算器、３１自己拡声音用ボリューム、３２送信音用ボリューム、３３・３４加算器、４１Ａ・４１Ｂ・４１Ｃ・４１Ｄ音量可変部、４２Ａ・４２Ｂ加算器 1 (1-1 · 1-2) Voice communication terminal device, 2 (R) · 2 (L) (2-1 (R) · 2-1 (L) · 2-2 (R) · 2-2 ( L)) Microphone, 3 (R), 3 (L) (3-1 (R), 3-1 (L), 3-2 (R), 3-2 (L)) Speaker, 11 Audio signal processing unit , 12 codec section, 13 encoder, 14 decoder, 15 communication section, 20A / 20B adaptive filter system, 21 (RR) / 21 (LR) / 21 (RL) / 21 (LL) adaptive filter, 22 (R) / 22 (L) Adder, 23 (R) / 23 (L) Subtractor, 31 Self-sound volume, 32 Transmitted sound volume, 33/34 Adder, 41A / 41B / 41C / 41D Volume variable section, 42A / 42B Adder

Claims

Each picked-up sound signal obtained by picking up sound with a plurality of microphones provided corresponding to each channel constituting a predetermined multi-channel configuration is used as a desired signal, and a sound signal transmitted from the communication partner side is received. And a plurality of speaker output audio signals corresponding to each channel, which are audio signals that have undergone a predetermined processing stage until sound is emitted from a plurality of speakers corresponding to each channel of the multi-channel configuration. And performing adaptive processing so that the output signal obtained by subtracting the cancellation signal generated based on the reference signal from the desired signal is minimized, for each channel corresponding to the multi-channel configuration. So that the output signal is the audio signal to be transmitted to the other communication device. And adaptive processing means,
Balance setting means for inputting the output signal for each channel obtained by the adaptive processing means, changing and setting the balance of the signal level between the channels, and
Synthesis means for synthesizing the output signal for each channel output from the balance setting means so as to be included as a component of the audio signal for speaker output of the corresponding channel;
Corresponding to a certain channel when a collected voice signal obtained by collecting a microphone corresponding to a certain channel among the plurality of microphones has a content corresponding to the voice input state of the speaker on the own side. Balance determining means for determining a balance of signal levels between channels to be set by the balance setting means in accordance with each distance between the microphone and a plurality of speakers corresponding to each of the channels constituting the multi-channel configuration. When,
An audio signal processing device comprising:

The balance determining means is
Based on the volume ratio of the arrival sound from each speaker to the microphone corresponding to the certain channel, which is obtained when the sound of the same volume is emitted from a plurality of speakers corresponding to each channel constituting the multi-channel configuration. The balance of the signal level between the channels to be set by the balance setting means is determined.
The audio signal processing apparatus according to claim 1.

Each picked-up sound signal obtained by picking up sound with a plurality of microphones provided corresponding to each channel constituting a predetermined multi-channel configuration is used as a desired signal, and a sound signal transmitted from the communication partner side is received. And a plurality of speaker output audio signals corresponding to each channel, which are audio signals that have undergone a predetermined processing stage until sound is emitted from a plurality of speakers corresponding to each channel of the multi-channel configuration. And performing adaptive processing so that the output signal obtained by subtracting the cancellation signal generated based on the reference signal from the desired signal is minimized, for each channel corresponding to the multi-channel configuration. So that the output signal is the audio signal to be transmitted to the other communication device. And adaptive processing procedure,
A balance setting procedure for inputting the output signal for each channel obtained by the adaptive processing procedure, changing and setting the balance of the signal level between the channels, and outputting,
A synthesis procedure for synthesizing the output signal for each channel output by the balance setting procedure so as to be included as a component of the audio signal for speaker output of the corresponding channel;
A microphone corresponding to a certain channel when a collected voice signal obtained by collecting a microphone corresponding to a certain channel among the plurality of microphones has a content corresponding to a self-speaker voice input state. And a balance determination procedure for determining the balance of the signal level between the channels to be set by the balance setting procedure according to each distance from a plurality of speakers corresponding to each channel constituting the multi-channel configuration,
The audio signal processing method characterized by performing.