JP2007151038A

JP2007151038A - Sound processing apparatus

Info

Publication number: JP2007151038A
Application number: JP2005346182A
Authority: JP
Inventors: Yasuhiko Kato; 靖彦加藤; Nobuyuki Kihara; 信之木原; Yasuhiro Kodama; 康広小玉; Yohei Sakuraba; 洋平櫻庭
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2005-11-30
Filing date: 2005-11-30
Publication date: 2007-06-14

Abstract

<P>PROBLEM TO BE SOLVED: To provide a sound processing apparatus in which computational complexity required for echo removal processing is decreased, effective echo processing can be performed and echo can be removed from system fluctuation to settlement. <P>SOLUTION: A band selecting means 3 holds a signal component of a frequency band where a sound discharging signal band-divided by a sound discharging signal dividing means 1 is selected, and removes a signal component of a frequency band where the sound discharging signal is not selected. Regarding a sound collecting signal band-divided by a sound collecting signal dividing means 2, on the other hand, the signal component of a frequency band where a sound discharging signal is selected, is removed and a signal component for which the sound discharging signal is not selected is held. Sound discharging signals from which the signal component of a predetermined frequency band is removed, are synthesized by a sound discharging signal synthesizing means 4, the signal component of a frequency band from which a sound discharging signal is removed, is held and sound collecting signals from which the signal component of a frequency band wherein a sound discharging signal is held are synthesized by a sound collecting signal composing means 5. Thus, frequency components of a sound discharging signal and a sound collecting signal are not overlapped. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は音声処理装置に関し、特にスピーカとマイクロフォン（以下、マイクとする）を備える拡声通話系で全二重通話を行う場合の音声処理を実行する音声処理装置に関する。 The present invention relates to an audio processing device, and more particularly to an audio processing device that executes audio processing when a full-duplex call is performed in a loudspeaker communication system including a speaker and a microphone (hereinafter referred to as a microphone).

ハンズフリー電話やテレビ会議システムなど、各装置がスピーカとマイクを備えた拡声通話系では、遠端装置のマイクで収音された音声が近端装置に送られ、近端装置のスピーカから放音される。一方、近端装置が装備するマイクから収音される近端話者の音声も遠端装置へ送られ、遠端装置のスピーカから放音されるように構成されている。このため、遠端、近端それぞれでスピーカから放音される相手の音声がマイクに印加される。何も処理を行わない場合は、この音声が再び相手装置へ送られるため、自分の発声がこだまのように少し遅れてスピーカから聞こえる「エコー」という現象を引き起こす。エコーが大きくなると、再びマイクに印加されて系をループし、ハウリングを引き起こす。 In a voice call system where each device has a speaker and microphone, such as hands-free telephones and video conference systems, the sound collected by the far-end device's microphone is sent to the near-end device and emitted from the near-end device's speaker. Is done. On the other hand, the voice of the near-end speaker picked up from the microphone provided in the near-end device is also sent to the far-end device and emitted from the speaker of the far-end device. For this reason, the voice of the other party emitted from the speaker at each of the far end and the near end is applied to the microphone. If no processing is performed, this voice is sent again to the other device, causing a phenomenon called “echo” that the user's utterance can be heard from the speaker with a slight delay as if it is a piece. When the echo becomes large, it is applied to the microphone again to loop the system and cause howling.

従来、このような拡声通信系では、エコーやハウリングを防止するための音声処理装置としてエコーキャンセラが組み込まれている。一般的なエコーキャンセラは、適応フィルタを用いてスピーカとマイク間のインパルスレスポンスを測定し、スピーカから放音されるリファレンス信号にこのインパルスレスポンスを畳み込んだ擬似エコーを生成し、マイクに印加されるスピーカの音声から擬似エコーを差し引くことでエコー成分を除去している。 Conventionally, in such a loudspeaker communication system, an echo canceller is incorporated as a voice processing device for preventing echo and howling. A general echo canceller measures the impulse response between a speaker and a microphone using an adaptive filter, generates a pseudo echo obtained by convolving the impulse response with a reference signal emitted from the speaker, and applies it to the microphone. The echo component is removed by subtracting the pseudo echo from the sound of the speaker.

しかし、スピーカとマイクの間のインパルスレスポンスは、室内の温度変化や人などの動きによって変化するため、全帯域用エコーキャンセラを係数固定型で構成するのでは、エコーを十分消去することが難しかった。そこで、音声信号を複数の帯域に分割し、帯域ごとにエコーキャンセル処理を行う帯域分割型のエコー消去装置が提案されている（たとえば、特許文献１参照）。
特開昭５９−６４９３２号公報 However, since the impulse response between the speaker and the microphone changes due to temperature changes in the room and movements of people, it is difficult to eliminate the echo sufficiently if the all-band echo canceller is configured with a fixed coefficient type. . Therefore, a band division type echo canceller that divides an audio signal into a plurality of bands and performs echo cancellation processing for each band has been proposed (for example, see Patent Document 1).
JP 59-64932 A

しかし、従来の適応フィルタを用いてエコーを除去する音声処理装置では、正確な擬似エコーを生成することが難しく、したがって、エコーを十分消去するためには演算量が膨大となるという問題点があった。 However, it is difficult to generate an accurate pseudo echo in a speech processing apparatus that removes an echo using a conventional adaptive filter. Therefore, there is a problem that the amount of calculation becomes enormous in order to sufficiently cancel the echo. It was.

上述のようにスピーカとマイクの間のインパルスレスポンスは、室内の人物が体を動かすなど、音声の反射の関係が変わるだけで変化するが、適応フィルタが変化に追従して収束するにはある程度の時間がかかる。また、適応フィルタの原理上、スピーカから放音された音声に含まれない周波数成分に対しては適応が行えないため、ホワイトノイズのような全ての周波数を含む音声の場合には収束が速いが、テレビ会議のように人の声がスピーカから放音されるような場合は、収束までにある程度の時間がかかることが知られている。このように、系が変化してから適応フィルタが収束するまでの時間は正確な擬似エコーを生成することができないため、エコーが残留したり、ハウリングが引き起こされたりするという問題点がある。帯域分割型の場合も適応フィルタを利用していることから、同様の問題点がある。 As described above, the impulse response between the speaker and the microphone changes only by changing the relationship of voice reflection, such as a person moving in the room, but there is some degree that the adaptive filter converges following the change. take time. In addition, due to the principle of the adaptive filter, since it is not possible to adapt to frequency components not included in the sound emitted from the speaker, convergence is fast in the case of sound including all frequencies such as white noise. When a human voice is emitted from a speaker as in a video conference, it is known that a certain amount of time is required for convergence. As described above, since it is not possible to generate an accurate pseudo echo during the time from the change of the system to the convergence of the adaptive filter, there is a problem that echo remains or howling is caused. The band division type also has a similar problem because it uses an adaptive filter.

また、一般的に適応フィルタの演算量はファーストフーリエ変換（ＦＦＴ）や、フィルタバンクに比べて大きいものであり、ローコストなシステムに用いる場合に負担となるという問題点もある。特に、体育館など広い場所での音声信号処理に適用される場合には、スピーカからマイクまでの距離が大きくなったり、残留時間が長くなったりすることから、適応フィルタに長いタップ長が必要となることが知られている。この場合、さらに計算量が増加し、負担が重くなる。 In general, the amount of calculation of the adaptive filter is larger than that of Fast Fourier Transform (FFT) or a filter bank, and there is a problem that it becomes a burden when used in a low-cost system. In particular, when applied to audio signal processing in a large place such as a gymnasium, the distance from the speaker to the microphone increases and the remaining time increases, so a long tap length is required for the adaptive filter. It is known. In this case, the calculation amount further increases and the burden becomes heavy.

本発明はこのような点に鑑みてなされたものであり、エコー除去処理に要する計算量を減少させ、効果的なエコー処理を可能にするとともに、系の変動から収束までの時間エコーを除去できる音声処理装置を提供することを目的とする。 The present invention has been made in view of these points, and can reduce the amount of calculation required for echo removal processing, enable effective echo processing, and remove time echoes from system fluctuations to convergence. An object is to provide a sound processing device.

本発明では上記課題を解決するために、スピーカとマイクを備える拡声通話系で全二重通話を行う場合の音声処理を実行する音声処理装置において、放音信号分割手段、収音信号分割手段、帯域選択手段、放音信号合成手段および収音信号合成手段を具備する音声処理装置が提供される。放音信号分割手段は、他装置から取得し、スピーカから出力する放音信号を複数の周波数帯域に帯域分割する。一方、収音信号分割手段は、マイクから入力される収音信号を同様の周波数帯域に帯域分割する。帯域選択手段は、複数の周波数帯域の全範囲を含む所定の周波数帯域範囲を、放音信号を選択する周波数帯域と収音信号を選択する周波数帯域とに分け、周波数帯域ごとに、選択されなかった周波数帯域の放音信号または収音信号の信号成分を除去する。放音信号合成手段は、帯域分割され、帯域選択手段によって選択されなかった周波数帯域の信号成分が除去された放音信号を合成する。収音信号合成手段は、同様に、帯域分割され、帯域選択手段によって選択されなかった周波数帯域の信号成分が除去された収音信号を合成する。 In the present invention, in order to solve the above-mentioned problem, in a voice processing device that executes voice processing in a full-duplex call in a loudspeaker call system including a speaker and a microphone, An audio processing apparatus including a band selection unit, a sound emission signal synthesis unit, and a sound collection signal synthesis unit is provided. The sound emission signal dividing means divides the sound emission signal obtained from another device and output from the speaker into a plurality of frequency bands. On the other hand, the sound collection signal dividing means divides the sound collection signal input from the microphone into the same frequency band. The band selection means divides a predetermined frequency band range including the entire range of a plurality of frequency bands into a frequency band for selecting a sound emission signal and a frequency band for selecting a sound collection signal, and is not selected for each frequency band. The signal component of the sound emission signal or sound collection signal in the selected frequency band is removed. The sound emission signal synthesizing unit synthesizes the sound emission signal obtained by dividing the band and removing the signal component of the frequency band not selected by the band selection unit. Similarly, the collected sound signal synthesizing unit synthesizes the collected sound signal from which the signal components of the frequency band that has been divided into bands and not selected by the band selecting unit are removed.

このような音声処理装置によれば、帯域選択手段は、複数の周波数帯域の全領域を含む所定の周波数帯域範囲を、放音信号を選択する周波数帯域と収音信号を選択する周波数帯域とに分け、所定の周波数帯域範囲に属する周波数帯域では、それぞれ、放音信号または収音信号のいずれか一方が選択される。この周波数帯域では、選択された放音信号または収音信号の一方の信号成分が保持され、他方が除去される。放音信号分割手段は、他装置から取得し、スピーカから出力する放音信号を複数の周波数帯域に分割し、帯域選択手段に出力する。帯域選択手段は、複数の周波数帯域に分割された放音信号ついて、選択された周波数帯域の信号成分を保持し、選択されなかった周波数帯域の信号成分を除去して放音信号合成手段へ出力する。そして、放音信号合成手段は、選択されなかった周波数帯域の信号成分が除去された放音信号を合成し、スピーカへ出力する。これにより、スピーカからは、選択されなかった周波数帯域の信号成分が除去された音声信号（放音信号）が放音される。一方、収音信号分割手段は、マイクから入力する収音信号を放音信号と同じ複数の周波数帯域に分割し、帯域選択手段に出力する。帯域選択手段は、複数の周波数帯域に分割された収音信号ついて、収音信号が選択され、放音信号が選択されなかった周波数帯域の信号成分を保持し、放音信号が選択され収音信号が選択されなかった周波数帯域の信号成分を除去して収音信号合成手段へ出力する。そして、放音信号合成手段は、選択されなかった周波数帯域の信号成分が除去された収音信号を合成する。これにより、マイクより入力されるスピーカの放音が重畳される周波数帯域の信号成分が除去された収音信号が合成され、他装置へ送られる。 According to such an audio processing device, the band selecting means converts the predetermined frequency band range including all areas of the plurality of frequency bands into a frequency band for selecting the sound emission signal and a frequency band for selecting the sound collection signal. In the frequency band belonging to the predetermined frequency band range, either the sound emission signal or the sound collection signal is selected. In this frequency band, one signal component of the selected sound emission signal or sound collection signal is retained, and the other is removed. The sound emission signal dividing means divides the sound emission signal acquired from the other device and output from the speaker into a plurality of frequency bands, and outputs it to the band selection means. The band selection means holds the signal component of the selected frequency band for the sound emission signal divided into a plurality of frequency bands, removes the signal component of the frequency band not selected, and outputs it to the sound emission signal synthesis means To do. Then, the sound emission signal synthesizing unit synthesizes the sound emission signal from which the signal components in the frequency band not selected are removed, and outputs the synthesized sound signal to the speaker. As a result, a sound signal (sound emission signal) from which the signal component of the frequency band that has not been selected is removed is emitted from the speaker. On the other hand, the collected sound signal dividing unit divides the collected sound signal input from the microphone into a plurality of frequency bands that are the same as the sound output signal, and outputs the divided frequency band to the band selecting unit. The band selection means holds the signal component of the frequency band in which the sound collection signal is selected and the sound emission signal is not selected for the sound collection signal divided into a plurality of frequency bands, and the sound emission signal is selected and collected. The signal component in the frequency band where the signal is not selected is removed and output to the collected sound signal synthesizing means. The sound emission signal synthesizing unit synthesizes the collected sound signal from which the signal component of the frequency band that has not been selected is removed. Thereby, the collected sound signal from which the signal component of the frequency band in which the sound output of the speaker input from the microphone is superimposed is removed is synthesized and sent to another device.

本発明の音声処理装置では、スピーカから放音する音声と、マイクで収音する音声を複数の周波数領域に分割し、一方の信号成分を有効とする領域は他方の信号成分を除去することによって、放音する音声信号と収音する音声信号の周波数成分が重ならないようにすることができる。したがって、マイクに収音する音声には、スピーカから放音される音声が重畳されているが、重畳成分が含まれるのは、収音信号の周波数領域のうち信号成分を除去する領域のみである。この結果、エコーやハウリングが起こらずに、拡声通話系における双方向同時通話を実現することができる。 In the sound processing apparatus of the present invention, the sound emitted from the speaker and the sound picked up by the microphone are divided into a plurality of frequency regions, and the region in which one signal component is valid is removed by removing the other signal component. The frequency components of the sound signal to be emitted and the sound signal to be collected can be prevented from overlapping. Therefore, the sound collected by the microphone is superimposed with the sound emitted from the speaker, but the superimposed component is included only in the region where the signal component is removed from the frequency region of the collected sound signal. . As a result, it is possible to realize a two-way simultaneous call in a loudspeaker call system without causing echo or howling.

また、系変動時の収束にある程度の時間がかかり、計算量も少なくない適応フィルタと比較し、収束時間も必要とせず、計算量も少なくて済む。これにより、マイクやスピーカを動かしたり、発言者が動いたりしたような場合にも、効果を発揮することができる。 In addition, it takes a certain amount of time for convergence when the system fluctuates, and does not require a convergence time and requires a smaller amount of calculation compared to an adaptive filter that does not require a small amount of calculation. Thereby, even when a microphone or a speaker is moved or a speaker moves, the effect can be exhibited.

以下、本発明の実施の形態を図面を参照して説明する。まず、実施の形態に適用される発明の概念について説明し、その後、実施の形態の具体的な内容を説明する。
図１は、実施の形態に適用される発明の概念図である。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. First, the concept of the invention applied to the embodiment will be described, and then the specific contents of the embodiment will be described.
FIG. 1 is a conceptual diagram of the invention applied to the embodiment.

本発明にかかる音声処理装置は、放音信号を複数の周波数帯域に分割する放音信号分割手段１、収音信号を複数の周波数帯域に分割する収音信号分割手段２、複数の周波数帯域に分割された放音信号と収音信号の信号成分が含まれる周波数帯域が重ならないようそれぞれが利用する帯域を選択する帯域選択手段３、帯域選択処理された放音信号を合成する放音信号合成手段４および帯域選択処理された収音信号を合成する収音信号合成手段５を具備する。 The sound processing apparatus according to the present invention includes a sound emission signal dividing means 1 for dividing a sound emission signal into a plurality of frequency bands, a sound collection signal dividing means 2 for dividing the sound collection signal into a plurality of frequency bands, and a plurality of frequency bands. Band selection means 3 for selecting a band to be used so that frequency bands including the signal components of the divided sound emission signal and the collected sound signal do not overlap each other, and sound emission signal synthesis for synthesizing the sound emission signals subjected to the band selection processing Means 4 and sound collection signal synthesis means 5 for synthesizing the sound collection signals subjected to band selection processing are provided.

放音信号分割手段１は、他装置から取得し、スピーカから出力される放音信号、すなわち、他装置がマイクによって収音した音声信号を入力すると、フーリエ変換などの時間領域から周波数領域への変換手法や、フィルタバンクを用いたマルチレート信号処理によって、音声信号（放音信号）を複数の周波数帯域に帯域分割する。帯域分割した放音信号は、帯域選択手段３へ出力する。 The sound emission signal dividing means 1 obtains a sound emission signal acquired from another device and output from a speaker, that is, a sound signal collected by the other device with a microphone, and then changes from the time domain to the frequency domain such as Fourier transform. A voice signal (sound emission signal) is divided into a plurality of frequency bands by a conversion method or multi-rate signal processing using a filter bank. The sound emission signal that has been divided into bands is output to the band selecting means 3.

収音信号分割手段２は、マイクから入力された収音信号、すなわち、スピーカから出力された放音信号が重畳された音声信号を入力すると、放音信号分割手段１と同じ方法によって、音声信号（収音信号）を複数の周波数帯域に帯域分割する。帯域分割した収音信号は、帯域選択手段３へ出力する。 When the sound collection signal dividing means 2 receives the sound collection signal inputted from the microphone, that is, the sound signal superimposed with the sound emission signal outputted from the speaker, the sound signal is divided by the same method as the sound emission signal division means 1. The (sound pickup signal) is divided into a plurality of frequency bands. The collected sound signal divided into the bands is outputted to the band selecting means 3.

帯域選択手段３は、複数の周波数帯域に帯域分割された放音信号と収音信号を入力し、所定の周波数帯域範囲について、放音信号を選択する周波数帯域と、収音信号を選択する周波数帯域とを決定する。所定の周波数帯域範囲は、全周波数帯域の範囲とすることもできるし、全範囲の一部としてもよい。放音信号を選択する周波数帯域では、その周波数帯域の放音信号の信号成分は除去せず、その周波数帯域の収音信号の信号成分は除去する。一方、収音信号を選択する周波数帯域では、その周波数帯域の放音信号の信号成分は除去し、その周波数帯域の収音信号の信号成分は保持する。これにより、スピーカへ出力する音声信号（放音信号）と、マイクから入力する音声信号（収音信号）の信号成分が含まれる周波数帯域が重ならなくなる。このため、マイクから入力する収音信号に放音信号が重畳されていても、これを除去することができる。また、除去した周波数帯域の音声信号の信号成分を、除去しなかった周波数帯域の信号成分で補間してもよい。なお、周波数帯域範囲や周波数帯域の選択の詳細については、実施の形態で詳細に説明する。 The band selection unit 3 inputs the sound emission signal and the sound collection signal divided into a plurality of frequency bands, and a frequency band for selecting the sound emission signal and a frequency for selecting the sound collection signal for a predetermined frequency band range. Determine the bandwidth. The predetermined frequency band range can be the entire frequency band range or a part of the entire range. In the frequency band for selecting the sound emission signal, the signal component of the sound emission signal in that frequency band is not removed, but the signal component of the sound collection signal in that frequency band is removed. On the other hand, in the frequency band for selecting the sound collection signal, the signal component of the sound emission signal in that frequency band is removed, and the signal component of the sound collection signal in that frequency band is retained. Thereby, the frequency band including the signal component of the audio signal (sound emission signal) output to the speaker and the audio signal (sound collection signal) input from the microphone does not overlap. For this reason, even if the sound emission signal is superimposed on the sound collection signal input from the microphone, it can be removed. Further, the signal component of the audio signal in the removed frequency band may be interpolated with the signal component in the frequency band that has not been removed. Details of the frequency band range and frequency band selection will be described in detail in the embodiment.

放音信号合成手段４は、放音信号分割手段１によって帯域分割され、帯域選択手段３によって所定の周波数帯域の信号成分が除去された放音信号を合成し、スピーカへ出力する。 The sound emission signal synthesizing unit 4 synthesizes the sound emission signal which has been band-divided by the sound emission signal dividing unit 1 and from which the signal component of a predetermined frequency band has been removed by the band selection unit 3 and outputs the synthesized sound signal.

収音信号合成手段５は、収音信号分割手段２によって帯域分割され、帯域選択手段３によって放音信号が除去された周波数帯域の信号成分を保持し、放音信号が保持される周波数帯域の信号成分が除去された収音信号を合成する。合成された収音信号は、他装置のスピーカから出力される放音信号として、他装置へ送られる。 The collected sound signal synthesizing means 5 holds the signal component of the frequency band that is divided by the collected sound signal dividing means 2 and from which the sound emission signal is removed by the band selection means 3, and has the frequency band in which the sound emission signal is held. The collected sound signal from which the signal component is removed is synthesized. The synthesized sound collection signal is sent to the other device as a sound emission signal output from the speaker of the other device.

このような構成の音声処理装置では、全二重通信を行う相手先の装置が収音した音声信号が入力されると、これをスピーカに放音する音声処理を行う。相手先装置から取得した相手先の収音信号を放音信号とし、放音信号分割手段１によって複数の周波数帯域に帯域分割する。帯域分割された放音信号は、帯域選択手段３によって所定の周波数帯域の信号成分が除去された後、放音信号合成手段４で合成され、スピーカから放音される。したがって、スピーカから放音される音声信号は、所定の周波数帯域の信号成分が除去されている。 In the audio processing apparatus having such a configuration, when an audio signal picked up by a partner apparatus that performs full-duplex communication is input, audio processing is performed for emitting the sound signal to a speaker. The collected sound signal of the other party acquired from the other party device is used as a sound emission signal, and the sound emission signal dividing means 1 performs band division into a plurality of frequency bands. The sound signal that has been subjected to the band division is synthesized by the sound emission signal synthesizing unit 4 after the signal component of a predetermined frequency band is removed by the band selecting unit 3 and emitted from the speaker. Therefore, the signal component of a predetermined frequency band is removed from the audio signal emitted from the speaker.

このスピーカから放音された音声信号は、発言者の音声とともにマイクを介して音声処理装置に入力される。このマイクから入力された収音信号には、発言者の音声信号にスピーカから放音された音声信号が重畳されている。収音信号分割手段２は、この収音信号を放音信号分割手段１と同様にして、複数の周波数帯域に分割し、帯域選択手段３へ出力する。帯域選択手段３の音声処理では、放音信号を選択した周波数帯域の収音信号の信号成分を除去し、放音信号が選択されなかった、すなわち、収音信号が選択された周波数帯域の収音信号成分を除去しない。これにより、帯域分割された収音信号からは、放音信号が重畳される周波数帯域の信号成分が除去される。収音信号合成手段５によって、この収音信号を合成し、相手先装置へ送信する。合成された収音信号からは、スピーカからの放音信号の信号成分、すなわち、エコー成分は除去されており、エコー成分が除去された収音信号を他装置へ出力することによって、ハウリングも防止することができる。 The audio signal emitted from the speaker is input to the audio processing device via the microphone together with the voice of the speaker. The sound signal input from the microphone is superimposed with the sound signal emitted from the speaker on the sound signal of the speaker. The collected sound signal dividing means 2 divides the collected sound signal into a plurality of frequency bands in the same manner as the emitted sound signal dividing means 1, and outputs it to the band selecting means 3. In the sound processing of the band selecting means 3, the signal component of the collected sound signal in the frequency band in which the sound emission signal is selected is removed, and the sound emission signal is not selected, that is, the sound collection signal is collected in the selected frequency band. Does not remove sound signal components. As a result, the signal component of the frequency band on which the sound emission signal is superimposed is removed from the band-divided sound collection signal. The collected sound signal synthesizing means 5 synthesizes the collected sound signal and transmits it to the counterpart device. The signal component of the sound emission signal from the speaker, that is, the echo component is removed from the synthesized sound collection signal, and howling is prevented by outputting the sound collection signal from which the echo component has been removed to another device. can do.

このように、本発明にかかる音声処理装置によれば、エコーやハウリングを抑制して、双方向同時通話を実現することができる。また、スピーカから出力する音声信号（放音信号）と、マイクから入力する音声信号（収音信号）の周波数成分が重ならないようにすることによって、エコーやハウリングを抑制するので、適応フィルタに比べ計算量も少なく、系が変動した場合でも対応することができ、収束までの時間を必要としないという利点もある。 As described above, according to the voice processing device of the present invention, it is possible to realize two-way simultaneous calling while suppressing echo and howling. In addition, since the frequency components of the audio signal (sound emission signal) output from the speaker and the audio signal (sound collection signal) input from the microphone do not overlap, echo and howling are suppressed, so compared to the adaptive filter There is also an advantage that the amount of calculation is small, it is possible to cope with the case where the system fluctuates, and the time until convergence is not required.

以下、実施の形態を、テレビ会議システムの音声処理部に適用した場合を例に図面を参照して詳細に説明する。
図２は、本発明の実施の形態のテレビ会議システムの構成を示した図である。図では、本発明の説明と関係のない画像に関する処理部を省略している。 Hereinafter, an embodiment will be described in detail with reference to the drawings, taking as an example a case where the embodiment is applied to an audio processing unit of a video conference system.
FIG. 2 is a diagram showing the configuration of the video conference system according to the embodiment of the present invention. In the figure, a processing unit relating to an image that is not related to the description of the present invention is omitted.

本実施の形態のテレビ会議システムは、スピーカ２１ａとマイク２２ａを接続する会議端末１０ａと、スピーカ２１ｂとマイク２２ｂを接続する会議端末１０ｂが通信回線２３によって接続されている。以下、任意の話者の近くに配置される会議端末１０ａを近端装置、近端装置１０ａと通信回線２３を介して接続し、この話者からは遠方に位置する会議端末１０ｂを遠端装置１０ｂとする。近端装置１０ａと遠端装置１０ｂは、同様の構成をしており、図では遠端装置１０ｂの内部ブロック図を省略している。なお、通信回線２３は、イーサネット（登録商標）などの一般的なディジタル通信回線である。 In the video conference system of the present embodiment, a conference terminal 10a connecting a speaker 21a and a microphone 22a and a conference terminal 10b connecting a speaker 21b and a microphone 22b are connected by a communication line 23. Hereinafter, the conference terminal 10a arranged near an arbitrary speaker is connected to the near-end device, the near-end device 10a via the communication line 23, and the conference terminal 10b located far from the speaker is connected to the far-end device. 10b. The near-end device 10a and the far-end device 10b have the same configuration, and an internal block diagram of the far-end device 10b is omitted in the figure. The communication line 23 is a general digital communication line such as Ethernet (registered trademark).

近端装置１０ａに接続されたスピーカ２１ａは、遠端装置１０ｂに接続されたマイク２２ｂで収音された音声を近端装置１０ａで処理して放音する。近端装置１０ａに接続されたマイク２２ａは、近端装置１０ａのテレビ会議出席者の発言音声を収音する。このとき、空間を介して入力されるスピーカ２１ａから放音される音声が重畳して収音される。遠端装置１０ｂの場合も同様である。 The speaker 21a connected to the near-end device 10a processes the sound collected by the microphone 22b connected to the far-end device 10b with the near-end device 10a and emits the sound. The microphone 22a connected to the near-end device 10a picks up the speech voice of the video conference attendee of the near-end device 10a. At this time, the sound emitted from the speaker 21a input through the space is superimposed and collected. The same applies to the far-end device 10b.

以下、近端装置１０ａおよび遠端装置１０ｂの内部構成を近端装置１０ａの場合で説明する。近端装置１０ａは、スピーカに接続するＤ／Ａ変換器１１、マイク２２ｂに接続するＡ／Ｄ変換器１２、音声信号を処理する信号処理部１３、音声信号の符号化／復号処理を行う音声コーデック１４および通信回線２３に接続する通信部１５を具備する。 Hereinafter, the internal configuration of the near-end device 10a and the far-end device 10b will be described in the case of the near-end device 10a. The near-end device 10a includes a D / A converter 11 connected to a speaker, an A / D converter 12 connected to a microphone 22b, a signal processing unit 13 that processes an audio signal, and an audio that performs encoding / decoding processing of the audio signal. A communication unit 15 connected to the codec 14 and the communication line 23 is provided.

Ｄ／Ａ変換器１１は、信号処理部１３で処理されたディジタル音声データをアナログへ変換する。アナログ音声信号は、図示しない増幅器で増幅された後、スピーカ２１ａから放音される。Ａ／Ｄ変換器１２は、マイク２２ａで収音された音声が、図示しない増幅器で増幅されたアナログ音声信号をディジタル音声データに変換する。信号処理部１３は、ディジタルシグナルプロセッサー（ＤＳＰ）で構成され、入力および出力の音声データを所望のデータへ変換する処理を行うとともに、収音された音声と放音される音声の周波数成分が重ならないように音声処理を実行する。この音声処理の詳細は後述する。音声コーデック１４は、信号処理部１３から送られてくるマイク２２ａの入力に基づく音声データをテレビ会議システムの通信で標準的に定められている符号へ変換するとともに、通信部１５から送られてくる遠端装置１０ｂによって符号化された音声データをデコードし、信号処理部１３へ送る。通信部１５は、遠端装置１０ｂとの間で、符号化された音声データを含む入出力データを、所定のディジタルデータ通信プロトコルに基づき、通信回線２３を介して送受信する。 The D / A converter 11 converts the digital audio data processed by the signal processing unit 13 into analog. The analog audio signal is amplified by an amplifier (not shown) and then emitted from the speaker 21a. The A / D converter 12 converts an analog voice signal obtained by amplifying a voice collected by the microphone 22a by an amplifier (not shown) into digital voice data. The signal processing unit 13 is configured by a digital signal processor (DSP), and performs processing for converting input and output sound data into desired data, and also overlaps the frequency components of the collected sound and the sound to be emitted. Audio processing is performed so that it does not occur. Details of this sound processing will be described later. The audio codec 14 converts the audio data based on the input of the microphone 22 a sent from the signal processing unit 13 into a code that is standardly defined in the communication of the video conference system, and is sent from the communication unit 15. The audio data encoded by the far-end device 10 b is decoded and sent to the signal processing unit 13. The communication unit 15 transmits / receives input / output data including encoded audio data to / from the far-end device 10b via the communication line 23 based on a predetermined digital data communication protocol.

次に、信号処理部１３による音声処理について詳細に説明する。
まず、第１の実施の形態として、音声信号の全周波数範囲を放音の信号成分を選択する周波数帯域と、収音の信号成分を選択する周波数帯域とに分けて音声処理する信号処理部について説明する。 Next, audio processing by the signal processing unit 13 will be described in detail.
First, as a first embodiment, a signal processing unit that performs audio processing by dividing the entire frequency range of an audio signal into a frequency band for selecting a signal component for sound emission and a frequency band for selecting a signal component for sound collection explain.

図３は、本発明の第１の実施の形態の信号処理部の構成を示した図である。なお、信号処理部３０は、図２に示した会議端末の信号処理部１３に組み込まれる。
本発明の第１の実施の形態である信号処理部３０は、放音信号を複数の周波数帯域に帯域分割する分析フィルタバンク３１、帯域分割された放音信号を合成する合成フィルタバンク３２、収音信号を複数の周波数帯域に帯域分割する分析フィルタバンク３３および帯域分割された収音信号を合成する合成フィルタバンク３４を具備する。 FIG. 3 is a diagram illustrating a configuration of the signal processing unit according to the first embodiment of this invention. The signal processing unit 30 is incorporated in the signal processing unit 13 of the conference terminal shown in FIG.
The signal processing unit 30 according to the first embodiment of the present invention includes an analysis filter bank 31 that divides a sound emission signal into a plurality of frequency bands, a synthesis filter bank 32 that synthesizes the sound signals that have been divided into bands, An analysis filter bank 33 that divides the sound signal into a plurality of frequency bands and a synthesis filter bank 34 that synthesizes the band-divided sound collection signal are provided.

分析フィルタバンク３１は、放音信号分割手段であり、音声コーデック１４から入力した音声信号データを低域から高域までの１２８チャンネルの周波数帯域に帯域分割する。以下、説明のため、最も低域のチャンネルを第１チャンネルとして順番に番号を付し、最も高域のチャンネルを第１２８チャンネルとする。この帯域分割処理は、たとえば、渡口和信らによる「完全再構成ＤＦＴフィルタバンクを用いたサブバンド適応フィルタ」（電子情報通信学会、１９９６年８月、Ｖｏｌ．Ｊ７９−ＡＮｏ．８ｐｐ．１３８５−１３９３）に記載されているＤＦＴフィルタバンクを用いて構成する。帯域を分割してダウンサンプリングの後に信号処理を行い、再び再合成する処理はマルチレート信号処理と呼ばれる。帯域分割の手法は、ＤＦＴフィルタバンクのほかにもＱＭＦフィルタバンクなど、用途に応じて様々な手法が知られている。実施の形態では、ＤＦＴフィルタバンクを用いた場合について説明するが、他の手法で帯域分割を行ってもかまわない。また、フィルタバンク以外の方法として、フーリエ変換などの時間領域から周波数領域への変換、逆変換が定義されている手法を用いることもできる。ＤＦＴフィルタバンクは分析と合成に機能が分けられる。分析で帯域別に分割された音声データは、合成フィルタバンクで元の音声データに再合成することができることが知られている。なお、手法によっては元の信号と再合成された信号が多少異なる場合もあるが、本発明に関しては本質的な影響がないように構成することができる。 The analysis filter bank 31 is a sound emission signal dividing unit, and divides the audio signal data input from the audio codec 14 into 128 frequency bands from low to high. Hereinafter, for the sake of explanation, the lowest frequency channel is numbered sequentially as the first channel, and the highest frequency channel is the 128th channel. This band division processing is performed by, for example, “Subband Adaptive Filter Using Completely Reconstructed DFT Filter Bank” by Watanabe Kazunobu et al. (Electronic Information and Communication Society, August 1996, Vol. J79-A No. 8 pp. 1393). The DFT filter bank described in 1393) is used. The process of dividing the band, performing signal processing after downsampling, and recombining is called multirate signal processing. As a method of band division, various methods such as a QMF filter bank in addition to the DFT filter bank are known depending on applications. In the embodiment, a case where a DFT filter bank is used will be described. However, band division may be performed by another method. Further, as a method other than the filter bank, a method in which transformation from time domain to frequency domain and inverse transformation such as Fourier transformation is defined can be used. The DFT filter bank is divided into analysis and synthesis functions. It is known that the voice data divided by band in the analysis can be re-synthesized into the original voice data by the synthesis filter bank. Although the original signal and the recombined signal may be slightly different depending on the method, the present invention can be configured so as not to have an essential influence.

合成フィルタバンク３２は、放音信号合成手段であり、分析フィルタバンク３１が帯域分割した１２８チャンネルの音声信号のうち、図示しない帯域選択手段によって、偶数番目のチャンネルの信号成分が除去された音声データを入力し、全帯域の成分を合成して１つの音声信号を生成する。音声信号は、Ｄ／Ａ変換器１１を介してスピーカ２１ａへ出力される。 The synthesis filter bank 32 is a sound emission signal synthesis unit. Among the 128-channel audio signals divided by the analysis filter bank 31, the audio data from which even-numbered channel signal components are removed by a band selection unit (not shown). , And the components of the entire band are synthesized to generate one audio signal. The audio signal is output to the speaker 21a via the D / A converter 11.

分析フィルタバンク３３は、収音信号分割手段であり、Ａ／Ｄ変換器１２から入力した音声信号データを、分析フィルタバンク３１と同様に、低域から高域までの１２８チャンネルの周波数帯域に帯域分割する。なお、分析フィルタバンク３３は、分析フィルタバンク３１と同じに構成される。 The analysis filter bank 33 is a collected sound signal dividing means, and the audio signal data input from the A / D converter 12 is banded into a 128-channel frequency band from low to high, similar to the analysis filter bank 31. To divide. The analysis filter bank 33 is configured the same as the analysis filter bank 31.

合成フィルタバンク３４は、収音信号合成手段であり、分析フィルタバンク３３が帯域分割した１２８チャンネルの音声信号のうち、図示しない帯域選択手段によって、奇数番目のチャンネルの信号成分が除去された音声データを入力し、全帯域の成分を合成して１つの音声信号を生成する。音声信号は、音声コーデック１４を介して他装置へ送信される。 The synthesis filter bank 34 is a collected sound signal synthesis means. Of the 128-channel audio signals divided by the analysis filter bank 33, the audio data from which the odd-numbered channel signal components are removed by the band selection means (not shown). , And the components of the entire band are synthesized to generate one audio signal. The audio signal is transmitted to another device via the audio codec 14.

なお、帯域選択手段では、スピーカ２１ａから放音される音声信号と、マイク２２ａで収音する音声信号の周波数成分が重ならないように、それぞれの音声信号が含まれる周波数帯域（チャンネル）を選択する。ここでは、低域から順に１から１２８まで割り振った番号の奇数番目のチャンネル（１、３、・・・、１２７）を放音される音声信号用として選択し、偶数番目のチャンネル（２、４、・・・、１２８）をマイクから収音される音声信号用として選択する。すなわち、放音される音声信号は、奇数番目のチャンネルに該当する周波数帯域の信号成分を利用し、偶数番目のチャンネルに該当する周波数帯域の信号成分は除去される。また、収音される音声信号は、奇数番目のチャンネルに該当する周波数帯域の信号成分を除去し、偶数番目のチャンネルに該当する周波数帯域の信号成分を利用される。このように、チャンネルを分離することによって、放音された音声信号が収音された音声信号に重畳されることを防止することができる。 The band selection means selects a frequency band (channel) in which each audio signal is included so that the audio signal emitted from the speaker 21a and the frequency component of the audio signal collected by the microphone 22a do not overlap. . Here, odd-numbered channels (1, 3,..., 127) with numbers assigned from 1 to 128 in order from the low range are selected for sound signals to be emitted, and even-numbered channels (2, 4 and 4) are selected. ,..., 128) are selected for audio signals picked up from a microphone. That is, the sound signal to be emitted uses the signal component of the frequency band corresponding to the odd-numbered channel, and the signal component of the frequency band corresponding to the even-numbered channel is removed. In addition, the audio signal to be picked up uses the signal component of the frequency band corresponding to the even-numbered channel by removing the signal component of the frequency band corresponding to the odd-numbered channel. In this way, by separating the channels, it is possible to prevent the emitted sound signal from being superimposed on the collected sound signal.

このような信号処理部３０によって実行される音声処理について、フローチャートを用いて説明する。
まず、スピーカから放音する音声信号の処理（以下、スピーカ音声処理とする）について説明する。図４は、第１の実施の形態の信号処理部を含む近端装置のスピーカ音声処理手順を示したフローチャートである。なお、遠端装置１０ｂにおいても同様の手順で音声処理が行われる。 The sound processing executed by the signal processing unit 30 will be described using a flowchart.
First, processing of an audio signal emitted from a speaker (hereinafter referred to as speaker audio processing) will be described. FIG. 4 is a flowchart illustrating a speaker audio processing procedure of the near-end device including the signal processing unit according to the first embodiment. In the far-end device 10b, sound processing is performed in the same procedure.

［ステップＳ０１］通信回線２３を介して遠端装置１０ｂからの符号化された音声データを通信部１５で受信する。
［ステップＳ０２］ステップＳ０１で受信した音声データを音声コーデック１４によってデコードし、たとえば、３２ＫＨｚサンプリング１６ビットストレートＰＣＭのディジタル音声データが生成される。このディジタル音声データは、ＤＳＰによって構成される信号処理部３０へ送られる。 [Step S01] The encoded voice data from the far-end device 10b is received by the communication unit 15 via the communication line 23.
[Step S02] The audio data received in step S01 is decoded by the audio codec 14, and, for example, digital audio data of 32 KHz sampling 16-bit straight PCM is generated. This digital audio data is sent to a signal processing unit 30 constituted by a DSP.

［ステップＳ０３］信号処理部３０では、入力された音声データに対し、分析フィルタバンク３１による帯域分割処理を行う。ここでは、ＤＦＴフィルタバンクを用いて、他装置から入力された放音される音声データを低域から高域までの１２８チャンネルの周波数帯域に分割する。 [Step S03] The signal processing unit 30 performs band division processing by the analysis filter bank 31 on the input audio data. Here, the DFT filter bank is used to divide sound data to be emitted input from other devices into 128-channel frequency bands from low to high.

［ステップＳ０４］１２８チャンネルのうち、放音される音声信号に割り当てられたチャンネル（ここでは、奇数番目のチャンネル）の信号成分をそのまま合成フィルタバンク３２に送り、放音される音声信号に割り当てられなかったチャンネル（ここでは、偶数番目のチャンネル）の信号成分を合成フィルタバンク３２へ出力しない。すなわち、分析フィルタバンク３１の１２８チャンネルの周波数帯域のうち、偶数番目の出力を０として合成フィルタバンク３２へ出力する。これにより、放音される音声信号の偶数番目の周波数帯域の信号成分が除去される。 [Step S04] Of the 128 channels, the signal component of the channel assigned to the sound signal to be emitted (in this case, the odd-numbered channel) is directly sent to the synthesis filter bank 32 and assigned to the sound signal to be emitted. The signal component of the missing channel (here, even-numbered channel) is not output to the synthesis filter bank 32. That is, the even-numbered output in the 128 frequency bands of the analysis filter bank 31 is set to 0 and output to the synthesis filter bank 32. Thereby, the signal component of the even-numbered frequency band of the sound signal to be emitted is removed.

［ステップＳ０５］合成フィルタバンク３２は、放音される音声信号が割り当てられなかった周波数帯域（ここでは、偶数番目のチャンネル）の信号成分が除去された音声データを受け取り、全帯域の信号成分を合成して１つの音声信号とし、Ｄ／Ａ変換器１１へ送る。 [Step S05] The synthesis filter bank 32 receives the audio data from which the signal component of the frequency band (in this case, the even-numbered channel) to which the sound signal to be emitted has not been assigned has been removed, and the signal component of the entire band is obtained. The synthesized audio signal is sent to the D / A converter 11.

［ステップＳ０６］Ｄ／Ａ変換器１１は、合成フィルタバンク３２が合成した放音用の音声信号をアナログ音声信号へ変換し、スピーカ２１ａへ出力する。
［ステップＳ０７］Ｄ／Ａ変換器１１から取得したアナログ音声信号を増幅器で増幅し、スピーカ２１ａから放音する。 [Step S06] The D / A converter 11 converts the sound output sound signal synthesized by the synthesis filter bank 32 into an analog sound signal, and outputs the analog sound signal to the speaker 21a.
[Step S07] The analog audio signal acquired from the D / A converter 11 is amplified by an amplifier and emitted from the speaker 21a.

以上のスピーカ音声処理手順が実行されることにより、他装置から入力され、スピーカ２１ａから放音する音声信号について、１２８チャンネルに帯域分割された偶数番目の周波数帯域の信号成分が除去され、１つの音声信号に合成された後、スピーカ２１ａから出力される。 By executing the above speaker audio processing procedure, the signal component of the even-numbered frequency band divided into 128 channels is removed from the audio signal input from another device and emitted from the speaker 21a. After being synthesized with the audio signal, it is output from the speaker 21a.

次に、マイクから収音する音声信号の音声処理（以下、マイク音声処理とする）について説明する。図５は、第１の実施の形態の信号処理部を含む近端装置のマイク音声処理手順を示したフローチャートである。なお、遠端装置１０ｂにおいても同様の手順で音声処理が行われる。 Next, audio processing of an audio signal collected from a microphone (hereinafter referred to as microphone audio processing) will be described. FIG. 5 is a flowchart illustrating a microphone sound processing procedure of the near-end device including the signal processing unit according to the first embodiment. In the far-end device 10b, sound processing is performed in the same procedure.

［ステップＳ１１］近端装置側のテレビ会議出席者の発言や、スピーカ２１ａから放音される音声は、マイク２２ａによって収音される。この収音された音声は、Ａ／Ｄ変換器１２で３２ＫＨｚサンプリング１６ビットストレートＰＣＭのディジタル音声データに変換される。変換された音声データは、分析フィルタバンク３３へ出力される。 [Step S11] The speech of the video conference attendee on the near-end device side and the sound emitted from the speaker 21a are collected by the microphone 22a. The collected sound is converted into digital sound data of 32 KHz sampling 16-bit straight PCM by the A / D converter 12. The converted audio data is output to the analysis filter bank 33.

［ステップＳ１２］分析フィルタバンク３３は、ステップＳ１１で生成された音声データを入力し、放音信号用の分析フィルタバンク３１と同様に、１２８チャンネルの音声データに分割する。 [Step S12] The analysis filter bank 33 receives the audio data generated in step S11 and divides it into 128-channel audio data in the same manner as the analysis filter bank 31 for the sound emission signal.

［ステップＳ１３］１２８チャンネルのうち、放音信号とは逆の収音信号に割り当てられたチャンネル（ここでは、偶数番目のチャンネル）の信号成分をそのまま合成フィルタバンク３４に送り、放音される音声信号に割り当てられたチャンネル（ここでは、奇数番目のチャンネル）の信号成分を合成フィルタバンク３４へ出力しない。すなわち、分析フィルタバンク３３の１２８チャンネルの周波数帯域のうち、放音される音声信号とは逆の奇数番目の出力を０として合成フィルタバンク３４へ出力する。これにより、収音された音声信号の奇数番目の周波数帯域の信号成分が除去される。 [Step S13] Of the 128 channels, the signal component of the channel (here, the even-numbered channel) assigned to the sound pickup signal opposite to the sound output signal is sent to the synthesis filter bank 34 as it is, and the sound to be emitted The signal component of the channel assigned to the signal (here, the odd-numbered channel) is not output to the synthesis filter bank 34. That is, in the 128-channel frequency band of the analysis filter bank 33, an odd-numbered output opposite to the sound signal to be emitted is set to 0 and output to the synthesis filter bank 34. Thereby, the signal component of the odd-numbered frequency band of the collected audio signal is removed.

［ステップＳ１４］合成フィルタバンク３４は、収音された音声信号が割り当てられなかった周波数帯域（ここでは、放音信号と逆の奇数番目のチャンネル）の信号成分が除去された音声データを受け取り、全帯域の信号成分を合成して１つの音声信号とし、音声コーデック１４へ送る。 [Step S14] The synthesis filter bank 34 receives the audio data from which the signal component of the frequency band to which the collected audio signal has not been assigned (here, the odd-numbered channel opposite to the sound output signal) is removed, The signal components of all the bands are combined into one audio signal and sent to the audio codec 14.

［ステップＳ１５］音声コーデック１４は、合成フィルタバンク３４より入力された音声信号を予め定められている符号へエンコードし、通信部１５へ送る。
［ステップＳ１６］通信部１５は、この符号化された音声データを、通信回線２３を介して遠端装置１０ｂへ送信する。 [Step S <b> 15] The audio codec 14 encodes the audio signal input from the synthesis filter bank 34 into a predetermined code and sends the encoded signal to the communication unit 15.
[Step S16] The communication unit 15 transmits the encoded audio data to the far-end device 10b via the communication line 23.

以上の処理手順が実行されることにより、マイク２２ａで収音され、遠端装置１０ｂに送信される音声信号について、１２８チャンネルに帯域分割された奇数番目の周波数帯域の信号成分が除去され、１つの音声信号に合成され、エンコードされた後、通信部１５送信される。 By executing the above processing procedure, the signal component of the odd frequency band divided into 128 channels is removed from the audio signal collected by the microphone 22a and transmitted to the far-end device 10b. After being synthesized into one audio signal and encoded, it is transmitted to the communication unit 15.

上述のように、スピーカ２１ａから放音され、マイク２２ａから収音される音声信号は、奇数番目のチャンネルに相当する周波数帯域の信号成分のみが保持されている。したがって、収音された音声信号から奇数番目のチャンネルの信号成分を除去することによって、スピーカ２１ａから放音された音声信号に由来する信号成分は、収音された音声信号から除去されたことになる。すなわち、収音された音声信号からエコー成分が除去されたことになる。 As described above, the audio signal emitted from the speaker 21a and collected from the microphone 22a holds only the signal component in the frequency band corresponding to the odd-numbered channel. Therefore, by removing the signal component of the odd-numbered channel from the collected sound signal, the signal component derived from the sound signal emitted from the speaker 21a is removed from the collected sound signal. Become. That is, the echo component is removed from the collected sound signal.

以上のように、第１の実施の形態では、スピーカから放音される音声とマイクで収音する音声の周波数成分を周波数帯域の偶数番目と奇数番目で分けて重ならないように構成することにより、エコーおよびハウリングを抑制して双方向同時通話を実現することが可能となる。 As described above, in the first embodiment, the frequency components of the sound emitted from the speaker and the sound collected by the microphone are divided into even-numbered and odd-numbered frequency bands so as not to overlap. In addition, it is possible to realize a two-way simultaneous call while suppressing echo and howling.

なお、上記の説明では、チャンネルを偶数番目と奇数番目とに分けるとしたが、マイクとスピーカそれぞれの周波数成分が重ならないように構成されていれば実現できるので、偶数、奇数という分け方の他にも、様々な分け方が可能である。 In the above description, the channels are divided into even-numbered and odd-numbered channels. However, this can be realized if the frequency components of the microphone and the speaker do not overlap each other. In addition, various ways of dividing are possible.

マイク音声処理用のチャンネルとスピーカ音声処理用のチャンネルを偶数、奇数と交互に分けるのではなく、２以上の周波数帯域ごとに分けてもよい。たとえば、
マイク音声処理で選択するチャンネル：１、２、５、６、・・・、１２５、１２６
スピーカ音声処理で選択するチャンネル：３、４、７、８、・・・、１２７、１２８
というように、２つずつ交互に分けてもよい。 The channel for microphone sound processing and the channel for speaker sound processing may be divided into two or more frequency bands instead of being alternately divided into even and odd numbers. For example,
Channels selected for microphone audio processing: 1, 2, 5, 6,..., 125, 126
Channels selected for speaker audio processing: 3, 4, 7, 8,..., 127, 128
As such, it may be alternately divided into two.

また、単純に選択するチャンネルをあらかじめ選択する方法のほかに、音声の特徴に応じて動的に選択する方法がある。
本実施の形態の音声処理では、音声の一部の成分を除去するので、音質に影響を与える可能性がある。たとえば、音声圧縮のために一部の成分を除去する場合に、聴感上の影響を軽減する手法として、聴覚マスキング効果を利用した方法が知られている。 In addition to a method of simply selecting a channel to be selected in advance, there is a method of dynamically selecting a channel according to audio characteristics.
In the audio processing according to the present embodiment, since some components of the audio are removed, the sound quality may be affected. For example, a method using an auditory masking effect is known as a method of reducing the influence on hearing when removing some components for audio compression.

図６は、音声信号の特性を示した図である。図は、ある時間のスピーカへ出力すべき音声を分析フィルタバンク３１で処理した出力を、横軸にフィルタバンクの１２８チャンネルを低域から高域に並べ、縦軸にパワーをプロットしたグラフである。 FIG. 6 is a diagram showing the characteristics of an audio signal. The figure is a graph in which the output to be output to the speaker for a certain time is processed by the analysis filter bank 31, the 128 channels of the filter bank are arranged from low to high on the horizontal axis, and the power is plotted on the vertical axis. .

この図では、第１５、６０、７８チャンネルにピークがある特性となっている。この場合、聴覚マスキング効果を利用したスピーカ音声処理とマイク音声処理の周波数成分が重ならないようにするチャンネル選択は、たとえば、次のように行う。 In this figure, there is a characteristic with peaks in the 15th, 60th, and 78th channels. In this case, channel selection that prevents frequency components of speaker audio processing and microphone audio processing using the auditory masking effect from overlapping is performed as follows, for example.

スピーカ音声処理で選択するチャンネル：３、６、９、１２、１５、１８、２１、２４、２７、３０、３３、３６、３９、４２、４５、４８、５１、５４、５７、６０、６３、６６、５９、７２、７５、７８、８１、８４、８７、９０、９３、９６、９９、１０２、１０５、１０８、１１１、１１４、１１７、１２０、１２３、１２６
マイク音声処理で選択するチャンネル：上記を除く１から１２８までのチャンネル
すなわち、スピーカ音声処理では、ピークがある第１５、６０、７８を含めて２チャンネルごとに選択し、マイク音声処理では、それ以外のチャンネルを選択する。聴覚マスキング効果を利用した選択方法は、前述の２チャンネルごとに限らず、任意の値に調整可能である。また、音声の周波数成分のピークの検出は、ある一時点のピークを使う方法のほかに、時間平均したもののピークを検出する方法も可能である。 Channels selected for speaker audio processing: 3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45, 48, 51, 54, 57, 60, 63, 66, 59, 72, 75, 78, 81, 84, 87, 90, 93, 96, 99, 102, 105, 108, 111, 114, 117, 120, 123, 126
Channels selected for microphone audio processing: Channels 1 to 128 excluding the above In other words, for speaker audio processing, select every 2 channels including the 15th, 60th, and 78th peaks, and for microphone audio processing, otherwise Select the channel. The selection method using the auditory masking effect is not limited to the above-described two channels and can be adjusted to an arbitrary value. In addition to the method of using a peak at a certain temporary point, the method of detecting the peak of a time-averaged component is also possible for detecting the peak of the frequency component of speech.

以上の説明では、フィルタバンクを利用して周波数成分に分割し、マイク音声処理およびスピーカ音声処理それぞれの周波数成分が重ならないようにする構成を示したが、フィルタバンクのほかに同様な時間周波数変換を行うフーリエ変換を用いる方法も可能である。この場合、分析フィルタバンク３１、３３を１２８ポイントのフーリエ変換とし、合成フィルタバンク３２、３４をこのフーリエ変換に対応する逆変換とすればよい。また、時間周波数変換は、フィルタバンク、フーリエ変換のほかにもディスクリートコサイン変換（ＤＣＴ）やウエーブレット変換を用いても可能である。 In the above description, the filter bank is used to divide the frequency components so that the frequency components of the microphone audio processing and the speaker audio processing do not overlap each other. It is also possible to use a method using Fourier transform to perform the above. In this case, the analysis filter banks 31 and 33 may be 128-point Fourier transform, and the synthesis filter banks 32 and 34 may be inverse transforms corresponding to the Fourier transform. In addition to the filter bank and Fourier transform, the time-frequency transform can be performed using a discrete cosine transform (DCT) or a wavelet transform.

また、第１の実施の形態では、周波数成分を除去して０とするとしたが、０以外にも環境騒音に準じる小さな値を選択することも可能である。
次に、第２の実施の形態について説明する。第１の実施の形態では、音声信号の全周波数範囲を放音の信号成分を選択する周波数帯域と、収音の信号成分を選択する周波数帯域とに分けるとしたが、第２の実施の形態では、このような音声処理を一部の周波数帯域範囲とし、他の周波数帯域には適応フィルタを用いた音声処理を行うとする。 In the first embodiment, the frequency component is removed and set to 0. However, it is also possible to select a small value according to the environmental noise other than 0.
Next, a second embodiment will be described. In the first embodiment, the entire frequency range of the audio signal is divided into a frequency band for selecting a sound signal component and a frequency band for selecting a sound signal component. Then, it is assumed that such voice processing is performed in a part of the frequency band range, and voice processing using an adaptive filter is performed in other frequency bands.

図７は、第２の実施の形態の信号処理部の構成を示した図である。なお、第２の実施の形態における信号処理部が有する処理機能の構成要素は、図３に示した第１の実施の形態の構成要素と同様である。そこで、図３に示した構成要素の符号を用いて、第２の実施の形態における機能を説明する。 FIG. 7 is a diagram illustrating the configuration of the signal processing unit according to the second embodiment. In addition, the component of the processing function which the signal processing part in 2nd Embodiment has is the same as the component of 1st Embodiment shown in FIG. Therefore, the functions of the second embodiment will be described using the reference numerals of the components shown in FIG.

第２の実施の形態では、分析フィルタバンク３１、３３による帯域分割を２５６チャンネルとし、第１チャンネルを最も低い周波数の成分を出力するとし、順番に番号を付し、最も高い周波数成分のフィルタを第２５６チャンネルとする。そして、第１チャンネルから第１２８チャンネルについては、適応フィルタ４１、４２、４３、・・・、４４を用いた音声処理を行い、第１２９チャンネルから第２５６チャンネルについては、第１の実施の形態で示したのと同様の処理を行う。なお、第１チャンネルから第１２８チャンネルについて適応フィルタを用いた音声処理を行うのは、一般的に、低域側の周波数帯域に音声信号が多く含まれることによる。 In the second embodiment, it is assumed that the band division by the analysis filter banks 31 and 33 is 256 channels, the first channel outputs the lowest frequency component, the number is assigned in order, and the highest frequency component filter is selected. The 256th channel. The first to 128th channels are subjected to audio processing using the adaptive filters 41, 42, 43,..., 44, and the 129th to 256th channels are the same as in the first embodiment. The same processing as shown is performed. The reason why the audio processing using the adaptive filter is performed for the first channel to the 128th channel is that, in general, many audio signals are included in the low frequency band.

第１２９チャンネルから第２５６チャンネルについては、スピーカ音声処理として、分析フィルタバンク３１によって帯域分割されたチャンネルの信号成分のうち、偶数番目のチャンネルの信号成分を除去して合成フィルタバンク３２へ送る。また、マイク音声処理として、分析フィルタバンク３３によって帯域分割されたチャンネルの信号成分のうち、奇数番目のチャンネルの信号成分を除去して合成フィルタバンク３４へ送る。以上の処理が行われることにより、第１２９チャンネルから第２５６チャンネルが属する周波数帯域については、第１の実施の形態と同様の効果が得られる。 For the 129th to 256th channels, the signal components of the even-numbered channels are removed from the signal components of the channels divided by the analysis filter bank 31 and sent to the synthesis filter bank 32 as speaker sound processing. Further, as microphone sound processing, the signal components of the odd-numbered channels are removed from the signal components of the channels divided by the analysis filter bank 33 and sent to the synthesis filter bank 34. By performing the above processing, the same effects as those of the first embodiment can be obtained for the frequency band to which the 129th channel to the 256th channel belong.

次に、第２の実施の形態の音声処理手順について説明する。図８は、第２の実施の形態の信号処理部の音声処理手順を示したフローチャートである。なお、第１の実施の形態と同じ処理手順については、説明を省略する。 Next, an audio processing procedure according to the second embodiment will be described. FIG. 8 is a flowchart illustrating an audio processing procedure of the signal processing unit according to the second embodiment. Note that description of the same processing procedure as in the first embodiment is omitted.

スピーカ音声処理について説明する。
［ステップＳ２１］分析フィルタバンク３１によって、第１チャンネルを最も低い周波数成分として第２５６チャンネルまでの２５６チャンネルに、放音される音声信号が帯域分割される。 Speaker audio processing will be described.
[Step S21] The analysis filter bank 31 divides the sound signal to be emitted into 256 channels up to the 256th channel with the first channel as the lowest frequency component.

［ステップＳ２２］この２５６チャンネルのうち、第１２９チャンネルから第２５６チャンネルについて、偶数番目の出力を０とし、合成フィルタバンク３２へ出力する。
［ステップＳ２３］この２５６チャンネルのうち、第１から第１２８チャンネルに対しては、全てのチャンネルの信号成分を合成フィルタバンク３２へ送るとともに、それぞれのチャンネルに対応して設けられた適応フィルタ（処理部）４１、４２、４３、・・・、４４へ適応フィルタ処理のリファレンス音声として信号成分を送る。 [Step S22] Of the 256 channels, the even-numbered output is set to 0 for the 129th channel to the 256th channel, and is output to the synthesis filter bank 32.
[Step S23] Among the 256 channels, for the first to 128th channels, the signal components of all the channels are sent to the synthesis filter bank 32, and adaptive filters (processes) corresponding to the respective channels are processed. Part) 41, 42, 43,..., 44, a signal component is sent as a reference sound for adaptive filter processing.

［ステップＳ２４］全帯域の信号成分を合成して、１つの音声信号とする。この音声信号は、Ｄ／Ａ変換器１１を経て、スピーカ２１ａより出力される。
マイク音声処理について説明する。 [Step S24] The signal components of all the bands are combined into one audio signal. This audio signal is output from the speaker 21 a via the D / A converter 11.
The microphone sound processing will be described.

［ステップＳ２５］スピーカ音声処理と同時に、マイク２２ａを介して入力された音声信号は、分析フィルタバンク３１と同じ構成の分析フィルタバンク３３によって２５６チャンネルに帯域分割される。 [Step S25] Simultaneously with the speaker audio processing, the audio signal input through the microphone 22a is band-divided into 256 channels by the analysis filter bank 33 having the same configuration as the analysis filter bank 31.

［ステップＳ２６］２５６チャンネルのうち、第１２９チャンネルから第２５６チャンネルについて、スピーカ音声処理とは逆に奇数番目の出力を０とし、合成フィルタバンク３４へ出力する。すなわち、第１２９チャンネルから第２５６チャンネルの成分は、第１の実施の形態と同様に、スピーカ２１ａへの出力音声と、マイク２２ａからの入力音声が、それぞれ奇数チャンネル、偶数チャンネルの音声成分しか含まず、同じ周波数成分が重ならないように構成される。 [Step S26] Contrary to speaker audio processing, odd-numbered outputs are set to 0 for 256th channel to 129th to 256th channels, and output to the synthesis filter bank 34. That is, the components from the 129th channel to the 256th channel include only the audio components of the odd-numbered channel and the even-numbered channel, respectively, in the output sound to the speaker 21a and the input sound from the microphone 22a, as in the first embodiment. However, the same frequency component is configured not to overlap.

［ステップＳ２７］この２５６チャンネルのうち、第１から第１２８チャンネルについては、それぞれのチャンネルに対応して設けられた適応フィルタ（処理部）４１、４２、４３、・・・、４４へ信号成分を送る。 [Step S27] Among the 256 channels, for the first to 128th channels, signal components are sent to adaptive filters (processing units) 41, 42, 43,..., 44 provided corresponding to the respective channels. send.

［ステップＳ２８］適応フィルタ（処理部）４１、４２、４３、・・・、４４は、第１から第１２８チャンネルについて、ステップＳ２３によって入力されたスピーカへ放音する音声信号をリファレンス信号とし、ステップＳ２７によって入力されたマイクが収音した音声信号を目的信号として、ＬＭＳ（Least Mean Square）アルゴリズムを用いた適応フィルタにより処理される。適応フィルタ処理については、本願発明と直接には関連しないので説明は省略する。適応フィルタ処理により、擬似エコーが生成される。 [Step S28] The adaptive filters (processing units) 41, 42, 43,..., 44 use the audio signal emitted to the speaker input in step S23 for the first to 128th channels as a reference signal. The voice signal picked up by the microphone input in S27 is processed as an objective signal by an adaptive filter using an LMS (Least Mean Square) algorithm. Since the adaptive filter processing is not directly related to the present invention, a description thereof will be omitted. A pseudo echo is generated by the adaptive filter processing.

［ステップＳ２９］適応フィルタ（処理部）４１、４２、４３、・・・、４４は、第１から第１２８チャンネルについて、ステップＳ２７によって入力されたマイクが収音した音声信号からステップＳ２８によって算出された擬似エコーを差し引き、合成フィルタバンク３４へ送る。 [Step S29] The adaptive filters (processing units) 41, 42, 43,..., 44 are calculated in Step S28 for the first to 128th channels from the audio signal collected by the microphone input in Step S27. The pseudo echo is subtracted and sent to the synthesis filter bank 34.

［ステップＳ３０］合成フィルタバンク３４では、ステップＳ２６によって入力された第１２９チャンネルから第２５６チャンネルの信号成分と、ステップＳ２９によって入力された第１から第１２８チャンネルの信号成分の全帯域の成分を合成して１つの音声信号とする。この音声信号は、音声コーデック１４を経て、他装置へ送信される。 [Step S30] In the synthesis filter bank 34, the signal components of the 129th to 256th channels input in step S26 and the components in the entire band of the signal components of the first to 128th channels input in step S29 are synthesized. Thus, one audio signal is obtained. This audio signal is transmitted to another apparatus via the audio codec 14.

以上の説明の第２の実施の形態では、第１の実施の形態の構成に適応フィルタによるエコーキャンセルを組み合わせることにより、音質に強い影響を与える信号成分は適応フィルタによって音声処理し、音質に影響が少ない信号成分は第１の実施の形態の構成により音声処理を行う。これにより、第１の実施の形態が適用される周波数帯域では、第１の実施の形態と同様に計算量を削減することができるので、音質と計算量を考慮した最適なシステムを設計することができる。 In the second embodiment described above, by combining the configuration of the first embodiment with echo cancellation by an adaptive filter, signal components that have a strong influence on sound quality are processed by the adaptive filter, and the sound quality is affected. The signal component with a small amount is subjected to sound processing by the configuration of the first embodiment. As a result, in the frequency band to which the first embodiment is applied, the amount of calculation can be reduced as in the first embodiment, and therefore an optimal system that takes into account the sound quality and the amount of calculation should be designed. Can do.

なお、第２の実施の形態の第１２９チャンネルから第２５６チャンネルの処理は、第１の実施の形態の処理と同様であり、第１の実施の形態と同様の変形が可能である。
また、適応フィルタの手法は、ＬＭＳアルゴリズムのほかに様々な手法が知られており、他の手法を用いることもできる。また、適応フィルタの性能向上のための制御方法も様々な手法が知られており、適応処理（ステップＳ２８）に適用することで、性能が向上する。 Note that the processing from the 129th channel to the 256th channel in the second embodiment is the same as the processing in the first embodiment, and can be modified in the same manner as in the first embodiment.
In addition to the LMS algorithm, various methods are known as adaptive filter methods, and other methods can also be used. Various control methods for improving the performance of the adaptive filter are known, and the performance is improved by applying the adaptive filter to the adaptive processing (step S28).

ＤＦＴフィルタバンクなどを用いるマルチレート信号処理では、周波数変換が行われることから、フィルタ出力をダウンサンプリングして計算量を低減する手法が知られている。本実施の形態に対しても特に、適応フィルタ処理で、この手法により計算量を削減することができる。 In multi-rate signal processing using a DFT filter bank or the like, since frequency conversion is performed, a method of down-sampling the filter output to reduce the calculation amount is known. Particularly for the present embodiment, it is possible to reduce the amount of calculation by this method in the adaptive filter processing.

以上の説明の第１の実施の形態および第２の実施の形態では、マイクとスピーカの音声の周波数成分が重ならないように構成するために、一部の成分を除去する方法について詳細に説明した。この構成ではマイクの音声から除去する周波数成分は、スピーカから放音される音声に含まれる成分であり、スピーカから放音される音声から除去される周波数成分はマイクで収音される音声のうち除去されない周波数成分となっており、背反な関係になっている。近端装置と遠端装置がこのような全く同じ構成で接続されると、遠端装置のマイクから除去された周波数成分が、近端装置のスピーカから放音される音声の除去されない周波数成分となり、近端で出力される音声は既に遠端で除去されており、何も聞こえない状況が起こる。以下にこのような問題を解決する方法をいくつか説明する。 In the first embodiment and the second embodiment described above, the method of removing some components has been described in detail so that the frequency components of the sound of the microphone and the speaker do not overlap. . In this configuration, the frequency component removed from the sound of the microphone is a component included in the sound emitted from the speaker, and the frequency component removed from the sound emitted from the speaker is the portion of the sound collected by the microphone. This is a frequency component that is not removed, and has a contradictory relationship. When the near-end device and the far-end device are connected in exactly the same configuration, the frequency component removed from the far-end device microphone becomes the frequency component from which the sound emitted from the near-end device speaker is not removed. The sound output at the near end has already been removed at the far end, and a situation occurs in which nothing can be heard. Several methods for solving such problems are described below.

第１の音声処理として、マイク音声は除去された成分を除去されなかった成分で補間して遠端装置へ送ることができる。以下に、補間の方法の詳細を説明する。
第１の実施の形態では分析フィルタバンク３３から、１２８チャンネルの出力のうち奇数番目のチャンネルが０として合成フィルタバンク３４へ送られることが詳述されている。図５に示した第１の実施の形態のマイク音声処理のステップＳ１３で、除去される奇数チャンネルを除去されない隣の偶数チャンネルのデータで置き換える。例えば除去される第１チャンネルは第２チャンネルと同じデータとする。同様に第３チャンネルは第４チャンネルのデータを、第５は第６を、というように第１２８チャンネルまで構成する。この補間操作以外については第１の実施の形態と同じ構成で実現可能である。 As the first sound processing, the microphone sound can be transmitted to the far-end device by interpolating the removed component with the component not removed. Details of the interpolation method will be described below.
In the first embodiment, it is described in detail that the odd-numbered channel among the 128-channel outputs is sent to the synthesis filter bank 34 from the analysis filter bank 33 as 0. In step S13 of the microphone sound processing according to the first embodiment shown in FIG. 5, the removed odd channel is replaced with the data of the adjacent even channel that is not removed. For example, the first channel to be removed is the same data as the second channel. Similarly, the third channel configures the data of the fourth channel, the fifth configures the sixth, and so on up to the 128th channel. Other than this interpolation operation can be realized with the same configuration as that of the first embodiment.

第２の音声処理として、遠端装置から送られてくる音声に除去された成分がある場合、その成分を補完してスピーカから出力する。
以下に、補間の方法の詳細を説明する。第１の実施の形態では分析フィルタバンク３１から１２８チャンネルの出力のうち偶数チャンネルを０として合成フィルタバンク３２へ送られることが詳述されている。図４に示した第１の実施の形態のスピーカ音声処理におけるステップＳ０３の処理後に、除去された周波数成分があるかを判定する。この方法は一定時間のフィルタ出力のパワーを積算し閾値以下の場合、この周波数成分が除去されていると判断する。除去されていると判断された場合は隣の奇数チャンネルのデータで置き換える。例えば、第２チャンネルの成分が除去されていると判断した場合には第１チャンネルのデータと同じデータとする。この判断及び操作の後にステップＳ０４へ処理を進める。前述の判断と操作以外については、第１の実施の形態と同じ構成で実現可能である。 As the second sound processing, when there is a removed component in the sound sent from the far-end device, the component is complemented and output from the speaker.
Details of the interpolation method will be described below. In the first embodiment, it is described in detail that the even-numbered channel of the 128-channel output from the analysis filter bank 31 is sent to the synthesis filter bank 32 as 0. After the process of step S03 in the speaker sound process of the first embodiment shown in FIG. 4, it is determined whether there is a removed frequency component. In this method, the power of the filter output for a certain period of time is integrated, and if it is below the threshold, it is determined that this frequency component has been removed. If it is determined that it has been removed, it is replaced with the data of the adjacent odd channel. For example, when it is determined that the component of the second channel is removed, the data is the same as the data of the first channel. After this determination and operation, the process proceeds to step S04. Except for the above-described determination and operation, it can be realized with the same configuration as that of the first embodiment.

また、遠端装置のマイク音声から除去された周波数成分以外を、近端装置のマイク音声の除去する周波数成分とする。前述のように、スピーカから放音される音声の周波数成分とマイクで収音される音声の周波数成分は背反の関係で構成されている。ゆえに、近端装置と遠端装置でこの関係を逆転して用いれば、遠端装置のマイク音声から除去された成分を近端装置のスピーカから放音しようとして何も音声が出力されないという状態が起こらない。図４に示した第１の実施の形態のスピーカ音声処理におけるステップ０４では偶数チャンネルの出力を０とすることを説明したが、次のように構成する。ステップ０３からの出力に対して、一定時間の偶数チャンネルのパワーの和と、奇数チャンネルのパワーの和を積算する。前記の値を比較し、値の小さな偶数あるいは奇数のチャンネル群を０とする。また、図５に示したマイク音声処理におけるステップＳ１３では、前記で偶数チャンネルが０とされた場合は奇数チャンネルを０とし、奇数チャンネルが０とされた場合は偶数チャンネルを０とする。前述の判断と操作以外については第１の実施の形態と同じ構成で実現可能である。前述の説明では除去された成分のある、なしをフィルタバンク出力のパワーから判断したが、遠端装置と近端装置でどの成分を除去したかをコントロール信号として相手装置へ送り、そのコントロール信号によって制御する構成も可能である。 Further, frequency components other than the frequency components removed from the microphone sound of the far-end device are set as frequency components to be removed from the microphone sound of the near-end device. As described above, the frequency component of the sound emitted from the speaker and the frequency component of the sound collected by the microphone are configured in a trade-off relationship. Therefore, if this relationship is reversed between the near-end device and the far-end device, no sound is output when the component removed from the far-end device's microphone sound is emitted from the near-end device speaker. Does not happen. In step 04 in the speaker sound processing of the first embodiment shown in FIG. 4, it has been described that the output of the even-numbered channel is set to 0. The configuration is as follows. For the output from step 03, the sum of the powers of the even-numbered channels and the power of the odd-numbered channels for a certain period of time are integrated. The above values are compared, and an even or odd channel group having a small value is set to 0. In step S13 in the microphone sound processing shown in FIG. 5, when the even channel is set to 0, the odd channel is set to 0. When the odd channel is set to 0, the even channel is set to 0. Except for the above-described determination and operation, it can be realized with the same configuration as that of the first embodiment. In the above description, the presence or absence of the removed component is determined from the power of the filter bank output. However, the component that has been removed by the far-end device and the near-end device is sent as a control signal to the other device, and the control signal A configuration to control is also possible.

以上のいずれかの処理が行われることにより、たとえば、近端装置のマイクで除去した周波数成分が遠端装置のスピーカから出力する成分として選択された場合に起こる、出力される音声信号成分が既に除去されていて、信号成分が何もないという問題を回避することができる。 By performing any of the above processes, for example, the output audio signal component that occurs when the frequency component removed by the microphone of the near-end device is selected as the component to be output from the speaker of the far-end device has already been generated. The problem of no signal component being eliminated can be avoided.

実施の形態に適用される発明の概念図である。It is a conceptual diagram of the invention applied to embodiment. 本発明の実施の形態のテレビ会議システムの構成を示した図である。It is the figure which showed the structure of the video conference system of embodiment of this invention. 本発明の第１の実施の形態の信号処理部の構成を示した図である。It is the figure which showed the structure of the signal processing part of the 1st Embodiment of this invention. 第１の実施の形態の信号処理部を含む近端装置のスピーカ音声処理手順を示したフローチャートである。It is the flowchart which showed the speaker audio | voice processing procedure of the near end apparatus containing the signal processing part of 1st Embodiment. 第１の実施の形態の信号処理部を含む近端装置のマイク音声処理手順を示したフローチャートである。It is the flowchart which showed the microphone audio | voice processing procedure of the near end apparatus containing the signal processing part of 1st Embodiment. 音声信号の特性を示した図である。It is the figure which showed the characteristic of the audio | voice signal. 第２の実施の形態の信号処理部の構成を示した図である。It is the figure which showed the structure of the signal processing part of 2nd Embodiment. 第２の実施の形態の信号処理部の音声処理手順を示したフローチャートである。It is the flowchart which showed the audio | voice processing procedure of the signal processing part of 2nd Embodiment.

Explanation of symbols

１……放音信号分割手段、２……収音信号分割手段、３……帯域選択手段、４……放音信号合成手段、５……収音信号合成手段
1 ... Sound emission signal dividing means, 2 ... Sound collection signal division means, 3 ... Band selection means, 4 ... Sound emission signal synthesis means, 5 ... Sound collection signal synthesis means

Claims

In a voice processing device that performs voice processing when a full-duplex call is made in a loudspeaker call system including a speaker and a microphone,
Sound emission signal dividing means for dividing a sound emission signal obtained from another device and output from the speaker into a plurality of frequency bands;
A sound collection signal dividing means for dividing the sound collection signal input from the microphone into the plurality of frequency bands;
The predetermined frequency band range including the entire range of the plurality of frequency bands is divided into the frequency band for selecting the sound emission signal and the frequency band for selecting the sound pickup signal, and is selected for each frequency band. Band selection means for removing signal components of the sound emission signal or the sound collection signal of the frequency band that was not present,
Sound emission signal synthesizing means for synthesizing the sound emission signal that has been band-divided and from which the signal component of the frequency band not selected by the band selection means has been removed;
Sound collection signal synthesis means for synthesizing the sound collection signal that has been band-divided and from which the signal component of the frequency band not selected by the band selection means has been removed;
A speech processing apparatus comprising:

The sound emission signal dividing means and the sound collection signal division means separate the sound emission signal and the sound collection signal into a plurality of frequency bands by performing a conversion process from a time domain to a frequency domain including a Fourier transform process. To
The speech processing apparatus according to claim 1.

The sound emission signal dividing unit and the sound collection signal division unit separate the sound emission signal and the sound collection signal into the plurality of frequency bands by performing multi-rate signal processing using a predetermined filter bank.
The speech processing apparatus according to claim 1.

The band selection means sequentially assigns numbers to the frequency bands included in the predetermined frequency band range, and removes signal components of one of the frequency bands, which are even-numbered or odd-numbered, for the sound emission signal that has been band-divided. And removing the other without removing the signal component of the even-numbered or odd-numbered frequency band obtained by removing the signal component of the sound emission signal with respect to the sound-collected signal that has been band-divided, without removing the other. The frequency band in which the sound emission signal and the sound collection signal are included does not overlap.
The speech processing apparatus according to claim 1.

Whether the band selection means selects the odd-numbered or even-numbered frequency band for removing the signal component for the sound emission signal and the collected sound signal divided by the band depends on the other party of the full-duplex call. It is determined according to a combination that does not overlap with the selection of the frequency band for removing the signal component of some other device.
The speech processing apparatus according to claim 4.

The band selecting unit obtains a predetermined frequency band from the other device according to a level of a signal component in the frequency band of the sound emission signal obtained from the other device and divided into bands by the sound emission signal dividing unit. It is determined whether or not the signal component is removed, and when the frequency band from which the signal component is removed is detected, the detected frequency band is selected as the frequency band from which the signal component of the sound emission signal is removed.
The speech processing apparatus according to claim 1.

The band selection unit is configured to release the signal component of the sound emission signal in the frequency band having the sound characteristics according to the sound characteristics of the sound output signals input from the other device. Determining the frequency band for selecting a sound signal;
The speech processing apparatus according to claim 1.

The band selection means dynamically performs the selection of the frequency band for selecting the sound emission signal according to the characteristics of the sound.
The speech processing apparatus according to claim 7.

The band selection means performs voice processing using an adaptive filter for a part of the frequency band range of the frequency band that is divided, and the sound emission signal of the predetermined frequency band for the remaining frequency band range and the Perform audio processing to remove the signal component of the collected sound signal,
The speech processing apparatus according to claim 1.

The band selection means selects the frequency band to be processed using the adaptive filter according to the characteristics of the voice of the sound emission signal input from the other device.
The speech processing apparatus according to claim 6.

The band selection means uses the signal component of the collected sound signal of the frequency band from which the signal component is not removed for the collected sound signal obtained by dividing the band and removing the signal component of the predetermined frequency band. Interpolating the signal component of the collected sound signal in the frequency band from which
The speech processing apparatus according to claim 1.

The band selecting unit obtains a predetermined frequency band from the other device according to a level of a signal component in the frequency band of the sound emission signal obtained from the other device and divided into bands by the sound emission signal dividing unit. It is determined whether the signal component has been removed, and when the frequency band from which the signal component has been removed is detected, the signal component is used by using the signal component of the sound emission signal in the frequency band from which the signal component has not been removed. Interpolating the signal component of the sound emission signal in the frequency band from which
The speech processing apparatus according to claim 1.