JPWO2011004503A1

JPWO2011004503A1 - Noise removing apparatus and noise removing method

Info

Publication number: JPWO2011004503A1
Application number: JP2011521766A
Authority: JP
Inventors: 真人戸上; 洋平川口; 浩明小窪
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2009-07-08
Filing date: 2009-07-08
Publication date: 2012-12-13
Anticipated expiration: 2029-07-08
Also published as: JP5382745B2; WO2011004503A1

Abstract

複数のマイクロホンから構成されるマイクアレイによって収集された音から雑音を除去する雑音除去装置であって、マイクアレイによって収集された音は、アナログ信号として前記雑音除去装置に入力され、雑音除去装置は、マイクアレイによって収集された音に含まれる雑音に基づいて、複数の前記雑音抑圧フィルタを生成し、ＡＤ変換装置によって変換されたデジタル信号に各々の雑音抑圧フィルタを作用させ、雑音が除去されたデジタル信号の音量が最も小さくなる雑音抑圧フィルタを選択し、選択された雑音抑圧フィルタを用いて、ＡＤ変換装置から入力されたデジタル信号から雑音を除去することを特徴とする。A noise removing device that removes noise from sound collected by a microphone array composed of a plurality of microphones, wherein the sound collected by the microphone array is input as an analog signal to the noise removing device, A plurality of the noise suppression filters are generated based on the noise included in the sound collected by the microphone array, and each noise suppression filter is applied to the digital signal converted by the AD converter to remove the noise. A noise suppression filter that minimizes the volume of the digital signal is selected, and noise is removed from the digital signal input from the AD converter using the selected noise suppression filter.

Description

本発明は、複数のマイクロホンを用いて収集された音の中から、特定の音のみ抽出する雑音抑圧技術に関する。 The present invention relates to a noise suppression technique for extracting only a specific sound from sounds collected using a plurality of microphones.

一般に、複数のマイクロホンを用いて収集された音の中から、特定の方向の音を抽出する技術として、最小分散ビームフォーマ法（例えば、Ｏ．Ｌ．Ｆｒｏｓｔ，ＩＩＩ，“Ａｎａｌｇｏｒｉｔｈｍｆｏｒｌｉｎｅａｒｌｙｃｏｎｓｔｒａｉｎｅｄａｄａｐｔｉｖｅａｒｒａｙｐｒｏｃｅｓｓｉｎｇ，“ＩｎＰｒｏｃ．ＩＥＥＥ，ｖｏｌ．６０，ｎｏ．８，ｐｐ．９２６−９３５，１９７２．参照）などがある。最小分散ビームフォーマ法は、複数のマイクロホンから入力される入力信号に対して、空間的な指向特性を持つ線形フィルタを作用させることによって、特定方向の信号（音）のみ抽出する技術である。
しかし、最小分散ビームフォーマ法では、ロボット内部のアクチュエータの動作音などの内部から到来する雑音に対しては、アクチュエータの動作状態に応じて空間的な音源位置が時々刻々変化するため、空間的な死角を精度良く構成することが困難となり、抑圧性能が低いという問題があった。
また、単一マイクロホンでも適用可能なスペクトルサブトラクション法という雑音抑圧方式がある（例えば、Ｓ．Ｆ．Ｂｏｌｌ，“Ｓｕｐｐｒｅｓｓｉｏｎｏｆａｃｏｕｓｔｉｃｎｏｉｓｅｉｎｓｐｅｅｃｈｕｓｉｎｇｓｐｅｃｔｒａｌｓｕｂｔｒａｃｔｉｏｎ，”ＩＥＥＥＴｒａｎｓ．ＡＳＳＰ，Ｖｏｌ．２７，Ｎｏ．２，ｐｐ．１１３−１２０，１９７９．参照）。スペクトルサブトラクション法は、空間的な指向特性を用いて雑音を除去するのではなく、雑音の統計量を推定し、比較的振幅特性が定常な雑音を除去する技術である。
しかし、スペクトルサブトラクション法では、雑音の振幅特性が非定常な雑音である場合、除去性能が劣化するだけでなく、取得したい音声が大きく劣化する。例えば、ロボット内部のアクチュエータの動作音は、アクチュエータの動作状態によって音質が変化する非定常な雑音であるため、前述した問題が生じる。In general, as a technique for extracting a sound in a specific direction from sounds collected using a plurality of microphones, a minimum dispersion beamformer method (for example, OL Forst, III, “Analgorithm for linearly aligned adaptive array”). processing, "In Proc. IEEE, vol. 60, no. 8, pp. 926-935, 1972.). The minimum dispersion beamformer method is a technique for extracting only a signal (sound) in a specific direction by applying a linear filter having spatial directivity to input signals input from a plurality of microphones.
However, with the minimum dispersion beamformer method, the spatial sound source position changes from moment to moment according to the operating state of the actuator for noise coming from inside such as the operating noise of the actuator inside the robot. There was a problem that it was difficult to configure the blind spot with high accuracy and the suppression performance was low.
In addition, there is a noise suppression method called a spectral subtraction method that can be applied to a single microphone (for example, SF Boll, “Suppression of acoustic noise in spectral subtraction,” IEEE Trans. ASSP, Vol. 27, No. 27). 2, pp. 113-120, 1979.). The spectral subtraction method is a technique for estimating noise statistics and removing noise with relatively steady amplitude characteristics, instead of removing noise using spatial directivity characteristics.
However, in the spectral subtraction method, when the amplitude characteristic of noise is non-stationary noise, not only the removal performance is degraded, but also the voice to be acquired is greatly degraded. For example, the operation sound of the actuator inside the robot is a non-stationary noise whose sound quality changes depending on the operation state of the actuator, and thus the above-described problem occurs.

ロボットのアクチュエータ動作音など、ロボット内部で生じる音源位置や音源の振幅特性が時々刻々と変化する非定常な雑音を高精度に除去することが課題である。
本発明の代表的な一例を示せば以下の通りである。すなわち、複数のマイクロホンから構成されるマイクアレイによって収集された音から雑音を除去する雑音除去装置であって、前記マイクアレイによって収集された音は、アナログ信号として前記雑音除去装置に入力され、前記雑音除去装置は、マイクロプロセッサと、前記マイクロプロセッサに接続される記憶装置と、前記マイクロプロセッサに接続されるメモリと、前記マイクロプロセッサに接続され、前記アナログ信号をデジタル信号に変換するＡＤ変換装置と、を備え、前記記憶装置は、前記マイクアレイによって収集された音に含まれる雑音を除去するための雑音抑圧フィルタを生成する雑音抑圧フィルタ生成プログラムと、前記雑音抑圧フィルタを用いて、前記マイクアレイによって収集された音に含まれる雑音を除去する雑音除去プログラムとを格納し、前記雑音除去装置は、前記マイクアレイによって収集された音に含まれる雑音に基づいて、複数の前記雑音抑圧フィルタを生成し、前記ＡＤ変換装置によって変換されたデジタル信号に前記各々の雑音抑圧フィルタを作用させ、雑音が除去されたデジタル信号の音量が最も小さくなる前記雑音抑圧フィルタを選択し、前記選択された雑音抑圧フィルタを用いて、前記ＡＤ変換装置から入力されたデジタル信号から雑音を除去することを特徴とする。
本発明によれば、雑音の属性に応じて、正確に雑音を除去することが可能となる。The problem is to remove with high accuracy non-stationary noise, such as the actuator operating sound of the robot, that changes the position of the sound source generated inside the robot and the amplitude characteristics of the sound source.
A typical example of the present invention is as follows. That is, a noise removing device that removes noise from sound collected by a microphone array composed of a plurality of microphones, wherein the sound collected by the microphone array is input to the noise removing device as an analog signal, The noise removing device includes a microprocessor, a storage device connected to the microprocessor, a memory connected to the microprocessor, an AD converter connected to the microprocessor and converting the analog signal into a digital signal, The storage device includes a noise suppression filter generation program for generating a noise suppression filter for removing noise included in the sound collected by the microphone array, and the microphone array using the noise suppression filter. Noise to remove noise contained in the sound collected by The noise removal device generates a plurality of the noise suppression filters based on the noise included in the sound collected by the microphone array, and converts the noise into a digital signal converted by the AD conversion device. Each of the noise suppression filters is operated to select the noise suppression filter that minimizes the volume of the digital signal from which noise has been removed, and is input from the AD converter using the selected noise suppression filter It is characterized by removing noise from a digital signal.
According to the present invention, noise can be accurately removed in accordance with noise attributes.

図１は、本発明の第１の実施形態の内部雑音除去装置のハードウェア構成のブロック図である。
図２は、本発明の第１の実施形態の記憶装置に格納されたプログラムの一例のブロック図である。
図３は、本発明の内部雑音除去装置が実行する処理の一例のブロック図である。
図４は、本発明の第１の実施形態の内部雑音除去処理の詳細のブロック図である。
図５は、本発明の第１の実施形態の内部雑音シグナルの一例を示す説明図である。
図６は、本発明の第１の実施形態の内部雑音の種類の一例を示す説明図である。
図７は、本発明の第１の実施形態における３つのアクチュエータの動作状態に応じた内部雑音シグナルの種類を定義する一例のタイミングチャートである。
図８は、本発明の第１の実施形態のパワー最小化フィルタリングの処理のフローチャートである。
図９は、本発明の第１の実施形態における目的音共分散更新、雑音抑圧フィルタ更新、雑音共分散行列選択、及び内部雑音の発生状況の一例のタイミングチャートである。
図１０は、本発明の第１の実施形態の雑音共分散推定処理の詳細を説明するブロック図である。
図１１は、本発明の第１の実施形態のベクトルＶ_ｑ，ｊ（ｚ，τ）のデータ構造を示す説明図である。
図１２は、本発明の第１の実施形態のクラスタリングにおける処理の詳細のフローチャートである。
図１３は、本発明の第２の実施形態の音声会議システムにおける内部雑音除去装置のハードウェア構成のブロック図である。
図１４は、本発明の第２の実施形態のキーボードの操作音に対応する内部雑音シグナルの一例を示す説明図である。
図１５は、本発明の第２の実施形態の各装置が実行する処理のブロック図である。
図１６は、本発明の第２の実施形態の音声会議システムにおけるユーザ使用シーンの一例を示す説明図である。
図１７は、本発明の第３の実施形態のタッチパネルを備える音声会議システムにおける内部雑音除去装置のハードウェア構成のブロック図である。
図１８は、本発明の第３の実施形態のタッチパネルの操作音に対応する内部雑音シグナルの一例を示す説明図である。
図１９は、本発明の第３の実施形態のタッチパネルを備える音声会議システムにおけるユーザ使用シーンの一例を示す説明図である。
図２０は、本発明の第４の実施形態の内部雑音除去処理を含む音声認識の処理の構成のブロック図である。
図２１は、本発明の第５の実施形態の実施形態の内部雑音除去処理の詳細のブロック図である。
図２２は、本発明の第１の実施形態のインデックスＩｎｄ（ｊ，τ）の一例を示す説明図である。
図２３は、本発明における音声認識機能を備えるロボットの一例を示す説明図である。
図２４は、本発明におけるプロジェクタを備えるビデオ会議システムの機器構成を示す説明図である。
図２５は、本発明におけるプロジェクタから発生する雑音の一例を示す説明図である。FIG. 1 is a block diagram of a hardware configuration of an internal noise removal apparatus according to the first embodiment of the present invention.
FIG. 2 is a block diagram of an example of a program stored in the storage device according to the first embodiment of this invention.
FIG. 3 is a block diagram showing an example of processing executed by the internal noise removal apparatus of the present invention.
FIG. 4 is a detailed block diagram of the internal noise removal processing according to the first embodiment of this invention.
FIG. 5 is an explanatory diagram illustrating an example of an internal noise signal according to the first embodiment of this invention.
FIG. 6 is an explanatory diagram illustrating an example of types of internal noise according to the first embodiment of this invention.
FIG. 7 is an example timing chart that defines the types of internal noise signals corresponding to the operating states of the three actuators according to the first embodiment of the present invention.
FIG. 8 is a flowchart of power minimization filtering processing according to the first embodiment of this invention.
FIG. 9 is a timing chart showing an example of the target sound covariance update, noise suppression filter update, noise covariance matrix selection, and internal noise generation status in the first embodiment of the present invention.
FIG. 10 is a block diagram illustrating details of the noise covariance estimation process according to the first embodiment of this invention.
FIG. 11 is an explanatory diagram illustrating a data structure of the vector V _{q, j} (z, τ) according to the first embodiment of this invention.
FIG. 12 is a flowchart illustrating details of processing in clustering according to the first embodiment of this invention.
FIG. 13 is a block diagram of a hardware configuration of an internal noise removal device in the audio conference system according to the second embodiment of this invention.
FIG. 14 is an explanatory diagram illustrating an example of an internal noise signal corresponding to the operation sound of the keyboard according to the second embodiment of this invention.
FIG. 15 is a block diagram of processing executed by each device according to the second embodiment of this invention.
FIG. 16 is an explanatory diagram illustrating an example of a user use scene in the audio conference system according to the second embodiment of this invention.
FIG. 17 is a block diagram of a hardware configuration of an internal noise removing device in an audio conference system including a touch panel according to the third embodiment of the present invention.
FIG. 18 is an explanatory diagram illustrating an example of an internal noise signal corresponding to the operation sound of the touch panel according to the third embodiment of this invention.
FIG. 19 is an explanatory diagram illustrating an example of a user use scene in the audio conference system including the touch panel according to the third embodiment of this invention.
FIG. 20 is a block diagram of a configuration of speech recognition processing including internal noise removal processing according to the fourth embodiment of this invention.
FIG. 21 is a block diagram showing details of the internal noise removal processing according to the fifth embodiment of the present invention.
FIG. 22 is an explanatory diagram illustrating an example of the index Ind (j, τ) according to the first embodiment of this invention.
FIG. 23 is an explanatory diagram showing an example of a robot having a voice recognition function according to the present invention.
FIG. 24 is an explanatory diagram showing a device configuration of a video conference system including a projector according to the present invention.
FIG. 25 is an explanatory diagram showing an example of noise generated from the projector in the present invention.

本発明は、音声認識機能を備えるロボットやプロジェクタを備えるビデオ会議システムのように、音声収録機能を備えるシステムにおいて、アクチュエータ等から発生する雑音を効率的に除去するものである。
図２３は、本発明における音声認識機能を備えるロボットの一例を示す説明図である。
本発明の雑音除去機構は、例えば、図２３に示すような音声認識機能を備えるロボット２２０１に実装される。
ロボット２２０１は、ロボット２２０１の腕を制御するための腕制御アクチュエータ２２０２と、ロボット２２０１の脚を制御するための脚制御アクチュエータ２２０３とを備える。また、ロボット２２０１は、当該ロボット２２０１と対話するユーザの音声を認識するための音声認識用マイクロホンアレイ２２０４を備える。
通常、音声認識では、周りの音が一切混入していないユーザ音声のみがマイクロホンアレイ２２０４によって収集される。したがって、周りの音が混入した音がマイクロホンアレイ２２０４によって収集された場合、音声認識の性能が劣化することが知られている。
図２３に示すロボット２２０１においては、腕制御アクチュエータ２２０２及び脚制御アクチュエータ２２０３の動作音が含まれるユーザ音声がマイクロホンによって収集される。
また、腕制御アクチュエータ２２０２及び脚制御アクチュエータ２２０３の動作音は、腕や脚が移動することによって、動作音が発生する位置が変化する。また、腕制御アクチュエータ２２０２及び脚制御アクチュエータ２２０３の動作音は、各アクチュエータの動作開始、動作中、又は動作終了時にも変化する。
本発明では、アクチュエータの動作音など機器の内部で発生する音を効率良く除去する。
図２４は、本発明におけるプロジェクタを備えるビデオ会議システムの機器構成を示す説明図である。
拠点Ａでは、マイクアレイ２３０１によって拠点Ａで発話した人の音声が収集され、収集された音声の情報が計算機２３０５に送信される。
計算機２３０５は、マイクアレイ２３０１によって収集された音の中から拠点Ａで発話した人の声だけを抽出し、ネットワーク２３０６を介して、抽出された音声を他の拠点Ｂに送信する。また、拠点Ａのカメラ２３０４によって撮影された拠点Ａの風景も同様に拠点Ｂに送信される。
拠点Ｂでは、受信した音声やカメラ２３０４で撮影された画像が計算機２３０５に取り込まれる。カメラ２３０４によって撮影された画像は、プロジェクタ２３０３に投影される。受信した音声は、スピーカ２３０２で再生される。
同様に拠点Ｂに設置されたマイクアレイ２３０１によって収集された音声及びカメラ２３０４によって撮影された画像は、計算機２３０５に取り込まれた後、ネットワーク２３０６を介して、拠点Ａに送信される。
拠点Ａでは、拠点Ｂの場合と同様に、カメラ２３０４によって撮影された画像がプロジェクタ２３０３に投影され、受信した音声がスピーカ２３０２で再生される。
図２４に示す会議システムにおいて、プロジェクタ２３０３から発生するファンノイズなどの雑音が、マイクアレイ２３０１によって収集される音に混入するという問題がある。
図２５は、本発明におけるプロジェクタ２３０３から発生する雑音の一例を示す説明図である。
プロジェクタ２３０３から発生する雑音は、プロジェクタ２３０３が動作するタイミング毎に固有の音が発生する。プロジェクタ２３０３から発生する各雑音には、発生タイミング２４１０に対応し、動作音名称２４００が定義される。動作音名称２４００は、プロジェクタ２３０３から発生する雑音を識別するための識別子である。発生タイミング２４１０は、動作音名称２４００に対応する雑音が発生するタイミングを示す。
図２５に示す例では、動作音名称２４００として「プロジェクタ起動音」及び「プロジェクタ動作音」がある。また、「プロジェクタ起動音」の発生タイミング２４１０は「プロジェクタ起動時」であり、「プロジェクタ動作音」の発生タイミング２４１０は「プロジェクタ動作時」である。
以下の説明において、前述したようなロボット２２０１のアクチュエータの動作音やプロジェクタ２３０３のファンノイズなど機器内部で発生する内部雑音を除去するための構成及び方法について説明する。
第１の実施形態
図１は、本発明の第１の実施形態の内部雑音除去装置のハードウェア構成のブロック図である。
第１の実施形態では、内部雑音除去装置１００、マイクアレイ１０１、アクチュエータ制御装置１０４、及びアクチュエータ１０５を備える機器について説明する。
内部雑音除去装置１００は、ＡＤ変換装置１０２、中央演算装置１０３、記憶装置１０６、及び揮発性メモリ１０７を備える。
ＡＤ変換装置１０２は、入力されたアナログ信号を中央演算装置１０３が処理可能なデジタル信号に変換する。図１に示す例では、マイクアレイ１０１から入力されたアナログ信号がＡＤ変換装置１０２に入力される。
中央演算装置１０３は、揮発性メモリ１０７に展開された各種プログラムを実行する。具体的には、中央演算装置１０３は、ＡＤ変換装置１０２によってデジタル変換された後のデジタル信号から内部雑音を除去し、所望の音声（以下、内部雑音除去音と記載する）のみを抽出する。抽出された内部雑音除去音は、外部の内部雑音除去音を再生する装置（図示省略）に出力され、当該装置によって再生される。
記憶装置１０６は、内部雑音を除去するためのプログラムや、内部雑音に関するデータを格納する。記憶装置１０６に格納されるプログラムについては、図２を用いて後述する。揮発性メモリ１０７は、プログラム実行中のワークメモリを確保するために用いられる。
アクチュエータ制御装置１０４は、アクチュエータ１０５を制御する装置である。例えば、音声認識装置を備えるロボット２２０１の腕や足に設置されたアクチュエータ（腕制御アクチュエータ２２０２及び脚制御アクチュエータ２２０３）を制御する。アクチュエータ制御装置１０４は、アクチュエータ制御信号に基づいてアクチュエータ１０５を制御する。
アクチュエータ１０５は、例えば、音声認識装置を備えるロボット２２０１の腕や足などに設置されたアクチュエータ（腕制御アクチュエータ２２０２及び脚制御アクチュエータ２２０３）である。アクチュエータ１０５が動作するときに発生する音が伝搬し、マイクアレイ１０１によって収集される。
なお、マイクアレイ１０１によって収集される音は、音声処理アプリケーション（図示省略）が必要とする所望の音とアクチュエータが動作するときに発生する雑音（以下、内部雑音と記載する）とが混在する。
なお、内部雑音除去装置１００は、マイクアレイ１０１、アクチュエータ制御装置１０４、及びアクチュエータ１０５の少なくとも一つを備えてもよい。また、ＡＤ変換装置１０２又は記憶装置１０６は、内部雑音除去装置１００の外部に備わってもよい。
図２は、本発明の第１の実施形態の記憶装置１０６に格納されたプログラムの一例のブロック図である。
記憶装置１０６は、共分散行列学習プログラム１０６１、及び雑音抑圧プログラム１０６２を格納する。
共分散行列学習プログラム１０６１は、雑音抑圧フィルタを生成するために用いられる共分散行列を生成するためのプログラムである。雑音抑圧プログラム１０６２は、マイクアレイ１０１によって収集された音に対して最適な雑音抑圧フィルタを選択し、内部雑音を除去するためのプログラムである。
なお、記憶装置１０６は、他のプログラムを格納してもよい。
図３は、本発明の内部雑音除去装置１００が実行する処理の一例のブロック図である。
内部雑音除去装置１００は、雑音共分散推定処理３０１と内部雑音除去処理３０３とを実行する。具体的には、中央演算装置１０３が共分散行列学習プログラム１０６１を実行することによって、雑音共分散推定処理３０１が実行される。また、中央演算装置１０３が雑音抑圧プログラム１０６２を実行することによって、内部雑音除去処理３０３が実行される。
雑音共分散推定処理３０１は、雑音共分散行列を算出するための処理である。具体的には、内部雑音に関する情報を含む内部雑音シグナルが中央演算装置１０３に入力され、入力された内部雑音シグナルに基づいて雑音共分散行列が算出される。なお、内部雑音シグナルに含まれる内部雑音に関する情報には、内部雑音の種類及び内部雑音が発生したタイミング等の内部雑音の属性に関する情報が含まれる。
雑音共分散推定処理３０１は、内部雑音の属性毎に、当該内部雑音の統計量（雑音共分散行列）を算出する処理である。なお、具体的な雑音共分散行列の算出方法は、図１０を用いて後述する。
算出された雑音共分散行列は、揮発性メモリ１０７又は記憶装置１０６に格納される。以下、揮発性メモリ１０７又は記憶装置１０６に格納されている雑音共分散行列を雑音共分散行列ＤＢ３０２と記載する。
内部雑音除去装置１００は、予め収集された内部雑音に対して、雑音共分散推定処理３０１を実行する、いわゆるキャリブレーション処理をしておくことが望ましい。
内部雑音除去処理３０３は、マイクアレイ１０１によって収集された内部雑音と目的とする音とが混在した音から内部雑音を除去するための処理である。
具体的には、内部雑音除去装置１００が、実際にマイクアレイ１０１によって収集された内部雑音と目的とする音とが混在した音に対して、雑音共分散行列ＤＢ３０２を用いて内部雑音を除去し、目的音のみが抽出された音を出力する。なお、内部雑音除去処理３０３の詳細は、図４を用いて後述する。
図４は、本発明の第１の実施形態の内部雑音除去処理３０３の詳細のブロック図である。
内部雑音除去処理３０３は、多チャンネル周波数分析４０１、目的音共分散更新４０３、雑音抑圧フィルタ更新４０４、雑音抑圧フィルタリング４０５、雑音共分散行列選択４０６、及びパワー最小化フィルタリング４０８を含み、各処理は、中央演算装置１０３によって実行される。
デジタル化された入力信号を各チャンネル毎に、一定サンプル（フレームシフト：Ｌ_{ｓｈｉｆｔ}）得られる度に内部雑音除去処理３０３が実行される。なお、本実施形態においては、Ｌ_{ｓｈｉｆｔ}は数十ｍｓ程度の時間長に設定する。
例えば、ＡＤ変換装置１０２のサンプリングレートが８ｋＨｚの場合、Ｌ_{ｓｈｉｆｔ}は２５６ポイント程度に設定する。以下、一定のフレームシフト量のサンプルが得られる度に実行される処理をフレーム処理と記載する。
１つのフレーム処理では、各マイクロホンから入力される入力信号毎に、当該入力信号が入力された時点の過去フレームサイズのサンプル（Ｌ_{ｆｒａｍｅ}）に対して処理が実行される。ここで、フレーム番号を表すインデックスをτとする。
τフレームでは、マイクロホン毎に、（τ×Ｌ_{ｓｈｉｆｔ}）ポイントから（τ×Ｌ_{ｓｈｉｆｔ}＋Ｌ_{ｆｒａｍｅ}−１）ポイント目までのデジタル信号が処理される。ここで、ｐは、τフレーム目の先頭ポイントからのポイント数を表すインデックスとする。
内部雑音除去装置１００には、ｍ番目のマイクロホンから式（１）で示す入力信号が入力される。

まず、多チャンネル周波数分析４０１では、内部雑音除去装置１００が、各マイクロホンから入力された入力信号のうちｐ＝０からｐ＝Ｌ_{ｆｒａｍｅ}−１ポイントのデータに対して、式（２）で示す離散フーリエ変換を実行する。

式（２）によって、各マイクロホンの時間周波数領域信号ｘ_ｍ（ｆ，τ）が得られる。ここで、ｆは周波数を表し、窓関数ｗ（ｐ）は、例えば、式（３）で示すようなハニング窓のようなものとする。

なお、離散フーリエ変換は、高速フーリエ変換のようなアルゴリズムを用いてもよい。
各マイクロホン毎の時間周波数領域信号は、式（４）に示すように周波数毎にまとめられて処理される。

ここで、Ｍはマイクロホン数とする。Ｔはベクトル又は行列の転置を表す演算子とする。
以下、内部雑音除去装置１００は、式（４）に示すような各周波数毎まとめられた信号に対して、目的音共分散更新４０３、雑音抑圧フィルタ更新４０４、雑音抑圧フィルタリング４０５、雑音共分散行列選択４０６、及びパワー最小化フィルタリング４０８を実行する。
目的音共分散更新４０３では、内部雑音除去装置１００が、式（５）を用いて目的音共分散行列Ｒ_ｓ（ｆ）を更新する。

ここで、αは更新係数であり、０から１までの値をとる。＊は行列又はベクトルの転置を表す。
また、内部雑音除去装置１００は、内部雑音シグナルを参照し、内部雑音が発生していないときに目的音共分散更新４０３を実行する。内部雑音が発生しているときには、目的音共分散行列は更新されず、内部雑音が発生する前の値が保持される。なぜなら、内部雑音が発生しているときに目的音共分散行列が更新されると、目的音共分散行列中に内部雑音の情報が混入し、雑音抑圧フィルタリング４０５において、雑音が抑圧されず、逆に強調されてしまうからである。
図５は、本発明の第１の実施形態の内部雑音シグナルの一例を示す説明図である。
内部雑音シグナルには、発生タイミング５１０に対応し、内部雑音シグナル名称５００が定義される。内部雑音シグナル名称５００は、内部雑音を識別するための識別子である。発生タイミング５１０は、内部雑音シグナル名称５００に対応する内部雑音が発生するタイミングを示す。
図５に示す例では、内部雑音シグナル名称５００として「内部雑音発生」及び「内部雑音終了」がある。また、「内部雑音発生」の発生タイミング５１０は「内部雑音が生じたタイミング」であり、「内部雑音終了」の発生タイミング５１０は「内部雑音が止まったタイミング」である。
つまり、内部雑音が生じたタイミングで出力される内部雑音シグナルと、内部雑音が止まったタイミングで出力される内部雑音シグナルとがあることがわかる。
前述した二つの内部雑音シグナルは、さらに、内部雑音の種類を示す情報がそれぞれ含まれる。
図６は、本発明の第１の実施形態の内部雑音の種類の一例を示す説明図である。なお、図６は、内部雑音シグナル名称５００が「内部雑音発生」である内部雑音の種類の一例を示す。
内部雑音の種類には、発生タイミング６１０に対応し、動作音名称６００が定義される。動作音名称６００は、アクチュエータ１０５の動作によって発生する内部雑音を識別する識別子である。発生タイミング６１０は、動作音名称６００に対応する内部雑音が発生するタイミングである。
例えば、動作音名称６００が「モータ１」の動作音は、モータ１の駆動時に発生する動作音として定義される。同様に、動作音名称６００が「モータ２」の動作音は、モータ２の駆動時に発生する動作音として定義される。
また、モータ１とモータ２との動作音が同時に存在する場合の動作音は、動作音名称６００が「モータ１・２」であり、モータ１及びモータ２の動作音とは異なる動作音として定義される。
図７は、本発明の第１の実施形態における３つのアクチュエータの動作状態に応じた内部雑音シグナルの種類を定義する一例のタイミングチャートである。
図７に示す例では、モータ１のみが動作を開始した場合、動作音名称６００は「モータ１」と定義される。モータ２のみが動作を開始した場合、動作音名称６００は「モータ２」と定義される。モータ３のみが動作を開始した場合、動作音名称６００は「モータ３」と定義される。モータ１及びモータ２が動作を開始した場合、動作音名称６００は「モータ１・２」と定義される。
図４の説明に戻る。
雑音共分散行列選択４０６では、内部雑音除去装置１００が、入力された内部雑音シグナルに基づいて、雑音共分散行列ＤＢ３０２が格納された記憶装置１０６又は揮発性メモリ１０７から雑音共分散行列Ｒ_ｎ（ｆ）を選択し、選択された雑音共分散行列Ｒ_ｎ（ｆ）を雑音抑圧フィルタ更新４０４に出力する。なお、雑音共分散行列選択４０６において、一つの内部雑音シグナルに対応する雑音共分散行列ＤＢ３０２が複数ある場合、内部雑音除去装置１００は、複数の内部雑音行列Ｒ_ｎ（ｆ）を選択してもよい。
雑音抑圧フィルタ更新４０４では、内部雑音除去装置１００が、目的音共分散行列Ｒ_ｓ（ｆ）と雑音共分散行列Ｒ_ｎ（ｆ）とを用いて、雑音抑圧フィルタｗ_ｉ（ｆ）を生成する。例えば、式（６）を用いて雑音抑圧フィルタｗ_ｉ（ｆ）が生成される。

ここで、ｍａｘｅｉｇは、最大固有値を与える固有ベクトルを算出する演算子である。
また、ｉは、ｉ番目の雑音共分散行列を示す。つまり、雑音共分散行列選択４０６において、複数の内部雑音行列Ｒ_ｎ（ｆ）が選択された場合、内部雑音除去装置１００は、各々の雑音共分散行列毎に雑音抑圧フィルタを生成する。
雑音抑圧フィルタリング４０５では、内部雑音除去装置１００が、式（７）に示しように、各雑音共分散行列に対応する雑音抑制フィルタｗ_ｉ（ｆ）を入力信号ｘ（ｆ，τ）に作用させ、雑音抑圧後の信号ｙ_ｉ（ｆ，τ）を算出する。

内部雑音除去装置１００は、雑音抑圧信号ｙ_ｉ（ｆ，τ）をパワー最小化フィルタリング４０８に出力する。
内部雑音除去装置１００は、内部雑音が発生していない間、雑音抑圧フィルタリング４０５を実行しない。なお、内部雑音が発生していない間、内部雑音除去装置１００は、多チャンネル周波数分析４０１において、いずれか一つのマイクロホンから入力される入力信号を雑音抑圧信号ｙ_ｉ（ｆ，τ）としてパワー最小化フィルタリング４０８に出力してもよい。
パワー最小化フィルタリング４０８では、内部雑音除去装置１００が、雑音抑圧フィルタリング４０５から入力された雑音抑圧信号ｙ_ｉ（ｆ，τ）の絶対値｜ｙ_ｉ（ｆ，τ）｜の２乗が、最小となる雑音抑圧信号ｙ_ｍｉｎ（ｆ，τ）を算出する。
また、内部雑音除去装置１００は、絶対値｜ｙ_ｉ（ｆ，τ）｜の代わりに、式（８）に示すようにパワーの移動平均から算出されたＰ_ｉ（ｆ，τ）が最小となる雑音抑圧信号ｙ_ｉ（ｆ，τ）を雑音抑圧信号ｙ_ｍｉｎ（ｆ，τ）として算出してもよい。

ここで、βは移動平均を算出するための係数であり、０から１までの値をとる。
図８は、本発明の第１の実施形態のパワー最小化フィルタリング４０８の処理のフローチャートである。
初期化８０１では、内部雑音除去装置１００が、各種変数を初期値に設定する。具体的には、内部雑音除去装置１００は、雑音抑圧フィルタのインデックスｉを「０」に設定し、｜ｙ_０（ｆ，τ）｜の２乗を最小値Ｐ_ｍｉｎに設定し、Ｐ_ｍｉｎを最小とする雑音抑圧フィルタのインデックスｉ_ｍｉｎを「０」に設定する。内部雑音除去装置１００は、各種変数を初期値に設定した後、判定８０５に進む。
判定８０５では、内部雑音除去装置１００が、雑音抑圧フィルタのインデックスｉが全雑音抑圧フィルタ数ｉ_ｍａｘより大きいか否かを判定する。つまり、全ての雑音抑圧フィルタに対して処理が終了したか否かを判定する。
雑音抑圧フィルタのインデックスｉが全雑音抑圧フィルタ数ｉ_ｍａｘ以下であると判定された場合、内部雑音除去装置１００は、雑音抑圧フィルタリング８０２に進む。
雑音抑圧フィルタリング８０２では、内部雑音除去装置１００が、各雑音抑圧フィルタを入力信号ｘ（ｆ，τ）に作用させ、雑音抑圧信号ｙ_ｉ（ｆ，τ）を算出し、判定８０３に進む。
判定８０３では、内部雑音除去装置１００が、雑音抑圧信号ｙ_ｉ（ｆ，τ）の絶対値の２乗がＰ_ｍｉｎより小さいか否かを判定する。なお、ｉ＝０の場合、判定８０３では、内部雑音除去装置１００が、Ｐ_ｍｉｎの判定を行わず、雑音フィルタのインデックスｉを「１」に更新してから判定８０５に戻り、次の雑音抑圧フィルタについて同様の処理を実行する。
雑音が除去された雑音抑圧信号ｙ_ｉ（ｆ，τ）の絶対値の２乗がＰ_ｍｉｎ以上であると判定された場合、内部雑音除去装置１００は、ｉをｉ＋１に更新してから判定８０５に戻り、次の雑音抑圧フィルタについて同様の処理を実行する。
雑音抑圧信号ｙ_ｉ（ｆ，τ）の絶対値の２乗がＰｍｉｎより小さいと判定された場合、内部雑音除去装置１００は、最小値更新８０４に進む。
最小値更新８０４では、内部雑音除去装置１００が、Ｐ_ｍｉｎ及びｉ_ｍｉｎを更新し、さらに、ｉをｉ＋１に更新してから判定８０５に戻り、次の雑音抑圧フィルタについて処理を実行する。
判定８０５において、雑音抑圧フィルタのインデックスｉが全雑音抑圧フィルタ数ｉ_ｍａｘより大きいと判定された場合、内部雑音除去装置１００は、全ての雑音抑圧フィルタについて処理が完了したと判定し、Ｐ_ｍｉｎとなる雑音抑圧フィルタのインデックスｉ及び雑音抑圧信号ｙ_ｉ（ｆ，τ）を、ｉ_ｍｉｎ及び雑音抑圧信号ｙ_ｍｉｎ（ｆ，τ）として時間領域変換４０９に出力し、処理を終了する。
前述した処理によって、内部雑音除去装置１００は、マイクアレイ１０１から入力された信号に雑音抑圧フィルタを作用させた後の音量が最小となる雑音抑圧フィルタ、及び雑音抑圧後の音量が最小となる出力信号を取得することができる。
つまり、本実施形態では、内部雑音の種類及び発生タイミング等の各属性毎に複数の雑音抑圧フィルタが生成され、当該雑音抑圧フィルタのうち、雑音が抑圧された後の音量が最も小さくなる雑音抑圧フィルタを選択することができる。したがって、正確に雑音を除去することが可能となる。
時間領域変換４０９では、内部雑音除去装置１００が、周波数毎に算出された雑音抑圧信号ｙ_ｍｉｎ（ｆ，τ）に対して式（９）に示す逆フーリエ変換を実行することによって、時間領域の雑音抑圧信号ｙ_ｍｉｎ（ｐ）を算出する。

ここで、ｆ_ｍａｘはサンプリングレートの０．５倍に相当する周波数とする。
時間領域変換４０９では、内部雑音除去装置１００が、時間領域の雑音抑圧信号ｙ_ｍｉｎ（ｐ）に窓関数の逆数に相当する関数を作用させたものをフレーム間で加算した最終的な信号を出力する。
図９は、本発明の第１の実施形態における目的音共分散更新４０３、雑音抑圧フィルタ更新４０４、雑音共分散行列選択４０６、及び内部雑音の発生状況の一例のタイミングチャートである。
図９に示すように内部雑音が発生していない時間帯に、目的音共分散更新４０３が実行される。当該時間帯には、雑音抑圧フィルタ更新４０４、及び雑音共分散行列選択４０６は実行されない。
内部雑音が発生した場合、内部雑音が発生した時に、アクチュエータ制御装置１０４から内部雑音発生シグナルが出力される。なお、出力される内部雑音発生シグナルには、内部雑音が発生したタイミング、及び内部雑音の種類等の内部雑音の属性が含まれる。
内部雑音発生シグナルは、アクチュエータ制御装置１０４に送信される駆動信号を用いる方法が考えられる。つまり、アクチュエータ制御装置１０４に送信される駆動信号が内部雑音発生シグナルとして中央演算装置１０３に入力される。
例えば、「Ａ」という種類の内部雑音が発生した場合、内部雑音Ａの発生シグナルが雑音共分散行列選択４０６に入力される。また、雑音抑圧フィルタ更新４０４にも目的音共分散更新４０３を介して内部雑音Ａの発生シグナルが入力される。
雑音共分散行列選択４０６では、内部雑音除去装置１００が、入力された内部雑音Ａに相当する雑音共分散行列を雑音共分散行列ＤＢ３０２から読み出す。
雑音抑圧フィルタ更新４０４では、内部雑音除去装置１００が、目的音共分散行列と内部雑音Ａに相当する雑音共分散行列とから雑音抑圧フィルタを生成する。雑音抑圧フィルタは、例えば、式（６）を用いて生成される。
内部雑音Ａの発生中、内部雑音除去装置１００は、毎フレーム、雑音抑圧フィルタ更新４０４を実行してもよい。なお、内部雑音Ａの発生中には、目的音共分散更新４０３は実行されない。
内部雑音Ａの終了シグナルが目的音共分散更新４０３に入力された時に、内部雑音除去装置１００は、再び、目的音共分散更新４０３を再開する。
以下、雑音共分散推定処理３０１の詳細について説明する。
図１０は、本発明の第１の実施形態の雑音共分散推定処理３０１の詳細を説明するブロック図である。
雑音共分散推定処理３０１は、多チャンネル周波数分析１００１、特徴量抽出１００２、特徴量ベクトル生成１００３、クラスタリング１００４、及び雑音共分散更新１００５を含み、各処理は、中央演算装置１０３によって実行される。
なお、本実施形態では、雑音共分散推定処理３０１の実行時には、内部雑音の属性毎に学習用の内部雑音信号が予め用意されている。予め用意された学習用の内部雑音信号には雑音が発生した時間帯（タイミング）が含まれており、雑音共分散推定処理３０１は、該当する時間帯（タイミング）の内部雑音信号のみ抽出して学習することができる。なお、予め用意された学習用の内部雑音信号は、複数存在してもよい。
多チャンネル周波数分析１００１では、内部雑音除去装置１００が、各内部雑音信号をフレーム毎に周波数領域信号ｘ（ｆ，τ）に変換する。ここで、内部雑音の種類を表すインデックスをｑ、内部雑音ｑの複数の内部雑音信号を表すインデックスをｊ、内部雑音ｑのｊ番目の信号のフレームτの周波数領域信号をｘ_ｑ，ｊ（ｆ，τ）と記載する。
特徴量抽出１００２では、内部雑音除去装置１００が、式（１０）を用いてｘ_ｑ，ｊ（ｆ，τ）から特徴量Ｖ_ｑ，ｊ（ｆ，τ）を生成する。

特徴量ベクトル生成１００３では、内部雑音除去装置１００が、まず各周波数をＺ個のサブグループに分ける。ここで、ｚはサブグループを表すインデックスとする。
特徴量ベクトル生成１００３では、内部雑音除去装置１００が、サブグループｚに属する同一フレームτの特徴量Ｖ_ｑ，ｊ（ｆ，τ）を連結して一つのベクトルＶ_ｑ，ｊ（ｚ，τ）を生成する。
図１１は、本発明の第１の実施形態のベクトルＶ_ｑ，ｊ（ｚ，τ）のデータ構造を示す説明図である。
図１１に示すように、ベクトルＶ_ｑ，ｊ（ｚ，τ）は、各周波数毎の特徴量Ｖ_ｑ，ｊ（ｆ，τ）を要素に持つベクトルである。
クラスタリング１００４では、内部雑音除去装置１００が、サブグループ毎に、内部雑音ｑの学習用の全信号及び全フレームのデータに対して、クラスタリング処理を実行する。具体的には、内部雑音除去装置１００は、信号インデックス及びフレームインデックス毎にその信号及びフレームの特徴量の属するクラスタを定義するインデックスＩｎｄ（ｊ，τ）を出力する。
図２２は、本発明の第１の実施形態のインデックスＩｎｄ（ｊ，τ）の一例を示す説明図である。
クラスタリング１００４が実行された結果、内部雑音信号には、内部雑音の各時間毎にインデックスＩｎｄ（ｊ，τ）が付与される。
図２２に示す例では、内部雑音信号にＡ〜ＣのインデックスＩｎｄ（ｊ，τ）が付与される。
雑音共分散更新１００５では、内部雑音除去装置１００が、クラスｃ毎に式（１１）を用いて、そのクラスタに属する特徴量を算出するときに用いられた入力データｘ（ｆ，τ）から周波数毎の共分散行列を算出する。

クラスタの数が、雑音共分散行列及び雑音抑圧フィルタの数となる。なお、クラスタの数は予め設定されてもよい。
雑音共分散更新１００５では、内部雑音除去装置１００が、内部雑音信号毎に算出された雑音共分散行列を記憶装置１０６又は揮発性メモリ１０７に雑音共分散行列ＤＢ３０２として格納し、処理を終了する。
図１２は、本発明の第１の実施形態のクラスタリング１００４における処理の詳細のフローチャートである。
以下、内部雑音の種類を表すインデックスｑ及びサブグループのインデックスｚを省略して表記する。また、τをτ＝ｊ＊Ｔ＋τと変数変換する。ここで、Ｔは各内部雑音信号のフレーム数である。
初期化１２０１では、内部雑音除去装置１００が、各クラスタのセントロイドＣ（ｃ）に特徴量Ｖ（τ）の一つをランダムに設定する。ここで、ｒａｎｄｏｍは全ての内部雑音信号及び全てのフレームのうちいずれか一つをランダムに選択する変数である。また、クラスタリング１００４は、変数ｅｎｄを「ＦＡＬＳＥ」に、Ｉｎｄ_ｐｒｅ（τ）を「０」に初期化する。
判定１２０２では、内部雑音除去装置１００が、変数ｅｎｄが終了状態を示す「ＴＵＲＥ」であるか否かを判定する。
変数ｅｎｄが終了状態を示す「ＴＵＲＥ」であると判定された場合、内部雑音除去装置１００は、処理を終了する。
変数ｅｎｄが終了状態を示す「ＴＵＲＥ」でないと判定された場合、内部雑音除去装置１００は、初期化１２０３に進む。
初期化１２０３では、内部雑音除去装置１００が、インデックスτを最も小さい値「１」に初期化し、判定１２０４に進む。
判定１２０４では、内部雑音除去装置１００が、インデックスτが最大値Ｔ_ｍａｘ以下であるか否かを判定する。
インデックスτが最大値Ｔ_ｍａｘ以下であると判定された場合、内部雑音除去装置１００は、初期化１２０５に進む。
初期化１２０５では、内部雑音除去装置１００が、変数Ｉｎｄ（τ）を「１」に、クラスタのインデックスｃを「１」に、また変数ｍｉｎを「−１」を初期化し、判定１２０６に進む。
判定１２０６では、内部雑音除去装置１００が、クラスタのインデックスｃがクラスタ数Ｃ以下であるか否かを判定する。
クラスタのインデックスｃがクラスタ数Ｃより大きいと判定された場合、内部雑音除去装置１００は、τをτ＋１に更新し、判定１２０４に戻る。
クラスタのインデックスｃがクラスタ数Ｃ以下であると判定された場合、内部雑音除去装置１００は、距離計算１２０７に進む。
距離計算１２０７では、内部雑音除去装置１００が、関数Ｄを用いて、各クラスタのセントロイドＣ（ｃ）と特徴量Ｖ（τ）との距離を算出し、判定１２０８に進む。関数Ｄは、例えば、｜Ｃ（ｃ）−Ｖ（τ）｜等が考えられる。算出された距離は変数ｄｉｓに入力される。
判定１２０８では、内部雑音除去装置１００が、変数ｄｉｓが変数ｍｉｎより小さいか否かを判定する。
変数ｄｉｓが変数ｍｉｎ以上と判定された場合、内部雑音除去装置１００は、クラスタのインデックスｃをｃ＋１に更新し、判定１２０６に戻る。
変数ｄｉｓが変数ｍｉｎより小さいと判定された場合、内部雑音除去装置１００は、最小値置換１２０９に進む。
最小値置換１２０９では、内部雑音除去装置１００が、Ｉｎｄ（τ）をクラスタのインデックスｃに置き換える。また、内部雑音除去装置１００は、変数ｍｉｎを変数ｄｉｓに置き換える。内部雑音除去装置１００は、その後、クラスタのインデックスｃをｃ＋１に更新し、判定１２０６に戻る。
判定１２０４において、インデックスτが最大値Ｔ_ｍａｘより大きいと判定された場合、内部雑音除去装置１００は、更新１２１０に進む。
更新１２１０では、内部雑音除去装置１００が、（１２）を用いて、セントロイドＣ（ｃ）を更新する。具体的には、各クラスタのセントロイドＣ（ｃ）を超えた、セントロイドの更新は、各クラスタにおける特徴量Ｖ（τ）の平均値を算出することによって実行される。

更新の後、内部雑音除去装置１００は、判定１２１１に進む。
判定１２１１では、内部雑音除去装置１００が、全てのインデックスτに対して、Ｉｎｄ_ｐｒｅ（τ）とＩｎｄ（τ）とが等しいか否かを判定する。
判定１２１１の条件を満たさないと判定された場合、内部雑音除去装置１００は、全てのインデックスτについてＩｎｄ_ｐｒｅ（τ）にＩｎｄ（τ）を代入し、判定１２０２に戻る。
全てのインデックスτに対して、Ｉｎｄ_ｐｒｅ（τ）とＩｎｄ（τ）とが等しいと判定された場合、内部雑音除去装置１００は、変数ｅｎｄを「ＴＵＲＥ」に設定して、判定１２０２に戻る。
第２の実施形態
以下、本発明の第２の実施形態について説明する。本発明の第２の実施形態は、音声会議システムを想定したものである。以下、本発明の第１の実施形態との差異を中心に説明する。
図１３は、本発明の第２の実施形態の音声会議システムにおける内部雑音除去装置のハードウェア構成のブロック図である。
第２の実施形態では、内部雑音除去装置１３００、マイクアレイ１３０１、キーボード信号認識装置１３０４、キーボード１３０５、音声送信装置１３０８、音声受信装置１３０９、ＤＡ変換装置１３１０、及びスピーカ１３１１を備える音声会議システムについて説明する。
内部雑音除去装置１３００は、ＡＤ変換装置１３０２、中央演算装置１３０３、記憶装置１３０６、及び揮発性メモリ１３０７を備える。
ＡＤ変換装置１３０２は、マイクアレイ１３０１から入力されたアナログ信号を中央演算装置１３０３が処理可能なデジタル信号に変換する。図１３に示す例では、アナログ信号がマイクアレイ１３０１からＡＤ変換装置１３０２に入力される。
中央演算装置１３０３は、揮発性メモリ１３０７に展開された各種プログラムを実行する。具体的には、中央演算装置１３０３は、ＡＤ変換装置１３０２によってデジタル変換された後のデジタル信号から内部雑音を除去し、内部雑音除去音のみを抽出する処理を実行する。
第２の実施形態では、内部雑音除去音は、ユーザがキーボード１３０５のキーを操作した時に発生する音を除去した音（キーボード除去音）とする。抽出された内部雑音除去音（キーボード除去音）は、音声送信装置１３０８に送信される。
記憶装置１３０６は、内部雑音を除去するためのプログラムや、内部雑音に関するデータを格納する。記憶装置１３０６に格納されるプログラムは、第１の実施形態と同一である。揮発性メモリ１３０７は、プログラム実行中のワークメモリを確保するために用いられる。
キーボード信号認識装置１３０４は、キーボード１３０５が備えるキーのうち、どのキーが、いつ操作されたかという情報を検出する。検出された情報は、中央演算装置１３０３に送信される。
音声送信装置１３０８は、中央演算装置１３０３から受信した内部雑音除去音を音声会議の通話先に送信する。
音声受信装置１３０９は、音声会議の通話先より送られてきた音声信号を受信し、受信した音声信号を中央演算装置１３０３に送信する。中央演算装置１３０３は、受信した音声信号をＤＡ変換装置１３１０に送信する。
ＤＡ変換装置１３１０は、受信した音声信号をアナログの音声信号に変換し、スピーカ１３１１に送信する。
スピーカ１３１１は、ＤＡ変換装置１３１０から送信されたアナログの音声信号を再生する。なお、スピーカ１３１１から再生されるアナログの音声信号（スピーカ再生信号と記載する）は、マイクアレイ１３０１によって収集される。この場合、マイクアレイ１３０１によって収集された音に含まれるスピーカ再生信号は、中央演算装置１３０３が実行する音響エコーキャンセラ処理によって除去される。
なお、内部雑音除去装置１３００は、マイクアレイ１３０１、キーボード信号認識装置１３０４、キーボード１３０５、音声送信装置１３０８、音声受信装置１３０９、ＤＡ変換装置１３１０、及びスピーカ１３１１の少なくとも一つを備えてもよい。また、ＡＤ変換装置１３０２又は記憶装置１３０６は、内部雑音除去装置１３００の外部に備わってもよい。
図１４は、本発明の第２の実施形態のキーボード１３０５の操作音に対応する内部雑音シグナルの一例を示す説明図である。
内部雑音シグナルは、キーボード１３０５のそれぞれのキーを操作した時に発行される。内部雑音シグナルは、キーボード１３０５のどのキーを操作した時の動作音がを識別できるように定義されている。
具体的には、内部雑音シグナルには、発生タイミング１４１０に対応し、動作音名称１４００が定義される。動作音名称１４００は、キーボード１３０５の操作音を識別するための識別子である。発生タイミング１４１０は、動作音名称１４００に対応する内部雑音が発生するタイミングを示す。
図１５は、本発明の第２の実施形態の各装置が実行する処理のブロック図である。
マイクアレイ１３０１によって収集された音声は、ＡＤ変換装置１３０２に送信される。ＡＤ変換装置１３０２は、受信した音声信号に対してＡＤ変換処理１５０２を実行し、受信した音声信号をデジタル信号に変換する。ＡＤ変換装置１３０２は、デジタル化された音声信号を中央演算装置１３０３に送信する。
なお、デジタル化された音声信号には、ユーザが発した音声のほか、スピーカ１３１１から出力される音声がマイクアレイ１３０１によって収集された音（音響エコー）やキーボード１３０５の操作時に発生する雑音が含まれる。
中央演算装置１３０３は、ＡＤ変換装置１３０２から送信された音声信号に対して、エコーキャンセラ１５０５を実行する。
エコーキャンセラ１５０５では、スピーカ１３１１から出力される音声信号を参照信号として、ＮＬＭＳなどの一般的なアルゴリズムを用いて音響エコー成分が除去される。
音響エコー成分が除去された音声信号は、内部雑音除去処理１５０３に出力される。中央演算装置１３０３は、音響エコー成分が除去された音声信号に対して内部雑音除去処理１５０３を実行し、キーボードの操作音を除去する。なお、内部雑音除去処理１５０３は、第１の実施形態の内部雑音除去処理３０３と同一の構成である。
内部雑音が除去された音声信号は、音声送信１５０８でネットワークを介して会議相手に送信される。
会議相手の音声は、ネットワークを介し、音声受信１５０７で受信する。受信した音声は、ＤＡ変換装置１３１０に送信される。
ＤＡ変換装置１３１０は、受信した音声に対してＤＡ変換処理１５０４を実行することによって、受信した音声をアナログの音声信号に変換する。また、ＤＡ変換装置１３１０は、アナログの音声信号をスピーカ１３１１に送信する。
スピーカ１３１１は、受信したアナログの音声信号を再生する。
図１６は、本発明の第２の実施形態の音声会議システムにおけるユーザ使用シーンの一例を示す説明図である。
キーボード１６０１上に配置されるボタンをユーザが操作した場合、操作されたボタン位置から雑音が発生する。発生した雑音は、音声会議システムにおけるユーザの発声音声と共にマイクロホンアレイ１６０３によって収集される。なお、マイクロホンアレイ１６０３は、例えば、パーソナルコンピュータの表示装置１６０２の上に配置することが考えられる。
第３の実施形態
以下、本発明の第３の実施形態について説明する。本発明の第３の実施形態は、タッチパネルを備える音声会議システムを想定したものである。以下、本発明の第１の実施形態との差異を中心に説明する。
図１７は、本発明の第３の実施形態のタッチパネルを備える音声会議システムにおける内部雑音除去装置のハードウェア構成のブロック図である。
第３の実施形態では、内部雑音除去装置１７００、マイクアレイ１７０１、タッチ位置認識装置１７０４、タッチパネル１７０５、音声送信装置１７０８、音声受信装置１７０９、ＤＡ変換装置１７１０、及びスピーカ１７１１を備える音声会議システムについて説明する。
内部雑音除去装置１７００は、ＡＤ変換装置１７０２、中央演算装置１７０３、記憶装置１７０６、及び揮発性メモリ１７０７を備える。
ＡＤ変換装置１７０２は、マイクアレイ１７０１から入力されたアナログ信号を中央演算装置１７０３が処理可能なデジタル信号に変換する。図１７に示す例では、マイクアレイ１７０１からアナログ信号がＡＤ変換装置１７０２に入力される。
中央演算装置１７０３は、揮発性メモリ１７０７に展開された各種プログラムを実行する。具体的には、中央演算装置１７０３は、ＡＤ変換装置１７０２によってデジタル変換された後のデジタル信号から内部雑音を除去し、内部雑音除去音のみを抽出する処理を実行する。
第３の実施形態では、内部雑音除去音は、ユーザがタッチパネル１７０５を操作した時に発生する音を除去した音（タッチパネル除去音）とする。抽出された内部雑音除去音（タッチパネル除去音）は、音声送信装置１７０８に送信される。
記憶装置１７０６は、内部雑音を除去するためのプログラムや、内部雑音に関するデータを格納する。記憶装置１７０６に格納されるプログラムは、第１の実施形態と同一である。揮発性メモリ１７０７は、プログラム実行中のワークメモリを確保するために用いられる。
タッチ位置認識装置１７０４は、タッチパネル１７０５のどの位置が、いつ操作されたかの情報を検出する。検出された情報は、中央演算装置１７０３に送信される。
音声送信装置１７０８は、中央演算装置１７０３から受信した内部雑音除去音を音声会議の通話先に送信する。
音声受信装置１７０９は、音声会議の通話先より送られてきた音声信号を受信し、受信した音声信号を中央演算装置１７０３に送信する。中央演算装置１７０３は、受信した音声信号をＤＡ変換装置１７１０に送信する。
ＤＡ変換装置１７１０は、受信した音声信号をアナログの音声信号に変換し、スピーカ１７１１に送信する。
スピーカ１７１１は、ＤＡ変換装置１７１０から送信されたアナログの音声信号を再生する。なお、スピーカ１７１１から再生されるアナログの音声信号（以下、スピーカ再生信号と記載する）は、マイクアレイ１７０１によって収集される。この場合、マイクアレイ１７０１によって収集された音に含まれるスピーカ再生信号は、中央演算装置１７０３が実行する音響エコーキャンセラ処理によって除去される。
なお、内部雑音除去装置１７００は、マイクアレイ１７０１、タッチ位置認識装置１７０４、タッチパネル１７０５、音声送信装置１７０８、音声受信装置１７０９、ＤＡ変換装置１７１０、及びスピーカ１７１１の少なくとも一つを備えてもよい。また、ＡＤ変換装置１７０２又は記憶装置１７０６は、内部雑音除去装置１７００の外部に備わってもよい。
図１８は、本発明の第３の実施形態のタッチパネル１７０５の操作音に対応する内部雑音シグナルの一例を示す説明図である。
内部雑音シグナルは、タッチパネル１７０５のそれぞれのタッチ位置を操作した時に発行される。内部雑音シグナルには、タッチパネル１７０５のタッチ位置毎にどの位置を操作した時の操作音であるかを識別できるよる情報が含まれる。
具体的には、内部雑音シグナルには、発生タイミング１８１０に対応し、タッチ位置名称１８００が定義される。タッチ位置名称１８００は、タッチパネル１７０５のタッチ位置毎の操作音を識別するための識別子である。発生タイミング１８１０は、タッチ位置名称１８００に対応する内部雑音が発生するタイミングを示す。
図１９は、本発明の第３の実施形態のタッチパネル１７０５を備える音声会議システムにおけるユーザ使用シーンの一例を示す説明図である。
タッチパネル１９０２を操作した音は、音声会議システムにおけるユーザが発する音声と共にマイクロホンアレイ１９０１によって収集される。
第４の実施形態
以下、本発明の第４の実施形態について説明する。本発明の第４の実施形態は、音声認識の機能を備えるロボット（図２３参照）を想定したものである。以下、本発明の第１の実施形態との差異を中心に説明する。
なお、第４の実施形態のロボット２２０１は、内部雑音除去装置１００を備える。内部雑音除去装置１００のハードウェア構成及び処理構成は第１の実施形態と同一であるため説明を省略する。
図２０は、本発明の第４の実施形態の内部雑音除去処理を含む音声認識の処理の構成のブロック図である。
内部雑音除去装置１００は、音声認識用マイクホンアレイ２２０４によって収集された音声信号に対しＡＤ変換処理２００２を実行し、デジタル音声信号に変換する。デジタル音声信号は、内部雑音除去処理３０３に出力される。
内部雑音除去装置１００は、内部雑音除去処理３０３を実行し、デジタル音声信号中に含まれる内部雑音を除去し、音声認識の対象である人の音声のみ抽出する。抽出された音声は音声認識２００４に出力される。
音声認識２００４では、一般的なＭＦＣＣなどの特徴量抽出処理が実行され、予め学習する音響モデルと特徴量とのビタビデコーディング処理が実行され、どの音声が発生したかを認識するような構成を取る。内部雑音除去装置１００は、認識結果を出力し、処理を終了する。
第５の実施形態
以下、本発明の第５の実施形態について説明する。本発明の第５の実施形態は、雑音共分散行列及び目的音共分散行列の推定方法及び雑音抑圧フィルタの適応方法の変形例を示す。以下、本発明の第１の実施形態との差異を中心に説明する。
なお、第５の実施形態は、装置構成は第１の実施形態のハードウェア構成及び処理構成は同一であるため説明を省略する。
図２１は、本発明の第５の実施形態の実施形態の内部雑音除去処理３０３の詳細のブロック図である。
内部雑音除去装置１００は、フレーム毎に、マイクアレイ１０１によって収集された音声信号に対して多チャンネル周波数分析２１０１を実行し、周波数領域信号に変換する。変換された周波数領域信号は、各周波数毎に音源方向推定２１０２に出力される。
内部雑音除去装置１００は、周波数領域信号に対し、音源方向推定２１０２を実行し、音源の方向が特定する。音源方向推定２１０２は、例えば、各マイクロホン間の位相差に基づくＧＣＣ−ＰＨＡＴ法や遅延和アレイ法などを用いる方法が考えられる。
内部雑音除去装置１００は、音源方向推定２１０２において、予め目的音方向を設定しておき、フレーム毎及び周波数毎に、音源方向が予め設定された目的音方向と一致するか否かを判定する。
音源方向が目的音方向と一致すると判定された場合、内部雑音除去装置１００は、条件を満たす成分（フレーム及び周波数）の音声を目的音として、当該音声に対して目的音適応２１０３を実行する。具体的には、内部雑音除去装置１００は、式（５）を用いて目的音の共分散行列Ｒ_ｓ（ｆ）を更新する。
音源方向が目的音方向と一致しなかった場合、内部雑音除去装置１００は、条件を満たさない成分（フレーム及び周波数）の音声を雑音として、雑音適応２１０４を実行する。具体的には、内部雑音除去装置１００は、式（１３）を用いて雑音共分散行列Ｒ_ｂ（ｆ）を更新する。

内部雑音追加２１０５では、内部雑音除去装置１００が、内部雑音シグナルに対応する各雑音共分散行列に雑音共分散行列Ｒ_ｂ（ｆ）を加算する。
フィルタ適応２１０６では、内部雑音除去装置１００が、目的音共分散行列とＲ_ｂ（ｆ）が加算された雑音共分散行列とを式（６）に代入し、雑音抑圧フィルタを生成する。
本発明の一実施形態によれば、内部雑音除去装置１００は、内部雑音の種類及び発生タイミング等の内部雑音の属性に応じて複数の雑音共分散行列を生成し、発生した内部雑音に対応した雑音共分散行列を複数選択し、各々の雑音共分散行列から複数の雑音抑圧フィルタを生成し、さらに、複数の雑音抑圧フィルタから適切な雑音フィルタを選択することができる。これによって、アクチュエータの動作状態によって音質が変化するような非定常な雑音に対しても適切に雑音を除去することが可能となる。
また、アクチュエータの動作音以外のキーボード１３０５又はタッチパネル１７０５等の操作音に対しても、正確に雑音を除去することができる。The present invention efficiently removes noise generated from an actuator or the like in a system having a voice recording function, such as a video conference system having a robot or projector having a voice recognition function.
FIG. 23 is an explanatory diagram showing an example of a robot having a voice recognition function according to the present invention.
The noise removal mechanism of the present invention is mounted on, for example, a robot 2201 having a voice recognition function as shown in FIG.
The robot 2201 includes an arm control actuator 2202 for controlling the arm of the robot 2201 and a leg control actuator 2203 for controlling the leg of the robot 2201. The robot 2201 includes a voice recognition microphone array 2204 for recognizing a voice of a user who interacts with the robot 2201.
Usually, in the speech recognition, only the user speech in which no surrounding sounds are mixed is collected by the microphone array 2204. Therefore, it is known that the performance of voice recognition is deteriorated when sounds mixed with surrounding sounds are collected by the microphone array 2204.
In the robot 2201 shown in FIG. 23, user sounds including operation sounds of the arm control actuator 2202 and the leg control actuator 2203 are collected by a microphone.
The operation sound of the arm control actuator 2202 and the leg control actuator 2203 changes the position where the operation sound is generated when the arm or leg moves. In addition, the operation sounds of the arm control actuator 2202 and the leg control actuator 2203 change when the operation of each actuator starts, is in operation, or ends.
In the present invention, sound generated inside the device, such as operation sound of the actuator, is efficiently removed.
FIG. 24 is an explanatory diagram showing a device configuration of a video conference system including a projector according to the present invention.
At the site A, the voice of the person uttered at the site A is collected by the microphone array 2301, and the collected voice information is transmitted to the computer 2305.
The computer 2305 extracts only the voice of a person uttered at the site A from the sounds collected by the microphone array 2301, and transmits the extracted voice to another site B via the network 2306. Also, the scenery of the base A photographed by the camera 2304 at the base A is transmitted to the base B in the same manner.
At the site B, the received voice and the image taken by the camera 2304 are taken into the computer 2305. An image photographed by the camera 2304 is projected on the projector 2303. The received sound is reproduced by the speaker 2302.
Similarly, audio collected by the microphone array 2301 installed at the site B and an image taken by the camera 2304 are captured by the computer 2305 and then transmitted to the site A via the network 2306.
At the site A, as in the case of the site B, an image photographed by the camera 2304 is projected onto the projector 2303, and the received sound is reproduced by the speaker 2302.
In the conference system shown in FIG. 24, there is a problem that noise such as fan noise generated from the projector 2303 is mixed into the sound collected by the microphone array 2301.
FIG. 25 is an explanatory diagram showing an example of noise generated from the projector 2303 in the present invention.
As noise generated from the projector 2303, a unique sound is generated every time the projector 2303 operates. Each noise generated from the projector 2303 corresponds to the generation timing 2410, and an operation sound name 2400 is defined. The operation sound name 2400 is an identifier for identifying noise generated from the projector 2303. The generation timing 2410 indicates a timing at which noise corresponding to the operation sound name 2400 is generated.
In the example shown in FIG. 25, the operation sound name 2400 includes “projector activation sound” and “projector operation sound”. Also, the “projector activation sound” generation timing 2410 is “when the projector is activated”, and the “projector operation sound” generation timing 2410 is “when the projector is operating”.
In the following description, a configuration and a method for removing internal noise generated inside the apparatus such as the operation sound of the actuator of the robot 2201 and the fan noise of the projector 2303 as described above will be described.
First embodiment
FIG. 1 is a block diagram of a hardware configuration of an internal noise removal apparatus according to the first embodiment of the present invention.
In the first embodiment, a device including the internal noise removing device 100, the microphone array 101, the actuator control device 104, and the actuator 105 will be described.
The internal noise removal device 100 includes an AD conversion device 102, a central processing unit 103, a storage device 106, and a volatile memory 107.
The AD converter 102 converts the input analog signal into a digital signal that can be processed by the central processing unit 103. In the example illustrated in FIG. 1, an analog signal input from the microphone array 101 is input to the AD converter 102.
The central processing unit 103 executes various programs developed in the volatile memory 107. Specifically, the central processing unit 103 removes internal noise from the digital signal that has been digitally converted by the AD converter 102, and extracts only the desired sound (hereinafter referred to as internal noise removed sound). The extracted internal noise-removed sound is output to a device (not shown) that reproduces an external internal noise-removed sound and is reproduced by the device.
The storage device 106 stores a program for removing internal noise and data related to internal noise. The program stored in the storage device 106 will be described later with reference to FIG. The volatile memory 107 is used to secure work memory during program execution.
The actuator control device 104 is a device that controls the actuator 105. For example, it controls the actuators (arm control actuator 2202 and leg control actuator 2203) installed on the arms and legs of the robot 2201 having the speech recognition device. The actuator control device 104 controls the actuator 105 based on the actuator control signal.
The actuator 105 is, for example, an actuator (arm control actuator 2202 and leg control actuator 2203) installed on an arm, a leg, or the like of a robot 2201 provided with a voice recognition device. Sound generated when the actuator 105 operates propagates and is collected by the microphone array 101.
Note that the sound collected by the microphone array 101 includes a desired sound required by a sound processing application (not shown) and noise generated when the actuator operates (hereinafter referred to as internal noise).
The internal noise removal device 100 may include at least one of the microphone array 101, the actuator control device 104, and the actuator 105. Further, the AD conversion device 102 or the storage device 106 may be provided outside the internal noise removal device 100.
FIG. 2 is a block diagram illustrating an example of a program stored in the storage device 106 according to the first embodiment of this invention.
The storage device 106 stores a covariance matrix learning program 1061 and a noise suppression program 1062.
The covariance matrix learning program 1061 is a program for generating a covariance matrix used for generating a noise suppression filter. The noise suppression program 1062 is a program for selecting an optimal noise suppression filter for the sound collected by the microphone array 101 and removing internal noise.
Note that the storage device 106 may store other programs.
FIG. 3 is a block diagram illustrating an example of processing executed by the internal noise removal apparatus 100 according to the present invention.
The internal noise removal apparatus 100 executes a noise covariance estimation process 301 and an internal noise removal process 303. Specifically, when the central processing unit 103 executes the covariance matrix learning program 1061, the noise covariance estimation process 301 is executed. Further, the central processing unit 103 executes the noise suppression program 1062, whereby an internal noise removal process 303 is executed.
The noise covariance estimation process 301 is a process for calculating a noise covariance matrix. Specifically, an internal noise signal including information related to internal noise is input to the central processing unit 103, and a noise covariance matrix is calculated based on the input internal noise signal. Note that the information related to the internal noise included in the internal noise signal includes information related to the internal noise attributes such as the type of internal noise and the timing at which the internal noise occurs.
The noise covariance estimation process 301 is a process for calculating the internal noise statistic (noise covariance matrix) for each internal noise attribute. A specific method for calculating the noise covariance matrix will be described later with reference to FIG.
The calculated noise covariance matrix is stored in the volatile memory 107 or the storage device 106. Hereinafter, the noise covariance matrix stored in the volatile memory 107 or the storage device 106 is referred to as a noise covariance matrix DB302.
The internal noise removal apparatus 100 desirably performs a so-called calibration process for executing the noise covariance estimation process 301 on the internal noise collected in advance.
The internal noise removal process 303 is a process for removing the internal noise from the sound in which the internal noise collected by the microphone array 101 and the target sound are mixed.
Specifically, the internal noise removing apparatus 100 removes the internal noise using the noise covariance matrix DB 302 from the sound in which the internal noise actually collected by the microphone array 101 and the target sound are mixed. , The sound from which only the target sound is extracted is output. Details of the internal noise removal processing 303 will be described later with reference to FIG.
FIG. 4 is a detailed block diagram of the internal noise removal processing 303 according to the first embodiment of this invention.
The internal noise removal processing 303 includes multi-channel frequency analysis 401, target sound covariance update 403, noise suppression filter update 404, noise suppression filtering 405, noise covariance matrix selection 406, and power minimization filtering 408. And executed by the central processing unit 103.
The digitized input signal is sampled for each channel (frame shift: L _shift ) The internal noise removal process 303 is executed whenever it is obtained. In the present embodiment, L _shift Is set to a time length of about several tens of ms.
For example, when the sampling rate of the AD converter 102 is 8 kHz, L _shift Is set to about 256 points. Hereinafter, a process executed every time a sample having a certain frame shift amount is obtained is referred to as a frame process.
In one frame process, for each input signal input from each microphone, a sample of the past frame size at the time when the input signal was input (L _frame ) Is executed. Here, an index representing the frame number is τ.
In the τ frame, for each microphone, (τ × L _shift ) From the point (τ × L _shift + L _frame -1) The digital signal up to the point is processed. Here, p is an index representing the number of points from the first point of the τ frame.
The internal noise removal apparatus 100 receives an input signal represented by Expression (1) from the mth microphone.

First, in the multi-channel frequency analysis 401, the internal noise removal apparatus 100 has p = 0 to p = L among the input signals input from the microphones. _frame Discrete Fourier transform represented by Expression (2) is performed on the data of −1 point.

According to equation (2), the time frequency domain signal x of each microphone _m (F, τ) is obtained. Here, f represents a frequency, and the window function w (p) is, for example, a Hanning window as shown in Expression (3).

The discrete Fourier transform may use an algorithm such as a fast Fourier transform.
The time frequency domain signals for each microphone are processed for each frequency as shown in Equation (4).

Here, M is the number of microphones. T is an operator representing transposition of a vector or matrix.
Hereinafter, the internal noise removal apparatus 100 performs the target sound covariance update 403, the noise suppression filter update 404, the noise suppression filtering 405, and the noise covariance matrix with respect to a signal collected for each frequency as shown in Expression (4). Selection 406 and power minimization filtering 408 are performed.
In the target sound covariance update 403, the internal noise removing apparatus 100 uses the target sound covariance matrix R using Equation (5). _s (F) is updated.

Here, α is an update coefficient and takes a value from 0 to 1. * Represents matrix or vector transpose.
Also, the internal noise removal apparatus 100 refers to the internal noise signal and executes the target sound covariance update 403 when no internal noise is generated. When the internal noise is generated, the target sound covariance matrix is not updated, and the value before the internal noise is generated is retained. This is because if the target sound covariance matrix is updated while internal noise is occurring, the information of the internal noise is mixed in the target sound covariance matrix, and the noise is not suppressed by the noise suppression filtering 405, and the inverse It is because it is emphasized by.
FIG. 5 is an explanatory diagram illustrating an example of an internal noise signal according to the first embodiment of this invention.
The internal noise signal name 500 is defined for the internal noise signal corresponding to the generation timing 510. The internal noise signal name 500 is an identifier for identifying internal noise. The generation timing 510 indicates the timing at which internal noise corresponding to the internal noise signal name 500 is generated.
In the example shown in FIG. 5, the internal noise signal names 500 include “internal noise generation” and “internal noise end”. The generation timing 510 of “internal noise generation” is “timing when internal noise occurs”, and the generation timing 510 of “end of internal noise” is “timing when internal noise stops”.
That is, it can be seen that there are an internal noise signal output at the timing when the internal noise occurs and an internal noise signal output at the timing when the internal noise stops.
Each of the two internal noise signals described above further includes information indicating the type of internal noise.
FIG. 6 is an explanatory diagram illustrating an example of types of internal noise according to the first embodiment of this invention. FIG. 6 shows an example of the types of internal noise whose internal noise signal name 500 is “internal noise generation”.
For the type of internal noise, an operation sound name 600 is defined corresponding to the generation timing 610. The operation sound name 600 is an identifier for identifying internal noise generated by the operation of the actuator 105. The generation timing 610 is a timing at which internal noise corresponding to the operation sound name 600 is generated.
For example, an operation sound whose operation sound name 600 is “motor 1” is defined as an operation sound generated when the motor 1 is driven. Similarly, an operation sound whose operation sound name 600 is “motor 2” is defined as an operation sound generated when the motor 2 is driven.
In addition, the operation sound when the operation sounds of the motor 1 and the motor 2 exist at the same time is defined as an operation sound whose operation sound name 600 is “

Motor

1 and 2” and is different from the operation sounds of the motor 1 and the motor 2 Is done.
FIG. 7 is an example timing chart that defines the types of internal noise signals corresponding to the operating states of the three actuators according to the first embodiment of the present invention.
In the example illustrated in FIG. 7, when only the motor 1 starts operation, the operation sound name 600 is defined as “motor 1”. When only the motor 2 starts operation, the operation sound name 600 is defined as “motor 2”. When only the motor 3 starts operating, the operation sound name 600 is defined as “motor 3”. When the motor 1 and the motor 2 start operation, the operation sound name 600 is defined as “

motors

1 and 2”.
Returning to the description of FIG.
In the noise covariance matrix selection 406, the internal noise removal apparatus 100 determines the noise covariance matrix R from the storage device 106 or the volatile memory 107 in which the noise covariance matrix DB 302 is stored based on the input internal noise signal. _n (F) is selected and the selected noise covariance matrix R _n (F) is output to the noise suppression filter update 404. In addition, in the noise covariance matrix selection 406, when there are a plurality of noise covariance matrices DB302 corresponding to one internal noise signal, the internal noise removal apparatus 100 has a plurality of internal noise matrices R302. _n (F) may be selected.
In the noise suppression filter update 404, the internal noise removal apparatus 100 performs the target sound covariance matrix R. _s (F) and noise covariance matrix R _n (F) and the noise suppression filter w _i (F) is generated. For example, using expression (6), the noise suppression filter w _i (F) is generated.

Here, maxeig is an operator that calculates an eigenvector that gives the maximum eigenvalue.
I represents the i-th noise covariance matrix. That is, in the noise covariance matrix selection 406, a plurality of internal noise matrices R _n When (f) is selected, the internal noise removal apparatus 100 generates a noise suppression filter for each noise covariance matrix.
In the noise suppression filtering 405, the internal noise removal apparatus 100 performs the noise suppression filter w corresponding to each noise covariance matrix as shown in the equation (7). _i (F) is made to act on the input signal x (f, τ), and the signal y after noise suppression _i (F, τ) is calculated.

The internal noise removal apparatus 100 is configured to generate a noise suppression signal y _i (F, τ) is output to the power minimization filtering 408.
The internal noise removal apparatus 100 does not execute the noise suppression filtering 405 while the internal noise is not generated. While the internal noise is not generated, the internal noise removal apparatus 100 uses the multi-channel frequency analysis 401 to convert the input signal input from any one of the microphones to the noise suppression signal y. _i You may output to the power minimization filtering 408 as (f, (tau)).
In the power minimization filtering 408, the internal noise removal apparatus 100 receives the noise suppression signal y input from the noise suppression filtering 405. _i Absolute value of (f, τ) | y _i Noise suppression signal y whose square of (f, τ) | _min (F, τ) is calculated.
In addition, the internal noise removal apparatus 100 has an absolute value | y _i Instead of (f, τ) |, P calculated from the moving average of power as shown in equation (8) _i Noise suppression signal y that minimizes (f, τ) _i (F, τ) is the noise suppression signal y _min It may be calculated as (f, τ).

Here, β is a coefficient for calculating the moving average, and takes a value from 0 to 1.
FIG. 8 is a flowchart of processing of the power minimizing filtering 408 according to the first embodiment of this invention.
In initialization 801, the internal noise removal apparatus 100 sets various variables to initial values. Specifically, the internal noise removal apparatus 100 sets the index i of the noise suppression filter to “0”, and | y ₀ The square of (f, τ) | _min Set to P _min Index i of the noise suppression filter that minimizes _min Is set to “0”. The internal noise removal apparatus 100 proceeds to determination 805 after setting various variables to initial values.
In decision 805, the internal noise removal apparatus 100 determines that the noise suppression filter index i is the total noise suppression filter number i. _max Determine if greater than. That is, it is determined whether or not the processing has been completed for all noise suppression filters.
The index i of the noise suppression filter is the total noise suppression filter number i _max When it is determined that the following is true, the internal noise removal apparatus 100 proceeds to the noise suppression filtering 802.
In the noise suppression filtering 802, the internal noise removal apparatus 100 causes each noise suppression filter to act on the input signal x (f, τ), and the noise suppression signal y. _i (F, τ) is calculated, and the process proceeds to decision 803.
In decision 803, the internal noise removal apparatus 100 determines that the noise suppression signal y _i The square of the absolute value of (f, τ) is P _min It is determined whether it is smaller. When i = 0, in the determination 803, the internal noise removal apparatus 100 determines that P _min Without making the determination, the index i of the noise filter is updated to “1”, and then the process returns to the determination 805 to perform the same processing for the next noise suppression filter.
Noise suppression signal y with noise removed _i The square of the absolute value of (f, τ) is P _min If it is determined as above, the internal noise removal apparatus 100 updates i to i + 1, returns to determination 805, and performs the same processing for the next noise suppression filter.
Noise suppression signal y _i If it is determined that the square of the absolute value of (f, τ) is smaller than Pmin, the internal noise removal apparatus 100 proceeds to the minimum value update 804.
In the minimum value update 804, the internal noise removal apparatus 100 performs P _min And i _min Is updated, i is updated to i + 1, and the process returns to decision 805 to execute processing for the next noise suppression filter.
In decision 805, the index i of the noise suppression filter is the total noise suppression filter number i. _max When it is determined that the value is larger than the maximum value, the internal noise removal apparatus 100 determines that the processing has been completed for all the noise suppression filters, and P _min The noise suppression filter index i and the noise suppression signal y _i (F, τ) i _min And noise suppression signal y _min (F, τ) is output to the time domain transform 409 and the process is terminated.
Through the above-described processing, the internal noise removal apparatus 100 has a noise suppression filter that minimizes the volume after the noise suppression filter is applied to the signal input from the microphone array 101, and an output that minimizes the volume after noise suppression. A signal can be acquired.
In other words, in the present embodiment, a plurality of noise suppression filters are generated for each attribute such as the type of internal noise and the generation timing, and among the noise suppression filters, the noise suppression that minimizes the volume after the noise is suppressed. A filter can be selected. Therefore, noise can be accurately removed.
In the time domain transform 409, the internal noise removal apparatus 100 performs a noise suppression signal y calculated for each frequency. _min By performing the inverse Fourier transform shown in Equation (9) on (f, τ), the time domain noise suppression signal y _min (P) is calculated.

Where f _max Is a frequency corresponding to 0.5 times the sampling rate.
In the time domain transform 409, the internal noise removal apparatus 100 performs the time domain noise suppression signal y. _min A final signal obtained by adding a function obtained by applying a function corresponding to the inverse of the window function to (p) between the frames is output.
FIG. 9 is a timing chart illustrating an example of the target sound covariance update 403, the noise suppression filter update 404, the noise covariance matrix selection 406, and the internal noise generation status according to the first embodiment of the present invention.
As shown in FIG. 9, the target sound covariance update 403 is executed in a time zone when no internal noise occurs. During the time period, the noise suppression filter update 404 and the noise covariance matrix selection 406 are not executed.
When internal noise occurs, an internal noise generation signal is output from the actuator control device 104 when the internal noise occurs. The output internal noise generation signal includes internal noise attributes such as the timing at which the internal noise occurs and the type of internal noise.
As the internal noise generation signal, a method using a drive signal transmitted to the actuator control device 104 can be considered. That is, the drive signal transmitted to the actuator control device 104 is input to the central processing unit 103 as an internal noise generation signal.
For example, when an internal noise of the type “A” is generated, the generated signal of the internal noise A is input to the noise covariance matrix selection 406. Further, the noise suppression filter update 404 is also input with a signal for generating the internal noise A via the target sound covariance update 403.
In the noise covariance matrix selection 406, the internal noise removal apparatus 100 reads a noise covariance matrix corresponding to the input internal noise A from the noise covariance matrix DB302.
In the noise suppression filter update 404, the internal noise removal apparatus 100 generates a noise suppression filter from the target sound covariance matrix and the noise covariance matrix corresponding to the internal noise A. The noise suppression filter is generated using, for example, Expression (6).
During the generation of the internal noise A, the internal noise removal apparatus 100 may execute the noise suppression filter update 404 every frame. During the generation of the internal noise A, the target sound covariance update 403 is not executed.
When the end signal of the internal noise A is input to the target sound covariance update 403, the internal noise removal apparatus 100 resumes the target sound covariance update 403 again.
Details of the noise covariance estimation process 301 will be described below.
FIG. 10 is a block diagram illustrating details of the noise covariance estimation process 301 according to the first embodiment of this invention.
The noise covariance estimation process 301 includes a multi-channel frequency analysis 1001, feature quantity extraction 1002, feature quantity vector generation 1003, clustering 1004, and noise covariance update 1005, and each process is executed by the central processing unit 103.
In the present embodiment, when the noise covariance estimation process 301 is executed, an internal noise signal for learning is prepared in advance for each internal noise attribute. The learning internal noise signal prepared in advance includes a time zone (timing) in which noise occurs, and the noise covariance estimation process 301 extracts only the internal noise signal in the corresponding time zone (timing). Can learn. A plurality of learning internal noise signals prepared in advance may exist.
In the multi-channel frequency analysis 1001, the internal noise removal apparatus 100 converts each internal noise signal into a frequency domain signal x (f, τ) for each frame. Here, the index representing the type of internal noise is q, the index representing a plurality of internal noise signals of the internal noise q is j, and the frequency domain signal of the frame τ of the j-th signal of the internal noise q is x. _{q, j} It is described as (f, τ).
In the feature quantity extraction 1002, the internal noise removal apparatus 100 uses the equation (10) to express _{q, j} Characteristic V from (f, τ) _{q, j} (F, τ) is generated.

In the feature vector generation 1003, the internal noise removal apparatus 100 first divides each frequency into Z subgroups. Here, z is an index representing a subgroup.
In the feature vector generation 1003, the internal noise removal apparatus 100 uses the feature V of the same frame τ belonging to the subgroup z. _{q, j} Concatenating (f, τ) to one vector V _{q, j} (Z, τ) is generated.
FIG. 11 shows a vector V according to the first embodiment of the present invention. _{q, j} It is explanatory drawing which shows the data structure of (z, (tau)).
As shown in FIG. _{q, j} (Z, τ) is the feature value V for each frequency. _{q, j} It is a vector having (f, τ) as elements.
In the clustering 1004, the internal noise removal apparatus 100 performs a clustering process on all signals for learning internal noise q and data of all frames for each subgroup. Specifically, the internal noise removal apparatus 100 outputs an index Ind (j, τ) that defines a cluster to which a feature amount of the signal and frame belongs for each signal index and frame index.
FIG. 22 is an explanatory diagram illustrating an example of the index Ind (j, τ) according to the first embodiment of this invention.
As a result of executing clustering 1004, an index Ind (j, τ) is assigned to the internal noise signal for each time of the internal noise.
In the example shown in FIG. 22, indexes Ind (j, τ) of A to C are given to the internal noise signal.
In the noise covariance update 1005, the internal noise removal apparatus 100 uses the expression (11) for each class c and calculates the frequency from the input data x (f, τ) used when calculating the feature quantity belonging to the cluster. Calculate the covariance matrix for each.

The number of clusters is the number of noise covariance matrices and noise suppression filters. Note that the number of clusters may be set in advance.
In the noise covariance update 1005, the internal noise removal apparatus 100 stores the noise covariance matrix calculated for each internal noise signal in the storage device 106 or the volatile memory 107 as the noise covariance matrix DB302, and the process ends.
FIG. 12 is a flowchart showing details of processing in the clustering 1004 according to the first embodiment of this invention.
Hereinafter, the index q indicating the type of internal noise and the index z of the subgroup are omitted. Also, τ is variable-transformed as τ = j * T + τ. Here, T is the number of frames of each internal noise signal.
In initialization 1201, the internal noise removal apparatus 100 randomly sets one of the feature values V (τ) to the centroid C (c) of each cluster. Here, random is a variable that randomly selects one of all internal noise signals and all frames. Further, the clustering 1004 sets the variable end to “FALSE”, Ind _pre (Τ) is initialized to “0”.
In determination 1202, the internal noise removal apparatus 100 determines whether or not the variable end is “TURE” indicating the end state.
When it is determined that the variable end is “TURE” indicating the end state, the internal noise removal device 100 ends the process.
If it is determined that the variable end is not “TURE” indicating the end state, the internal noise removal apparatus 100 proceeds to initialization 1203.
In initialization 1203, the internal noise removal apparatus 100 initializes the index τ to the smallest value “1”, and proceeds to determination 1204.
In determination 1204, the internal noise removal apparatus 100 determines that the index τ is the maximum value T. _max It is determined whether or not:
Index τ is maximum T _max If it is determined that the internal noise removal apparatus 100 is equal to or less than the internal noise removal apparatus 100, the process proceeds to initialization 1205.
In initialization 1205, the internal noise removal apparatus 100 initializes the variable Ind (τ) to “1”, the cluster index c to “1”, and the variable min to “−1”, and proceeds to decision 1206.
In determination 1206, the internal noise removal apparatus 100 determines whether or not the cluster index c is equal to or less than the number C of clusters.
If it is determined that the cluster index c is greater than the cluster number C, the internal noise removal apparatus 100 updates τ to τ + 1 and returns to determination 1204.
When it is determined that the cluster index c is equal to or less than the number C of clusters, the internal noise removal device 100 proceeds to the distance calculation 1207.
In the distance calculation 1207, the internal noise removal apparatus 100 calculates the distance between the centroid C (c) and the feature value V (τ) of each cluster using the function D, and the process proceeds to decision 1208. As the function D, for example, | C (c) −V (τ) | The calculated distance is input to the variable dis.
In determination 1208, the internal noise removal apparatus 100 determines whether or not the variable dis is smaller than the variable min.
If it is determined that the variable dis is greater than or equal to the variable min, the internal noise removal device 100 updates the cluster index c to c + 1 and returns to determination 1206.
If it is determined that the variable dis is smaller than the variable min, the internal noise removal apparatus 100 proceeds to minimum value replacement 1209.
In the minimum value replacement 1209, the internal noise removal apparatus 100 replaces Ind (τ) with the cluster index c. Also, the internal noise removal device 100 replaces the variable min with the variable dis. The internal noise removal apparatus 100 then updates the cluster index c to c + 1 and returns to decision 1206.
In decision 1204, the index τ is the maximum value T. _max If it is determined that the value is larger, the internal noise removal apparatus 100 proceeds to update 1210.
In update 1210, internal noise removal apparatus 100 updates centroid C (c) using (12). Specifically, the update of the centroid exceeding the centroid C (c) of each cluster is executed by calculating the average value of the feature value V (τ) in each cluster.

After the update, the internal noise removal device 100 proceeds to decision 1211.
In decision 1211, the internal noise removal apparatus 100 determines that the Ind for all indices τ. _pre It is determined whether (τ) and Ind (τ) are equal.
When it is determined that the condition of the determination 1211 is not satisfied, the internal noise removal device 100 performs the Ind for all indexes τ. _pre Substitute Ind (τ) for (τ), and return to decision 1202.
For all indices τ, Ind _pre If it is determined that (τ) and Ind (τ) are equal, the internal noise removal device 100 sets the variable end to “TURE” and returns to determination 1202.
Second embodiment
Hereinafter, a second embodiment of the present invention will be described. The second embodiment of the present invention assumes an audio conference system. Hereinafter, the difference from the first embodiment of the present invention will be mainly described.
FIG. 13 is a block diagram of a hardware configuration of an internal noise removal device in the audio conference system according to the second embodiment of this invention.
In the second embodiment, an audio conference system including an internal noise removal device 1300, a microphone array 1301, a keyboard signal recognition device 1304, a keyboard 1305, an audio transmission device 1308, an audio reception device 1309, a DA conversion device 1310, and a speaker 1311. explain.
The internal noise removal device 1300 includes an AD conversion device 1302, a central processing unit 1303, a storage device 1306, and a volatile memory 1307.
The AD converter 1302 converts the analog signal input from the microphone array 1301 into a digital signal that can be processed by the central processing unit 1303. In the example illustrated in FIG. 13, an analog signal is input from the microphone array 1301 to the AD conversion device 1302.
The central processing unit 1303 executes various programs developed in the volatile memory 1307. Specifically, the central processing unit 1303 executes a process of removing internal noise from the digital signal after being digitally converted by the AD converter 1302, and extracting only the internal noise removed sound.
In the second embodiment, the internal noise removal sound is a sound (keyboard removal sound) obtained by removing the sound generated when the user operates a key of the keyboard 1305. The extracted internal noise removal sound (keyboard removal sound) is transmitted to the voice transmission device 1308.
The storage device 1306 stores a program for removing internal noise and data related to internal noise. The program stored in the storage device 1306 is the same as that in the first embodiment. The volatile memory 1307 is used to secure work memory during program execution.
The keyboard signal recognition device 1304 detects information indicating which key is operated when the keyboard 1305 is equipped. The detected information is transmitted to the central processing unit 1303.
The voice transmission device 1308 transmits the internal noise removal sound received from the central processing unit 1303 to the voice conference destination.
The voice receiving device 1309 receives a voice signal sent from the voice conference destination and transmits the received voice signal to the central processing unit 1303. The central processing unit 1303 transmits the received audio signal to the DA converter 1310.
The DA converter 1310 converts the received audio signal into an analog audio signal and transmits the analog audio signal to the speaker 1311.
The speaker 1311 reproduces the analog audio signal transmitted from the DA converter 1310. Note that analog audio signals reproduced from the speaker 1311 (referred to as speaker reproduction signals) are collected by the microphone array 1301. In this case, the speaker reproduction signal included in the sound collected by the microphone array 1301 is removed by an acoustic echo canceller process executed by the central processing unit 1303.
The internal noise removal apparatus 1300 may include at least one of a microphone array 1301, a keyboard signal recognition apparatus 1304, a keyboard 1305, an audio transmission apparatus 1308, an audio reception apparatus 1309, a DA conversion apparatus 1310, and a speaker 1311. Further, the AD conversion device 1302 or the storage device 1306 may be provided outside the internal noise removal device 1300.
FIG. 14 is an explanatory diagram illustrating an example of an internal noise signal corresponding to the operation sound of the keyboard 1305 according to the second embodiment of this invention.
An internal noise signal is issued when each key of the keyboard 1305 is operated. The internal noise signal is defined so that an operation sound when any key of the keyboard 1305 is operated can be identified.
Specifically, the operation noise name 1400 is defined for the internal noise signal corresponding to the generation timing 1410. The operation sound name 1400 is an identifier for identifying the operation sound of the keyboard 1305. The generation timing 1410 indicates a timing at which internal noise corresponding to the operation sound name 1400 is generated.
FIG. 15 is a block diagram of processing executed by each device according to the second embodiment of this invention.
The sound collected by the microphone array 1301 is transmitted to the AD conversion device 1302. The AD conversion device 1302 executes AD conversion processing 1502 on the received audio signal, and converts the received audio signal into a digital signal. The AD converter 1302 transmits the digitized audio signal to the central processing unit 1303.
Note that the digitized audio signal includes the sound (acoustic echo) collected by the microphone array 1301 and the noise generated when the keyboard 1305 is operated, in addition to the voice emitted by the user. It is.
The central processing unit 1303 executes an echo canceller 1505 on the audio signal transmitted from the AD conversion device 1302.
The echo canceller 1505 removes the acoustic echo component using a general algorithm such as NLMS using the audio signal output from the speaker 1311 as a reference signal.
The sound signal from which the acoustic echo component has been removed is output to the internal noise removal processing 1503. The central processing unit 1303 performs internal noise removal processing 1503 on the audio signal from which the acoustic echo component has been removed, and removes the operation sound of the keyboard. Note that the internal noise removal processing 1503 has the same configuration as the internal noise removal processing 303 of the first embodiment.
The voice signal from which the internal noise has been removed is transmitted to the conference partner via the network by voice transmission 1508.
The voice of the conference partner is received by voice reception 1507 via the network. The received voice is transmitted to the DA converter 1310.
The DA converter 1310 converts the received voice into an analog voice signal by executing a DA conversion process 1504 on the received voice. Further, the DA converter 1310 transmits an analog audio signal to the speaker 1311.
The speaker 1311 reproduces the received analog audio signal.
FIG. 16 is an explanatory diagram illustrating an example of a user use scene in the audio conference system according to the second embodiment of this invention.
When the user operates a button arranged on the keyboard 1601, noise is generated from the operated button position. The generated noise is collected by the microphone array 1603 together with the voice of the user in the voice conference system. Note that the microphone array 1603 can be arranged on a display device 1602 of a personal computer, for example.
Third embodiment
Hereinafter, a third embodiment of the present invention will be described. The third embodiment of the present invention assumes an audio conference system including a touch panel. Hereinafter, the difference from the first embodiment of the present invention will be mainly described.
FIG. 17 is a block diagram of a hardware configuration of an internal noise removing device in an audio conference system including a touch panel according to the third embodiment of the present invention.
In the third embodiment, an audio conference system including an internal noise removal device 1700, a microphone array 1701, a touch position recognition device 1704, a touch panel 1705, an audio transmission device 1708, an audio reception device 1709, a DA conversion device 1710, and a speaker 1711. explain.
The internal noise removal device 1700 includes an AD conversion device 1702, a central processing unit 1703, a storage device 1706, and a volatile memory 1707.
The AD converter 1702 converts the analog signal input from the microphone array 1701 into a digital signal that can be processed by the central processing unit 1703. In the example illustrated in FIG. 17, an analog signal is input from the microphone array 1701 to the AD converter 1702.
The central processing unit 1703 executes various programs developed in the volatile memory 1707. Specifically, the central processing unit 1703 executes a process of removing internal noise from the digital signal after being digitally converted by the AD converter 1702 and extracting only the internal noise removed sound.
In the third embodiment, the internal noise removal sound is a sound (touch panel removal sound) obtained by removing the sound generated when the user operates the touch panel 1705. The extracted internal noise removal sound (touch panel removal sound) is transmitted to the voice transmission device 1708.
The storage device 1706 stores a program for removing internal noise and data related to internal noise. The program stored in the storage device 1706 is the same as that in the first embodiment. The volatile memory 1707 is used to secure work memory during program execution.
The touch position recognition device 1704 detects information about which position on the touch panel 1705 has been operated and when. The detected information is transmitted to the central processing unit 1703.
The voice transmission device 1708 transmits the internal noise-removed sound received from the central processing unit 1703 to the voice conference call destination.
The voice receiving device 1709 receives a voice signal sent from the voice conference destination and transmits the received voice signal to the central processing unit 1703. The central processing unit 1703 transmits the received audio signal to the DA converter 1710.
The DA converter 1710 converts the received audio signal into an analog audio signal and transmits the analog audio signal to the speaker 1711.
The speaker 1711 reproduces the analog audio signal transmitted from the DA converter 1710. Note that an analog audio signal reproduced from the speaker 1711 (hereinafter referred to as a speaker reproduction signal) is collected by the microphone array 1701. In this case, the speaker reproduction signal included in the sound collected by the microphone array 1701 is removed by acoustic echo canceller processing executed by the central processing unit 1703.
The internal noise removal apparatus 1700 may include at least one of a microphone array 1701, a touch position recognition apparatus 1704, a touch panel 1705, an audio transmission apparatus 1708, an audio reception apparatus 1709, a DA conversion apparatus 1710, and a speaker 1711. The AD conversion device 1702 or the storage device 1706 may be provided outside the internal noise removal device 1700.
FIG. 18 is an explanatory diagram illustrating an example of an internal noise signal corresponding to the operation sound of the touch panel 1705 according to the third embodiment of this invention.
An internal noise signal is issued when each touch position on the touch panel 1705 is operated. The internal noise signal includes information by which it is possible to identify which position is the operation sound when operating the touch panel 1705 for each touch position.
Specifically, a touch position name 1800 is defined for the internal noise signal corresponding to the generation timing 1810. The touch position name 1800 is an identifier for identifying an operation sound for each touch position on the touch panel 1705. The generation timing 1810 indicates a timing at which internal noise corresponding to the touch position name 1800 is generated.
FIG. 19 is an explanatory diagram illustrating an example of a user use scene in the audio conference system including the touch panel 1705 according to the third embodiment of this invention.
The sound of operating the touch panel 1902 is collected by the microphone array 1901 together with the sound emitted by the user in the audio conference system.
Fourth embodiment
The fourth embodiment of the present invention will be described below. The fourth embodiment of the present invention assumes a robot (see FIG. 23) having a voice recognition function. Hereinafter, the difference from the first embodiment of the present invention will be mainly described.
Note that the robot 2201 of the fourth embodiment includes the internal noise removal device 100. Since the hardware configuration and processing configuration of the internal noise removal apparatus 100 are the same as those in the first embodiment, the description thereof is omitted.
FIG. 20 is a block diagram of a configuration of speech recognition processing including internal noise removal processing according to the fourth embodiment of this invention.
The internal noise removal apparatus 100 performs AD conversion processing 2002 on the audio signals collected by the microphone microphone array 2204 for voice recognition, and converts them into digital audio signals. The digital audio signal is output to the internal noise removal process 303.
The internal noise removal apparatus 100 executes an internal noise removal process 303, removes internal noise contained in the digital voice signal, and extracts only the voice of a person who is the target of voice recognition. The extracted voice is output to the voice recognition 2004.
In the speech recognition 2004, a feature amount extraction process such as general MFCC is executed, and a Viterbi decoding process of the acoustic model and the feature amount to be learned in advance is executed to recognize which speech is generated. take. The internal noise removal apparatus 100 outputs the recognition result and ends the process.
Fifth embodiment
The fifth embodiment of the present invention will be described below. The fifth embodiment of the present invention shows a modification of the noise covariance matrix and target sound covariance matrix estimation method and the noise suppression filter adaptation method. Hereinafter, the difference from the first embodiment of the present invention will be mainly described.
In the fifth embodiment, the apparatus configuration is the same as the hardware configuration and processing configuration of the first embodiment, and a description thereof will be omitted.
FIG. 21 is a block diagram showing details of the internal noise removal processing 303 according to the fifth embodiment of this invention.
The internal noise removal apparatus 100 performs multi-channel frequency analysis 2101 on the audio signal collected by the microphone array 101 for each frame, and converts it into a frequency domain signal. The converted frequency domain signal is output to the sound source direction estimation 2102 for each frequency.
The internal noise removal apparatus 100 performs sound source direction estimation 2102 on the frequency domain signal to identify the direction of the sound source. As the sound source direction estimation 2102, for example, a method using a GCC-PHAT method or a delay sum array method based on a phase difference between microphones can be considered.
In the sound source direction estimation 2102, the internal noise removal apparatus 100 sets a target sound direction in advance, and determines whether the sound source direction matches a predetermined target sound direction for each frame and each frequency.
When it is determined that the sound source direction matches the target sound direction, the internal noise removal apparatus 100 performs target sound adaptation 2103 on the sound with the sound of components (frame and frequency) satisfying the condition as the target sound. Specifically, the internal noise removal apparatus 100 uses the equation (5) to calculate the target sound covariance matrix R. _s (F) is updated.
When the sound source direction does not coincide with the target sound direction, the internal noise removal apparatus 100 performs noise adaptation 2104 using the sound of components (frame and frequency) that do not satisfy the conditions as noise. Specifically, the internal noise removal apparatus 100 uses the noise covariance matrix R using Equation (13). _b (F) is updated.

In the internal noise addition 2105, the internal noise removal apparatus 100 adds a noise covariance matrix R to each noise covariance matrix corresponding to the internal noise signal. _b Add (f).
In the filter adaptation 2106, the internal noise removal apparatus 100 performs the target sound covariance matrix and R _b The noise covariance matrix added with (f) is substituted into equation (6) to generate a noise suppression filter.
According to an embodiment of the present invention, the internal noise removal apparatus 100 generates a plurality of noise covariance matrices according to the internal noise attributes such as the type of internal noise and the generation timing, and corresponds to the generated internal noise. A plurality of noise covariance matrices are selected, a plurality of noise suppression filters are generated from each noise covariance matrix, and an appropriate noise filter can be selected from the plurality of noise suppression filters. As a result, it is possible to appropriately remove noise even for non-stationary noise whose sound quality changes depending on the operating state of the actuator.
Further, noise can be accurately removed from operation sounds of the keyboard 1305 or the touch panel 1705 other than the operation sound of the actuator.

Claims

A noise removal device that removes noise from sound collected by a microphone array composed of a plurality of microphones,
The sound collected by the microphone array is input to the noise removing device as an analog signal,
The noise removing device includes a microprocessor, a storage device connected to the microprocessor, a memory connected to the microprocessor, and an AD conversion device connected to the microprocessor and converting the analog signal into a digital signal. And comprising
The storage device uses a noise suppression filter generation program for generating a noise suppression filter for removing noise included in the sound collected by the microphone array, and the noise suppression filter. Stores a noise removal program that removes noise contained in the sound,
The noise removing device includes:
Based on the noise included in the sound collected by the microphone array, a plurality of the noise suppression filters are generated,
Each noise suppression filter is allowed to act on the digital signal converted by the AD converter, and the noise suppression filter that minimizes the volume of the digital signal from which noise has been removed is selected.
A noise removing device that removes noise from a digital signal input from the AD converter using the selected noise suppression filter.

The noise removal device according to claim 1,
The noise removing apparatus, wherein the noise removed from the digital signal input from the AD conversion apparatus is an operation sound of a device that performs a predetermined operation included in the sound collected by the microphone array.

The noise removal device according to claim 2,
The noise has a plurality of attributes,
The noise removing device includes:
Based on a digital signal corresponding to each attribute of noise generated from the device, a clustering process is performed to generate a plurality of first noise covariance matrices for each noise attribute generated from the device,
A noise removing apparatus that generates the plurality of noise suppression filters using each of the first noise covariance matrices.

The noise removal device according to claim 3,
The noise removing device is connected to the device via a control unit that controls the device,
Information indicating the attribute of noise generated from the device is input from the control unit to the noise removal device,
The noise removing device includes:
Selecting the plurality of first noise covariance matrices corresponding to the noise attribute based on the input noise attribute;
A noise removal apparatus that generates the plurality of noise suppression filters using each of the selected first noise covariance matrices.

The noise removal device according to claim 4,
Generating a target covariance matrix from the digital signal input from the AD converter;
A noise removal apparatus that generates the plurality of noise suppression filters using the generated target covariance matrix and each of the selected first noise covariance matrices.

The noise removal device according to claim 3,
The attribute of the noise generated from the device includes at least one of information for identifying the source of the noise, a state of the noise source when the noise is generated, or a timing at which the noise is generated. The noise removal apparatus characterized by the above-mentioned.

The noise removal device according to claim 3,
Estimating the sound source direction of the sound collected by the microphone array;
Generating a second noise covariance matrix for removing sound arriving from other than the target sound source direction as noise from the sound arriving from the target sound source direction;
Generating a plurality of third noise covariance matrices by adding the second noise covariance matrix and each of the first noise covariance matrices;
A noise removal apparatus that generates the plurality of noise suppression filters using each of the third noise covariance matrices.

A noise removal method in a noise removal apparatus for removing noise from sound collected by a microphone array composed of a plurality of microphones,
The sound collected by the microphone array is input to the noise removing device as an analog signal,
The noise removing device includes a microprocessor, a storage device connected to the microprocessor, a memory connected to the microprocessor, and an AD conversion device connected to the microprocessor and converting the analog signal into a digital signal. And comprising
The storage device uses a noise suppression filter generation program for generating a noise suppression filter for removing noise from the sound collected by the microphone array, and a sound collected by the microphone array using the noise suppression filter. A noise removal program that removes noise from
The method
A first step of generating a plurality of the noise suppression filters based on noise included in the sound collected by the microphone array;
A second step in which the noise removing device operates the respective noise suppression filters on the digital signal to select the noise suppression filter with the smallest volume of the digital signal from which noise has been removed;
The noise removal method includes: a third step of removing noise from the digital signal input from the AD converter using the selected noise suppression filter.

The noise removal method according to claim 8, comprising:
The noise removed from the digital signal input from the AD converter is an operation sound of a device that performs a predetermined operation included in the sound collected by the microphone array. .

The noise removal method according to claim 9, comprising:
The noise has a plurality of attributes,
The first step includes
The noise removal apparatus performs a clustering process based on a digital signal corresponding to each attribute of noise generated from the device, and generates a plurality of first noise covariance matrices for each attribute of noise generated from the device. A fourth step of generating;
And a fifth step of generating the plurality of noise suppression filters using each of the first noise covariance matrices.

The noise removal method according to claim 10, comprising:
The noise removing device is connected to the device via a control unit that controls the device,
Information indicating the attribute of noise generated from the device is input from the control unit to the noise removal device,
The fourth step includes
A sixth step in which the noise removing device selects the plurality of first noise covariance matrices corresponding to the attribute of the noise generated from the device based on the attribute of the noise generated from the input device; ,
And a seventh step of generating the plurality of noise suppression filters using each of the selected first noise covariance matrices. 7. A noise removal method, comprising:

The noise removal method according to claim 11, comprising:
The fifth step includes
An eighth step in which the noise removing device generates a target covariance matrix from the digital signal input from the AD converter;
The noise removal apparatus generates a plurality of noise suppression filters using the generated target covariance matrix and each first noise covariance matrix selected in the sixth step. And a noise removal method comprising the steps of:

The noise removal method according to claim 10, comprising:
The attribute of the noise generated from the device includes at least one of information for identifying the source of the noise, a state of the noise source when the noise is generated, or a timing at which the noise is generated. A noise removal method characterized by the above.

The noise removal method according to claim 10, comprising:
Further, a tenth step in which the noise removing device estimates a sound source direction of sound collected by the microphone array;
An eleventh step in which the noise removing device generates a second noise covariance matrix for removing, as noise, sound arriving from a direction other than the target sound source direction from sound arriving from the target sound source direction;
A twelfth step in which the noise removing device generates a plurality of third noise covariance matrices by adding the second noise covariance matrix and the first noise covariance matrix;
And a thirteenth step of generating the plurality of noise suppression filters using each of the third noise covariance matrices. 13. A noise removal method, comprising: