JP2010271712A

JP2010271712A - Sound data processing device and sound data processing method

Info

Publication number: JP2010271712A
Application number: JP2010101023A
Authority: JP
Inventors: Kazuhiro Nakadai; 一博中臺; Ince Goekhan; インジュ・ギョカン; Tobias Rodemann; トビアス・ローデマン; Koji Tsujino; 広司辻野
Original assignee: Honda Motor Co Ltd
Current assignee: Honda Motor Co Ltd
Priority date: 2009-05-22
Filing date: 2010-04-26
Publication date: 2010-12-02
Anticipated expiration: 2030-04-26
Also published as: JP5535746B2

Abstract

<P>PROBLEM TO BE SOLVED: To process sound data including a signal sound so as to reduce noise to be caused by a mechanical device. <P>SOLUTION: A sound data processing device includes: an operating state acquisition part 101 acquiring an operating state of the mechanical device; a sound data acquisition part 103 acquiring sound data corresponding to the acquired operating state; and a database 105 storing various kinds of operating states of the mechanical device in unit time and associated sound data as a template. The sound data processing device also includes: a database retrieval part 107 retrieving a template of the operating state nearest to the acquired operating state from the database 105, and a template subtraction part 109 subtracting sound data of the template of an operating state nearest to the acquired operating state from the acquired sound data to reduce the noise to be caused by the mechanical device. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、ロボットなどの機械装置が発生するノイズを低減するように信号音を含む音データを処理する音データ処理装置及び音データ処理方法に関する。 The present invention relates to a sound data processing device and a sound data processing method for processing sound data including a signal sound so as to reduce noise generated by a mechanical device such as a robot.

たとえば、動作しながら自動音声認識を行うロボットには、ロボット自身の動作によるノイズ、すなわち自己ノイズを抑制する機能が必要である（特許文献１など）。 For example, a robot that performs automatic speech recognition while operating needs a function of suppressing noise caused by the operation of the robot itself, that is, self-noise (Patent Document 1, etc.).

一般に、ノイズ低減のための音源特定及び音源分離が研究されているが、自己ノイズの下での自動音声認識に向けられたものではない（非特許文献１など）。また、スペクトル減算などの従来のノイズ低減方法は、実際にはうまく機能しない場合が多い（非特許文献２など）。また、動作ごとにスペクトル減算を行う方法も知られているが、多数の種類の動作に対応することは、実際上不可能である。さらに、ロボットの自己ノイズのノイズ源は、近距離場に存在するので、従来の遠距離場のノイズ低減方法の性能は大幅に低下する。 In general, sound source identification and sound source separation for noise reduction have been studied, but they are not directed to automatic speech recognition under self-noise (Non-Patent Document 1, etc.). Also, conventional noise reduction methods such as spectral subtraction do not actually work well in many cases (Non-Patent Document 2, etc.). A method of performing spectral subtraction for each operation is also known, but it is practically impossible to deal with many types of operations. Furthermore, since the noise source of the robot's self-noise exists in the near field, the performance of the conventional far-field noise reduction method is significantly degraded.

このように、ロボットの自己ノイズのような、機械装置が発生するノイズを効率的に低減するように信号音を含む音データを処理する音データ処理装置及び方法は開発されていない。 Thus, a sound data processing apparatus and method for processing sound data including signal sound so as to efficiently reduce noise generated by a mechanical device such as robot self-noise has not been developed.

特開2008-122927号公報JP 2008-122927 A

I. Hara, F. Asano, H. Asoh, J. Ogata, N. Ichimura, Y. Kawai, F. Kanehiro, H. Hirukawa, and K.Yamamoto, “Robust speech interface based on audio and video information fusion for humanoid HRP-2” in Proc. IEEE/RSJ International Joint Conference on Robots and Intelligent Systems (IROS), pp.2404-2410,(2004)I. Hara, F. Asano, H. Asoh, J. Ogata, N. Ichimura, Y. Kawai, F. Kanehiro, H. Hirukawa, and K. Yamamoto, “Robust speech interface based on audio and video information fusion for humanoid HRP-2 ”in Proc. IEEE / RSJ International Joint Conference on Robots and Intelligent Systems (IROS), pp.2404-2410, (2004) S. Boll, “Suppression of Acoustic Noise in Speech Using Spectral Subtraction”, in IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. ASSP-27, No.2, (1979)S. Boll, “Suppression of Acoustic Noise in Speech Using Spectral Subtraction”, in IEEE Transactions on Acoustics, Speech, and Signal Processing, vol.ASSP-27, No.2, (1979)

したがって、たとえば、ロボットの自己ノイズのような、機械装置が発生するノイズを効率的に低減するように信号音を含む音データを処理する音データ処理装置及び音データ処理方法に対するニーズがある。 Therefore, there is a need for a sound data processing device and a sound data processing method for processing sound data including signal sound so as to efficiently reduce noise generated by a mechanical device such as robot self-noise.

本発明の一つの態様による音データ処理装置は、機械装置が発生するノイズを低減するように信号音を含む音データを処理する音データ処理装置である。本態様による音データ処理装置は、該機械装置の動作状態を取得する動作状態取得部と、取得された動作状態に対応する音データを取得する音データ取得部と、単位時間における該機械装置の種々の動作状態及び対応する音データをテンプレートとして記憶するデータベースと、を備えている。本発明による音データ処理装置は、該データベースから、取得された動作状態に最も近い動作状態のテンプレートを検索するデータベース検索部と、取得された音データから、取得された動作状態に最も近い動作状態のテンプレートの音データを減算して機械装置が発生するノイズを低減するテンプレート減算部と、をさらに備えている。 A sound data processing apparatus according to one aspect of the present invention is a sound data processing apparatus that processes sound data including a signal sound so as to reduce noise generated by a mechanical device. The sound data processing device according to this aspect includes an operation state acquisition unit that acquires an operation state of the machine device, a sound data acquisition unit that acquires sound data corresponding to the acquired operation state, and the machine device in unit time And a database for storing various operation states and corresponding sound data as templates. The sound data processing apparatus according to the present invention includes a database search unit that searches the database for an operation state closest to the acquired operation state from the database, and an operation state that is closest to the operation state acquired from the acquired sound data. A template subtracting unit that subtracts the sound data of the template to reduce noise generated by the mechanical device.

本態様による音データ処理装置においては、データベースに記憶された、単位時間における機械装置の種々の動作状態に対応する音データを表すテンプレートの内、取得された動作状態に最も近い動作状態のテンプレートの音データが、該取得された動作状態の音データの予測値として使用される。したがって、取得された音データから該予測値を減算することによって、動作状態に応じて、機械装置が発生するノイズを効率的に低減することができる。 In the sound data processing device according to the present aspect, the template of the operation state closest to the acquired operation state among the templates representing the sound data corresponding to the various operation states of the machine device per unit time stored in the database. Sound data is used as a predicted value of the sound data of the acquired operation state. Therefore, by subtracting the predicted value from the acquired sound data, it is possible to efficiently reduce noise generated by the mechanical device according to the operating state.

本発明の一つの実施形態による音データ処理装置は、多チャンネルの音データに基づいてノイズを低減する多チャンネルノイズ低減部と、取得された動作状態に基づいて、該テンプレート減算部の出力及び該多チャンネルノイズ低減部の出力のいずれかを選択する出力選択部と、をさらに備えている。 A sound data processing device according to an embodiment of the present invention includes a multi-channel noise reduction unit that reduces noise based on multi-channel sound data, an output of the template subtraction unit based on the acquired operation state, and the An output selection unit that selects one of the outputs of the multi-channel noise reduction unit.

本実施形態によれば、テンプレート減算部の出力及び多チャンネルノイズ低減部の出力のうち、取得された動作状態に応じて、ノイズ低減効果の高い方の出力を選択することができる。 According to the present embodiment, it is possible to select an output having a higher noise reduction effect among the outputs of the template subtraction unit and the multi-channel noise reduction unit according to the acquired operation state.

本発明の一つの実施形態による音データ処理装置において、該機械装置はロボットである。 In the sound data processing device according to one embodiment of the present invention, the mechanical device is a robot.

本実施形態によれば、ロボットの動作状態に応じて、ロボットが発生するノイズを効率的に低減することができる。 According to the present embodiment, it is possible to efficiently reduce noise generated by the robot according to the operation state of the robot.

本発明の一つの実施形態による音データ処理装置において、該ロボットの動作状態が関節モータの角度、角速度及び角加速度のデータによって表される。 In the sound data processing apparatus according to one embodiment of the present invention, the operation state of the robot is represented by data on the angle, angular velocity, and angular acceleration of the joint motor.

本実施形態によれば、関節モータの角度、角速度及び角加速度のデータを使用することによって、該ロボットの動作状態を容易且つ確実に把握することができる。 According to the present embodiment, the operation state of the robot can be easily and reliably grasped by using the data of the angle, angular velocity, and angular acceleration of the joint motor.

本発明の一つの実施形態による音データ処理装置において、取得された動作状態に最も近い動作状態のテンプレートが、モータの角度、角速度及び角加速度のデータによって構成される３次元空間における距離によって定められる。 In the sound data processing device according to one embodiment of the present invention, the template of the operation state closest to the acquired operation state is determined by the distance in the three-dimensional space constituted by the motor angle, angular velocity and angular acceleration data. .

本実施形態によれば、該３次元空間における距離を使用することにより、取得された動作状態に最も近い動作状態のテンプレートを、容易且つ確実に定めることができる。 According to the present embodiment, by using the distance in the three-dimensional space, it is possible to easily and reliably determine the template having the operation state closest to the acquired operation state.

本発明の一つの実施形態による音データ処理装置において、該テンプレートの音データが周波数スペクトルによって表される。 In the sound data processing apparatus according to one embodiment of the present invention, the sound data of the template is represented by a frequency spectrum.

本実施形態によれば、音データが周波数スペクトルによって表すことによって効率的にデータを記憶することができる。 According to the present embodiment, sound data can be efficiently stored by representing it by a frequency spectrum.

本発明の一つの実施形態による音データ処理装置は、単位時間における該機械装置の種々の動作状態及び該動作状態に対応する音データを採取して該データベースを作成するデータベース作成部をさらに備えている。 The sound data processing device according to one embodiment of the present invention further includes a database creation unit that creates the database by collecting various operation states of the mechanical device in unit time and sound data corresponding to the operation states. Yes.

本実施形態によれば、データベースを作成するための作業負荷が大幅に軽減される。 According to this embodiment, the work load for creating a database is greatly reduced.

本発明の一つの態様による音データ処理方法は、機械装置が発生するノイズを低減するように信号音を含む音データを処理する音データ処理方法である。本態様による音データ処理方法は、該機械装置の動作状態を取得するステップと、取得された動作状態に対応する音データを取得するステップと、を含む。本態様による音データ処理は、単位時間における該機械装置の種々の動作状態及び対応する音データをテンプレートとして記憶するデータベースから、取得された動作状態に最も近い動作状態のテンプレートの音データを検索するステップと、取得された音データから、取得された動作状態に最も近い動作状態のテンプレートの音データを減算して機械装置が発生するノイズを低減した出力を求めるステップと、をさらに含む。 A sound data processing method according to one aspect of the present invention is a sound data processing method for processing sound data including a signal sound so as to reduce noise generated by a mechanical device. The sound data processing method according to this aspect includes a step of acquiring an operation state of the mechanical device and a step of acquiring sound data corresponding to the acquired operation state. In the sound data processing according to this aspect, the sound data of the template in the operation state closest to the acquired operation state is searched from the database that stores the various operation states of the mechanical device in the unit time and the corresponding sound data as templates. And subtracting the sound data of the template in the operation state closest to the acquired operation state from the acquired sound data to obtain an output with reduced noise generated by the mechanical device.

本態様による音データ処理方法においては、データベースに記憶された、単位時間における機械装置の種々の動作状態に対応する音データを表すテンプレートの内、取得された動作状態に最も近い動作状態のテンプレートの音データが、該取得された動作状態の音データの予測値として使用される。したがって、取得された音データから該予測値を減算することによって、動作状態に応じて、機械装置が発生するノイズを効率的に低減することができる。 In the sound data processing method according to this aspect, the template of the operation state closest to the acquired operation state among the templates representing the sound data corresponding to various operation states of the mechanical device in the unit time stored in the database is stored. Sound data is used as a predicted value of the sound data of the acquired operation state. Therefore, by subtracting the predicted value from the acquired sound data, it is possible to efficiently reduce noise generated by the mechanical device according to the operating state.

本発明の一つの実施形態による音データ処理方法は、取得された動作状態に基づいて、テンプレート減算によってノイズを低減した出力及び多チャンネルの音データを使用してノイズを低減した出力のいずれかを選択するステップをさらに含む。 A sound data processing method according to an embodiment of the present invention is configured to output either noise-reduced output by template subtraction or noise-reduced output using multi-channel sound data based on the acquired operation state. The method further includes a step of selecting.

本実施形態によれば、テンプレート減算によってノイズを低減した出力及び多チャンネルの音データを使用してノイズを低減した出力のうち、取得された動作状態に応じて、ノイズ低減効果の高い方の出力を選択することができる。 According to the present embodiment, the output with the higher noise reduction effect among the output with reduced noise by template subtraction and the output with reduced noise using multi-channel sound data, depending on the acquired operating state. Can be selected.

本発明の一実施形態による音データ処理装置のテンプレート減算ノイズ低減部の構成を示す図である。It is a figure which shows the structure of the template subtraction noise reduction part of the sound data processing apparatus by one Embodiment of this invention. 本発明の一実施形態による音データ処理装置の構成を示す図である。It is a figure which shows the structure of the sound data processing apparatus by one Embodiment of this invention. データベースの構造を示す図である。It is a figure which shows the structure of a database. データベースを作成する手順を示す流れ図である。It is a flowchart which shows the procedure which produces a database. テンプレート減算を使用したノイズ低減の手順を示す流れ図である。It is a flowchart which shows the procedure of the noise reduction using template subtraction. 「頭の運動」のノイズがある場合の、ＳＮ比（信号対ノイズ比）と語正解率（ＷＣＲ）との関係を示す図である。It is a figure which shows the relationship between SN ratio (signal-to-noise ratio) and word correct rate (WCR) in case there exists noise of "head movement". 「腕の運動」のノイズがある場合の、ＳＮ比（信号対ノイズ比）と語正解率（ＷＣＲ）との関係を示す図である。It is a figure which shows the relationship between SN ratio (signal-to-noise ratio) and word correct rate (WCR) in case there exists noise of "arm movement". 「頭の運動」及び「腕の運動」のノイズがある場合の、ＳＮ比（信号対ノイズ比）と語正解率（ＷＣＲ）との関係を示す図である。It is a figure which shows the relationship between SN ratio (signal-to-noise ratio) and a word correct rate (WCR) in case there exists noise of "head movement" and "arm movement".

図１は、本発明の一実施形態による音データ処理装置のテンプレート減算ノイズ低減部１００の構成を示す図である。本実施形態における機械装置は、ロボットである。テンプレート減算ノイズ低減部１００は、ロボットの動作状態を取得する動作状態取得部１０１と、信号及びノイズを含む音データを取得する音データ取得部１０３と、単位時間の機械装置の種々の動作状態及び対応する音データをテンプレートとして記憶するデータベース１０５と、データベースを作成するデータベース作成部１１１と、を備えている。このように、テンプレートとは、単位時間のロボットの種々の動作状態で生じる音データ（音響信号）を表すものである。テンプレート減算ノイズ低減部１００は、データベース１０５から、取得された動作状態に最も近い動作状態のテンプレートを検索するデータベース検索部１０７と、取得された音データから、取得された動作状態に最も近い動作状態のテンプレートの音データを減算して機械装置が発生するノイズを低減するテンプレート減算部１０９と、をさらに備えている。 FIG. 1 is a diagram illustrating a configuration of a template subtraction noise reduction unit 100 of a sound data processing device according to an embodiment of the present invention. The mechanical device in this embodiment is a robot. The template subtraction noise reduction unit 100 includes an operation state acquisition unit 101 that acquires an operation state of the robot, a sound data acquisition unit 103 that acquires sound data including a signal and noise, and various operation states of the mechanical device in unit time. A database 105 that stores corresponding sound data as a template and a database creation unit 111 that creates a database are provided. As described above, the template represents sound data (acoustic signal) generated in various operation states of the robot for a unit time. The template subtraction noise reduction unit 100 searches the database 105 for a template having the operation state closest to the acquired operation state, and the operation state closest to the operation state acquired from the acquired sound data. And a template subtracting unit 109 that subtracts the sound data of the template to reduce noise generated by the mechanical device.

動作状態取得部１０１は、ロボットの関節モータの角度センサに接続され、角度センサからの角度データを取得する。ロボットの動作状態は、ロボットの各関節モータの角度、角速度及び角加速度によって表される。動作状態取得部１０１は、取得した角度データに微分処理を行って角速度データ及び角加速度データを得る。 The operation state acquisition unit 101 is connected to the angle sensor of the joint motor of the robot and acquires angle data from the angle sensor. The operation state of the robot is represented by the angle, angular velocity, and angular acceleration of each joint motor of the robot. The motion state acquisition unit 101 performs differential processing on the acquired angle data to obtain angular velocity data and angular acceleration data.

音データ取得部１０３は、ロボットに設置されたマイクロフォン２０１に接続され、マイクロフォン２０１が集めた音データを取得する。音データ取得部１０３は、たとえば、ＭＣＲＡ（I. Cohen, “Noise Estimation by Minima Controlled Recursive Averaging for Robust Speech Enhancement, in IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. ASSSP-27, No. 2, (1979）)などを使用した背景ノイズ低減機能を備えていてもよい。 The sound data acquisition unit 103 is connected to the microphone 201 installed in the robot, and acquires sound data collected by the microphone 201. The sound data acquisition unit 103 is, for example, MCRA (I. Cohen, “Noise Estimation by Minima Controlled Recursive Averaging for Robust Speech Enhancement, in IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. ASSSP-27, No. 2, (1979)) or the like may be provided.

データベース１０５のデータ構造及びデータベース作成部１１１の機能、データベース検索部１０７の機能、ならびにテンプレート減算部１０９の機能については後で詳細に説明する。 The data structure of the database 105, the function of the database creation unit 111, the function of the database search unit 107, and the function of the template subtraction unit 109 will be described in detail later.

図２は、本発明の一実施形態による音データ処理装置１５０の構成を示す図である。音データ処理装置１５０は、図１を使用して説明したテンプレート減算ノイズ低減部１００、従来技術である幾何学的音源分離法（Geometric Source Separation，GSS, L. C. Parra and C. V. Alvino, “Geometric Source Separation* Merging Convolutive Source Separation with Geometric Beanforming”, in IEEE Trans. Speech Audio Process., vol. 10, No. 6, pp. 352-362,(2002)）を使用してノイズを低減する多チャンネルノイズ低減部１２１、ノイズ処理後の音データから音の特徴を抽出する音特徴抽出部１２３及び１２５、テンプレート減算ノイズ低減部１００の出力及び多チャンネルノイズ低減部１２１の出力のいずれか選択する出力選択部１２７ならびに選択された出力を使用して音声認識を行う音声認識部１２９を備える。 FIG. 2 is a diagram illustrating a configuration of the sound data processing device 150 according to an embodiment of the present invention. The sound data processing device 150 includes a template subtraction noise reduction unit 100 described with reference to FIG. 1 and a conventional geometric source separation method (Geometric Source Separation, GSS, LC Parra and CV Alvino, “Geometric Source Separation * Multi-channel noise reduction unit 121 that reduces noise using Merging Convolutive Source Separation with Geometric Beanforming ”, in IEEE Trans. Speech Audio Process., Vol. 10, No. 6, pp. 352-362, (2002)) , Sound feature extraction units 123 and 125 that extract sound features from the noise-processed sound data, an output selection unit 127 that selects one of the output of the template subtraction noise reduction unit 100 and the output of the multi-channel noise reduction unit 121 A speech recognition unit 129 that performs speech recognition using the output output.

多チャンネルノイズ低減部１２１は、ロボットの頭部に設置された８個のマイクロフォン２０１・・・２０３から音データを取得し、音源位置を特定し、特定した音源位置を使用して音源分離を行なった後、ポストフィルタリング処理（Ephraim and D.Malah,“ Speech enhancement using a minimum mean-square error short -time spectral amplitude estimator“, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.901-904, (2002)）を行う。ポストフィルタリング処理は、たとえば、背景ノイズなどの定常ノイズ、及び個別の音源の分離段階の出力チャネル間の漏れエネルギにより生じる非定常ノイズを減少させる。多チャンネルノイズ低減部１２１は、方向を持った音源を分離することのできる多チャンネルを使用する、上記の構成以外のどのような構成によって実現してもよい。 The multi-channel noise reduction unit 121 acquires sound data from eight microphones 201... 203 installed on the head of the robot, specifies a sound source position, and performs sound source separation using the specified sound source position. After filtering (Ephraim and D. Malah, “Speech enhancement using a minimum mean-square error short -time spectral amplitude estimator”, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.901- 904, (2002)). The post-filtering process reduces, for example, stationary noise, such as background noise, and non-stationary noise caused by leakage energy between the output channels of individual sound source separation stages. The multi-channel noise reduction unit 121 may be realized by any configuration other than the above-described configuration that uses multi-channels that can separate sound sources having directions.

出力選択部１２７は、テンプレート減算ノイズ低減部１００の動作状態取得部１０１からロボットの動作状態の情報を受け取り、その情報に基づいて、テンプレート減算ノイズ低減部１００及び多チャンネルノイズ低減部１２１のうち、その動作状態においてより効率的にノイズを低減する方の出力を選択し、その出力を音声認識部１２９へ送る。ロボットの動作状態と、テンプレート減算ノイズ低減部１００及び多チャンネルノイズ低減部１２１の性能との関係は後で説明する。 The output selection unit 127 receives information on the operation state of the robot from the operation state acquisition unit 101 of the template subtraction noise reduction unit 100, and based on the information, of the template subtraction noise reduction unit 100 and the multi-channel noise reduction unit 121, The output for reducing noise more efficiently in the operating state is selected, and the output is sent to the voice recognition unit 129. The relationship between the operation state of the robot and the performance of the template subtraction noise reduction unit 100 and the multi-channel noise reduction unit 121 will be described later.

図３は、データベース１０５の構造を示す図である。 FIG. 3 is a diagram showing the structure of the database 105.

図４は、データベース１０５を作成する手順を示す流れ図である。 FIG. 4 is a flowchart showing a procedure for creating the database 105.

データベース１０５を作成する際には、ロボットが、個々の運動の間に１秒より短い休止を設けながら、多数の運動からなる連続した運動のシーケンスを実行して、動作状態取得部１０１が動作状態を取得し、音データ取得部１０３が音データを取得する。「腕の運動」は、腕全体をリーチの範囲の空間でランダムに動かし、「脚の運動」は、足踏み及び短距離歩行を行い、「頭の運動」は、頭をランダムに回転させる（上下[-30°30°]、方位[-90°90°]）。 When creating the database 105, the robot executes a sequence of continuous movements consisting of a large number of movements while providing a pause shorter than 1 second between individual movements, and the movement state acquisition unit 101 moves to a movement state. And the sound data acquisition unit 103 acquires sound data. “Arm movement” moves the entire arm randomly in the reach space, “Leg movement” performs stepping and short-distance walking, and “Head movement” rotates the head randomly (up and down [-30 ° 30 °], bearing [-90 ° 90 °]).

図４のステップＳ１０１０において、動作状態取得部１０１が、所定の時間のロボットの動作状態を取得する。 In step S1010 of FIG. 4, the operation state acquisition unit 101 acquires the operation state of the robot for a predetermined time.

上述のように、ロボットの動作状態は、ロボットの各関節モータの角度θ、角速度

及び角加速度

によって表される。ロボットの関節の数をＪとすると、動作状態を表す特徴ベクトルは以下のようになる。

ここで、ｋは、時刻を表す。角度θ、角速度

及び角加速度

の値は、５ミリ秒ごとに取得し、[-1,1]に規格化する。 As described above, the robot operation state is determined by the angle θ and the angular velocity of each joint motor of the robot.

And angular acceleration

Represented by If the number of joints of the robot is J, the feature vector representing the motion state is as follows.

Here, k represents time. Angle θ, angular velocity

And angular acceleration

The value of is obtained every 5 milliseconds and normalized to [-1,1].

具体的に、ロボットは、頭の動作に２個のモータ、各脚の動作に５個のモータおよび各腕の動作に４個のモータを使用する。このように全体で２０個のモータを使用するので、Ｊは、２０である。 Specifically, the robot uses two motors for head movement, five motors for each leg movement, and four motors for each arm movement. Thus, since 20 motors are used in total, J is 20.

図４のステップＳ１０２０において、音データ取得部１０３が、所定の時間の音データを取得する。具体的に、上記のロボットの動作状態に対応する音データ、すなわち、モータ・ノイズに対応する音データを取得し、以下の周波数スペクトルによって表す。

ここで、ｋは、時刻を表し、Ｆは周波数の範囲を表す。該周波数の範囲は、０ｋＨｚ−８ｋＨｚを２５６に区分したものである。音データは、１０ミリ秒ごとに取得する。上述のように、音データ取得部１０３は、たとえば、ＭＣＲＡを使用して背景ノイズを低減した後のデータを、モータ・ノイズに対応する音データとして使用してもよい。 In step S1020 of FIG. 4, the sound data acquisition unit 103 acquires sound data for a predetermined time. Specifically, sound data corresponding to the above-described robot operation state, that is, sound data corresponding to motor noise is acquired and represented by the following frequency spectrum.

Here, k represents time and F represents a frequency range. The frequency range is obtained by dividing 0 kHz-8 kHz into 256. Sound data is acquired every 10 milliseconds. As described above, the sound data acquisition unit 103 may use, for example, data after reducing background noise using MCRA as sound data corresponding to motor noise.

図４のステップＳ１０３０において、データベース作成部１１１は、動作状態取得部１０１から、動作状態を表す特徴ベクトル

を受け取り、音データ取得部１０３から、動作状態に対応する音データの周波数スペクトル

を受け取る。 In step S1030 of FIG. 4, the database creation unit 111 receives a feature vector representing the operation state from the operation state acquisition unit 101.

The frequency spectrum of the sound data corresponding to the operating state from the sound data acquisition unit 103

Receive.

動作状態を表す特徴ベクトル及び音データの周波数スペクトルには、時刻タグが付されている。したがって、時刻タグが一致した特徴ベクトル及び周波数スペクトルを組み合わせることによりテンプレートが生成される。図３に示すデータベース１０５は、このようにして生成されたテンプレートの集合として作成される。上述のように、動作状態を表す特徴ベクトルの期間は５ミリ秒であるので、テンプレートの期間も５ミリ秒である。 A time tag is attached to the frequency vector of the feature vector representing the operating state and the sound data. Therefore, a template is generated by combining the feature vector and the frequency spectrum having the same time tag. The database 105 shown in FIG. 3 is created as a set of templates generated in this way. As described above, since the period of the feature vector representing the operation state is 5 milliseconds, the period of the template is also 5 milliseconds.

本発明においては、単位時間（上記の例では５ミリ秒）のロボットの種々の動作状態で生じる音データ（音響信号）を表す多数のテンプレートを集め、データベースを構築する。その後、ロボットの動作中に、データベースに記憶されたテンプレートの内、ある時点で取得された動作状態に最も近い動作状態のテンプレートの音データを、その時点で取得された動作状態の音データの予測値として使用する。このように、本発明は、単位時間（上記の例では５ミリ秒）のロボットの種々の動作状態で生じる音データ（音響信号）を表す多数のテンプレートの集合は、ロボットのあらゆる動作状態の音データの予測に対応することができるという考えに基づいている。この考えは、以下の仮定に基づいている。 In the present invention, a large number of templates representing sound data (acoustic signals) generated in various operation states of a robot for a unit time (5 milliseconds in the above example) are collected to construct a database. After that, during the robot operation, the sound data of the template in the operation state closest to the operation state acquired at a certain time among the templates stored in the database is predicted, and the sound data of the operation state acquired at that time is predicted. Use as a value. As described above, the present invention provides a set of a large number of templates representing sound data (acoustic signals) generated in various operation states of the robot in unit time (5 milliseconds in the above example). It is based on the idea that it can cope with the prediction of data. This idea is based on the following assumptions.

１）その時点のモータ・ノイズは、特定のモータの角度、角速度及び角加速度に依存する。
２）どの時点においても、類似する動作状態（関節の状態）の組み合わせによって、類似するノイズの周波数スペクトルが生じる。
３）任意の時点における単一の関節モータ・ノイズの重ね合わせは、該時点における全体ノイズに等しい。 1) The motor noise at that time depends on the angle, angular velocity and angular acceleration of the specific motor.
2) At any point in time, a similar noise frequency spectrum is produced by a combination of similar motion states (joint states).
3) The superposition of a single joint motor noise at any time is equal to the total noise at that time.

図４のステップＳ１０４０において、データベース１０５に十分なテンプレートが格納されたかどうか判断される。たとえば、一連の「腕の運動」、一連の「脚の運動」及び一連の「頭の運動」を含む運動のシーケンスについてテンプレートをデータベース１０５に格納した後に、十分なテンプレートが格納されたと判断してもよい。 In step S1040 of FIG. 4, it is determined whether or not sufficient templates are stored in the database 105. For example, after storing templates in database 105 for a sequence of movements including a series of “arm movements”, a series of “leg movements”, and a series of “head movements”, it is determined that sufficient templates have been stored. Also good.

上記の手順により、ロボットのあらゆる動作状態の音データの予測に使用することができるデータベースを、簡単に作成することができる。 According to the above procedure, a database that can be used for prediction of sound data in all operating states of the robot can be easily created.

図５は、テンプレート減算を使用したノイズ低減の手順を示す流れ図である。 FIG. 5 is a flowchart showing a noise reduction procedure using template subtraction.

図５のステップＳ２０１０において、動作状態取得部１０１が、ロボットの動作状態（特徴ベクトル）を取得する。 In step S2010 of FIG. 5, the operation state acquisition unit 101 acquires the operation state (feature vector) of the robot.

図５のステップＳ２０２０において、音データ取得部１０３が、音データを取得する。 In step S2020 of FIG. 5, the sound data acquisition unit 103 acquires sound data.

図５のステップＳ２０３０において、データベース検索部１０７が、動作状態取得部１０１から、取得された動作状態（特徴ベクトル）を受け取り、データベース１０５から、取得された動作状態に最も近い動作状態のテンプレートを検索する。 In step S2030 of FIG. 5, the database search unit 107 receives the acquired operation state (feature vector) from the operation state acquisition unit 101, and searches the database 105 for a template having the operation state closest to the acquired operation state. To do.

ここで、ロボットの関節の数をＪとすると、動作状態の特徴ベクトルは３Ｊ次元空間の点に対応する。データベース１０５の任意のテンプレートの動作状態の特徴ベクトルを

と表し、取得された動作状態の特徴ベクトルを

と表す。そうすると、取得された動作状態に最も近い動作状態のテンプレートを検索することは、３Ｊ次元のユークリッド空間の距離

が最も小さくなる特徴ベクトル

を有するテンプレートを求めることに相当する。 Here, if the number of joints of the robot is J, the feature vector of the motion state corresponds to a point in the 3J-dimensional space. A feature vector of an operation state of an arbitrary template in the database 105

And the obtained feature state vector

It expresses. Then, searching for the template with the operation state closest to the acquired operation state is the distance in the 3J-dimensional Euclidean space.

Is the smallest feature vector

Is equivalent to obtaining a template having

図５のステップＳ２０４０において、テンプレート減算部１０９は、音データ取得部１０３から、取得された音データを受け取り、データベース検索部１０７から、検索されたテンプレートの周波数スペクトルを受け取る。つぎに、テンプレート減算部１０９は、取得された音データの周波数スペクトルから、モータ・ノイズの予測値であるテンプレートの周波数スペクトルを減算する。取得された音データの周波数スペクトルを

で表し、テンプレートの周波数スペクトルを

で表すと、有効な信号の周波数スペクトル

は、以下の式で求められる。

テンプレートの周波数スペクトル

は、モータ・ノイズの予測値であるので、有効な信号の周波数スペクトル

は、残余のモータ・ノイズを含む。そこで、テンプレート減算を以下の式によって行なってもよい。

In step S2040 of FIG. 5, the template subtraction unit 109 receives the acquired sound data from the sound data acquisition unit 103, and receives the frequency spectrum of the searched template from the database search unit 107. Next, the template subtraction unit 109 subtracts the frequency spectrum of the template, which is a predicted value of motor noise, from the frequency spectrum of the acquired sound data. The frequency spectrum of the acquired sound data

And the frequency spectrum of the template

Is the frequency spectrum of the valid signal.

Is obtained by the following equation.

Template frequency spectrum

Is the predicted motor noise, so the frequency spectrum of the valid signal

Includes residual motor noise. Therefore, template subtraction may be performed by the following equation.

ここで、αは過大評価係数であり、知覚の信号の歪みとノイズ低減レベルとの折り合いを可能にする。また、βはスペクトルの下限であり、周波数スペクトルの急峻なピーク及び谷の影響を低減する。一例として

とする。 Here, α is an overestimation coefficient and makes it possible to strike a balance between perceptual signal distortion and noise reduction level. Β is the lower limit of the spectrum and reduces the influence of sharp peaks and valleys in the frequency spectrum. As an example

And

図５のステップＳ２０５０において、出力選択部１２７は、テンプレート減算ノイズ低減部１００の動作状態取得部１０１からロボットの動作状態の情報を受け取り、その情報に基づいて、テンプレート減算ノイズ低減部１００及び多チャンネルノイズ低減部１２１のいずれかの出力を選択する。頭を上下・左右に振る運動または頭を傾ける運動の速度を

として、

が成立する場合に、テンプレート減算ノイズ低減部１００の出力を選択し、それ以外の場合には、多チャンネルノイズ低減部１２１の出力を選択してもよい。頭部の運動が存在する場合に、テンプレート減算ノイズ低減部１００の出力を選択する理由は後で説明する。関係式（５）において、０の代わりに正の定数εを使用する理由は、頭の運動は停止したが、関節モータの角度センサが、非常に小さな位置の差の信号を送る状況に対応するためである。出力を選択した後、出力選択部１２７は、選択した出力を音声認識部１２９へ送り、音声認識部１２９が音声認識を行なう。 In step S2050 of FIG. 5, the output selection unit 127 receives information on the robot operation state from the operation state acquisition unit 101 of the template subtraction noise reduction unit 100, and based on the information, the template subtraction noise reduction unit 100 and the multi-channel Any output of the noise reduction unit 121 is selected. The speed of the movement to shake the head up and down, left and right or tilt the head

As

Is established, the output of the template subtraction noise reduction unit 100 may be selected. In other cases, the output of the multi-channel noise reduction unit 121 may be selected. The reason for selecting the output of the template subtraction noise reduction unit 100 when there is head movement will be described later. The reason for using a positive constant ε instead of 0 in relation (5) corresponds to the situation where the head motion has stopped but the angle sensor of the joint motor sends a very small position difference signal. Because. After selecting the output, the output selection unit 127 sends the selected output to the voice recognition unit 129, and the voice recognition unit 129 performs voice recognition.

以下において、本実施形態の音声処理装置を使用した音声認識の実験について説明する。実験において、音声認識の対象は、自己ノイズ及び背景ノイズからなるノイズ信号に、明瞭なスピーチの発話が混合されたものである。発話は、４人ずつの男性及び女性の話者に対応する２３６語を含む日本語データベースを使用して実施した。スピーカの位置は、実験を通して正面（０°）に固定される。録音環境は、４．０ｍｘ７．０ｍｘ３．０ｍの寸法の部屋で、残響時間は０．２秒である。音声認識の結果は、語正解率（word correct rates, WCR）で与えられる。 In the following, a speech recognition experiment using the speech processing apparatus of this embodiment will be described. In the experiment, the target of speech recognition is a noise signal composed of self noise and background noise mixed with a clear speech utterance. The utterances were conducted using a Japanese database containing 236 words corresponding to four male and female speakers. The position of the speaker is fixed to the front (0 °) throughout the experiment. The recording environment is a room having a size of 4.0 mx 7.0 mx 3.0 m, and the reverberation time is 0.2 seconds. The result of speech recognition is given in terms of word correct rates (WCR).

ノイズ及びスピーチを混合する前に、種々のＳＮ比（信号対ノイズ比）の条件に対して、ノイズ及びスピーチのエネルギの正確な量を生成するように、以下に示すセグメントＳＮ比

に基づいて明瞭なスピーチを増幅した。

ここで、Jは、スピーチ活動を伴うセグメントの数であり、s(n)及びd(n)は、n番目の離散したスピーチのサンプル及びノイズのサンプルである。 Before mixing the noise and speech, the segment signal-to-noise ratio shown below to produce an accurate amount of noise and speech energy for various signal-to-noise (signal-to-noise) conditions.

Based on, a clear speech was amplified.

Where J is the number of segments with speech activity and s (n) and d (n) are the nth discrete speech and noise samples.

図６は、「頭の運動」のノイズがある場合の、ＳＮ比（信号対ノイズ比）と語正解率（ＷＣＲ）との関係を示す図である。図６は、ノイズ低減部不使用の場合のＷＣＲ、多チャンネルノイズ低減部を使用した場合のＷＣＲ、及びテンプレート減算ノイズ低減部を使用した場合のＷＣＲを示している。いずれのＳＮ比の場合も、テンプレート減算ノイズ低減部を使用した場合のＷＣＲが他の二つの場合のＷＣＲよりも高い。 FIG. 6 is a diagram illustrating the relationship between the SN ratio (signal-to-noise ratio) and the word accuracy rate (WCR) when there is “head movement” noise. FIG. 6 shows the WCR when the noise reduction unit is not used, the WCR when the multi-channel noise reduction unit is used, and the WCR when the template subtraction noise reduction unit is used. In any S / N ratio, the WCR when the template subtraction noise reduction unit is used is higher than the WCR in the other two cases.

図７は、「腕の運動」のノイズがある場合の、ＳＮ比（信号対ノイズ比）と語正解率（ＷＣＲ）との関係を示す図である。図７は、ノイズ低減部不使用の場合のＷＣＲ、多チャンネルノイズ低減部を使用した場合のＷＣＲ、及びテンプレート減算ノイズ低減部を使用した場合のＷＣＲを示している。ＳＮ比が低い場合を除いて、テンプレート減算ノイズ低減部を使用した場合のＷＣＲは、多チャンネルノイズ低減部を使用した場合のＷＣＲよりも低い。 FIG. 7 is a diagram illustrating the relationship between the SN ratio (signal-to-noise ratio) and the word accuracy rate (WCR) when there is noise of “arm movement”. FIG. 7 shows WCR when the noise reduction unit is not used, WCR when the multi-channel noise reduction unit is used, and WCR when the template subtraction noise reduction unit is used. Except when the S / N ratio is low, the WCR when the template subtraction noise reduction unit is used is lower than the WCR when the multi-channel noise reduction unit is used.

図８は、「頭の運動」及び「腕の運動」のノイズがある場合の、ＳＮ比（信号対ノイズ比）と語正解率（ＷＣＲ）との関係を示す図である。図８は、ノイズ低減部不使用の場合のＷＣＲ、多チャンネルノイズ低減部を使用した場合のＷＣＲ、及びテンプレート減算ノイズ低減部を使用した場合のＷＣＲを示している。いずれのＳＮ比の場合も、テンプレート減算ノイズ低減部を使用した場合のＷＣＲが他の二つの場合のＷＣＲよりも高い。 FIG. 8 is a diagram illustrating the relationship between the SN ratio (signal-to-noise ratio) and the word accuracy rate (WCR) when there is noise of “head movement” and “arm movement”. FIG. 8 shows the WCR when the noise reduction unit is not used, the WCR when the multi-channel noise reduction unit is used, and the WCR when the template subtraction noise reduction unit is used. In any S / N ratio, the WCR when the template subtraction noise reduction unit is used is higher than the WCR in the other two cases.

表１は、種々のノイズがあり、ＳＮ比が−５ｄＢである場合に対して、ノイズ低減部不使用の場合のＷＣＲ、多チャンネルノイズ低減部を使用した場合のＷＣＲ、及びテンプレート減算低減部を使用した場合のＷＣＲを示した表である。表１中のＷＣＲの単位は、パーセントである。

Table 1 shows the WCR when the noise reduction unit is not used, the WCR when the multi-channel noise reduction unit is used, and the template subtraction reduction unit when there are various noises and the SN ratio is −5 dB. It is the table | surface which showed WCR at the time of using. The unit of WCR in Table 1 is percent.

図６乃至図８及び表１から理解されるように、「頭の運動」及び「頭及び腕の運動」のノイズがある場合には、テンプレート減算ノイズ低減部を使用した場合のＷＣＲが多チャンネルノイズ低減部を使用した場合のＷＣＲよりも高い。これに対して、「腕の運動」のノイズがある場合は、多チャンネルノイズ低減部を使用した場合のＷＣＲがテンプレート減算ノイズ低減部を使用した場合のＷＣＲテンプレート減算ノイズ低減部を使用した場合のＷＣＲよりも高い。 As understood from FIGS. 6 to 8 and Table 1, when there is noise of “head movement” and “head and arm movement”, the WCR when the template subtraction noise reduction unit is used is multi-channel. It is higher than WCR when the noise reduction unit is used. On the other hand, when there is noise of “arm movement”, the WCR when using the multi-channel noise reduction unit uses the WCR template subtraction noise reduction unit when using the template subtraction noise reduction unit. Higher than WCR.

その理由は以下のとおりである。腕の位置は、ロボットの正面に位置するスピーカから遠く離れている。したがって、「腕の運動」のノイズがある場合に、多チャンネルノイズ低減部において、ロボットの頭に設置した複数のマイクロフォンを使用した音源分離によるノイズ低減が特に効率的に行われる。また、「脚の運動」のノイズがある場合も、同様に、複数のマイクロフォンを使用した音源分離によるノイズ低減が効率的に行われる。 The reason is as follows. The position of the arm is far from the speaker located in front of the robot. Therefore, when there is noise of “arm movement”, noise reduction by sound source separation using a plurality of microphones installed at the head of the robot is performed particularly efficiently in the multi-channel noise reduction unit. Similarly, when there is noise of “leg motion”, noise reduction is efficiently performed by sound source separation using a plurality of microphones.

一方、「頭の運動」のノイズがある場合に、頭部のモータによるノイズは、マイクロフォンの近傍において、大きな残響を伴いながら、頭部内を伝播する。マイクロフォンの近距離場における強いノイズ源は、伝播パターンを極めて複雑にする。結果として、該ノイズ源は、多チャンネルノイズ低減部における音源分離の分離品質を劣化させる。 On the other hand, when there is noise of “head movement”, the noise due to the head motor propagates in the head with a large reverberation in the vicinity of the microphone. Strong noise sources in the near field of the microphone make the propagation pattern extremely complex. As a result, the noise source deteriorates the separation quality of the sound source separation in the multi-channel noise reduction unit.

これに対して、テンプレート減算ノイズ低減部は、指向拡散性に基づいてノイズをモデル化しておらず、動作状態に基づいて、データベースからのテンプレートの予測を使用する。したがって、「頭の運動」のノイズがある場合に、音源分離を使用する場合よりも良好な結果が得られる。 On the other hand, the template subtraction noise reduction unit does not model the noise based on the directional diffusivity, and uses the prediction of the template from the database based on the operation state. Therefore, better results are obtained when there is noise of “head movement” than when using sound source separation.

上記の結果から、「頭の運動」及び「頭及び腕の運動」のノイズがある場合には、テンプレート減算ノイズ低減部を使用するのが有利である。これに対して、「腕の運動」、「脚の運動」及びその組み合わせのノイズがある場合に、多チャンネルノイズ低減部を使用するのが有利である。そこで、出力選択部１２７は、「頭の運動」がある場合には、テンプレート減算ノイズ低減部１００の出力を選択し、「頭の運動」がない場合には、多チャンネルノイズ低減部１２１の出力を選択する。 From the above results, it is advantageous to use the template subtraction noise reduction unit when there is noise of “head movement” and “head and arm movement”. On the other hand, it is advantageous to use a multi-channel noise reduction unit when there is noise of “arm movement”, “leg movement”, and combinations thereof. Therefore, the output selection unit 127 selects the output of the template subtraction noise reduction unit 100 when “head movement” exists, and outputs the output of the multi-channel noise reduction unit 121 when “head movement” does not exist. Select.

１００…テンプレート減算ノイズ低減部、１０１…動作状態取得部、１０３…音データ取得部、１０５…データベース、１０７…データベース検索部、１０９…テンプレート減算部、１１１…データベース作成部 DESCRIPTION OF SYMBOLS 100 ... Template subtraction noise reduction part, 101 ... Operation | movement state acquisition part, 103 ... Sound data acquisition part, 105 ... Database, 107 ... Database search part, 109 ... Template subtraction part, 111 ... Database preparation part

Claims

A sound data processing device for processing sound data including signal sound so as to reduce noise generated by a mechanical device,
An operation state acquisition unit for acquiring an operation state of the mechanical device;
A sound data acquisition unit for acquiring sound data corresponding to the acquired operation state;
A database for storing various operating states of the mechanical device in unit time and corresponding sound data as a template;
A database search unit that searches the database for the operation state closest to the acquired operation state from the database;
A sound data processing apparatus comprising: a template subtraction unit that subtracts sound data of a template in an operation state closest to the acquired operation state from acquired sound data to reduce noise generated by the mechanical device.

A multi-channel noise reduction unit that reduces noise based on multi-channel sound data, and an output that selects either the output of the template subtraction unit or the output of the multi-channel noise reduction unit based on the acquired operating state The sound data processing device according to claim 1, further comprising a selection unit.

The sound data processing device according to claim 1, wherein the mechanical device is a robot.

The sound data processing apparatus according to claim 3, wherein an operation state of the robot is represented by data of an angle, an angular velocity, and an angular acceleration of a joint motor.

The sound data processing device according to claim 4, wherein the template of the operation state closest to the acquired operation state is determined by a distance in a three-dimensional space configured by motor angle, angular velocity, and angular acceleration data.

6. The sound data processing apparatus according to claim 1, wherein the sound data of the template is represented by a frequency spectrum.

The sound data processing device according to claim 1, further comprising a database creation unit that creates the database by collecting various operation states of the mechanical device in unit time and sound data corresponding to the operation states. .

The sound data processing device according to claim 1, wherein the sound data processing device is used for speech recognition.

A sound data processing method for processing sound data including a signal sound so as to reduce noise generated by a mechanical device,
Obtaining an operating state of the mechanical device;
Obtaining sound data corresponding to the obtained operating state;
Searching for sound data of a template in an operation state closest to the acquired operation state from a database storing various operation states of the mechanical device in a unit time and corresponding sound data as a template;
Subtracting the sound data of the template in the operation state closest to the acquired operation state from the acquired sound data to obtain an output with reduced noise generated by the mechanical device, and a sound data processing method.

The audio processing according to claim 9, further comprising: selecting one of an output with reduced noise by template subtraction and an output with reduced noise using multi-channel sound data based on the acquired operation state. Method.