JP2012058314A

JP2012058314A - Acoustic processing system and machine employing the same

Info

Publication number: JP2012058314A
Application number: JP2010198815A
Authority: JP
Inventors: Yohei Kawaguchi; 洋平川口; Masato Togami; 真人戸上
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2010-09-06
Filing date: 2010-09-06
Publication date: 2012-03-22
Anticipated expiration: 2030-09-06
Also published as: JP5451562B2

Abstract

【課題】機械周囲の人物の安全のために、抽出すべき位置の人物の音声を抽出し、危険回避にとって有用な音声を瞬時的に抽出するための音響処理システムを提供する。
【解決手段】音響処理システムにおいて、音を収音する複数のマイクロホンからなる音入力部２０１と、機械の動作による周囲の人物または物体との接触に伴う危険度を算出する危険度算出部２０６と、前記音入力部２０１から出力された信号を入力として前記危険度算出部２０６で算出された危険度に応じた分離信号を出力する音抽出部２０３と、前記音抽出部２０３から出力された分離信号を出力する音出力部２１９と、を有する。
【選択図】図２To provide a sound processing system for extracting a voice of a person at a position to be extracted and instantaneously extracting a voice useful for avoiding a danger for safety of a person around a machine.
In a sound processing system, a sound input unit 201 including a plurality of microphones that collect sound, and a risk level calculation unit 206 that calculates a level of risk associated with contact with a surrounding person or object due to operation of a machine, The sound extraction unit 203 that outputs the separation signal according to the risk calculated by the risk calculation unit 206 with the signal output from the sound input unit 201 as an input, and the separation output from the sound extraction unit 203 And a sound output unit 219 for outputting a signal.
[Selection] Figure 2

Description

本発明は、建設機械、車両、作業機械などの比較的大型の機械を操作するオペレータもしくは運転者が機械周囲の人物の状況を把握するために適した音響処理技術に関し、特に、機械周囲の人物の安全に適した音響処理システム及びこれを用いた機械に適用して有効な技術に関する。 The present invention relates to an acoustic processing technique suitable for an operator or driver operating a relatively large machine such as a construction machine, a vehicle, or a work machine to grasp the situation of a person around the machine, and more particularly, to a person around the machine. The present invention relates to a sound processing system suitable for safety and a technology effective when applied to a machine using the same.

建設機械、車両、作業機械などの比較的大型の機械では、機械周囲の人物の安全のために、オペレータもしくは運転者（以下、オペレータという）が常に機械周囲の人物の状況を把握して、その都度危険を回避する必要がある。オペレータが機械周囲の人物の状況を知る上で重要な情報の一つが、周囲の人物が発声する音声である。 In relatively large machines such as construction machines, vehicles, and work machines, the operator or driver (hereinafter referred to as the operator) always knows the situation of the person around the machine for the safety of the person around the machine. It is necessary to avoid danger each time. One of the important information for the operator to know the situation of the person around the machine is the voice uttered by the person around.

周囲の人物の音声を収音するために機械外部にマイクロホンを設置し、収音された音をオペレータに提示することで、オペレータに周囲の人物の状況を把握させることを想定する。マイクロホンで収音される音には、周囲の人物の音声だけでなく、機械動作にともなうエンジン音、機械駆動音、掘削音などが同時に混入するので、収音される音から周囲の人物の音声のみを抽出し、オペレータに提示する必要がある。 It is assumed that a microphone is installed outside the machine in order to pick up the voices of the surrounding people and the collected sounds are presented to the operator so that the operator can grasp the situation of the surrounding people. The sound picked up by the microphone includes not only the sounds of the surrounding people but also the engine sounds, machine driving sounds, excavation sounds, etc. that accompany machine operation. Only need to be extracted and presented to the operator.

複数のマイクロホン（マイクロホンアレー）を用いた音源分離技術を用いれば、特定の位置から到来する音声のみを抽出することが可能である。ただし、以下の２点の課題がある。 If a sound source separation technique using a plurality of microphones (microphone arrays) is used, it is possible to extract only sound coming from a specific position. However, there are the following two problems.

一つ目に、音源分離では、音声を抽出する位置、すなわち人物が存在する位置を指定しなければならない点が課題である。たとえば、スパース性を仮定した位置推定に基づく音源分離方式（たとえば、特許文献１）は、指定した抽出位置を目的音源位置、それ以外を妨害音源位置としてフィルタを適応し、音源分離を行う。このため、位置の指定が必要である。また、音源の位置を指定せずに各音源の音を抽出するブラインド音源分離という技術も存在するが、その場合にも、複数個得られた音響信号のうち、どの音が抽出すべき信号であったのかを判断する問題が残る。 First, in sound source separation, the problem is that it is necessary to specify the position where the voice is extracted, that is, the position where the person exists. For example, a sound source separation method based on position estimation assuming sparsity (for example, Patent Document 1) performs sound source separation by applying a filter with a designated extraction position as a target sound source position and the other as a disturbing sound source position. For this reason, it is necessary to specify the position. In addition, there is a technique called blind sound source separation that extracts the sound of each sound source without specifying the position of the sound source, but even in that case, which sound is to be extracted from among the plurality of obtained acoustic signals. The problem remains to determine if there was.

二つ目に、音源分離の「精度」とフィルタ適応時間のトレードオフが存在する点が課題である。ここでの精度とは、抽出された音が元の目的音源の音にどれだけ近いかを意味する。一般的に、高精度に抽出するための適応方式（たとえば、非特許文献１の独立成分分析）は、瞬時的な入力信号だけではフィルタの適応ができず、オペレータが周囲の人物の状況を把握し危険回避の判断をすることはできない（以下、「瞬時的」とは、音の提示を受けてからオペレータが危険回避行動を実施するまでの時間より十分短い時間であることを意味する）。 The second problem is that there is a trade-off between “accuracy” of sound source separation and filter adaptation time. The accuracy here means how close the extracted sound is to the sound of the original target sound source. In general, an adaptive method for extracting with high accuracy (for example, independent component analysis of Non-Patent Document 1) cannot apply a filter only with an instantaneous input signal, and an operator grasps the situation of surrounding people. However, it is not possible to make a decision to avoid danger (hereinafter, “instantaneous” means that the time from when the sound is presented until the operator performs the danger avoidance action is sufficiently shorter).

その一方で、瞬時的な入力信号だけを用いて抽出が可能な音源分離アルゴリズムが存在する（たとえば、非特許文献２のバイナリマスキング）が、一般にその精度は低く、騒音が混入するので、周囲の人物が何を話しているかまでをオペレータが認識することは困難である。また、常にオペレータが分離されずに残留した騒音にさらされるという問題もある。 On the other hand, there are sound source separation algorithms that can be extracted using only an instantaneous input signal (for example, binary masking in Non-Patent Document 2), but the accuracy is generally low and noise is mixed. It is difficult for an operator to recognize what a person is talking about. There is also a problem that the operator is always exposed to residual noise without being separated.

また、リアルタイム処理と分離精度を両立するために、状況に応じて前記独立成分分析と音量差に基づくバイナリマスキングとを選択する方式がある（たとえば、特許文献２）。特許文献２では、独立成分分析の分離行列の収束度によって選択を行う実施例が示されている。 In order to achieve both real-time processing and separation accuracy, there is a method of selecting the independent component analysis and binary masking based on the volume difference according to the situation (for example, Patent Document 2). Patent Document 2 shows an example in which selection is performed based on the degree of convergence of a separation matrix for independent component analysis.

特開２００７−４７４２７号公報JP 2007-47427 A 特開２００７−３３８２５号公報JP 2007-33825 A

Ｔ．Ｔａｋａｔａｎｉ，Ｔ．Ｎｉｓｈｉｋａｗａ，Ｈ．Ｓａｒｕｗａｔａｒｉ，ａｎｄＫ．Ｓｈｉｋａｎｏ， “ＢｌｉｎｄｓｅｐａｒａｔｉｏｎｏｆｂｉｎａｕｒａｌｓｏｕｎｄｍｉｘｔｕｒｅｓｕｓｉｎｇＳＩＭＯ−ｍｏｄｅｌ−ｂａｓｅｄｉｎｄｅｐｅｎｄｅｎｔｃｏｍｐｏｎｅｎｔａｎａｌｙｓｉｓ，” ＩＣＡＳＳＰ２００４，ｖｏｌ．４，ｐｐ．１１３−１１６，２００４．T.A. Takatani, T .; Nishikawa, H .; Saruwatari, and K.A. Shikano, “Blind separation of binaural sound mixtures using SIMO-model-based independent component analysis,” ICASSP2004, vol. 4, pp. 113-116, 2004. Ｏ．ＹｉｌｍａｚａｎｄＳ．Ｒｉｃｋａｒｄ， “Ｂｌｉｎｄｓｅｐａｒａｔｉｏｎｏｆｓｐｅｅｃｈｍｉｘｔｕｒｅｓｖｉａｔｉｍｅ−ｆｒｅｑｕｅｎｃｙｍａｓｋｉｎｇ，” ＩＥＥＥＴｒａｎｓ．ＳｉｇｎａｌＰｒｏｃｅｓｓ．，ｖｏｌ．５２，ｎｏ．７，ｐｐ．１８３０−１８４７，Ｊｕｌｙ２００４．O. Yilmaz and S.J. Rickard, “Blind separation of speed mixture via time-frequency masking,” IEEE Trans. Signal Process. , Vol. 52, no. 7, pp. 1830-1847, July 2004. Ｍ．Ｔｏｇａｍｉ，Ｔ．Ｓｕｍｉｙｏｓｈｉ，ａｎｄＡ．Ａｍａｎｏ， “Ｓｔｅｐｗｉｓｅｐｈａｓｅｄｉｆｆｅｒｅｎｃｅｒｅｓｔｏｒａｔｉｏｎｍｅｔｈｏｄｆｏｒｓｏｕｎｄｓｏｕｒｃｅｌｏｃａｌｉｚａｔｉｏｎｕｓｉｎｇｍｕｌｔｉｐｌｅｍｉｃｒｏｐｈｏｎｅｐａｉｒｓ，” ＩＣＡＳＳＰ２００７，ｖｏｌ．Ｉ，ｐｐ．１１７−１２０，２００７．M.M. Togami, T .; Sumioshi, and A.A. Amano, “Stepwise phase difference restoration method for sound source localization using multiple microphone pairs,” ICASP2007, vol. I, pp. 117-120, 2007.

ところで、前記した特許文献２において、収束度の基準で選択するメリットは、分離精度がバイナリマスキング未満まで低下しないという安定性である。周囲の人物の安全を最重要とする本発明においては、危険回避が必要な場合であるほど瞬時性が必要であるが、この課題は分離精度の安定性を重視する特許文献２の発明によっては解決できない。また、そもそも前記で述べた抽出すべき位置の指定の課題も解決できない。 By the way, in the above-mentioned patent document 2, the merit to select on the basis of the degree of convergence is the stability that the separation accuracy does not decrease to less than the binary masking. In the present invention in which the safety of surrounding people is the most important, instantaneousness is necessary so that danger avoidance is necessary, but this problem depends on the invention of Patent Document 2 that places importance on stability of separation accuracy. It cannot be solved. Also, the problem of specifying the position to be extracted as described above cannot be solved.

そこで、本発明は、前記課題を解決するためになされたものであり、その代表的な目的は、機械周囲の人物の安全のために、抽出すべき位置の人物の音声を抽出し、危険回避にとって有用な音声を瞬時的に抽出するための音響処理システムを提供することにある。 Therefore, the present invention has been made to solve the above-mentioned problems, and its typical purpose is to extract the voice of the person at the position to be extracted for the safety of the person around the machine and to avoid danger. It is an object of the present invention to provide an acoustic processing system for instantaneously extracting speech useful for a person.

本発明の前記ならびにその他の目的と新規な特徴は、本明細書の記述および添付図面から明らかになるであろう。 The above and other objects and novel features of the present invention will be apparent from the description of this specification and the accompanying drawings.

本願において開示される発明のうち、代表的なものの概要を簡単に説明すれば、次のとおりである。 Of the inventions disclosed in the present application, the outline of typical ones will be briefly described as follows.

すなわち、代表的な音響処理システムは、音を収音する複数のマイクロホンからなる音入力部と、機械の動作による周囲の人物または物体との接触に伴う危険度を算出する危険度算出部と、前記音入力部から出力された信号を入力として前記危険度算出部で算出された危険度に応じた分離信号を出力する音抽出部と、前記音抽出部から出力された分離信号を出力する音出力部と、を有することを特徴とする。さらに、以下のような特徴を有しても良い。 That is, a typical acoustic processing system includes a sound input unit including a plurality of microphones that collect sound, a risk calculation unit that calculates a risk associated with contact with a surrounding person or object due to the operation of the machine, A sound extraction unit that outputs a separation signal corresponding to the degree of risk calculated by the risk level calculation unit using the signal output from the sound input unit, and a sound that outputs the separation signal output from the sound extraction unit And an output unit. Furthermore, you may have the following characteristics.

前記音抽出部は、相対的に危険度が高い各位置を抽出位置とする複数の音源分離ユニットから構成される。各音源分離ユニットの抽出方式は、対応する抽出位置の危険度が高い場合には瞬時的に抽出可能な方式とし、抽出位置の危険度が低い場合には高精度に抽出可能な方式とする。 The sound extraction unit is composed of a plurality of sound source separation units with each position having a relatively high risk as an extraction position. The extraction method of each sound source separation unit is a method that can be extracted instantaneously when the risk of the corresponding extraction position is high, and a method that can extract with high accuracy when the risk of the extraction position is low.

前記危険度は、機械の運動状態と人物位置の検出結果から算出される。機械の運動状態は、機械運動状態推定部により作業機械に設置されたセンサ情報もしくは機械操作信号に基づいて推定される。人物検出は、音声非音声判別結果と映像に基づく動体検出結果を組み合わせることで行う。音声非音声判別は、前記音入力部が出力する信号から音源位置を推定する音源位置推定部と、該音源位置推定部が出力する音源位置に基づいて音声非音声を判別する音声非音声判別部により実現する。動体検出は、可視光線カメラもしくは赤外線カメラなどの１以上のカメラからなる映像入力部と、該映像入力部が出力する映像に基づいて動体検出を行う動体検出部により実現する。また、位置ごとの危険度に応じて音源位置推定部は推定方法を変え、動体検出部は検出方法を変える。 The degree of risk is calculated from the motion state of the machine and the detection result of the person position. The machine motion state is estimated based on sensor information or a machine operation signal installed in the work machine by the machine motion state estimation unit. The person detection is performed by combining the voice non-voice discrimination result and the moving object detection result based on the video. The sound non-speech discrimination includes a sound source position estimation unit that estimates a sound source position from a signal output from the sound input unit, and a voice non-speech discrimination unit that determines speech non-speech based on a sound source position output from the sound source position estimation unit. To achieve. The moving object detection is realized by a video input unit including one or more cameras such as a visible light camera or an infrared camera, and a moving object detection unit that detects a moving object based on an image output from the video input unit. Further, the sound source position estimation unit changes the estimation method according to the risk level for each position, and the moving object detection unit changes the detection method.

前記危険度に応じて映像を表示する映像出力部と、前記危険度に基づいて機械外部に対する外部向け出力音を生成する外部向け出力音生成部と、該外部向け出力音生成部が生成する外部向け出力音を出力する外部向け音出力部と、前記危険度に基づいて機械の動作を制御する機械制御部を有する。 A video output unit that displays video in accordance with the risk level, an external output sound generation unit that generates an external output sound to the outside of the machine based on the risk level, and an external that is generated by the external output sound generation unit An external sound output unit that outputs a direct output sound, and a machine control unit that controls the operation of the machine based on the degree of risk.

本願において開示される発明のうち、代表的なものによって得られる効果を簡単に説明すれば、以下のとおりである。 Of the inventions disclosed in the present application, effects obtained by typical ones will be briefly described as follows.

すなわち、代表的な音響処理システムによれば、機械周囲の人物の安全のために、抽出すべき位置の人物の音声を抽出し、危険回避にとって有用な音声を瞬時的に抽出するための音響処理システムを提供することができる。 That is, according to a typical acoustic processing system, for the safety of a person around the machine, the acoustic processing for extracting the voice of the person at the position to be extracted and instantaneously extracting the voice useful for danger avoidance A system can be provided.

本発明の実施の形態１における音響処理システムのハードウェア構成の一例を示す図である。It is a figure which shows an example of the hardware constitutions of the sound processing system in Embodiment 1 of this invention. 本発明の実施の形態１における音響処理システムのブロック構成の一例を示す図である。It is a figure which shows an example of the block configuration of the sound processing system in Embodiment 1 of this invention. 図２に示す音入力部のブロック構成の一例を示す図である。It is a figure which shows an example of the block configuration of the sound input part shown in FIG. 図２に示す音源位置推定部のブロック構成の一例を示す図である。It is a figure which shows an example of the block configuration of the sound source position estimation part shown in FIG. 図２に示す動体検出部のブロック構成の一例を示す図である。It is a figure which shows an example of the block configuration of the moving body detection part shown in FIG. 図２に示す音抽出部のブロック構成の一例を示す図である。It is a figure which shows an example of the block configuration of the sound extraction part shown in FIG. 図２において、あるフレームτにおける周波数領域信号Ｘｆ（ｆ，τ）のデータ構造の一例を示す図である。In FIG. 2, it is a figure which shows an example of the data structure of the frequency domain signal Xf (f, (tau)) in a certain frame (tau). 図２において、音源分離ユニットが選択する方式２がスパース性に基づく適応による最小分散ビームフォーマである場合のブロック構成の一例を示す図である。In FIG. 2, it is a figure which shows an example of a block configuration in case the method 2 which a sound source separation unit selects is the minimum dispersion | distribution beamformer by the adaptation based on sparsity. 図２に示す音抽出部の処理フローの一例を示すフローチャートである。It is a flowchart which shows an example of the processing flow of the sound extraction part shown in FIG. 本発明の実施の形態３における音響処理システムのブロック構成の一例を示す図である。It is a figure which shows an example of the block configuration of the sound processing system in Embodiment 3 of this invention. 本発明の実施の形態４における音響処理システムのブロック構成の一例を示す図である。It is a figure which shows an example of the block configuration of the sound processing system in Embodiment 4 of this invention. 図２に示す音源位置推定部におけるＳＰＩＲＥアルゴリズムの一例を示すフローチャートである。3 is a flowchart illustrating an example of a SPIRE algorithm in the sound source position estimation unit illustrated in FIG. 2. 本発明の実施の形態１における音響処理システムを建設機械に適用した場合の外観の一例を示す図である。It is a figure which shows an example of the external appearance at the time of applying the sound processing system in Embodiment 1 of this invention to a construction machine.

以下、本発明の実施の形態を、たとえば建設機械と一体となった音響処理システムを例に図面に基づいて詳細に説明する。なお、実施の形態を説明するための全図において、同一の部材には原則として同一の符号を付し、その繰り返しの説明は省略する。 DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings, taking, as an example, an acoustic processing system integrated with a construction machine. Note that components having the same function are denoted by the same reference symbols throughout the drawings for describing the embodiment, and the repetitive description thereof will be omitted.

＜実施の形態１＞
以下、本発明の実施の形態１を、図１〜図９、図１２、図１３を用いて説明する。 <Embodiment 1>
Hereinafter, Embodiment 1 of the present invention will be described with reference to FIGS. 1 to 9, 12, and 13.

図１は、本発明の実施の形態１における音響処理システムのハードウェア構成の一例を示す図である。 FIG. 1 is a diagram illustrating an example of a hardware configuration of a sound processing system according to Embodiment 1 of the present invention.

本実施の形態における音響処理システム１００のハードウェア構成は、マイクロホンアレー１０１１〜１０１Ｍ、スピーカアレー１０２１〜１０２Ｓ、可視光線カメラ１０３１〜１０３Ａ、赤外線カメラ１０４１〜１０４Ｂ、マイクロホン１０５、ヘッドホン１０６、Ａ／Ｄ−Ｄ／Ａ変換装置１０７、中央演算装置１０８、揮発性メモリ１０９、記憶媒体１１０、画像表示装置１１１、オーディオケーブル１１４１〜１１４Ｍ，１１５１〜１１５Ｓ，１１６，１１７、モニタケーブル１１８、デジタルケーブル１１９，１２０１〜１２０Ａ，１２１１〜１２１Ｂなどから構成される。この音響処理システム１００は、作業機械１１２、機械操作入力部１１３などから構成される建設機械と一体となっている。 The hardware configuration of the sound processing system 100 according to the present embodiment includes a microphone array 1011 to 101M, a speaker array 1021 to 102S, a visible light camera 1031 to 103A, an infrared camera 1041 to 104B, a microphone 105, a headphone 106, and A / D−. D / A converter 107, central processing unit 108, volatile memory 109, storage medium 110, image display device 111, audio cables 1141 to 114M, 1151 to 115S, 116, 117, monitor cable 118, digital cables 119, 1201 120A, 1211-121B, etc. The sound processing system 100 is integrated with a construction machine including a work machine 112, a machine operation input unit 113, and the like.

マイクロホンアレー１０１１〜１０１Ｍは、建設機械外部に装着した、各アレーがＮ個のマイクロホンからなるマイクロホン群である。スピーカアレー１０２１〜１０２Ｓは、建設機械外部に装着したＳ個のスピーカ１０２１〜１０２Ｓからなるスピーカ群である。 The microphone arrays 1011 to 101M are a group of microphones that are mounted outside the construction machine and each array is composed of N microphones. The speaker arrays 1021 to 102S are a speaker group including S speakers 1021 to 102S mounted outside the construction machine.

可視光線カメラ１０３１〜１０３Ａは、建設機械外部に装着した可視光線カメラ群である。赤外線カメラ１０４１〜１０４Ｂは、建設機械外部に装着した赤外線カメラ群である。 The visible light cameras 1031 to 103A are a visible light camera group mounted outside the construction machine. The infrared cameras 1041 to 104B are a group of infrared cameras mounted outside the construction machine.

マイクロホン１０５は、オペレータが装着するマイクロホンである。ヘッドホン１０６は、オペレータが装着するヘッドホンである。 The microphone 105 is a microphone worn by the operator. The headphone 106 is a headphone worn by an operator.

Ａ／Ｄ−Ｄ／Ａ変換装置１０７は、マイクロホンアレー１０１１〜１０１Ｍから出力される信号とマイクロホン１０５から出力される信号をデジタルデータに変換すると同時に、スピーカアレー１０２１〜１０２Ｓとヘッドホン１０６にアナログ音圧信号を出力するＡ／Ｄ−Ｄ／Ａ変換装置である。 The A / D-D / A conversion device 107 converts the signal output from the microphone array 1011 to 101M and the signal output from the microphone 105 into digital data, and simultaneously converts the analog sound pressure to the speaker array 1021 to 102S and the headphone 106. This is an A / D-D / A converter that outputs a signal.

中央演算装置１０８は、Ａ／Ｄ−Ｄ／Ａ変換装置１０７の出力を処理する中央演算装置である。揮発性メモリ１０９は、中央演算装置１０８における演算処理のデータなどを一時的に格納する揮発性のメモリである。記憶媒体１１０は、プログラムなどの情報を記憶する記憶媒体である。画像表示装置１１１は、中央演算装置１０８における演算処理の情報や画像などを表示する表示装置である。 The central processing unit 108 is a central processing unit that processes the output of the A / D-D / A conversion unit 107. The volatile memory 109 is a volatile memory that temporarily stores data of arithmetic processing in the central processing unit 108. The storage medium 110 is a storage medium that stores information such as programs. The image display device 111 is a display device that displays arithmetic processing information and images in the central processing unit 108.

オーディオケーブル１１４１〜１１４Ｍは、マイクロホンアレー１０１１〜１０１ＭとＡ／Ｄ−Ｄ／Ａ変換装置１０７とを接続するケーブルである。オーディオケーブル１１５１〜１１５Ｓは、スピーカアレー１０２１〜１０２ＳとＡ／Ｄ−Ｄ／Ａ変換装置１０７とを接続するケーブルである。オーディオケーブル１１６は、マイクロホン１０５とＡ／Ｄ−Ｄ／Ａ変換装置１０７とを接続するケーブルである。オーディオケーブル１１７は、ヘッドホン１０６とＡ／Ｄ−Ｄ／Ａ変換装置１０７とを接続するケーブルである。 The audio cables 1141 to 114M are cables that connect the microphone arrays 1011 to 101M and the A / D / D / A converter 107. The audio cables 1151 to 115S are cables that connect the speaker arrays 1021 to 102S and the A / D / D / A converter 107. The audio cable 116 is a cable that connects the microphone 105 and the A / D / D / A converter 107. The audio cable 117 is a cable for connecting the headphones 106 and the A / D / D / A converter 107.

モニタケーブル１１８は、画像表示装置１１１と中央演算装置１０８とを接続するケーブルである。 The monitor cable 118 is a cable for connecting the image display device 111 and the central processing unit 108.

デジタルケーブル１１９は、Ａ／Ｄ−Ｄ／Ａ変換装置１０７と中央演算装置１０８とを接続するケーブルである。デジタルケーブル１２０１〜１２０Ａは、可視光線カメラ１０３１〜１０３Ａと中央演算装置１０８とを接続するケーブルである。デジタルケーブル１２１１〜１２１Ｂは、赤外線カメラ１０４１〜１０４Ｂと中央演算装置１０８とを接続するケーブルである。 The digital cable 119 is a cable that connects the A / D / D / A converter 107 and the central processing unit 108. The digital cables 1201 to 120A are cables that connect the visible light cameras 1031 to 103A and the central processing unit 108. The digital cables 1211 to 121B are cables that connect the infrared cameras 1041 to 104B and the central processing unit 108.

作業機械１１２は、アームなどを持つ建設機械である。機械操作入力部１１３は、建設機械の各種操作を入力する部分である。 The work machine 112 is a construction machine having an arm or the like. The machine operation input unit 113 is a part for inputting various operations of the construction machine.

以上のように構成される音響処理システム１００のハードウェアの動作は、以下の通りである。 The hardware operation of the sound processing system 100 configured as described above is as follows.

マイクロホンアレー１０１１〜１０１Ｍが出力する音圧データは、オーディオケーブル１１４１〜１１４Ｍを介してＡ／Ｄ−Ｄ／Ａ変換装置１０７に送られる。このマイクロホンアレー１０１１〜１０１Ｍからの音圧データは、Ａ／Ｄ−Ｄ／Ａ変換装置１０７によってそれぞれデジタル音圧データに変換される。この変換では、信号間で変換タイミングを同期して変換する。変換後のデジタル音圧データは、デジタルケーブル１１９を介して中央演算装置１０８に送られ、中央演算装置１０８で音響信号処理が施される。この音響信号処理後のデジタル音圧データはデジタルケーブル１１９を介して、Ａ／Ｄ−Ｄ／Ａ変換装置１０７に送られる。この中央演算装置１０８からのデジタル音圧データは、Ａ／Ｄ−Ｄ／Ａ変換装置１０７によってアナログ音圧データに変換され、オーディオケーブル１１７を介してヘッドホン１０６より出力される。 The sound pressure data output from the microphone arrays 1011 to 101M is sent to the A / D / D / A converter 107 via the audio cables 1141 to 114M. The sound pressure data from the microphone arrays 1011 to 101M is converted into digital sound pressure data by the A / D-D / A converter 107, respectively. In this conversion, conversion is performed by synchronizing the conversion timing between signals. The converted digital sound pressure data is sent to the central processing unit 108 through the digital cable 119, and the central processing unit 108 performs acoustic signal processing. The digital sound pressure data after the acoustic signal processing is sent to the A / D-D / A converter 107 via the digital cable 119. The digital sound pressure data from the central processing unit 108 is converted to analog sound pressure data by the A / D-D / A converter 107 and output from the headphones 106 via the audio cable 117.

マイクロホンアレー１０１１〜１０１Ｍで収音され、中央演算装置１０８に送られてきたデジタル音圧データＸには、作業機械１１２外部の作業員の声と作業機械１１２が発するエンジン音やアーム駆動音などの雑音とが混入して含まれている。中央演算装置１０８では、デジタル音圧データＸと、可視光線カメラ１０３１〜１０３Ａから得られる画像データＶＩと、赤外線カメラ１０４１〜１０４Ｂから得られる画像データＩＩと、機械操作入力部１１３から得られる操作信号と、作業機械１１２が持つ速度情報とに基づいて、位置ごとの危険度Ｈを算出する。危険度Ｈは揮発性メモリ１０９に記憶される。中央演算装置１０８は、危険度Ｈに基づいて、音源位置推定方式を変え、さらに、動体検出方式を変え、さらに、危険度が比較的高い位置を音抽出位置とし、その中でも危険度が特に高い位置に対しては瞬時的に抽出可能な方式での音抽出を行い、危険度が低い位置に対しては高精度に抽出可能な方式での音抽出を行う。抽出信号Ｙは、デジタルケーブル１１９を介してＡ／Ｄ−Ｄ／Ａ変換装置１０７に送られ、アナログ信号に変換されてオーディオケーブル１１７を介してヘッドホン１０６から出力される。 The digital sound pressure data X collected by the microphone arrays 1011 to 101M and sent to the central processing unit 108 includes voices of workers outside the work machine 112, engine sounds and arm drive sounds generated by the work machine 112, and the like. It is mixed with noise. In the central processing unit 108, digital sound pressure data X, image data VI obtained from the visible light cameras 1031 to 103A, image data II obtained from the infrared cameras 1041 to 104B, and operation signals obtained from the machine operation input unit 113. And the risk level H for each position is calculated based on the speed information of the work machine 112. The risk level H is stored in the volatile memory 109. Based on the risk level H, the central processing unit 108 changes the sound source position estimation method, further changes the moving object detection method, further sets the position with a relatively high risk level as the sound extraction position, and has a particularly high risk level. Sound extraction is performed for a position by a method that can be extracted instantaneously, and sound extraction is performed for a position with a low degree of danger by a method that can be extracted with high accuracy. The extracted signal Y is sent to the A / D / D / A converter 107 via the digital cable 119, converted into an analog signal, and output from the headphones 106 via the audio cable 117.

揮発性メモリ１０９に蓄えられた位置ごとの危険度Ｈは、中央演算装置１０８において、画像に変換され、モニタケーブル１１８を介して画像表示装置１１１より出力される。 The risk level H for each position stored in the volatile memory 109 is converted into an image by the central processing unit 108 and output from the image display device 111 via the monitor cable 118.

マイクロホン１０５で収音される音声信号は、オーディオケーブル１１６を介して、Ａ／Ｄ−Ｄ／Ａ変換装置１０７にてデジタル音圧データに変換され、デジタルケーブル１１９を介して、中央演算装置１０８に入力される。また、スピーカアレー１０２１〜１０２Ｓを用いた指向性フィルタが、その指向性を向ける位置ごとに予め記憶媒体１１０に格納されている。前記デジタル音圧データに対して、危険度Ｈが比較的高い位置に指向性を向ける指向性フィルタを選択して畳み込み、複数チャンネルデジタル信号データを生成する。デジタルケーブル１１９を介して、この複数チャンネルデジタル信号データをＡ／Ｄ−Ｄ／Ａ変換装置１０７に入力し、Ａ／Ｄ−Ｄ／Ａ変換装置１０７が複数チャンネルアナログ信号に変換し、オーディオケーブル１１５１〜１１５Ｓを介してスピーカアレー１０２１〜１０２Ｓより出力する。 The audio signal collected by the microphone 105 is converted into digital sound pressure data by the A / D-D / A converter 107 via the audio cable 116 and is sent to the central processing unit 108 via the digital cable 119. Entered. In addition, a directional filter using the speaker arrays 1021 to 102S is stored in advance in the storage medium 110 for each position where the directivity is directed. The digital sound pressure data is convolved by selecting a directional filter that directs the directivity to a position where the degree of risk H is relatively high, and multi-channel digital signal data is generated. The multi-channel digital signal data is input to the A / D-D / A converter 107 via the digital cable 119, and the A / D-D / A converter 107 converts it into a multi-channel analog signal. Are output from the speaker arrays 1021 to 102S through .about.115S.

中央演算装置１０８は、作業機械１１２に対して、危険度Ｈに応じた移動の種類、移動速度、動作の種類、動作速度などの制御を行う。 The central processing unit 108 controls the work machine 112 such as the type of movement, the movement speed, the type of movement, and the movement speed according to the degree of risk H.

デジタルケーブル１１９は、ＵＳＢケーブルなどを用いる。デジタルケーブル１２０１〜１２０Ａ、デジタルケーブル１２１１〜１２１Ｂは、ＵＳＢケーブルやＬＡＮケーブルなどを用いる。 The digital cable 119 uses a USB cable or the like. As the digital cables 1201 to 120A and the digital cables 1211 to 121B, USB cables or LAN cables are used.

図１３は、本実施の形態における音響処理システム１００を建設機械に適用した場合の外観の一例を示す図である。図１３は、建設機械を上面から見た模式図である。 FIG. 13 is a diagram illustrating an example of an external appearance when the sound processing system 100 according to the present embodiment is applied to a construction machine. FIG. 13 is a schematic view of the construction machine as viewed from above.

この図１３の例では、建設機械は、キャビネット１３００１、エンジン部１３００２、アーム部１３００３などから構成される。マイクロホンアレー１０１１〜１０１４を建設機械外部の四隅に配置している。キャビネット１３００１内でオペレータが操作する。 In the example of FIG. 13, the construction machine includes a cabinet 13001, an engine unit 13002, an arm unit 13003, and the like. Microphone arrays 1011 to 1014 are arranged at the four corners outside the construction machine. An operator operates in the cabinet 13001.

たとえば、本発明を用いない場合、キャビネット１３００１の内部では外部の音はほとんど聞こえない。また、建設機械自身がエンジン部１３００２やアーム部１３００３といった騒音源を有しており、マイクロホンアレー１０１１〜１０１４が収音した音をそのまま聞いても、それらの騒音に埋もれた周囲の人物の音声はほとんど聞こえない。本発明では、これらの課題を解決するものである。 For example, when the present invention is not used, external sounds are hardly audible inside the cabinet 13001. In addition, the construction machine itself has noise sources such as the engine part 13002 and the arm part 13003. Even if the sound collected by the microphone arrays 1011 to 1014 is heard as it is, the voices of surrounding people buried in those noises are not heard. I can hardly hear. The present invention solves these problems.

図２は、本実施の形態における音響処理システム１００のブロック構成の一例を示す図である。この図２に示すブロック構成は、図１に示す中央演算処理装置１０８が、記憶媒体１１０に記憶されているプログラムを読み出して実行することで実現されるソフトウェアによる機能構成である。ただし、一部の構成要素は図１に示すハードウェア構成を含むものもある。 FIG. 2 is a diagram illustrating an example of a block configuration of the sound processing system 100 according to the present embodiment. The block configuration shown in FIG. 2 is a functional configuration by software realized by the central processing unit 108 shown in FIG. 1 reading out and executing a program stored in the storage medium 110. However, some components include the hardware configuration shown in FIG.

本実施の形態における音響処理システム１００は、音入力部２０１と、音入力部２０１に繋がっている音源位置推定部２０２と、音入力部２０１に繋がっている音抽出部２０３と、音源位置推定部２０２に繋がっている音声非音声判別部２０４と、音声非音声判別部２０４に繋がっている人物検出部２０５と、人物検出部２０５に繋がっており、音源位置推定部２０２と音抽出部２０３に繋がる危険度算出部２０６と、機械センサ入力部２０７と、機械センサ入力部２０７に繋がっており、危険度算出部２０６に繋がる機械運動状態推定部２０９と、可視光線入力部２１０と、赤外線入力部２１１と、可視光線入力部２１０及び赤外線入力部２１１と危険度算出部２０６に繋がっており、人物検出部２０５に繋がる動体検出部２１２と、人物検出部２０５と危険度算出部２０６に繋がっている映像出力部２１３と、操作者音声入力部２１５と、操作者音声入力部２１５と危険度算出部２０６に繋がっている外部向け出力音生成部２１６と、外部向け出力音生成部２１６に繋がっている外部向け音出力部２１７と、危険度算出部２０６に繋がっている機械動作制御部２１８と、音抽出部２０３に繋がっている音出力部２１９と、機械運動状態推定部２０９に繋がる機械操作入力部２２１などから構成される。 The acoustic processing system 100 according to the present embodiment includes a sound input unit 201, a sound source position estimation unit 202 connected to the sound input unit 201, a sound extraction unit 203 connected to the sound input unit 201, and a sound source position estimation unit. The voice non-voice discrimination unit 204 connected to 202, the person detection unit 205 connected to the voice non-speech discrimination unit 204, and the person detection unit 205 are connected to the sound source position estimation unit 202 and the sound extraction unit 203. A risk calculation unit 206, a machine sensor input unit 207, and a machine sensor input unit 207 are connected to the machine motion state estimation unit 209, a visible light input unit 210, and an infrared input unit 211 connected to the risk calculation unit 206. And a visible light input unit 210, an infrared input unit 211, and a risk level calculation unit 206, a moving object detection unit 212 connected to the person detection unit 205, and a person The video output unit 213 connected to the output unit 205 and the risk level calculation unit 206, the operator voice input unit 215, and the output sound generation unit 216 for the outside connected to the operator voice input unit 215 and the risk level calculation unit 206. An external sound output unit 217 connected to the external output sound generation unit 216, a machine operation control unit 218 connected to the risk level calculation unit 206, and a sound output unit 219 connected to the sound extraction unit 203 And a machine operation input unit 221 connected to the machine motion state estimation unit 209.

また、音声非音声判別部２０４と機械運動状態推定部２０９では、機械の寸法２０８が用いられる。音源位置推定部２０２と音抽出部２０３では、マイク配置２１４の情報が用いられる。動体検出部２１２では、カメラ投影行列２２０が用いられる。 The voice non-voice discrimination unit 204 and the machine motion state estimation unit 209 use the machine size 208. The sound source position estimation unit 202 and the sound extraction unit 203 use information on the microphone arrangement 214. In the moving object detection unit 212, a camera projection matrix 220 is used.

以上のように構成される音響処理システム１００のソフトウェアによる主な機能（一部の構成要素はハードウェア構成を含む）は、以下の通りである。 The main functions by software of the sound processing system 100 configured as described above (some components include a hardware configuration) are as follows.

音入力部２０１は、音を収音する複数のマイクロホンからなる機能部である。詳細は図３を用いて後述する。音源位置推定部２０２は、音入力部２０１が出力する信号から音源位置を推定したり、または音抽出部２０３が出力する信号から音源位置を推定する機能部である。また、音源位置推定部２０２は、危険度算出部２０６が出力する位置ごとの危険度に基づいて推定方式を変化させる。詳細は図４を用いて後述する。音抽出部２０３は、音入力部２０１から出力された信号を入力として危険度算出部２０６で算出された危険度に応じた分離信号を出力する機能部である。この音抽出部２０３は、複数の音源分離ユニットを備え、各音源分離ユニットは危険度に応じて抽出位置を設定し、さらに危険度に応じて音源分離ユニットが分離方式を変化させる。詳細は図６を用いて後述する。 The sound input unit 201 is a functional unit including a plurality of microphones that collect sound. Details will be described later with reference to FIG. The sound source position estimation unit 202 is a functional unit that estimates a sound source position from a signal output from the sound input unit 201 or estimates a sound source position from a signal output from the sound extraction unit 203. The sound source position estimation unit 202 changes the estimation method based on the risk level for each position output by the risk level calculation unit 206. Details will be described later with reference to FIG. The sound extraction unit 203 is a functional unit that outputs a separation signal corresponding to the risk calculated by the risk calculation unit 206 using the signal output from the sound input unit 201 as an input. The sound extraction unit 203 includes a plurality of sound source separation units. Each sound source separation unit sets an extraction position according to the degree of danger, and the sound source separation unit changes the separation method according to the degree of risk. Details will be described later with reference to FIG.

音声非音声判別部２０４は、音源位置推定部２０２が出力する音源位置に基づいて音声非音声を判別する機能部である。人物検出部２０５は、音声非音声判別部２０４が出力する音声非音声判別結果に基づいて人物位置を検出する機能部である。この人物検出部２０５は、また動体検出部２１２の出力する信号に基づいて人物検出を行う。 The speech non-speech determination unit 204 is a functional unit that determines speech non-speech based on the sound source position output from the sound source position estimation unit 202. The person detection unit 205 is a functional unit that detects a person position based on the voice / non-voice discrimination result output by the voice / non-voice discrimination unit 204. The person detection unit 205 performs person detection based on a signal output from the moving object detection unit 212.

危険度算出部２０６は、機械の動作による周囲の人物または物体との接触に伴う危険度を算出する機能部である。この危険度算出部２０６は、位置ごとの危険度を算出する。さらに、危険度算出部２０６は、機械運動状態推定部２０９の出力する運動状態に基づいて危険度を算出したり、人物検出部２０５が出力する人物位置検出結果に基づいて危険度を算出する。機械運動状態推定部２０９は、機械に設置されたセンサ情報もしくは機械操作信号に基づいて推定される機械の運動状態を推定する機能部である。 The risk level calculation unit 206 is a functional unit that calculates the level of risk associated with contact with surrounding people or objects due to machine operations. The risk level calculation unit 206 calculates the risk level for each position. Further, the risk level calculation unit 206 calculates the risk level based on the motion state output by the machine motion state estimation unit 209 or calculates the risk level based on the person position detection result output by the person detection unit 205. The machine motion state estimation unit 209 is a functional unit that estimates the motion state of the machine estimated based on sensor information or a machine operation signal installed in the machine.

映像入力部は、可視光線入力部２１０及び赤外線入力部２１１からなり、可視光線カメラもしくは赤外線カメラの１以上のカメラからなる機能部である。動体検出部２１２は、映像入力部が出力する映像に基づいて動体検出を行う機能部である。また、動体検出部２１２は、危険度算出部２０６が出力する位置ごとの危険度に基づいて検出方式を変化させる。詳細は図５を用いて後述する。映像出力部２１３は、危険度算出部２０６が出力する危険度に基づいて映像を表示する機能部である。 The video input unit includes a visible light input unit 210 and an infrared light input unit 211, and is a functional unit including a visible light camera or one or more cameras of an infrared camera. The moving object detection unit 212 is a functional unit that performs moving object detection based on the video output from the video input unit. In addition, the moving object detection unit 212 changes the detection method based on the risk level for each position output by the risk level calculation unit 206. Details will be described later with reference to FIG. The video output unit 213 is a functional unit that displays a video based on the risk level output by the risk level calculation unit 206.

外部向け出力音生成部２１６は、危険度算出部２０６が出力する危険度に基づいて機械の外部に対する外部向け出力音を生成する機能部である。外部向け音出力部２１７は、外部向け出力音生成部２１６が生成する外部向け出力音を出力する機能部である。 The external output sound generation unit 216 is a functional unit that generates an external output sound for the outside of the machine based on the risk level output by the risk level calculation unit 206. The external sound output unit 217 is a functional unit that outputs an external output sound generated by the external output sound generation unit 216.

機械動作制御部２１８は、危険度算出部２０６が出力する危険度に基づいて機械の動作を制御する機能部である。音出力部２１９は、音抽出部２０３から出力された分離信号を出力する機能部である。 The machine operation control unit 218 is a functional unit that controls the operation of the machine based on the risk level output from the risk level calculation unit 206. The sound output unit 219 is a functional unit that outputs the separated signal output from the sound extraction unit 203.

以下において、音響処理システム１００のソフトウェアによる主な機能部を詳細に説明する。 Below, the main function parts by the software of the sound processing system 100 will be described in detail.

図３に、音入力部２０１のブロック構成の一例を示す。音入力部２０１は、多チャンネルＡＤ変換器３０１、多チャンネルフレーム処理部３０２、多チャンネル短時間周波数分析部３０３などから構成される。多チャンネルＡＤ変換器３０１は、Ａ／Ｄ−Ｄ／Ａ変換装置１０７に含まれる。 FIG. 3 shows an example of a block configuration of the sound input unit 201. The sound input unit 201 includes a multi-channel AD converter 301, a multi-channel frame processing unit 302, a multi-channel short-time frequency analysis unit 303, and the like. The multi-channel AD converter 301 is included in the A / D-D / A conversion device 107.

音入力部２０１において、マイクロホンアレー１０１１〜１０１Ｍから得た多チャンネルアナログ音圧データは多チャンネルＡＤ変換器３０１でデジタル音圧データｘ＿１１（ｔ）〜ｘ＿ＭＮ（ｔ）に変換される。ｔはサンプリング周期毎の離散時間である。変換されたデジタル音圧データｘ＿１１（ｔ）〜ｘ＿ＭＮ（ｔ）は、多チャンネルフレーム処理部３０２に渡る。 In the sound input unit 201, the multichannel analog sound pressure data obtained from the microphone arrays 1011 to 101M is converted into digital sound pressure data x_11 (t) to x_MN (t) by the multichannel AD converter 301. t is a discrete time for each sampling period. The converted digital sound pressure data x_11 (t) to x_MN (t) is passed to the multi-channel frame processing unit 302.

多チャンネルフレーム処理部３０２では、ｔ＝τｓからｔ＝τｓ＋Ｆ＿ｓ−１までのｘ＿ｉｊ（ｔ）をそれぞれｔ＝０からｔ＝Ｆ−１までのＸｆ＿ｉｊ（ｔ，τ）に移し変える。ここで、τはフレームインデックスと呼び、多チャンネルフレーム処理部３０２から音出力部２１９までの処理が完了した後で、１インクリメントされる。ｓはフレームシフトと呼び、フレーム毎にずらすサンプル数を意味する。Ｆ＿ｓはフレームサイズと呼び、フレーム毎に一度に処理するサンプル数を意味する。ｉはマイクロホンアレー番号を意味するインデックス（１，…，Ｍ）とする。ｊはマイクロホン番号を意味するインデックス（１，…，Ｎ）とする。 The multi-channel frame processing unit 302 changes x_ij (t) from t = τs to t = τs + F_s−1 to Xf_ij (t, τ) from t = 0 to t = F−1, respectively. Here, τ is called a frame index, and is incremented by 1 after the processing from the multi-channel frame processing unit 302 to the sound output unit 219 is completed. s is called a frame shift and means the number of samples shifted for each frame. F_s is called a frame size, and means the number of samples processed at one time for each frame. i is an index (1,..., M) indicating a microphone array number. j is an index (1,..., N) indicating a microphone number.

その後、Ｘｆ＿ｉｊ（ｔ，τ）は多チャンネル短時間周波数分析部３０３に渡される。多チャンネル短時間周波数分析部３０３では、Ｘｆ＿ｉｊ（ｔ，τ）に、直流成分カット及びハミング窓、ハニング窓、ブラックマン窓などの窓処理を施した後、短時間フーリエ変換を施し、それぞれ周波数領域の信号Ｘｆ＿ｉｊ（ｆ，τ）に変換する。ここでの周波数ビン数をＦとする。あるフレームτでのＸｆ＿ｉｊ（ｆ，τ）は、図７のようなデータ構造をとる。周波数領域信号Ｘｆ＿ｉｊ（ｆ，τ）は、音源位置推定部２０２と音抽出部２０３に送られる。 Thereafter, Xf_ij (t, τ) is passed to the multi-channel short-time frequency analysis unit 303. The multi-channel short-time frequency analysis unit 303 performs a DC component cut and window processing such as a Hamming window, a Hanning window, and a Blackman window on Xf_ij (t, τ), and then performs a short-time Fourier transform on each frequency domain. Signal Xf_ij (f, τ). The frequency bin number here is F. Xf_ij (f, τ) in a certain frame τ has a data structure as shown in FIG. The frequency domain signal Xf_ij (f, τ) is sent to the sound source position estimation unit 202 and the sound extraction unit 203.

図４に、音源位置推定部２０２のブロック構成の一例を示す。音源位置推定部２０２は、周波数毎方向推定部４０１１〜４０１Ｍ、方向推定統合部４０２などから構成される。 FIG. 4 shows an example of a block configuration of the sound source position estimation unit 202. The sound source position estimation unit 202 includes frequency direction estimation units 4011 to 401M, a direction estimation integration unit 402, and the like.

まず、周波数毎方向推定部４０１ｉは、一つのマイクロホンアレー１０１ｉに対応する多チャンネル周波数領域信号Ｘｆ＿ｉ１（ｆ，τ）〜Ｘｆ＿ｉＮ（ｆ，τ）に対して、各周波数インデックスｆに対する音の到来方向θ＿ｉ（ｆ）を推定する。マイクロホンアレーのマイク素子数が二つの場合、θを［数１］で推定する。 First, the direction estimator 401i for each frequency receives the sound arrival direction θ_i for each frequency index f with respect to the multi-channel frequency domain signals Xf_i1 (f, τ) to Xf_iN (f, τ) corresponding to one microphone array 101i. Estimate (f). When the number of microphone elements in the microphone array is two, θ is estimated by [Equation 1].

ここで、ρ（ｆ，τ）は、二つのマイク素子の入力信号の、フレームτ、周波数インデックスｆにおける位相差とする。ｆｒｅｑ（ｆ）は周波数インデックスｆの周波数（Ｈｚ）であり、［数２］で計算される。 Here, ρ (f, τ) is a phase difference between the input signals of the two microphone elements at the frame τ and the frequency index f. freq (f) is the frequency (Hz) of the frequency index f, and is calculated by [Equation 2].

ただし、Ｆ_ＳはＡ／Ｄ変換装置のサンプリングレートである。ｄは二つのマイク素子の物理的な間隔（ｍ）とする。ｃは音速（ｍ／ｓ）とする。音速は、厳密には温度や媒質の密度に依存して変化するが、通常３４０ｍ／ｓなどの一つの値に固定して用いる。ここでの雑音除去処理は、前述の「スパース性」の仮定に基づけば、時間−周波数毎に同一の処理を別々に行えばよいため、以後、時間−周波数のサフィックス（ｆ，τ）は省略して表記する。 However, _{F S} is the sampling rate of the A / D converter. d is the physical distance (m) between the two microphone elements. c is the speed of sound (m / s). Strictly speaking, the speed of sound changes depending on the temperature and the density of the medium, but is usually fixed to one value such as 340 m / s. Since the noise removal processing here may be performed separately for each time-frequency based on the above-mentioned assumption of “sparseness”, the time-frequency suffix (f, τ) is omitted hereinafter. It describes as.

マイクロホンアレーのマイク素子数が三つ以上の場合、ＳＰＩＲＥアルゴリズム（非特許文献３参照）により、その方向を高精度に算出することが可能である。ＳＰＩＲＥアルゴリズムでも、前述の「スパース性」の仮定に基づき、時間−周波数毎に同一の処理を別々に行うものとする。図１２に、ＳＰＩＲＥアルゴリズムのフローチャートを示す。 When the number of microphone elements in the microphone array is three or more, the direction can be calculated with high accuracy by the SPIRE algorithm (see Non-Patent Document 3). Also in the SPIRE algorithm, the same processing is performed separately for each time-frequency based on the above-described assumption of “sparseness”. FIG. 12 shows a flowchart of the SPIRE algorithm.

まず、ＳＰＩＲＥアルゴリズムでは、マイク素子の配置読み込みを行う（Ｓ１２０１）。次に、ＳＰＩＲＥアルゴリズムでは、それぞれが二つのマイク素子で構成されるマイクペアとなるように、各マイクペアを構成するマイク素子の選択を行う（Ｓ１２０２）。このとき、マイクペアを構成する二つのマイク素子間のマイク間隔が、マイクペアごとに異なるように分けることが望ましい。 First, in the SPIRE algorithm, the arrangement of microphone elements is read (S1201). Next, in the SPIRE algorithm, the microphone elements constituting each microphone pair are selected so that each microphone pair is composed of two microphone elements (S1202). At this time, it is desirable to divide the microphone interval between the two microphone elements constituting the microphone pair so as to be different for each microphone pair.

次に、ＳＰＩＲＥアルゴリズムは、各マイクペアをマイク間隔が小さいものから順にソートし、マイクペア待ち行列に格納する（Ｓ１２０３）。ここで、ｌを一つのマイクペアを特定するためのインデックスとし、ｌ＝１をマイク間隔が最も短いマイクペア、ｌ＝Ｌをマイク間隔が最も長いマイクペアとする。マイクペア待ち行列の要素数が０かどうかの比較演算を行う（Ｓ１２０４）。要素数が０でない間（Ｓ１２０４−Ｎｏ）、次に述べるＳ１２０５及びＳ１２０６を繰り返す。 Next, the SPIRE algorithm sorts each microphone pair in ascending order of the microphone interval and stores it in the microphone pair queue (S1203). Here, l is an index for specifying one microphone pair, l = 1 is a microphone pair with the shortest microphone interval, and l = L is a microphone pair with the longest microphone interval. A comparison operation is performed to determine whether the number of elements in the microphone pair queue is 0 (S1204). While the number of elements is not 0 (S1204-No), S1205 and S1206 described below are repeated.

すなわち、次に、マイクペア待ち行列から間隔が最短の一つのマイクペアｌを読み込み、かつ、マイクペア待ち行列から除く処理を行う（Ｓ１２０５）。そして、続く位相差推定処理では、読み込んだｌに対して、まず［数３］を満たす整数ｎ_ｌをみつける（Ｓ１２０６）。不等式で囲まれた範囲が２πに相当するため、必ず一つだけ解が見つかる。そして、［数４］を実行する。 That is, next, the process of reading one microphone pair 1 with the shortest interval from the microphone pair queue and removing it from the microphone pair queue is performed (S1205). In the subsequent phase difference estimation process, an integer n ₁ satisfying [Equation 3] is first found for the read l (S1206). Since the range surrounded by inequalities corresponds to 2π, only one solution can be found. Then, [Formula 4] is executed.

また、上記の処理をｌ＝１に対して行う前に初期値として、［数５］を設定する。Ｓ１２０５及びＳ１２０６をＰ回繰り返し、マイクペア待ち行列の要素数が０となると（Ｓ１２０４−Ｙｅｓ）、［数６］に従って、位相差から方向計算を行い、θ（ｆ，τ）を計算する（Ｓ１２０７）。 Also, [Formula 5] is set as an initial value before the above processing is performed for l = 1. When S1205 and S1206 are repeated P times and the number of elements in the microphone pair queue becomes 0 (S1204-Yes), direction calculation is performed from the phase difference according to [Equation 6], and θ (f, τ) is calculated (S1207). .

ここで、ｄ_ｌはｌ番目のマイクペアのマイク素子間の間隔とする。 Here, d _l is the distance between the microphone elements of the l th microphone pair.

音源方向推定の推定精度は、マイク間隔が長い程、高まることが知られているが、方向を推定する信号の半波長以上マイク間隔が長ければ、マイク間の位相差から一つの方向を特定することができず、同じ位相差を持つ二つ以上の方向が存在してしまうことが知られている（空間的エイリアシング）。ＳＰＩＲＥ法では、長いマイク間隔で生じた二つ以上の推定方向のうち、短いマイク間隔で求めた音源方向に近い方向を選択するような機構を備えている。したがって、空間的エイリアシングが生じるような長いマイク間隔でも高精度に音源方向を推定することができるという利点を備えている。 It is known that the estimation accuracy of sound source direction estimation increases as the microphone interval increases, but if the microphone interval is longer than the half wavelength of the signal for estimating the direction, one direction is specified from the phase difference between the microphones. It is known that there are two or more directions with the same phase difference (spatial aliasing). The SPIRE method includes a mechanism that selects a direction close to a sound source direction obtained at a short microphone interval from two or more estimated directions generated at a long microphone interval. Therefore, there is an advantage that the sound source direction can be estimated with high accuracy even with a long microphone interval that causes spatial aliasing.

周波数毎方向推定部４０１１〜４０１Ｍから出力される方向推定結果θ＿ｉ（ｆ，τ）は、方向推定統合部４０２に入力される。［数７］により音源が存在する位置インデックスｐほど大きな値を持つ位置ヒストグラムｈ（ｐ，τ）を得ることが可能である。 The direction estimation result θ_i (f, τ) output from the frequency direction direction estimation units 4011 to 401M is input to the direction estimation integration unit 402. According to [Expression 7], it is possible to obtain a position histogram h (p, τ) having a larger value as the position index p where the sound source exists.

ここで、前のフレームで算出された危険度マップデータＨ（ｐ，τ）に応じて、［数７］の加算処理を間引いた［数８］を用いれば、危険度が高い位置に対して追従性高く位置ヒストグラムを算出することができる。 Here, according to the risk map data H (p, τ) calculated in the previous frame, if [Expression 8] obtained by thinning out the addition process of [Expression 7] is used, a position with a high risk is used. A position histogram can be calculated with high follow-up performance.

音声非音声判別部２０４は、音源位置推定部２０２から入力された位置ヒストグラムｈ（ｐ，τ）に基づいて、位置ｐごとに音声の有無を表わす音声非音声判別マップｖ（ｐ，τ）を判定する。音声非音声判別には、ｈ（ｐ，τ）を位置ｐに存在する人の雑音混入音声信号とみなし、ＭＣＲＡに基づく雑音推定を行ってから、入力信号対雑音比（事後ＳＮＲ）γ（ｐ，τ）に基づく判別方式［数９］などの一般的なアルゴリズムを用いて判別すればよく、本質的な機能の差にはならない。 The speech non-speech discrimination unit 204 generates a speech non-speech discrimination map v (p, τ) indicating the presence / absence of speech for each position p based on the position histogram h (p, τ) input from the sound source position estimation unit 202. judge. For speech non-speech discrimination, h (p, τ) is regarded as a speech signal with human noise present at the position p, noise estimation based on MCRA is performed, and then the input signal-to-noise ratio (post SNR) γ (p , Τ) may be discriminated by using a general algorithm such as a discriminating method [Equation 9], which is not an essential functional difference.

また、機械の寸法２０８に基づいて機械内部のｐに対し、ｖ（ｐ，τ）は常に０とすることで計算コストを削減することができる。音声非音声判別マップｖ（ｐ，τ）は人物検出部２０５に送られる。 Further, the calculation cost can be reduced by setting v (p, τ) to be always 0 with respect to p inside the machine based on the machine size 208. The voice / non-voice discrimination map v (p, τ) is sent to the person detection unit 205.

可視光線カメラ１０３１〜１０３Ａからなる可視光線入力部２１０は可視光線画像データＶＩを動体検出部２１２に送る。 The visible light input unit 210 including the visible light cameras 1031 to 103A sends the visible light image data VI to the moving object detection unit 212.

赤外線カメラ１０４１〜１０４Ｂからなる赤外線入力部２１１は赤外線画像データＩＩを動体検出部２１２に送る。 The infrared input unit 211 including the infrared cameras 1041 to 104B sends the infrared image data II to the moving object detection unit 212.

図５に、動体検出部２１２のブロック構成の一例を示す。動体検出部２１２は、背景差分・フレーム間差分算出部５０１、体表面検出部５０２、視錐体交差算出部５０３などから構成される。 FIG. 5 shows an example of a block configuration of the moving object detection unit 212. The moving object detection unit 212 includes a background difference / interframe difference calculation unit 501, a body surface detection unit 502, a visual cone intersection calculation unit 503, and the like.

背景差分・フレーム間差分算出部５０１は、可視光線画像データＶＩ＿１〜ＶＩ＿Ａに基づき、それぞれの画像に対して背景差分処理およびフレーム間差分処理により物体領域を抽出した画像ＥＩ＿１〜ＥＩ＿Ａを計算する。体表面検出部５０２は、赤外線画像データＩＩ＿１〜ＩＩ＿Ｂに基づき、それぞれの画像に対して温度の高いピクセル領域を体表面領域として抽出した画像ＢＩ＿１〜ＢＩ＿Ｂを計算する。視錐体交差算出部５０３では、画像ＥＩ＿１〜ＥＩ＿Ａの物体領域と画像ＢＩ＿１〜ＢＩ＿Ｂの体表面領域のそれぞれの視錐体を、カメラ投影行列２２０に基づいて３次元空間内に逆投影する。［数１０］により得られるカメラ間で視野が交差する３次元領域のうち、視体積が交差する領域について、［数１１］のように動体存在マップｅ（ｐ，τ）を更新する。 Based on the visible light image data VI_1 to VI_A, the background difference / interframe difference calculation unit 501 calculates images EI_1 to EI_A in which object regions are extracted by background difference processing and interframe difference processing for each image. Based on the infrared image data II_1 to II_B, the body surface detection unit 502 calculates images BI_1 to BI_B in which pixel regions having high temperatures are extracted as body surface regions for the respective images. The visual cone intersection calculation unit 503 back-projects the respective visual cones of the object areas of the images EI_1 to EI_A and the body surface areas of the images BI_1 to BI_B into the three-dimensional space based on the camera projection matrix 220. The moving object existence map e (p, τ) is updated as in [Equation 11] for the region in which the visual volume intersects among the three-dimensional regions in which the visual fields intersect between the cameras obtained by [Equation 10].

ここで、ｗ_ｅはまた、前のフレームで算出された危険度マップデータＨ（ｐ，τ）に応じて、［数１０］の逆投影処理を間引いた［数１２］を用いれば、動体存在マップｅ（ｐ，τ）算出での危険度が高い位置に対して追従性が高くなる。 Here, w _e also, depending on the calculated in the previous frame the risk map data H (p, tau), the use of the [number 12] obtained by thinning the back projection processing in the number 10, the moving object existence The followability becomes high with respect to a position having a high degree of risk in calculating the map e (p, τ).

人物検出部２０５は、音声非音声判別マップｖ（ｐ，τ）と動体存在マップｅ（ｐ，τ）に基づき、［数１３］により人物検出マップｄ（ｐ，τ）を計算する。ここで、ｗ_ｖは０以上１以下の重み係数である。 The person detection unit 205 calculates the person detection map d (p, τ) from [Equation 13] based on the voice / non-voice discrimination map v (p, τ) and the moving object existence map e (p, τ). Here, _wv is a weighting coefficient of 0 or more and 1 or less.

機械センサ入力部２０７は、たとえば機械の速度計や機械のアームの油圧センサなどのセンサからなり、それぞれのセンサ信号をベクトルＣ（ｔ）＝（ｃ＿１（ｔ），…，ｃ＿Ω（ｔ））として出力する。 The machine sensor input unit 207 includes sensors such as a machine speedometer and a machine arm hydraulic pressure sensor, for example, and each sensor signal is set as a vector C (t) = (c_1 (t),..., C_Ω (t)). Output.

機械運動状態推定部２０９では、機械の寸法２０８から各小部位ｚ＿ｋの３次元位置Ｐ＿ｋ（ｔ）を得る。ここで、ｋ（ｋ＝１，…，Ｋ）は部位インデックスである。また、前記センサ信号のベクトルＣ（ｔ）とベクトルＰ（ｔ）＝（Ｐ＿１（ｔ），…，Ｐ＿Ｋ（ｔ））との組に対する、小部位ｚ＿ｋの運動速度Ｖ＿ｋ（ｔ）のベクトルＶ（ｔ）＝（Ｖ＿１（ｔ），…，Ｖ＿Ｋ（ｔ））のテーブルを予め記憶媒体１１０に記憶しているものとする。このテーブルは、設計時にシミュレーションで容易に得ることができる。このテーブルにより小部位ｚ＿ｋの速度Ｖ＿ｋ（ｔ）が得られる。 The machine motion state estimation unit 209 obtains the three-dimensional position P_k (t) of each small part z_k from the machine dimension 208. Here, k (k = 1,..., K) is a part index. Further, the vector V () of the motion velocity V_k (t) of the small part z_k with respect to the set of the vector C (t) of the sensor signal and the vector P (t) = (P_1 (t),..., P_K (t)). It is assumed that a table of t) = (V_1 (t),..., V_K (t)) is stored in the storage medium 110 in advance. This table can be easily obtained by simulation at the time of design. With this table, the velocity V_k (t) of the small part z_k is obtained.

さらに、機械操作入力部２２１から操作信号μ（ｔ）を得る。操作信号μ（ｔ）とＰ（ｔ）との組みについても対応する加速度Ａ（ｔ）＝（Ａ＿１（ｔ），…，Ａ＿ｋ（ｔ））のテーブルを記憶しておくことで、操作信号μ（ｔ）から小部位ｚ＿ｋの加速度Ａ＿ｋ（ｔ）が得られる。［数１４］により時刻ｔ＋Δｔでの小部位ｚ＿ｋの予測位置Ｐ（ｔ＋Δｔ）が求まる。最後に、［数１５］により、接触までにかかる最短時間のマップｇ（ｐ，ｔ）が求まる。 Further, an operation signal μ (t) is obtained from the machine operation input unit 221. By storing a table of corresponding accelerations A (t) = (A_1 (t),..., A_k (t)) for combinations of the operation signals μ (t) and P (t), the operation signal μ From (t), the acceleration A_k (t) of the small part z_k is obtained. [Expression 14] The predicted position P (t + Δt) of the small part z_k at time t + Δt is obtained. Finally, a map g (p, t) of the shortest time required for contact is obtained from [Equation 15].

危険度算出部２０６は、人物検出部２０５から入力される人物検出マップｄ（ｐ，τ）と、機械運動状態推定部２０９から入力される接触最短時間のマップｇ（ｐ，ｔ）とに基づいて、［数１６］により、危険度マップＨ（ｐ，τ）を算出する。ここで、ε、νはそれぞれ適当な定数とする。 The risk level calculation unit 206 is based on the person detection map d (p, τ) input from the person detection unit 205 and the map g (p, t) of the shortest contact time input from the machine motion state estimation unit 209. Thus, the risk map H (p, τ) is calculated from [Equation 16]. Here, ε and ν are appropriate constants.

映像出力部２１３では、人物検出マップｄ（ｐ，τ）と危険度マップＨ（ｐ，τ）を重畳して提示する。 In the video output unit 213, the person detection map d (p, τ) and the risk map H (p, τ) are superimposed and presented.

音抽出部２０３では、音入力部２０１から入力される周波数領域信号Ｘｆ＿１１（ｆ，τ）〜Ｘｆ＿ＭＮ（ｆ，τ）と危険度マップＨ（ｐ，τ）とに基づいて、抽出信号Ｙｆ（ｆ，τ）を計算する。 The sound extraction unit 203 extracts the extracted signal Yf (f) based on the frequency domain signals Xf_11 (f, τ) to Xf_MN (f, τ) and the risk map H (p, τ) input from the sound input unit 201. , Τ).

図６に、音抽出部２０３のブロック構成の一例を示す。音抽出部２０３は、抽出方向選択部６０１、音源分離ユニット６０２１〜６０２Ｒ、混合部６０３などから構成される。 FIG. 6 shows an example of a block configuration of the sound extraction unit 203. The sound extraction unit 203 includes an extraction direction selection unit 601, sound source separation units 6021 to 602R, a mixing unit 603, and the like.

まず、抽出方向選択部６０１では、すべての位置インデックスｐのＨ（ｐ，τ）をソートし、上位Ｒ個の位置ｐ＿１〜ｐ＿Ｒを抽出位置と定める。音源分離ユニット６０２１〜６０２Ｒは、それぞれ抽出位置ｐ＿１〜ｐ＿Ｒに対応する。ｒ番目の音源分離ユニット６０２ｒ（たとえば６０２Ｒ）のフローチャートを、図９に示す。 First, the extraction direction selection unit 601 sorts H (p, τ) of all the position indexes p, and determines the top R positions p_1 to p_R as extraction positions. The sound source separation units 6021 to 602R correspond to the extraction positions p_1 to p_R, respectively. A flowchart of the r-th sound source separation unit 602r (for example, 602R) is shown in FIG.

Ｓ９０１では、Ｈ（ｐ＿ｒ，τ）＞Ｔ＿ｈか、Ｈ（ｐ＿ｒ，τ）≦Ｔ＿ｈかで場合分けを行う。危険度Ｈ（ｐ＿ｒ，τ）が高いＨ（ｐ＿ｒ，τ）＞Ｔ＿ｈの場合（Ｓ９０１−Ｙｅｓ）は、特に高速性が求められると判断し、Ｓ９０２にて瞬時的に抽出可能な方式である方式１を選択する。方式１は、たとえば前述したＳＰＩＲＥのような方向推定アルゴリズムにより各周波数インデックスに対して求めた方向θ（ｆ，τ）が抽出位置ｐ＿ｒと重なる場合にその周波数成分を残し、重ならない場合にその周波数成分を０とするようなバイナリマスキングであってもよい。 In S901, the case is divided according to H (p_r, τ)> T_h or H (p_r, τ) ≦ T_h. When H (p_r, τ)> T_h with a high degree of risk H (p_r, τ) (S901—Yes), it is determined that particularly high speed is required, and a method that can be instantaneously extracted at S902 Select 1. The method 1 leaves the frequency component when the direction θ (f, τ) obtained for each frequency index by the direction estimation algorithm such as SPIRE described above overlaps the extraction position p_r, and the frequency component when the direction θ does not overlap. Binary masking in which the component is 0 may be used.

それに対して、危険度Ｈ（ｐ＿ｒ，τ）が相対的に低いＨ（ｐ＿ｒ，τ）≦Ｔ＿ｈの場合（Ｓ９０１−Ｎｏ）は、円滑なコミュニケーションのために高精度な抽出が求められると判断し、Ｓ９０３にて瞬時的に抽出可能な方式である方式２を選択する。 On the other hand, when H (p_r, τ) ≦ T_h where the risk level H (p_r, τ) is relatively low (S901-No), it is determined that high-precision extraction is required for smooth communication. In step S903, the method 2, which is a method that can be instantaneously extracted, is selected.

図８に、方式２の例として、スパース性に基づく適応による最小分散ビームフォーマである場合のブロック構成の一例を示す。方式２は、目的音／雑音分離部８０１、目的音ステアリングベクトル更新部８０２、雑音共分散行列更新部８０３、フィルタ更新部８０４、及び、フィルタ乗算部８０５の詳細構成となる。図８に基づいて説明する。 FIG. 8 shows an example of a block configuration in the case of a minimum dispersion beamformer by adaptation based on sparsity as an example of method 2. Method 2 has a detailed configuration of a target sound / noise separation unit 801, a target sound steering vector update unit 802, a noise covariance matrix update unit 803, a filter update unit 804, and a filter multiplication unit 805. This will be described with reference to FIG.

目的音／雑音分離部８０１は、前述のバイナリマスキングと同様に、方向推定アルゴリズムにより各周波数インデックスに対して求めた方向θ（ｆ，τ）によって、［数１７］のように目的音信号Ｘ＿ｄｅｓ（ｆ，τ）とＸ＿ｉｎｔ（ｆ，τ）に分離する。Ｘ＿ｄｅｓ（ｆ，τ）は、目的音／雑音分離部８０１から目的音ステアリングベクトル更新部８０２に送られる。Ｘ＿ｉｎｔ（ｆ，τ）は、目的音／雑音分離部８０１から雑音共分散行列更新部８０３に送られる。 Similar to the above-described binary masking, the target sound / noise separation unit 801 uses the direction θ (f, τ) obtained for each frequency index by the direction estimation algorithm, and the target sound signal X_des (Equation 17). f, τ) and X_int (f, τ). X_des (f, τ) is sent from the target sound / noise separation unit 801 to the target sound steering vector update unit 802. X_int (f, τ) is sent from the target sound / noise separation unit 801 to the noise covariance matrix update unit 803.

目的音ステアリングベクトル更新部８０２では、［数１８］に基づき、目的音ステアリングベクトルａ（ｆ，τ）＝［ａ＿０（ｆ，τ），…，ａ＿Ｍ−１（ｆ，τ）］^Ｔを更新する。ただし、γ_ｓは０以上１未満の適当な定数パラメタである。もちろん、安定のために、｜Ｘ＿ｄｅｓ＿ｉ（ｆ，τ）｜が十分に大きいときだけに更新するようにしてもよい。 The target sound steering vector update unit 802 updates the target sound steering vector a (f, τ) = [a — 0 (f, τ),..., A_M−1 (f, τ)] ^{T based} on [Equation 18]. . However, γ _s is an appropriate constant parameter of 0 or more and less than 1. Of course, for the sake of stability, it may be updated only when | X_des_i (f, τ) | is sufficiently large.

雑音共分散行列更新部８０３では、［数１９］に基づき、雑音共分散行列Ｒ（ｆ，τ）を更新する。ただし、Ｘ＿ｉｎｔ（ｆ，τ）＝［Ｘ＿ｉｎｔ＿０（ｆ，τ），…，Ｘ＿ｉｎｔ＿Ｍ−１（ｆ，τ）］^Ｔとし、γ_ｎは０以上１未満の適当な定数パラメタとする。もちろん、安定のために、｜Ｘ＿ｉｎｔ（ｆ，τ）｜が十分に大きいときだけに更新するようにしてもよい。 The noise covariance matrix updating unit 803 updates the noise covariance matrix R (f, τ) based on [Equation 19]. However, X_int (f, τ) = [X_int_0 (f, τ),..., X_int_M−1 (f, τ)] ^T, and γ _n is an appropriate constant parameter of 0 or more and less than 1. Of course, for the sake of stability, it may be updated only when | X_int (f, τ) | is sufficiently large.

フィルタ更新部８０４では、目的音ステアリングベクトルａ（ｆ，τ）と雑音共分散行列Ｒ（ｆ，τ）から、［数２０］に基づき、フィルタｗ（ｆ，τ）を計算する。ただし、γ_ｗは０以上１未満の適当な定数パラメタである。 The filter update unit 804 calculates a filter w (f, τ) from the target sound steering vector a (f, τ) and the noise covariance matrix R (f, τ) based on [Equation 20]. However, γ _w is an appropriate constant parameter of 0 or more and less than 1.

最後に、フィルタ乗算部８０５では、［数２１］に基づいて、フィルタｗ（ｆ，τ）をＸｆ（ｆ，τ）＝［Ｘｆ＿０（ｆ，τ），…，Ｘｆ＿Ｍ−１（ｆ，τ）］^Ｔに乗算することで、指定された方向から到来する音を除去した信号Ｙｆ（ｆ，τ）が得られる。 Finally, the filter multiplier 805 converts the filter w (f, τ) to Xf (f, τ) = [Xf_0 (f, τ),..., Xf_M−1 (f, τ) based on [Equation 21]. By multiplying ^T , a signal Yf (f, τ) from which the sound coming from the designated direction is removed is obtained.

この例では、方式２にスパース性に基づく適応による最小分散ビームフォーマを用いているが、方式２は他の高精度な抽出手法であるＩＣＡを用いてもよい。ＩＣＡは高次統計量を用いるため、適応のために数秒程度の音声信号が必要であり、瞬時的な抽出は困難である一方で、高精度な抽出が可能である。また、この例では２通りの方式１、方式２のみを選択、実行したが、方式の個数は３以上であってもよく、それらを危険度に応じて選択、実行してもよい。 In this example, the minimum dispersion beamformer based on sparsity is used for method 2, but ICA, which is another highly accurate extraction method, may be used for method 2. Since ICA uses high-order statistics, an audio signal of about several seconds is required for adaptation, and instantaneous extraction is difficult, but high-precision extraction is possible. In this example, only two methods 1 and 2 are selected and executed. However, the number of methods may be three or more, and may be selected and executed according to the degree of risk.

混合部６０３では、音源分離ユニット６０２１〜６０２Ｒが出力した各周波数領域信号を混合し、抽出信号Ｙｆ（ｆ，τ）を出力する。 The mixing unit 603 mixes the frequency domain signals output from the sound source separation units 6021 to 602R, and outputs an extraction signal Yf (f, τ).

以上の手順によって計算された周波数領域フレーム信号Ｙｆ（ｆ，τ）は、音出力部２１９に送られ、そこで、逆ＦＦＴを掛けられ、時間領域信号ｙ（ｔ，τ）に変換される。ｙ（ｔ，τ）は、フレーム周期毎にオーバーラップし、加算され、かつ窓関数の逆数を施されたｙ（ｔ）に変換され、ｙ（ｔ）がＤＡ変換を介してヘッドホン１０６から出力される。 The frequency domain frame signal Yf (f, τ) calculated by the above procedure is sent to the sound output unit 219, where it is subjected to inverse FFT and converted to a time domain signal y (t, τ). y (t, τ) overlaps every frame period, is added, and is converted into y (t) subjected to the inverse of the window function, and y (t) is output from the headphone 106 via the DA conversion. Is done.

外部向け出力音生成部２１６は、危険度マップＨ（ｐ，τ）に基づき、そのＨ（ｐ，τ）が大きい位置ｐ＿ｒにスピーカアレーの指向性を持つようなフィルタを選択する。オペレータ側のマイクロホン１０５からなる操作者音声入力部２１５から入力される音声信号に対し、前記フィルタを乗算し、複数チャンネル信号を生成し、外部向け音出力部２１７によりＤＡ変換を介してスピーカアレー１０２１〜１０２Ｓから出力する。 The external output sound generation unit 216 selects a filter having the directivity of the speaker array at a position p_r where the H (p, τ) is large, based on the risk map H (p, τ). The voice signal input from the operator voice input unit 215 including the operator-side microphone 105 is multiplied by the filter to generate a multi-channel signal, and the speaker output 1021 via the DA conversion by the external sound output unit 217. -102S.

機械動作制御部２１８は、危険度マップＨ（ｐ，τ）が、あるｐに対して非常に大きい場合に機械の動作を減速、もしくは、停止する。 The machine operation control unit 218 decelerates or stops the operation of the machine when the risk map H (p, τ) is very large with respect to a certain p.

以上説明した本実施の形態における音響処理システムによれば、以下のような効果を得ることができる。
（１）危険度算出部２０６で位置ごとに危険度を算出し、音抽出部２０３でその危険度が高い位置を抽出位置として自動的に選択するので、安全性のために音声を抽出すべきである、危険度が高い位置に存在する人物の音声を抽出することが可能である。
（２）音抽出部２０３において、危険度が高い位置を抽出位置とする音源分離ユニットほど瞬時的に抽出可能な方式を選択するので、危険度が高い位置の人物の音声はリアルタイムで抽出される。これにより、オペレータは瞬時的に危険回避を行うことができる。
（３）音抽出部２０３において、相対的に危険度が低い位置を抽出位置とする音源分離ユニットは高精度な分離方式を選択するので、残留騒音が少ない抽出音声を出力する。これにより、オペレータは周囲の人物の音声の内容を認識することができ、さらに外部向け音出力部２１７を介してオペレータと周囲の人物の間で円滑な会話が可能である。
（４）危険度算出部２０６が算出した位置ごとの危険度に応じて、音源位置推定部２０２が推定方式を変え、動体検出部２１２が検出方式を変えることにより、危険度の高い位置に対する計算を優先的に行い、危険度の低い位置に対する計算の頻度を下げることができるので、オペレータの迅速な行動が必要である危険度が高い位置ほど、危険度算出の更新が短縮される。
（５）映像出力部２１３に危険度を映像で視覚的に提示するため、オペレータが電話や無線で会話中である場合など、なんらかの原因で聴覚が使えない場合でも危険回避が可能である。
（６）外部向け音出力部２１７は、危険度が高い位置に指向性を向けて音声を出力するため、機械の騒音により聞きづらい環境であっても、機械周囲の人物に注意喚起を行うことができる。
（７）機械動作制御部２１８は、危険度が高い場合に、緊急に機械自体を制御して危険を回避するので、オペレータの回避判断が間に合わない場合に事故を回避できる可能性がある。 According to the sound processing system in the present embodiment described above, the following effects can be obtained.
(1) Since the risk level calculation unit 206 calculates the risk level for each position, and the sound extraction unit 203 automatically selects a position with a high risk level as an extraction position, speech should be extracted for safety. It is possible to extract the voice of a person present at a high risk level.
(2) In the sound extraction unit 203, a sound source separation unit that selects a position with a high degree of danger as the extraction position is selected so that a method that can be extracted instantaneously is selected. . As a result, the operator can instantly avoid danger.
(3) In the sound extraction unit 203, the sound source separation unit having a position with a relatively low degree of risk as the extraction position selects a high-accuracy separation method, and therefore outputs extracted speech with little residual noise. As a result, the operator can recognize the content of the voices of the surrounding people, and further, a smooth conversation is possible between the operator and the surrounding people via the external sound output unit 217.
(4) Calculation for a position with a high degree of risk by the sound source position estimation unit 202 changing the estimation method and the moving object detection unit 212 changing the detection method according to the risk level calculated by the risk level calculation unit 206. Can be preferentially performed, and the frequency of calculation for a position with a low risk level can be reduced. Therefore, the update of the risk level calculation is shortened for a position with a high risk level that requires quick action by the operator.
(5) Since the degree of danger is visually presented to the video output unit 213 as a video, it is possible to avoid danger even when the operator cannot use hearing for some reason, such as when the operator is talking by telephone or wirelessly.
(6) Since the external sound output unit 217 outputs sound with directivity to a position with a high degree of danger, even in an environment where it is difficult to hear due to the noise of the machine, it is possible to call attention to persons around the machine it can.
(7) Since the machine operation control unit 218 urgently controls the machine itself to avoid danger when the degree of danger is high, there is a possibility that an accident can be avoided if the operator's avoidance decision is not in time.

＜実施の形態２＞
以下、本発明の実施の形態２を、前述した図６を用いて説明する。 <Embodiment 2>
The second embodiment of the present invention will be described below with reference to FIG.

前記実施の形態１においては、音抽出部２０３のｒ番目の音源分離ユニット６０２ｒ（たとえば６０２Ｒ）が位置ごとに方式を切り替える例を説明したが、本実施の形態では、位置ごとに方式を切り替えるのではなく、時刻によってのみ方式を切り替える構成に適用した例である。 In the first embodiment, the example in which the r-th sound source separation unit 602r (for example, 602R) of the sound extraction unit 203 switches the method for each position has been described. However, in this embodiment, the method is switched for each position. Instead, this is an example applied to a configuration in which the method is switched only by time.

このような構成による本実施の形態における音響処理システムによれば、前記実施の形態１の効果に加えて、たとえば、あるｐについてＨ（ｐ，τ）＞Ｔ＿ｈである場合に全音源分離ユニットで方式１を選択するという構成であっても、危険度が高い時刻はリアルタイムで抽出し、危険度が低い時刻は高精度に抽出することができるという効果がある。 According to the acoustic processing system in the present embodiment having such a configuration, in addition to the effects of the first embodiment, for example, when H (p, τ)> T_h for a certain p, Even when the method 1 is selected, it is possible to extract a time with a high degree of risk in real time and extract a time with a low degree of danger with high accuracy.

＜実施の形態３＞
以下、本発明の実施の形態３を、図１０を用いて説明する。図１０は、本実施の形態における音響処理システムのブロック構成の一例を示す図である。 <Embodiment 3>
The third embodiment of the present invention will be described below with reference to FIG. FIG. 10 is a diagram illustrating an example of a block configuration of the sound processing system according to the present embodiment.

本実施の形態は、前記実施の形態１に対して、可視光線入力部２１０、赤外線入力部２１１、動体検出部２１２、映像出力部２１３、操作者音声入力部２１５、外部向け出力音生成部２１６、外部向け音出力部２１７、機械動作制御部２１８、カメラ投影行列２２０を持たない構成である。 The present embodiment is different from the first embodiment in the visible light input unit 210, the infrared input unit 211, the moving object detection unit 212, the video output unit 213, the operator voice input unit 215, and the output sound generation unit 216 for the outside. The external sound output unit 217, the machine operation control unit 218, and the camera projection matrix 220 are not provided.

すなわち、本実施の形態における音響処理システムは、図１０に示すように、音入力部２０１と、音源位置推定部２０２と、音抽出部２０３と、音声非音声判別部２０４と、人物検出部２０５と、危険度算出部２０６と、機械センサ入力部２０７と、機械運動状態推定部２０９と、音出力部２１９と、機械操作入力部２２１などから構成され、各機能部は前記実施の形態１と同様の機能を有している。 That is, as shown in FIG. 10, the sound processing system according to the present embodiment includes a sound input unit 201, a sound source position estimation unit 202, a sound extraction unit 203, a voice non-speech discrimination unit 204, and a person detection unit 205. A risk degree calculation unit 206, a machine sensor input unit 207, a machine motion state estimation unit 209, a sound output unit 219, a machine operation input unit 221, and the like. It has the same function.

このような構成による本実施の形態における音響処理システムによれば、前記実施の形態１の効果のうち、（５）〜（７）を除く、以下の（１）〜（４）のような効果を得ることができる。
（１）危険度算出部２０６で位置ごとに危険度を算出し、音抽出部２０３でその危険度が高い位置を抽出位置として自動的に選択するので、安全性のために音声を抽出すべきである、危険度が高い位置に存在する人物の音声を抽出することが可能である。
（２）音抽出部２０３において、危険度が高い位置を抽出位置とする音源分離ユニットほど瞬時的に抽出可能な方式を選択するので、危険度が高い位置の人物の音声はリアルタイムで抽出される。これにより、オペレータは瞬時的に危険回避を行うことができる。
（３）音抽出部２０３において、相対的に危険度が低い位置を抽出位置とする音源分離ユニットは高精度な分離方式を選択するので、残留騒音が少ない抽出音声を出力する。これにより、オペレータは周囲の人物の音声の内容を認識することができる。
（４）危険度算出部２０６が算出した位置ごとの危険度に応じて、音源位置推定部２０２が推定方式を変えることにより、危険度の高い位置に対する計算を優先的に行い、危険度の低い位置に対する計算の頻度を下げることができるので、オペレータの迅速な行動が必要である危険度が高い位置ほど、危険度算出の更新が短縮される。 According to the acoustic processing system in the present embodiment having such a configuration, the following effects (1) to (4) excluding (5) to (7) among the effects of the first embodiment. Can be obtained.
(1) Since the risk level calculation unit 206 calculates the risk level for each position, and the sound extraction unit 203 automatically selects a position with a high risk level as an extraction position, speech should be extracted for safety. It is possible to extract the voice of a person present at a high risk level.
(2) In the sound extraction unit 203, a sound source separation unit that selects a position with a high degree of danger as the extraction position is selected so that a method that can be extracted instantaneously is selected. . As a result, the operator can instantly avoid danger.
(3) In the sound extraction unit 203, the sound source separation unit having a position with a relatively low degree of risk as the extraction position selects a high-accuracy separation method, and therefore outputs extracted speech with little residual noise. Thereby, the operator can recognize the content of the voice of the surrounding person.
(4) The sound source position estimation unit 202 changes the estimation method according to the risk level for each position calculated by the risk level calculation unit 206, thereby preferentially calculating a position with a high risk level and having a low risk level. Since the frequency of calculation with respect to the position can be lowered, the update of the risk level calculation is shortened as the position has a high level of risk that requires quick action by the operator.

＜実施の形態４＞
以下、本発明の実施の形態４を、図１１を用いて説明する。図１１は、本実施の形態における音響処理システムのブロック構成の一例を示す図である。 <Embodiment 4>
Embodiment 4 of the present invention will be described below with reference to FIG. FIG. 11 is a diagram illustrating an example of a block configuration of the sound processing system according to the present embodiment.

本実施の形態は、前記実施の形態３に対して、さらに、音源位置推定部２０２、音声非音声判別部２０４、人物検出部２０５を持たない構成である。 The present embodiment is a configuration that does not further include the sound source position estimation unit 202, the voice / non-speech discrimination unit 204, and the person detection unit 205 as compared with the third embodiment.

すなわち、本実施の形態における音響処理システムは、図１１に示すように、音入力部２０１と、音抽出部２０３と、危険度算出部２０６と、機械センサ入力部２０７と、機械運動状態推定部２０９と、音出力部２１９と、機械操作入力部２２１などから構成され、各機能部は前記実施の形態１と同様の機能を有している。 That is, as shown in FIG. 11, the sound processing system according to the present embodiment includes a sound input unit 201, a sound extraction unit 203, a risk level calculation unit 206, a machine sensor input unit 207, and a machine motion state estimation unit. 209, a sound output unit 219, a machine operation input unit 221, and the like, and each functional unit has the same function as in the first embodiment.

このような構成による本実施の形態における音響処理システムによれば、前記実施の形態３の効果のうち、（４）を除く、以下の（１）〜（３）のような効果を得ることができる。
（１）人物検出部を備えない場合であっても、危険度算出部２０６で位置ごとに危険度を算出し、音抽出部２０３でその危険度が高い位置を抽出位置として自動的に選択するので、安全性のために音声を抽出すべきである、危険度が高い位置に存在する人物の音声を抽出することが可能である。
（２）音抽出部２０３において、危険度が高い位置を抽出位置とする音源分離ユニットほど瞬時的に抽出可能な方式を選択するので、危険度が高い位置の人物の音声はリアルタイムで抽出される。これにより、オペレータは瞬時的に危険回避を行うことができる。
（３）音抽出部２０３において、相対的に危険度が低い位置を抽出位置とする音源分離ユニットは高精度な分離方式を選択するので、残留騒音が少ない抽出音声を出力する。これにより、オペレータは周囲の人物の音声の内容を認識することができる。 According to the acoustic processing system in the present embodiment having such a configuration, the following effects (1) to (3) other than (4) among the effects of the third embodiment can be obtained. it can.
(1) Even if the person detection unit is not provided, the risk level calculation unit 206 calculates the risk level for each position, and the sound extraction unit 203 automatically selects a position with a high risk level as the extraction position. Therefore, it is possible to extract the voice of a person who should be extracted for safety and is present at a high risk level.
(2) In the sound extraction unit 203, a sound source separation unit that selects a position with a high degree of danger as the extraction position is selected so that a method that can be extracted instantaneously is selected. . As a result, the operator can instantly avoid danger.
(3) In the sound extraction unit 203, the sound source separation unit having a position with a relatively low degree of risk as the extraction position selects a high-accuracy separation method, and therefore outputs extracted speech with little residual noise. Thereby, the operator can recognize the content of the voice of the surrounding person.

以上、本発明者によってなされた発明を実施の形態に基づき具体的に説明したが、本発明は前記実施の形態に限定されるものではなく、その要旨を逸脱しない範囲で種々変更可能であることはいうまでもない。 As mentioned above, the invention made by the present inventor has been specifically described based on the embodiment. However, the present invention is not limited to the embodiment, and various modifications can be made without departing from the scope of the invention. Needless to say.

たとえば、前記実施の形態においては、音響処理システムが建設機械と一体となった構成例について説明したが、本発明は、建設機械に限らず、一般の車両、作業機械などにもそのまま適用可能である。 For example, in the above-described embodiment, the configuration example in which the sound processing system is integrated with the construction machine has been described. However, the present invention is not limited to the construction machine but can be applied to general vehicles, work machines, and the like. is there.

本発明の音響処理システムは、建設機械、車両、作業機械などの比較的大型の機械を操作するオペレータもしくは運転者が機械周囲の人物の状況を把握するために適した音響処理技術に関し、特に、機械周囲の人物の安全に適した音響処理システム及びこれを用いた機械に利用可能である。 The acoustic processing system of the present invention relates to an acoustic processing technique suitable for an operator or driver operating a relatively large machine such as a construction machine, a vehicle, or a work machine to grasp the situation of a person around the machine. The present invention is applicable to a sound processing system suitable for safety of persons around the machine and a machine using the sound processing system.

１００…音響処理システム、１０１１〜１０１Ｍ…マイクロホンアレー、１０２１〜１０２Ｓ…スピーカアレー、１０３１〜１０３Ａ…可視光線カメラ、１０４１〜１０４Ｂ…赤外線カメラ、１０５…マイクロホン、１０６…ヘッドホン、１０７…Ａ／Ｄ−Ｄ／Ａ変換装置、１０８…中央演算装置、１０９…揮発性メモリ、１１０…記憶媒体、１１１…画像表示装置、１１２…作業機械、１１３…機械操作入力部、１１４１〜１１４Ｍ，１１５１〜１１５Ｓ，１１６，１１７…オーディオケーブル、１１８…モニタケーブル、１１９，１２０１〜１２０Ａ，１２１１〜１２１Ｂ…デジタルケーブル、
２０１…音入力部、２０２…音源位置推定部、２０３…音抽出部、２０４…音声非音声判別部、２０５…人物検出部、２０６…危険度算出部、２０７…機械センサ入力部、２０８…機械の寸法、２０９…機械運動状態推定部、２１０…可視光線入力部、２１１…赤外線入力部、２１２…動体検出部、２１３…映像出力部、２１４…マイク配置、２１５…操作者音声入力部、２１６…外部向け出力音生成部、２１７…外部向け音出力部、２１８…機械動作制御部、２１９…音出力部、２２０…カメラ投影行列、２２１…機械操作入力部、
３０１…多チャンネルＡＤ変換器、３０２…多チャンネルフレーム処理部、３０３…多チャンネル短時間周波数分析部、
４０１１〜４０１Ｍ…周波数毎方向推定部、４０２…方向推定統合部、
５０１…背景差分・フレーム間差分算出部、５０２…体表面検出部、５０３…視錐体交差算出部、
６０１…抽出方向選択部、６０２１〜６０２Ｒ…音源分離ユニット、６０３…混合部、
８０１…目的音／雑音分離部、８０２…目的音ステアリングベクトル更新部、８０３…雑音共分散行列更新部、８０４…フィルタ更新部、８０５…フィルタ乗算部、
１３００１…キャビネット、１３００２…エンジン部、１３００３…アーム部。 DESCRIPTION OF SYMBOLS 100 ... Sound processing system, 1011-101M ... Microphone array, 1021-102S ... Speaker array, 1031-103A ... Visible light camera, 1041-104B ... Infrared camera, 105 ... Microphone, 106 ... Headphone, 107 ... A / D-D / A converter, 108 ... central processing unit, 109 ... volatile memory, 110 ... storage medium, 111 ... image display device, 112 ... work machine, 113 ... machine operation input unit, 1141 to 114M, 1151 to 115S, 116, 117 ... Audio cable, 118 ... Monitor cable, 119, 1201-120A, 1211-121B ... Digital cable,
DESCRIPTION OF SYMBOLS 201 ... Sound input part, 202 ... Sound source position estimation part, 203 ... Sound extraction part, 204 ... Voice non-speech discrimination part, 205 ... Person detection part, 206 ... Risk level calculation part, 207 ... Machine sensor input part, 208 ... Machine 209 ... mechanical motion state estimation unit, 210 ... visible light input unit, 211 ... infrared input unit, 212 ... moving object detection unit, 213 ... video output unit, 214 ... microphone arrangement, 215 ... operator voice input unit, 216 ... external output sound generation unit, 217 ... external sound output unit, 218 ... machine operation control unit, 219 ... sound output unit, 220 ... camera projection matrix, 221 ... machine operation input unit,
301 ... multi-channel AD converter, 302 ... multi-channel frame processing unit, 303 ... multi-channel short-time frequency analysis unit,
4011-401M ... Direction estimation unit for each frequency, 402 ... Direction estimation integration unit,
501 ... Background difference / interframe difference calculation unit, 502 ... Body surface detection unit, 503 ... Visual cone intersection calculation unit,
601 ... Extraction direction selection unit, 6021-602R ... Sound source separation unit, 603 ... Mixing unit,
801... Target sound / noise separator, 802... Target sound steering vector update unit, 803... Noise covariance matrix update unit, 804.
13001 ... Cabinet, 13002 ... Engine part, 13003 ... Arm part.

Claims

A sound input unit composed of a plurality of microphones for collecting sound;
A risk level calculation unit for calculating a level of risk associated with contact with a person or object in the vicinity due to the operation of the machine;
A sound extraction unit that outputs a separation signal corresponding to the degree of risk calculated by the risk level calculation unit by using the signal output from the sound input unit;
A sound output unit that outputs a separation signal output from the sound extraction unit.

The sound processing system according to claim 1,
The sound processing system, wherein the risk calculating unit calculates a risk for each position.

The sound processing system according to claim 1 or 2,
The sound extraction unit includes a plurality of sound source separation units,
The sound processing system according to claim 1, wherein the plurality of sound source separation units set extraction positions according to the degree of risk.

The sound processing system according to claim 3,
The sound processing system, wherein the sound source separation unit changes a separation method according to the degree of risk.

The sound processing system according to claim 4,
A machine motion state estimation unit for estimating a motion state of the machine estimated based on sensor information or a machine operation signal installed in the machine;
The risk processing unit calculates the risk based on a motion state output from the mechanical motion state estimation unit.

The sound processing system according to claim 5,
A sound source position estimating unit that estimates a sound source position from a signal output from the sound input unit;
A sound non-speech discrimination unit that discriminates speech non-speech based on a sound source position output by the sound source position estimation unit;
A person detection unit that detects a person position based on a voice non-voice discrimination result output by the voice non-speech discrimination unit;
The acoustic processing system, wherein the risk level calculation unit calculates the risk level based on a person position detection result output by the person detection unit.

The sound processing system according to claim 5,
A sound source position estimating unit that estimates a sound source position from a signal output by the sound extraction unit;
A sound non-speech discrimination unit that discriminates speech non-speech based on a sound source position output by the sound source position estimation unit;
A person detection unit that detects a person position based on a voice non-voice discrimination result output by the voice non-speech discrimination unit;
The acoustic processing system, wherein the risk level calculation unit calculates the risk level based on a person position detection result output by the person detection unit.

The sound processing system according to claim 7,
A video input unit composed of one or more cameras such as a visible light camera or an infrared camera;
A moving object detection unit that detects a moving object based on the video output from the video input unit;
The acoustic processing system, wherein the person detection unit detects a person based on a signal output from the moving object detection unit.

The sound processing system according to claim 8.
The sound processing system, wherein the sound source position estimation unit changes an estimation method based on a risk level for each position output by the risk level calculation unit.

The sound processing system according to claim 8 or 9,
The acoustic processing system, wherein the moving body detection unit changes a detection method based on a risk level for each position output by the risk level calculation unit.

In the sound processing system according to any one of claims 1 to 10,
The acoustic processing system further comprising: a video output unit that displays video based on the risk level output by the risk level calculation unit.

In the sound processing system according to any one of claims 1 to 11,
An external output sound generator for generating an external output sound to the outside of the machine based on the risk output by the risk calculator;
The sound processing system further comprising: an external sound output unit that outputs an external output sound generated by the external output sound generation unit.

In the sound processing system according to any one of claims 1 to 12,
The acoustic processing system further comprising a machine operation control unit that controls the operation of the machine based on the risk level output by the risk level calculation unit.

A machine using the sound processing system according to claim 1.