JP6881267B2

JP6881267B2 - Controls, converters, control methods, conversion methods, and programs

Info

Publication number: JP6881267B2
Application number: JP2017233062A
Authority: JP
Inventors: 弘章伊藤; 豪入江; 京介西田; 歩相名神山
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2017-12-05
Filing date: 2017-12-05
Publication date: 2021-06-02
Anticipated expiration: 2037-12-05
Also published as: JP2019103011A

Description

この発明は、話者の方向を推定する技術に関する。 The present invention relates to a technique for estimating the direction of a speaker.

音声認識を利用した音声対話エージェントやロボット対話等のアプリケーションにおいて、ロボットに到来した音が対話に関係あるか否か、を判別することは、円滑な対話を実現する上で重要である。 In applications such as voice dialogue agents and robot dialogues that use voice recognition, it is important to determine whether or not the sound that arrives at the robot is related to the dialogue in order to realize a smooth dialogue.

例えば特許文献１や特許文献２に記載された従来技術では、複数のマイクロホンで集音された信号に基づきある複数の方向毎に分離した信号を生成し、分離後の信号のパワーを算出し、ある時点で最大のパワーとなる方向を対話に関係のある方向とし、その方向の音を強調して集音するように指向性集音を実施する。 For example, in the prior art described in Patent Document 1 and Patent Document 2, a signal separated for each of a plurality of directions is generated based on a signal collected by a plurality of microphones, and the power of the separated signal is calculated. The direction of maximum power at a certain point is set as the direction related to the dialogue, and directional sound collection is performed so as to emphasize and collect the sound in that direction.

従来の話者方向決定装置の機能構成を図１に示す。図１の話者方向決定装置９は、方向別前処理部９１と方向別パワー算出部９２と到来方向選択部９３とを備える。方向別前処理部９１は、複数のマイクロホンで集音された音信号に基づきある複数の方向毎に分離した信号を生成する。方向別パワー算出部９２は、分離後の音信号から方向毎のパワーを算出する。到来方向選択部９３は、方向毎のパワーからある時点で最大のパワーとなる方向を対話に関係のある方向として選択する。指向性集音部８は、複数のマイクロホンで集音された音信号のうち到来方向選択部９２が選択した到来方向の音を強調して集音する。 FIG. 1 shows the functional configuration of the conventional speaker direction determination device. The speaker direction determination device 9 of FIG. 1 includes a direction-specific preprocessing unit 91, a direction-specific power calculation unit 92, and an arrival direction selection unit 93. The direction-specific preprocessing unit 91 generates signals separated for each of a plurality of directions based on sound signals collected by a plurality of microphones. The direction-specific power calculation unit 92 calculates the power for each direction from the sound signal after separation. The arrival direction selection unit 93 selects the direction that becomes the maximum power at a certain time from the power for each direction as the direction related to the dialogue. The directional sound collecting unit 8 emphasizes and collects the sound in the arrival direction selected by the arrival direction selection unit 92 among the sound signals collected by the plurality of microphones.

特開２００５−６４９６８号公報Japanese Unexamined Patent Publication No. 2005-64968 特開２００１−３０９４８３号公報Japanese Unexamined Patent Publication No. 2001-309483

しかしながら、従来の技術では音のパワーのみを手掛かりとしているため、目的とする音源と、対話に無関係な音源とが存在する場合には、どちらが目的とする音源か見分けることができず、無関係な音源側を強調してしまうといった誤動作を起こす可能性がある。例えば、複数人に囲まれたロボットが対話を行うシーンを想定すると、周囲の会話のように対話と無関係な話者に反応してしまうといった誤動作を起こしてしまい、対話が成立しないことがある。 However, since the conventional technology uses only the power of sound as a clue, when there is a target sound source and a sound source unrelated to dialogue, it is not possible to distinguish which is the target sound source, and the unrelated sound source. There is a possibility of malfunction such as emphasizing the side. For example, assuming a scene in which a robot surrounded by a plurality of people has a dialogue, a malfunction such as reacting to a speaker unrelated to the dialogue like a surrounding conversation may occur, and the dialogue may not be established.

この発明の目的は、上記のような点に鑑みて、対話に無関係な音源が存在する場合に、その対話に無関係な情報を排除することで、誤動作を防止することができる話者方向決定技術を実現することである。 In view of the above points, an object of the present invention is a speaker direction determination technique capable of preventing a malfunction by eliminating information irrelevant to the dialogue when a sound source irrelevant to the dialogue exists. Is to realize.

上記の課題を解決するために、この発明の変換装置は、複数のマイクロホンにより収音された音響信号から推定されたマイクロホンアレイを基準とした所望の音源の方向であるマイクロホン方向を、カメラにより撮影された画像における座標であるカメラ座標に変換する変換装置であって、マイクロホン方向を変換規則によりカメラ座標に変換する変換部を含み、変換規則は、少なくとも３個のスピーカからなる放音部から発せられ、少なくとも３個のスピーカのうちいずれのスピーカから発せられたか識別できるよう設定された音響信号と、少なくとも３個のスピーカの個々の位置を検知できるよう放音部が撮影された画像と、を関連付けることで得られたものである。 In order to solve the above problems, the converter of the present invention uses a camera to capture the direction of the desired sound source based on the microphone array estimated from the acoustic signals picked up by the plurality of microphones. It is a conversion device that converts the microphone coordinates into camera coordinates, which are the coordinates in the image, and includes a conversion unit that converts the microphone direction into camera coordinates according to the conversion rule. The conversion rule is emitted from a sound emitting unit consisting of at least three speakers. An acoustic signal set so that it can be identified from which of at least three speakers is emitted, and an image in which the sound emitting part is taken so that the individual positions of at least three speakers can be detected. It was obtained by associating.

この発明の話者方向決定技術では、音の到来方向推定に加えて、画像認識を利用した話者方向推定を行うことで、対話に無関係な情報を排除することができる。これにより、この発明の話者方向決定技術によれば、対話に無関係な音源が存在する場合であっても、誤動作を防止することができる。 In the speaker direction determination technique of the present invention, information irrelevant to dialogue can be eliminated by estimating the speaker direction using image recognition in addition to estimating the arrival direction of sound. Thereby, according to the speaker direction determination technique of the present invention, it is possible to prevent a malfunction even when a sound source unrelated to the dialogue exists.

図１は、従来の話者方向決定装置の機能構成を例示する図である。FIG. 1 is a diagram illustrating a functional configuration of a conventional speaker direction determining device. 図２は、第一実施形態の話者方向決定装置の機能構成を例示する図である。FIG. 2 is a diagram illustrating a functional configuration of the speaker direction determining device of the first embodiment. 図３は、第一実施形態の話者方向決定方法の処理手続きを例示する図である。FIG. 3 is a diagram illustrating a processing procedure of the speaker direction determination method of the first embodiment. 図４は、話者方向推定結果とカメラ画像の校正を説明するための図である。FIG. 4 is a diagram for explaining the speaker direction estimation result and the calibration of the camera image. 図５は、画像認識結果を利用したスコアの算出を説明するための図である。FIG. 5 is a diagram for explaining the calculation of the score using the image recognition result. 図６は、変形例の変換装置の機能構成を例示する図である。FIG. 6 is a diagram illustrating the functional configuration of the conversion device of the modified example. 図７は、変形例の変換方法の処理手続きを例示する図である。FIG. 7 is a diagram illustrating a processing procedure of a conversion method of a modified example. 図８は、第二実施形態の話者方向決定装置の機能構成を例示する図である。FIG. 8 is a diagram illustrating the functional configuration of the speaker direction determining device of the second embodiment. 図９は、第二実施形態の話者方向決定方法の処理手続きを例示する図である。FIG. 9 is a diagram illustrating a processing procedure of the speaker direction determination method of the second embodiment. 図１０は、音声認識結果と画像認識結果を利用したスコアの算出を説明するための図である。FIG. 10 is a diagram for explaining the calculation of the score using the voice recognition result and the image recognition result. 図１１は、第三実施形態の最適配置取得装置の機能構成を例示する図である。FIG. 11 is a diagram illustrating a functional configuration of the optimum arrangement acquisition device of the third embodiment. 図１２は、スピーカの最適配置の取得方法を説明するための図である。FIG. 12 is a diagram for explaining a method of acquiring the optimum arrangement of the speakers. 図１３は、スピーカの最適配置の表示方法を説明するための図である。FIG. 13 is a diagram for explaining a display method of the optimum arrangement of the speakers.

以下、この発明の実施の形態について詳細に説明する。なお、図面中において同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. In the drawings, the components having the same function are given the same number, and duplicate description will be omitted.

［第一実施形態］
第一実施形態の話者方向決定装置は、対話ロボットなどが話者の方向を推定して指向性集音を実施する際に、雑音源などで方向推定を誤動作させないために、到来方向推定結果に基づき画像認識を実施することで、目的話者方向を決定する装置である。 [First Embodiment]
The speaker direction determination device of the first embodiment is an arrival direction estimation result in order to prevent the direction estimation from malfunctioning due to a noise source or the like when the dialogue robot or the like estimates the speaker direction and performs directional sound collection. It is a device that determines the target speaker direction by performing image recognition based on the above.

第一実施形態の話者方向決定装置１は、図２に示すように、M（≧2）個のマイクロホンが集音したM個の音声信号と少なくともK（≧1）個のカメラが撮像したK個の画像信号とを入力とし、その音声信号と画像信号とから推定した話者方向を指向性集音部８へ出力する。K個のカメラとM個のマイクロホンとは異なる位置に設置されることを想定しているが、例えば、カメラの筐体にマイクロホンを設置するなど同一とみなせる位置に設置されていてもよい。K個のカメラは、全天球カメラのように、カメラを中心として全周囲を撮影可能なカメラを用いてもよい。話者方向決定装置１は、到来方向推定部１１と測定座標補正部１２と画像認識部１３と話者方向推定部１４とを備える。この話者方向決定装置１が、図３に例示する各ステップの処理を行うことにより第一実施形態の話者方向決定方法が実現される。 As shown in FIG. 2, the speaker direction determination device 1 of the first embodiment is captured by M audio signals collected by M (≧ 2) microphones and at least K (≧ 1) cameras. K image signals are input, and the speaker direction estimated from the audio signal and the image signal is output to the directional sound collecting unit 8. It is assumed that the K cameras and the M microphones are installed at different positions, but they may be installed at positions that can be regarded as the same, for example, by installing the microphones in the housing of the camera. As the K cameras, a camera capable of photographing the entire circumference around the camera, such as an omnidirectional camera, may be used. The speaker direction determination device 1 includes an arrival direction estimation unit 11, a measurement coordinate correction unit 12, an image recognition unit 13, and a speaker direction estimation unit 14. The speaker direction determination method of the first embodiment is realized by the speaker direction determination device 1 performing the processing of each step illustrated in FIG.

話者方向決定装置１は、例えば、中央演算処理装置（CPU: Central Processing Unit）、主記憶装置（RAM: Random Access Memory）などを有する公知又は専用のコンピュータに特別なプログラムが読み込まれて構成された特別な装置である。話者方向決定装置１は、例えば、中央演算処理装置の制御のもとで各処理を実行する。話者方向決定装置１に入力されたデータや各処理で得られたデータは、例えば、主記憶装置に格納され、主記憶装置に格納されたデータは必要に応じて中央演算処理装置へ読み出されて他の処理に利用される。話者方向決定装置１の各処理部は、少なくとも一部が集積回路等のハードウェアによって構成されていてもよい。 The speaker direction determination device 1 is configured by loading a special program into a known or dedicated computer having, for example, a central processing unit (CPU), a main storage device (RAM: Random Access Memory), or the like. It is a special device. The speaker direction determination device 1 executes each process under the control of the central processing unit, for example. The data input to the speaker direction determination device 1 and the data obtained by each process are stored in, for example, the main storage device, and the data stored in the main storage device is read out to the central processing unit as needed. It is used for other processing. At least a part of each processing unit of the speaker direction determining device 1 may be configured by hardware such as an integrated circuit.

以下、図３を参照して、第一実施形態の話者方向決定装置１が実行する話者方向決定方法について説明する。 Hereinafter, the speaker direction determination method executed by the speaker direction determination device 1 of the first embodiment will be described with reference to FIG.

ステップＳ１１において、到来方向推定部１１は、まず、M個のマイクロホンからM個の音声信号を受信し、ビームフォーミング等の信号処理によりL（≧2）個の方向別音声信号に変換することで、各方向別音声信号から方向別のパワーを算出する。次に、到来方向推定部１１は、算出した方向別パワーを所定の閾値と比較し、その閾値を超えた方向を到来方向として推定する。そして、到来方向推定部１１は、到来方向の推定結果を測定座標補正部１２へ出力する。 In step S11, the arrival direction estimation unit 11 first receives M audio signals from M microphones and converts them into L (≧ 2) direction-specific audio signals by signal processing such as beamforming. , Calculate the power for each direction from the audio signal for each direction. Next, the arrival direction estimation unit 11 compares the calculated power for each direction with a predetermined threshold value, and estimates the direction exceeding the threshold value as the arrival direction. Then, the arrival direction estimation unit 11 outputs the estimation result of the arrival direction to the measurement coordinate correction unit 12.

ステップＳ１２において、測定座標補正部１２は、到来方向推定部１１から到来方向推定結果（角度情報）を受信し、K個のカメラで撮影された画像上の座標系に合致するように予め算出しておいた変換行列を用い、到来方向推定結果をカメラと同一の座標系へと校正する。測定座標補正部１２は、校正した到来方向推定結果を画像認識部１３へ出力する。 In step S12, the measurement coordinate correction unit 12 receives the arrival direction estimation result (angle information) from the arrival direction estimation unit 11 and calculates in advance so as to match the coordinate system on the images taken by the K cameras. Using the transformation matrix that has been set up, the arrival direction estimation result is calibrated to the same coordinate system as the camera. The measurement coordinate correction unit 12 outputs the calibrated arrival direction estimation result to the image recognition unit 13.

図４を参照して、到来方向推定結果（角度情報）とカメラ画像の校正方法について説明する。校正するためには、マイクで観測された音声信号から算出される到来方向推定結果の二次元角度スペクトル上の点(θ, φ)（θは水平角、φは仰角を表す）と、カメラで撮影された画像上の画素(x, y)との変換行列を求めればよい。ここではカメラ画像の画素から二次元角度スペクトル上の点への変換方法を示す。二次元角度スペクトル上の点からカメラ画像の画素へ変換する場合は逆の計算を行えばよい。 With reference to FIG. 4, the arrival direction estimation result (angle information) and the calibration method of the camera image will be described. In order to calibrate, the points (θ, φ) on the two-dimensional angle spectrum of the arrival direction estimation result calculated from the voice signal observed by the microphone (θ represents the horizontal angle and φ represents the elevation angle) and the camera. The conversion matrix with the pixels (x, y) on the captured image may be obtained. Here, a conversion method from a pixel of a camera image to a point on a two-dimensional angular spectrum is shown. When converting from a point on the two-dimensional angle spectrum to a pixel of a camera image, the reverse calculation may be performed.

図４に示すように、３箇所以上の相異なる位置に校正用スピーカを設置する。各スピーカには、撮影された画像からそれぞれの校正用スピーカが区別可能なマーカー（例えば、「●」「■」「★」等の記号等）を貼り付けておく。また、各校正用スピーカから相異なる周波数帯域の音を発することで、画像上の画素と二次元角度スペクトルとの対応が取れるようにする。この校正用スピーカを用いて、マイク及びカメラにて同時に収音及び撮影することで得られる二次元角度スペクトル(θ_i, φ_i)とカメラ画像の画素(x_i, y_i)（iはスピーカのインデックスを表す）について、下記の式で表される変換行列を求める。ここで、a, b, c, d, e, fは到来方向推定結果の二次元角度スペクトルと画像上の画素の組から対応関係を求めた変換パラメータである。この変換パラメータが設定された3×3の行列が図４中の変換行列Kに該当する。 As shown in FIG. 4, the calibration speakers are installed at three or more different positions. A marker (for example, a symbol such as “●”, “■”, “★”, etc.) that can distinguish each calibration speaker from the captured image is attached to each speaker. In addition, by emitting sounds in different frequency bands from each calibration speaker, it is possible to make a correspondence between the pixels on the image and the two-dimensional angle spectrum. _{Two-dimensional angle spectrum (θ i} , φ _i ) and camera image pixels (x _i , y _i ) (i are speakers) obtained by simultaneously collecting and photographing sound with a microphone and a camera using this calibration speaker. For (representing the index of), the transformation matrix represented by the following equation is obtained. Here, a, b, c, d, e, and f are conversion parameters obtained from the pair of pixels on the image and the two-dimensional angle spectrum of the arrival direction estimation result. The 3 × 3 matrix in which this transformation parameter is set corresponds to the transformation matrix K in FIG.

座標変換における自由度は、回転（１自由度）、平行移動（２自由度）、拡大縮小（１自由度）、せん断（１自由度）の合計６自由度とし、対応する角度スペクトルと画素の組を３つ以上得ることで、変換行列を一意に決定することができる。 The degrees of freedom in coordinate conversion are a total of 6 degrees of freedom: rotation (1 degree of freedom), translation (2 degrees of freedom), scaling (1 degree of freedom), and shear (1 degree of freedom). By obtaining three or more pairs, the conversion matrix can be uniquely determined.

ステップＳ１３において、画像認識部１３は、K個のカメラから画像信号を受信し、測定座標補正部１２から座標軸が校正された到来方向推定結果を受信する。画像認識部１３は、カメラから受信した画像から到来方向毎の画像を取得し、取得した画像に顔認識を実施することで、画像中の顔向きを検出し、画面全体における顔部分の占有率を算出する。画像認識部１３は、顔部分の画面占有率と顔向き検出結果とを話者方向推定部１９へ出力する。なお、顔向きの検出方法および画面占有率の算出方法については、下記参考文献１のような方法が利用可能である。
〔参考文献１〕新井啓之、伊藤直己、片岡香織、谷口行信、“画像処理による広告効果測定技術−人数計測技術・顔画像技術の応用”、NTT技術ジャーナル 2013.1、vol. 25、pp. 61-64、2013年 In step S13, the image recognition unit 13 receives the image signals from the K cameras, and receives the arrival direction estimation result whose coordinate axes are calibrated from the measurement coordinate correction unit 12. The image recognition unit 13 acquires an image for each arrival direction from the image received from the camera, performs face recognition on the acquired image, detects the face orientation in the image, and occupies the face portion in the entire screen. Is calculated. The image recognition unit 13 outputs the screen occupancy rate of the face portion and the face orientation detection result to the speaker direction estimation unit 19. As a method for detecting the face orientation and a method for calculating the screen occupancy rate, a method as in Reference 1 below can be used.
[Reference 1] Hiroyuki Arai, Naoki Ito, Kaori Kataoka, Yukinobu Taniguchi, "Advertising Effect Measurement Technology by Image Processing-Application of Number Measurement Technology / Face Image Technology", NTT Technology Journal 2013.1, vol. 25, pp. 61 -64, 2013

ステップＳ１４において、話者方向推定部１４は、画像認識部１３から受信した到来方向毎の画面占有率および顔向き検出結果から、指向性集音の目的とする話者方向を推定する。話者方向の推定方法は決定論的でも確率的でも構わない。例えば、図５に示すように、画面占有率と顔向き検出結果からスコアを算出し、そのスコアが最も高い画像の方向を話者方向として決定する。例えば、図５の例では、正面を向いており画面占有率が高い図５（Ａ）が最もスコアが高く、正面以外を向いており画面占有率が低い図５（Ｄ）が最もスコアが低くなっていることがわかる。話者方向推定部１４は、決定した話者方向を指向性集音部８へ出力する。 In step S14, the speaker direction estimation unit 14 estimates the speaker direction, which is the target of the directional sound collection, from the screen occupancy rate and the face orientation detection result for each arrival direction received from the image recognition unit 13. The speaker direction estimation method may be deterministic or probabilistic. For example, as shown in FIG. 5, a score is calculated from the screen occupancy rate and the face orientation detection result, and the direction of the image having the highest score is determined as the speaker direction. For example, in the example of FIG. 5, FIG. 5 (A), which faces the front and has a high screen occupancy rate, has the highest score, and FIG. 5 (D), which faces other than the front and has a low screen occupancy rate, has the lowest score. You can see that it is. The speaker direction estimation unit 14 outputs the determined speaker direction to the directional sound collecting unit 8.

ステップＳ８において、指向性集音部８は、M個のマイクロホンが集音したM個の音声信号から、話者方向推定部１４から受け取った話者方向の音を強調して集音する。指向性集音部８は、例えば下記参考文献２に記載された指向性集音を行う。指向性集音部８は、話者方向の音が強調された強調音声を出力する。
〔参考文献２〕特開２００９−４４５８８号公報 In step S8, the directional sound collecting unit 8 emphasizes the sound in the speaker direction received from the speaker direction estimating unit 14 from the M audio signals collected by the M microphones and collects the sound. The directional sound collecting unit 8 collects the directional sound described in Reference 2 below, for example. The directional sound collecting unit 8 outputs the emphasized voice in which the sound in the speaker direction is emphasized.
[Reference 2] Japanese Unexamined Patent Publication No. 2009-445888

［変形例］
第一実施形態の話者方向決定装置１から測定座標補正部１２の処理のみを取り出した独立の変換装置を構成してもよい。変形例の変換装置１００は、図６に示すように、変換部１０を備える。この変換装置１００が、図７に例示する各ステップの処理を行うことにより変形例の変換方法が実現される。 [Modification example]
An independent conversion device may be configured in which only the processing of the measurement coordinate correction unit 12 is extracted from the speaker direction determination device 1 of the first embodiment. As shown in FIG. 6, the conversion device 100 of the modified example includes a conversion unit 10. The conversion device 100 realizes the conversion method of the modified example by performing the processing of each step illustrated in FIG. 7.

変換装置１００は、マイクロホン方向を入力とし、そのマイクロホン方向をカメラで撮影した画像上の座標系へ変換したカメラ座標を出力する。マイクロホン方向とは、複数のマイクロホンにより収音された音響信号から推定されたマイクロホンアレイを基準とした所望の音源の方向である。カメラ座標とは、カメラにより撮影された画像における座標である。 The conversion device 100 takes the microphone direction as an input, and outputs the camera coordinates obtained by converting the microphone direction into the coordinate system on the image taken by the camera. The microphone direction is a direction of a desired sound source with reference to a microphone array estimated from acoustic signals picked up by a plurality of microphones. The camera coordinates are the coordinates in the image taken by the camera.

ステップＳ１０において、変換部１０は、第一実施形態と同様の変換行列を用いて、入力されたマイクロホン方向（角度情報）をカメラで撮影された画像の座標系へ校正し、そのカメラ座標を変換装置１００の出力として出力する。変換行列は、第一実施形態と同様のものであるため、少なくとも３個のスピーカからなる放音部から発せられ、少なくとも３個のスピーカのうちいずれのスピーカから発せられたか識別できるよう設定された音響信号と、少なくとも３個のスピーカの個々の位置を検知できるよう放音部が撮影された画像と、を関連付けることで得られたものである。 In step S10, the conversion unit 10 calibrates the input microphone direction (angle information) to the coordinate system of the image captured by the camera using the same conversion matrix as in the first embodiment, and converts the camera coordinates. It is output as the output of the device 100. Since the transformation matrix is the same as that of the first embodiment, it is set so that it can be identified from which of the at least three speakers the sound emitting unit is emitted from the sound emitting unit consisting of at least three speakers. It was obtained by associating the acoustic signal with an image in which the sound emitting portion was taken so that the individual positions of at least three speakers could be detected.

変形例の変換装置１００は、例えば、話者方向決定装置１の到来方向推定部１１が出力する到来方向推定結果を受け取って、その到来方向推定結果をカメラで撮影した画像上の座標に校正し、話者方向決定装置１の画像認識部１３へ返却する外部の装置として機能させることが可能である。また、マイクロホンで集音した音声の到来方向をカメラで撮影した画像上の座標に変換するような他の音声処理装置に応用することが可能である。 The conversion device 100 of the modified example receives, for example, the arrival direction estimation result output by the arrival direction estimation unit 11 of the speaker direction determination device 1, and calibrates the arrival direction estimation result to the coordinates on the image taken by the camera. , It is possible to function as an external device to be returned to the image recognition unit 13 of the speaker direction determination device 1. Further, it can be applied to other voice processing devices that convert the arrival direction of the sound collected by the microphone into the coordinates on the image taken by the camera.

［第二実施形態］
第二実施形態の話者方向決定装置２は、図８に示すように、第一実施形態と同様に到来方向推定部１１と測定座標補正部１２と画像認識部１３と話者方向推定部１４とを備え、さらに音声認識部２１を備える。この話者方向決定装置２が、図９に例示する各ステップの処理を行うことにより第二実施形態の話者方向決定方法が実現される。 [Second Embodiment]
As shown in FIG. 8, the speaker direction determination device 2 of the second embodiment has an arrival direction estimation unit 11, a measurement coordinate correction unit 12, an image recognition unit 13, and a speaker direction estimation unit 14, as in the first embodiment. And further, a voice recognition unit 21 is provided. The speaker direction determination method of the second embodiment is realized by the speaker direction determination device 2 performing the processing of each step illustrated in FIG. 9.

以下、図９を参照して、第二実施形態の話者方向決定装置２が実行する話者方向決定方法について説明する。 Hereinafter, the speaker direction determination method executed by the speaker direction determination device 2 of the second embodiment will be described with reference to FIG. 9.

ステップＳ１１において、到来方向推定部１１は、第一実施形態と同様に、到来方向を推定し、測定座標補正部１２へ出力する。また同時に、M個のマイクロホンから受信したM個の音声信号を、到来方向毎に分離して音声認識部２１へ出力する。 In step S11, the arrival direction estimation unit 11 estimates the arrival direction and outputs the measurement coordinate correction unit 12 as in the first embodiment. At the same time, the M voice signals received from the M microphones are separated for each direction of arrival and output to the voice recognition unit 21.

ステップＳ１２において、測定座標補正部１２は、第一実施形態と同様に、到来方向推定部１１から受信した到来方向推定結果をカメラと同一の座標系へと校正し、画像認識部１３へ出力する。校正するために用いる変換行列については第一実施形態と同様の方法で求めることができる。 In step S12, the measurement coordinate correction unit 12 calibrates the arrival direction estimation result received from the arrival direction estimation unit 11 into the same coordinate system as the camera, and outputs the result to the image recognition unit 13, as in the first embodiment. .. The transformation matrix used for calibration can be obtained by the same method as in the first embodiment.

ステップＳ１３において、画像認識部１３は、第一実施形態と同様に、測定座標補正部１２から受信した到来方向毎に顔向きの検出と画面占有率の算出を行い、その顔向き検出結果と画面占有率とを話者方向推定部１９へ出力する。顔向きの検出方法および画面占有率の算出方法は、第一実施形態と同様に上記参考文献１のような方法が利用可能である。 In step S13, the image recognition unit 13 detects the face orientation and calculates the screen occupancy for each arrival direction received from the measurement coordinate correction unit 12 as in the first embodiment, and the face orientation detection result and the screen. The occupancy rate is output to the speaker direction estimation unit 19. As a face orientation detection method and a screen occupancy calculation method, the same method as in Reference 1 can be used as in the first embodiment.

ステップＳ２１において、音声認識部２１は、到来方向推定部１１から受信した到来方向毎に分離した音声信号に対して音声認識を実施し、到来方向毎の音声認識結果を得る。音声認識部２１は、得た音声認識結果を話者方向推定部１４へ出力する。 In step S21, the voice recognition unit 21 performs voice recognition on the voice signals separated for each arrival direction received from the arrival direction estimation unit 11, and obtains the voice recognition result for each arrival direction. The voice recognition unit 21 outputs the obtained voice recognition result to the speaker direction estimation unit 14.

ステップＳ１４において、話者方向推定部１４は、音声認識部２１から受信した到来方向毎の音声認識結果と、画像認識部１３から受信した到来方向毎の顔向き検出結果および画面占有率とに基づいて、指向性集音の目的とする話者方向を推定する。例えば、図１０（Ａ）に示すように、画面占有率が高く、顔向きが正面であり、特定の単語を発話している到来方向のスコアが高くなり、図１０（Ｂ）に示すように、それらの条件に合致しない到来方向のスコアが低くなるように設計することが考えられる。このとき、特定の単語は、対話のシナリオや音声認識のタスクから話者が発話することが想定される単語であり、音声認識結果にこれらの単語が含まれるほど高いスコアが与えられるように設計するとよい。話者方向推定部１４は、決定した話者方向を指向性集音部８へ出力する。 In step S14, the speaker direction estimation unit 14 is based on the voice recognition result for each arrival direction received from the voice recognition unit 21, the face orientation detection result for each arrival direction received from the image recognition unit 13, and the screen occupancy rate. To estimate the speaker direction, which is the target of directional sound collection. For example, as shown in FIG. 10 (A), the screen occupancy rate is high, the face is facing the front, and the score in the arrival direction in which a specific word is spoken is high, as shown in FIG. 10 (B). , It is conceivable to design so that the score in the direction of arrival that does not meet those conditions is low. At this time, the specific word is a word that is expected to be uttered by the speaker from a dialogue scenario or a speech recognition task, and is designed so that the speech recognition result is given a high score so that these words are included. It is good to do it. The speaker direction estimation unit 14 outputs the determined speaker direction to the directional sound collecting unit 8.

この発明のポイントは、主に、１．音情報を利用した到来方向推定結果を基準に、画像情報及び言語情報を利用して話者方向を決定すること、２．画像情報では顔認識による画面占有率及び顔向き検出結果を利用し、言語情報では特定単語の発話検知結果を利用すること、の二点である。上記の点により、音のみでは捉えきれない目的とする話者方向を、音による到来方向推定の後段に画像情報や言語情報を用いた話者方向推定を行うことで、従来の方向推定で誤検知となっていた状況を回避でき、話者方向推定結果の頑健性が向上する。音声認識を利用した対話ロボットを利用する際に、周囲の話者などの雑音源が存在する環境でも、対話対象となる話者の発話のみを検出することができるため、利用シーンの拡大及びユーザ利便性が向上する。 The main points of the present invention are 1. 2. Determine the speaker direction using image information and linguistic information based on the arrival direction estimation result using sound information. The image information uses the screen occupancy rate and the face orientation detection result by face recognition, and the language information uses the utterance detection result of a specific word. Based on the above points, the target speaker direction, which cannot be captured by sound alone, is erroneously estimated by the conventional direction estimation by performing speaker direction estimation using image information and linguistic information after the arrival direction estimation by sound. The situation that was detected can be avoided, and the robustness of the speaker direction estimation result is improved. When using a dialogue robot that uses voice recognition, even in an environment where there are noise sources such as surrounding speakers, only the utterances of the speaker to be spoken can be detected, so the usage scene can be expanded and the user can use it. Convenience is improved.

［第三実施形態］
上記の実施形態では、予め用意された変換行列Kを用いて、音の到来方向をカメラの座標に変換していた。第三実施形態では変換行列Kを取得するために最適な校正用スピーカの配置を求める最適配置取得装置を説明する。校正用スピーカの配置を最適化することにより、カメラ側にとっては一般的にレンズ歪みや収差、交差ずれなどの影響を軽減することができるという効果がある。マイクロホン側にとっては各マイクロホンの感度誤差などの影響を軽減するために可能な範囲で多様な位置と角度に設置することが望ましいため、そうなるような配置を最適配置として求める。 [Third Embodiment]
In the above embodiment, the arrival direction of the sound is converted into the coordinates of the camera by using the transformation matrix K prepared in advance. In the third embodiment, an optimum arrangement acquisition device for obtaining the optimum arrangement of the calibration speaker for acquiring the transformation matrix K will be described. By optimizing the arrangement of the calibration speaker, there is an effect that the influence of lens distortion, aberration, crossing deviation, etc. can be generally reduced on the camera side. For the microphone side, it is desirable to install the microphones at various positions and angles as much as possible in order to reduce the influence of the sensitivity error of each microphone. Therefore, such an arrangement is required as the optimum arrangement.

第三実施形態の最適配置取得装置３は、図１１に示すように、M（≧2）個のマイクロホンが集音したM個の音声信号と少なくともK（≧1）個のカメラが撮像したK個の画像信号とを入力とし、その音声信号と画像信号とから計算した校正用スピーカの最適配置を画面に表示する。最適配置取得装置３は、第一角度差取得部３１と第二角度差取得部３２と距離取得部３３と最適配置計算部３４と最適配置表示部３５とを備える。この最適配置取得装置３が、後述の各ステップの処理を行うことにより第三実施形態の最適配置取得方法が実現される。 As shown in FIG. 11, the optimum arrangement acquisition device 3 of the third embodiment includes M audio signals collected by M (≧ 2) microphones and K captured by at least K (≧ 1) cameras. The optimum arrangement of the calibration speaker calculated from the audio signal and the image signal is displayed on the screen by inputting the individual image signals. The optimum arrangement acquisition device 3 includes a first angle difference acquisition unit 31, a second angle difference acquisition unit 32, a distance acquisition unit 33, an optimum arrangement calculation unit 34, and an optimum arrangement display unit 35. The optimum arrangement acquisition device 3 of the third embodiment is realized by performing the processing of each step described later by the optimum arrangement acquisition device 3.

第一角度差取得部３１は、M個のマイクロホンが集音したM個の音声信号に基づいて、M個のマイクロホンからなるマイクロホンアレイから各校正用スピーカを見たときの角度差を求める。第一角度差取得部３１は、求めた校正用スピーカの角度差を最適配置計算部３４へ出力する。マイクロホンアレイから各校正用スピーカを見たときの角度差が既知であれば、第一角度差取得部３１を備える必要はなく、最適配置取得装置３に既知の角度差が入力されるように構成すればよい。 The first angle difference acquisition unit 31 obtains an angle difference when each calibration speaker is viewed from a microphone array composed of M microphones, based on M audio signals collected by M microphones. The first angle difference acquisition unit 31 outputs the obtained angle difference of the calibration speaker to the optimum arrangement calculation unit 34. If the angle difference when each calibration speaker is viewed from the microphone array is known, it is not necessary to include the first angle difference acquisition unit 31, and the optimum arrangement acquisition device 3 is configured to input the known angle difference. do it.

第二角度差取得部３２は、K個のカメラが撮像したK個の画像信号に基づいて、カメラから各校正用スピーカを見たときの角度差を求める。第二角度差取得部３２は、求めた校正用スピーカの角度差を最適配置計算部３４へ出力する。カメラから各校正用スピーカを見たときの角度差が既知であれば、第二角度差取得部３２を備える必要はなく、最適配置取得装置３に既知の角度差が入力されるように構成すればよい。 The second angle difference acquisition unit 32 obtains the angle difference when each calibration speaker is viewed from the cameras based on the K image signals captured by the K cameras. The second angle difference acquisition unit 32 outputs the obtained angle difference of the calibration speaker to the optimum arrangement calculation unit 34. If the angle difference when each calibration speaker is viewed from the camera is known, it is not necessary to provide the second angle difference acquisition unit 32, and the optimum arrangement acquisition device 3 is configured to input the known angle difference. Just do it.

距離取得部３３は、K個のカメラが撮像したK個の画像信号に基づいて、校正用スピーカ間の距離を求める。距離取得部３３は、求めた校正用スピーカ間の距離を最適配置計算部３４へ出力する。校正用スピーカ間の距離が既知であれば、距離取得部３３を備える必要はなく、最適配置取得装置３に既知の距離が入力されるように構成すればよい。 The distance acquisition unit 33 obtains the distance between the calibration speakers based on the K image signals captured by the K cameras. The distance acquisition unit 33 outputs the obtained distance between the calibration speakers to the optimum placement calculation unit 34. If the distance between the calibration speakers is known, it is not necessary to include the distance acquisition unit 33, and the optimum placement acquisition device 3 may be configured to input the known distance.

最適配置計算部３４は、マイクロホンアレイから各校正用スピーカを見たときの角度差、カメラから各校正用スピーカを見たときの角度差、および校正用スピーカ間の距離に基づいて、校正用スピーカの最適配置を計算する。最適配置計算部３４は、計算した校正用スピーカの最適配置を最適配置表示部３５へ出力する。 The optimum placement calculation unit 34 is based on the angle difference when each calibration speaker is viewed from the microphone array, the angle difference when each calibration speaker is viewed from the camera, and the distance between the calibration speakers. Calculate the optimal placement of. The optimum placement calculation unit 34 outputs the calculated optimum placement of the calibration speaker to the optimum placement display unit 35.

図１２を参照して、最適配置計算部３４が校正用スピーカの最適配置を計算する方法を説明する。図１２の例では、３個の校正用スピーカが存在しており、３個のマイクロホンからなるマイクロホンアレイと１個のカメラとを基準として校正用スピーカの最適配置を計算している。図中、校正用スピーカ間の距離はＡ−１〜Ａ−３で示している。カメラから各校正用スピーカを見たときの角度差はＢ−１〜Ｂ−３で示している。マイクロホンアレイから各校正用スピーカを見たときの角度差はＣ−１〜Ｃ−３で示している。このとき、各校正用スピーカ間の距離Ａ−１〜Ａ−３とカメラから校正用スピーカの角度差Ｂ−１〜Ｂ−３とマイクロホンアレイから校正用スピーカの角度差Ｃ−１〜Ｃ−３とを最大化することで、校正用スピーカの最適配置を求めることができる。なお、角度差は、例えばＢ−１とＢ−２との角度差をＢ−１、Ｂ−２がベクトルで定義されるものとすれば、arg(B-1)-arg(B-2)である。 A method of calculating the optimum arrangement of the calibration speaker by the optimum arrangement calculation unit 34 will be described with reference to FIG. In the example of FIG. 12, there are three calibration speakers, and the optimum arrangement of the calibration speakers is calculated with reference to a microphone array composed of three microphones and one camera. In the figure, the distances between the calibration speakers are indicated by A-1 to A-3. The angle difference when each calibration speaker is viewed from the camera is shown by B-1 to B-3. The angle difference when each calibration speaker is viewed from the microphone array is shown by C-1 to C-3. At this time, the distances A-1 to A-3 between the calibration speakers, the angle difference B-1 to B-3 between the camera and the calibration speaker, and the angle difference C-1 to C-3 between the microphone array and the calibration speaker. By maximizing and, the optimum arrangement of the calibration speaker can be obtained. The angle difference is arg (B-1) -arg (B-2), for example, assuming that the angle difference between B-1 and B-2 is defined by B-1 and B-2 as vectors. Is.

最適配置表示部３５は、最適配置計算部３４から受け取った校正用スピーカの最適配置を画面等の出力部（図示せず）に出力する。図１３は、最適配置表示部３５が各校正用スピーカの最適配置を画面上に表示する一例である。図１３はカメラから校正用スピーカが設置されている空間を撮像した画像上に、現実に設置されている校正用スピーカの位置（実線の円）と、最適配置計算部３４により計算された各校正用スピーカの最適な位置（点線の網掛けされた円）とを表示した画面例である。現実の校正用スピーカの位置や各校正用スピーカの最適な位置は、画面上において、例えば、左右をx軸、上下をy軸、奥行きをz軸として三次元空間に各位置をプロットすることで表示する。ここでは直交座標系とした場合の例を示したが、例えば円筒座標系や球座標系など校正用スピーカを配置する空間に対して適切な座標系を用いて表示すればよい。 The optimum arrangement display unit 35 outputs the optimum arrangement of the calibration speaker received from the optimum arrangement calculation unit 34 to an output unit (not shown) such as a screen. FIG. 13 is an example in which the optimum arrangement display unit 35 displays the optimum arrangement of each calibration speaker on the screen. FIG. 13 shows the positions of the calibration speakers actually installed (solid circles) on the image of the space in which the calibration speakers are installed from the camera, and each calibration calculated by the optimum placement calculation unit 34. This is an example of a screen displaying the optimum position of the speaker (dotted shaded circle). The actual position of the proofreading speaker and the optimum position of each proofreading speaker can be determined by plotting each position on the screen, for example, with the left and right as the x-axis, the top and bottom as the y-axis, and the depth as the z-axis. indicate. Here, an example in the case of using a Cartesian coordinate system is shown, but it may be displayed using an appropriate coordinate system for the space in which the calibration speaker is arranged, such as a cylindrical coordinate system or a spherical coordinate system.

以上、この発明の実施の形態について説明したが、具体的な構成は、これらの実施の形態に限られるものではなく、この発明の趣旨を逸脱しない範囲で適宜設計の変更等があっても、この発明に含まれることはいうまでもない。実施の形態において説明した各種の処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。 Although the embodiments of the present invention have been described above, the specific configuration is not limited to these embodiments, and even if the design is appropriately changed without departing from the spirit of the present invention, the specific configuration is not limited to these embodiments. Needless to say, it is included in the present invention. The various processes described in the embodiments are not only executed in chronological order according to the order described, but may also be executed in parallel or individually as required by the processing capacity of the device that executes the processes.

［プログラム、記録媒体］
上記実施形態で説明した各装置における各種の処理機能をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記各装置における各種の処理機能がコンピュータ上で実現される。 [Program, recording medium]
When various processing functions in each device described in the above embodiment are realized by a computer, the processing contents of the functions that each device should have are described by a program. Then, by executing this program on the computer, various processing functions in each of the above devices are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing content can be recorded on a computer-readable recording medium. The computer-readable recording medium may be, for example, a magnetic recording device, an optical disk, a photomagnetic recording medium, a semiconductor memory, or the like.

また、このプログラムの流通は、例えば、そのプログラムを記録したDVD、CD-ROM等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 In addition, the distribution of this program is carried out, for example, by selling, transferring, renting, or the like a portable recording medium such as a DVD or CD-ROM on which the program is recorded. Further, the program may be stored in the storage device of the server computer, and the program may be distributed by transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記憶装置に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるASP（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first, for example, first stores a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. Then, when the process is executed, the computer reads the program stored in its own storage device and executes the process according to the read program. Further, as another execution form of this program, a computer may read the program directly from a portable recording medium and execute processing according to the program, and further, the program is transferred from the server computer to this computer. Each time, the processing according to the received program may be executed sequentially. In addition, the above processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition without transferring the program from the server computer to this computer. May be. The program in this embodiment includes information to be used for processing by a computer and equivalent to the program (data that is not a direct command to the computer but has a property of defining the processing of the computer, etc.).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Further, in this embodiment, the present device is configured by executing a predetermined program on the computer, but at least a part of these processing contents may be realized by hardware.

１、２、９話者方向決定装置
３最適配置取得装置
８指向性集音部
１１到来方向推定部
１２測定座標補正部
１３画像認識部
１４話者方向推定部
２１音声認識部
３１第一角度差取得部
３２第二角度差取得部
３３距離取得部
３４最適配置計算部
３５最適配置表示部
９１方向別前処理部
９２方向別パワー算出部
９３到来方向選択部 1, 2, 9 Speaker direction determination device 3 Optimal placement acquisition device 8 Directional sound collection unit 11 Arrival direction estimation unit 12 Measurement coordinate correction unit 13 Image recognition unit 14 Speaker direction estimation unit 21 Voice recognition unit 31 First angle difference Acquisition unit 32 Second angle difference acquisition unit 33 Distance acquisition unit 34 Optimal placement calculation unit 35 Optimal placement display unit 91 Direction-specific preprocessing unit 92 Direction-specific power calculation unit 93 Arrival direction selection unit

Claims

A control device that controls a robot surrounded by a plurality of people to react only to a speaker who is trying to interact with the robot.
The microphone direction, which is the direction of a desired sound source with reference to the microphone array estimated from the acoustic signal picked up by the microphone array composed of a plurality of microphones, is the coordinates in the image taken by the camera according to the conversion rule. A conversion unit that converts to camera coordinates,
By recognizing the image, a directional sound collecting unit that emphasizes the direction of the image having the highest score obtained based on the screen occupancy and the orientation of the face and collects the sound.
Have,
The conversion rule is
An acoustic signal that is emitted from a sound emitting unit consisting of at least three speakers and is set so as to be able to identify which of the at least three speakers is emitted.
An image in which the sound emitting portion is taken so that the individual positions of the at least three speakers can be detected, and
It was obtained by associating
Control device.

The control device according to claim 1.
The at least three speakers are used to convert the microphone direction to camera coordinates.
Control device.

The control device according to claim 1 or 2.
The conversion rule is obtained by associating the two-dimensional angle spectrum of the arrival direction estimated from the acoustic signal emitted by the speaker with the coordinates on the image of the speaker taken by the camera.
Control device.

The control device according to any one of claims 1 to 3.
i is the number of the speaker, (θ _i , φ _i ) is a point on the two-dimensional angle spectrum of the arrival direction estimated from the acoustic signal emitted by the i-th speaker, and (x _i , y _i ) is. The coordinates on the image of the i-th speaker taken by the camera, and a, b, c, d, e, and f are from the set of the point on the two-dimensional angle spectrum and the coordinates with 6 degrees of freedom. Use the obtained conversion parameters as
The conversion unit converts the direction of the microphone into the coordinates of the camera by calculating the following equation.

Control device.

Converts the microphone direction, which is the direction of the desired sound source based on the microphone array estimated from the acoustic signal picked up by the microphone array consisting of a plurality of microphones, into the camera coordinates, which are the coordinates in the image taken by the camera. It is a conversion device that
A conversion unit that converts the microphone direction to the camera coordinates according to the conversion rule,
With the optimum placement calculation unit that obtains the distance between at least three speakers, the angle difference of each speaker as seen from the camera, and the position of each speaker that maximizes the angle difference of each speaker as seen from the microphone array as the optimum placement. ,
Including
The conversion rule is
The emitted consisting sound emitting portion from at least three loudspeakers, the acoustic signal is set to be able to identify or emitted from any speaker of the at least three loudspeakers,
An image in which the sound emitting portion is taken so that the individual positions of the at least three speakers can be detected, and
It was obtained by associating
Conversion device.

The conversion device according to claim 5.
It further includes an optimum arrangement display unit that superimposes and displays the actual arrangement of each speaker and the optimum arrangement of each speaker on the image taken by the camera.
Conversion device.

A control method executed by a control device that controls a robot surrounded by a plurality of people to react only to a speaker who is trying to interact with the robot.
An image taken by a camera according to a conversion rule, in which a conversion unit determines the direction of a desired sound source based on the microphone array estimated from an acoustic signal picked up by a microphone array composed of a plurality of microphones. Converted to camera coordinates, which are the coordinates in
By recognizing the image, the directional sound collecting unit emphasizes the direction of the image having the highest score obtained based on the screen occupancy and the orientation of the face, and collects the sound.
The conversion rule is
An acoustic signal that is emitted from a sound emitting unit consisting of at least three speakers and is set so as to be able to identify which of the at least three speakers is emitted.
An image in which the sound emitting portion is taken so that the individual positions of the at least three speakers can be detected, and
It was obtained by associating
Control method.

Converts the microphone direction, which is the direction of the desired sound source based on the microphone array estimated from the acoustic signal picked up by the microphone array consisting of a plurality of microphones, into the camera coordinates, which are the coordinates in the image taken by the camera. This is the conversion method performed by the conversion device.
The conversion unit converts the microphone direction into the camera coordinates according to the conversion rule.
The optimum placement calculation unit sets the distance between at least three speakers, the angle difference of each speaker as seen from the camera, and the position of each speaker that maximizes the angle difference of each speaker as seen from the microphone array as the optimum placement. Ask,
The conversion rule is
The emitted consisting sound emitting portion from at least three loudspeakers, the acoustic signal is set to be able to identify or emitted from any speaker of the at least three loudspeakers,
An image in which the sound emitting portion is taken so that the individual positions of the at least three speakers can be detected, and
It was obtained by associating
Conversion method.

A program for operating a computer as the control device according to any one of claims 1 to 4.

A program for operating a computer as the conversion device according to claim 5 or 6.