JP2023053670A

JP2023053670A - Information processing device, information processing method, and program

Info

Publication number: JP2023053670A
Application number: JP2021162852A
Authority: JP
Inventors: 裕高瀬; Yutaka Takase; 哲哉皆川; Tetsuya Minagawa
Original assignee: Sony Group Corp
Current assignee: Sony Group Corp
Priority date: 2021-10-01
Filing date: 2021-10-01
Publication date: 2023-04-13
Also published as: WO2023054047A1

Abstract

To provide an information processing device, an information processing method, and a program, capable of collecting sound emitted by a sound source in high quality.SOLUTION: A information processing device includes an information acquisition unit and a sound collection control unit. The information acquisition unit acquires sound source information that indicates a position of a sound source and a direction in which the sound source emits sound. From multiple sound collection devices arranged around the sound source and having a configurable sound collection direction, the sound collection control unit selects, on the basis of the sound source information, at least one target device to be used to collect sound emitted by the sound source.SELECTED DRAWING: Figure 4

Description

本技術は、集音システム等に適用可能な情報処理装置、情報処理方法、及びプログラムに関する。 The present technology relates to an information processing device, an information processing method, and a program applicable to a sound collection system or the like.

近年、音源を分離して集音を行う技術が開発されている。例えば、特定の方向から発せられた音を選択的に集音することで、様々な音の中から目的とする音を分離することができる。方向を指定して集音を行う方法としては、例えばアレイ状に配意された複数のマイクの出力を処理して特定の方向の音源を分離するビームフォーミング技術が知られている。 In recent years, techniques for separating sound sources and collecting sounds have been developed. For example, by selectively collecting sounds emitted from a specific direction, a desired sound can be separated from various sounds. As a method of collecting sound by designating a direction, for example, beamforming technology is known, which processes outputs of a plurality of microphones arranged in an array to separate a sound source in a specific direction.

特許文献１には、ビームフォーミング技術を用いた音声認識システムについて記載されている。このシステムでは、アレイマイク周辺を撮影した画像から人体が検出される。アレイマイクから見て人体がある方向は集音方向に設定され、人体がない方向はノイズ方向に設定される。またビームフォーミング処理が実行され、アレイマイクの出力から集音方向の音源（目的音）とノイズ方向の音源（ノイズ音）とが分離される。この目的音からノイズ音をキャンセルすることで、高精度なノイズキャンセルが可能となっている（特許文献１の明細書段落［００１７］［００１８］［００２３］［００２４］図３等）。 Patent Literature 1 describes a speech recognition system using beamforming technology. In this system, the human body is detected from the image taken around the array microphone. The direction in which the human body exists as viewed from the array microphone is set as the sound collection direction, and the direction in which the human body does not exist is set as the noise direction. A beamforming process is also performed to separate the sound source in the sound collecting direction (target sound) and the sound source in the noise direction (noise sound) from the output of the array microphone. By canceling the noise sound from the target sound, highly accurate noise cancellation is possible (paragraphs [0017] [0018] [0023] [0024] FIG. 3 of Patent Document 1, etc.).

特開２０２０－３７２４号公報JP 2020-3724 A

特許文献１のように、目的音からノイズ音をキャンセルできたとしても、目的音が発せられる方向によっては、所望の音質が得られないこともあり得る。このため、目的とする音そのものをより高い品質で集音する技術が求められている。 Even if the noise sound can be canceled from the target sound as in Patent Document 1, the desired sound quality may not be obtained depending on the direction in which the target sound is emitted. Therefore, there is a demand for a technique for collecting the target sound itself with higher quality.

以上のような事情に鑑み、本技術の目的は、音源が発する音を高品質に集音することが可能な情報処理装置、情報処理方法、及びプログラムを提供することにある。 In view of the circumstances as described above, an object of the present technology is to provide an information processing device, an information processing method, and a program capable of collecting sound emitted by a sound source with high quality.

上記目的を達成するため、本技術の一形態に係る情報処理装置は、情報取得部と、集音制御部とを具備する。
前記情報取得部は、音源の位置と前記音源が音を発する方向とを示す音源情報を取得する。
前記集音制御部は、前記音源情報に基づいて、前記音源の周辺に配置され集音方向を設定可能な複数の集音装置から、前記音源が発する音の集音に用いる少なくとも１つの対象装置を選択する。 To achieve the above object, an information processing apparatus according to an aspect of the present technology includes an information acquisition unit and a sound collection control unit.
The information acquisition unit acquires sound source information indicating a position of a sound source and a direction in which the sound source emits sound.
Based on the sound source information, the sound collection control unit selects at least one target device that is used to collect sound emitted by the sound source from among a plurality of sound collection devices that are arranged around the sound source and whose sound collection direction can be set. to select.

この情報処理装置では、音源の周辺に配置された複数の集音装置から、音源の音を集音するための対象装置が少なくとも１つ選択される。各集音装置は、集音方向を設定できる装置であり、対象装置の選択には、音源の位置及び音源が音を発する方向を示す音源情報が用いられる。これにより、例えば音源の位置や音の出る方向に適応した集音装置を用いることが可能となり、音源が発する音を高品質に集音することが可能となる。 In this information processing device, at least one target device for collecting the sound of the sound source is selected from a plurality of sound collectors arranged around the sound source. Each sound collecting device is a device capable of setting a sound collecting direction, and sound source information indicating the position of the sound source and the direction in which the sound source emits sound is used for selecting the target device. As a result, it becomes possible to use a sound collector adapted to, for example, the position of the sound source and the direction from which the sound is emitted, and it is possible to collect the sound emitted by the sound source with high quality.

前記集音制御部は、前記音源情報に基づいて、前記対象装置の集音方向を設定してもよい。 The sound collection control unit may set a sound collection direction of the target device based on the sound source information.

前記集音制御部は、前記対象装置から前記音源に向かう方向を前記対象装置の集音方向に設定してもよい。 The sound collection control unit may set a direction from the target device toward the sound source as a sound collection direction of the target device.

前記集音制御部は、前記音源が音を発する方向を基準として前記音源が発する直接音を集音可能な前記集音装置を判定し、当該集音装置を前記対象装置として選択してもよい。 The sound collection control unit may determine the sound collection device capable of collecting the direct sound emitted by the sound source based on the direction in which the sound source emits sound, and select the sound collection device as the target device. .

前記複数の集音装置は、各々の配置に応じて割り当てられた割当範囲に前記集音方向を設定可能なように構成されてもよい。この場合、前記集音制御部は、前記音源が音を発する方向が前記割当範囲の中心方向に対応する前記集音装置を前記対象装置として選択してもよい。 The plurality of sound collecting devices may be configured such that the sound collecting direction can be set within an allocation range allocated according to each arrangement. In this case, the sound collection control unit may select, as the target device, the sound collection device whose direction in which the sound source emits sound corresponds to the central direction of the allocation range.

前記集音制御部は、前記音源が音を発する方向が前記割当範囲の中心方向に対応する前記集音装置が存在しない場合、前記音源が音を発する方向に沿った集音が可能であり、前記音源との距離が最も近い前記集音装置を前記対象装置として選択してもよい。 The sound collection control unit is capable of collecting sound along the direction in which the sound source emits sound when there is no sound collection device in which the direction in which the sound source emits sound corresponds to the center direction of the allocation range, The sound collecting device closest to the sound source may be selected as the target device.

前記情報取得部は、複数の音源ごとに前記音源情報を取得してもよい。この場合、前記集音制御部は、前記複数の音源ごとの前記音源情報に基づいて、前記複数の音源ごとに前記対象装置をそれぞれ選択してもよい。 The information acquisition unit may acquire the sound source information for each of a plurality of sound sources. In this case, the sound collection control unit may select the target device for each of the plurality of sound sources based on the sound source information for each of the plurality of sound sources.

前記集音制御部は、処理対象の音源が発する直接音を集音し前記処理対象とは異なる他の音源が発する直接音を集音しないように前記集音方向を設定可能な前記集音装置を前記対象装置として選択してもよい。 The sound collection control unit is capable of setting the sound collection direction so as to collect direct sound emitted by a sound source to be processed and not to collect direct sound emitted by a sound source different from the sound source to be processed. may be selected as the target device.

前記情報処理装置は、さらに、前記少なくとも１つの対象装置の出力に基づいて、前記音源が発する音を表す音データを生成する集音処理部を具備してもよい。 The information processing device may further include a sound collection processing unit that generates sound data representing the sound emitted by the sound source based on the output of the at least one target device.

前記複数の集音装置は、予め集音方向が設定された複数の候補装置を含んでもよい。この場合、前記集音制御部は、前記複数の候補装置から前記対象装置を選択してもよい。また、前記集音処理部は、前記対象装置として選択されない候補装置を集音状態で待機させてもよい。 The plurality of sound collecting devices may include a plurality of candidate devices whose sound collecting directions are set in advance. In this case, the sound collection control unit may select the target device from the plurality of candidate devices. Further, the sound collection processing unit may make the candidate device that is not selected as the target device stand by in a sound collection state.

前記集音制御部は、単一の前記音源について、前記複数の集音装置から複数の対象装置を選択してもよい。 The sound collection control unit may select a plurality of target devices from the plurality of sound collection devices for the single sound source.

前記集音処理部は、前記複数の対象装置により集音されたデータを合成して、前記音源の前記音データを生成してもよい。 The sound collection processing unit may generate the sound data of the sound source by synthesizing data collected by the plurality of target devices.

前記音源は、発話者であってもよい。この場合、前記音源が音を発する方向は、前記発話者の発話方向であってもよい。 The sound source may be a speaker. In this case, the direction in which the sound source emits sound may be the utterance direction of the speaker.

前記情報取得部は、前記発話者を撮影した画像データに基づいて、前記発話者に関するボーン検出を実行して前記発話者の発話方向を推定してもよい。 The information acquisition unit may estimate the speech direction of the speaker by performing bone detection on the speaker based on image data of the speaker.

前記情報取得部は、前記発話者のジェスチャーを検出してもよい。
前記集音処理部は、前記発話者のジェスチャーに応じて、前記発話者の音声を集音する集音処理を制御してもよい。 The information acquisition unit may detect a gesture of the speaker.
The sound collection processing unit may control sound collection processing for collecting the voice of the speaker according to the gesture of the speaker.

前記集音処理部は、前記発話者が手を挙げるジェスチャーが検出された場合、前記発話者に対する前記集音処理を優先して実行し、前記発話者が手で口を遮るジェスチャーが検出された場合、前記発話者に対する前記集音処理を停止してもよい。 The sound collection processing unit preferentially executes the sound collection processing for the speaker when a gesture of the speaker raising a hand is detected, and a gesture of the speaker covering the mouth with a hand is detected. case, the sound collection process for the speaker may be stopped.

前記集音処理部は、前記対象装置により集音されたデータから、前記発話者の音声と、前記発話者の所作音とを分離してもよい。 The sound collection processing unit may separate the speech of the speaker and the gesture sound of the speaker from the data collected by the target device.

前記集音装置は、複数のマイクが配置されたマイクアレイであってもよい。この場合、前記集音方向は、前記マイクアレイに関するビームフォーミング処理で設定されるビームの方向であってもよい。 The sound collecting device may be a microphone array in which a plurality of microphones are arranged. In this case, the sound collection direction may be a beam direction set by beamforming processing for the microphone array.

本技術の一形態に係る情報処理方法は、コンピュータシステムにより実行される情報処理方法であって、音源の位置と前記音源が音を発する方向とを示す音源情報を取得することを含む。
前記音源情報に基づいて、前記音源の周辺に配置され集音方向を設定可能な複数の集音装置から、前記音源が発する音の集音に用いる少なくとも１つの対象装置が選択される。 An information processing method according to an embodiment of the present technology is an information processing method executed by a computer system, and includes acquiring sound source information indicating a position of a sound source and a direction in which the sound source emits sound.
Based on the sound source information, at least one target device used to collect the sound emitted by the sound source is selected from a plurality of sound collectors arranged around the sound source and capable of setting a sound collection direction.

本技術の一形態に係るプログラムは、コンピュータシステムに以下のステップを実行させる。
音源の位置と前記音源が音を発する方向とを示す音源情報を取得するステップ。
前記音源情報に基づいて、前記音源の周辺に配置され集音方向を設定可能な複数の集音装置から、前記音源が発する音の集音に用いる少なくとも１つの対象装置を選択するステップ。 A program according to an embodiment of the present technology causes a computer system to execute the following steps.
Obtaining sound source information indicating the position of a sound source and the direction in which the sound source emits sound.
Based on the sound source information, selecting at least one target device to be used for collecting the sound emitted by the sound source from a plurality of sound collecting devices arranged around the sound source and capable of setting a sound collection direction.

本技術の一実施形態に係る集音システムの構成例を示すブロック図である。1 is a block diagram showing a configuration example of a sound collection system according to an embodiment of the present technology; FIG. ＢＦマイクの構成例を示す模式図である。It is a schematic diagram which shows the structural example of BF microphone. ＢＦマイクに設定されるビームの一例を示す模式図である。FIG. 4 is a schematic diagram showing an example of beams set in a BF microphone; 集音システムの基本的な集音動作を示す模式図である。FIG. 4 is a schematic diagram showing a basic sound collection operation of the sound collection system; 集音システムの動作例を示すフローチャートである。4 is a flowchart showing an operation example of the sound collection system; ＢＦマイクの配置例を示す模式図である。FIG. 4 is a schematic diagram showing an example of arrangement of BF microphones; 発話者の発話方向の一例を示す模式図である。FIG. 4 is a schematic diagram showing an example of a speaking direction of a speaker; 複数の発話者に対する集音動作について説明するための模式図である。FIG. 4 is a schematic diagram for explaining a sound collecting operation for a plurality of speakers; 複数のＢＦマイクを用いた集音動作の一例を示す模式図である。FIG. 4 is a schematic diagram showing an example of sound collection operation using a plurality of BF microphones; 発話者が移動する際の集音動作の一例を示す模式図である。FIG. 4 is a schematic diagram showing an example of sound collection operation when a speaker moves. 音声の合成処理について説明するための模式図である。FIG. 4 is a schematic diagram for explaining a speech synthesizing process; 複数の発話者が移動する際の集音動作の一例を示す模式図である。FIG. 5 is a schematic diagram showing an example of sound collection operations when a plurality of speakers move; 発話者の発話方向を想定した集音動作の一例を示す模式図である。FIG. 10 is a schematic diagram showing an example of a sound collection operation assuming an utterance direction of a speaker; ジャスチャーに応じた集音動作の一例を示す模式図である。FIG. 10 is a schematic diagram showing an example of a sound collection operation in response to a gesture; 音声と動作音とを集音する集音動作の一例を示す模式図である。FIG. 5 is a schematic diagram showing an example of a sound collection operation for collecting voice and operation sound;

以下、本技術に係る実施形態を、図面を参照しながら説明する。 Hereinafter, embodiments according to the present technology will be described with reference to the drawings.

［集音システムの構成］
図１は、本技術の一実施形態に係る集音システムの構成例を示すブロック図である。集音システム１００は、集音対象となる空間内にいる発話者１の音声５を集音して、発話者１の音声データ６を生成するシステムである。本実施形態では、発話者１は音源の一例であり、発話者１の音声５が集音対象となる音（目的音）となる。
図１に示すように、集音システム１００は、複数のＢＦマイクＭと、検出カメラ１０と、記憶部１１と、コントローラ２０とを有する。 [Configuration of sound collection system]
FIG. 1 is a block diagram showing a configuration example of a sound collection system according to an embodiment of the present technology. The sound collection system 100 is a system that collects the voice 5 of the speaker 1 in a space to be sound-collected and generates the voice data 6 of the speaker 1 . In this embodiment, speaker 1 is an example of a sound source, and speech 5 of speaker 1 is a sound to be collected (target sound).
As shown in FIG. 1 , the sound collection system 100 has multiple BF microphones M, a detection camera 10 , a storage unit 11 and a controller 20 .

複数のＢＦマイクＭは、各々がビームフォーミング（ＢＦ）技術を用いて特定方向に対する集音を行うことが可能な集音装置である。
図１には、複数のＢＦマイクＭとして、４つのＢＦマイクＭ１～Ｍ４が模式的に図示されている。なおＢＦマイクＭの個数は限定されない。
ここでビームフォーミング技術は、ＢＦマイクＭから特定の方向に伸びるビームを設定し、そのビームに沿って到来する音波を高感度に集音する技術である。この場合、ビームが設定される方向は、ＢＦマイクＭの集音方向となる。
各ＢＦマイクＭは、発話者１がいる空間に設定された所定の位置にそれぞれ配置される。集音システム１００におけるＢＦマイクＭの配置例については、後に詳しく説明する。
このように、各ＢＦマイクＭは、音源である発話者１の周辺に配置され集音方向を設定可能な装置である。本実施形態では、ＢＦマイクＭは、集音装置に相当する。 A plurality of BF microphones M are sound collecting devices each capable of collecting sound in a specific direction using beam forming (BF) technology.
Four BF microphones M1 to M4 are schematically illustrated as the plurality of BF microphones M in FIG. Note that the number of BF microphones M is not limited.
Here, the beamforming technique is a technique of setting a beam extending in a specific direction from the BF microphone M and collecting sound waves arriving along the beam with high sensitivity. In this case, the direction in which the beam is set is the direction in which the BF microphone M collects sound.
Each BF microphone M is arranged at a predetermined position set in the space where the speaker 1 is present. An example of arrangement of the BF microphones M in the sound collection system 100 will be described later in detail.
In this way, each BF microphone M is a device that is arranged around the speaker 1 who is a sound source and can set the sound collecting direction. In this embodiment, the BF microphone M corresponds to a sound collector.

図２は、ＢＦマイクＭの構成例を示す模式図である。図３は、ＢＦマイクＭに設定されるビーム７の一例を示す模式図である。
図２に示すＢＦマイクＭは、平板状の基板１５と、基板１５に配置された複数のマイク１６とを有する。すなわち、ＢＦマイクＭは、複数のマイク１６が配置されたマイクアレイである。
図２Ａは、基板１５と直交する方向から見たＢＦマイクＭの平面図であり、図２Ｂは、基板１５と平行な方向から見たＢＦマイクＭの側面図である。 FIG. 2 is a schematic diagram showing a configuration example of the BF microphone M. As shown in FIG. FIG. 3 is a schematic diagram showing an example of the beam 7 set on the BF microphone M. As shown in FIG.
The BF microphone M shown in FIG. 2 has a flat board 15 and a plurality of microphones 16 arranged on the board 15 . That is, the BF microphone M is a microphone array in which a plurality of microphones 16 are arranged.
2A is a plan view of the BF microphone M seen from a direction orthogonal to the substrate 15, and FIG. 2B is a side view of the BF microphone M seen from a direction parallel to the substrate 15. FIG.

基板１５は、平面形状が円形の板状部材であり、第１の面１７ａと、第１の面１７ａとは反対側の第２の面１７ｂとを有する。第１の面１７ａは、複数のマイク１６が配置される面である。図２Ａは、ＢＦマイクＭの第１の面１７ａを見た平面図である。また図２Ｂでは、基板１５の図中上側の表面が第１の面１７ａであり、基板１５の図中下側の表面が第２の面１７ｂである。
複数のマイク１６は、音波に応じた電気信号（音信号）を発生させる素子である。各マイク１６は、無指向性マイクとして構成され、音波の到来方向によらず略一定の感度で音波を検出する。マイク１６としては、例えばダイナミック型のマイクロフォンや、コンデンサ型のマイクロフォン等が用いられる。 The substrate 15 is a plate member having a circular planar shape, and has a first surface 17a and a second surface 17b opposite to the first surface 17a. The first surface 17a is a surface on which a plurality of microphones 16 are arranged. 2A is a plan view of the first surface 17a of the BF microphone M. FIG. 2B, the upper surface of the substrate 15 in the drawing is the first surface 17a, and the lower surface of the substrate 15 in the drawing is the second surface 17b.
The multiple microphones 16 are elements that generate electrical signals (sound signals) corresponding to sound waves. Each microphone 16 is configured as an omnidirectional microphone, and detects sound waves with substantially constant sensitivity regardless of the arrival direction of the sound waves. As the microphone 16, for example, a dynamic microphone, a condenser microphone, or the like is used.

図２Ｂに示す例では、各マイク１６は、音波を受ける受音部分を基板１５とは反対側に向けて配置される。この場合、第１の面１７ａ側がＢＦマイクＭの受音側となる。この構成では、例えば第１の面１７ａ側に各マイク１６を保護するカバー等が設けられてもよい。
これに限定されず、第２の面１７ａ側がＢＦマイクＭの受音側となるようにＢＦマイクＭを構成してもよい。この場合、基板１５上の各マイク１６の配置位置には、第１の面１７ａから第２の面１７ｂに貫通するマイク孔が設けられる。また各マイク１６は、受音部分をマイク孔に向けて配置される。 In the example shown in FIG. 2B , each microphone 16 is arranged with the sound receiving portion for receiving sound waves facing away from the substrate 15 . In this case, the first surface 17a side is the sound receiving side of the BF microphone M. FIG. In this configuration, for example, a cover or the like for protecting each microphone 16 may be provided on the first surface 17a side.
The configuration is not limited to this, and the BF microphone M may be configured such that the second surface 17a side is the sound receiving side of the BF microphone M. In this case, a microphone hole penetrating from the first surface 17a to the second surface 17b is provided at the placement position of each microphone 16 on the substrate 15 . Each microphone 16 is arranged with its sound receiving portion directed toward the microphone hole.

図２Ａに示すように、ＢＦマイクＭには、８つのマイク１６ａ～１６ｈが設けられる。各マイク１６ａ～１６ｈは、第１の面１７ａにおける基板１５の中心（基板中心Ｃ）を基準として回転対称となるように配置される。従って、基板中心Ｃと、互いに隣接する２つのマイク１６とを結ぶ２つの線分のなす角度（角度間隔）は４５°となる。
以下では、基板中心Ｃから見たマイク１６ａの方位角φを０°とする。また図２Ａにおいて時計回りの方向（基板中心Ｃを右側に見ながら回転する右回りの方向）に方位角φが増えるものとする。従ってマイク１６ａ～１６ｈが配置される方位角は０°、４５°、９０°、１３５°、１８０°、２２５°、２７０°、及び３１５°となる。 As shown in FIG. 2A, the BF microphone M is provided with eight microphones 16a-16h. The microphones 16a to 16h are arranged so as to be rotationally symmetric with respect to the center of the substrate 15 (substrate center C) on the first surface 17a. Therefore, the angle (angular interval) between the two line segments connecting the substrate center C and the two microphones 16 adjacent to each other is 45°.
In the following, the azimuth angle φ of the microphone 16a viewed from the substrate center C is assumed to be 0°. Also, in FIG. 2A, the azimuth angle φ increases in the clockwise direction (clockwise direction of rotation while viewing the substrate center C on the right side). Therefore, the azimuth angles at which the microphones 16a to 16h are arranged are 0°, 45°, 90°, 135°, 180°, 225°, 270° and 315°.

ＢＦマイクＭは、典型的には、基板１５（第１の面１７ａ又は第２の面１７ｂ）が水平となるように配置して用いられる。従って、マイク１６ａ～１６ｈの方位角は水平面における方位角として扱うことができる。なお、ＢＦマイクＭの姿勢は限定されない。例えばＢＦマイクＭを水平面に対して傾けて配置することも可能である。 The BF microphone M is typically used with the substrate 15 (first surface 17a or second surface 17b) arranged horizontally. Therefore, the azimuth angles of the microphones 16a to 16h can be treated as azimuth angles in the horizontal plane. Note that the posture of the BF microphone M is not limited. For example, it is possible to arrange the BF microphone M tilted with respect to the horizontal plane.

ＢＦマイクＭからは、マイク１６ａ～１６ｈが生成した各音信号が出力される。すなわち、複数のマイク１６ａ～１６ｈが生成する多チャンネルの音信号が、ＢＦマイクＭの出力となる。
これらの音信号に対して、後述するコントローラ２０（集音処理部２３）によりビームフォーミング処理が実行される。
ビームフォーミング処理では、特定の方向を向いたビーム７が設定され、ビーム７に沿って到来する音波を集音する処理が行われる。例えば、ビーム７に沿って到来する音波の各マイク１６ａ～１６ｈへの伝搬遅延（到達時間のずれ）が補正される。また伝搬遅延が補正された信号が適宜加算され、ビーム７に沿って到来する音波を強調した信号が生成される。これにより、ビーム７に沿って到来する音波を選択的に集音することが可能となる。
このように、ＢＦマイクＭの集音方向３は、ＢＦマイクＭに関するビームフォーミング処理で設定されるビーム７の方向である。 The BF microphone M outputs each sound signal generated by the microphones 16a to 16h. That is, multi-channel sound signals generated by the plurality of microphones 16a to 16h are output from the BF microphone M. FIG.
A controller 20 (sound collection processing unit 23), which will be described later, performs beamforming processing on these sound signals.
In the beamforming process, a beam 7 directed in a specific direction is set, and a process of collecting sound waves arriving along the beam 7 is performed. For example, the propagation delay (difference in arrival time) of sound waves arriving along the beam 7 to each of the microphones 16a to 16h is corrected. Also, the signals whose propagation delays have been corrected are appropriately added to generate a signal in which the sound wave arriving along the beam 7 is emphasized. This makes it possible to selectively collect sound waves arriving along the beam 7 .
Thus, the sound collection direction 3 of the BF microphone M is the direction of the beam 7 set in the beamforming process for the BF microphone M. FIG.

図３には、ＢＦマイクＭに設定されるビーム７の範囲が灰色の領域を用いて模式的に図示されている。ＢＦマイクＭでは、基板中心Ｃから集音方向３を中心に扇状に広がる範囲が、ビーム７の範囲となる。このビーム７の範囲は、集音方位角Ａ及びビーム幅βで規定される。 FIG. 3 schematically shows the range of the beam 7 set on the BF microphone M using a gray area. In the BF microphone M, the range of the beam 7 is a fan-shaped range centered on the sound collecting direction 3 from the center C of the substrate. The range of this beam 7 is defined by the sound collection azimuth A and the beam width β.

集音方位角Ａは、集音方向３の中心角を表す方位角度である。例えばＢＦマイクＭを集音方向３に指向性をもつマイクと見做した場合に、集音方位角Ａは、指向性をもつマイクの向きに相当する。
ＢＦマイクＭでは、８つのマイク１６ａ～１６ｈを回転対象に配置することで、集音方位角Ａを３６０°の全方位にわたって設定すること、すなわち３６０°の全方位に向けてビームを張ることが可能となっている。従って、図２に示すＢＦマイクＭは、音源方位３６０°対応のビームフォーミングマイクアレイであると言える。 A sound collection azimuth angle A is an azimuth angle representing the central angle of the sound collection direction 3 . For example, when the BF microphone M is regarded as a microphone having directivity in the sound collection direction 3, the sound collection azimuth angle A corresponds to the direction of the microphone having directivity.
In the BF microphone M, by arranging the eight microphones 16a to 16h in rotational symmetry, it is possible to set the sound collection azimuth angle A in all directions of 360°, that is, to extend the beam in all directions of 360°. It is possible. Therefore, it can be said that the BF microphone M shown in FIG. 2 is a beam forming microphone array corresponding to 360 degrees of sound source directions.

ビーム幅βは、集音方位角Ａに対するＢＦマイクＭの指向性を表す角度である。ビーム幅βが小さいほど、指向性が高くなる。またビーム幅βが大きいほど、集音可能な範囲が広くなる。本実施形態では、ビーム幅βは一定の値に設定されるものとする。
なお、マイク１６の個数やマイクアレイの直径等のＢＦマイクＭの装置規模を拡大することで、ビーム幅βを可変にすることも可能である。この場合、例えば発話者１の状況やシーンに応じてビーム幅βを変更するといった処理が行われてもよい。 The beam width β is an angle representing the directivity of the BF microphone M with respect to the sound collection azimuth A. The smaller the beam width β, the higher the directivity. Also, the larger the beam width β, the wider the sound-collectable range. In this embodiment, the beam width β is set to a constant value.
The beam width β can be made variable by increasing the size of the BF microphone M, such as the number of microphones 16 and the diameter of the microphone array. In this case, for example, a process of changing the beam width β according to the situation or scene of the speaker 1 may be performed.

本実施形態では、集音方位角Ａは、外部のセンサ（検出カメラ１０）を用いて検出された発話者１の位置の情報をもとに、発話者１を逐次追従するように設定される。集音対象となる発話者１に対して、ビーム７の方位角度の範囲をＡ±βに制御することで、目的音である発話者１の音声５の高品位な集音を実現することが可能となる。
集音方位角Ａを設定する方法については、後に詳しく説明する。 In this embodiment, the sound collection azimuth angle A is set so as to sequentially follow the speaker 1 based on information on the position of the speaker 1 detected using an external sensor (detection camera 10). . By controlling the azimuth angle range of the beam 7 to A±β with respect to the target speaker 1, it is possible to achieve high-quality sound collection of the voice 5 of the speaker 1, which is the target sound. It becomes possible.
A method for setting the sound collection azimuth angle A will be described later in detail.

図１に戻り、検出カメラ１０は、音源である発話者１を撮影するカメラである。検出カメラ１０は、例えば発話者１がいる空間に向けて配置され、集音システム１００の動作中に発話者１を撮影する。
検出カメラ１０としては、ＣＭＯＳやＣＣＤ等のイメージセンサを備えたデジタルカメラが用いられる。また検出カメラ１０として、例えばステレオカメラやＴｏＦカメラ等の奥行きを測定可能な測距カメラが用いられてもよい。
なお検出カメラ１０は、１台でもよいし、複数の検出カメラ１０が用いられてもよい。 Returning to FIG. 1, the detection camera 10 is a camera that captures an image of the speaker 1 who is the sound source. The detection camera 10 is arranged, for example, facing the space where the speaker 1 is present, and photographs the speaker 1 while the sound collection system 100 is operating.
As the detection camera 10, a digital camera equipped with an image sensor such as CMOS or CCD is used. Further, as the detection camera 10, a distance measuring camera capable of measuring depth, such as a stereo camera or a ToF camera, may be used.
One detection camera 10 may be used, or a plurality of detection cameras 10 may be used.

記憶部１１は、不揮発性の記憶デバイスであり、例えばＳＳＤ（Solid State Drive）やＨＤＤ（Hard Disk Drive）等が用いられる。その他、コンピュータが読み取り可能な非一過性の任意の記録媒体が用いられてよい。
図１に示すように記憶部１１には、制御プログラム１２と、マイク情報１３と、音声データベース（音声ＤＢ１４）とが記憶される。 The storage unit 11 is a non-volatile storage device such as an SSD (Solid State Drive) or HDD (Hard Disk Drive). In addition, any non-transitory computer-readable recording medium may be used.
As shown in FIG. 1, the storage unit 11 stores a control program 12, microphone information 13, and a voice database (voice DB 14).

制御プログラム１２は、集音システム１００全体の動作を制御するプログラムである。
マイク情報１３は、複数のＢＦマイクＭに関する情報である。例えば各ＢＦマイクＭが配置された位置の３次元座標や、各ＢＦマイクＭの姿勢等がマイク情報として格納される。これらのマイク情報は、ビームフォーミング処理を実行する際に適宜参照される。この他、ＢＦマイクＭの種類や型番等がマイク情報１３として格納されてもよい。
音声ＤＢ１４は、発話者１の音声データ６を記録したデータベースである。例えばコントローラ２０で生成された音声データ６が、発話者１のラベルとともに逐次記録される。また例えば、複数の発話者１がいる場合には、各発話者１ごとに音声データ６が記録される。 The control program 12 is a program that controls the operation of the sound collection system 100 as a whole.
The microphone information 13 is information about a plurality of BF microphones M. FIG. For example, the three-dimensional coordinates of the position where each BF microphone M is arranged, the posture of each BF microphone M, and the like are stored as microphone information. These pieces of microphone information are appropriately referred to when performing beam forming processing. In addition, the type, model number, etc. of the BF microphone M may be stored as the microphone information 13 .
The voice DB 14 is a database in which the voice data 6 of the speaker 1 is recorded. For example, the voice data 6 generated by the controller 20 are sequentially recorded together with the label of the speaker 1 . Further, for example, when there are a plurality of speakers 1, voice data 6 is recorded for each speaker 1. FIG.

コントローラ２０は、集音システム１００が有する各ブロックの動作を制御する。コントローラ２０は、例えばＣＰＵやメモリ（ＲＡＭ、ＲＯＭ）等のコンピュータに必要なハードウェア構成を有する。ＣＰＵが記憶部１１に記憶されている制御プログラム１２をＲＡＭにロードして実行することにより、種々の処理が実行される。 The controller 20 controls the operation of each block included in the sound collection system 100 . The controller 20 has a hardware configuration necessary for a computer, such as a CPU and memory (RAM, ROM). Various processes are executed by the CPU loading the control program 12 stored in the storage unit 11 into the RAM and executing it.

コントローラ２０は、例えばＰＣ等のコンピュータを用いて構成される。またコントローラ２０として、例えばＦＰＧＡ（Field Programmable Gate Array）等のＰＬＤ(Programmable Logic Device)、その他ＡＳＩＣ（Application Specific Integrated Circuit）等のデバイスが用いられてもよい。 The controller 20 is configured using a computer such as a PC, for example. As the controller 20, for example, a device such as a PLD (Programmable Logic Device) such as an FPGA (Field Programmable Gate Array) or an ASIC (Application Specific Integrated Circuit) may be used.

本実施形態では、コントローラ２０のＣＰＵが本実施形態に係る制御プログラム１２を実行することで、機能ブロックとして、画像処理部２１、集音制御部２２、及び集音処理部２３が実現される。そしてこれらの機能ブロックにより、本実施形態に係る情報処理方法が実行される。なお各機能ブロックを実現するために、ＩＣ（集積回路）等の専用のハードウェアが適宜用いられてもよい。 In this embodiment, the CPU of the controller 20 executes the control program 12 according to this embodiment, thereby realizing an image processing unit 21, a sound collection control unit 22, and a sound collection processing unit 23 as functional blocks. These functional blocks execute the information processing method according to the present embodiment. In order to implement each functional block, dedicated hardware such as an IC (integrated circuit) may be used as appropriate.

画像処理部２１は、検出カメラ１０が撮影した画像に対して各種の画像処理を実行して音源情報を生成する。ここで音源情報とは、集音システム１００の集音対象となる音源に関する情報である。
音源情報には、音源を識別する情報が含まれる。例えば複数の音源が集音対象となっている場合には、各音源を識別するＩＤ等が音源情報として生成される。
また音源情報には、音源の位置を示す情報、音源が音を発する方向を示す情報が含まれる。すなわち、音源が音を発する位置及び方向を示す情報が音源情報として生成される。
このように、画像処理部２１は、音源の位置と音源が音を発する方向とを示す音源情報を取得する。本実施形態では、画像処理部２１は、音源情報を取得する情報取得部に相当する。 The image processing unit 21 performs various types of image processing on the image captured by the detection camera 10 to generate sound source information. Here, the sound source information is information about a sound source to be collected by the sound collection system 100 .
The sound source information includes information for identifying the sound source. For example, when a plurality of sound sources are targeted for sound collection, an ID or the like for identifying each sound source is generated as sound source information.
The sound source information includes information indicating the position of the sound source and information indicating the direction in which the sound source emits sound. That is, information indicating the position and direction in which the sound source emits sound is generated as the sound source information.
Thus, the image processing unit 21 acquires sound source information indicating the position of the sound source and the direction in which the sound source emits sound. In this embodiment, the image processing unit 21 corresponds to an information acquiring unit that acquires sound source information.

本実施形態では、音源である発話者１を対象とした音源情報が生成される。
このため、音源を識別する情報は、発話者１を識別する情報（発話者１の名称やＩＤ等）となる。画像処理部２１では、検出カメラ１０を用いて発話者１を撮影した画像データから発話者１が識別される。発話者１の識別には、例えば画像認識技術を利用した個人識別等の処理が用いられる。 In this embodiment, sound source information is generated for the speaker 1 who is the sound source.
Therefore, the information identifying the sound source is the information identifying the speaker 1 (name, ID, etc. of the speaker 1). The image processing unit 21 identifies the speaker 1 from the image data of the speaker 1 captured by the detection camera 10 . To identify the speaker 1, processing such as individual identification using image recognition technology is used.

また音源の位置を示す情報は、発話者１の位置を示す情報となる。
画像処理部２１では、検出カメラ１０を用いて発話者１を撮影した画像データから発話者１の位置が算出される。発話者１の位置を示す情報は、発話者１がいる床面における２次元座標でもよいし、発話者１の頭部の３次元座標でもよい。
発話者１の位置を算出する方法は限定されない。 Information indicating the position of the sound source is information indicating the position of the speaker 1 .
The image processing unit 21 calculates the position of the speaker 1 from image data obtained by photographing the speaker 1 using the detection camera 10 . The information indicating the position of speaker 1 may be two-dimensional coordinates on the floor where speaker 1 is located, or may be three-dimensional coordinates of the head of speaker 1 .
A method for calculating the position of speaker 1 is not limited.

また音源が音を発する方向は、発話者１の発話方向である。発話方向は、例えば発話者１の頭部正面が向けられた方向である。音源情報には、このような発話者１の発話方向を示す情報（例えば発話者１の頭部の向き等を示す情報）が含まれる。
画像処理部２１では、検出カメラ１０を用いて発話者１を撮影した画像データに基づいて、発話者１に関するボーン検出（骨格推定）が実行され発話者１の発話方向が推定される。ボーン検出を用いることで、発話方向を精度よく推定することが可能である。また複数の発話者１が存在する場合であっても、各発話者１の発話方向を容易に推定可能である。
なお発話方向を検出する方法は、ボーン検出を用いた方法に限定されず、例えば頭部の向き等を推定可能な任意の方法が用いられてよい。 Also, the direction in which the sound source emits sound is the speaking direction of speaker 1 . The speaking direction is, for example, the direction in which the front of the head of the speaker 1 is directed. The sound source information includes such information indicating the speaking direction of the speaker 1 (for example, information indicating the orientation of the head of the speaker 1, etc.).
The image processing unit 21 performs bone detection (skeletal estimation) of the speaker 1 based on image data of the speaker 1 captured by the detection camera 10 to estimate the direction of speech of the speaker 1 . By using bone detection, it is possible to accurately estimate the direction of speech. Moreover, even when there are a plurality of speakers 1, the speaking direction of each speaker 1 can be easily estimated.
Note that the method of detecting the speech direction is not limited to the method using bone detection, and any method that can estimate the orientation of the head, for example, may be used.

例えば、発話者１が特定できている場合には、その発話者１の位置や発話方向が逐次算出される。また、複数の発話者１が存在する場合には、各発話者１が個別に識別され、発話者１ごとに音源情報（位置や発話方向）が算出される。
このように、集音システム１００では、検出カメラ１０と、画像処理部２１とにより、集音対象となる発話者１を識別し、発話者１の位置及び発話方向を検出する検出装置が構成される。 For example, when speaker 1 can be identified, the position and speech direction of speaker 1 are sequentially calculated. Also, when there are a plurality of speakers 1, each speaker 1 is individually identified, and sound source information (position and speaking direction) is calculated for each speaker 1. FIG.
As described above, in the sound collection system 100, the detection camera 10 and the image processing unit 21 constitute a detection device that identifies the speaker 1 to be sound-collected and detects the position and speech direction of the speaker 1. be.

集音制御部２２は、集音システム１００による集音動作を制御する。
本実施形態では、集音制御部２２は、上記した音源情報に基づいて、音源（発話者１）の周辺に配置され集音方向３を設定可能な複数のＢＦマイクＭから、音源が発する音（発話者１の音声５）の集音に用いる少なくとも１つの対象マイク２５を選択する。
ここで対象マイク２５とは、集音対象となる発話者１の音声データ６の生成に使用されるＢＦマイクＭである。すなわち、対象マイク２５として選択されたＢＦマイクＭの出力が、音声データ６の元データとして用いられる。 The sound collection control unit 22 controls the sound collection operation of the sound collection system 100 .
In the present embodiment, the sound collection control unit 22, based on the sound source information described above, selects a plurality of BF microphones M that are arranged around the sound source (speaker 1) and that can set the sound collection direction 3. Sound emitted by the sound source Select at least one target microphone 25 to be used for collecting (speech 5 of speaker 1).
Here, the target microphone 25 is the BF microphone M used to generate the voice data 6 of the speaker 1 who is the target of sound collection. That is, the output of the BF microphone M selected as the target microphone 25 is used as the original data of the voice data 6 .

対象マイク２５は、音源情報が示す発話者１の位置や発話方向をもとに選択される。
この処理では、例えば発話者１の音声５を十分な感度で検出することができるＢＦマイクＭが、対象マイク２５として選択される。選択されるＢＦマイクＭは１つでもよいし、複数でもよい。これにより、発話者１の状態にあった適切なＢＦマイクＭを対象マイク２５として選択することが可能となる。
図１に示す例では、ＢＦマイクＭ１が対象マイク２５に選択されている。 The target microphone 25 is selected based on the position and speaking direction of the speaker 1 indicated by the sound source information.
In this process, for example, the BF microphone M that can detect the voice 5 of the speaker 1 with sufficient sensitivity is selected as the target microphone 25 . One or a plurality of BF microphones M may be selected. As a result, it is possible to select the BF microphone M suitable for the state of the speaker 1 as the target microphone 25 .
In the example shown in FIG. 1, the BF microphone M1 is selected as the target microphone 25. In the example shown in FIG.

また本実施形態では、集音制御部２２は、音源情報に基づいて、対象マイク２５の集音方向３を設定する。すなわち、音源情報が示す発話者１の位置や発話方向をもとに、対象マイク２５のビーム７の方向が設定される。
この処理では、例えば発話者１の発話方向に沿った集音が可能となるように、集音方向３（ビーム７の方向）が設定される。これにより、発話方向２にあった適切な集音方向を設定することが可能となる。 Further, in this embodiment, the sound collection control unit 22 sets the sound collection direction 3 of the target microphone 25 based on the sound source information. That is, the direction of the beam 7 of the target microphone 25 is set based on the position and speaking direction of the speaker 1 indicated by the sound source information.
In this process, the sound collection direction 3 (the direction of the beam 7) is set so that the sound can be collected along the utterance direction of the speaker 1, for example. As a result, it is possible to set an appropriate sound collection direction that matches the utterance direction 2 .

なお、複数の発話者１が集音対象となる場合には、各発話者１の音源情報をもとに、各発話者１ごとに対象マイク２５が選択されその集音方向３が設定される。 When a plurality of speakers 1 are to be sound-collected, the target microphone 25 is selected for each speaker 1 based on the sound source information of each speaker 1, and the sound-collecting direction 3 is set. .

図１に示すように、集音制御部２２では、複数のＢＦマイクＭのうち対象マイク２５を指定する信号（音声選択信号）と、対象マイク２５に関する集音方向３を指定する信号（集音方向信号）とが生成される。
音声選択信号は、集音処理部２３に出力される。また対象マイク２５として選択されたＢＦマイクＭについては、集音方向信号が指定する方向にその集音方向３が設定される。
なお図１では、各ＢＦマイクＭに対して集音方向信号が出力される様子が模式的に図示されている。実際には、集音方向信号は、集音処理部２３に出力され、集音処理部２３により実行される対象マイク２５に関するビームフォーミング処理に用いられる。 As shown in FIG. 1, in the sound collection control unit 22, a signal (sound selection signal) that designates the target microphone 25 among the plurality of BF microphones M and a signal that designates the sound collection direction 3 regarding the target microphone 25 (sound collection direction signals) are generated.
The audio selection signal is output to the sound collection processing unit 23 . For the BF microphone M selected as the target microphone 25, the sound collection direction 3 is set to the direction specified by the sound collection direction signal.
Note that FIG. 1 schematically shows how the sound collection direction signal is output to each BF microphone M. As shown in FIG. In practice, the sound collection direction signal is output to the sound collection processing unit 23 and used in the beamforming process for the target microphone 25 executed by the sound collection processing unit 23 .

集音処理部２３は、少なくとも１つの対象マイク２５の出力に基づいて、発話者１が発する音声５を表す音声データ６を生成する。
上記したように対象マイク２５の出力は、対象マイク２５を構成する複数のマイク１６ａ～１６ｈが生成する音信号である。これらの音信号に対して、ビームフォーミング処理が実行され、発話者１の音声５を集音した音声データ６が生成される。本実施形態では、音声データ６は、音源が発する音を表す音データに相当する。
図１に示すように、集音処理部２３は、マイク切替部２７と、音声データ生成部２８とを有する。 The sound collection processing unit 23 generates sound data 6 representing the sound 5 uttered by the speaker 1 based on the output of at least one target microphone 25 .
As described above, the output of the target microphone 25 is the sound signal generated by the plurality of microphones 16a to 16h that constitute the target microphone 25. FIG. A beam forming process is performed on these sound signals to generate audio data 6 obtained by collecting the voice 5 of the speaker 1 . In this embodiment, the sound data 6 corresponds to sound data representing sound produced by a sound source.
As shown in FIG. 1 , the sound collection processing unit 23 has a microphone switching unit 27 and an audio data generation unit 28 .

マイク切替部２７は、音声選択信号に基づいて、複数のＢＦマイクＭから対象マイク２５を選択する。マイク切替部２７は、全てのＢＦマイクＭの出力を読み込むことが可能である。このうち、音声選択信号により対象マイク２５に指定されたＢＦマイクＭの出力が読み込まれる。従ってマイク切替部２７は、複数のＢＦマイクＭの出力のうち対象マイク２５の出力を読み込むことで、対象マイク２５を選択するとも言える。 The microphone switching unit 27 selects the target microphone 25 from the plurality of BF microphones M based on the voice selection signal. The microphone switching unit 27 can read the outputs of all the BF microphones M. Among them, the output of the BF microphone M designated as the target microphone 25 by the voice selection signal is read. Therefore, it can be said that the microphone switching unit 27 selects the target microphone 25 by reading the output of the target microphone 25 among the outputs of the plurality of BF microphones M.

なお図１に示すマイク切替部２７は、４つのＢＦマイクＭ１～Ｍ４のうち、単一のＢＦマイクＭを対象マイク２５として選択する切替スイッチとして模式的に図示されている。これに限定されず、マイク切替部２７は、４つのＢＦマイクＭ１～Ｍ４のうち、複数のＢＦマイクＭを対象マイク２５として選択することも可能である。 Note that the microphone switching unit 27 shown in FIG. 1 is schematically illustrated as a switching switch that selects a single BF microphone M as the target microphone 25 from among the four BF microphones M1 to M4. Without being limited to this, the microphone switching unit 27 can also select a plurality of BF microphones M as the target microphones 25 from among the four BF microphones M1 to M4.

音声データ生成部２８は、マイク切替部２７により読み込まれた対象マイク２５の出力（マイク１６ａ～１６ｈの音信号）にビームフォーミング処理を実行し音声データ６を生成する。
ビームフォーミング処理では、集音方向信号が指定する集音方向３にビーム７が設定される。そして設定されたビーム７に沿って到来する音波について、伝搬遅延を補正する処理や、補正後の音信号を加算する処理等が実行される。
またビームフォーミング処理の他にも、各音信号の強度を調整する処理や、ノイズを除去する処理等が実行されてもよい。 The audio data generation unit 28 generates audio data 6 by executing beamforming processing on the output of the target microphone 25 (sound signals of the microphones 16a to 16h) read by the microphone switching unit 27 .
In the beamforming process, the beam 7 is set in the sound collection direction 3 specified by the sound collection direction signal. Then, for sound waves arriving along the set beam 7, processing for correcting propagation delay, processing for adding sound signals after correction, and the like are executed.
In addition to the beamforming process, a process of adjusting the intensity of each sound signal, a process of removing noise, and the like may be performed.

音声データ生成部２８により生成された音声データ６は、所定の再生装置２９に出力される。あるいは、音声データ６は、記憶部１１に構成された音声ＤＢ１４に格納される。
なお、複数の発話者１が集音対象となる場合には、各発話者１ごとに選択された対象マイク２５の出力をもとに、各発話者１ごとに音声データ６が生成される。 The audio data 6 generated by the audio data generator 28 is output to a predetermined reproducing device 29 . Alternatively, the voice data 6 is stored in the voice DB 14 configured in the storage unit 11 .
Note that when a plurality of speakers 1 are to be sound-collected, voice data 6 is generated for each speaker 1 based on the output of the target microphone 25 selected for each speaker 1 .

図４は、集音システム１００の基本的な集音動作を示す模式図である。図４には、発話者１と、２つのＢＦマイクＭ１及びＭ２と、検出カメラ１０とが模式的に図示されている。
以下では、発話者１の位置をＱと記載し、ＢＦマイクＭ１及びＭ２の位置をそれぞれＰ１及びＰ２と記載する。また発話者１の発話方向２やＢＦマイクＭの集音方向３が水平面内の方向であるものとして説明を行う。図４には発話方向２及び集音方向３が、それぞれ白抜きの実線の矢印及び黒抜きの実線の矢印を用いて模式的に図示されている。
また、発話者１の発話方向２と、発話者１から見たＢＦマイクＭの方向とのなす角度を、ＢＦマイクＭの集音角度と記載する。 FIG. 4 is a schematic diagram showing the basic sound collection operation of the sound collection system 100. As shown in FIG. FIG. 4 schematically shows a speaker 1, two BF microphones M1 and M2, and a detection camera 10. FIG.
In the following, the position of speaker 1 is denoted as Q, and the positions of BF microphones M1 and M2 are denoted as P1 and P2, respectively. Also, the description will be made assuming that the speaking direction 2 of the speaker 1 and the sound collecting direction 3 of the BF microphone M are in the horizontal plane. In FIG. 4, the utterance direction 2 and the sound collection direction 3 are schematically illustrated using solid white arrows and solid black arrows, respectively.
Also, the angle formed by the utterance direction 2 of the speaker 1 and the direction of the BF microphone M viewed from the speaker 1 is referred to as the sound collection angle of the BF microphone M.

図４では、発話者１は、図中の右側を向いている。従って、発話者１の発話方向２は、図中の右側に向かう方向となる。
また発話者１の正面から左側にずれた位置には、ＢＦマイクＭ１が配置されており、発話者１から見て右側にはＢＦマイクＭ２が配置されている。従って、ＢＦマイクＭ１の集音角度は、ＢＦマイクＭ２の集音角度よりも小さい。なお、発話者１から見て、ＢＦマイクＭ１の位置は、ＢＦマイクＭ２の位置よりも離れている。 In FIG. 4, speaker 1 faces to the right in the figure. Therefore, the utterance direction 2 of the speaker 1 is directed to the right side in the figure.
A BF microphone M1 is arranged at a position shifted to the left side from the front of the speaker 1, and a BF microphone M2 is arranged at the right side as seen from the speaker 1. - 特許庁Therefore, the sound collection angle of the BF microphone M1 is smaller than the sound collection angle of the BF microphone M2. Note that the position of the BF microphone M1 is farther from the position of the BF microphone M2 as viewed from the speaker 1. FIG.

例えば検出カメラ１０により検出された発話者１の位置情報だけを用いて、発話者１の音声５を集音するためのＢＦマイクＭを選択する場合を考える。位置情報だけを参照した場合、例えば発話者１に最も近い位置にあるＢＦマイクＭ２が選択される。 For example, consider the case of selecting the BF microphone M for collecting the voice 5 of the speaker 1 using only the positional information of the speaker 1 detected by the detection camera 10 . If only the positional information is referred to, for example, the BF microphone M2 closest to the speaker 1 is selected.

ところで、図４に示すシーンでは、発話者１は、ＢＦマイクＭ２の方向を向いておらず、発話者１の発話方向２と、発話者１から見たＢＦマイクＭ２の方向（点Ｑから点Ｐ２に向かう方向）とのなす集音角度が９０°を超えている。
例えば、発話位置(発話者１の口元)で発話された音声５を点音源とすると、発話者１自身が障害物となる。このため、ＢＦマイクＭ２は、口元で発せられた直接音ではなく回折音を集音することになる。 By the way, in the scene shown in FIG. 4, the speaker 1 does not face the direction of the BF microphone M2. direction toward P2) exceeds 90°.
For example, if the voice 5 uttered at the utterance position (the mouth of the speaker 1) is a point sound source, the speaker 1 itself becomes an obstacle. Therefore, the BF microphone M2 collects the diffracted sound rather than the direct sound emitted from the mouth.

ここで、直接音とは、障害物等によって遮られることなく、音源からＢＦマイクＭに到達する音声５である。
一方で、障害物によって遮られ障害物を回り込んで伝搬された音声５（障害物による回折を受けた音声５）は、回折音となる。例えば、集音角度が十分に大きくなると音声５の回折数が多くなり、その分だけ音声５の減衰量も大きくなる。 Here, the direct sound is the sound 5 that reaches the BF microphone M from the sound source without being blocked by obstacles or the like.
On the other hand, the sound 5 blocked by the obstacle and propagated around the obstacle (the sound 5 diffracted by the obstacle) becomes a diffracted sound. For example, when the sound collection angle is sufficiently large, the number of diffractions of the sound 5 increases, and the amount of attenuation of the sound 5 increases accordingly.

また図４に示すように、ＢＦマイクＭ２では、発話者１の左側から到来する環境雑音３０が直接集音される。従って、ＢＦマイクＭ２を用いて発話者１の音声５を集音する場合、目的音である音声５に比べ環境雑音３０の音量レベルが高くなる。 Further, as shown in FIG. 4, the BF microphone M2 directly picks up ambient noise 30 coming from the left side of the speaker 1 . Therefore, when the voice 5 of the speaker 1 is collected using the BF microphone M2, the volume level of the ambient noise 30 is higher than that of the voice 5, which is the target sound.

これに対し、図４に示すシーンでは、ＢＦマイクＭ１は、発話者１の正面近くに配置される。このため、発話方向２に対するＢＦマイクＭ１の集音角度は９０°未満となる。従って、ＢＦマイクＭ１を用いた場合、発話者１が発した直接音を集音可能となり、回折音を集音する場合に比べて音声５の減衰量を十分に抑制することができる。
またＢＦマイクＭ１は、環境雑音３０を直接集音することはない。これにより、発話者１の音声５の雑音レベルを十分に抑制することが可能である。 On the other hand, in the scene shown in FIG. 4, the BF microphone M1 is arranged near the front of the speaker 1 . Therefore, the sound collection angle of the BF microphone M1 with respect to the utterance direction 2 is less than 90°. Therefore, when the BF microphone M1 is used, the direct sound uttered by the speaker 1 can be collected, and the amount of attenuation of the sound 5 can be suppressed sufficiently compared to the case of collecting the diffracted sound.
Also, the BF microphone M1 does not directly pick up the environmental noise 30 . Thereby, the noise level of the speech 5 of the speaker 1 can be sufficiently suppressed.

そこで、集音システム１００では、検出カメラ１０で撮影した映像信号（画像データ）をもとに、画像処理部２１により発話者１の位置検出と同時に、発話者１のボーン検出が実行されその発話方向２が検出される。
このようにして得られた発話者１の位置Ｑ及び発話方向２の情報（音源情報）から、集音制御部２２により発話者１の音声５を集音するＢＦマイクＭ（対象マイク２５）が選択される。また集音制御部２２により対象マイク２５の集音方向３が設定される。 Therefore, in the sound collection system 100, based on the video signal (image data) captured by the detection camera 10, the image processing unit 21 simultaneously detects the position of the speaker 1 and detects the bones of the speaker 1. Direction 2 is detected.
Based on the information (sound source information) of the position Q of the speaker 1 and the speaking direction 2 (sound source information) obtained in this way, the BF microphone M (target microphone 25) that collects the voice 5 of the speaker 1 by the sound collection control unit 22 is selected. selected. Also, the sound collection control unit 22 sets the sound collection direction 3 of the target microphone 25 .

対象マイク２５を選択する処理では、音源である発話者１が音声５を発する発話方向２を基準として発話者１が発する直接音を集音可能なＢＦマイクＭが判定され、当該ＢＦマイクＭが対象マイク２５として選択される。
例えば発話方向２を中心とする所定の範囲に集音方向３を設定可能であるか否かを判定することで、直接音を集音可能であるか否かが判定される。例えば音源が発話者１である場合、発話方向２を中心として±９０°の範囲が、所定の範囲として設定される。
直接音を集音可能であるか否かを判定する方法は限定されず、例えば障害物の有無等に応じて判定されてもよい。
図４に示す例では、発話方向２から左側にずれて配置されたＢＦマイクＭ１が、直接音を集音可能であるとして、対象マイク２５として選択される。 In the process of selecting the target microphone 25, the BF microphone M capable of collecting the direct sound emitted by the speaker 1 is determined based on the speaking direction 2 in which the speaker 1, which is the sound source, emits the voice 5, and the BF microphone M is selected. It is selected as the target microphone 25 .
For example, by determining whether or not the sound collection direction 3 can be set within a predetermined range centered on the utterance direction 2, it is determined whether or not the direct sound can be collected. For example, when the sound source is the speaker 1, a range of ±90° centering on the speaking direction 2 is set as the predetermined range.
The method of determining whether direct sound can be collected is not limited, and determination may be made according to the presence or absence of an obstacle, for example.
In the example shown in FIG. 4, the BF microphone M1, which is displaced to the left from the speaking direction 2, is selected as the target microphone 25 because it can collect the direct sound.

また集音方向３を設定する処理では、対象マイク２５から発話者１に向かう方向が対象マイク２５の集音方向３に設定される。これにより、発話者１が発する直接音を最も効率的に集音することが可能となる。
図４に示す例では、対象マイク２５であるＢＦマイクＭ１の位置Ｐ１から、発話者１の位置Ｑに向かう方向が、ＢＦマイクＭ１の集音方向３に設定される。またＢＦマイクＭ１のビーム７の範囲は、発話者１に向かう集音方向３を中心として±βの角度で広がる扇状の領域となる。 In the process of setting the sound collection direction 3, the direction from the target microphone 25 toward the speaker 1 is set as the sound collection direction 3 of the target microphone 25. FIG. As a result, the direct sound uttered by the speaker 1 can be collected most efficiently.
In the example shown in FIG. 4, the direction from the position P1 of the BF microphone M1, which is the target microphone 25, toward the position Q of the speaker 1 is set as the sound collection direction 3 of the BF microphone M1. Also, the range of the beam 7 of the BF microphone M1 is a fan-shaped area that spreads at an angle of ±β centering on the sound collecting direction 3 toward the speaker 1 .

このように、集音システム１００には、特定方向からの音を集音可能な複数の集音装置（ＢＦマイクＭ）と、集音対象となる発話者１の位置Ｑ及び発話方向２を検出する機構（検出カメラ１０及び画像処理部２１）が設けられる。そして、集音制御部２２により発話者１の位置Ｑ及び発話方向２にあったＢＦマイクＭが選択され、集音処理部２３により発話者１の音声データ６が生成される。これにより、発話者１の音声５を品質よく集音することが可能となる。 In this way, the sound collection system 100 includes a plurality of sound collection devices (BF microphones M) capable of collecting sound from a specific direction, and the position Q and speech direction 2 of the speaker 1 to be collected. A mechanism (the detection camera 10 and the image processing unit 21) is provided. Then, the sound collection control unit 22 selects the BF microphone M at the position Q and the speaking direction 2 of the speaker 1, and the sound collection processing unit 23 generates voice data 6 of the speaker 1. FIG. This makes it possible to collect the speech 5 of the speaker 1 with good quality.

例えば、発話者１の近くにある集音マイクを用いて集音を行うような会議システムでは、発話者１が集音マイクに背を向けていた場合、発話方向２とは反対の方向から集音を行うことになり、音量や音質が大幅に低下する可能性があった。例えばビームフォーミング技術を備えたマイクアレイを用いる場合でも同様の問題が発生する。 For example, in a conference system that collects sound using a sound collecting microphone near speaker 1, if speaker 1 turns his back to the sound collecting microphone There was a possibility that the sound volume and sound quality would be greatly reduced. For example, a similar problem occurs when using a microphone array with beamforming technology.

これに対して、本実施形態に係る集音システム１００では、複数のＢＦマイクＭから、発話者１の位置Ｑ及び発話方向２にあったＢＦマイクＭを選択して集音動作が実行される。
例えば映像コンテンツの制作現場等では、演者の正面から集音するようにマイクの位置を移動させている。また演者の正面から集音する場合に、その背後からくる雑音の混入が想定される場合には、マイクの指向範囲にノイズ源が入らないようにマイクの位置や姿勢を変化させて高音質な集音を実現している。
集音システム１００で行われる集音動作は、発話者１を正面から集音を出来るＢＦマイクＭを選択することで、上記した制作現場での集音方法と同様の効果を発揮するものである。 On the other hand, in the sound collection system 100 according to the present embodiment, the sound collection operation is performed by selecting the BF microphone M that matches the position Q and the utterance direction 2 of the speaker 1 from a plurality of BF microphones M. .
For example, at a video content production site, etc., the position of the microphone is moved so as to collect sound from the front of the performer. Also, when collecting sound from the front of the performer, if noise coming from behind is expected to be mixed in, change the position and posture of the microphone so that the noise source does not enter the directional range of the microphone to achieve high sound quality. Sound collection is realized.
In the sound collection operation performed by the sound collection system 100, by selecting the BF microphone M that can collect sound from the front of the speaker 1, the same effect as the sound collection method at the production site described above is exhibited. .

また集音システム１００では、集音動作が行われている間に、上記した画像処理部２１により所定のフレームレートで発話者１の音源情報（位置Ｑ及び発話方向２）を算出する処理が繰り返し実行される。従って画像処理部２１は、音源情報をモニタリングするともいえる。
また、集音制御部２２により、音源情報のモニタリング結果に応じて、対象マイク２５と対象マイク２５の集音方向とを指定する信号（音声選択信号及び集音方向信号）を動的に算出される。そして、集音処理部２３により、音声選択信号及び集音方向信号に基づいて、音声データ６が生成される。
これにより、各タイミングでの発話者１の位置や発話方向に応じて、動的に集音動作を行うことが可能となり、発話者１の音声５を常時高感度で集音することが可能となる。 Further, in the sound collecting system 100, while the sound collecting operation is being performed, the image processing unit 21 repeats the process of calculating the sound source information (the position Q and the speaking direction 2) of the speaker 1 at a predetermined frame rate. executed. Therefore, it can be said that the image processing unit 21 monitors sound source information.
In addition, the sound collection control unit 22 dynamically calculates a signal (voice selection signal and sound collection direction signal) specifying the target microphone 25 and the sound collection direction of the target microphone 25 according to the monitoring result of the sound source information. be. Then, the sound collection processing unit 23 generates the sound data 6 based on the sound selection signal and the sound collection direction signal.
As a result, it is possible to dynamically collect sound according to the position and speaking direction of the speaker 1 at each timing, and it is possible to always collect the voice 5 of the speaker 1 with high sensitivity. Become.

図５は、集音システムの動作例を示すフローチャートである。図６は、ＢＦマイクＭの配置例を示す模式図である。
図５に示す処理は、図６に示すように配置された４つのＢＦマイクＭ１～Ｍ４から集音に用いる対象マイク２５を選択する処理である。なお対象マイク２５についての集音方向を設定する処理や、対象マイク２５の出力から音声データ６を生成する処理等は、対象マイク２５を選択した後に適宜実行される。
また図５に示す処理は、集音動作が行われている間に所定のフレームレートで繰り返し実行されるループ処理である。 FIG. 5 is a flowchart showing an operation example of the sound collection system. FIG. 6 is a schematic diagram showing an arrangement example of the BF microphones M. As shown in FIG.
The processing shown in FIG. 5 is processing for selecting the target microphone 25 to be used for sound collection from the four BF microphones M1 to M4 arranged as shown in FIG. The process of setting the sound collection direction for the target microphone 25, the process of generating the audio data 6 from the output of the target microphone 25, and the like are appropriately executed after the target microphone 25 is selected.
The processing shown in FIG. 5 is loop processing that is repeatedly executed at a predetermined frame rate while the sound collection operation is being performed.

まず、図６に示すＢＦマイクＭの配置について説明する。ここでは、４つのＢＦマイクＭ１～Ｍ４が、正方形状の領域の４つの頂点にそれぞれ配置される。この正方形状の領域が、集音システム１００の集音対象領域４０である。ここでは、集音対象領域４０内の各点において、図中上方向の方位角を０°とし、時計回りの方向に方位角が増えるものとする。
ＢＦマイクＭ１は図中右上の頂点に配置され、ＢＦマイクＭ２は図中右下の頂点に配置され、ＢＦマイクＭ３は図中左下の頂点に配置され、ＢＦマイクＭ４は図中左上の頂点に配置される。 First, the arrangement of the BF microphones M shown in FIG. 6 will be described. Here, four BF microphones M1 to M4 are arranged at four vertices of a square area. This square area is the sound collection target area 40 of the sound collection system 100 . Here, at each point in the sound collection target area 40, the azimuth angle in the upward direction in the drawing is assumed to be 0°, and the azimuth angle increases in the clockwise direction.
The BF microphone M1 is placed at the upper right vertex in the figure, the BF microphone M2 is placed at the lower right vertex in the figure, the BF microphone M3 is placed at the lower left vertex in the figure, and the BF microphone M4 is placed at the upper left vertex in the figure. placed.

また本実施形態では、複数のＢＦマイクＭは、各々の配置に応じて割り当てられた割当範囲４１に集音方向３を設定可能なように構成される。
割当範囲４１は、例えば各ＢＦマイクＭが集音を担当する角度範囲であり、典型的には水平面における方位角度の範囲である。割当範囲４１は、各ＢＦマイクＭの位置や、集音対象領域４０の形状に合わせて適宜設定される。 Further, in this embodiment, the plurality of BF microphones M are configured so that the sound collection direction 3 can be set in the allocation range 41 allocated according to the arrangement of each.
The allocation range 41 is, for example, an angle range in which each BF microphone M is in charge of sound collection, and is typically a range of azimuth angles on a horizontal plane. The allocation range 41 is appropriately set according to the position of each BF microphone M and the shape of the sound collection target area 40 .

図６には、円弧状の矢印を用いてＢＦマイクＭ１の割当範囲４１が模式的に図示されている。ＢＦマイクＭ１の割当範囲４１は、ＢＦマイクＭ１を基準として１８０°から２７０°の範囲である。同様に、ＢＦマイクＭ２の割当範囲４１は、２７０°から３６０°の範囲であり、ＢＦマイクＭ３の割当範囲４１は、０°から９０°の範囲であり、ＢＦマイクＭ４の割当範囲４１は、９０°から１８０°の範囲である。
各ＢＦマイクＭは、少なくとも上記した割当範囲４１内に集音方向３を設定可能である。 FIG. 6 schematically illustrates the allocation range 41 of the BF microphone M1 using arc-shaped arrows. The allocation range 41 of the BF microphone M1 is a range from 180° to 270° with respect to the BF microphone M1. Similarly, the allocation range 41 of the BF microphone M2 ranges from 270° to 360°, the allocation range 41 of the BF microphone M3 ranges from 0° to 90°, and the allocation range 41 of the BF microphone M4 is It ranges from 90° to 180°.
Each BF microphone M can set the sound collection direction 3 at least within the allocation range 41 described above.

図５に示すように、まず画像処理部２１により、検出カメラ１０が撮影した画像データから発話者１が検出される（ステップ１０１）。発話者１の検出には、例えば人物を検出する任意の画像処理が用いられる。この時、発話者１の識別が行われてもよい。 As shown in FIG. 5, the image processing unit 21 first detects the speaker 1 from the image data captured by the detection camera 10 (step 101). Any image processing for detecting a person, for example, is used to detect the speaker 1 . At this time, speaker 1 identification may be performed.

またステップ１０１では、発話者１が検出された場合、発話者１の位置座標が検出される。ここでは、集音対象領域４０における発話者１の位置Ｑの２次元座標（ｘｙ座標）が検出される。
またステップ１０１では、発話者１に対してボーン検出が実行され、発話者１の発話方向２が検出される。ここでは、集音対象領域４０における発話方向２の方位角度（正面角度）が検出される。 Also, in step 101, when the speaker 1 is detected, the position coordinates of the speaker 1 are detected. Here, the two-dimensional coordinates (xy coordinates) of the position Q of the speaker 1 in the sound collection target area 40 are detected.
Also, in step 101, bone detection is performed for speaker 1, and speech direction 2 of speaker 1 is detected. Here, the azimuth angle (frontal angle) of the speech direction 2 in the sound collection target area 40 is detected.

図７は、発話者１の発話方向２の一例を示す模式図である。
図７に示すように、発話者１の位置Ｑを基準に算出される。ここでは、発話者１の位置Ｑから見て、図中上方向の方位角を０°とする。また図中右方向の方位角を９０°とし、図中下方向の方位角を１８０°とし、図中左方向の方位角を２７０°とする。
発話者１の発話方向２、すなわち発話者１の正面角度θは、０°～３６０°の方位角度として算出される。例えば図７に示す発話方向２の角度θは、およそ１２０°である。 FIG. 7 is a schematic diagram showing an example of the speech direction 2 of the speaker 1. As shown in FIG.
As shown in FIG. 7, it is calculated based on the position Q of speaker 1 . Here, as viewed from position Q of speaker 1, the azimuth angle in the upward direction in the figure is assumed to be 0°. The azimuth angle in the right direction in the figure is 90°, the azimuth angle in the downward direction in the figure is 180°, and the azimuth angle in the left direction in the figure is 270°.
The utterance direction 2 of the speaker 1, that is, the frontal angle θ of the speaker 1 is calculated as an azimuth angle of 0° to 360°. For example, the angle θ of speech direction 2 shown in FIG. 7 is approximately 120°.

なお、発話者１の位置Ｑや発話方向２が検出できない場合には、各パラメータの検出ができない旨の情報が記録されてもよい。 If the position Q and the speech direction 2 of the speaker 1 cannot be detected, information indicating that each parameter cannot be detected may be recorded.

次に、発話方向２が検出可能であるか否かが判定される（ステップ１０２）。
例えば画像処理部２１により発話方向２が検出されない場合、発話方向２が検出できない状態であると判定され（ステップ１０２のＮｏ）、発話者１の位置Ｑ（ｘｙ座標）が取得可能であるか否かが判定される（ステップ１０３）。
例えば画像処理部２１により発話者１の位置Ｑが検出されない場合、発話者１の位置Ｑが検出できない状態であると判定され（ステップ１０３のＮｏ）、再度ステップ１０１が実行される。 Next, it is determined whether speech direction 2 is detectable (step 102).
For example, when the speech direction 2 is not detected by the image processing unit 21, it is determined that the speech direction 2 cannot be detected (No in step 102), and whether the position Q (xy coordinates) of the speaker 1 can be acquired. is determined (step 103).
For example, when the position Q of the speaker 1 is not detected by the image processing unit 21, it is determined that the position Q of the speaker 1 cannot be detected (No in step 103), and step 101 is executed again.

一方で、発話者１の位置Ｑが検出された場合、発話者１の位置Ｑが検出可能な状態であると判定され（ステップ１０３のＹｅｓ）、発話者１の位置Ｑに最寄りのＢＦマイクＭが、対象マイク２５として選択される（ステップ１０４）。
このように、発話方向２が不明であるが、発話者１の位置Ｑがわかっている場合には、発話者１に直近にあるＢＦマイクＭ（図５ではＢＦマイク（Ｎ）と記載している）が選択される。なおＮはＢＦマイクＭを表すインデックスであり、Ｎ＝１、２、３、４である。
ステップ１０４で、対象マイク２５が選択されると、次のループ処理が実行される。 On the other hand, when the position Q of the speaker 1 is detected, it is determined that the position Q of the speaker 1 is detectable (Yes in step 103), and the BF microphone M closest to the position Q of the speaker 1 is installed. is selected as the target microphone 25 (step 104).
In this way, when the utterance direction 2 is unknown, but the position Q of the speaker 1 is known, the BF microphone M (in FIG. is selected). Note that N is an index representing the BF microphone M, where N=1, 2, 3, 4.
At step 104, when the target microphone 25 is selected, the following loop processing is executed.

ステップ１０２に戻り、画像処理部２１により発話方向２が検出された場合、発話方向２が検出可能な状態であると判定され（ステップ１０２のＹｅｓ）、発話方向２に最も適したＢＦマイクＭの有無が判定される（ステップ１０５）。 Returning to step 102, when the speech direction 2 is detected by the image processing unit 21, it is determined that the speech direction 2 is detectable (Yes in step 102). The presence or absence is determined (step 105).

ここで、発話方向２に最も適したＢＦマイクＭとは、発話方向２と割当範囲４１の中心方向とが対応しているＢＦマイクＭである。
このようなＢＦマイクＭを用いることで、割当範囲４１の中心に沿って到来する音声５を集音することが可能となる。この結果、効果的に音声５を強調することや、他のノイズを抑制するといった処理が可能となり、高品質な音声データ６を生成可能となる。
具体的には発話方向２の角度θが、以下の関係を満たすか否かが判定される。
θ＝９０°×Ｎ－４５° ・・・（１） Here, the BF microphone M most suitable for the utterance direction 2 is the BF microphone M for which the utterance direction 2 and the center direction of the allocation range 41 correspond.
By using such a BF microphone M, it is possible to collect the sound 5 arriving along the center of the allocation range 41 . As a result, processing such as effectively emphasizing the voice 5 and suppressing other noise becomes possible, and high-quality voice data 6 can be generated.
Specifically, it is determined whether or not the angle θ of the speaking direction 2 satisfies the following relationship.
θ=90°×N-45° (1)

（１）式より、Ｎ＝１の場合、θ＝４５°となる。このθ＝４５°の発話方向２は、ＢＦマイクＭ１の割当範囲４１（１８０°から２７０°）の中心方向（２２５°）を１８０°回転させた方向であり、中心方向に沿ってＢＦマイクＭ１に進行する方向である。すなわち、θ＝４５°の発話方向２は、ＢＦマイクＭ１の割当範囲４１の中心方向と対応している。この場合、ＢＦマイクＭ１が、発話方向２に最も適したＢＦマイクＭとなる。
同様に、Ｎ＝２、３、４について、（１）式が満たされる場合には、ＢＦマイクＭ２、Ｍ３、及びＭ４が、それぞれ発話方向２に最も適したＢＦマイクＭとなる。 From the equation (1), when N=1, θ=45°. This utterance direction 2 of θ=45° is a direction obtained by rotating the central direction (225°) of the allocation range 41 (180° to 270°) of the BF microphone M1 by 180°. direction. That is, utterance direction 2 at θ=45° corresponds to the central direction of allocation range 41 of BF microphone M1. In this case, the BF microphone M1 is the most suitable BF microphone M for the speaking direction 2. FIG.
Similarly, for N=2, 3, and 4, BF microphones M2, M3, and M4 are the most suitable BF microphones M for speech direction 2, respectively, if equation (1) is satisfied.

なおステップ１０５では、（１）式によるθの判定に一定の幅αを持たせた処理が実行されてもよい。例えば、発話方向２の角度θが（９０°×Ｎ－４５°－α）≦θ≦（９０°×Ｎ－４５°＋α）を満たすか否かが、各Ｎについて判定される。このように、発話方向２と割当範囲４１の中心方向とが多少ずれていた場合であっても、高品質な音声データ６を生成可能である。 Note that in step 105, processing may be performed in which a certain width α is given to the determination of θ by equation (1). For example, it is determined for each N whether the angle θ of the speaking direction 2 satisfies (90°×N−45°−α)≦θ≦(90°×N−45°+α). As described above, even if the direction of speech 2 is slightly deviated from the central direction of the allocation range 41, it is possible to generate high-quality voice data 6. FIG.

（１）式を満たすＮが存在した場合（ステップ１０５のＹｅｓ）、（１）式を満たすＢＦマイク（Ｎ）が、発話方向２に最も適したＢＦマイクＭとして対象マイク２５に選択される（ステップ１０６）。
このように、本実施形態では、割当範囲４１の中心方向が発話方向２と対応しているＢＦマイクＭが対象マイク２５として選択される。これにより、発話者１の音声５を十分高い音質で集音するといったことが可能となる。
ステップ１０６で、対象マイク２５が選択されると、次のループ処理が実行される。 If there is N that satisfies the expression (1) (Yes in step 105), the BF microphone (N) that satisfies the expression (1) is selected as the target microphone 25 as the BF microphone M most suitable for the utterance direction 2 ( step 106).
Thus, in this embodiment, the BF microphone M whose center direction of the allocation range 41 corresponds to the speech direction 2 is selected as the target microphone 25 . This makes it possible to collect the voice 5 of the speaker 1 with sufficiently high quality.
At step 106, when the target microphone 25 is selected, the following loop processing is executed.

ステップ１０５に戻り、（１）式を満たすＮが存在しない場合（ステップ１０５のＮｏ）、発話者１の位置Ｑのｘｙ座標から、発話者１に最寄りのＢＦマイクＭが検出される（ステップ１０７）。
例えば図６に示す例では、発話者１の発話方向２について（１）式を満たすＮは存在しないと判定され、発話者１に最も近いＢＦマイクＭ４（Ｎ＝４）が検出される。 Returning to step 105, if there is no N satisfying the formula (1) (No in step 105), the BF microphone M closest to speaker 1 is detected from the xy coordinates of position Q of speaker 1 (step 107 ).
For example, in the example shown in FIG. 6, it is determined that there is no N that satisfies the expression (1) for the speech direction 2 of the speaker 1, and the BF microphone M4 (N=4) closest to the speaker 1 is detected.

ステップ１０７で検出されたＢＦマイクＭについて、発話方向２に沿った集音が可能であるか否かが判定される（ステップ１０８）。ここで、発話方向２に沿った集音とは、発話方向２がビーム７の方向範囲に含まれた状態で行われる集音動作である。
図６を参照して説明したように、ここでは各ＢＦマイクＭが、９０°の割当範囲４１内で集音方向３を設定可能である。従って、Ｎ番目のＢＦマイクＭが設定可能な方位角の範囲は、９０°×（Ｎ－１）－βから、９０°×Ｎ＋βまでの範囲となる。
ステップ１０８では、発話者１に最も近いＢＦマイク（Ｎ）について、発話方向２の角度θが上記したビーム７を設定可能な範囲に収まるか否かが判定される。これは、以下の関係を満たすか否かを判定する処理である。
９０×（Ｎ－１）－β≦θ≦９０°×Ｎ＋β ・・・（２） For the BF microphone M detected in step 107, it is determined whether or not sound can be collected along the speaking direction 2 (step 108). Here, the sound collection along the utterance direction 2 is a sound collection operation performed in a state where the utterance direction 2 is included in the direction range of the beam 7 .
As explained with reference to FIG. 6, here each BF microphone M can set the sound collection direction 3 within the allocation range 41 of 90°. Therefore, the range of azimuth angles that can be set by the N-th BF microphone M is from 90°×(N−1)−β to 90°×N+β.
At step 108, for the BF microphone (N) closest to the speaker 1, it is determined whether or not the angle .theta. This is a process of determining whether or not the following relationship is satisfied.
90×(N−1)−β≦θ≦90°×N+β (2)

図６を参照して（２）式の判定について説明する。ここでは、ＢＦマイクＭ４（Ｎ＝４）が最寄りのＢＦマイクＭとして検出されているため、（２）式は、２７０－β≦θ≦３６０°＋βとなる。これは、ＢＦマイクＭ４の割当範囲４１に集音方向３を設定するという条件のもとで設定可能なビーム７の範囲に対応する。この範囲に、発話方向２の角度θが含まれているかどうかが判定される。
これにより、発話者１に最も近いＢＦマイクＭにおいて、発話方向２に沿った集音が可能であるかどうかがわかる。 The determination of expression (2) will be described with reference to FIG. Here, since the BF microphone M4 (N=4) is detected as the nearest BF microphone M, the formula (2) is 270-β≤θ≤360°+β. This corresponds to the range of the beam 7 that can be set under the condition that the sound collection direction 3 is set in the allocation range 41 of the BF microphone M4. It is determined whether or not the angle θ of the speech direction 2 is included in this range.
Thus, it can be determined whether or not the BF microphone M closest to the speaker 1 can collect sound along the speaking direction 2 .

（２）式が満たされる場合（ステップ１０８のＹｅｓ）、ステップ１０７で検出された最寄りのＢＦマイク（Ｎ）が対象マイク２５に選択される（ステップ１０９）。これにより、発話者１に最も近い位置から十分な感度で音声５を集音することが可能となる。
ステップ１０９で、対象マイク２５が選択されると、次のループ処理が実行される。 If the expression (2) is satisfied (Yes in step 108), the nearest BF microphone (N) detected in step 107 is selected as the target microphone 25 (step 109). This makes it possible to collect the voice 5 from the position closest to the speaker 1 with sufficient sensitivity.
At step 109, when the target microphone 25 is selected, the following loop processing is executed.

また（２）式が満たされない場合（ステップ１０８のＮｏ）、ステップ１０７で検出された最寄りのＢＦマイク（Ｎ）は対象マイク２５としては選択されない。この場合、次のＢＦマイク（Ｎ＋１）について、発話方向２に沿った集音が可能であるか否かが判定される（ステップ１１０）。
この処理では、発話方向２の角度θが以下の関係を満たすか否かが判定される。
９０×Ｎ＋β＜θ≦９０×（Ｎ＋１）＋β ・・・（３） Also, if the formula (2) is not satisfied (No in step 108), the nearest BF microphone (N) detected in step 107 is not selected as the target microphone 25. FIG. In this case, it is determined whether or not the next BF microphone (N+1) can collect sound along the speech direction 2 (step 110).
In this process, it is determined whether or not the angle θ of the speaking direction 2 satisfies the following relationship.
90×N+β<θ≦90×(N+1)+β (3)

（３）式は、発話者１の最寄りのＢＦマイク（Ｎ）に隣接するＢＦマイク（Ｎ＋１）が、設定可能なビーム７の範囲のうち、ＢＦマイク（Ｎ）と重複しない範囲に発話方向２の角度θが含まれているかどうかを判定する条件式である。
図６に示す例では、最寄りのＢＦマイクＭ４であった。この場合ステップ１１０では、その次のＢＦマイクＭ１（Ｎ＝１）がＢＦマイクＭ４とは別に設定可能なビーム７の範囲を対象として判定処理が実行される。 Expression (3) is such that the BF microphone (N+1) adjacent to the BF microphone (N) closest to the speaker 1 is in the range of the settable beam 7 that does not overlap with the BF microphone (N). is a conditional expression for determining whether the angle θ of is included.
In the example shown in FIG. 6, it was the nearest BF microphone M4. In this case, in step 110, determination processing is executed for the range of the beam 7 in which the next BF microphone M1 (N=1) can be set separately from the BF microphone M4.

（３）式が満たされる場合（ステップ１１０のＹｅｓ）、最寄りのＢＦマイク（Ｎ）に隣接するＢＦマイク（Ｎ＋１）が対象マイク２５に選択される（ステップ１１１）。これにより、発話者１に２番目（又は３番目）に近い位置から十分な感度で音声５を集音することが可能となる。
ステップ１１１で、対象マイク２５が選択されると、次のループ処理が実行される。 If the expression (3) is satisfied (Yes in step 110), the BF microphone (N+1) adjacent to the nearest BF microphone (N) is selected as the target microphone 25 (step 111). This makes it possible to collect the voice 5 from a position second (or third) closest to the speaker 1 with sufficient sensitivity.
At step 111, when the target microphone 25 is selected, the following loop processing is executed.

また（３）式が満たされない場合（ステップ１１０のＮｏ）、最寄りのＢＦマイク（Ｎ）にＢＦマイク（Ｎ＋１）とは反対側で隣接するＢＦマイク（Ｎ－１）が対象マイク２５に選択される（ステップ１１２）。これにより、ＢＦマイク（Ｎ＋１）が選択された場合と同様に、発話者１に十分近い位置から十分な感度で音声５を集音することが可能となる。
ステップ１１２で、対象マイク２５が選択されると、次のループ処理が実行される。 Further, if the formula (3) is not satisfied (No in step 110), the BF microphone (N−1) adjacent to the nearest BF microphone (N) on the opposite side of the BF microphone (N+1) is selected as the target microphone 25. (step 112). As a result, as in the case where the BF microphone (N+1) is selected, it is possible to collect the voice 5 from a position sufficiently close to the speaker 1 with sufficient sensitivity.
At step 112, when the target microphone 25 is selected, the following loop processing is executed.

ステップ１０７～ステップ１１２で行われる処理は、発話方向２に沿った集音が可能なＢＦマイクＭを近い順番に検索して対象マイク２５に設定する処理である。このように、本実施形態では、発話方向２が割当範囲４１の中心方向に対応するＢＦマイクＭが存在しない場合、発話方向２に沿った集音が可能であり、音源との距離が最も近いＢＦマイクＭが対象マイクとして選択される。
これにより、可能な限り高い感度で音声５を集音することが可能なＢＦマイクＭを対象マイク２５に設定することが可能となる。この結果、音声データ６の音質を十分に向上することが可能となる。 The process performed in steps 107 to 112 is a process of retrieving BF microphones M capable of collecting sound along the speaking direction 2 in order of proximity and setting them as the target microphone 25 . As described above, in the present embodiment, when there is no BF microphone M whose speech direction 2 corresponds to the central direction of the allocation range 41, sound can be collected along the speech direction 2, and the distance to the sound source is the closest. BF microphone M is selected as the target microphone.
This makes it possible to set the BF microphone M capable of collecting the voice 5 with the highest possible sensitivity as the target microphone 25 . As a result, the sound quality of the audio data 6 can be sufficiently improved.

図８は、複数の発話者１に対する集音動作について説明するための模式図である。以下では、集音対象領域４０に複数の発話者１が居る場合の集音動作について説明する。
ここでは、正方形状の集音対象領域４０の中心に置かれた机４３の周りに座っている４人の発話者１Ａ、１Ｂ、１Ｃ、及び１Ｄを対象として集音動作が行われものとする。発話者１Ａ、１Ｂ、１Ｃ、及び１Ｄは、集音対象領域４０の中心から見て図中の左上、右上、右下、及び左下に位置し、互いに向かい合うようにして会話をしている。
また集音対象領域４０の４つの頂点には、図６と同様にＢＦマイクＭ１～Ｍ４がそれぞれ配置される。 FIG. 8 is a schematic diagram for explaining the sound collection operation for a plurality of speakers 1. FIG. A sound collection operation when a plurality of speakers 1 are present in the sound collection target area 40 will be described below.
Here, it is assumed that four speakers 1A, 1B, 1C, and 1D sitting around a desk 43 placed in the center of a square-shaped sound-collection target area 40 are subjected to sound-collection operations. . Speakers 1A, 1B, 1C, and 1D are positioned at the upper left, upper right, lower right, and lower left in the drawing when viewed from the center of the target sound collection area 40, and are having a conversation facing each other.
BF microphones M1 to M4 are arranged at the four vertices of the sound collection target area 40, respectively, as in FIG.

複数の発話者１が集音対象となる場合、画像処理部２１は、複数の発話者１（音源）ごとに音源情報を取得する。
具体的には、集音対象領域４０を図示しない検出カメラ１０で撮影した画像データから、発話者１Ａ、１Ｂ、１Ｃ、及び１Ｄの各々について、各発話者１の位置と発話方向２とがそれぞれ検出される。 When a plurality of speakers 1 are targeted for sound collection, the image processing unit 21 acquires sound source information for each of the plurality of speakers 1 (sound sources).
Specifically, from the image data captured by the detection camera 10 (not shown) of the sound collection target area 40, the position and the speaking direction 2 of each speaker 1 are obtained for each of the speakers 1A, 1B, 1C, and 1D. detected.

各発話者１の音源情報が取得されると、集音制御部２２は、複数の発話者１ごとの音源情報に基づいて、複数の発話者１ごとに対象マイク２５をそれぞれ選択する。また集音制御部２２は、複数の発話者１ごとに選択された各対象マイク２５について、集音方向３をそれぞれ設定する。
図８に示す例では、発話者１Ａの対象マイク２５として、集音対象領域４０の右上に配置されたＢＦマイクＭ１が選択される。また、発話者１Ｂの対象マイク２５として、集音対象領域４０の左上に配置されたＢＦマイクＭ４が選択される。また、発話者１Ｃの対象マイク２５として、集音対象領域４０の左下に配置されたＢＦマイクＭ３が選択される。また、発話者１Ｄの対象マイク２５として、集音対象領域４０の右下に配置されたＢＦマイクＭ２が選択される。 When the sound source information of each speaker 1 is obtained, the sound collection control unit 22 selects the target microphone 25 for each of the speakers 1 based on the sound source information of each speaker 1 . The sound collection control unit 22 also sets the sound collection direction 3 for each of the target microphones 25 selected for each of the plurality of speakers 1 .
In the example shown in FIG. 8, the BF microphone M1 arranged at the upper right of the sound collection target area 40 is selected as the target microphone 25 of the speaker 1A. Also, the BF microphone M4 arranged at the upper left of the sound collection target area 40 is selected as the target microphone 25 of the speaker 1B. Also, the BF microphone M3 arranged at the lower left of the sound collection target area 40 is selected as the target microphone 25 of the speaker 1C. Also, the BF microphone M2 arranged at the lower right of the sound collection target area 40 is selected as the target microphone 25 of the speaker 1D.

例えば、発話者１Ａの音声５の集音に、発話者１Ａの直近に配置されたＢＦマイクＭ４を用いるとする。ここでは、発話者１Ａは、机を挟んで対峙している発話者１Ｂ及び発話者１Ｃのほうを向いて会話をしている。このため、発話者１Ａの発話方向２に対するＢＦマイクＭ４の集音角度は、９０°以上である。さらにＢＦマイクＭ４を用いて発話者１Ａの音声５を集音する場合、発話者１Ｂ及び１Ｃの発話方向２の９０°以内にビームフォーミングの集音方向３を設定することになる。
この結果、ＢＦマイクＭ４では、発話者１Ａの回折音と、発話者１Ｂ及び１Ｃの直接音とを集音することになり、発話者１Ａの音声５を選択的に集音することが難しくなる。 For example, suppose that the BF microphone M4 arranged in the immediate vicinity of the speaker 1A is used to collect the voice 5 of the speaker 1A. Here, speaker 1A is having a conversation while facing speaker 1B and speaker 1C facing each other across the desk. Therefore, the sound collection angle of the BF microphone M4 with respect to the speaking direction 2 of the speaker 1A is 90° or more. Furthermore, when collecting the voice 5 of the speaker 1A using the BF microphone M4, the sound collection direction 3 for beam forming is set within 90° of the speech direction 2 of the speakers 1B and 1C.
As a result, the BF microphone M4 collects the diffracted sound of the speaker 1A and the direct sounds of the speakers 1B and 1C, making it difficult to selectively collect the voice 5 of the speaker 1A. .

これに対し、例えば図５を参照して説明した処理のように、発話方向２の情報を加味することで、発話者１Ａの音声を集音する対象マイク２５として、ＢＦマイクＭ１を選択することが可能である。ＢＦマイクＭ１を用いることで、発話者１Ａの直接音を集音することが可能となる。またＢＦマイクＭ１から発話者１Ａに向けて設定される集音方向３は、発話者１Ｂ及び１Ｃの音声５をほとんど集音しない。このように、発話者１Ｂ及び１Ｃをビームフォーミングの集音範囲外にすることが可能となるので、集音対象でない発話者１の影響を十分に抑えることが可能となる。 On the other hand, the BF microphone M1 can be selected as the target microphone 25 for collecting the voice of the speaker 1A by considering the information on the speaking direction 2, as in the processing described with reference to FIG. 5, for example. is possible. By using the BF microphone M1, it is possible to collect the direct sound of the speaker 1A. Also, the sound collection direction 3 set from the BF microphone M1 toward the speaker 1A hardly collects the voices 5 of the speakers 1B and 1C. In this way, since the speakers 1B and 1C can be placed outside the sound collection range of beamforming, it is possible to sufficiently suppress the influence of the speaker 1 who is not the target of sound collection.

発話者１Ｂ～１Ｄに対して設定される対象マイク２５についても、上記と同様の効果を発揮することが可能である。これにより、複数の発話者１が居る場合であっても、各発話者１の音声５を個別にかつ良好な音質で集音することが可能となる。 The same effects as described above can be exhibited for the target microphones 25 set for the speakers 1B to 1D. As a result, even when there are a plurality of speakers 1, it is possible to collect the voice 5 of each speaker 1 individually and with good sound quality.

図９は、複数のＢＦマイクＭを用いた集音動作の一例を示す模式図である。
図９では、複数のＢＦマイクＭを使って一人の発話者１の音声を集音する例について説明する。この場合、集音制御部２２では、単一の音源（一人の発話者１）について、複数のＢＦマイクＭから複数の対象マイク２５が選択される。
ここでは、図６や図８と同様に４つのＢＦマイクＭ１～Ｍ４が正方形状の集音対象領域４０に配置される。 FIG. 9 is a schematic diagram showing an example of sound collection operation using a plurality of BF microphones M. FIG.
In FIG. 9, an example of collecting the voice of one speaker 1 using a plurality of BF microphones M will be described. In this case, the sound collection control unit 22 selects a plurality of target microphones 25 from a plurality of BF microphones M for a single sound source (single speaker 1).
Here, as in FIGS. 6 and 8, four BF microphones M1 to M4 are arranged in a square sound collection target area 40. FIG.

図９に示す発話者１は、集音対象領域４０の中心よりも図中上側に位置した状態で、図中下側を向いて音声５を発している。このため、発話者１に近接するＢＦマイクＭ１やＭ４では、発話者１の直接音の集音が難しい。
このような場合、集音制御部２２により、集音対象領域４０において発話者１の正面側（発話方向２が向けられた側）にあるＢＦマイクＭ２及びＭ３がともに発話者１の対象マイクとして選択される。また集音処理部２３により、ＢＦマイクＭ２及びＭ３使って、発話者１の音声５が同時に集音され、各集音結果を加算（合成）して音声データ６が生成される。
このように２つのＢＦマイクＭ２及びＭ３を用いることで、遠距離集音時の集音レベルを向上することが可能となり、品質を低下させることなく発話者１の音声５を集音することが可能となる。 A speaker 1 shown in FIG. 9 is positioned above the center of the sound collection target area 40 in the figure, and is facing downward in the figure and uttering a voice 5 . For this reason, it is difficult for the BF microphones M1 and M4, which are close to the speaker 1, to collect the direct sound of the speaker 1. FIG.
In such a case, the sound collection control unit 22 selects both the BF microphones M2 and M3 located on the front side of the speaker 1 (the side toward which the speech direction 2 is directed) in the sound collection target area 40 as the target microphones of the speaker 1. selected. The sound collection processing unit 23 simultaneously collects the voice 5 of the speaker 1 using the BF microphones M2 and M3, and adds (synthesizes) the collected sound results to generate voice data 6. FIG.
By using the two BF microphones M2 and M3 in this way, it is possible to improve the sound collection level at the time of long-distance sound collection, and it is possible to collect the speech 5 of the speaker 1 without deteriorating the quality. It becomes possible.

図１０は、発話者１が移動する際の集音動作の一例を示す模式図である。図１１は、音声５の合成処理について説明するための模式図である。ここでは、図１０及び図１１を参照して、集音対象領域４０内を発話者１が移動する場合の対象マイク２５の選択動作について説明する。
発話者１は、集音対象領域４０の左上から中央右側を通って左下に向けて移動するものとする。図１０には、時刻Ｔ１、Ｔ２、Ｔ３、及びＴ４における発話者１の位置及び発話方向２が模式的に図示されている。またビーム７の範囲を表すグレーの色は各時刻に対応しており、色が濃いほど後の時刻に設定されたビーム７を表している。 FIG. 10 is a schematic diagram showing an example of the sound collection operation when the speaker 1 moves. FIG. 11 is a schematic diagram for explaining the process of synthesizing voice 5. As shown in FIG. Here, the operation of selecting the target microphone 25 when the speaker 1 moves within the sound collection target area 40 will be described with reference to FIGS. 10 and 11. FIG.
It is assumed that the speaker 1 moves from the upper left of the target sound collection area 40 to the lower left through the right side of the center. FIG. 10 schematically shows the position and speech direction 2 of speaker 1 at times T1, T2, T3, and T4. The gray color representing the range of the beam 7 corresponds to each time, and the darker the color, the later the beam 7 is set.

例えば時刻Ｔ１では、発話者１は、集音対象領域４０の左上に位置し発話方向２は図中右側に向けられている。この場合、ＢＦマイクＭ１が対象マイク２５となり、発話者１に向けてビーム７が設定される。
時刻Ｔ２では、発話者１は、ＢＦマイクＭ１に接近しており発話方向２は図中右下に向けられている。この場合、ＢＦマイクＭ１とともに、ＢＦマイクＭ２が対象マイク２５として選択される。
時刻Ｔ３では、発話者１は、集音対象領域４０の中央右側に位置し発話方向２は図中下側に向けられている。この場合、ＢＦマイクＭ１は対象マイク２５から外されており、ＢＦマイクＭ２が対象マイク２５として選択される。
時刻Ｔ４では、発話者１は、ＢＦマイクＭ２に接近しており発話方向２は図中左下のＢＦマイクＭ３に向けられている。この場合、ＢＦマイクＭ２とともに、ＢＦマイクＭ３が対象マイク２５として選択される。 For example, at time T1, the speaker 1 is positioned at the upper left of the sound collection target area 40, and the speech direction 2 is directed to the right side in the figure. In this case, the BF microphone M1 becomes the target microphone 25, and the beam 7 is set toward the speaker 1. FIG.
At time T2, speaker 1 is approaching BF microphone M1 and speaking direction 2 is directed to the lower right in the figure. In this case, the BF microphone M2 is selected as the target microphone 25 along with the BF microphone M1.
At time T3, the speaker 1 is positioned on the right side of the center of the sound collection target area 40, and the speaking direction 2 is directed downward in the figure. In this case, the BF microphone M1 is removed from the target microphone 25, and the BF microphone M2 is selected as the target microphone 25. FIG.
At time T4, the speaker 1 is approaching the BF microphone M2, and the speaking direction 2 is directed toward the BF microphone M3 at the lower left in the figure. In this case, the BF microphone M3 is selected as the target microphone 25 together with the BF microphone M2.

このように、本実施形態では、発話者１の移動に伴い、複数のＢＦマイクＭを適宜切り替えて対象マイク２５が設定される。
また時刻Ｔ２やＴ４のように、２つのＢＦマイクＭで集音が可能な場合には、両方のＢＦマイクＭが対象マイク２５として設定され、そのデータを用いて音声データ６が合成される。すなわち、集音処理部２３では、複数の対象マイク２５により集音されたデータを合成して、発話者１の音声データ６が生成される。
以下では、時刻Ｔ２の場合を例に挙げて、対象マイク２５として選択された２つのＢＦマイクＭ１及びＭ２を用いて音声データ６を合成する方法について説明する。 As described above, in the present embodiment, the target microphone 25 is set by appropriately switching the plurality of BF microphones M as the speaker 1 moves.
Also, when sound can be collected by two BF microphones M, such as times T2 and T4, both BF microphones M are set as the target microphones 25, and the audio data 6 is synthesized using the data. That is, the sound collection processing unit 23 synthesizes the data collected by the plurality of target microphones 25 to generate the speech data 6 of the speaker 1 .
A method of synthesizing the audio data 6 using the two BF microphones M1 and M2 selected as the target microphones 25 will be described below, taking the case of time T2 as an example.

図１１には、時刻Ｔ２における発話者１とＢＦマイクＭ１及びＭ２との配置関係が模式的に図示されている。
発話者１からＢＦマイクＭ１に向かう方向（ＱからＰ１に向かう方向）と発話方向２とのなす角度をγ₁と記載し、発話者１からＢＦマイクＭ２に向かう方向（ＱからＰ２に向かう方向）と発話方向２とのなす角度をγ₂と記載する。また、発話者１とＢＦマイクＭ１との距離（ＱとＰ１との距離）をＬ₁と記載し、発話者１とＢＦマイクＭ２との距離（ＱとＰ２との距離）をＬ₂と記載する。
（γ₁、γ₂、Ｌ₁、Ｌ₂）は、例えば画像処理部２１によるボーン検出及び人位置検出の各処理を用いてそれぞれ算出される。 FIG. 11 schematically shows the positional relationship between speaker 1 and BF microphones M1 and M2 at time T2.
The angle between the direction from speaker 1 to BF microphone M1 (the direction from Q to P1) and the utterance direction 2 is denoted as _γ1 , and the direction from speaker 1 to BF microphone M2 (the direction from Q to P2) ) and the speaking direction 2 is denoted as γ ₂ . Also, the distance between speaker 1 and BF microphone M1 (distance between Q and P1) is indicated as _L1 , and the distance between speaker 1 and BF microphone M2 (distance between Q and P2) is indicated as _L2. do.
(γ ₁ , γ ₂ , L ₁ , L ₂ ) are calculated using bone detection and human position detection processing by the image processing unit 21, for example.

ここで、発話者１の正面で集音を行った場合に、必要な発話レベルＡを集音可能な距離を、基準集音距離Ｌと記載する。
例えば、基準集音距離Ｌに対して、発話者１から距離Ｌ₁だけ離れた位置で集音するＢＦマイクＭ１の集音レベルＡ１は、以下の式で表される。
Ａ１＝Ａ×(Ｌ／Ｌ₁)² ・・・（４）
同様に、基準集音距離Ｌに対して、発話者１から距離Ｌ₂だけ離れた位置で集音するＢＦマイクＭ２の集音レベルＡ２は、以下の式で表される。
Ａ２＝Ａ×(Ｌ／Ｌ₂)² ・・・（５） Here, a reference sound collection distance L is the distance at which the necessary speech level A can be collected when the sound is collected in front of the speaker 1 .
For example, the sound collection level A1 of the BF microphone M1 that collects sound at a position separated from the speaker 1 by a distance _L1 with respect to the reference sound collection distance L is expressed by the following equation.
A1=A×(L/ _L1 ) ² (4)
Similarly, the sound collection level A2 of the BF microphone M2 that collects sound at a position separated from the speaker 1 by a distance _L2 with respect to the reference sound collection distance L is expressed by the following equation.
A2=A×(L/L ₂ ) ² (5)

また、ＢＦマイクＭ１及びＭ２の各出力を以下の式に従って合成する。
Ａ_mix＝sqrt｛(Ａ１×(Ｌ₁／Ｌ)²×cosγ)²＋(Ａ１×(Ｌ₁／Ｌ)²×cosγ)²｝
・・・（６）
ここでＡ_mixは、ＢＦマイクＭ１及びＭ２の各出力を合成した合成レベルである。
またsqrt｛｝は、｛｝内の値に対する平方根を意味する。
またγは、上記した（γ₁、γ₂）のどちらか一方である。 Also, each output of the BF microphones M1 and M2 is synthesized according to the following formula.
A _mix =sqrt{(A1×(L ₁ /L) ² ×cosγ) ² +(A1×(L ₁ /L) ² ×cosγ) ² }
... (6)
Here, A _mix is a synthesis level obtained by synthesizing the outputs of the BF microphones M1 and M2.
Also, sqrt{} means the square root of the value in {}.
γ is either one of (γ ₁ , γ ₂ ) described above.

（４）及び（５）式より、必要な発話レベルＡは、以下のように表される。
Ａ＝Ａ１×(Ｌ₁／Ｌ)²＝Ａ２×(Ｌ₂／Ｌ)² ・・・（７）
従って、（６）式に従って合成される合成レベルＡ_mixは、Ａ_mix＝Ａとなる。
このように、（６）式を用いることで、合成レベルＡ_mixを常に発話レベルＡと同等のレベルとすることが可能となる。 From the equations (4) and (5), the required speech level A is expressed as follows.
A=A1×( _L1 /L) ² =A2×( _L2 /L) ² (7)
Therefore, the synthesis level A _mix synthesized according to the formula (6) is A _mix =A.
Thus, by using the equation (6), the synthesis level A _mix can always be kept at the same level as the speech level A.

また、（６）式のγは、例えば２つのＢＦマイクＭ（ここではＭ１及びＭ２）のうち、メインに集音を行うＢＦマイクＭ（主マイクアレイ）の発話方向２に対する集音角度である。
例えば、発話者１の位置Ｑ及び発話方向２をもとに、集音角度γが－９０°≦γ≦９０°となり、発話者１に近接する２つのＢＦマイクＭが対象マイク２５として選択される。また、選択された２つのＢＦマイクＭのうち、発話者１に近いほうが、メインに集音を行うＢＦマイクＭに設定され、その集音角度が（６）式のγとして用いられる。 In addition, γ in equation (6) is, for example, the sound collection angle of the BF microphone M (main microphone array) that mainly collects sound, out of the two BF microphones M (here, M1 and M2), with respect to the utterance direction 2. .
For example, based on the position Q of the speaker 1 and the speaking direction 2, the sound collection angle γ is −90°≦γ≦90°, and two BF microphones M close to the speaker 1 are selected as the target microphones 25. be. Of the two selected BF microphones M, the one closer to the speaker 1 is set as the BF microphone M that mainly collects sound, and its sound collection angle is used as γ in equation (6).

例えば、図１１に示す状況では、発話者１に近いＢＦマイクＭ１がメインに集音を行うＢＦマイクＭに設定され、その集音角度γ₁が（６）式のγとして用いられる。
また時刻Ｔ２以降に発話者１が移動して、γ₁＝９０°（またはγ₁＝－９０°）となった場合、メインに集音を行うＢＦマイクＭは、ＢＦマイクＭ２に切り替えられ、（６）式のγが集音角度γ₁に切り替えられる。
これにより、隣接するＢＦマイクＭの連続的な切替えを実現することが可能となる。この結果、不自然な音切れ等を発生させることなく、集音レベルの高い高品質な集音を継続して行うことが可能となる。 For example, in the situation shown in FIG. 11, the BF microphone M1 close to the speaker 1 is set as the BF microphone M that mainly collects sound, and its sound collection angle _γ1 is used as γ in equation (6).
Further, when the speaker 1 moves after time T2 and becomes γ ₁ =90° (or γ ₁ =−90°), the BF microphone M that mainly collects sound is switched to the BF microphone M2, γ in equation (6) is switched to the sound collection angle γ ₁ .
This makes it possible to realize continuous switching of adjacent BF microphones M. FIG. As a result, it is possible to continuously perform high-quality sound collection at a high sound collection level without causing unnatural sound interruptions or the like.

図１２は、複数の発話者１が移動する際の集音動作の一例を示す模式図である。
図１２では、複数の発話者１が移動し、かつ各発話者１に対する集音動作が干渉する場合について説明する。
ここでは、集音対象領域４０内を２人の発話者１Ａ及び１Ｂが、図中の太い矢印に沿ってそれぞれ移動するものとする。図１２Ａ及び図１２Ｂには、時刻Ｔ１及び時刻Ｔ２での発話者１Ａ及び１Ｂの配置が模式的に図示されている。
また発話者１Ａの対象マイク２５のビーム７の範囲が薄いグレーの領域で示されており、発話者１Ｂの対象マイク２５のビーム７の範囲が濃いグレーの領域で示されている。また、ドットの領域は、比較のために示した仮想的なビーム７の範囲を表している。 FIG. 12 is a schematic diagram showing an example of a sound collection operation when a plurality of speakers 1 move.
FIG. 12 illustrates a case where a plurality of speakers 1 move and sound collection operations for each speaker 1 interfere.
Here, it is assumed that two speakers 1A and 1B move within the sound collection target area 40 along the thick arrows in the drawing. 12A and 12B schematically show the arrangement of speakers 1A and 1B at time T1 and time T2.
The range of the beam 7 of the target microphone 25 of the speaker 1A is indicated by a light gray area, and the range of the beam 7 of the target microphone 25 of the speaker 1B is indicated by a dark gray area. Also, the dot area represents the range of the virtual beam 7 shown for comparison.

図１２Ａでは、発話者１Ａは集音対象領域４０の左上の外周近くに位置し、発話者１Ａの発話方向２は図中右側を向いている。また発話者１Ｂは集音対象領域４０の中央下側の外周近くに位置し、発話者１Ｂの発話方向２は図中左上を向いている。 In FIG. 12A, the speaker 1A is positioned near the upper left outer periphery of the sound collection target area 40, and the speaking direction 2 of the speaker 1A faces the right side in the figure. Also, the speaker 1B is positioned near the outer periphery of the lower center of the sound collection target area 40, and the speech direction 2 of the speaker 1B is directed to the upper left in the figure.

図１２Ａに示す状況では、発話者１Ａの正面側にある直近のＢＦマイクＭ１で、発話者１Ａの音声５を集音してもその集音方向３（ビーム７ａの方向）に他者（発話者１Ｂ）が重ならない。このため、ＢＦマイクＭ１が発話者１Ａの対象マイク２５として選択され、発話者１Ａに向けてビーム７ａが設定される。
同様に、発話者１Ｂの正面側にある直近のＢＦマイクＭ３で、発話者１Ｂの音声５を集音してもその集音方向３（ビーム７ｃの方向）に他者（発話者１Ａ）が重ならない。このため、ＢＦマイクＭ３が発話者１Ｂの対象マイク２５として選択され、発話者１Ｂに向けてビーム７ｂが設定される。 In the situation shown in FIG. 12A, even if the voice 5 of the speaker 1A is collected by the BF microphone M1 closest to the front side of the speaker 1A, the sound collection direction 3 (the direction of the beam 7a) Person 1B) does not overlap. Therefore, the BF microphone M1 is selected as the target microphone 25 for the speaker 1A, and the beam 7a is set toward the speaker 1A.
Similarly, even if the voice 5 of the speaker 1B is collected with the BF microphone M3 that is closest to the front side of the speaker 1B, the other person (the speaker 1A) is in the sound collection direction 3 (direction of the beam 7c). Do not overlap. Therefore, the BF microphone M3 is selected as the target microphone 25 of the speaker 1B, and the beam 7b is set toward the speaker 1B.

なお、発話者１Ａに最も近い位置にあるＢＦマイクＭ４では、発話者１Ａにビーム７ｄを向けたとしても、発話者１Ａを背後から集音することになる。同様に、発話者１Ｂに最も近い位置にあるＢＦマイクＭ２では、発話者１Ｂにビーム７ｂを向けたとしても、発話者１Ｂを背後から集音することになる。従ってＢＦマイクＭ４のビーム７ｄや、ＢＦマイクＭ２のビーム７ｂでは、発話者１の直接音が集音できないため、音質が低下する可能性がある。 Even if the beam 7d is directed toward the speaker 1A, the BF microphone M4 located closest to the speaker 1A picks up sound from behind the speaker 1A. Similarly, the BF microphone M2 located closest to the speaker 1B picks up the sound of the speaker 1B from behind even if the beam 7b is directed toward the speaker 1B. Therefore, since the beam 7d of the BF microphone M4 and the beam 7b of the BF microphone M2 cannot collect the direct sound of the speaker 1, the sound quality may deteriorate.

図１２Ｂでは、発話者１Ａは集音対象領域４０の中心の右上に位置し、発話者１Ａの発話方向２は図中右下を向いている。また発話者１Ｂは集音対象領域４０の中心の左下に位置し、発話者１Ｂの発話方向２は図中上側を向いている。 In FIG. 12B, the speaker 1A is positioned at the upper right of the center of the sound collection target area 40, and the speech direction 2 of the speaker 1A is directed to the lower right in the figure. Also, the speaker 1B is positioned at the lower left of the center of the sound collection target area 40, and the speech direction 2 of the speaker 1B is directed upward in the figure.

図１２Ｂに示す状況では、図１２Ａと同様にＢＦマイクＭ１を用いて発話者１Ａを集音した場合、ＢＦマイクＭ１のビーム７ａ'上に、他者（発話者１Ｂ）が重なっている。また発話者１Ｂの発話方向２に対するＢＦマイクＭ１の集音角度が９０°以下であるため、ビーム７ａ'を用いた場合、発話者１Ｂが発する直接音が集音される可能性がある。
一方で、発話者１Ａの正面側にあるもう一方のＢＦマイクＭ２を用いて発話者１Ａを集音した場合、ＢＦマイクＭ２のビーム７ｂ'上に、他者（発話者１Ｂ）が重ならない。このため、図１２Ｂでは、ＢＦマイクＭ２が発話者１Ａの対象マイク２５として選択され、発話者１Ａに向けてビーム７ｂ'が設定される。これにより、発話者１Ａの音声５だけを高品質に集音することが可能である。 In the situation shown in FIG. 12B, when the speaker 1A is picked up using the BF microphone M1 as in FIG. 12A, the other person (speaker 1B) overlaps the beam 7a' of the BF microphone M1. Also, since the sound collection angle of the BF microphone M1 with respect to the speaking direction 2 of the speaker 1B is 90° or less, the direct sound emitted by the speaker 1B may be collected when the beam 7a' is used.
On the other hand, when the other BF microphone M2 in front of the speaker 1A is used to collect the sound of the speaker 1A, the other person (speaker 1B) does not overlap the beam 7b' of the BF microphone M2. Therefore, in FIG. 12B, the BF microphone M2 is selected as the target microphone 25 of the speaker 1A, and the beam 7b' is set toward the speaker 1A. This makes it possible to collect only the voice 5 of the speaker 1A with high quality.

図１２Ｂに示す発話者１Ｂについても同様に対象マイク２５が切り替えられる。例えば、
図１２Ａと同様にＢＦマイクＭ３を用いて発話者１Ｂを集音した場合、ＢＦマイクＭ３のビーム７ｃ'には、他者（発話者１Ａ）が重なっており、発話者１Ａが発する直接音が集音される可能性がある。
一方で、発話者１Ｂの正面側にあるＢＦマイクＭ４を用いて発話者１Ｂを集音した場合、ＢＦマイクＭ４のビーム７ｄ'上に、他者（発話者１Ａ）が重ならない。このため、図１２Ｂでは、ＢＦマイクＭ４が発話者１Ｂの対象マイク２５として選択され、発話者１Ｂに向けてビーム７ｄ'が設定される。これにより、発話者１Ｂの音声５だけを高品質に集音することが可能である。 The target microphone 25 is similarly switched for the speaker 1B shown in FIG. 12B. for example,
When the speaker 1B is collected using the BF microphone M3 as in FIG. 12A, the beam 7c' of the BF microphone M3 overlaps the other person (speaker 1A), and the direct sound emitted by the speaker 1A is Sound may be collected.
On the other hand, when the BF microphone M4 in front of the speaker 1B is used to collect the sound of the speaker 1B, the other person (speaker 1A) does not overlap the beam 7d' of the BF microphone M4. Therefore, in FIG. 12B, the BF microphone M4 is selected as the target microphone 25 of the speaker 1B, and the beam 7d' is set toward the speaker 1B. This makes it possible to collect only the voice 5 of the speaker 1B with high quality.

このように本実施形態では、処理対象（集音対象）の発話者１が発する直接音を集音し処理対象とは異なる他の発話者１が発する直接音を集音しないように集音方向３を設定可能なＢＦマイクＭが対象マイク２５として選択される。
これにより、例えば処理対象の発話者１が発した音声５を選択的に集音した音声データ６を生成することが可能となる。 As described above, in this embodiment, the sound collection direction is changed so as to collect the direct sound uttered by the speaker 1 to be processed (sound collection target) and not to collect the direct sound uttered by another speaker 1 different from the processing target. 3 can be set is selected as the target microphone 25 .
As a result, for example, it is possible to generate voice data 6 by selectively collecting the voice 5 uttered by the speaker 1 to be processed.

図１３は、発話者１の発話方向２を想定した集音動作の一例を示す模式図である。
図１３では、複数の発話方向２にむけた発話が想定可能であり、発話方向２が比較的頻繁に切り替わるような状況での集音動作について説明する。
ここでは、一例としてリモート会議が行われている状況を想定する。集音対象領域４０には、発話者１Ａ及び１Ｂが左右に分かれて座っている。また集音対象領域４０の中央上側に設けられたモニター４４には、リモート会議の参加者である発話者１Ｃが映し出されている。 13A and 13B are schematic diagrams showing an example of the sound collection operation assuming the speaking direction 2 of the speaker 1. FIG.
FIG. 13 illustrates a sound collection operation in a situation where it is possible to assume speech directed toward a plurality of speech directions 2 and the speech direction 2 switches relatively frequently.
Here, as an example, it is assumed that a remote conference is being held. Speakers 1A and 1B are sitting on the right and left sides of the sound collection target area 40 . A speaker 1</b>C, who is a participant in the remote conference, is displayed on the monitor 44 provided in the upper center of the sound collection target area 40 .

複数の発話方向２が想定される場合には、対応するＢＦマイクＭに対して、想定される発話方向２に応じた集音方向３が予め設定される。集音方向３が予め設定されたＢＦマイクＭは、対象マイク２５の候補となる候補マイク２６となる。
このように、複数のＢＦマイクＭには、予め集音方向３が設定された複数の候補マイク２６が含まれる。本実施形態では、候補マイク２６は、候補装置に相当する。 When a plurality of utterance directions 2 are assumed, a sound collection direction 3 corresponding to the assumed utterance direction 2 is set in advance for the corresponding BF microphone M. The BF microphone M for which the sound collection direction 3 is set in advance becomes a candidate microphone 26 that is a candidate for the target microphone 25 .
Thus, the multiple BF microphones M include multiple candidate microphones 26 for which the sound collection directions 3 are set in advance. In this embodiment, the candidate microphones 26 correspond to candidate devices.

発話者１Ａに着目すると、図１３に示す状況では、発話者１Ａが発話者１Ｃに向かって発話する場合（発話方向２が上側に向けられる場合）と、発話者１Ａが発話者１Ｂに向かって発話する場合（発話方向２が右側に向けられる場合）とが想定される。
この場合、ＢＦマイクＭ４及びＭ１が、発話者１Ａの音声５を集音する候補マイク２６として設定される。
例えば、発話者１Ａが発話者１Ｃに向かって発話する際の上側に向けられる発話方向２ａに対応して、ＢＦマイクＭ４に集音方向３ａが設定される。同様に、発話者１Ａが発話者１Ｂに向かって発話する際の右側に向けられる発話方向２ｂに対応して、ＢＦマイクＭ１に集音方向３ｂが設定される。 Focusing on speaker 1A, in the situation shown in FIG. A case of speaking (speech direction 2 directed to the right) is assumed.
In this case, the BF microphones M4 and M1 are set as candidate microphones 26 for collecting the speech 5 of the speaker 1A.
For example, a sound collection direction 3a is set for the BF microphone M4 corresponding to an upward speaking direction 2a when the speaker 1A speaks to the speaker 1C. Similarly, a sound collection direction 3b is set for the BF microphone M1 in correspondence with the speech direction 2b directed to the right when the speaker 1A speaks to the speaker 1B.

このように、候補マイク２６が設定された状態で、発話者１に対する集音動作が実行される。具体的には、集音制御部２２により、複数の候補マイク２６から対象マイク２５が選択される。例えば、発話者１の実際の発話方向２がモニタリングされ、そのモニタリング結果に応じて、各候補マイク２６から対象マイク２５が選択される。
図１３では、発話者１Ａが発話者１Ｃに向かって発話しているとする。この場合、発話方向２ａに対応する集音方向３ａが設定されたＢＦマイクＭ４が対象マイク２５として選択される。そして、ＢＦマイクＭ４により集音方向３ａに沿って発話者１Ａの音声５が集音される。 In this way, the sound collecting operation for the speaker 1 is performed with the candidate microphones 26 set. Specifically, the target microphone 25 is selected from the plurality of candidate microphones 26 by the sound collection control unit 22 . For example, the actual speaking direction 2 of the speaker 1 is monitored, and the target microphone 25 is selected from each candidate microphone 26 according to the monitoring result.
In FIG. 13, it is assumed that speaker 1A is speaking to speaker 1C. In this case, the BF microphone M4 for which the sound collection direction 3a corresponding to the speaking direction 2a is set is selected as the target microphone 25. FIG. Then, the voice 5 of the speaker 1A is collected along the sound collection direction 3a by the BF microphone M4.

また、集音処理部２３は、対象マイク２５として選択されない候補マイク２６を集音状態で待機させる。ここで集音状態での待機とは、例えば対象マイク２５による集音動作のバックグラウンドで集音処理（ビームフォーミング処理）を継続する処理である。なお待機中に生成された音声データ６は適宜削除される。
図１３では、ＢＦマイクＭ４が対象マイク２５として選択されるため、もう一方の候補マイク２６であるＢＦマイクＭ１が集音状態で待機することになる。このときＢＦマイクＭ１は集音方向３ｂに対する集音動作を継続している。
これにより、発話方向２が急に変化した場合であっても、待機させた候補マイク２６での集音に切り替えることで、高品質な集音を継続して行うことが可能となる。 In addition, the sound collection processing unit 23 causes the candidate microphones 26 not selected as the target microphones 25 to stand by in the sound collection state. Here, the standby in the sound collecting state is, for example, a process of continuing the sound collecting process (beam forming process) in the background of the sound collecting operation by the target microphone 25 . Note that the voice data 6 generated during standby is deleted as appropriate.
In FIG. 13, since the BF microphone M4 is selected as the target microphone 25, the other candidate microphone 26, the BF microphone M1, is on standby in a sound collecting state. At this time, the BF microphone M1 continues the sound collection operation in the sound collection direction 3b.
As a result, even when the speech direction 2 suddenly changes, it is possible to continuously collect high-quality sound by switching to the sound collection by the candidate microphone 26 that is on standby.

例えば図１３では、発話者１Ａの隣席に発話者１Ｂが居るため、発話者１Ａがメインの方向（発話方向２ａ）を向いて発話者１Ｃと話していたとしても、急に発話者１Ｂとの会話が始まる可能性がある。そこで、上記したように予め隣席方向（発話方向２ｂ）に対してもＢＦマイクＭ１を集音状態で待機すれば、発話者１Ａが頻繁に且つ早急に向きを変えて隣席の発話者１Ｂと会話を始めても、頭切れをせずに発話者１Ａの音声５を集音することが可能となる。 For example, in FIG. 13, since speaker 1B is in the seat next to speaker 1A, even if speaker 1A faces the main direction (speech direction 2a) and talks to speaker 1C, the conversation with speaker 1B suddenly occurs. Conversation can start. Therefore, as described above, if the BF microphone M1 is on standby in the sound-collecting state in advance in the adjacent seat direction (speech direction 2b), the speaker 1A can frequently and quickly turn around and converse with the adjacent speaker 1B. , it is possible to collect the voice 5 of the speaker 1A without truncating.

図１４は、ジャスチャーに応じた集音動作の一例を示す模式図である。
図１４では、発話者１のジェスチャー（特定動作）に応じて発話者１に対する集音処理を制御する方法について説明する。
ここでは、画像処理部２１により、発話者１のジェスチャーが検出される。本実施形態では、発話者１の発話方向２を検出するボーン検出機能を利用して、発話者１の骨格の情報から発話者１のジェスチャーが検出される。発話者１のジェスチャーは、静的なジェスチャー（ポーズ）であってもよいし、動的なジェスチャー（動作）であってもよい。 FIG. 14 is a schematic diagram showing an example of a sound collection operation according to a gesture.
In FIG. 14, a method of controlling sound collection processing for speaker 1 according to gestures (specific actions) of speaker 1 will be described.
Here, the gesture of speaker 1 is detected by the image processing unit 21 . In this embodiment, the gesture of speaker 1 is detected from information on the skeleton of speaker 1 using a bone detection function that detects the speaking direction 2 of speaker 1 . The gesture of speaker 1 may be a static gesture (pose) or a dynamic gesture (movement).

図１４（ａ）～（ｃ）には、発話者１の骨格を用いて、発話者１の姿勢が模式的に図示されている。発話者１の骨格は、複数の座標点４５で表されており、例えば発話者１の頭部は、頭座標点４５ａと、首座標点４５ｂとで表されている。また発話者１の右手は、右手首及び右手のひらを表す座標点４５のペア４６Ｒで表されており、発話者１の左手は、左手首及び左手のひらを表す座標点４５のペア４６Ｌで表されている。
これに限定されず、例えば、目、鼻、耳等の他の部分を表す座標点４５が用いられてもよい。 14A to 14C schematically show the posture of speaker 1 using the skeleton of speaker 1. FIG. The skeleton of speaker 1 is represented by a plurality of coordinate points 45. For example, the head of speaker 1 is represented by head coordinate point 45a and neck coordinate point 45b. The right hand of speaker 1 is represented by a pair 46R of coordinate points 45 representing the right wrist and right palm, and the left hand of speaker 1 is represented by a pair 46L of coordinate points 45 representing the left wrist and left palm. ing.
It is not limited to this, and for example, coordinate points 45 representing other parts such as eyes, nose, ears, etc. may be used.

本実施形態では、集音処理部２３により、発話者１のジェスチャーに応じて、発話者１の音声５を集音する集音処理が制御される。
ここで集音処理とは、例えば発話者１の音声５を集音するために必要となる一連の処理である。集音処理には、音声データ６を生成するビームフォーミング処理の他、画像処理部２１による発話者１の位置Ｑ及び発話方向２の検出処理や、集音制御部２２による対象マイク２５を選択する処理や集音方向３を設定する処理が含まれる。
これらの処理が、発話者１のジェスチャーに応じて制御される。 In this embodiment, the sound collection processing unit 23 controls sound collection processing for collecting the voice 5 of the speaker 1 according to the gesture of the speaker 1 .
Here, the sound collection process is a series of processes necessary for collecting the voice 5 of the speaker 1, for example. In the sound collection processing, in addition to the beamforming processing for generating the voice data 6, the image processing unit 21 detects the position Q and the speech direction 2 of the speaker 1, and the sound collection control unit 22 selects the target microphone 25. processing and processing for setting the sound collection direction 3 are included.
These processes are controlled according to the gesture of speaker 1 .

図１４（ａ）には、発話者１の一般姿勢が示されている。一般姿勢は、例えば発話者１の通常の姿勢であり、左右の手を下におろして直立した状態である。なお、左右の手（ペア４６Ｌ及び４６Ｒ）の位置が例えば肩の座標点４５よりも低い位置にある場合を一般姿勢に設定してもよい。
一般姿勢が検出された場合、発話者１に対して通常の集音処理が実行される。 FIG. 14( a ) shows the general posture of speaker 1 . The general posture is, for example, the normal posture of the speaker 1, in which the speaker stands upright with his left and right hands down. A general posture may be set when the left and right hands (pair 46L and 46R) are positioned lower than the shoulder coordinate point 45, for example.
When the general posture is detected, normal sound collection processing is performed for speaker 1 .

図１４（ｂ）には、集音を停止する停止ジェスチャーが示されている。停止ジェスチャーは、口前に手をかざす姿勢である。このように、発話者１が手で口を遮る停止ジェスチャーが検出された場合、発話者に対する集音処理が停止される。
ここでは、発話者１の右手（ペア４６Ｒ）が、頭座標点４５ａ及び首座標点４５ｂの間と重なる位置で検出される。このようなジェスチャーが検出された場合には、発話者１が口を塞いだとみなして、発話者１を対象とする集音処理が停止される。これにより、例えば発話者１が集音したくない会話等が集音される事態を回避することが可能となる。
なお、他の発話者１に対して実行されている集音処理はそのまま継続される。 FIG. 14(b) shows a stop gesture for stopping sound collection. A stop gesture is a posture of holding a hand in front of the mouth. In this way, when the stop gesture of speaker 1 covering his mouth with his hand is detected, the sound collection process for the speaker is stopped.
Here, the right hand (pair 46R) of speaker 1 is detected at a position overlapping between head coordinate point 45a and neck coordinate point 45b. When such a gesture is detected, it is assumed that speaker 1 has covered his mouth, and sound collection processing for speaker 1 is stopped. As a result, for example, it is possible to avoid a situation in which a conversation or the like, which the speaker 1 does not want to collect, is collected.
Note that the sound collection processing that is being executed for the other speaker 1 is continued as it is.

図１４（ｃ）には、集音を優先する優先ジェスチャーが示されている。優先ジェスチャーは、左右どちらかの手を頭部より上にかざす姿勢である。このように、発話者１が手を挙げる優先ジェスチャーが検出された場合、発話者１に対する集音処理が優先して実行される。
ここでは、発話者１の左手（ペア４６Ｌ）が、頭座標点４５ａよりも高い位置で検出される。このようなジェスチャーが検出された場合には、発話者１が発言のために挙手をしたとみなして、発話者１を優先的に集音する集音処理（優先集音）が実行される。
優先集音では、例えば発話者１の音声を集音するためのビームフォーミング処理の精度が引き上げられる。あるいは、発話者１の発話方向２等の検出精度が引き上げられる。逆に、他の発話者１に対して実行されている集音処理の精度が引き下げられてもよい。また、発話者１の音声５を単独で集音するといった処理が実行されてもよい。これにより、例えば発言を希望する発話者１の音声を高品質に集音することが可能となる。 FIG. 14(c) shows a priority gesture that prioritizes sound collection. A priority gesture is a posture in which either the left or right hand is held above the head. In this way, when the priority gesture of raising the hand of speaker 1 is detected, the sound collection process for speaker 1 is preferentially executed.
Here, speaker 1's left hand (pair 46L) is detected at a position higher than head coordinate point 45a. When such a gesture is detected, it is assumed that speaker 1 has raised his/her hand to speak, and sound collection processing (prioritized sound collection) for preferentially collecting sound for speaker 1 is performed.
In the priority sound collection, for example, the accuracy of beam forming processing for collecting the voice of speaker 1 is raised. Alternatively, the detection accuracy of the utterance direction 2 and the like of the speaker 1 is raised. Conversely, the accuracy of the sound collection process being executed for the other speaker 1 may be lowered. Further, a process of independently collecting the voice 5 of the speaker 1 may be executed. As a result, for example, the voice of speaker 1 who wishes to speak can be collected with high quality.

図１５は、音声と動作音とを集音する集音動作の一例を示す模式図である。
図１５では、発話者１の移動等の動作に伴う所作音８を分離して集音する方法について説明する。以下では所作音の一例として、発話者１が移動した際に発生する足音を例に挙げて説明する。この処理は、例えばボーン検出や位置検出により、発話者１の移動が検出された場合に実行される。なお、発話者１の移動の有無に関わらず、所作音８（足音）を分離する処理が実行されてもよい。 FIG. 15 is a schematic diagram showing an example of sound collection operation for collecting voice and operation sound.
With reference to FIG. 15, a method of separating and collecting a gesture sound 8 accompanying an action such as movement of the speaker 1 will be described. In the following description, footsteps generated when the speaker 1 moves will be described as an example of the gesture sound. This processing is executed when movement of the speaker 1 is detected by bone detection or position detection, for example. Note that the process of separating the gesture sound 8 (footsteps) may be executed regardless of whether or not the speaker 1 moves.

図１５Ａは、対象マイク２５（ＢＦマイクＭ）から発話者１に向けられたビーム７の垂直方向の広がりを示す模式図である。例えば対象マイク２５に設定されたビーム７は、図１５Ａに示すように上下方向に広がる。このため、対象マイク２５は、発話者１の音声５とともに、発話者１の足元で発生する足音（所作音８）も集音することが可能である。
従って、対象マイク２５の出力をもとに生成された音声データ６には、発話者１の音声５と所作音８が含まれている。 FIG. 15A is a schematic diagram showing the spread of the beam 7 directed toward the speaker 1 from the target microphone 25 (BF microphone M) in the vertical direction. For example, the beam 7 set on the target microphone 25 spreads vertically as shown in FIG. 15A. Therefore, the target microphone 25 can collect not only the voice 5 of the speaker 1 but also the sound of footsteps (sound 8) generated at the feet of the speaker 1 .
Therefore, the voice data 6 generated based on the output of the target microphone 25 contains the voice 5 of the speaker 1 and the gesture sound 8 .

本実施形態では、集音処理部２３により、対象マイク２５により集音された音声データ６から、発話者１の音声５と、発話者１の所作音８とが分離される。
例えば音声データ６から発話成分を分離することで、発話者１の所作音８(足音)を集音した所作音データ等を生成することが可能である。 In this embodiment, the sound collection processing unit 23 separates the sound 5 of the speaker 1 and the gesture sound 8 of the speaker 1 from the sound data 6 collected by the target microphone 25 .
For example, by separating the utterance component from the voice data 6, it is possible to generate gesture sound data or the like in which the gesture sound 8 (footsteps) of the speaker 1 is collected.

図１５Ｂは、所作音８を分離する集音処理部２３の構成例を示すブロック図である。この集音処理部２３には、図１を参照して説明した音声データ生成部２８の後段に、音源分離部３５が設けられる。
音源分離部３５は、対象マイク２５を用いて生成された音声データ６から発話者１の音声５を除去して、所作音８を抽出する。所作音８の抽出には、データの内容や集音環境等に応じて分離周波数等のパラメータを変化させる適応型の音源分離処理が用いられる。あるいは、所作音８の特徴に合わせて固定型の帯域通過フィルタ（ＢＰＦ）等が用いられてもよい。 FIG. 15B is a block diagram showing a configuration example of the sound collection processing unit 23 that separates the gesture sound 8. As shown in FIG. The sound collection processing unit 23 is provided with a sound source separation unit 35 after the audio data generation unit 28 described with reference to FIG.
The sound source separation unit 35 removes the voice 5 of the speaker 1 from the voice data 6 generated using the target microphone 25 and extracts the gesture sound 8 . An adaptive sound source separation process that changes parameters such as a separation frequency according to the content of data, the sound collection environment, and the like is used for extracting the motion sound 8 . Alternatively, a fixed bandpass filter (BPF) or the like may be used according to the characteristics of the sound 8 .

図１５Ｃは、音声５及び所作音８に関する集音レベルの周波数分布を示す模式的なグラフである。グラフの横軸は、周波数であり、縦軸は、集音レベルである。音声５及び所作音８の集音レベルは、実線のグラフ及び一点鎖線のグラフを用いてそれぞれ示されている。
例えば音声５は、１ｋＨｚを中心として比較的急峻なピーク状に分布しており、１ｋＨｚよりも十分に周波数が高い領域（または低い領域）には周波数成分を持たない。一方で、所作音８は、音声５よりも広い周波数範囲に分布した比較的ブロードな分布を示す。すなわち音声５が周波数成分を持たない領域にも、所作音８の周波数成分が分布している。 FIG. 15C is a schematic graph showing the frequency distribution of collected sound levels for voice 5 and gesture sound 8. FIG. The horizontal axis of the graph is frequency, and the vertical axis is sound collection level. Sound collection levels of voice 5 and gesture sound 8 are indicated using a solid line graph and a dashed line graph, respectively.
For example, the sound 5 is distributed in a relatively sharp peak shape centered at 1 kHz, and does not have frequency components in a region sufficiently higher (or lower) in frequency than 1 kHz. On the other hand, the gesture sound 8 exhibits a relatively broad distribution in a frequency range wider than that of the voice 5 . That is, the frequency components of the gesture sound 8 are distributed even in areas where the voice 5 does not have frequency components.

このように、音声５の周波数成分は１ｋＨｚ近辺に集中している。そこで、音源分離部３５では、音声データ６から１ｋＨｚ近辺の周波数成分を除去する処理が実行される。このように、音源分離部３５は、１ｋＨｚ近辺の周波数成分を除去したデータを所作音８(足音)とみなして集音する。
図１５Ｃには、１ｋＨｚ近辺の周波数成分を除去するＢＰＦの周波数特性が、破線のグラフを用いて示されている。このようなＢＰＦを音声データ６に作用させることで、音声５が除去されて所作音８が抽出された所作音データが生成される。
この他、所作音８を抽出する方法は限定されず、例えば機械学習等を用いた音源分離技術等が適宜用いられてもよい。 Thus, the frequency components of voice 5 are concentrated around 1 kHz. Therefore, the sound source separation unit 35 performs a process of removing frequency components around 1 kHz from the audio data 6 . In this way, the sound source separation unit 35 regards the data from which the frequency component around 1 kHz is removed as the gesture sound 8 (footsteps) and collects the sound.
FIG. 15C shows the frequency characteristics of a BPF that removes frequency components around 1 kHz using a dashed line graph. By applying such a BPF to the voice data 6, gesture sound data in which the voice 5 is removed and the gesture sound 8 is extracted is generated.
In addition, the method of extracting the gesture sound 8 is not limited, and for example, a sound source separation technique using machine learning or the like may be used as appropriate.

音声５と分離された所作音８（所作音データ）は、例えば音声５とは別のトラックの音データとして、再生装置２９や記憶部１１に出力される。
例えば、発話者１の挙動を遠隔地で再生するようなアプリケーション（リモート会議やリモートプレゼンテーション等）では、音声５と所作音８とを分けて再生することで、臨場感の向上をはかることが可能である。
また例えば、映像コンテンツの収録を行う際に、所作音８を音声５とは別トラックで記録することが可能となり、コンテンツの品質を向上することが可能となる。 The action sound 8 (action sound data) separated from the voice 5 is output to the reproducing device 29 or the storage unit 11 as sound data of a track different from that of the voice 5, for example.
For example, in an application that reproduces the behavior of the speaker 1 at a remote location (remote conference, remote presentation, etc.), it is possible to improve the sense of presence by reproducing the voice 5 and the gesture sound 8 separately. is.
Also, for example, when recording video content, it is possible to record the gesture sound 8 on a separate track from the audio 5, thereby improving the quality of the content.

以上、本実施形態に係るコントローラ２０では、音源である発話者１の周辺に配置された複数のＢＦマイクＭから、発話者１の音声５を集音するための対象マイク２５が少なくとも１つ選択される。各ＢＦマイクＭは、集音方向３を設定できる装置であり、対象マイク２５の選択には、発話者１の位置Ｑ及び発話者１が音声を発する発話方向２を示す音源情報が用いられる。これにより、例えば発話者１の位置や音声５の出る方向に適応したＢＦマイクＭを用いることが可能となり、発話者１が発する音声５を高品質に集音することが可能となる。 As described above, in the controller 20 according to the present embodiment, at least one target microphone 25 for collecting the voice 5 of the speaker 1 is selected from the plurality of BF microphones M arranged around the speaker 1 which is the sound source. be done. Each BF microphone M is a device that can set the sound collection direction 3, and sound source information indicating the position Q of the speaker 1 and the speech direction 2 in which the speaker 1 emits voice is used to select the target microphone 25. This makes it possible to use, for example, the BF microphone M adapted to the position of the speaker 1 and the direction from which the voice 5 is emitted, so that the voice 5 emitted by the speaker 1 can be collected with high quality.

音源の音を集音する方法として、例えば目的音以外の音を除去するノイズキャンセルを用いる方法が考えられる。例えば特許文献１では、一つのマイクアレイを用いたビームフォーミング技術によるノイズキャンセルの方法が記載されている。この方法では、マイクアレイとは別の画像処理装置を用いて集音対象となる人物の配置が検出され、集音対象の配置に基づいてノイズ方向が設定される。そして集音対象が存在する方向の音からノイズ方向の音を差し引くことで、ノイズがキャンセルされる。 As a method for collecting sound from a sound source, for example, a method using noise cancellation for removing sounds other than the target sound is conceivable. For example, Patent Literature 1 describes a method of noise cancellation by beam forming technology using one microphone array. In this method, an image processing device separate from the microphone array is used to detect the placement of a person to be sound-collected, and the noise direction is set based on the placement of the sound-collection target. Noise is canceled by subtracting the sound in the noise direction from the sound in the direction in which the sound collection target exists.

しかしながら、例えば集音対象となる人物がマイクアレイに背を向けた場合には、発話方向とは反対側から人物の音声を集音することになり、そもそも集音対象の音を高品質で集音することが難しい。また集音対象とノイズ源との配置関係によっては、目的音より雑音が大きく集音されることになる。この場合、目的音となる発話情報を雑音情報の中から抜き出すことになるので、音声の品質が劣化する可能性がある。 However, for example, when the target person turns his/her back to the microphone array, the person's voice is collected from the opposite side of the speaking direction. difficult to make a sound Also, depending on the positional relationship between the sound collection target and the noise source, the noise may be louder than the target sound. In this case, since the utterance information, which is the target sound, is extracted from the noise information, there is a possibility that the quality of the speech will be degraded.

本実施形態では、集音対象となる音源（発話者１）の位置Ｑ及び発話方向２が音源情報として検出される。この音源情報をもとに、任意の方向に集音方向３を設定可能な複数の集音装置を制御して発話者１の音声５が集音される。これにより、様々な方向を向いている複数の発話者１から発せられる音声５を個別かつ同時に集音することが可能となる。
また複数の発話者１が同時に発話しても、各発話者１の音声データ６を別々のオブジェクトとして発話数分だけ集音することが可能である。これにより、音声データ６の取り扱いが容易になる。 In this embodiment, the position Q and the speech direction 2 of the sound source (speaker 1) to be collected is detected as the sound source information. Based on this sound source information, the sound 5 of the speaker 1 is collected by controlling a plurality of sound collectors capable of setting the sound collection direction 3 in an arbitrary direction. This makes it possible to individually and simultaneously collect sounds 5 emitted from a plurality of speakers 1 facing various directions.
Even if a plurality of speakers 1 speak at the same time, it is possible to collect the voice data 6 of each speaker 1 as separate objects for the number of utterances. This facilitates handling of the audio data 6 .

また、複数のＢＦマイクＭから、対象マイク２５を選択しその集音方向３を設定する方法は、発話者１の音声５を良い音質で集音可能な状況を作り出すことを目的としている。これは、ノイズをキャンセルする前の段階で、おおもとのデータにおける音質を向上させる方法であると言える。
このように、集音システム１００で行われる集音方法は、ノイズ除去ではないので、再生した場合に明瞭に聞くことが可能な音声データ６を提供することが可能となる。 The method of selecting the target microphone 25 from a plurality of BF microphones M and setting the sound collection direction 3 aims at creating a situation in which the voice 5 of the speaker 1 can be collected with good sound quality. It can be said that this is a method of improving the sound quality of the original data before canceling the noise.
Thus, since the sound collection method performed by the sound collection system 100 is not noise removal, it is possible to provide audio data 6 that can be heard clearly when reproduced.

＜その他の実施形態＞
本技術は、以上説明した実施形態に限定されず、他の種々の実施形態を実現することができる。 <Other embodiments>
The present technology is not limited to the embodiments described above, and various other embodiments can be implemented.

上記では、各ＢＦマイクＭに対して、１つのビーム７を設定して集音を行う方法について説明した。これに限定されず、例えば１つのＢＦマイクＭに対して、複数のビーム７（集音方向３）を設定することも可能である。これにより、例えばＢＦマイクＭの数よりも発話者１が多いような場合であっても、発話者１ごとの高品質な集音を実現することが可能となる。 A method of setting one beam 7 for each BF microphone M to collect sound has been described above. It is not limited to this, and it is also possible to set a plurality of beams 7 (sound collection direction 3) for one BF microphone M, for example. As a result, even when the number of speakers 1 is greater than the number of BF microphones M, for example, it is possible to achieve high-quality sound collection for each speaker 1 .

図１を参照して説明した構成では、集音処理部２３によりビームフォーミング処理が実行された。例えば、各ＢＦマイクＭがそれぞれビームフォーミング処理を実行可能なように構成されてもよい。この場合、各ＢＦマイクＭでは、集音方向信号が指定する集音方向３の音波を集音するビームフォーミング処理が実行され、各ＢＦマイクＭからは、集音方向３の音声データ６が出力される。このような構成であっても、発話者１の音声５を高品質に集音することが可能である。 In the configuration described with reference to FIG. 1, the sound collection processing unit 23 executes beam forming processing. For example, each BF microphone M may be configured to be able to perform beam forming processing. In this case, each BF microphone M performs beamforming processing for collecting sound waves in the sound collection direction 3 specified by the sound collection direction signal, and each BF microphone M outputs sound data 6 in the sound collection direction 3. be done. Even with such a configuration, it is possible to collect the speech 5 of the speaker 1 with high quality.

集音方向３を設定可能な集音装置として、ＢＦマイクＭに代えて、単一指向性マイク等が用いられてもよい。この場合、例えば多数の単一指向性マイクが発話者１の周辺に配置される。そして発話者１の発話方向２にあった集音方向３をもつ単一指向性マイクが選択され、対象マイク２５として用いられる。このような構成であっても、発話者１の音声５を高品質に集音することが可能である。 A unidirectional microphone or the like may be used instead of the BF microphone M as a sound collecting device capable of setting the sound collecting direction 3 . In this case, for example, many unidirectional microphones are arranged around speaker 1 . A unidirectional microphone having a sound collecting direction 3 that matches the speaking direction 2 of the speaker 1 is selected and used as the target microphone 25 . Even with such a configuration, it is possible to collect the speech 5 of the speaker 1 with high quality.

上記では集音システムのコンピュータ（コントローラ）により、本技術に係る情報処理方法が実行される場合を説明した。しかしながら集音システムのコンピュータとネットワーク等を介して通信可能な他のコンピュータとにより、本技術に係る情報処理方法、及びプログラムが実行されてもよい。 A case has been described above in which the computer (controller) of the sound collection system executes the information processing method according to the present technology. However, the computer of the sound collection system and another computer that can communicate via a network or the like may execute the information processing method and the program according to the present technology.

すなわち本技術に係る情報処理方法、及びプログラムは、単体のコンピュータにより構成されたコンピュータシステムのみならず、複数のコンピュータが連動して動作するコンピュータシステムにおいても実行可能である。なお本開示において、システムとは、複数の構成要素（装置、モジュール（部品）等）の集合を意味し、すべての構成要素が同一筐体中にあるか否かは問わない。したがって、別個の筐体に収納され、ネットワークを介して接続されている複数の装置、及び、１つの筐体の中に複数のモジュールが収納されている１つの装置は、いずれもシステムである。 That is, the information processing method and program according to the present technology can be executed not only in a computer system configured by a single computer, but also in a computer system in which a plurality of computers work together. In the present disclosure, a system means a set of multiple components (devices, modules (parts), etc.), and it does not matter whether all the components are in the same housing. Therefore, a plurality of devices housed in separate housings and connected via a network, and a single device housing a plurality of modules within a single housing, are both systems.

コンピュータシステムによる本技術に係る情報処理方法、及びプログラムの実行は、例えば音源情報を取得する処理及び対象マイクを選択する処理が、単体のコンピュータにより実行される場合、及び各処理が異なるコンピュータにより実行される場合の両方を含む。また所定のコンピュータによる各処理の実行は、当該処理の一部または全部を他のコンピュータに実行させその結果を取得することを含む。 The computer system executes the information processing method and program according to the present technology, for example, when the process of acquiring sound source information and the process of selecting a target microphone are executed by a single computer, and each process is executed by a different computer. includes both cases where Execution of each process by a predetermined computer includes causing another computer to execute part or all of the process and obtaining the result.

すなわち本技術に係る情報処理方法及びプログラムは、１つの機能をネットワークを介して複数の装置で分担、共同して処理するクラウドコンピューティングの構成にも適用することが可能である。 That is, the information processing method and program according to the present technology can also be applied to a configuration of cloud computing in which a plurality of devices share and jointly process one function via a network.

以上説明した本技術に係る特徴部分のうち、少なくとも２つの特徴部分を組み合わせることも可能である。すなわち各実施形態で説明した種々の特徴部分は、各実施形態の区別なく、任意に組み合わされてもよい。また上記で記載した種々の効果は、あくまで例示であって限定されるものではなく、また他の効果が発揮されてもよい。 It is also possible to combine at least two characteristic portions among the characteristic portions according to the present technology described above. That is, various characteristic portions described in each embodiment may be combined arbitrarily without distinguishing between each embodiment. Moreover, the various effects described above are only examples and are not limited, and other effects may be exhibited.

本開示において、「同じ」「等しい」「直交」等は、「実質的に同じ」「実質的に等しい」「実質的に直交」等を含む概念とする。例えば「完全に同じ」「完全に等しい」「完全に直交」等を基準とした所定の範囲（例えば±１０％の範囲）に含まれる状態も含まれる。 In the present disclosure, the terms “same”, “equal”, “orthogonal”, etc. are concepts including “substantially the same”, “substantially equal”, “substantially orthogonal”, and the like. For example, states included in a predetermined range (for example, a range of ±10%) based on "exactly the same", "exactly equal", "perfectly orthogonal", etc. are also included.

なお、本技術は以下のような構成も採ることができる。
（１）音源の位置と前記音源が音を発する方向とを示す音源情報を取得する情報取得部と、
前記音源情報に基づいて、前記音源の周辺に配置され集音方向を設定可能な複数の集音装置から、前記音源が発する音の集音に用いる少なくとも１つの対象装置を選択する集音制御部と
を具備する情報処理装置。
（２）（１）に記載の情報処理装置であって、
前記集音制御部は、前記音源情報に基づいて、前記対象装置の集音方向を設定する
情報処理装置。
（３）（２）に記載の情報処理装置であって、
前記集音制御部は、前記対象装置から前記音源に向かう方向を前記対象装置の集音方向に設定する
情報処理装置。
（４）（１）から（３）のうちいずれか１つに記載の情報処理装置であって、
前記集音制御部は、前記音源が音を発する方向を基準として前記音源が発する直接音を集音可能な前記集音装置を判定し、当該集音装置を前記対象装置として選択する
情報処理装置。
（５）（４）に記載の情報処理装置であって、
前記複数の集音装置は、各々の配置に応じて割り当てられた割当範囲に前記集音方向を設定可能なように構成され、
前記集音制御部は、前記音源が音を発する方向が前記割当範囲の中心方向に対応する前記集音装置を前記対象装置として選択する
情報処理装置。
（６）（５）に記載の情報処理装置であって、
前記集音制御部は、前記音源が音を発する方向が前記割当範囲の中心方向に対応する前記集音装置が存在しない場合、前記音源が音を発する方向に沿った集音が可能であり、前記音源との距離が最も近い前記集音装置を前記対象装置として選択する
情報処理装置。
（７）（１）から（６）のうちいずれか１つに記載の情報処理装置であって、
前記情報取得部は、複数の音源ごとに前記音源情報を取得し、
前記集音制御部は、前記複数の音源ごとの前記音源情報に基づいて、前記複数の音源ごとに前記対象装置をそれぞれ選択する
情報処理装置。
（８）（７）に記載の情報処理装置であって、
前記集音制御部は、処理対象の音源が発する直接音を集音し前記処理対象とは異なる他の音源が発する直接音を集音しないように前記集音方向を設定可能な前記集音装置を前記対象装置として選択する
情報処理装置。
（９）（１）から（８）のうちいずれか１つに記載の情報処理装置であって、さらに、
前記少なくとも１つの対象装置の出力に基づいて、前記音源が発する音を表す音データを生成する集音処理部を具備する
情報処理装置。
（１０）（９）に記載の情報処理装置であって、
前記複数の集音装置は、予め集音方向が設定された複数の候補装置を含み、
前記集音制御部は、前記複数の候補装置から前記対象装置を選択し、
前記集音処理部は、前記対象装置として選択されない候補装置を集音状態で待機させる
情報処理装置。
（１１）（９）又は（１０）に記載の情報処理装置であって、
前記集音制御部は、単一の前記音源について、前記複数の集音装置から複数の対象装置を選択する
情報処理装置。
（１２）（１１）に記載の情報処理装置であって、
前記集音処理部は、前記複数の対象装置により集音されたデータを合成して、前記音源の前記音データを生成する
情報処理装置。
（１３）（９）から（１２）のうちいずれか１つに記載の情報処理装置であって、
前記音源は、発話者であり、
前記音源が音を発する方向は、前記発話者の発話方向である
情報処理装置。
（１４）（１３）に記載の情報処理装置であって、
前記情報取得部は、前記発話者を撮影した画像データに基づいて、前記発話者に関するボーン検出を実行して前記発話者の発話方向を推定する
情報処理装置。
（１５）（１３）又は（１４）に記載の情報処理装置であって、
前記情報取得部は、前記発話者のジェスチャーを検出し、
前記集音処理部は、前記発話者のジェスチャーに応じて、前記発話者の音声を集音する集音処理を制御する
情報処理装置。
（１６）（１５）に記載の情報処理装置であって、
前記集音処理部は、前記発話者が手を挙げるジェスチャーが検出された場合、前記発話者に対する前記集音処理を優先して実行し、前記発話者が手で口を遮るジェスチャーが検出された場合、前記発話者に対する前記集音処理を停止する
情報処理装置。
（１７）（１３）から（１６）のうちいずれか１つに記載の情報処理装置であって、
前記集音処理部は、前記対象装置により集音されたデータから、前記発話者の音声と、前記発話者の所作音とを分離する
情報処理装置。
（１８）（１）から（１７）のうちいずれか１つに記載の情報処理装置であって、
前記集音装置は、複数のマイクが配置されたマイクアレイであり、
前記集音方向は、前記マイクアレイに関するビームフォーミング処理で設定されるビームの方向である
情報処理装置。
（１９）音源の位置と前記音源が音を発する方向とを示す音源情報を取得し、
前記音源情報に基づいて、前記音源の周辺に配置され集音方向を設定可能な複数の集音装置から、前記音源が発する音の集音に用いる少なくとも１つの対象装置を選択する
ことをコンピュータシステムが実行する情報処理方法。
（２０）音源の位置と前記音源が音を発する方向とを示す音源情報を取得するステップと、
前記音源情報に基づいて、前記音源の周辺に配置され集音方向を設定可能な複数の集音装置から、前記音源が発する音の集音に用いる少なくとも１つの対象装置を選択するステップと
をコンピュータシステムに実行させるプログラム。 Note that the present technology can also adopt the following configuration.
(1) an information acquisition unit that acquires sound source information indicating the position of a sound source and the direction in which the sound source emits sound;
A sound collection control unit that selects, based on the sound source information, at least one target device used for collecting sound emitted by the sound source from a plurality of sound collection devices arranged around the sound source and capable of setting a sound collection direction. An information processing device comprising and.
(2) The information processing device according to (1),
The information processing apparatus, wherein the sound collection control unit sets a sound collection direction of the target device based on the sound source information.
(3) The information processing device according to (2),
The information processing apparatus, wherein the sound collection control unit sets a direction from the target device toward the sound source as a sound collection direction of the target device.
(4) The information processing device according to any one of (1) to (3),
The sound collection control unit determines the sound collection device capable of collecting the direct sound emitted by the sound source based on the direction in which the sound source emits sound, and selects the sound collection device as the target device. .
(5) The information processing device according to (4),
The plurality of sound collecting devices are configured to be able to set the sound collecting direction in an allocation range allocated according to each arrangement,
The information processing device, wherein the sound collection control unit selects, as the target device, the sound collection device whose direction in which the sound source emits sound corresponds to the center direction of the allocation range.
(6) The information processing device according to (5),
The sound collection control unit is capable of collecting sound along the direction in which the sound source emits sound when there is no sound collection device in which the direction in which the sound source emits sound corresponds to the center direction of the allocation range, An information processing device that selects the sound collecting device closest to the sound source as the target device.
(7) The information processing device according to any one of (1) to (6),
The information acquisition unit acquires the sound source information for each of a plurality of sound sources,
The information processing apparatus, wherein the sound collection control unit selects the target device for each of the plurality of sound sources based on the sound source information for each of the plurality of sound sources.
(8) The information processing device according to (7),
The sound collection control unit is capable of setting the sound collection direction so as to collect direct sound emitted by a sound source to be processed and not to collect direct sound emitted by a sound source different from the sound source to be processed. as the target device. Information processing device.
(9) The information processing device according to any one of (1) to (8), further comprising:
An information processing apparatus comprising a sound collection processing unit that generates sound data representing the sound emitted by the sound source based on the output of the at least one target device.
(10) The information processing device according to (9),
The plurality of sound collecting devices includes a plurality of candidate devices whose sound collecting directions are set in advance,
The sound collection control unit selects the target device from the plurality of candidate devices,
The information processing device, wherein the sound collection processing unit makes a candidate device that is not selected as the target device stand by in a sound collection state.
(11) The information processing device according to (9) or (10),
The information processing device, wherein the sound collection control unit selects a plurality of target devices from the plurality of sound collectors for the single sound source.
(12) The information processing device according to (11),
The information processing device, wherein the sound collection processing unit synthesizes data collected by the plurality of target devices to generate the sound data of the sound source.
(13) The information processing device according to any one of (9) to (12),
The sound source is a speaker,
The information processing apparatus, wherein the direction in which the sound source emits sound is the utterance direction of the speaker.
(14) The information processing device according to (13),
The information processing apparatus, wherein the information acquisition unit estimates a speech direction of the speaker by performing bone detection on the speaker based on image data of the speaker.
(15) The information processing device according to (13) or (14),
The information acquisition unit detects a gesture of the speaker,
The information processing apparatus, wherein the sound collection processing unit controls sound collection processing for collecting the voice of the speaker according to the gesture of the speaker.
(16) The information processing device according to (15),
The sound collection processing unit preferentially executes the sound collection processing for the speaker when a gesture of the speaker raising a hand is detected, and a gesture of the speaker covering the mouth with a hand is detected. information processing device that stops the sound collection process for the speaker if the
(17) The information processing device according to any one of (13) to (16),
The information processing device, wherein the sound collection processing unit separates the speech of the speaker and the gesture sound of the speaker from the data collected by the target device.
(18) The information processing device according to any one of (1) to (17),
The sound collecting device is a microphone array in which a plurality of microphones are arranged,
The information processing apparatus, wherein the sound collection direction is a beam direction set in beamforming processing for the microphone array.
(19) obtaining sound source information indicating the position of a sound source and the direction in which the sound source emits sound;
selecting at least one target device used for collecting sound emitted by the sound source from a plurality of sound collecting devices arranged around the sound source and capable of setting a sound collection direction, based on the sound source information. Information processing methods performed by
(20) obtaining sound source information indicating the position of a sound source and the direction in which the sound source emits sound;
selecting, based on the sound source information, at least one target device used to collect the sound emitted by the sound source from among a plurality of sound collecting devices arranged around the sound source and capable of setting a sound collection direction; A program that you want the system to run.

Ｍ、Ｍ１～Ｍ４…ＢＦマイク
１、１Ａ～１Ｄ…発話者
２…発話方向
３…集音方向
５…音声
１０…検出カメラ
１１…記憶部
１２…制御プログラム
１６…マイク
２０…コントローラ
２１…画像処理部
２２…集音制御部
２３…集音処理部
２５…対象マイク
２６…候補マイク
３５…音源分離部
４１…割当範囲
１００…集音システム M, M1 to M4... BF microphone 1, 1A to 1D... Speaker 2... Speech direction 3... Sound collection direction 5... Sound 10... Detection camera 11... Storage unit 12... Control program 16... Microphone 20... Controller 21... Image processing Unit 22 Sound collection control unit 23 Sound collection processing unit 25 Target microphone 26 Candidate microphone 35 Sound source separation unit 41 Allocation range 100 Sound collection system

Claims

an information acquisition unit that acquires sound source information indicating the position of a sound source and the direction in which the sound source emits sound;
A sound collection control unit that selects, based on the sound source information, at least one target device used for collecting sound emitted by the sound source from a plurality of sound collection devices arranged around the sound source and capable of setting a sound collection direction. An information processing device comprising and.

The information processing device according to claim 1,
The information processing apparatus, wherein the sound collection control unit sets a sound collection direction of the target device based on the sound source information.

The information processing device according to claim 2,
The information processing apparatus, wherein the sound collection control unit sets a direction from the target device toward the sound source as a sound collection direction of the target device.

The information processing device according to claim 1,
The sound collection control unit determines the sound collection device capable of collecting the direct sound emitted by the sound source based on the direction in which the sound source emits sound, and selects the sound collection device as the target device. .

The information processing device according to claim 4,
The plurality of sound collecting devices are configured to be able to set the sound collecting direction in an allocation range allocated according to each arrangement,
The information processing device, wherein the sound collection control unit selects, as the target device, the sound collection device whose direction in which the sound source emits sound corresponds to the center direction of the allocation range.

The information processing device according to claim 5,
The sound collection control unit is capable of collecting sound along the direction in which the sound source emits sound when there is no sound collection device in which the direction in which the sound source emits sound corresponds to the center direction of the allocation range, An information processing device that selects the sound collecting device closest to the sound source as the target device.

The information processing device according to claim 1,
The information acquisition unit acquires the sound source information for each of a plurality of sound sources,
The information processing apparatus, wherein the sound collection control unit selects the target device for each of the plurality of sound sources based on the sound source information for each of the plurality of sound sources.

The information processing device according to claim 7,
The sound collection control unit is capable of setting the sound collection direction so as to collect direct sound emitted by a sound source to be processed and not to collect direct sound emitted by a sound source different from the sound source to be processed. as the target device. Information processing device.

The information processing apparatus according to claim 1, further comprising:
An information processing apparatus comprising a sound collection processing unit that generates sound data representing the sound emitted by the sound source based on the output of the at least one target device.

The information processing device according to claim 9,
The plurality of sound collecting devices includes a plurality of candidate devices whose sound collecting directions are set in advance,
The sound collection control unit selects the target device from the plurality of candidate devices,
The information processing device, wherein the sound collection processing unit makes a candidate device that is not selected as the target device stand by in a sound collection state.

The information processing device according to claim 9,
The information processing device, wherein the sound collection control unit selects a plurality of target devices from the plurality of sound collectors for the single sound source.

The information processing device according to claim 11,
The information processing device, wherein the sound collection processing unit synthesizes data collected by the plurality of target devices to generate the sound data of the sound source.

The information processing device according to claim 9,
The sound source is a speaker,
The information processing apparatus, wherein the direction in which the sound source emits sound is the utterance direction of the speaker.

The information processing device according to claim 13,
The information processing apparatus, wherein the information acquisition unit estimates a speech direction of the speaker by performing bone detection on the speaker based on image data of the speaker.

The information processing device according to claim 13,
The information acquisition unit detects a gesture of the speaker,
The information processing apparatus, wherein the sound collection processing unit controls sound collection processing for collecting the voice of the speaker according to the gesture of the speaker.

The information processing device according to claim 15,
The sound collection processing unit preferentially executes the sound collection processing for the speaker when a gesture of the speaker raising a hand is detected, and a gesture of the speaker covering the mouth with a hand is detected. information processing device that stops the sound collection process for the speaker if the

The information processing device according to claim 13,
The information processing device, wherein the sound collection processing unit separates the speech of the speaker and the gesture sound of the speaker from the data collected by the target device.

The information processing device according to claim 1,
The sound collecting device is a microphone array in which a plurality of microphones are arranged,
The information processing apparatus, wherein the sound collection direction is a beam direction set by beamforming processing for the microphone array.

Acquiring sound source information indicating the position of a sound source and the direction in which the sound source emits sound;
selecting at least one target device used for collecting sound emitted by the sound source from a plurality of sound collecting devices arranged around the sound source and capable of setting a sound collection direction, based on the sound source information. Information processing methods performed by

obtaining sound source information indicating the position of a sound source and the direction in which the sound source emits sound;
selecting, based on the sound source information, at least one target device used to collect the sound emitted by the sound source from among a plurality of sound collecting devices arranged around the sound source and capable of setting a sound collection direction; A program that you want the system to run.