JP5564873B2

JP5564873B2 - Sound collection processing device, sound collection processing method, and program

Info

Publication number: JP5564873B2
Application number: JP2009220467A
Authority: JP
Inventors: 智佳子松本
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2009-09-25
Filing date: 2009-09-25
Publication date: 2014-08-06
Anticipated expiration: 2029-09-25
Also published as: JP2011071702A

Description

本明細書で議論される実施態様は、音声信号処理技術に関する。 The embodiments discussed herein relate to audio signal processing techniques.

話者による発声を明瞭に収音する技術として、話者の発声を収音するマイクロフォン（以下、「マイク」と記すこともある）による収音指向性を制御する技術が幾つか知られている。 As a technique for clearly collecting a speaker's utterance, several techniques for controlling a sound collection directivity by a microphone that collects a speaker's utterance (hereinafter sometimes referred to as “microphone”) are known. .

そのような技術に、単一収音指向性マイクの指向性のビーム幅に等しい画角を有するカメラを用意し、その指向性と画角とを一致させるようにマイクとカメラとを一体化させた、カメラ会議用のカメラ一体化マイクで使用されるものがある。このカメラ一体化マイクでは、カメラでの撮像画像から発話者の顔の像の検出が行われる。そして、検出された顔の像が当該撮像画像の中心に位置するようにカメラを向ける制御が行われて、マイクの収音指向性（以下、単に「指向性」を記すこともある）の中心がその発話者に向けられる。このカメラ一体化マイクにおいて、撮像画像から検出された顔の像の数に応じて、マイクの指向性の向きを制御するという技術が知られている。この第一の技術は、その数が奇数の場合にはマイクに最も近い話者にマイクの指向性を向け、その数が偶数の場合にはマイクに最も近い話者と二番目に近い話者との間にマイクの指向性を向けるようして、会議での発話者の声を的確に捉えるというものである。 In such a technology, a camera having an angle of view equal to the beam width of the directivity of a single sound-collecting directional microphone is prepared, and the microphone and the camera are integrated so that the directivity and the angle of view match. Some cameras are used with a camera-integrated microphone for a camera conference. In this camera-integrated microphone, the face image of the speaker is detected from the image captured by the camera. Then, control is performed so that the camera is directed so that the detected face image is positioned at the center of the captured image, and the center of the sound collection directivity of the microphone (hereinafter sometimes simply referred to as “directivity”) Is directed to the speaker. In this camera-integrated microphone, a technique is known in which the directionality of the microphone is controlled in accordance with the number of face images detected from the captured image. This first technique directs the microphone's directivity to the speaker closest to the microphone when the number is odd, and the speaker closest to the microphone and the second closest speaker when the number is even. The direction of the microphone is directed between the two and the voice of the speaker at the conference is accurately captured.

また、このような技術の別のひとつに、撮像画像に含まれる人物の像の当該撮像画像上での大きさに基づいてマイクの指向性の鋭さを制御するという技術がある。この第二の技術では、撮像画像上において人物の像が大きい場合には、その撮像意図が人物を重視していると判断し、その人物の発声を明瞭に収音するべくマイクの指向性を鋭くするように制御する。その一方で、撮像画像上において人物の像が小さい場合には、その撮像意図がその人物を含む周囲の環境全体であると判断し、その人物の発声の収音と共に周囲の環境音の収音にも配慮するべくマイクの指向性を鈍くするように制御する。 Another technique is to control the sharpness of the directivity of the microphone based on the size of the person image included in the captured image on the captured image. In this second technique, when the image of a person is large on the captured image, it is determined that the intention of capturing the image emphasizes the person, and the microphone directivity is set so as to clearly collect the person's utterance. Control to sharpen. On the other hand, if the image of the person is small on the captured image, it is determined that the capturing intention is the entire surrounding environment including the person, and the sound collection of the surrounding environment sound is collected together with the sound collection of the person's utterance In order to take into consideration, the microphone directivity is controlled to be dull.

この他に、本明細書で議論される実施態様に関連する技術として、複数の方向に存在する音源からの音声を収音した音声信号のうち、所定の方向の音源が発する音声を強調して周囲の雑音を抑制する第三の技術が知られている。この技術では、複数の方向に存在する音源からの音声を複数のマイク（マイクアレイ）で収音し、各マイクから出力される時間軸上の音声信号を、例えばフーリエ変換することで、周波数軸上の音声信号に各々変換する。この周波数軸上の各音声信号について、同一周波数での位相差を各周波数について算出し、その位相差に基づいて、所定の方向に音源が存在する確率を各周波数について特定する。そして、この確率に基づき、当該所定の方向の音源以外の音源に基づく音声信号成分を抑制する抑制関数を求め、得られた抑制関数を周波数軸上の音声信号に乗算する。その後、この乗算結果を、例えば逆フーリエ変換して、時間軸上の信号に復元すると、所定の方向に音源に基づく音声信号が得られるというものである。 In addition, as a technique related to the embodiment discussed in this specification, among the audio signals obtained by collecting sounds from sound sources existing in a plurality of directions, the sound emitted by the sound sources in a predetermined direction is emphasized. A third technique for suppressing ambient noise is known. In this technology, sound from sound sources existing in a plurality of directions is collected by a plurality of microphones (microphone arrays), and a sound signal on the time axis output from each microphone is subjected to, for example, Fourier transform, thereby generating a frequency axis. Each of the above audio signals is converted. For each audio signal on the frequency axis, a phase difference at the same frequency is calculated for each frequency, and the probability that a sound source exists in a predetermined direction is specified for each frequency based on the phase difference. Then, based on this probability, a suppression function that suppresses an audio signal component based on a sound source other than the sound source in the predetermined direction is obtained, and the obtained suppression function is multiplied by the audio signal on the frequency axis. After that, when the multiplication result is subjected to inverse Fourier transform, for example, and restored to a signal on the time axis, an audio signal based on a sound source can be obtained in a predetermined direction.

特開２００９−４９７３４号公報JP 2009-49734 A 特開２００９−６５５８７号公報JP 2009-65587 A 特開２００７−３１８５２８号公報JP 2007-318528 A

マイクの収音範囲内に複数の話者が在る場合において、前述した第一の技術のようにしてマイクの指向性の向きを制御しても、発話者がマイクに最も近い者ではない場合には、発話者の声を的確に捉えることができない場合がある。また、前述した第二の技術では、複数の人物の同時発声の明瞭な収音は難しい。 When there are multiple speakers within the microphone collection range, the speaker is not closest to the microphone even if the directionality of the microphone is controlled as in the first technique described above. In some cases, the voice of the speaker cannot be accurately captured. In addition, with the second technique described above, it is difficult to clearly collect a plurality of persons simultaneously speaking.

本発明は上述した問題に鑑みてなされたものであり、その解決しようとする課題は、複数の発音体による同時の発音を明瞭に収音することである。 The present invention has been made in view of the above-described problems, and a problem to be solved is to clearly collect simultaneous sound generation by a plurality of sound generators.

本明細書で後述する収音装置のひとつには、収音処理手段と、取得手段と、収音指向性範囲設定手段とを有するというものがある。このうち、収音処理手段は、相対位置が固定されている複数のマイクロフォンを備えたマイクアレイで収音した複数の収音信号に基づいて収音指向性を制御した出力音の信号を生成する。また、取得手段は、マイクアレイの収音範囲内に存在する発音体の数及び配置の情報の入力を取得する。そして、収音指向性範囲設定手段は、マイクアレイから発音体に向ける収音指向性の向きを、取得手段が取得した発音体についての配置の情報に基づき発音体の各々について設定する。加えて、この収音指向性範囲設定手段は、発音体に向ける収音指向性の鋭さを、取得手段が取得した発音体についての数の情報に基づき発音体の各々について設定する。これらを有する収音装置において、前述した収音処理手段は、収音指向性範囲設定手段が設定した向き及び鋭さに収音指向性を制御した出力音の信号を生成して出力する。 One of the sound collection devices described later in this specification includes a sound collection processing unit, an acquisition unit, and a sound collection directivity range setting unit. Among these, the sound collection processing means generates an output sound signal in which sound collection directivity is controlled based on a plurality of sound collection signals collected by a microphone array including a plurality of microphones whose relative positions are fixed. . Further, the acquisition means acquires input of information on the number and arrangement of sounding bodies existing within the sound collection range of the microphone array. Then, the sound collection directivity range setting means sets the direction of the sound collection directivity directed from the microphone array toward the sound generator for each sound generator based on the arrangement information about the sound generator acquired by the acquisition means. In addition, the sound collection directivity range setting means sets the sharpness of the sound collection directivity directed toward the sound generator for each sound generator based on the number information about the sound generator acquired by the acquisition means. In the sound collecting device having these, the sound collecting processing means described above generates and outputs an output sound signal in which the sound collecting directivity is controlled in the direction and sharpness set by the sound collecting directivity range setting means.

また、本明細書で後述する収音方法のひとつは、相対位置が固定されている複数のマイクロフォンを備えたマイクアレイで収音した複数の収音信号に基づいて収音指向性を制御した出力音の信号を生成するものである。 Also, one of the sound collection methods described later in this specification is an output in which sound collection directivity is controlled based on a plurality of sound collection signals collected by a microphone array including a plurality of microphones whose relative positions are fixed. A sound signal is generated.

この方法では、まず、マイクアレイの収音範囲内に存在する発音体の数及び配置の情報の入力を取得する。次に、マイクアレイから発音体に向ける収音指向性の向きを、取得された発音体についての配置の情報に基づき発音体の各々について設定する。更に、これと共に、発音体の各々に向ける収音指向性の鋭さを、取得された発音体についての数の情報に基づき発音体の各々について設定する。そして、次に、設定された向き及び鋭さに収音指向性を制御した出力音の信号を生成して出力する。 In this method, first, input of information on the number and arrangement of sounding bodies existing within the sound collection range of the microphone array is acquired. Next, the direction of sound collection directivity directed from the microphone array to the sounding body is set for each sounding body based on the acquired arrangement information about the sounding body. At the same time, the sharpness of the sound collection directivity directed to each of the sounding bodies is set for each of the sounding bodies based on the number information about the acquired sounding bodies. Then, an output sound signal in which the sound collection directivity is controlled to the set direction and sharpness is generated and output.

また、本明細書で後述するプログラムのひとつは、相対位置が固定されている複数のマイクロフォンを備えたマイクアレイで収音した複数の収音信号に基づいて収音指向性を制御した出力音の信号の生成をコンピュータに行わせるためのプログラムである。このプログラムは、コンピュータに実行させることによって、取得処理と、収音指向性範囲設定処理と、収音処理と、をコンピュータに行わせる。ここで、収音処理は、マイクアレイの収音範囲内に存在する発音体の数及び配置の情報の入力を取得する処理である。また、収音指向性範囲設定処理は、マイクアレイから該発音体に向ける収音指向性の向きを、取得処理で取得された発音体についての配置の情報に基づき発音体の各々について設定する処理である。加えて、この収音指向性範囲設定処理は、発音体に向ける収音指向性の鋭さを、取得処理により取得された発音体についての数の情報に基づき発音体の各々について設定する処理も含む。そして、収音処理は、収音指向性範囲設定処理により設定された向き及び鋭さに収音指向性を制御した出力音の信号を生成して出力する処理である。 In addition, one of the programs described later in this specification is an output sound whose sound collection directivity is controlled based on a plurality of sound collection signals collected by a microphone array including a plurality of microphones whose relative positions are fixed. A program for causing a computer to generate a signal. By executing the program, the computer causes the computer to perform an acquisition process, a sound collection directivity range setting process, and a sound collection process. Here, the sound collection process is a process of acquiring input of information on the number and arrangement of sounding bodies existing within the sound collection range of the microphone array. The sound collection directivity range setting process is a process of setting the direction of sound collection directivity directed from the microphone array to the sounding body for each sounding body based on the arrangement information about the sounding body acquired in the acquisition process. It is. In addition, the sound collection directivity range setting process includes a process of setting the sharpness of the sound collection directivity directed toward the sound generator for each of the sound generators based on the information on the number of sound generators acquired by the acquisition process. . The sound collection process is a process of generating and outputting an output sound signal in which the sound collection directivity is controlled in the direction and sharpness set by the sound collection directivity range setting process.

本明細書で後述する収音装置は、複数の発音体による同時の発音を明瞭に収音することができる。 The sound collection device described later in this specification can clearly collect simultaneous sound generation by a plurality of sound generators.

収音システムの構成の第一の例である。It is a 1st example of a structure of a sound collection system. 収音信号の２つのマイク間での位相差範囲の周波数特性例である。It is an example of the frequency characteristic of the phase difference range between two microphones of the collected sound signal. 顔の位置検出システムが撮影画像から取得するデータ例である。It is an example of data which a face position detection system acquires from a picked-up image. 収音指向性の鋭さの設定の説明図（その１）である。It is explanatory drawing (the 1) of the setting of the sharpness of sound collection directivity. 収音指向性の鋭さの設定の説明図（その２）である。It is explanatory drawing (the 2) of the setting of the sharpness of sound collection directivity. 収音指向性の鋭さの設定の説明図（その３）である。It is explanatory drawing (the 3) of the setting of the sharpness of sound collection directivity. 出力音声の音源から除外する発音体の抽出の説明図（その１）である。It is explanatory drawing (the 1) of extraction of the sounding body excluded from the sound source of an output audio | voice. 出力音声の音源から除外する発音体の抽出の説明図（その２）である。It is explanatory drawing (the 2) of extraction of the sounding body excluded from the sound source of an output audio | voice. 収音システムの構成の第二の例である。It is a 2nd example of a structure of a sound collection system. 収音装置として動作させるコンピュータの構成である。This is a configuration of a computer that operates as a sound collection device. コンピュータにより実行される制御処理の処理内容を図解したフローチャートである。It is the flowchart which illustrated the processing content of the control processing performed by a computer.

まず図１について説明する。図１には、収音システムの構成の第一の例が図解されている。この収音システムは、収音装置１、マイクアレイ２、カメラ３、及び顔の位置検出システム４を有している。 First, FIG. 1 will be described. FIG. 1 illustrates a first example of the configuration of a sound collection system. This sound collection system includes a sound collection device 1, a microphone array 2, a camera 3, and a face position detection system 4.

収音装置１は、マイクアレイ２での収音信号に対して信号処理を施し、その収音指向性の制御がされている出力音を出力する。
マイクアレイ２は、複数のマイクロフォンを例えば水平方向に一列に並べて構成されている。なお、マイクアレイ２を構成している各マイクロフォン間の相対位置は固定されている。 The sound collection device 1 performs signal processing on a sound collection signal from the microphone array 2 and outputs an output sound whose sound collection directivity is controlled.
The microphone array 2 is configured by arranging a plurality of microphones, for example, in a line in the horizontal direction. The relative positions between the microphones constituting the microphone array 2 are fixed.

収音装置１は、発音体情報取得部１０、収音指向性範囲設定部２０、収音処理部３０、及び除外発音体抽出部４０を備えている。
発音体情報取得部１０は、マイクアレイ２の収音範囲内に存在する発音体の数及び配置の情報の入力を取得する。なお、発音体としては、音を発するものであれば、例えば犬や猫などの動物でもよく、更には、スピーカを備えた放音装置や、あるいは、発音を本来の目的としてしない、騒音として動作音を発する機械などであってもよい。但し、図１の収音システムでは、マイクアレイ２の収音範囲内に存在する発音体として、発声を行う人間を想定しており、発音体情報取得部１０は、マイクアレイ２の収音範囲内に存在する人間の数及び配置の情報の入力を顔の位置検出システム４から取得する。 The sound collection device 1 includes a sound generator information acquisition unit 10, a sound collection directivity range setting unit 20, a sound collection processing unit 30, and an excluded sound generator extraction unit 40.
The sound generator information acquisition unit 10 acquires information on the number and arrangement of sound generators existing in the sound collection range of the microphone array 2. Note that the sound generator may be an animal such as a dog or a cat as long as it emits sound. Furthermore, the sound generator is a sound emitting device equipped with a speaker, or operates as noise that does not have the original purpose of sound generation. It may be a machine that emits sound. However, in the sound collection system of FIG. 1, a sounding person is assumed as a sounding body existing within the sound collection range of the microphone array 2, and the sounding body information acquisition unit 10 performs the sound collection range of the microphone array 2. Input of information on the number and arrangement of humans existing in the face is obtained from the face position detection system 4.

収音指向性範囲設定部２０は、マイクアレイ２から発音体に向ける収音指向性の向きを、発音体情報取得部１０が取得した発音体（本実施形態では人間）についての配置の情報に基づき、当該発音体（人間）の各々について設定する。収音指向性範囲設定部２０は、更に、マイクアレイ２から発音体（人間）に向ける収音指向性の鋭さを、発音体情報取得部１０が取得した当該発音体（人間）についての数の情報に基づき、当該発音体（人間）の各々について設定する。この収音指向性範囲設定部２０により行われる、収音指向性の向き及び鋭さの設定の手法については後述する。 The sound collection directivity range setting unit 20 uses the direction of the sound collection directivity directed from the microphone array 2 to the sounding body as arrangement information about the sounding body (human in this embodiment) acquired by the sounding body information acquisition unit 10. Based on this, each sound generator (human) is set. The sound collection directivity range setting unit 20 further sets the sharpness of the sound collection directivity directed from the microphone array 2 toward the sounding body (human) by the number of the sounding body (human) acquired by the sounding body information acquisition unit 10. Based on the information, settings are made for each sound generator (human). A method of setting the direction and sharpness of the sound collection directivity performed by the sound collection directivity range setting unit 20 will be described later.

収音処理部３０は、マイクアレイ２を構成している各々のマイクで収音した複数の収音信号に基づいて、収音指向性範囲設定部２０が設定した向き及び鋭さに収音指向性を制御した出力音（出力音声）の信号を生成する。本実施形態では、収音処理部３０は、この収音指向性の制御を、前述した特許文献３により開示されている公知の手法を用い、以下のように行う。 The sound collection processing unit 30 has a sound collection directivity in the direction and sharpness set by the sound collection directivity range setting unit 20 based on a plurality of sound collection signals collected by each microphone constituting the microphone array 2. A signal of an output sound (output sound) in which is controlled is generated. In the present embodiment, the sound collection processing unit 30 performs the sound collection directivity control as follows using a known method disclosed in Patent Document 3 described above.

収音処理部３０は、指向性受音処理部３１と出力音声信号生成部３２とを有している。
指向性受音処理部３１は、まず、マイクアレイ２で収音した上記の複数の収音信号の各々をアナログ−デジタル変換して、時間領域の収音信号データとする。次に、この収音信号データに対し、例えば高速フーリエ変換のような時間−周波数変換を施すことで、各収音信号の周波数スペクトルデータを求める。 The sound collection processing unit 30 includes a directional sound reception processing unit 31 and an output audio signal generation unit 32.
The directional sound reception processing unit 31 first performs analog-digital conversion on each of the plurality of sound collection signals collected by the microphone array 2 to obtain time-domain sound collection signal data. Next, by applying time-frequency conversion such as fast Fourier transform to the collected sound signal data, frequency spectrum data of each collected sound signal is obtained.

次に、指向性受音処理部３１は、収音信号のうちのひとつの周波数スペクトルデータを基準としたときの、その他の各収音信号の周波数スペクトルデータとの間でのスペクトルの位相差を、各スペクトル周波数について算出する処理を行う。 Next, the directional sound reception processing unit 31 calculates the phase difference of the spectrum from the frequency spectrum data of each other collected sound signal when the frequency spectrum data of one of the collected sound signals is used as a reference. A process for calculating each spectral frequency is performed.

次に、指向性受音処理部３１は、収音指向性範囲設定部２０が設定した向き及び鋭さの収音指向性を得るために収音信号の周波数スペクトルに与える重み付け値を求める。そして、前述した基準の収音信号の周波数スペクトルに対し、この重み付け値をスペクトル周波数毎に乗算して重み付けを与える処理を行う。このスペクトル周波数毎の重み付け値は、例えば以下のようにして求める。 Next, the directivity sound reception processing unit 31 obtains a weight value to be given to the frequency spectrum of the collected sound signal in order to obtain the sound collection directivity having the direction and sharpness set by the sound collection directivity range setting unit 20. Then, the weighting is performed by multiplying the frequency spectrum of the reference sound collection signal described above by the weighting value for each spectrum frequency. The weighting value for each spectrum frequency is obtained as follows, for example.

まず、指向性受音処理部３１は、収音指向性範囲設定部２０が設定した収音指向性の向き及び鋭さの情報から、当該収音指向性の範囲内から到来した音がマイクアレイ２で収音されたときに生じ得る収音信号のマイク間での位相差範囲の周波数特性を求める。 First, the directivity sound receiving processing unit 31 uses the sound collection directivity direction setting and the sharpness information set by the sound collection directivity range setting unit 20, and the sound that has arrived from the sound collection directivity range falls within the microphone array 2. The frequency characteristics of the phase difference range between the microphones of the collected sound signal that can be generated when the sound is picked up by the above are obtained.

ここで図２について説明する。図２は、収音信号の２つのマイク間での位相差範囲の周波数特性例であり、鋭さを±θ_defのビーム幅とする収音指向性の範囲内から到来した音が２つのマイクで収音されたときに生じ得る収音信号のマイク間での位相差範囲の周波数特性を示したものである。なお、図２において、横軸は音源から発する音の周波数であり、縦軸はこの音を収音したときの２つのマイク間での位相差である。なお、この周波数特性で示される周波数と位相差との関係は、２つのマイクの配置位置の中点を中心としたときの音源の方向角をパラメータとして幾何学的に算出することができる。 Here, FIG. 2 will be described. FIG. 2 is an example of frequency characteristics of the phase difference range between two microphones of the collected sound signal. Sounds coming from within the sound collection directivity range where the sharpness is a beam width of ± θ _def are obtained by the two microphones. The frequency characteristic of the phase difference range between the microphones of the collected sound signal that can be generated when the sound is collected is shown. In FIG. 2, the horizontal axis represents the frequency of the sound emitted from the sound source, and the vertical axis represents the phase difference between the two microphones when the sound is collected. The relationship between the frequency and the phase difference indicated by the frequency characteristics can be calculated geometrically using the direction angle of the sound source when the center point of the arrangement positions of the two microphones is the center as a parameter.

この周波数特性を求めるために、例えば、収音指向性の鋭さと収音信号のマイク間での位相差範囲の周波数特性との関係が示されているテーブルを、当該収音指向性の向き毎に予めデータベース化して指向性受音処理部３１に格納しておくようにすることもできる。この場合には、指向性受音処理部３１は、このデータベースを参照し、収音指向性範囲設定部２０が設定した情報に対応付けられているものをテーブルから読み出すことで、収音信号の位相差範囲の周波数特性を求めるようにする。 In order to obtain this frequency characteristic, for example, a table showing the relationship between the sharpness of the sound collection directivity and the frequency characteristic of the phase difference range between the microphones of the sound collection signal is obtained for each direction of the sound collection directivity. It is also possible to create a database in advance and store it in the directivity received sound processing unit 31. In this case, the directivity sound reception processing unit 31 refers to this database and reads out the information associated with the information set by the sound collection directivity range setting unit 20 from the table, thereby The frequency characteristic of the phase difference range is obtained.

次に、収音信号の周波数スペクトルに対して与える、各スペクトルの位相差に基づいた重み付け値を、当該スペクトルの周波数毎に設定する。各スペクトルに与えられるこの重み付け値は以下のようにして求める。 Next, a weighting value based on the phase difference of each spectrum to be given to the frequency spectrum of the collected sound signal is set for each frequency of the spectrum. This weighting value given to each spectrum is obtained as follows.

まず、先に求めた収音信号の位相差範囲の周波数特性を参照し、当該重み付けを与えるスペクトルの周波数においての当該位相差範囲をその周波数特性から求める。
次に、各スペクトル周波数について先に求めた位相差と、その位相差範囲との関係に基づき、重み付け値を設定する。例えば、位相差がその位相差範囲内であってその範囲の中心から所定値以内の近さであるスペクトルについては、この重み付け値を「１．０」に設定し、位相差がその位相差範囲外のスペクトルについては、この重み付け係数を「０．０」に設定する。また、位相差がその位相差範囲内であるがその範囲の中心から上記所定値以上に離れたものについては、「１．０」から「０．０」の範囲でその中心からの距離に応じた例えば一次補間を行い、範囲の境界で上述の設定値と連続するように重み付け値を設定する。 First, referring to the frequency characteristic of the phase difference range of the collected sound signal obtained earlier, the phase difference range at the frequency of the spectrum to which the weighting is applied is obtained from the frequency characteristic.
Next, a weighting value is set based on the relationship between the phase difference previously determined for each spectral frequency and the phase difference range. For example, for a spectrum whose phase difference is within the phase difference range and close to a predetermined value from the center of the range, this weighting value is set to “1.0” and the phase difference is within the phase difference range. For the outer spectrum, this weighting factor is set to “0.0”. In addition, the phase difference within the phase difference range but separated from the center of the range by more than the predetermined value depends on the distance from the center in the range of “1.0” to “0.0”. For example, linear interpolation is performed, and the weighting value is set so as to be continuous with the above-described setting value at the boundary of the range.

各スペクトルに与えられる重み付け値は、以上のようにして求められる。なお、この重み付け値は、特許文献３において「抑制関数」と称されているものに相当する。
なお、位相差及び位相差範囲と上述した重み付け設定値との関係が予め示されているテーブルをスペクトル周波数毎に予めデータベース化して指向性受音処理部３１に格納しておくようにしてもよい。この場合には、指向性受音処理部３１は、このデータベースを参照して、各スペクトルにおける位相差及び位相差範囲に対応付けられている重み付け値をテーブルから読み出して設定する。 The weighting value given to each spectrum is obtained as described above. This weighting value corresponds to what is referred to as “suppression function” in Patent Document 3.
Note that a table in which the relationship between the phase difference and the phase difference range and the above-described weighting setting value is shown in advance as a database for each spectrum frequency may be stored in the directivity reception processing unit 31. . In this case, the directional sound reception processing unit 31 reads out and sets the phase difference in each spectrum and the weighting value associated with the phase difference range with reference to this database.

本実施形態では、先に求めていた位相差の数だけ以上のようにして得られる重み付け値について、スペクトル周波数毎に加算平均を求めることで、基準とした収音信号の周波数スペクトルに対して与えられるスペクトル周波数毎の重み付け値を求める。 In the present embodiment, the weighted value obtained as described above for the number of phase differences previously obtained is given to the frequency spectrum of the collected sound signal as a reference by obtaining an average for each spectrum frequency. A weight value for each spectral frequency to be obtained is obtained.

出力音声信号生成部３２は、指向性受音処理部３１により上述した重み付けが与えられた収音信号の周波数スペクトルに対し、指向性受音処理部３１での変換に対する逆変換（例えば高速フーリエ逆変換）を施して時間領域の音声信号データに変換して出力する。この音声信号データが、マイクアレイ２で収音した複数の収音信号に基づき生成された、収音指向性範囲設定部２０が設定した向き及び鋭さに収音指向性が制御された出力音声の信号である。
収音処理部３０は以上のように構成されている。 The output audio signal generation unit 32 performs inverse transformation (for example, fast Fourier inverse) on the frequency spectrum of the collected sound signal given the above weighting by the directivity reception processing unit 31 with respect to the conversion in the directivity reception processing unit 31. Conversion) to convert to time domain audio signal data and output. This sound signal data is generated based on a plurality of sound pickup signals picked up by the microphone array 2, and the output sound of which the sound pickup directivity is controlled to the direction and sharpness set by the sound pickup directivity range setting unit 20. Signal.
The sound collection processing unit 30 is configured as described above.

除外発音体抽出部４０は、発音体情報取得部１０が取得した発音体（本実施形態においては人間）の配置の情報に基づいて、収音処理部３０が生成する出力音声の音源から除外する発音体（人間）を抽出する。収音指向性範囲設定部２０は、除外発音体抽出部４０がこの抽出を行った場合には、マイクアレイ２から発音体（人間）に向ける収音指向性の向き及び鋭さを、当該発音体（人間）のうち除外発音体抽出部４０により抽出されたもの以外の各々について設定する。この除外発音体抽出部４０による、除外される発音体の抽出の手法については後述する。
収音装置１は以上の構成要素を備えている。 The excluded sound generator extracting unit 40 excludes the sound generator of the output sound generated by the sound collection processing unit 30 based on the arrangement information of the sound generator (human in this embodiment) acquired by the sound generator information acquiring unit 10. Extract the pronunciation body (human). When the excluded sound generator extraction unit 40 performs this extraction, the sound collection directivity range setting unit 20 determines the direction and sharpness of the sound collection directivity directed from the microphone array 2 toward the sound generator (human). The setting is made for each of the humans other than those extracted by the excluded sound generator extraction unit 40. A method of extracting excluded sound generators by the excluded sound generator extraction unit 40 will be described later.
The sound collection device 1 includes the above components.

カメラ３は、マイクアレイ２の収音範囲内の画像の固定倍率での撮影を、所定の時間間隔で繰り返し行う。なお、本実施形態では、カメラ３はマイクアレイ２とほぼ同一の位置に配置されているものとする。 The camera 3 repeatedly captures images within a sound collection range of the microphone array 2 at a fixed magnification at predetermined time intervals. In the present embodiment, it is assumed that the camera 3 is disposed at substantially the same position as the microphone array 2.

顔の位置検出システム４（以下、単に「検出システム４」と称することとする）は、カメラ３により撮影された画像に対して画像処理を施すことによって、マイクアレイ２の収音範囲内に存在する発音体（本実施形態においては人間）の数及び配置の情報を得る。この情報は収音装置１に入力されて、発音体情報取得部１０により取得される。 The face position detection system 4 (hereinafter simply referred to as “detection system 4”) is present within the sound collection range of the microphone array 2 by performing image processing on an image captured by the camera 3. Information on the number and arrangement of sound generators (humans in the present embodiment) is obtained. This information is input to the sound collection device 1 and acquired by the sound generator information acquisition unit 10.

ここで、この検出システム４による画像処理について説明する。
検出システム４は、まず、カメラ３での撮影画像から、人間の顔の像の検出処理を行う。この顔検出の手法には周知の技術を用いる。本実施形態では、画像から切り出した部分領域の画像と、予め用意しておいた顔パターンのデータベースから読み出した顔パターン画像の各々とを照合して両者の相関度を算出する処理を行う。この相関度は、例えば、顔の輪郭、目・鼻・口の相対位置、顔の色彩などに基づき総合的に算出する。そして、この相関度が所定値よりも高いものが存在した場合には、その部分領域を、人間の顔の像の検出結果とする。この処理を、部分領域の位置及び大きさを変えながら撮影画像の全体に亘って行うことで、撮影画像に含まれる顔の像の数と、各顔の像の撮影画像における位置及び大きさを検出する。このうちの撮影画像に含まれる顔の像の数の検出結果は、マイクアレイ２の収音範囲を撮影した画像から得られた、マイクアレイ２の収音範囲内に存在する発音体の数の情報として、検出システム４から出力される。 Here, image processing by the detection system 4 will be described.
First, the detection system 4 performs a process of detecting an image of a human face from an image captured by the camera 3. A known technique is used for this face detection method. In the present embodiment, the image of the partial area cut out from the image and each of the face pattern images read out from the face pattern database prepared in advance are collated to calculate the degree of correlation between them. This degree of correlation is calculated comprehensively based on, for example, the face contour, the relative positions of the eyes, nose, and mouth, the color of the face, and the like. If there is an object whose correlation degree is higher than a predetermined value, the partial area is set as a detection result of a human face image. By performing this process over the entire captured image while changing the position and size of the partial area, the number of facial images included in the captured image and the position and size of each facial image in the captured image are determined. To detect. Of these, the detection result of the number of face images included in the photographed image is the number of sounding bodies present in the sound collection range of the microphone array 2 obtained from an image obtained by photographing the sound collection range of the microphone array 2. Information is output from the detection system 4.

次に、検出システム４は、撮影画像から検出された部分領域（すなわち顔の像）の撮影画像上の位置に基づいて、その部分領域に表されている顔へのマイクアレイ２からの方向角を求める処理を行う。この方向角を求めるために、例えば、部分領域の位置と方向角との関係を実測して作成したテーブルを検出システム４に予め格納しておくようにすることができる。この場合には、検出システム４は、このテーブルを参照し、部分領域の位置に対応付けられている方向角をテーブルから読み出すことで、マイクアレイ２から顔への方向角を求めるようにする。 Next, the detection system 4 determines the direction angle from the microphone array 2 to the face represented in the partial area based on the position on the captured image of the partial area (that is, the face image) detected from the captured image. The process which calculates | requires is performed. In order to obtain the direction angle, for example, a table created by actually measuring the relationship between the position of the partial region and the direction angle can be stored in the detection system 4 in advance. In this case, the detection system 4 refers to this table and reads the direction angle associated with the position of the partial area from the table, thereby obtaining the direction angle from the microphone array 2 to the face.

次に、検出システム４は、撮影画像から検出された部分領域（すなわち顔の像）の大きさに基づいて、その部分領域に表されている顔までのマイクアレイ２からの距離を求める処理を行う。この距離を求めるために、例えば、部分領域の大きさと距離との関係を実測して作成したテーブルを検出システム４に予め格納しておくようにすることができる。この場合には、検出システム４は、このテーブルを参照し、部分領域の大きさに対応付けられている距離をテーブルから読み出すことで、マイクアレイ２から顔までの距離を求めるようにする。 Next, the detection system 4 performs processing for obtaining the distance from the microphone array 2 to the face represented in the partial area based on the size of the partial area (that is, the face image) detected from the captured image. Do. In order to obtain this distance, for example, a table created by actually measuring the relationship between the size of the partial area and the distance can be stored in the detection system 4 in advance. In this case, the detection system 4 refers to this table and reads the distance associated with the size of the partial area from the table to obtain the distance from the microphone array 2 to the face.

次に、検出システム４は、撮影画像から検出された顔の像に表されている目・鼻・口の位置関係に基づいて、その顔の向きを求める処理を行う。この処理では、例えば、その顔の像に含まれている両目・鼻・口の位置に基づき、鼻の位置と口の位置とを通る直線からの、右目の位置までの距離と左目の位置までの距離とをまず求める。そして、この２つの距離の比に基づき、その顔の向きを示す角度を求める。この角度を求めるために、例えば、上述の距離の比と上述の角度との関係を実測して作成したテーブルを検出システム４に予め格納しておくようにすることができる。この場合には、検出システム４は、このテーブルを参照し、撮影画像に基づき求められた上述の距離の比に対応付けられている角度をテーブルから読み出すことで、その顔の向きを示す角度を求めるようにする。 Next, the detection system 4 performs processing for obtaining the orientation of the face based on the positional relationship between the eyes, nose, and mouth represented in the face image detected from the captured image. In this process, for example, based on the positions of both eyes, nose and mouth contained in the face image, from the straight line passing through the nose position and mouth position to the right eye position and the left eye position. First, find the distance. Based on the ratio of the two distances, an angle indicating the face orientation is obtained. In order to obtain this angle, for example, a table created by actually measuring the relationship between the above-mentioned distance ratio and the above-mentioned angle can be stored in the detection system 4 in advance. In this case, the detection system 4 refers to this table and reads the angle associated with the above-mentioned distance ratio obtained based on the captured image from the table, thereby obtaining an angle indicating the orientation of the face. Try to ask.

検出システム４が以上のようにして撮影画像から取得するデータを図３に示す。
この図３では、カメラ３の撮影範囲内（すなわちマイクアレイ２の収音範囲内）に人間が二人（人Ａ及び人Ｂ）在る場合を示している。ここで、検出システム４は、以上の処理により、図３に示されている人Ａの方向角θ_A、距離ｄ_A、及び顔の角度θ2_Aと、人Ｂの方向角θ_B、距離ｄ_B、及び顔の角度θ2_Bとを求める。 FIG. 3 shows data acquired from the captured image by the detection system 4 as described above.
FIG. 3 shows a case where there are two persons (person A and person B) within the photographing range of the camera 3 (that is, within the sound collection range of the microphone array 2). Here, the detection system 4, the above processing, the direction angle theta _A human A shown in FIG. 3, the distance d _A, and the angle .theta.2 _A face direction angle of the human B theta _B, the distance d _B and the face angle θ2 _B are obtained.

なお、検出システム４が、人Ａの方向角θ_A及び距離ｄ_Aと人Ｂの方向角θ_B及び距離ｄ_Bとを求める代わりに、人Ａ及び人Ｂそれぞれの配置位置を示す二次元座標値（Ｘ_A，Ｙ_A）及び（Ｘ_B，Ｙ_B）を撮影画像から求めるようにしてもよい。 In addition, instead of the detection system 4 obtaining the direction angle θ _A and the distance d _A of the person _A and the direction angle θ _B and the distance d _B of the person B, the two-dimensional coordinates indicating the arrangement positions of the person A and the person B are shown. The values (X _A , Y _A ) and (X _B , Y _B ) may be obtained from the captured image.

検出システム４は、マイクアレイ２の収音範囲を撮影した画像から以上のようにして得られた、撮影画像に顔の像が含まれている各人についての方向角θ、距離ｄ、及び顔の角度θ2 の各データを、当該収音範囲内に存在する発音体の配置の情報として出力する。 The detection system 4 obtains the direction angle θ, the distance d, and the face for each person whose face image is included in the photographed image obtained from the image obtained by photographing the sound collection range of the microphone array 2 as described above. Are output as information on the arrangement of sounding bodies existing in the sound collection range.

検出システム４は、以上のようにして、マイクアレイ２の収音範囲内に存在する発音体の数及び配置の情報を、カメラ３が当該収音範囲内の画像を撮像する度に当該画像から取得して収音装置１に出力する。検出システム４から出力された情報は、収音装置１の発音体情報取得部１０が取得する。
図１の収音システムは、以上の構成要素を有している。 As described above, the detection system 4 obtains information on the number and arrangement of sounding bodies existing within the sound collection range of the microphone array 2 from the image every time the camera 3 captures an image within the sound collection range. Obtained and output to the sound collecting device 1. The information output from the detection system 4 is acquired by the sound generator information acquisition unit 10 of the sound collection device 1.
The sound collection system of FIG. 1 has the above components.

次に、収音装置１の収音指向性範囲設定部２０により行われる、収音指向性の向き及び鋭さの設定の手法について説明する。 Next, a method of setting the direction and sharpness of sound collection directivity performed by the sound collection directivity range setting unit 20 of the sound collection device 1 will be described.

収音指向性範囲設定部２０には、マイクアレイ２の収音範囲内に存在する人の数と収音指向性の鋭さを示す角度との関係が設定されている角度テーブルが予め格納されている。本実施形態では、この角度テーブルの設定によって、その収音範囲内に存在する人が一人の場合に収音指向性の鋭さの最大値（最も鈍い値）θ_MAX（例えば９０°）が関係付けられており、二人の場合に規定値θ_def（例えば３０°）が関係付けられているものとする。 The sound collection directivity range setting unit 20 stores in advance an angle table in which the relationship between the number of persons existing in the sound collection range of the microphone array 2 and the angle indicating the sharpness of the sound collection directivity is set. Yes. In the present embodiment, when the angle table is set, the maximum value (dullest value) θ _MAX (for example, 90 °) of the sound collection directivity is related when there is only one person within the sound collection range. It is assumed that a prescribed value θ _def (for example, 30 °) is associated with two people.

収音指向性範囲設定部２０は、まず、マイクアレイ２の収音範囲内に存在する発音体（人間）の数及び配置の情報として、上述の人数の情報と、各人についての方向角θ、距離ｄ、及び顔の角度θ2 の各データとを、発音体情報取得部１０から取得する処理を行う。なお、ここでは、除外発音体抽出部４０の動作は考慮しないものとする。 The sound collection directivity range setting unit 20 firstly includes the number information and the direction angle θ for each person as information on the number and arrangement of sounding bodies (humans) existing in the sound collection range of the microphone array 2. , Distance d, and face angle θ2 are acquired from the sound generator information acquisition unit 10. Here, the operation of the excluded sound generator extraction unit 40 is not considered.

次に、収音指向性範囲設定部２０は、このうちの人数の情報に基づき、各人に向ける収音指向性の鋭さの設定処理を行う。この設定処理を、図４を用いて説明する。
マイクアレイ２の収音範囲内に存在する人間が一人のみの場合には、収音指向性範囲設定部２０は、この設定処理の実行により、収音指向性の鋭さを、その人への向きを中心とする±θ_MAXの角度に設定する。図４の（１）の例では、方向角θ_Aが０°に位置する人Ａに対し、収音指向性範囲設定部２０は、±θ_MAXの角度に収音指向性の鋭さを設定する。 Next, the sound collection directivity range setting unit 20 performs setting processing of the sharpness of the sound collection directivity directed toward each person based on the information on the number of persons. This setting process will be described with reference to FIG.
When only one person exists within the sound collection range of the microphone array 2, the sound collection directivity range setting unit 20 performs the setting process to set the sharpness of the sound collection directivity to the direction of the person. Set the angle to ± θ _MAX centered at. In the example of (1) of FIG. 4, for the person _A whose direction angle θ _A is 0 °, the sound collection directivity range setting unit 20 sets the sharpness of the sound collection directivity to an angle of ± θ _MAX. .

一方、その収音範囲内に存在する人間が二人以上の場合、収音指向性範囲設定部２０は、この設定処理の実行により、収音指向性の鋭さを、その人への向きを中心とする±θ_defの角度に設定する。図４（２）の例では、方向角θ_A（＜０）に位置する人Ａと、方向角θ_B（＞０）に位置する人Ｂとに対し、収音指向性範囲設定部２０は、それぞれ、その方向角を中心として±θ_defの角度に収音指向性の鋭さを設定する。 On the other hand, when there are two or more persons in the sound collection range, the sound collection directivity range setting unit 20 performs the setting process to focus the sharpness of the sound collection directivity on the direction toward the person. Set the angle to ± θ _def . In the example of FIG. 4 (2), the sound collection directivity range setting unit 20 for the person A located at the direction angle θ _A (<0) and the person B located at the direction angle θ _B (> 0) The sharpness of the sound collection directivity is set to an angle of ± θ _def around the direction angle.

収音指向性範囲設定部２０は、このようにして、マイクアレイ２の収音範囲内に存在する発音体についての数の情報に基づき、各発音体に向ける収音指向性の鋭さの設定を行う。収音処理部３０は、収音指向性範囲設定部２０が設定した鋭さに収音指向性を制御した出力音声の信号を生成するので、このような設定を収音指向性範囲設定部２０が行うことで、複数の発音体による同時の発音を明瞭に収音することが収音装置１で可能になる。 In this way, the sound collection directivity range setting unit 20 sets the sharpness of the sound collection directivity directed to each sound generator based on the information on the number of sound generators existing within the sound collection range of the microphone array 2. Do. Since the sound collection processing unit 30 generates an output sound signal in which the sound collection directivity is controlled to the sharpness set by the sound collection directivity range setting unit 20, the sound collection directivity range setting unit 20 performs such setting. By doing so, it is possible for the sound collection device 1 to clearly collect sound simultaneously generated by a plurality of sound generators.

なお、この設定処理により、収音指向性範囲設定部２０が、発音体（人間）の各々に向ける収音指向性の鋭さを、更に、発音体情報取得部１０が検出システム４から取得した発音体（人間）についての配置の情報にも基づき、設定するようにしてもよい。 Note that, by this setting process, the sound collection directivity range setting unit 20 indicates the sharpness of the sound collection directivity directed toward each sound generator (human), and further, the sound generator information acquisition unit 10 acquires the sound from the detection system 4. You may make it set based also on the information of arrangement | positioning about a body (human).

例えば、マイクアレイ２の収音範囲内に存在する人Ａと人Ｂとの配置間隔が、図５の（１）に示すように、図４の（２）の場合よりも離れており、収音指向性の範囲の一部が、マイクアレイ２の収音可能範囲を超えてしまう場合がある。すなわち、
θ_A−θ_def＜−θ_MAX
θ_B＋θ_def＞θ_MAX
の場合である。（マイクアレイ２から人Ａまでの距離とマイクアレイ２から人Ｂまでの距離とは同一とする。）このような場合には、収音指向性範囲設定部２０は、設定処理の実行により、人Ａ及び人Ｂそれぞれについての収音指向性の角度範囲を示す方位角α及びβを、下記の数式で示される範囲内に設定する。
−θ_MAX＜ α ＜θ_A＋θ_def
θ_B−θ_def＜ β ＜θ_MAX For example, as shown in (1) of FIG. 5, the arrangement interval between the person A and the person B existing within the sound collection range of the microphone array 2 is farther than in the case of (2) of FIG. A part of the sound directivity range may exceed the sound collection possible range of the microphone array 2 in some cases. That is,
θ _A −θ _def <−θ _MAX
θ _B + θ _def > θ _MAX
This is the case. (The distance from the microphone array 2 to the person A and the distance from the microphone array 2 to the person B are the same.) In such a case, the sound collection directivity range setting unit 20 executes the setting process, The azimuth angles α and β indicating the angle range of the sound collection directivity for each of the person A and the person B are set within a range represented by the following mathematical formula.
−θ _MAX <α <θ _A + θ _def
θ _B −θ _def <β <θ _MAX

一方、マイクアレイ２の収音範囲内に存在する人Ａと人Ｂとの配置間隔が、図５の（２）に示すように、図４の（２）の場合よりも近く、両者の収音指向性の範囲の一部が重なってしまう場合がある。（マイクアレイ２から人Ａまでの距離とマイクアレイ２から人Ｂまでの距離とは同一とする。）すなわち、
−θ_A＋θ_B＜２θ_def
の場合である。このような場合には、収音指向性範囲設定部２０は、設定処理の実行により、人Ａ及び人Ｂそれぞれについての収音指向性の角度範囲α及びβを、下記の数式で示される範囲内とする。
θ_A−θ_def＜ α ＜（−θ_A＋θ_B）／２
（−θ_A＋θ_B）／２＜ β ＜θ_B＋θ_def On the other hand, the arrangement interval between the person A and the person B existing within the sound collection range of the microphone array 2 is closer than the case of (2) in FIG. Some of the sound directivity ranges may overlap. (The distance from the microphone array 2 to the person A is the same as the distance from the microphone array 2 to the person B.) That is,
-Θ _A + θ _B <2θ _def
This is the case. In such a case, the sound collection directivity range setting unit 20 performs the setting process to set the sound collection directivity angle ranges α and β for the person A and the person B, respectively, in the ranges represented by the following mathematical formulas. Within.
θ _A −θ _def <α <(− θ _A + θ _B ) / 2
(−θ _A + θ _B ) / 2 <β <θ _B + θ _def

以上のようにして、収音指向性範囲設定部２０が、各発音体に向ける収音指向性の鋭さを、マイクアレイ２の収音範囲内に存在する発音体同士の配置間隔の情報にも基づいて設定するようにしてもよい。 As described above, the sound collection directivity range setting unit 20 uses the sharpness of the sound collection directivity directed to each sound generator as well as information on the arrangement interval between sound generators existing within the sound collection range of the microphone array 2. You may make it set based on.

更に、この設定処理により、収音指向性範囲設定部２０が、各発音体に向ける収音指向性の鋭さを、発音体とマイクアレイ２との距離の情報にも基づき、以下のようにして設定するようにしてもよい。 Furthermore, by this setting process, the sound collection directivity range setting unit 20 determines the sharpness of the sound collection directivity directed to each sounding body based on the distance information between the sounding body and the microphone array 2 as follows. You may make it set.

収音指向性範囲設定部２０には、マイクアレイ２の収音範囲内に存在する人とマイクアレイ２との距離の基準値ｄ_defが予め格納されている。ここで、人とマイクアレイ２との距離が、この基準距離ｄ_defに一致する場合には、収音指向性範囲設定部２０は、前述した角度テーブルに設定されている収音指向性の鋭さを示す角度の値を、その人に向ける収音指向性の鋭さとしてそのまま設定する。 The sound collection directivity range setting unit 20 stores in advance a reference value d _{def for} the distance between a person existing in the sound collection range of the microphone array 2 and the microphone array 2. Here, when the distance between the person and the microphone array 2 matches the reference distance d _def , the sound collection directivity range setting unit 20 sets the sharpness of the sound collection directivity set in the angle table described above. Is set as it is as the sharpness of the sound collection directivity toward the person.

一方、図６の例において、マイクアレイ２との距離が基準距離ｄ_defよりも短い距離ｄ_Aである人Ａについての収音指向性については、収音指向性範囲設定部２０は、その人Ａへの方位角θ_Aを中心とする±θ_def×（ｄ_A／ｄ_def）の角度の範囲に狭く設定する。 On the other hand, in the example of FIG. 6, for the sound collection directivity for the person A whose distance to the microphone array 2 is the distance d _A shorter than the reference distance d _def , the sound collection directivity range setting unit 20 The angle is set narrowly in a range of ± θ _def × (d _A / d _def ) centered on the azimuth angle θ _A to _A.

また、図６の例において、マイクアレイ２との距離が基準距離ｄ_defよりも長い距離ｄ_Bである人Ｂについての収音指向性については、収音指向性範囲設定部２０は、その人Ｂへの方位角θ_Bを中心とする±θ_def×（ｄ_B／ｄ_def）の角度の範囲に広く設定する。 Further, in the example of FIG. 6, for the sound collection directivity of the human B is a longer distance d _B than the distance the reference distance d _def of a microphone array 2, sound collection directivity range setting unit 20, the person Widely set within an angle range of ± θ _def × (d _B / d _def ) centered on the azimuth angle θ _B to _B.

なお、収音指向性範囲設定部２０が、このように、マイクアレイ２との距離が短いほど収音指向性を狭く設定するのは、この距離が短い場合には、良好な収音が可能であるので、目的の音以外の雑音を抑制することを意図しているためである。その一方で、この距離が長いほど収音指向性を広く設定するのは、この距離が長いと、収音周波数帯域によっては音の伝搬による減衰が大きくなるために、収音量を少しでも稼ぐためである。 Note that the sound collection directivity range setting unit 20 sets the sound collection directivity narrower as the distance from the microphone array 2 becomes shorter as described above. Good sound collection is possible when this distance is short. Therefore, it is intended to suppress noise other than the target sound. On the other hand, the longer the distance, the wider the sound collection directivity. The longer the distance, the greater the attenuation due to sound propagation depending on the sound collection frequency band. It is.

なお、本実施形態に係る収音システムは、収音可能距離の顕著な長距離化を指向するものではないので、収音可能距離を伸ばすために収音指向性を狭くする制御は行わない。
収音指向性範囲設定部２０による収音指向性の向き及び鋭さの設定は、以上のようにして行われる。 Note that the sound collection system according to the present embodiment is not intended to significantly increase the sound collection possible distance, and therefore, control for narrowing the sound collection directivity is not performed in order to extend the sound collection possible distance.
The setting of the direction and sharpness of sound collection directivity by the sound collection directivity range setting unit 20 is performed as described above.

次に、除外発音体抽出部４０により行われる、収音処理部３０が生成する出力音声の音源から除外される発音体の抽出の手法について説明する。
除外発音体抽出部４０は、まず、マイクアレイ２の収音範囲内に存在する発音体（人間）の配置の情報として、図３を用いて説明した各人についての方向角θ、距離ｄ、及び顔の角度θ2 の各データとを、発音体情報取得部１０から取得する処理を行う。 Next, a method of extracting sound generators excluded from the sound source of the output sound generated by the sound collection processing unit 30 performed by the excluded sound generator extraction unit 40 will be described.
First, the excluded sound generator extraction unit 40 uses the direction angle θ, the distance d, and the distance d for each person described with reference to FIG. 3 as information on the arrangement of sound generators (humans) existing within the sound collection range of the microphone array 2. And the data of the angle θ2 of the face are acquired from the sound generator information acquisition unit 10.

次に、除外発音体抽出部４０は、これらの配置の情報に基づき、収音処理部３０が生成する出力音声の音源から除外する発音体（人間）を抽出する抽出処理を行う。この抽出処理について説明する。 Next, the excluded sound generator extraction unit 40 performs an extraction process for extracting a sound generator (human) to be excluded from the sound source of the output sound generated by the sound collection processing unit 30 based on the information on the arrangement. This extraction process will be described.

まず図７について説明する。図７は、カメラ３の撮影範囲内（すなわちマイクアレイ２の収音範囲内）に在る二人（人Ａ及び人Ｂ）のうち、人Ｂが移動している状態を表現している。 First, FIG. 7 will be described. FIG. 7 represents a state in which the person B is moving among two persons (person A and person B) within the photographing range of the camera 3 (that is, within the sound collection range of the microphone array 2).

この図７の例の場合には、カメラ３が撮影を行う度に発音体情報取得部１０が検出システム４から取得する配置の情報のうちの人Ｂについての情報は、その値が変化する。除外発音体抽出部４０は、この値の変化量、より具体的には、人Ｂについての方向角θ_B及び距離ｄ_Bから求まる、各画像の撮影時における人Ｂの配置位置の変化量（すなわち移動距離）を算出する。そして、この変化量が、予め定めておいた閾値を上回った場合には、人Ｂによる発声の明瞭な収音は困難であると判断し、出力音声の音源から除外する発音体として、人Ｂを抽出する。 In the case of the example in FIG. 7, the value of the information about the person B in the arrangement information acquired by the sound generator information acquisition unit 10 from the detection system 4 every time the camera 3 performs shooting changes. Exclusion sounding body extraction unit 40, the variation of this value, more specifically, determined from the direction angle theta _B and the distance d _B of human B, the amount of change in position of the person B at the time of shooting of each image ( That is, the movement distance is calculated. If the amount of change exceeds a predetermined threshold, it is determined that it is difficult to clearly collect the utterance by the person B, and the person B is used as a sounding body to be excluded from the output sound source. To extract.

以上のように、除外発音体抽出部４０は、図７の例の場合には、発音体情報取得部１０が取得した発音体の配置の情報の変化量に基づき、出力音声の信号の音源から除外する発音体を抽出する。 As described above, in the example of FIG. 7, the excluded sounding body extraction unit 40 is based on the amount of change in the sounding body arrangement information acquired by the sounding body information acquisition unit 10 from the sound source of the output sound signal. Extract phonetics to exclude.

次に図８について説明する。図８は、カメラ３の撮影範囲内（すなわちマイクアレイ２の収音範囲内）に在る二人（人Ａ及び人Ｂ）のうち、人Ｂの顔がカメラ３（すなわちマイクアレイ２）に対して横を向いている状態を表現している。 Next, FIG. 8 will be described. FIG. 8 shows that among two persons (person A and person B) within the shooting range of the camera 3 (that is, within the sound collection range of the microphone array 2), the face of the person B faces the camera 3 (that is, the microphone array 2). On the other hand, it represents a state of facing sideways.

この図８の例の場合には、人Ｂについての顔の角度θ2_Bに注目する。そして、この値が、予め定めておいた、カメラ３（すなわちマイクアレイ２）を向いているといえる閾値範囲外であった場合には、人Ｂによる発声の明瞭な収音は困難であると判断し、出力音声の音源から除外する発音体として、人Ｂを抽出する。 In the case of the example of FIG. 8, attention is paid to the face angle θ 2 _B of the person B. If this value is outside a predetermined threshold range that can be said to be facing the camera 3 (that is, the microphone array 2), it is difficult to clearly collect the utterance by the person B. The person B is extracted as a sounding body to be determined and excluded from the sound source of the output sound.

以上のように、除外発音体抽出部４０は、図８の例の場合には、発音体情報取得部１０が取得した発音体である人間の配置の情報のうちの当該人間の顔の向きの情報に基づいて、出力音声の信号の音源から除外する発音体を抽出する。 As described above, in the example of FIG. 8, the excluded sounding body extraction unit 40 indicates the orientation of the human face in the information on the arrangement of the human being that is the sounding body acquired by the sounding body information acquisition unit 10. Based on the information, a sounding body to be excluded from the sound source of the output audio signal is extracted.

除外発音体抽出部４０は、以上のようにして抽出した発音体の情報を収音指向性範囲設定部２０に通知する。収音指向性範囲設定部２０は、マイクアレイ２から発音体に向ける収音指向性の向き及び鋭さを、当該発音体のうち除外発音体抽出部４０により抽出されたもの以外のものについての数若しくは配置の情報に基づき設定する。 The excluded sound generator extraction unit 40 notifies the sound collection directivity range setting unit 20 of the sound generator information extracted as described above. The sound collection directivity range setting unit 20 sets the direction and sharpness of the sound collection directivity directed from the microphone array 2 to the sounding body for the sounding bodies other than those extracted by the excluded sounding body extraction unit 40. Or it sets based on the information of arrangement | positioning.

なお、除外発音体抽出部４０は、出力音声の信号の音源から除外する発音体を抽出する手法として、この他のものを用いることもできる。
例えば、検出システム４が、撮影画像より検出した人の顔の像における口の動きの有無の情報を出力する場合には、収音指向性範囲設定部２０は、前述の収音指向性の向き及び鋭さを、この口の動きの有無の情報に基づき設定するようにすることができる。 It should be noted that the excluded sound generator extraction unit 40 may use other methods as a method for extracting sound generators to be excluded from the sound source of the output sound signal.
For example, when the detection system 4 outputs information on the presence or absence of mouth movements in the human face image detected from the captured image, the sound collection directivity range setting unit 20 determines the direction of the sound collection directivity described above. The sharpness can be set based on the information on the presence or absence of the mouth movement.

例えば、検出システム４が、撮影画像より検出した顔の像から口（唇）の輪郭形状を抽出する処理を、カメラ３が時間を隔てて撮影した各撮影画像に対して行い、続いて、この形状の変化量を算出する処理を行う。そして、この変化量が、予め定めておいた閾値を上回った場合には、この口は動きが有ると判断し、当該閾値に満たない場合には、この口は動きが無いと判断する。検出システム４は、このようにして撮影画像より検出される各人の口の動きの判断結果情報を収音装置１に出力する。検出システム４から出力されたこの情報は、収音装置１の発音体情報取得部１０が取得する。 For example, the detection system 4 performs the process of extracting the contour shape of the mouth (lips) from the face image detected from the photographed image with respect to each photographed image photographed by the camera 3 at a time interval. A process for calculating the amount of change in shape is performed. When the amount of change exceeds a predetermined threshold value, it is determined that the mouth has movement, and when the amount of change is less than the threshold value, the mouth is determined to have no movement. The detection system 4 outputs the determination result information on the movement of each person's mouth thus detected from the captured image to the sound collection device 1. This information output from the detection system 4 is acquired by the sound generator information acquisition unit 10 of the sound collection device 1.

除外発音体抽出部４０は、発音体情報取得部１０が取得したこの判断結果情報に基づき、口は動きが無いと判断されている人を、発声をしていない人とみなし、出力音声の信号の音源から除外する発音体として抽出する。収音指向性範囲設定部２０は、マイクアレイ２から発音体に向ける収音指向性の向き及び鋭さを、当該発音体のうち除外発音体抽出部４０により抽出されたもの以外のものについての数若しくは配置の情報に基づき設定する。 Based on the determination result information acquired by the sound generator information acquisition unit 10, the excluded sound generator extraction unit 40 regards a person whose mouth is determined not to move as a person who does not speak, and outputs an output sound signal. As a sound generator to be excluded from the sound source. The sound collection directivity range setting unit 20 sets the direction and sharpness of the sound collection directivity directed from the microphone array 2 to the sounding body for the sounding bodies other than those extracted by the excluded sounding body extraction unit 40. Or it sets based on the information of arrangement | positioning.

このように、除外発音体抽出部４０が、発音体情報取得部１０が取得した人間の口の動きの有無の情報に基づいて、収音処理部３０が生成する出力音声の信号の音源から除外する発音体を抽出するようにしてもよい。 As described above, the excluded sound generator extraction unit 40 is excluded from the sound source of the output sound signal generated by the sound collection processing unit 30 based on the information on the presence or absence of the movement of the human mouth acquired by the sound generator information acquisition unit 10. The sounding body to be played may be extracted.

なお、上述した各人の口の動きの判断処理を、検出システム４に代わって除外発音体抽出部４０が行うように構成することもできる。
図１のように構成されている収音システムは、各構成要素が以上のように動作することで、複数の発音体による同時の発音の明瞭な収音が可能になる。 In addition, it can also comprise so that the determination process of each person's mouth movement mentioned above may be performed by the excluded sounding body extraction part 40 instead of the detection system 4. FIG.
The sound collection system configured as shown in FIG. 1 enables clear sound collection of simultaneous sound generation by a plurality of sound generators by operating each component as described above.

次に図９について説明する。図９には、収音システムの構成の第二の例が図解されている。なお、図９において、図１に図解した第一の例と同一の動作を行う構成要素には、同一の符号を付しており、それらについては詳細な説明を省略する。 Next, FIG. 9 will be described. FIG. 9 illustrates a second example of the configuration of the sound collection system. In FIG. 9, components that perform the same operations as those in the first example illustrated in FIG. 1 are denoted by the same reference numerals, and detailed description thereof is omitted.

この収音システムの第二の例は、図１に図解した第一の例と同様に、収音装置１、マイクアレイ２、カメラ３、及び顔の位置検出システム４を有している。但し、この第二の例では、収音装置１における収音処理部３０が、発音検出部３３を、指向性受音処理部３１と出力音声信号生成部３２との間に備えている点において、第一の例と相違している。 Similar to the first example illustrated in FIG. 1, the second example of the sound collection system includes a sound collection device 1, a microphone array 2, a camera 3, and a face position detection system 4. However, in the second example, the sound collection processing unit 30 in the sound collection device 1 includes a sound generation detection unit 33 between the directional sound reception processing unit 31 and the output sound signal generation unit 32. This is different from the first example.

発音検出部３３は、マイクアレイ２の収音範囲内に存在する発音体の各々による発音の有無を検出し、その検出結果を出力音声信号生成部３２に通知する。出力音声信号生成部３２は、マイクアレイ２の収音範囲内に存在する発音体のうち、発音検出部３３により発音が検出されたもののみを音源とする出力音声の信号を生成する。 The sound generation detection unit 33 detects the presence or absence of sound generation by each of the sounding bodies existing within the sound collection range of the microphone array 2 and notifies the output sound signal generation unit 32 of the detection result. The output audio signal generation unit 32 generates an output audio signal using only the sound generators that are detected within the sound collection range of the microphone array 2 and whose sound generation is detected by the sound generation detection unit 33 as sound sources.

発音検出部３３について更に説明する。発音検出部３３は、発音体の各々による発音の有無の検出を、出力音声における所定の周波数帯の振幅レベルに基づいて行う。
前述したように、指向性受音処理部３１は、収音指向性範囲設定部２０が設定した向き及び鋭さの収音指向性を得るための重み付けが与えられた収音信号の周波数スペクトルを出力し、出力音声信号生成部３２は、これを時間領域の音声信号データに変換する。従って、指向性受音処理部３１が出力する周波数スペクトルは、出力音声信号生成部３２から出力される出力音声の信号の周波数スペクトルである。 The pronunciation detection unit 33 will be further described. The sound generation detection unit 33 detects the presence or absence of sound generation by each sound generator based on the amplitude level of a predetermined frequency band in the output sound.
As described above, the directivity sound reception processing unit 31 outputs the frequency spectrum of the collected sound signal given the weight for obtaining the sound collection directivity having the direction and sharpness set by the sound collection directivity range setting unit 20. Then, the output audio signal generation unit 32 converts this into audio signal data in the time domain. Therefore, the frequency spectrum output by the directional sound reception processing unit 31 is the frequency spectrum of the output audio signal output from the output audio signal generation unit 32.

発音検出部３３は、この出力音声の信号の周波数スペクトルのうち、所定の周波数帯に含まれるスペクトルのレベルを加算し、その合計値を、出力音声における所定の周波数帯の振幅レベルとして求める。そして、この振幅レベルが、所定の閾値を上回ったか否かの判定を行い、上回っていた場合には発音体による発音が有るとの判定を下し、上回らなかった場合には発音体による発音が無いとの判定を下す。 The sound generation detection unit 33 adds the levels of the spectrum included in the predetermined frequency band in the frequency spectrum of the signal of the output sound, and obtains the total value as the amplitude level of the predetermined frequency band in the output sound. Then, it is determined whether or not the amplitude level exceeds a predetermined threshold value. If the amplitude level exceeds the predetermined threshold value, it is determined that there is a pronunciation by the sounding body. If not, the sounding by the sounding body is not determined. Judge that there is no.

なお、振幅レベルを求める周波数帯は、本実施形態においては、人間による発声音の周波数帯（３００〜３４００Ｈｚ付近）とする。この代わりに、人間による発声音の周波数スペクトルにおける第一フォルマント（formant）の周波数帯（３００〜１０００Ｈｚ付近）としてもよい。 In the present embodiment, the frequency band for obtaining the amplitude level is a frequency band of voices produced by humans (around 300 to 3400 Hz). Instead of this, the frequency band of the first formant (in the vicinity of 300 to 1000 Hz) in the frequency spectrum of the uttered sound by humans may be used.

また、発音検出部３３が、発音体の各々による発音の有無の検出を、出力音声における所定の周波数帯の振幅レベルに基づいて行う代わりに、以下のようにして行うこともできる。 In addition, the sound generation detection unit 33 can detect the presence or absence of sound generation by each sound generator based on the amplitude level of a predetermined frequency band in the output sound as follows.

例えば、発音検出部３３が、この周波数スペクトルから所定値以上であるスペクトルを抽出し、抽出されたスペクトルを加算してその合計値を求める。そして、この合計値が、所定の閾値を上回ったか否かの判定を行い、上回っていた場合には発音体による発音が有るとの判定を下し、上回らなかった場合には発音体による発音が無いとの判定を下す。発音検出部３３による発音体の各々による発音の有無の検出を、こうして行うようにすることもできる。 For example, the sound generation detection unit 33 extracts a spectrum that is equal to or greater than a predetermined value from the frequency spectrum, and adds the extracted spectra to obtain a total value. Then, it is determined whether or not the total value exceeds a predetermined threshold value. If the total value exceeds the predetermined threshold value, it is determined that there is a pronunciation by the sounding body. If not, the sounding by the sounding body is not determined. Judge that there is no. It is also possible to detect the presence or absence of pronunciation by each of the sounding bodies by the sounding detection unit 33 in this way.

あるいは、発音検出部３３が、この周波数スペクトルにおけるスペクトルの最大値を求める。そして、この最大値が、所定の閾値を上回ったか否かの判定を行い、上回っていた場合には発音体による発音が有るとの判定を下し、上回らなかった場合には発音体による発音が無いとの判定を下す。発音検出部３３による発音体の各々による発音の有無の検出を、こうして行うようにすることもできる。 Alternatively, the sound generation detection unit 33 obtains the maximum value of the spectrum in this frequency spectrum. Then, it is determined whether or not the maximum value exceeds a predetermined threshold value. If the maximum value is exceeded, it is determined that there is a pronunciation by the sounding body. If not, the sounding by the sounding body is not determined. Judge that there is no. It is also possible to detect the presence or absence of pronunciation by each of the sounding bodies by the sounding detection unit 33 in this way.

発音検出部３３は、以上のようにして判定した、発音体による発音の有無の判定結果を出力音声信号生成部３２に通知する。出力音声信号生成部３２は、発音検出部３３から通知された判定結果に基づき、発音体による発音が有ると判定されているときの出力音声の信号を出力し、発音体による発音が無いと判定されているときの出力音声の信号の出力を中止する。 The sound generation detection unit 33 notifies the output sound signal generation unit 32 of the determination result of the presence or absence of sound generation by the sounding body determined as described above. Based on the determination result notified from the sound generation detection unit 33, the output sound signal generation unit 32 outputs a signal of the output sound when it is determined that there is sound generation by the sounding body, and determines that there is no sounding by the sounding body. Stops outputting the output audio signal when

なお、発音検出部３３は、出力音声の信号の出力を中止する代わりに、無音としてもよい。また、突然の無音部分の発生による違和感を軽減するために、無音とする代わりに、所定レベルの白色雑音を出力するようにしてもよいし、この収音システムが定常的に発生させている定常雑音を出力するようにしてもよい。
図９のように構成されている収音システムは、以上のように動作する。 Note that the sound generation detection unit 33 may be silent instead of stopping the output of the output sound signal. Further, in order to reduce a sense of incongruity due to the sudden occurrence of a silent part, instead of silence, a predetermined level of white noise may be output, or the steady state that this sound collection system generates constantly. Noise may be output.
The sound collection system configured as shown in FIG. 9 operates as described above.

なお、図１及び図９の各々に図解した収音システムにおける収音装置１の動作、すなわち、マイクアレイ２で収音した複数の収音信号に基づいて収音指向性を制御した出力音声の信号の生成動作を、コンピュータに行わせることもできる。 The operation of the sound collection device 1 in the sound collection system illustrated in each of FIGS. 1 and 9, that is, the output sound whose sound collection directivity is controlled based on a plurality of sound collection signals collected by the microphone array 2. The signal generation operation can also be performed by a computer.

まず図１０について説明する。図１０には、収音装置１の動作を行わせるコンピュータ５０の構成が図解されている。
このコンピュータ５０は、ＭＰＵ５１、ＲＯＭ５２、ＲＡＭ５３、ハードディスク装置５４、入力装置５５、表示装置５６、インタフェース装置５７、及び記録媒体駆動装置５８を備えている。なお、これらの構成要素はバス５９を介して接続されており、ＭＰＵ５１の管理の下で各種のデータを相互に授受することができる。 First, FIG. 10 will be described. FIG. 10 illustrates the configuration of a computer 50 that causes the sound collection device 1 to operate.
The computer 50 includes an MPU 51, a ROM 52, a RAM 53, a hard disk device 54, an input device 55, a display device 56, an interface device 57, and a recording medium drive device 58. These components are connected via a bus 59, and various data can be exchanged under the management of the MPU 51.

ＭＰＵ（Micro Processing Unit）５１は、このコンピュータ５０全体の動作を制御する演算処理装置である。
ＲＯＭ（Read Only Memory）５２は、所定の基本制御プログラムが予め記録されている読み出し専用半導体メモリである。ＭＰＵ５１は、この基本制御プログラムをコンピュータ５０の起動時に読み出して実行することにより、このコンピュータ５０の各構成要素の動作制御が可能になる。 An MPU (Micro Processing Unit) 51 is an arithmetic processing unit that controls the operation of the entire computer 50.
A ROM (Read Only Memory) 52 is a read-only semiconductor memory in which a predetermined basic control program is recorded in advance. The MPU 51 reads out and executes this basic control program when the computer 50 is activated, thereby enabling operation control of each component of the computer 50.

ＲＡＭ（Random Access Memory）５３は、ＭＰＵ５１が各種の制御プログラムを実行する際に、必要に応じて作業用記憶領域として使用する、随時書き込み読み出し可能な半導体メモリである。 A RAM (Random Access Memory) 53 is a semiconductor memory that can be written and read at any time and used as a working storage area as needed when the MPU 51 executes various control programs.

ハードディスク装置５４は、ＭＰＵ５１によって実行される各種の制御プログラムや各種のデータを記憶しておく記憶装置である。ＭＰＵ５１は、ハードディスク装置５４に記憶されている所定の制御プログラムを読み出して実行することにより、後述する制御処理を行えるようになる。なお、本実施形態では、収音指向性の向き（方向角）毎に、収音指向性の鋭さ（角度値）と収音信号のマイク間での位相差範囲の周波数特性との関係が示されているテーブルのデータベースが予めハードディスク装置５４に格納されているものとする。また、位相差及び位相差範囲と、前述した重み付け設定値との関係が示されている、スペクトル周波数毎のテーブルのデータベースも予めハードディスク装置５４に格納されているものとする。 The hard disk device 54 is a storage device that stores various control programs executed by the MPU 51 and various data. The MPU 51 can perform control processing described later by reading and executing a predetermined control program stored in the hard disk device 54. In the present embodiment, the relationship between the sharpness of the sound collection directivity (angle value) and the frequency characteristics of the phase difference range between the microphones of the sound collection signal is shown for each direction (direction angle) of the sound collection directivity. It is assumed that a database of stored tables is stored in the hard disk device 54 in advance. In addition, it is assumed that a database of a table for each spectral frequency indicating the relationship between the phase difference and the phase difference range and the above-described weighting setting value is also stored in the hard disk device 54 in advance.

入力装置５５は、例えばキーボード装置やマウス装置であり、コンピュータ５０の使用者により操作されると、その操作内容に対応付けられている使用者からの各種情報の入力を取得し、取得した入力情報をＭＰＵ５１に送付する。 The input device 55 is, for example, a keyboard device or a mouse device. When operated by a user of the computer 50, the input device 55 acquires input of various information from the user associated with the operation content, and acquires the acquired input information. Is sent to the MPU 51.

表示装置５６は例えば液晶ディスプレイであり、ＭＰＵ５１から送付される表示データに応じて各種のテキストや画像を表示する。
インタフェース装置５７は、このコンピュータ５０に接続される各種機器との間での各種データの授受の管理を行う。より具体的には、検出システム４から送られてくるデータの受信、マイクアレイ２を構成しているマイクの各々から出力される収音信号のアナログ−デジタル変換と変換後の収音信号データの一時的なバッファリング、出力音声データの後続機器への送信などを行う。 The display device 56 is, for example, a liquid crystal display, and displays various texts and images according to display data sent from the MPU 51.
The interface device 57 manages the exchange of various data with various devices connected to the computer 50. More specifically, reception of data sent from the detection system 4, analog-digital conversion of the collected sound signal output from each of the microphones constituting the microphone array 2, and the converted collected sound signal data Temporary buffering, transmission of output audio data to subsequent devices, etc.

記録媒体駆動装置５８は、可搬型記録媒体６０に記録されている各種の制御プログラムやデータの読み出しを行う装置である。ＭＰＵ５１は、可搬型記録媒体６０に記録されている所定の制御プログラムを、記録媒体駆動装置５８を介して読み出して実行することによって、後述する各種の制御処理を行うようにすることもできる。なお、可搬型記録媒体６０としては、例えばＣＤ−ＲＯＭ（Compact Disc Read Only Memory）やＤＶＤ−ＲＯＭ（Digital Versatile Disc Read Only Memory）などがある。 The recording medium driving device 58 is a device that reads various control programs and data recorded on the portable recording medium 60. The MPU 51 can read out and execute a predetermined control program recorded on the portable recording medium 60 via the recording medium driving device 58 to perform various control processes described later. Examples of the portable recording medium 60 include a CD-ROM (Compact Disc Read Only Memory) and a DVD-ROM (Digital Versatile Disc Read Only Memory).

このようなコンピュータ５０を収音装置１として動作させるには、まず、後述する制御処理の処理内容をＭＰＵ５１に行わせるための制御プログラムを作成する。作成された制御プログラムはハードディスク装置５４若しくは可搬型記録媒体６０に予め格納しておく。そして、ＭＰＵ５１に所定の指示を与えてこの制御プログラムを読み出させて実行させる。こうすることで、ＭＰＵ５１が、発音体情報取得部１０、収音指向性範囲設定部２０、収音処理部３０、及び除外発音体抽出部４０として機能し、このコンピュータ５０による収音装置１の機能の提供が可能になる。 In order to operate such a computer 50 as the sound collection device 1, first, a control program for causing the MPU 51 to perform processing contents of a control process described later is created. The created control program is stored in advance in the hard disk device 54 or the portable recording medium 60. Then, a predetermined instruction is given to the MPU 51 to read and execute this control program. By doing so, the MPU 51 functions as the sound generator information acquisition unit 10, the sound collection directivity range setting unit 20, the sound collection processing unit 30, and the excluded sound generator extraction unit 40. Functions can be provided.

次に図１１について説明する。図１１は、図１０のコンピュータ５０におけるＭＰＵ５１により行われる制御処理の処理内容を図解したフローチャートである。
図１１において、この制御処理の実行が開始されると、まず、Ｓ１０１では、マイクアレイ２の収音範囲内の発音体の数と収音指向性の鋭さを示す角度との関係を定義する角度テーブルと、収音指向性の鋭さの最大値及び基準距離との初期設定処理が行われる。この処理では、上述した角度テーブルと、収音指向性の鋭さの最大値θ_MAXと、その発音体とマイクアレイ２との距離の基準値ｄ_defとを入力装置５５から取得してハードディスク装置５４に格納する処理が行われる。なお、この処理は、収音指向性範囲設定部２０としての動作のための処理である。 Next, FIG. 11 will be described. FIG. 11 is a flowchart illustrating the processing contents of the control processing performed by the MPU 51 in the computer 50 of FIG.
In FIG. 11, when the execution of this control process is started, first, in S101, an angle defining the relationship between the number of sounding bodies within the sound collection range of the microphone array 2 and the angle indicating the sharpness of the sound collection directivity. Initial setting processing of the table and the maximum value of the sound collection directivity sharpness and the reference distance is performed. In this processing, the angle table, the maximum value θ _MAX of the sound collection directivity, and the reference value d _def of the distance between the sound generator and the microphone array 2 are acquired from the input device 55 and the hard disk device 54. The process to store in is performed. This process is a process for the operation as the sound collection directivity range setting unit 20.

次に、Ｓ１０２では、検出システム４が出力する、マイクアレイ２の収音範囲内に存在する発音体の数及び配置の情報を表しているデータを、インタフェース装置５７が受信していたか否かを判定する処理が行われる。ＭＰＵ５１は、ここで、このデータを受信していたと判定したとき（判定結果がＹｅｓのとき）にはＳ１０３に処理を進める。一方、このデータを受信してないとき（判定結果がＮｏのとき）には、ＭＰＵ５１は、この図１１の制御処理を終了する。 Next, in S102, it is determined whether or not the interface device 57 has received the data output from the detection system 4 and representing the number and arrangement information of the sounding bodies existing within the sound collection range of the microphone array 2. A determination process is performed. When the MPU 51 determines that the data has been received (when the determination result is Yes), the MPU 51 advances the process to S103. On the other hand, when the data is not received (when the determination result is No), the MPU 51 ends the control process of FIG.

次に、Ｓ１０３では、発音体情報取得処理が行われる。この処理は、インタフェース装置５７が受信した、マイクアレイ２の収音範囲内に存在する発音体の数及び配置の情報を表している検出システム４からのデータをインタフェース装置５７から取得して、ＲＡＭ５３の所定領域に格納する処理である。この処理は、発音体情報取得部１０としての動作のための処理である。なお、発音体である人間の口の動きの有無の情報を表しているデータを検出システム４が出力している場合には、ＭＰＵ５１は、このデータも、ＲＡＭ５３の所定領域に格納する処理を行う。 Next, in S103, sound generator information acquisition processing is performed. In this process, the interface device 57 receives the data from the detection system 57 representing the number and arrangement information of the sounding bodies existing within the sound collection range of the microphone array 2 from the interface device 57, and the RAM 53 This is a process of storing in a predetermined area. This process is a process for the operation as the sound generator information acquisition unit 10. When the detection system 4 outputs data representing the presence / absence of movement of the mouth of a human being who is a sounding body, the MPU 51 performs processing for storing this data in a predetermined area of the RAM 53 as well. .

次に、Ｓ１０４では、除外発音体抽出処理が行われる。この処理は、発音体情報取得部１０が取得した発音体の配置の情報に基づいて、収音処理部３０が生成する出力音声の音源から除外する発音体を抽出する処理である。なお、この処理は、除外発音体抽出部４０としての動作のための処理である。 Next, in S104, an excluded sound generator extraction process is performed. This process is a process of extracting a sounding body to be excluded from the sound source of the output sound generated by the sound collection processing unit 30 based on the information on the arrangement of the sounding body acquired by the sounding body information acquisition unit 10. This process is a process for the operation as the excluded sound generator extraction unit 40.

この処理では、まず、ＲＡＭ５３の所定領域に格納されている発音体の数及び配置の情報を読み出す処理を行う。そして、次に、読み出した情報に基づき、発音体の配置位置の変化量の算出処理、あるいは、発音体である各人の顔の向きの取得処理を行う。そして、変化量が所定の閾値を上回っているか否かの判定処理、あるいは、顔の向きが所定の閾値範囲外であるか否かの判定処理を行う。なお、発音体の配置位置の変化量は、直近に実行されたＳ１０３の処理でＲＡＭ５３に格納された配置の情報と、それよりも過去に実行されたＳ１０３の処理でＲＡＭ５３に格納された配置の情報とから算出する。 In this process, first, a process of reading information on the number and arrangement of sound generators stored in a predetermined area of the RAM 53 is performed. Then, based on the read information, a calculation process of the change amount of the arrangement position of the sounding body or an acquisition process of the face direction of each person who is the sounding body is performed. Then, a process for determining whether or not the amount of change exceeds a predetermined threshold value or a process for determining whether or not the face orientation is outside the predetermined threshold range is performed. Note that the amount of change in the arrangement position of the sound generators is the information on the arrangement stored in the RAM 53 in the process of S103 executed most recently and the arrangement stored in the RAM 53 in the process of S103 executed earlier. Calculate from information.

また、発音体である人間の口の動きの有無の情報がＲＡＭ５３の所定領域に格納されている場合には、この情報を読み出して、人間の口の動きの有無の判定処理を行う。
ここで、変化量の算出結果が所定の閾値を上回っていた場合、顔の向きが所定の閾値範囲外であった場合、あるいは、人間の口の動きが無かった場合には、そのような場合に該当した発音体を、出力音声の信号の音源から除外するものとして抽出する処理を行う。 When information on the presence or absence of the movement of the human mouth, which is a sounding body, is stored in a predetermined area of the RAM 53, this information is read out and a process for determining the presence or absence of movement of the human mouth is performed.
Here, when the calculation result of the amount of change exceeds a predetermined threshold, when the face orientation is outside the predetermined threshold range, or when there is no movement of the human mouth, such a case A process of extracting the sounding body corresponding to is excluded from the sound source of the output sound signal.

次に、Ｓ１０５では、対象発音体決定処理が行われる。この処理は、収音指向性範囲設定部２０としての動作のための処理である。この処理では、直近のＳ１０３の処理によりＲＡＭ５３の所定領域に格納した発音体の各種情報を、Ｓ１０４の処理により得られた発音体の除外の情報に基づいて更新する処理である。この更新処理では、発音体の各種情報のうち、発音体の数については、この数からＳ１０４の処理により抽出された発音体の数を減算する処理が行われる。なお、この減算結果である対象発音体の数は、更に変数ｎに代入される。また、発音体の配置位置や口の動きの有無の情報については、Ｓ１０４の処理により抽出された発音体についてのものが削除される。この更新処理後である対象発音体の各種情報は、ＲＡＭ５３の別の所定領域に格納される。 Next, in S105, a target sounding body determination process is performed. This process is a process for the operation as the sound collection directivity range setting unit 20. In this process, various kinds of sound generator information stored in the predetermined area of the RAM 53 by the latest process of S103 are updated based on the information on the exclusion of the sound generator obtained by the process of S104. In this update process, the number of sounding bodies out of the various types of sounding body information is subtracted from the number of sounding bodies extracted by the process of S104. Note that the number of target sounding bodies as the subtraction result is further substituted into a variable n. In addition, regarding the information on the arrangement position of the sounding body and the presence / absence of mouth movement, the sounding body extracted by the process of S104 is deleted. Various pieces of information on the target sounding body after this update processing are stored in another predetermined area of the RAM 53.

次に、Ｓ１０６では、音声データの読み込み処理が行われる。この処理も、収音指向性範囲設定部２０としての動作のための処理である。この処理は、インタフェース装置５７で一時的にバッファリングされている、マイクアレイ２を構成しているマイクの各々から出力される収音信号データを読み出してＲＡＭ５３の所定領域に一括して格納する処理である。 Next, in S106, an audio data reading process is performed. This process is also a process for the operation as the sound collection directivity range setting unit 20. In this process, collected sound signal data output from each of the microphones constituting the microphone array 2 that are temporarily buffered by the interface device 57 are read and stored in a predetermined area of the RAM 53 in a lump. It is.

次に、Ｓ１０７では、変数ｎの現在の値が正の値であるか否かを判定する処理が行われる。ここで、ＭＰＵ５１は、変数ｎの値が正の値であると判定したとき（判定結果がＹＥＳのとき）にはＳ１０８に処理を進める。一方、ＭＰＵ５１は、変数ｎの値が正の値ではないと判定したとき（判定結果がＮＯのとき）には、Ｓ１０２へと処理を戻し、インタフェース装置５７でバッファリングされている次の収音信号データに関する処理を改めて実行する。 Next, in S107, processing for determining whether or not the current value of the variable n is a positive value is performed. When the MPU 51 determines that the value of the variable n is a positive value (when the determination result is YES), the MPU 51 advances the process to S108. On the other hand, when the MPU 51 determines that the value of the variable n is not a positive value (when the determination result is NO), the MPU 51 returns the processing to S102 and the next sound pickup buffered by the interface device 57. The process related to the signal data is executed again.

以降のＳ１０８からＳ１１３にかけての処理は、マイクアレイ２の収音範囲内に存在する発音体からＳ１０４の処理により抽出されたものを除いた各発音体（対象発音体）における、第ｎ番目の対象発音体に関して実行される処理である。 The subsequent processing from S108 to S113 is the nth target in each sounding body (target sounding body) obtained by removing the sounding body existing in the sound collection range of the microphone array 2 from the sounding body extracted by the processing in S104. This is a process that is performed on the sound generator.

まず、Ｓ１０８では、収音指向性範囲設定処理が行われる。この処理も、収音指向性範囲設定部２０としての動作のための処理である。この処理は、第ｎ番目の対象発音体に向ける収音指向性の鋭さを対象発音体の数に基づき設定すると共に、当該収音指向性の向きを、第ｎ番目の対象発音体についての配置の情報に基づき設定する処理である。 First, in S108, sound collection directivity range setting processing is performed. This process is also a process for the operation as the sound collection directivity range setting unit 20. In this process, the sharpness of the sound collection directivity directed toward the nth target sounding body is set based on the number of target sounding bodies, and the direction of the sound collection directivity is arranged for the nth target sounding body. This processing is set based on the information.

この処理では、まず、対象発音体の数の情報と、第ｎ番目の対象発音体（人間）についての方向角θ及び距離ｄのデータとを、ＲＡＭ５３から取得する処理が行われる。次に、Ｓ１０１の処理で取得した角度テーブルを参照し、対象発音体の数に対応付けられている角度値を取得する処理が行われる。この角度値と方向角θとが、収音指向性の鋭さ及び向きをそれぞれ表している。 In this process, first, information on the number of target sounding bodies and data on the direction angle θ and the distance d for the nth target sounding body (human) are acquired from the RAM 53. Next, a process of acquiring an angle value associated with the number of target sounding bodies is performed with reference to the angle table acquired in the process of S101. The angle value and the direction angle θ represent the sharpness and direction of the sound collection directivity.

なお、このとき、第ｎ番目の対象発音体に向ける収音指向性の鋭さを、前述したようにして、発音体（人間）についての配置の情報にも基づいて設定する処理を更に行うようにしてもよい。 At this time, the sharpness of the sound collection directivity directed to the nth target sounding body is further set as described above based on the arrangement information about the sounding body (human). May be.

この場合には、まず、第ｎ番目の対象発音体に隣接する対象発音体についての方向角のデータを、ＲＡＭ５３から取得する処理が行われる。そして、得られた方向角のデータとＳ１０１の処理で取得していた収音指向性の鋭さの最大値θ_MAXとを利用し、図５を用いて説明したようにして、第ｎ番目の対象発音体に向ける収音指向性の鋭さを設定する処理が行われる。 In this case, first, a process of acquiring the direction angle data for the target sounding body adjacent to the nth target sounding body from the RAM 53 is performed. Then, by using the obtained direction angle data and the maximum value θ _MAX of the sound collection directivity sharpness acquired in the processing of S101, the n-th target is obtained as described with reference to FIG. Processing for setting the sharpness of the sound collection directivity toward the sounding body is performed.

更に、図６を用いて説明したようにして、第ｎ番目の対象発音体に向ける収音指向性の鋭さを設定する処理を行うようにしてもよい。この場合には、まず、第ｎ番目の対象発音体に隣接する対象発音体についての距離のデータを、ＲＡＭ５３から取得する処理が行割れる。次に、得られた距離のデータとＳ１０１の処理で取得していた基準距離ｄ_defとを利用し、図６を用いて説明したようにして、収音指向性の鋭さを設定する処理が行われる。 Furthermore, as described with reference to FIG. 6, processing for setting the sharpness of the sound collection directivity toward the nth target sounding body may be performed. In this case, first, a process for acquiring distance data about the target sounding body adjacent to the nth target sounding body from the RAM 53 is performed. Next, using the obtained distance data and the reference distance d _def acquired in the process of S101, the process of setting the sharpness of the sound collection directivity is performed as described with reference to FIG. Is called.

なお、Ｓ１０８の収音指向性範囲設定処理により設定された収音指向性の鋭さ及び向きをそれぞれ表す角度値及び方向角は、ＲＡＭ５３の所定領域に格納される。 Note that the angle value and the direction angle representing the sharpness and direction of the sound collection directivity set by the sound collection directivity range setting process of S108 are stored in a predetermined area of the RAM 53.

次に、Ｓ１０９では、指向性受音処理が行われる。この処理は、収音処理部３０における収音指向性受音処理部３１としての動作のための処理である。
この処理では、まず、Ｓ１０６の処理によりＲＡＭ５３に格納しておいたマイク毎の収音信号データを読み出し、その各々について時間−周波数変換（例えばフーリエ変換）を施して、各収音信号の周波数スペクトルデータを求める処理が行われる。次に、収音信号のうちのひとつの周波数スペクトルデータを基準としたときの、その他の各収音信号の周波数スペクトルデータとの間でのスペクトルの位相差を、各スペクトル周波数について算出する処理が行われる。 Next, in S109, directivity sound reception processing is performed. This process is a process for the operation as the sound collection directivity reception process unit 31 in the sound collection process unit 30.
In this process, first, the collected sound signal data for each microphone stored in the RAM 53 by the process of S106 is read out, and time-frequency conversion (for example, Fourier transform) is performed on each of them, and the frequency spectrum of each collected signal is obtained. Processing to obtain data is performed. Next, a process of calculating a spectrum phase difference with respect to the frequency spectrum data of each other collected sound signal for each spectrum frequency when one frequency spectrum data of the collected sound signals is used as a reference. Done.

次に、Ｓ１０８の処理によりＲＡＭ５３に格納された角度値及び方向角を読み出す処理が行われる。次に、ハードディスク装置５４内のデータベースを参照し、読み出した方向角についてのテーブルから、読み出した角度値に対応付けられている、収音信号のマイク間での位相差範囲の周波数特性を読み出す処理が行われる。そして、各収音信号の周波数スペクトルデータにおける各スペクトル周波数における位相差範囲を、このテーブルから取得する処理が行われる。 Next, a process of reading the angle value and the direction angle stored in the RAM 53 by the process of S108 is performed. Next, referring to the database in the hard disk device 54, a process of reading out the frequency characteristics of the phase difference range between the microphones of the collected sound signal, which is associated with the read angle value, from the read direction angle table. Is done. And the process which acquires the phase difference range in each spectrum frequency in the frequency spectrum data of each sound collection signal from this table is performed.

次に、ハードディスク装置５４内のデータベースを参照し、各スペクトル周波数についてのテーブルから、各スペクトルにおける位相差及び位相差範囲に対応付けられている重み付け値を取得する処理が行われる。そして、基準の収音信号の周波数スペクトルに対し、この重み付け値をスペクトル周波数毎に乗算して重み付けを与える処理が行われる。 Next, referring to the database in the hard disk device 54, a process of obtaining a weighting value associated with a phase difference and a phase difference range in each spectrum from a table for each spectrum frequency is performed. Then, the weighting is performed by multiplying the frequency spectrum of the reference sound collection signal by the weight value for each spectrum frequency.

以降に続くＳ１１０、Ｓ１１１、及びＳ１１３の処理は、図９の収音処理部３０における発音検出部３３としての動作のための処理である。従って、図１の収音装置１をコンピュータ５０で実現する場合には、Ｓ１１０、Ｓ１１１、及びＳ１１３の処理は実行不要であり、Ｓ１０９に続いて、後述するＳ１１２の処理を実行させて、その後に後述のＳ１１４の処理を実行するようにすればよい。 Subsequent processing of S110, S111, and S113 is processing for operation as the sound generation detection unit 33 in the sound collection processing unit 30 of FIG. Therefore, when the sound collecting device 1 of FIG. 1 is realized by the computer 50, the processing of S110, S111, and S113 is not necessary, and the processing of S112, which will be described later, is executed after S109. What is necessary is just to perform the process of below-mentioned S114.

まず、Ｓ１１０では、発音検出用レベル取得処理が行われる。この処理では、Ｓ１０９の指向性受音処理により重み付けが与えられた周波数スペクトルのうち、前述した所定の周波数帯に含まれるスペクトルのレベルを加算し、その合計値を、出力音声における所定の周波数帯の振幅レベルとして求める処理が行われる。このようにして求められた振幅レベルが、発音検出用レベルとして扱われる。 First, in S110, a sound generation detection level acquisition process is performed. In this process, the level of the spectrum included in the predetermined frequency band described above among the frequency spectra weighted by the directional sound receiving process of S109 is added, and the total value is added to the predetermined frequency band in the output sound. The processing for obtaining the amplitude level of is performed. The amplitude level obtained in this way is treated as a sound generation detection level.

次に、Ｓ１１１では、Ｓ１１０の処理により得られた発音検出用レベルが、閾値である所定値を上回っているか否かを判定する処理が行われる。ＭＰＵ５１は、ここで、発音検出用レベルが所定値を上回っていると判定したとき（判定結果がＹｅｓのとき）にはＳ１１２に処理を進め、一方、発音検出用レベルが所定値を上回っていないと判定したとき（判定結果がＮｏのとき）にはＳ１１３に処理を進める。 Next, in S111, a process of determining whether or not the sound generation detection level obtained by the process of S110 exceeds a predetermined value that is a threshold value is performed. Here, when the MPU 51 determines that the pronunciation detection level exceeds the predetermined value (when the determination result is Yes), the MPU 51 proceeds to S112, while the pronunciation detection level does not exceed the predetermined value. (When the determination result is No), the process proceeds to S113.

なお、このＳ１１０及びＳ１１１の処理において、発音体の各々による発音の有無の検出を、上述したようにして求めた発音検出用レベルに基づいて行う代わりに、以下のようにして行うこともできる。 In the processes of S110 and S111, the detection of the presence or absence of sound generation by each sound generator can be performed as follows instead of being performed based on the sound detection level obtained as described above.

例えば、Ｓ１１０において、ＭＰＵ５１が、重み付けが与えられた周波数スペクトルから所定値以上であるスペクトルを抽出し、抽出されたスペクトルを加算してその合計値を求める処理を行う。そして、続くＳ１１１において、この合計値が、所定の閾値を上回ったか否かの判定処理をＭＰＵ５１が行う。ここで、上回っていたと判定した場合には発音体による発音が有るとの判定を下してＳ１１２に処理を進め、上回らなかったと判定した場合には発音体による発音が無いとの判定を下してＳ１１３に処理を進める。 For example, in S110, the MPU 51 performs a process of extracting a spectrum that is equal to or greater than a predetermined value from the weighted frequency spectrum and adding the extracted spectra to obtain a total value. In subsequent S111, the MPU 51 performs a process of determining whether or not the total value exceeds a predetermined threshold value. If it is determined that the sound has been exceeded, it is determined that there is a sound produced by the sounding body, and the process proceeds to S112. If it is determined that the sound has not been exceeded, it is determined that there is no sound by the sounding body. Then, the process proceeds to S113.

あるいは、Ｓ１１０において、ＭＰＵ５１が、重み付けが与えられた周波数スペクトルにおけるスペクトルの最大値を求める処理を行う。そして、続くＳ１１１において、この最大値が、所定の閾値を上回ったか否かの判定処理をＭＰＵ５１が行う。ここで、上回っていたと判定した場合には発音体による発音が有るとの判定を下してＳ１１２に処理を進め、上回らなかったと判定した場合には発音体による発音が無いとの判定を下してＳ１１３に処理を進める。 Or in S110, MPU51 performs the process which calculates | requires the maximum value of the spectrum in the frequency spectrum to which weighting was given. In subsequent S111, the MPU 51 determines whether or not the maximum value exceeds a predetermined threshold value. If it is determined that the sound has been exceeded, it is determined that there is a sound produced by the sounding body, and the process proceeds to S112. If it is determined that the sound has not been exceeded, it is determined that there is no sound by the sounding body. Then, the process proceeds to S113.

Ｓ１１０及びＳ１１１の処理を以上のように行うようにしても、発音体の各々による発音の有無の検出を行うことができる。
Ｓ１１２では、出力音声生成処理が行われる。この処理は、出力音声信号生成部３２としての動作のための処理である。この処理では、Ｓ１０９の指向性受音処理により重み付けが与えられた収音信号の周波数スペクトルに対し、指向性受音処理で行われた変換に対する逆変換（例えば高速フーリエ逆変換）を施して時間領域の音声信号データに変換して出力する処理が行われる。ＭＰＵ５１は、このＳ１１２の処理を終えたときには、Ｓ１１４に処理を進める。 Even if the processing of S110 and S111 is performed as described above, it is possible to detect the presence or absence of sound generation by each sound generator.
In S112, an output sound generation process is performed. This process is a process for the operation as the output audio signal generation unit 32. In this process, the frequency spectrum of the collected sound signal weighted by the directional sound reception process of S109 is subjected to an inverse transform (for example, a fast Fourier inverse transform) with respect to the conversion performed in the directional sound reception process to obtain a time. A process of converting to the audio signal data of the area and outputting it is performed. When the MPU 51 completes the process of S112, the process proceeds to S114.

一方、Ｓ１１３では、非音声処理が行われる。この処理は、出力音声の信号の出力を中止する処理である。なお、この出力音声の信号の出力を中止する処理の代わりに、無音データを出力する処理をＭＰＵ５１が行うようにしてもよい。また、この代わりに、所定レベルの白色雑音データを出力する処理をＭＰＵ５１が行うようにしてもよいし、この収音システムが定常的に発生させている定常雑音に相当するデータを出力する処理をＭＰＵ５１が行うようにしてもよい。 On the other hand, in S113, non-speech processing is performed. This process is a process for stopping the output of the output audio signal. Note that the MPU 51 may perform a process of outputting silence data instead of the process of stopping the output of the output audio signal. Alternatively, the MPU 51 may perform a process of outputting white noise data of a predetermined level, or a process of outputting data corresponding to stationary noise that is regularly generated by the sound collection system. The MPU 51 may perform this.

次に、Ｓ１１４では、変数ｎの値をデクリメントする処理、すなわち、変数ｎの現在の値から１を減算し、その減算結果の値を改めて変数ｎに代入する処理が行われ、その後はＳ１０７へ処理を戻し、変数ｎの新たな値に基づいた処理が改めて実行される。 Next, in S114, a process of decrementing the value of the variable n, that is, a process of subtracting 1 from the current value of the variable n and substituting the value of the subtraction result into the variable n is performed, and thereafter, the process proceeds to S107. The process is returned, and the process based on the new value of the variable n is executed again.

以上の制御処理をＭＰＵ５１に行わせることにより、図１０のコンピュータ５０が収音装置１として機能することが可能になる。
なお、本発明は、これまでに説明した実施の形態に限定されるものではなく、実施段階では、その要旨を変更しない範囲で種々変形したり組み合わせたりすることが可能である。 By causing the MPU 51 to perform the above control processing, the computer 50 in FIG. 10 can function as the sound collection device 1.
In addition, this invention is not limited to embodiment described so far, In an implementation stage, it is possible to variously change and combine in the range which does not change the summary.

例えば、上述した実施形態では、カメラ３はマイクアレイ２とほぼ同一の位置に配置されているものとしていたが、カメラ３とマイクアレイ２とを離れた位置に配置することも可能である。なお、このように配置をする場合には、例えば、カメラ３とマイクアレイ２との位置関係を変換する変換テーブルを収音装置１の発音体情報取得部１０に用意しておく。そして、カメラ３での撮影画像から位置検出システム４が検出した位置、角度、距離の配置情報を、発音体情報取得部１０が、この変換テーブルを参照して、マイクアレイ２の位置での配置情報に変換するようにすればよい。 For example, in the above-described embodiment, the camera 3 is arranged at substantially the same position as the microphone array 2, but the camera 3 and the microphone array 2 can be arranged at positions separated from each other. In the case of such an arrangement, for example, a conversion table for converting the positional relationship between the camera 3 and the microphone array 2 is prepared in the sound generator information acquisition unit 10 of the sound collection device 1. Then, the sound generator information acquisition unit 10 refers to the conversion table for the arrangement information of the position, angle, and distance detected by the position detection system 4 from the image captured by the camera 3, and the arrangement at the position of the microphone array 2. What is necessary is just to make it convert into information.

なお、以上までに説明した実施形態に関し、更に以下の付記を開示する。
（付記１）
相対位置が固定されている複数のマイクロフォンを備えたマイクアレイで収音した複数の収音信号に基づいて収音指向性を制御した出力音の信号を生成する収音処理手段と、
該マイクアレイの収音範囲内に存在する発音体の数及び配置の情報の入力を取得する取得手段と、
該マイクアレイから該発音体に向ける該収音指向性の向きを、該取得手段が取得した該発音体についての配置の情報に基づき該発音体の各々について設定すると共に、該発音体に向ける該収音指向性の鋭さを、該取得手段が取得した該発音体についての数の情報に基づき該発音体の各々について設定する収音指向性範囲設定手段と、
を有し、
該収音処理手段は、該収音指向性範囲設定手段が設定した向き及び鋭さに該収音指向性を制御した出力音の信号を生成して出力する、
ことを特徴とする収音装置。
（付記２）
該取得手段が取得する該マイクアレイの収音範囲内に存在する発音体の数及び配置の情報は、該マイクアレイの収音範囲を撮影した画像から得られたものであること特徴とする付記１に記載の収音装置。
（付記３）
該収音指向性範囲設定手段は、該発音体に向ける該収音指向性の鋭さを、該取得手段が取得した該発音体についての数の情報に基づくと共に、更に、該取得手段が取得した該発音体についての配置の情報にも基づき、該発音体の各々について設定することを特徴とする付記１又は２に記載の収音装置。
（付記４）
該収音指向性範囲設定手段が該発音体に向ける該収音指向性の鋭さを設定する基礎とする該発音体についての配置の情報は、該マイクアレイの収音範囲内に存在する発音体同士の配置間隔の情報であることを特徴とする付記３に記載の収音装置。
（付記５）
該収音指向性範囲設定手段が該発音体に向ける該収音指向性の鋭さを設定する基礎とする該発音体についての配置の情報は、該発音体と該マイクアレイとの距離の情報であることを特徴とする付記３に記載の収音装置。
（付記６）
該収音指向性範囲設定手段は、該発音体と該マイクアレイとの距離が長い場合と比較して、該距離が短い場合に収音指向性の鋭さをより狭い角度に設定することを特徴とする付記５に記載の収音装置。
（付記７）
該取得手段が取得した該発音体の配置の情報に基づいて、該収音処理手段が生成する出力音の音源から除外する発音体を抽出する除外発音体抽出手段を更に有し、
該収音指向性範囲設定手段は、該収音指向性の向き及び鋭さを、該発音体のうち該除外発音体抽出手段により抽出されたもの以外のものについての情報に基づき設定する、
こと特徴とする付記１から６のうちのいずれか一項に記載の収音装置。
（付記８）
該除外発音体抽出手段は、該取得手段が取得した該発音体の配置の情報の変化量に基づき、該収音処理手段が生成する出力音の信号の音源から除外する発音体を抽出する、
ことを特徴とする付記７に記載の収音装置。
（付記９）
該発音体は人間であり、
該除外発音体抽出手段は、該取得手段が取得した該人間の配置の情報のうちの該人間の顔の向きの情報に基づいて、該収音処理手段が生成する出力音声の信号の音源から除外する発音体を抽出する、
ことを特徴とする付記７に記載の収音装置。
（付記１０）
該発音体は人間であり、
該除外発音体抽出手段は、該取得手段が取得した該人間の口の動きの有無の情報に基づいて、該収音処理手段が生成する出力音声の信号の音源から除外する発音体を抽出する、
ことを特徴とする付記７に記載の収音装置。
（付記１１）
該マイクアレイの収音範囲内に存在する発音体による発音の有無を検出する発音検出手段を更に有し、
該収音処理手段は、該発音検出手段により発音が検出されているときの該出力音の信号を出力する、
ことを特徴とする付記１から１０のうちのいずれか一項に記載の収音装置。
（付記１２）
該発音検出手段は、該出力音における所定の周波数帯の振幅レベルに基づいて、該発音体による発音の有無を検出することを特徴とする付記１１に記載の収音装置。
（付記１３）
該発音体は人間であり、
該所定の周波数帯が、人間の第一フォルマントの周波数帯に設定されている、
ことを特徴とする付記１２に記載の収音装置。
（付記１４）
該収音処理手段は、該マイクアレイで収音した収音信号の周波数スペクトルに対し、該収音指向性範囲設定手段が設定した向き及び鋭さの該収音指向性を得るための重み付けをスペクトル毎に与え、該重み付けが与えられた周波数スペクトルを時間軸情報に変換することによって、該出力音の信号を生成し、
該発音検出手段は、該重み付けが与えられた周波数スペクトルにおいてスペクトルが所定値以上であるものについての該スペクトルの加算合計値に基づいて、該発音体による発音の有無を検出する、
ことを特徴とする付記１１に記載の収音装置。
（付記１５）
該収音処理手段は、該マイクアレイで収音した収音信号の周波数スペクトルに対し、該収音指向性範囲設定手段が設定した向き及び鋭さの該収音指向性を得るための重み付けをスペクトル毎に与え、該重み付けが与えられた周波数スペクトルを時間領域の音声信号に変換することによって、該出力音の信号を生成し、
該発音検出手段は、該重み付けが与えられた周波数スペクトルにけるスペクトルの最大値に基づいて、該発音体による発音の有無を検出する、
ことを特徴とする付記１１に記載の収音装置。
（付記１６）
相対位置が固定されている複数のマイクロフォンを備えたマイクアレイで収音した複数の収音信号に基づいて収音指向性を制御した出力音の信号を生成する収音方法であって、
該マイクアレイの収音範囲内に存在する発音体の数及び配置の情報の入力を取得し、
該マイクアレイから該発音体に向ける該収音指向性の向きを、取得された該発音体についての配置の情報に基づき該発音体の各々について設定すると共に、該発音体に向ける該収音指向性の鋭さを、取得された該発音体についての数の情報に基づき該発音体の各々について設定し、
設定された向き及び鋭さに該収音指向性を制御した出力音の信号を生成して出力する、
ことを特徴とする収音方法。
（付記１７）
相対位置が固定されている複数のマイクロフォンを備えたマイクアレイで収音した複数の収音信号に基づいて収音指向性を制御した出力音の信号の生成をコンピュータに行わせるためのプログラムであって、該コンピュータに実行させることによって、
該マイクアレイの収音範囲内に存在する発音体の数及び配置の情報の入力を取得する取得処理と、
該マイクアレイから該発音体に向ける該収音指向性の向きを、該取得処理で取得された該発音体についての配置の情報に基づき該発音体の各々について設定すると共に、該発音体に向ける該収音指向性の鋭さを、該取得処理により取得された該発音体についての数の情報に基づき該発音体の各々について設定する収音指向性範囲設定処理と、
該収音指向性範囲設定処理により設定された向き及び鋭さに該収音指向性を制御した出力音の信号を生成して出力する収音処理と、
を該コンピュータに行わせるためのプログラム。 In addition, the following additional remarks are disclosed regarding the embodiment described above.
(Appendix 1)
Sound collection processing means for generating a signal of output sound in which sound collection directivity is controlled based on a plurality of sound collection signals collected by a microphone array including a plurality of microphones whose relative positions are fixed;
Obtaining means for obtaining input of information on the number and arrangement of sounding bodies existing within the sound collection range of the microphone array;
The direction of the sound collection directivity directed from the microphone array to the sounding body is set for each of the sounding bodies based on the arrangement information about the sounding body acquired by the acquisition unit, and the direction toward the sounding body is set. Sound collection directivity range setting means for setting the sharpness of the sound collection directivity for each of the sounding bodies based on the information on the number of the sounding bodies acquired by the acquisition means;
Have
The sound collection processing means generates and outputs an output sound signal in which the sound collection directivity is controlled in the direction and sharpness set by the sound collection directivity range setting means.
A sound collecting device characterized by that.
(Appendix 2)
Note that the information on the number and arrangement of sounding bodies present in the sound collection range of the microphone array acquired by the acquisition means is obtained from an image obtained by photographing the sound collection range of the microphone array. The sound collecting device according to 1.
(Appendix 3)
The sound collection directivity range setting means is based on the information on the number of sound generators acquired by the acquisition means, and the acquisition means acquires the sharpness of the sound collection directivity directed toward the sound generator. 3. The sound collecting device according to appendix 1 or 2, wherein the sound generator is set for each of the sound generators based on arrangement information about the sound generators.
(Appendix 4)
The information on the arrangement of the sound generators on which the sound collection directivity range setting means sets the sharpness of the sound collection directivity directed to the sound generator is the sound generators existing within the sound collection range of the microphone array. The sound collection device according to attachment 3, wherein the sound collection device is information on an arrangement interval between them.
(Appendix 5)
The information about the arrangement of the sounding body on which the sound collection directivity range setting means sets the sharpness of the sound collection directivity toward the sounding body is information on the distance between the sounding body and the microphone array. The sound collecting device according to Supplementary Note 3, wherein the sound collecting device is provided.
(Appendix 6)
The sound collection directivity range setting means sets the sharpness of the sound collection directivity at a narrower angle when the distance is shorter than when the distance between the sounding body and the microphone array is long. The sound collection device according to appendix 5.
(Appendix 7)
Based on the information on the arrangement of the sounding bodies acquired by the acquiring means, further comprising excluded sounding body extracting means for extracting sounding bodies to be excluded from the sound source of the output sound generated by the sound collection processing means,
The sound collection directivity range setting means sets the direction and sharpness of the sound collection directivity based on information about the sound generators other than those extracted by the excluded sound generator extraction means.
The sound collection device according to any one of Supplementary notes 1 to 6, which is characterized by that.
(Appendix 8)
The excluded sound generator extraction unit extracts a sound generator to be excluded from the sound source of the output sound signal generated by the sound collection processing unit based on the amount of change in the information of the sound generator arrangement acquired by the acquisition unit.
The sound collecting device according to appendix 7, characterized by:
(Appendix 9)
The sounding body is human,
The excluded sound generator extraction unit is configured to generate an output sound signal from the sound source generated by the sound collection processing unit based on the human face orientation information of the human arrangement information acquired by the acquisition unit. Extract phonetics to exclude,
The sound collecting device according to appendix 7, characterized by:
(Appendix 10)
The sounding body is human,
The excluded sound generator extraction unit extracts a sound generator to be excluded from the sound source of the output sound signal generated by the sound collection processing unit based on the information on the presence or absence of movement of the human mouth acquired by the acquisition unit. ,
The sound collecting device according to appendix 7, characterized by:
(Appendix 11)
Further comprising sound detection means for detecting the presence or absence of sound generation by a sounding body present within the sound collection range of the microphone array;
The sound collection processing means outputs a signal of the output sound when the sound generation is detected by the sound generation detection means.
The sound collecting device according to any one of Supplementary notes 1 to 10, characterized in that:
(Appendix 12)
The sound collection device according to appendix 11, wherein the sound generation detection means detects presence or absence of sound generation by the sound generator based on an amplitude level of a predetermined frequency band in the output sound.
(Appendix 13)
The sounding body is human,
The predetermined frequency band is set to the frequency band of the first human formant.
The sound collecting device according to appendix 12, characterized by:
(Appendix 14)
The sound collection processing means assigns a weight to obtain the sound collection directivity having the direction and sharpness set by the sound collection directivity range setting means with respect to the frequency spectrum of the sound collection signal collected by the microphone array. By generating a signal of the output sound by converting the frequency spectrum to which the weight is given to time axis information,
The sound detection means detects the presence or absence of sound generation by the sound generator based on the sum total value of the spectrum of the frequency spectrum to which the weight is given and the spectrum is a predetermined value or more.
The sound collecting device according to Supplementary Note 11, wherein
(Appendix 15)
The sound collection processing means assigns a weight to obtain the sound collection directivity having the direction and sharpness set by the sound collection directivity range setting means with respect to the frequency spectrum of the sound collection signal collected by the microphone array. And generating a signal of the output sound by converting the weighted frequency spectrum into a time-domain sound signal,
The sound detection means detects presence or absence of sound generation by the sound generator based on the maximum value of the spectrum in the frequency spectrum to which the weight is given.
The sound collecting device according to Supplementary Note 11, wherein
(Appendix 16)
A sound collection method for generating an output sound signal in which sound collection directivity is controlled based on a plurality of sound collection signals collected by a microphone array including a plurality of microphones whose relative positions are fixed,
Obtaining input of information on the number and arrangement of sounding bodies present within the sound collection range of the microphone array,
The direction of the sound collection directivity directed from the microphone array to the sounding body is set for each of the sounding bodies based on the acquired arrangement information about the sounding body, and the sound collection directivity directed toward the sounding body Setting the sexual sharpness for each of the sounding bodies based on the acquired number of information about the sounding body;
Generating and outputting an output sound signal in which the sound collection directivity is controlled in a set direction and sharpness;
A sound collecting method characterized by that.
(Appendix 17)
A program for causing a computer to generate an output sound signal in which sound collection directivity is controlled based on a plurality of sound collection signals collected by a microphone array having a plurality of microphones whose relative positions are fixed. By causing the computer to execute,
An acquisition process for acquiring input of information on the number and arrangement of sounding bodies existing within the sound collection range of the microphone array;
The direction of the sound collection directivity directed from the microphone array to the sounding body is set for each of the sounding bodies based on the arrangement information about the sounding body acquired in the acquisition process, and is directed to the sounding body. A sound collection directivity range setting process for setting the sharpness of the sound collection directivity for each of the sound generators based on the information on the number of the sound generators acquired by the acquisition process;
Sound collection processing for generating and outputting an output sound signal in which the sound collection directivity is controlled to the direction and sharpness set by the sound collection directivity range setting processing;
A program for causing the computer to execute.

１収音装置
２マイクアレイ
３カメラ
４顔の位置検出システム
１０発音体情報取得部
２０収音指向性範囲設定部
３０収音処理部
３１指向性受音処理部
３２出力音声信号生成部
３３発音検出部
４０除外発音体抽出部
５０コンピュータ
５１ＭＰＵ
５２ＲＯＭ
５３ＲＡＭ
５４ハードディスク装置
５５入力装置
５６表示装置
５７インタフェース装置
５８記録媒体駆動装置
５９バス
６０可搬型記録媒体 DESCRIPTION OF SYMBOLS 1 Sound collection device 2 Microphone array 3 Camera 4 Face position detection system 10 Sound body information acquisition part 20 Sound collection directivity range setting part 30 Sound collection process part 31 Directional sound reception process part 32 Output sound signal generation part 33 Sound detection detection 40 Excluded sound generator extraction unit 50 Computer 51 MPU
52 ROM
53 RAM
54 Hard Disk Device 55 Input Device 56 Display Device 57 Interface Device 58 Recording Medium Drive Device 59 Bus 60 Portable Recording Medium

Claims

Sound collection processing means for generating a signal of output sound in which sound collection directivity is controlled based on a plurality of sound collection signals collected by a microphone array including a plurality of microphones whose relative positions are fixed;
Obtaining means for obtaining input of information on the number and arrangement of sounding bodies existing within the sound collection range of the microphone array;
The direction of the sound collection directivity directed from the microphone array to the sounding body is set for each of the sounding bodies based on the arrangement information about the sounding body acquired by the acquisition means, and the sound generation from the microphone array The sharpness of the sound collection directivity toward the body is based on the information on the number of the sounding bodies acquired by the acquisition means, and further, the sound generation existing within the sound collection range of the microphone array acquired by the acquisition means Based on the arrangement information about the sounding body, which is composed of at least one of the information on the arrangement interval between the bodies or the distance information between the sounding body and the microphone array , the storage set for each of the sounding bodies. Sound directivity range setting means;
Have
The sound collection processing means generates and outputs an output sound signal in which the sound collection directivity is controlled in the direction and sharpness set by the sound collection directivity range setting means.
A sound collecting device characterized by that.

Based on the information on the arrangement of the sounding bodies acquired by the acquiring means, further comprising excluded sounding body extracting means for extracting sounding bodies to be excluded from the sound source of the output sound generated by the sound collection processing means,
The sound collection directivity range setting means sets the direction and sharpness of the sound collection directivity based on information about the sound generators other than those extracted by the excluded sound generator extraction means.
The sound collection device according to claim 1.

The excluded sound generator extraction unit extracts a sound generator to be excluded from the sound source of the output sound signal generated by the sound collection processing unit based on the amount of change in the information of the sound generator arrangement acquired by the acquisition unit.
The sound collecting device according to claim 2 , wherein

The sounding body is human,
The excluded sound generator extraction unit is configured to generate an output sound signal generated by the sound collection processing unit from a sound source based on the human face orientation information of the human arrangement information acquired by the acquisition unit. Extract phonetics to exclude,
The sound collecting device according to claim 3 .

Further comprising sound detection means for detecting the presence or absence of sound generation by a sounding body present within the sound collection range of the microphone array;
The sound collection processing means outputs a signal of the output sound when the sound generation is detected by the sound generation detection means.
The sound collecting device according to any one of claims 1 to 4 , wherein the sound collecting device is provided.

A sound collection method for generating an output sound signal in which sound collection directivity is controlled based on a plurality of sound collection signals collected by a microphone array including a plurality of microphones whose relative positions are fixed,
Obtaining input of information on the number and arrangement of sounding bodies present within the sound collection range of the microphone array,
The direction of the sound collection directivity directed from the microphone array to the sounding body is set for each of the sounding bodies based on the acquired arrangement information about the sounding body, and is directed from the microphone array to the sounding body. the該収sound directivity of sharpness, based rather with the number of information about the acquired emitting sound body, further, information of arrangement intervals of the sound bodies present in the sound pickup range of the acquired said microphone array Or each of the sound generators based on the information on the arrangement of the sound generators, comprising at least one of the distance information between the sound generators and the microphone array ,
Generating and outputting an output sound signal in which the sound collection directivity is controlled in a set direction and sharpness;
A sound collecting method characterized by that.

A program for causing a computer to generate an output sound signal in which sound collection directivity is controlled based on a plurality of sound collection signals collected by a microphone array having a plurality of microphones whose relative positions are fixed. By causing the computer to execute,
An acquisition process for acquiring input of information on the number and arrangement of sounding bodies existing within the sound collection range of the microphone array;
The direction of the sound collection directivity directed from the microphone array to the sounding body is set for each of the sounding bodies based on the arrangement information about the sounding body acquired in the acquisition process, and the microphone array the該収sound directivity of sharpness directing the sounding body, based rather with the number of information about the said mounting obtain processed by the obtained emitting sound body, further, the sound collection range of the microphone array obtained by said mounting obtain treatment Based on the information on the arrangement of the sound generators, which is at least one of the information on the arrangement interval between the sound generators existing in the information and the information on the distance between the sound generators and the microphone array . Sound collection directivity range setting processing to be set for each,
Sound collection processing for generating and outputting an output sound signal in which the sound collection directivity is controlled to the direction and sharpness set by the sound collection directivity range setting processing;
A program for causing the computer to execute.