JPH1141577A

JPH1141577A - Speaker position detector

Info

Publication number: JPH1141577A
Application number: JP9193630A
Authority: JP
Inventors: Hironori Kitagawa; 博紀北川; Naoji Matsuo; 直司松尾; Shigemi Osada; 茂美長田
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1997-07-18
Filing date: 1997-07-18
Publication date: 1999-02-12

Abstract

PROBLEM TO BE SOLVED: To highly accurately detect a speaker position by judging it while adding two information of the detection information of a sound source position and the detection information of a figure position. SOLUTION: A sound source position detection part 3-1 of a control part 3 prepares a sound source position may by inputting sound source information inputted from a microphone array 1. Based on the image information inputted by a sensor 2 for image input, a human body position detection part 3-2 prepares a human body position map. The map calculates the probability for the sound source or human body position to exist for the unit of each domain while partitioning the range of a space detectable for the microphone array 1 and sensor 2. A speaker position discrimination part 3-3 calculates the product of probabilities in the correspondent domains of the sound source and human body position maps and discriminates the domain having the largest product as the speaker position. Either one of an ultrasonic sensor, an infrared sensor or a telecision camera is used for the sensor 2.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、テレビ会議等で利
用される話者の位置を検出する装置において、話者の位
置の検出精度の向上を図った新しい話者位置検出装置に
関する。現在、テレビ会議が急速に普及している。テレ
ビ会議においては、カメラを話者に向けたり、カメラの
焦点を話者に合わせたりするために、必要な音声のみを
拾いだす高精度の話者位置の検出手段が必要とされてい
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a new speaker position detecting device for detecting the position of a speaker used in a video conference or the like, which has improved accuracy in detecting the position of the speaker. Currently, video conferencing is rapidly spreading. 2. Description of the Related Art In a video conference, in order to point a camera at a speaker or focus the camera on the speaker, a high-precision speaker position detecting unit that picks up only necessary voice is required.

【０００２】[0002]

【従来の技術】従来のテレビ会議装置等で用いられてい
る話者位置検出装置では、各座席に対応して指向性マイ
クが設置されており、誰かが話すと、最大音量の入力が
あったマイクの前に座っている人が話者であると確定し
ていた。また、特許出願公開番号「特開平5-244587」に
開示された例のように、複数のマイクを水平に等間隔で
放射状に配置して並べることにより、話者のＸＺ平面で
の方向を特定し、また、カメラから入力した画像を、エ
ッジ検出回路により処理して、輪郭パターンを認識し、
予め登録されている人物形状とパターンマッチングを行
って人物位置を確定し、更に、動き検出回路により唇の
動きを検知して話者位置を特定する例もある。2. Description of the Related Art In a conventional speaker position detecting apparatus used in a video conference apparatus and the like, a directional microphone is provided corresponding to each seat, and when someone speaks, the maximum volume is input. The person sitting in front of the microphone was determined to be the speaker. In addition, as in the example disclosed in Patent Application Publication No. JP-A-5-244587, by arranging a plurality of microphones horizontally and radially at equal intervals and arranging them, the direction of the speaker on the XZ plane is specified. In addition, the image input from the camera is processed by an edge detection circuit to recognize a contour pattern,
There is also an example in which the position of a person is determined by performing pattern matching with a person shape registered in advance, and furthermore, the movement of a lip is detected by a motion detection circuit to specify a speaker position.

【０００３】[0003]

【発明が解決しようとする課題】しかしながら、上記の
ような従来例の内、マイクと人を１対１に固定する方式
では、話者が移動している場合は対応出来なかった。ま
た別の従来例では、唇の動きを検出するため、話者の正
面が見えている必要がある。また、音源位置は、マイク
によりＸＺ平面上での方向しか検出出来なかった。However, of the above-mentioned conventional examples, the method of fixing the microphone and the person in a one-to-one correspondence cannot cope with the case where the speaker is moving. In another conventional example, the front of the speaker needs to be visible in order to detect the movement of the lips. Further, the sound source position could only be detected by the microphone in the direction on the XZ plane.

【０００４】本発明は、話者の位置を特定するために、
マイクと人を１対１に対応させる必要もなく、正面が見
えていなくても、或いは、人が移動しながら話をしてい
ても、高い精度で、話者位置を検出することを目的とす
る。According to the present invention, in order to specify the position of a speaker,
There is no need to make a one-to-one correspondence between the microphone and the person, and even if the front is not visible or the person is talking while moving, the purpose is to detect the speaker position with high accuracy. I do.

【０００５】[0005]

【課題を解決するための手段】前記目的を達成するため
に、本発明の話者位置検出装置は、テレビカメラ・超音
波センサ・赤外線センサ等のセンサからの入力信号を処
理し、人物位置を検出する手段と、マイクロホンアレイ
からの入力信号を処理し、音源位置を検出する手段と、
前記２種類の情報を合わせて処理することにより、話者
位置を判定する手段を有し、任意の位置の話者を検出可
能とすることを特徴とする。In order to achieve the above object, a speaker position detecting apparatus according to the present invention processes input signals from sensors such as a television camera, an ultrasonic sensor and an infrared sensor to determine a person position. Means for detecting, processing the input signal from the microphone array, means for detecting the sound source position,
By combining and processing the two types of information, a means for determining a speaker position is provided, and a speaker at an arbitrary position can be detected.

【０００６】また、前述の装置において、話者位置検出
手段として、画像でＸＹ方向のみの位置を、音源により
Ｚ方向のみの位置を求め、話者位置を検出することも出
来る。これにより、画像処理が２次元平面上で処理が出
来るため、画像処理時間を短縮して、話者位置の検出を
可能とすることが出来る。また、前述の装置において、
まず、音源位置のＸＹＺ座標を求め、音源位置周辺のセ
ンサ信号のみを処理することにより、音源位置マップの
作成と音源位置周辺以外の人物位置検出の画像処理を省
くことが可能となり、話者位置検出精度を殆ど落とすこ
となく、人物位置検出を高速化することが出来る。Further, in the above-mentioned apparatus, as a speaker position detecting means, a position in only an XY direction in an image and a position in only a Z direction can be obtained from a sound source to detect a speaker position. As a result, image processing can be performed on a two-dimensional plane, so that image processing time can be reduced and the speaker position can be detected. Also, in the above device,
First, the XYZ coordinates of the sound source position are obtained, and only the sensor signals around the sound source position are processed, thereby making it possible to omit the generation of the sound source position map and the image processing for detecting the position of the person other than around the sound source position. It is possible to speed up the detection of the position of a person without substantially lowering the detection accuracy.

【０００７】また、前述の装置において、センサを回転
可能とする回転台を設置することにより、センサの死角
に話者がいても、音源位置をもとにセンサを回転させ、
死角にいる話者の検出も可能とすることが出来る。ま
た、前述の装置において、話者位置のキャリブレーショ
ンを行う機能を設けることにより、話者の特定が正しい
かどうかを、ディスプレイ等の表示装置上で確認出来る
ようにし、間違っている場合は、訂正出来るようにする
ことが出来る。Further, in the above-described apparatus, by installing a turntable that can rotate the sensor, even if a speaker is in the blind spot of the sensor, the sensor is rotated based on the sound source position,
Detection of a speaker in a blind spot can also be enabled. Further, in the above-described apparatus, by providing a function of calibrating the speaker position, it is possible to confirm whether or not the speaker is specified correctly on a display device such as a display. You can do it.

【０００８】[0008]

【発明の実施の形態】本発明の基本構成を、図１を用い
て説明する。マイクロホンアレイ１から入力された音源
情報を入力にして、制御部３の音源位置検出部３−１が
音源位置マップを作成し、話者位置判定部３−３に渡
す。画像入力用のセンサ２が入力した画像情報をもとに
して、人物位置検出部３−２は人物位置マップを作成
し、話者位置判定部３−３に渡す。センサから見た水平
方向・鉛直方向・奥行き方向を、それぞれ、Ｘ方向・Ｙ
方向・Ｚ方向とすると、マップは、マイクロホンアレイ
とセンサが検出可能な空間の範囲をＸＹＺ方向に一定間
隔毎に区切って、各区分単位に音源位置または人物位置
の存在する確率を計算したものであり、音源位置マップ
と人物位置マップは、同じ空間に対応している。話者位
置判定部３−３は、音源位置マップと人物位置マップの
対応する各区分の確率の積を計算し、その積の最も大き
い区分を、話者位置と判定し、話者位置情報を他の機器
に渡す。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The basic configuration of the present invention will be described with reference to FIG. With the sound source information input from the microphone array 1 as an input, the sound source position detecting unit 3-1 of the control unit 3 creates a sound source position map and passes it to the speaker position determining unit 3-3. Based on the image information input by the image input sensor 2, the person position detection unit 3-2 creates a person position map and passes it to the speaker position determination unit 3-3. The horizontal direction, vertical direction, and depth direction viewed from the sensor are represented by X direction and Y direction, respectively.
Assuming the direction and the Z direction, the map divides the range of the space that can be detected by the microphone array and the sensor at regular intervals in the XYZ directions, and calculates the probability of the existence of the sound source position or the person position in each division unit. Yes, the sound source position map and the person position map correspond to the same space. The speaker position determination unit 3-3 calculates the product of the probabilities of the corresponding sections of the sound source position map and the person position map, determines the section having the largest product as the speaker position, and determines the speaker position information. Give it to another device.

【０００９】本発明の処理概要を図２のフローチャート
を用いて説明する。まず、ステップＳ１で、音声入力が
あると、音源位置検出部３−１が、マイクロホンアレイ
１の入力信号を分析して、音源位置マップの区分単位に
音源位置の確率を計算し、音源マップを完成させる。次
に、ステップＳ２で、人物位置検出部３−２が、センサ
２より入力した入力信号を画像処理し、人物位置マップ
の区分単位に人物位置の確率を計算し、人物位置マップ
を完成させる。ステップＳ１とステップＳ２の処理は、
どちらも常時行っており、特にどちらが先というわけで
はない。２つのマップが完成すると、ステップＳ３で、
話者位置判定部３−３が、音源位置マップと人物位置マ
ップの対応する各区分の積を計算し、その積の最も大き
い区分を、話者位置と判定する。話者位置が特定出来る
と、ステップＳ４で、話者位置判定部３−３が、他の機
器に話者位置を渡す。An outline of the processing of the present invention will be described with reference to the flowchart of FIG. First, in step S1, when there is a voice input, the sound source position detecting unit 3-1 analyzes the input signal of the microphone array 1, calculates the probability of the sound source position for each unit of the sound source position map, and generates the sound source map. Finalize. Next, in step S2, the person position detection unit 3-2 performs image processing on the input signal input from the sensor 2, calculates the probability of the person position for each segment of the person position map, and completes the person position map. The processing of step S1 and step S2 is
Both of them are always going on, and not particularly which one comes first. When the two maps are completed, in step S3,
The speaker position determination unit 3-3 calculates the product of the corresponding sections of the sound source position map and the person position map, and determines the section having the largest product as the speaker position. When the speaker position can be specified, in step S4, the speaker position determination unit 3-3 passes the speaker position to another device.

【００１０】[0010]

【実施例１】図３は、本発明の実施例である。音声が発
生すると、マイクロホンアレイ１からの音声情報を、音
源位置検出部３−１が、検出範囲内の空間を一定間隔毎
に区切り、その区分単位に音源位置の存在する確率を計
算し、音源位置マップを作成する。マイクロホンアレイ
１からの音声情報により、音源のＸＹＺ座標を求めるこ
とが可能である。Embodiment 1 FIG. 3 shows an embodiment of the present invention. When a sound is generated, the sound source position detection unit 3-1 separates the sound information from the microphone array 1 at regular intervals into a space within the detection range, calculates the probability of the existence of the sound source position in each division unit, Create a location map. It is possible to obtain the XYZ coordinates of the sound source from the audio information from the microphone array 1.

【００１１】人物位置検出部３−２は、常時、センサ２
からの情報をもとに、検出範囲内の空間を一定間隔毎に
区切り、区分単位の人物位置の存在確率を求め、人物位
置マップを作成する。人物位置の検出は、テンプレート
マッチングを用いて検出する。センサ２には、超音波セ
ンサ２−２・赤外線センサ２−３・テレビカメラ２−４
のいづれか１つを使用し、センサ２は、回転台４の上に
載っている。超音波センサ２−２を使用する場合は、超
音波発信機２−１と組み合わせて使用する。The person position detecting section 3-2 always has the sensor 2
, The space within the detection range is divided at regular intervals, the existence probability of the person position in the unit of division is obtained, and a person position map is created. The position of the person is detected using template matching. The sensor 2 includes an ultrasonic sensor 2-2, an infrared sensor 2-3, and a television camera 2-4.
The sensor 2 is mounted on a turntable 4. When the ultrasonic sensor 2-2 is used, it is used in combination with the ultrasonic transmitter 2-1.

【００１２】音源位置検出範囲が、センサの感知範囲よ
り広い場合は、話者位置判定部３−３が、回転台制御部
３−４に指示して、回転台４を回転させてその回転角度
を感知範囲と対応づけ、音源位置検出範囲とセンサの感
知範囲を合わせることにより、センサの死角を無くす。
音源位置マップと人物位置マップは、同一の空間に対応
しており、一定間隔毎に区切られている。その各区分は
１対１に対応している。If the sound source position detection range is wider than the sensing range of the sensor, the speaker position determination unit 3-3 instructs the turntable control unit 3-4 to rotate the turntable 4 to rotate the turntable. Is associated with the detection range, and the blind spot of the sensor is eliminated by matching the sound source position detection range with the detection range of the sensor.
The sound source position map and the person position map correspond to the same space and are separated at regular intervals. Each of the sections has a one-to-one correspondence.

【００１３】音源位置マップと人物位置マップの作成が
完了すると、話者位置判定部３−３は、マップの各区分
単位に２つのマップの確率の積を求め、その積が最大の
区分を話者位置と特定する。但し、キャリブレーション
機能が、話者位置判定部３−３にある場合は、表示装置
６にセンサからの入力画像と話者位置と判定した部分を
表示し、人の判断により、話者位置が妥当かどうか判断
し、正しい話者位置を入力装置５から入力することが出
来る。話者位置判定部３−３は、特定した話者位置の情
報を、他の機器に渡す。When the creation of the sound source position map and the person position map is completed, the speaker position determination unit 3-3 calculates the product of the probabilities of the two maps for each division unit of the map, and speaks the division having the largest product. Identify the position. However, when the calibration function is provided in the speaker position determining unit 3-3, the input image from the sensor and the portion determined as the speaker position are displayed on the display device 6, and the speaker position is determined by human judgment. It is possible to judge the validity and input the correct speaker position from the input device 5. The speaker position determination unit 3-3 passes the information on the specified speaker position to another device.

【００１４】他の機器としては、例えば、テレビ会議シ
ステムが考えられ、話者位置情報をもとに、テレビ会議
の情報入力カメラの選択・切替え、回転、ズーム等を制
御することや、話者の音声のみを強調することや、話者
方向にディスプレイを向けること等が可能となる。前述
の実施例では、音源位置マップと人物位置マップを完全
に作成し、マップの全ての区分について、話者位置の確
率を計算しているが、必ずしも全ての区分の確率を計算
する必要はない。例えば、マイクロホンアレイを使用し
た音源位置の検出により、音源位置を求め、センサの人
物位置検出を該音源位置から一定の範囲内のマップの区
分に絞って検出することにより、話者位置の検出精度を
殆ど落とすことなく、高速に処理することが可能であ
る。反対に、センサの人物位置検出を先に行い、検出し
た人物位置周辺に絞って、音源位置の確率を求め、話者
位置を特定することも考えられるが、この場合は、処理
時間のかかるセンサによる画像処理を先に行うため、処
理の高速化はあまり望めない。As another device, for example, a video conference system is conceivable, which controls selection / switching, rotation, zoom, etc. of an information input camera of a video conference based on speaker position information, Can be emphasized, and the display can be directed toward the speaker. In the above-described embodiment, the sound source position map and the person position map are completely created, and the probability of the speaker position is calculated for every section of the map. However, it is not always necessary to calculate the probabilities of all sections. . For example, by detecting a sound source position using a microphone array, a sound source position is obtained, and detection of a person position of the sensor is performed by narrowing down detection of a person position to a section of a map within a certain range from the sound source position. Can be processed at a high speed with almost no drop. Conversely, it is conceivable that the sensor position detection is performed first by the sensor, the probability of the sound source position is determined by focusing on the detected person position, and the speaker position is specified. , Image processing is performed first, so that high-speed processing cannot be expected.

【００１５】また、別の方法として、前記のように、マ
ップを３次元の空間を区切って作成するのではなく、処
理の高速化のため、人物位置マップは、２次元のＸＹ平
面で作成し、音源位置マップは、Ｚ軸の方向を求め、そ
の交点を話者位置と特定することも可能である。As another method, as described above, instead of creating a map by dividing a three-dimensional space, a person position map is created on a two-dimensional XY plane in order to speed up processing. In the sound source position map, it is also possible to determine the direction of the Z axis and specify the intersection point as the speaker position.

【００１６】[0016]

【発明の効果】本発明では、マイクロホンアレイによる
音源位置の検出情報と、センサの入力信号を処理するこ
とによる人物位置の検出情報の２つの情報を合わせて判
断することにより、従来より精度の高い話者位置を検出
することを可能にすると同時に、従来では出来なかった
移動中の話者や画像で唇が検出出来ない場合の話者の特
定を可能とした。また、センサに回転台を取り付けるこ
とにより、話者がセンサの死角にいる場合も、回転台を
回転することにより、話者の位置を確定出来るようにし
た。As described above, according to the present invention, the information of the sound source position detected by the microphone array and the detected information of the person position obtained by processing the input signal of the sensor are determined together to determine the information with higher accuracy than before. At the same time, it is possible to detect a speaker position, and at the same time, it is possible to specify a moving speaker or a speaker in the case where lips cannot be detected in an image, which has not been conventionally possible. Also, by attaching a turntable to the sensor, even when the speaker is in the blind spot of the sensor, the position of the speaker can be determined by rotating the turntable.

[Brief description of the drawings]

【図１】本発明の基本構成図FIG. 1 is a basic configuration diagram of the present invention.

【図２】処理概要フローチャートFIG. 2 is a processing outline flowchart.

【図３】本発明の実施例FIG. 3 shows an embodiment of the present invention.

[Explanation of symbols]

１マイクロホンアレイ２画像入力用のセンサ２−１超音波発生装置２−２超音波センサ２−３赤外線センサ２−４テレビカメラ３制御部３−１音源位置検出部３−２人物位置検出部３−３話者位置判定部３−４カメラ制御部４回転台５キーボード・マウス等の入力装置６ディスプレイ等の表示装置 DESCRIPTION OF SYMBOLS 1 Microphone array 2 Sensor for image input 2-1 Ultrasonic wave generator 2-2 Ultrasonic sensor 2-3 Infrared sensor 2-4 Television camera 3 Control unit 3-1 Sound source position detecting unit 3-2 Person position detecting unit 3 -3 Speaker position determination unit 3-4 Camera control unit 4 Turntable 5 Input device such as keyboard and mouse 6 Display device such as display

Claims

[Claims]

1. A means for processing an input signal from any one of a television camera, an ultrasonic sensor, an infrared sensor and the like to detect a person's position, and processing an input signal from a microphone array to detect a sound source position. And a means for determining the speaker position by processing the two types of information together, so that a speaker at an arbitrary position within the sensing range can be detected. Position detection device.

2. The method according to claim 1, wherein when the horizontal direction within the sensing range viewed from the sensor is X, the vertical direction is Y, and the depth direction is Z, the position in the XY direction of the image is determined as the speaker position detecting means. A speaker position detecting apparatus for determining a position in a Z direction from a sound source and detecting a speaker position.

3. A speaker according to claim 1, wherein the sensor is rotatable so that even if a speaker is present in the blind spot of the sensor, the speaker is detected based on the sound source position to detect the speaker in the blind spot. A speaker position detecting device, which is enabled.