JP2017168903A

JP2017168903A - Information processing apparatus, conference system, and method for controlling information processing apparatus

Info

Publication number: JP2017168903A
Application number: JP2016049714A
Authority: JP
Inventors: 智幸後藤; Tomoyuki Goto
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2016-03-14
Filing date: 2016-03-14
Publication date: 2017-09-21

Abstract

PROBLEM TO BE SOLVED: To provide an information processing apparatus having a speaker tracking function, and not to use at least some microphones of microphones in a microphone array for inputting voices of speakers according to the state of speaker tracking so as to use some of the microphones for another purpose.SOLUTION: An information processing apparatus comprises: a microphone array 114 that includes a plurality of microphones; a speaker 115 that outputs voices; a sound source direction detection part 130 that detects the directions of the voices input to the microphone array 114 on the basis of the voices input to the plurality of microphones; and a voice input control part 132 that does not use at least one or more microphones of the microphones for collecting the voices from speakers according to a result of detection performed by the sound source direction detection part 130.SELECTED DRAWING: Figure 2

Description

本発明は、情報処理装置、会議システムおよび情報処理装置の制御方法に関する。 The present invention relates to an information processing apparatus, a conference system, and a method for controlling the information processing apparatus.

近年、インターネット等のネットワークを介して遠隔地（拠点）に設置された端末装置（会議端末ともいう）を接続し、遠隔会議（テレビ会議、ビデオ会議ともいう）を行う会議システム（遠隔会議システム、テレビ会議システム、ビデオ会議システムともいう）が普及している。 In recent years, a conference system (remote conference system, which connects a terminal device (also called a conference terminal) installed in a remote place (base) via a network such as the Internet and performs a remote conference (also called a video conference or a video conference) Video conferencing systems and video conferencing systems) are widely used.

この会議システムでの会議端末は、各拠点の会議室等に設置され、相手先の会議端末との間で会議出席者の画像や音声をやり取りすることで遠隔会議を行う。具体的には、会議端末の各々は、遠隔会議に出席する会議出席者をカメラで撮像するとともに会議出席者の音声をマイクで集音し、相手先の会議端末に画像データや音声データを送信する一方で、相手先の会議端末から送信された画像データおよび音声データを受信し、受信した画像データを用いた会議画面を表示部（モニタ）に表示出力するとともに音声データをスピーカから音出力する。 A conference terminal in this conference system is installed in a conference room or the like at each base, and performs a remote conference by exchanging images and sounds of conference attendees with the conference terminal of the other party. Specifically, each conference terminal captures a conference attendee attending a remote conference with a camera, collects the conference attendee's voice with a microphone, and transmits image data and audio data to the destination conference terminal. On the other hand, it receives image data and audio data transmitted from the conference terminal of the other party, displays a conference screen using the received image data on a display unit (monitor), and outputs audio data from a speaker. .

会議端末のマイクとして、音声を入力するためのマイク（音声入力部ともいう）が複数配列されてなるマイクアレイ（音声入力手段ともいう）を用いることが知られており、マイクアレイを構成する各マイクに届く音源の時間差に基づいて、音声（すなわち、会議参加者の発話）が入力された方向を特定すること（音源方向検知機能、音源方向検知処理という）により、会議参加者のうち実際に発話している参加者（発話者という）を検知して、発話者をカメラで撮像する機能（話者追尾機能という）を備えるものが知られている。 As a microphone of a conference terminal, it is known to use a microphone array (also referred to as a voice input unit) in which a plurality of microphones (also referred to as voice input units) for inputting voice are arranged. Based on the time difference between the sound sources that reach the microphone, the direction in which the sound (that is, the speech of the conference participant) is input is specified (referred to as the sound source direction detection function or the sound source direction detection process). A device having a function of detecting a participant who speaks (referred to as a speaker) and imaging the speaker with a camera (referred to as a speaker tracking function) is known.

しかしながら、遠隔会議は必ずしも周囲の騒音のない環境で行われるとは限らず、会議端末が、例えば、簡素な仕切りで区切られたスペースや、隣との壁が薄い会議室などに設置される場合などは、遠隔会議とは関係のない周囲からの騒音が、会議端末のマイクで集音されてしまうことがある。この場合、相手先の拠点において、会議参加者の音声を聞き取りにくくしてしまう。 However, remote conferences are not always conducted in an environment free of ambient noise. For example, a conference terminal is installed in a space separated by a simple partition or a conference room with a thin wall next to it. For example, noise from the surroundings that is not related to the remote conference may be collected by the microphone of the conference terminal. In this case, it becomes difficult to hear the voice of the conference participant at the destination site.

これに対し、例えば、特許文献１には、カメラで撮影した画像をもとに人の配置を判定し、判定した人の配置に応じて複数のマイクから入力される音声を個別に増幅して加算することで、ユーザの操作によらずに会議出席者の配置に対応した適切なマイクの収音特性を自動で設定して、会議出席者の発話音声を的確に収音するビデオ会議装置が開示されている。 On the other hand, for example, in Patent Document 1, the arrangement of a person is determined based on an image photographed by a camera, and sounds input from a plurality of microphones are individually amplified according to the determined arrangement of the person. A video conferencing apparatus that automatically sets the sound collection characteristics of an appropriate microphone corresponding to the arrangement of the conference attendees without depending on the user's operation and accurately collects the speech sound of the conference attendees. It is disclosed.

通常、遠隔会議において、マイクアレイの周囲に参加者が均等に存在していることは少なく、音声の入力方向は所定の数方向に限られることが多い。このような場合においても、従来は、マイクアレイのすべてのマイクを発話者の音声の集音用としており、マイクアレイの集音特性の制御には、検討の余地が残されていた。 Usually, in a remote conference, there are few participants evenly around the microphone array, and the voice input direction is often limited to a predetermined number of directions. Even in such a case, conventionally, all the microphones in the microphone array are used for collecting the voice of the speaker, and there remains room for study in controlling the sound collection characteristics of the microphone array.

そこで本発明は、話者追尾機能を備えた情報処理装置において、話者追尾の状況に応じて、マイクアレイのマイクの少なくとも一部のマイクを、発話者の音声入力用としないことで、一部のマイクを他の用途で使用することを可能とする情報処理装置を提供することを目的とする。 Therefore, the present invention provides an information processing apparatus having a speaker tracking function, in which at least some of the microphones in the microphone array are not used for voice input of a speaker according to the speaker tracking status. An object of the present invention is to provide an information processing apparatus that can use the microphones of other parts for other purposes.

かかる目的を達成するため、本発明に係る情報処理装置は、複数の音声入力部を備えてなる音声入力手段と、音声を出力する音声出力手段と、前記複数の音声入力部へ入力される音声に基づいて、前記音声入力手段への音声の入力方向を検知する音源方向検知手段と、前記音源方向検知手段の検知結果に応じて、前記音声入力部のうち少なくとも１以上の音声入力部を発話者からの音声の集音に使用しない音声入力部とする音声入力制御手段と、を備えるものである。 In order to achieve this object, an information processing apparatus according to the present invention includes a voice input unit including a plurality of voice input units, a voice output unit that outputs voice, and a voice input to the plurality of voice input units. Based on the sound source direction detecting means for detecting the input direction of the sound to the sound input means, and at least one of the sound input sections is uttered according to the detection result of the sound source direction detecting means. Voice input control means serving as a voice input unit that is not used for collecting voice from a person.

本発明によれば、話者追尾機能を備えた情報処理装置において、話者追尾の状況に応じて、マイクアレイのマイクの少なくとも一部のマイクを、発話者の音声入力用としないことで、一部のマイクを他の用途で使用することを可能とする。 According to the present invention, in the information processing apparatus having the speaker tracking function, depending on the situation of the speaker tracking, at least some of the microphones of the microphone array are not used for voice input of the speaker. Some microphones can be used for other purposes.

テレビ会議システムの構成例を示すブロック図である。It is a block diagram which shows the structural example of a video conference system. 会議端末の主要内部構成例を示すブロック図である。It is a block diagram which shows the main internal structural examples of a conference terminal. 音源方向検知処理の説明図である。It is explanatory drawing of a sound source direction detection process. マイクアレイの集音制御の説明図（１）である。It is explanatory drawing (1) of the sound collection control of a microphone array. マイクアレイの集音制御の説明図（２）である。It is explanatory drawing (2) of the sound collection control of a microphone array. ノイズキャンセル処理における入力信号の説明図である。It is explanatory drawing of the input signal in a noise cancellation process. マイクアレイの集音制御の一例を示すフローチャートである。It is a flowchart which shows an example of the sound collection control of a microphone array. マイクアレイの集音制御の説明図（３）である。It is explanatory drawing (3) of the sound collection control of a microphone array.

以下、本発明に係る構成を図１から図８に示す実施の形態に基づいて詳細に説明する。 Hereinafter, the configuration according to the present invention will be described in detail based on the embodiment shown in FIGS.

［第１の実施形態］
（会議システム構成）
本発明に係る会議システムの一実施形態であるテレビ会議システムの構成について説明する。 [First Embodiment]
(Conference system configuration)
A configuration of a video conference system which is an embodiment of the conference system according to the present invention will be described.

図１は、テレビ会議システム１の構成例を示すブロック図である。図１に示すように、テレビ会議システム１は、サーバ３と複数の会議端末５（５−１，５−２，５−３，５−４・・・）とを備え、これらがインターネット等のネットワークＮを介して接続されて構成される。サーバ３としては、サーバコンピュータやワークステーション等を利用することができ、会議端末５としては、専用の会議端末装置（情報処理装置）のほか、パーソナルコンピュータ等の汎用の情報処理装置を利用することができる。 FIG. 1 is a block diagram illustrating a configuration example of the video conference system 1. As shown in FIG. 1, the video conference system 1 includes a server 3 and a plurality of conference terminals 5 (5-1, 5-2, 5-3, 5-4...), Such as the Internet. It is configured to be connected via a network N. A server computer or a workstation can be used as the server 3, and a general-purpose information processing device such as a personal computer can be used as the conference terminal 5 in addition to a dedicated conference terminal device (information processing device). Can do.

サーバ３は、個々の会議端末５との間で通信接続が確立しているか否かを監視する処理や、会議開始時においてテレビ会議に参加する拠点（参加拠点）に設置された会議端末５を呼び出す処理、呼び出しに応答して通信接続が確立した参加拠点の会議端末５からテレビ会議の間に送信される画像データや音声データを相手先（他の参加拠点）の会議端末５に転送する処理等を行う。 The server 3 performs processing for monitoring whether or not a communication connection is established with each conference terminal 5, and the conference terminal 5 installed at a base (participating base) that participates in the video conference at the start of the conference. Processing for calling and processing for transferring image data and audio data transmitted during the video conference from the conference terminal 5 at the participating site where communication connection is established in response to the call to the conference terminal 5 at the other party (other participating site) Etc.

会議端末５の各々は、遠隔地にある拠点の会議室等に設置され、テレビ会議の出席者によって操作される。テレビ会議中の各参加拠点の会議端末５は、後述するカメラ１１２によって撮像した会議出席者の画像データやマイクアレイ１１４によって集音した会議出席者の音声データをサーバ３に送信する一方、他の参加拠点の会議端末５から送信されてサーバ３によって転送された画像データや音声データを受信し、ディスプレイ１２０に会議画面として表示出力するとともにスピーカ１１５から出力（放音）する。 Each of the conference terminals 5 is installed in a conference room or the like at a base in a remote place and operated by attendees of the video conference. The conference terminal 5 at each participating site during the video conference transmits to the server 3 the image data of the conference attendee imaged by the camera 112 to be described later and the audio data of the conference attendee collected by the microphone array 114. Image data and audio data transmitted from the conference terminal 5 at the participating base and transferred by the server 3 are received, displayed on the display 120 as a conference screen, and output (sound) from the speaker 115.

例えば、このテレビ会議システム１において図１に示す３台の会議端末５−１〜５−３が参加するテレビ会議では、会議端末５−１から送信された画像データや音声データはサーバ３の制御によって相手先である会議端末５−２，５−３に転送される一方、会議端末５−４には転送されない。同様に、会議端末５−２，５−３から送信された画像データや音声データはサーバ３の制御によって各々の相手先である会議端末５−１，５−３や会議端末５−１，５−２に転送され、会議端末５−４には転送されない。このようにして、テレビ会議システム１では、サーバ３との通信接続が確立された２台以上の会議端末５が設置された参加拠点間でテレビ会議が行われる。 For example, in the video conference in which three conference terminals 5-1 to 5-3 shown in FIG. 1 participate in the video conference system 1, image data and audio data transmitted from the conference terminal 5-1 are controlled by the server 3. Is transferred to the conference terminals 5-2 and 5-3 as the other party, but not transferred to the conference terminal 5-4. Similarly, the image data and audio data transmitted from the conference terminals 5-2 and 5-3 are controlled by the server 3 to the conference terminals 5-1 and 5-3 and the conference terminals 5-1 and 5 which are the respective counterparts. -2 and not transferred to the conference terminal 5-4. In this manner, in the video conference system 1, a video conference is performed between participating sites where two or more conference terminals 5 having established communication connections with the server 3 are installed.

（会議端末構成）
図２は、会議端末５の主要内部構成例を示すブロック図である。図２に示されているように、会議端末５は、会議端末５の全体の動作を制御するＣＰＵ（Central Processing Unit）１０１、ＩＰＬ（Initial Program Loader）等のＣＰＵ１０１の駆動に用いられるプログラムを記憶したＲＯＭ（Read Only Memory）１０２、ＣＰＵ１０１のワークエリアとして使用されるＲＡＭ（Random Access Memory）１０３、端末用プログラム、画像データ、及び音声データ等の各種データを記憶するフラッシュメモリ１０４、ＣＰＵ１０１の制御にしたがってフラッシュメモリ１０４に対する各種データの読み出し又は書き込みを制御するＳＳＤ（Solid State Drive）１０５、フラッシュメモリ等の記録メディア１０６に対するデータの読み出し又は書き込み（記憶）を制御するメディアドライブ１０７、会議端末５の宛先を選択する場合などに操作される操作部１０８、会議端末５の電源のＯＮ／ＯＦＦを切り換えるための電源スイッチ１０９、ネットワークＮを利用してデータ伝送をするためのネットワークＩ／Ｆ（Interface）１１１を備えている。 (Conference terminal configuration)
FIG. 2 is a block diagram illustrating a main internal configuration example of the conference terminal 5. As shown in FIG. 2, the conference terminal 5 stores a program used to drive a CPU 101 such as a CPU (Central Processing Unit) 101 that controls the entire operation of the conference terminal 5 and an IPL (Initial Program Loader). ROM (Read Only Memory) 102, RAM (Random Access Memory) 103 used as a work area for CPU 101, flash memory 104 for storing various data such as terminal programs, image data, and audio data, and control of CPU 101 Therefore, an SSD (Solid State Drive) 105 that controls reading or writing of various data with respect to the flash memory 104, a media drive 107 that controls reading or writing (storage) of data with respect to the recording medium 106 such as a flash memory, and the destination of the conference terminal 5 Operation when selecting Operation unit 108 is a power switch 109 for switching ON / OFF the power of the conference terminal 5, and a network I / F (Interface) 111 for using a network N to the data transmission.

操作部１０８は、キーボードやマウス、タッチパネル、各種スイッチ等の入力装置によって実現されるものであり、操作入力に応じた入力データをＣＰＵ１０１に出力する。 The operation unit 108 is realized by an input device such as a keyboard, a mouse, a touch panel, and various switches, and outputs input data corresponding to the operation input to the CPU 101.

ネットワークＩ／Ｆ１１１は、外部（例えばサーバ３）とのデータ通信を行うためのものであり、ＬＡＮを経由してネットワークＮと接続し、相手先の会議端末５との画像データや音声データ等の送受を、サーバ３を介して行う。このネットワークＩ／Ｆ１１１は、１０Ｂａｓｅ−Ｔ，１００Ｂａｓｅ−ＴＸ，１０００Ｂａｓｅ−Ｔ等に対応した制御を行いイーサネット（登録商標）に接続するもの（有線ＬＡＮ）や、ＩＥＥＥ８０２．１１ａ／ｂ／ｇ／ｎに対応した制御を行うもの（無線ＬＡＮ）等、接続態様に応じたものを適宜採用して用いることができる。 The network I / F 111 is used for data communication with the outside (for example, the server 3). The network I / F 111 is connected to the network N via the LAN and transmits image data, audio data, and the like with the conference terminal 5 of the other party. Transmission / reception is performed via the server 3. This network I / F 111 is connected to Ethernet (registered trademark) by performing control corresponding to 10Base-T, 100Base-TX, 1000Base-T, etc., or IEEE802.11a / b / g / n. A device according to the connection mode such as a device that performs corresponding control (wireless LAN) or the like can be appropriately adopted and used.

また、会議端末５は、ＣＰＵ１０１の制御に従って被写体を撮像して画像データを得る内蔵型のカメラ１１２、このカメラ１１２の駆動を制御する撮像素子Ｉ／Ｆ１１３、音声を入力する内蔵型のマイクアレイ１１４、音声を出力する内蔵型のスピーカ１１５、ＣＰＵ１０１の制御に従ってマイクアレイ１１４及びスピーカ１１５との間で音声信号の入出力を処理する音声入出力Ｉ／Ｆ１１６、ＣＰＵ１０１の制御に従って外付けのディスプレイ１２０に画像データを伝送するディスプレイＩ／Ｆ１１７、各種の外部機器を接続するための外部機器接続Ｉ／Ｆ１１８、および上記各構成要素を電気的に接続するためのアドレスバスやデータバス等のバスライン１１０を備えている。 The conference terminal 5 also includes a built-in camera 112 that captures an image of a subject under the control of the CPU 101 to obtain image data, an image sensor I / F 113 that controls driving of the camera 112, and a built-in microphone array 114 that inputs sound. The built-in speaker 115 that outputs sound, the sound input / output I / F 116 that processes input / output of sound signals between the microphone array 114 and the speaker 115 according to the control of the CPU 101, and the external display 120 according to the control of the CPU 101 A display I / F 117 for transmitting image data, an external device connection I / F 118 for connecting various external devices, and a bus line 110 such as an address bus and a data bus for electrically connecting the above components. I have.

撮像手段としてのカメラ１１２は、レンズや、光を電荷に変換して被写体の画像（映像）を電子化する固体撮像素子を含み、固体撮像素子として、ＣＭＯＳ（Complementary Metal Oxide Semiconductor）イメージセンサや、ＣＣＤ（Charge Coupled Device）イメージセンサ等が用いられる。 The camera 112 as an imaging unit includes a lens and a solid-state imaging element that converts light into electric charges and digitizes an object image (video). As the solid-state imaging element, a CMOS (Complementary Metal Oxide Semiconductor) image sensor, A CCD (Charge Coupled Device) image sensor or the like is used.

カメラ１１２は、会議出席者の画像を入力するためのものであり、会議室内の様子を撮像し、生成した画像データを随時、ＣＰＵ１０１に出力する。カメラ１１２は、マイクアレイ１１４にて検知した発話者の方向に追従して、撮像方向、撮像範囲を切り替える制御がなされる。 The camera 112 is for inputting images of the attendees of the conference. The camera 112 captures the situation in the conference room and outputs the generated image data to the CPU 101 as needed. The camera 112 follows the direction of the speaker detected by the microphone array 114 and controls to switch the imaging direction and imaging range.

マイクアレイ１１４は、会議出席者の音声を入力するためのマイクが複数配列されてなり、集音した会議出席者の音声データを随時、ＣＰＵ１０１に出力する。 The microphone array 114 includes a plurality of microphones for inputting the voices of conference attendees, and outputs collected voice data of conference attendees to the CPU 101 as needed.

なお、マイクアレイ１１４を、カメラ１１２やスピーカ１１５を備える会議端末５の本体部と、別体のマイクアレイユニットとしてもよい。この場合、マイクアレイユニットと本体部とは、有線または無線により接続される。これにより、マイクアレイの配置位置の自由度を向上することができる。 The microphone array 114 may be a separate microphone array unit from the main body of the conference terminal 5 including the camera 112 and the speaker 115. In this case, the microphone array unit and the main body are connected by wire or wirelessly. Thereby, the freedom degree of the arrangement position of a microphone array can be improved.

スピーカ１１５は、ＣＰＵ１０１から入力される音声データを出力する音声出力手段である。 The speaker 115 is an audio output unit that outputs audio data input from the CPU 101.

外部機器接続Ｉ／Ｆ１１８には、ＵＳＢ（Universal Serial Bus）ケーブル等によって、外付けカメラ、外付けマイク、及び外付けスピーカ等の外部機器がそれぞれ接続可能である。例えば、外付けカメラが接続された場合には、ＣＰＵ１０１の制御に従って、内蔵型のカメラ１１２に優先して、外付けカメラが動作するようにしてもよい。同じく、外付けマイクが接続された場合や、外付けスピーカが接続された場合には、ＣＰＵ１０１の制御に従って、それぞれが内蔵型のマイクアレイ１１４や内蔵型のスピーカ１１５に優先して、外付けマイクや外付けスピーカを駆動させるようにしてもよい。 External devices such as an external camera, an external microphone, and an external speaker can be connected to the external device connection I / F 118 by a USB (Universal Serial Bus) cable or the like. For example, when an external camera is connected, the external camera may be operated in preference to the built-in camera 112 under the control of the CPU 101. Similarly, when an external microphone is connected or when an external speaker is connected, the external microphone is given priority over the built-in microphone array 114 or the built-in speaker 115 according to the control of the CPU 101. Alternatively, an external speaker may be driven.

なお、記録メディア１０６は、会議端末５に対して着脱自在な構成となっている。また、ＣＰＵ１０１の制御にしたがってデータの読み出し又は書き込みを行う不揮発性メモリであれば、フラッシュメモリ１０４に限らず、ＥＥＰＲＯＭ（Electrically Erasable and Programmable ROM）等を用いてもよい。 Note that the recording medium 106 is detachable from the conference terminal 5. Further, as long as it is a non-volatile memory that reads or writes data according to the control of the CPU 101, not only the flash memory 104 but also an EEPROM (Electrically Erasable and Programmable ROM) or the like may be used.

更に、上記端末用プログラムは、インストール可能な形式又は実行可能な形式のファイルで、上記記録メディア１０６等の、コンピュータで読み取り可能な記録媒体に記録して流通させるようにしてもよい。また、上記端末用プログラムは、フラッシュメモリ１０４ではなくＲＯＭ１０２に記憶させるようにしてもよい。 Further, the terminal program may be recorded in a computer-readable recording medium such as the recording medium 106 and distributed as a file in an installable or executable format. The terminal program may be stored in the ROM 102 instead of the flash memory 104.

ディスプレイ１２０は、被写体の画像や操作用アイコン等を表示するＬＣＤやＥＬディスプレイ、ＣＲＴディスプレイ等によって構成された表示部であり、ＣＰＵ１０１から入力される画像データを表示した会議画面等の各種画面を表示出力する。また、ディスプレイ１２０は、ケーブル１２０ｃによってディスプレイＩ／Ｆ１１７に接続される。このケーブル１２０ｃは、アナログＲＧＢ（ＶＧＡ）信号用のケーブルであってもよいし、コンポーネントビデオ用のケーブルであってもよいし、ＨＤＭＩ（登録商標）（High-Definition Multimedia Interface）やＤＶＩ（Digital Video Interactive）信号用のケーブルであってもよい。 The display 120 is a display unit configured by an LCD, an EL display, a CRT display, or the like that displays a subject image, an operation icon, and the like, and displays various screens such as a conference screen that displays image data input from the CPU 101. Output. The display 120 is connected to the display I / F 117 by a cable 120c. The cable 120c may be an analog RGB (VGA) signal cable, a component video cable, HDMI (registered trademark) (High-Definition Multimedia Interface) or DVI (Digital Video). Interactive) signal cable may be used.

ＣＰＵ１０１は、カメラ１１２から入力される画像データやマイクアレイ１１４から入力される音声データ、ネットワークＩ／Ｆ１１１から入力される相手先の会議端末５からの画像データや音声データ、操作部１０８から入力される入力データ、フラッシュメモリ１０４等に記録されるプログラムやデータ等をもとに、会議端末５を構成する各部への指示やデータの転送等を行って会議端末５の動作を統括的に制御する。例えば、ＣＰＵ１０１は、サーバ３からの呼び出しを受けてサーバ３との通信接続が確立した後、カメラ１１２から入力される画像データやマイクアレイ１１４から入力される音声データをサーバ３に送信する処理と、サーバ３から転送される相手先の会議端末５からの画像データや音声データを受信する処理とを並行して繰り返し行う。 The CPU 101 receives image data input from the camera 112, audio data input from the microphone array 114, image data and audio data from the destination conference terminal 5 input from the network I / F 111, and input from the operation unit 108. Based on the input data to be recorded, the program or data recorded in the flash memory 104 or the like, and performs overall control of the operation of the conference terminal 5 by instructing each unit constituting the conference terminal 5 and transferring the data. . For example, the CPU 101 receives a call from the server 3 and establishes a communication connection with the server 3 and then transmits image data input from the camera 112 and audio data input from the microphone array 114 to the server 3. The process of receiving image data and audio data from the destination conference terminal 5 transferred from the server 3 is repeatedly performed in parallel.

具体的には、ＣＰＵ１０１は、テレビ会議中にカメラ１１２から随時入力される画像データ、およびマイクアレイ１１４から随時入力される音声データをエンコードしてネットワークＩ／Ｆ１１１に出力することで、これらをサーバ３に送信する処理を行う。ＣＰＵ１０１は、例えば、Ｈ.２６４／ＡＶＣ、Ｈ.２６４／ＳＶＣ等の規格によるコーデックを行う。 Specifically, the CPU 101 encodes the image data input from the camera 112 as needed during the video conference and the audio data input from the microphone array 114 as needed, and outputs them to the network I / F 111 so that they are stored in the server I / F 111. The process of transmitting to 3 is performed. The CPU 101 performs codec based on standards such as H.264 / AVC and H.264 / SVC, for example.

また、ＣＰＵ１０１は、これと並行し、相手先の会議端末５から送信されてサーバ３によって転送された画像データおよび音声データをネットワークＩ／Ｆ１１１を介して受信する。そして、ＣＰＵ１０１は、受信した画像データおよび音声データをデコードしてディスプレイ１２０、スピーカ１１５に送信するコーデック機能を有している。これにより、相手先の会議端末５で入力された画像および音声の再生を行う。 In parallel with this, the CPU 101 receives image data and audio data transmitted from the destination conference terminal 5 and transferred by the server 3 via the network I / F 111. The CPU 101 has a codec function that decodes the received image data and audio data and transmits them to the display 120 and the speaker 115. As a result, the image and sound input at the destination conference terminal 5 are reproduced.

また、ＣＰＵ１０１は、マイクアレイ１１４の各マイクからの入力に基づいて、音源方向検知処理を実行する音源方向検知部１３０を備えている。図３は、音源方向検知部１３０による音源方向検知処理の説明図である。音源方向検知処理は、マイクアレイ１１４を構成する各マイクに届く音源の時間差に基づいて、音声が入力された方向を特定するものである。すなわち、例えば、図３に示すように、４つのマイク（マイク１〜マイク４）に対して、音源である発話者Ｓ１から音声が入力された場合、マイク１とマイク２の到達時間差（Δｔ１）、マイク１とマイク３の到達時間差（Δｔ２）、マイク１とマイク４の到達時間差（Δｔ３）、に基づいて音声の入力方向（すなわち、発話者の方向）を検知することができる。なお、音源方向検知処理としては、公知または新規の方法を適用することができる。 In addition, the CPU 101 includes a sound source direction detection unit 130 that executes a sound source direction detection process based on input from each microphone of the microphone array 114. FIG. 3 is an explanatory diagram of sound source direction detection processing by the sound source direction detection unit 130. The sound source direction detection process is to identify the direction in which sound is input based on the time difference between sound sources reaching each microphone constituting the microphone array 114. That is, for example, as shown in FIG. 3, when voice is input from the speaker S1 as a sound source to four microphones (microphones 1 to 4), the arrival time difference (Δt1) between the microphone 1 and the microphone 2 Based on the arrival time difference between the microphone 1 and the microphone 3 (Δt2) and the arrival time difference between the microphone 1 and the microphone 4 (Δt3), the voice input direction (that is, the direction of the speaker) can be detected. As the sound source direction detection process, a known or new method can be applied.

また、ＣＰＵ１０１は、音源方向検知部１３０での検知結果に基づいて、カメラ１１２の撮像範囲を制御する撮像範囲制御部１３１（撮像範囲制御手段）を備えている。カメラ１１２は、例えば、撮像方向が旋回可能に設けられており、検知された発話者の方向に基づいて、ＣＰＵ１０１により旋回が制御される。また、カメラ１１２を、広角レンズを用いて構成し、その視野範囲（画角）内に会議出席者の全員が含まれるようにして、検知された発話者の方向に基づいて、デジタル処理により撮像範囲を切り替える制御をするものであってもよい。 The CPU 101 also includes an imaging range control unit 131 (imaging range control means) that controls the imaging range of the camera 112 based on the detection result of the sound source direction detection unit 130. For example, the camera 112 is provided so that the imaging direction can be turned, and the turning is controlled by the CPU 101 based on the detected direction of the speaker. Further, the camera 112 is configured by using a wide-angle lens so that all the attendees of the conference are included in the visual field range (angle of view), and imaged by digital processing based on the detected direction of the speaker. Control for switching the range may be performed.

また、ＣＰＵ１０１は、音源方向検知部１３０での検知結果に基づいて、マイクアレイ１１４の集音特性を制御する音声入力制御部１３２を備えている。音声入力制御部１３２について以下に説明する。 In addition, the CPU 101 includes an audio input control unit 132 that controls the sound collection characteristics of the microphone array 114 based on the detection result of the sound source direction detection unit 130. The voice input control unit 132 will be described below.

（音声入力制御）
図４は、音声入力制御部１３２によるマイクアレイ１１４の集音制御の説明図である。図４に示す例では、マイクアレイ１１４はマイクＡ〜Ｈの８つのマイクにより構成されている。音声入力制御部１３２は、音源方向検知処理の検知結果（すなわち、発話者の方向）に基づいて、マイクの指向性（マイクビーム）を向ける。図４は、発話者Ｓ１からの音声の入力方向Ｉ１に応じて、指向性制御がされる範囲Ｄ１と、発話者Ｓ２からの音声の入力方向Ｉ２に応じて、指向性制御がされる範囲Ｄ２が形成される例を示している。 (Voice input control)
FIG. 4 is an explanatory diagram of the sound collection control of the microphone array 114 by the voice input control unit 132. In the example illustrated in FIG. 4, the microphone array 114 includes eight microphones A to H. The voice input control unit 132 directs the directivity of the microphone (microphone beam) based on the detection result of the sound source direction detection process (that is, the direction of the speaker). FIG. 4 shows a range D1 in which directivity control is performed according to the voice input direction I1 from the speaker S1, and a range D2 in which directivity control is performed according to the voice input direction I2 from the speaker S2. This shows an example in which is formed.

ここで、遠隔会議が開始され、しばらく時間が経過した状況を考える。会議参加者が会議中に新たに加わることや、中座すること、着座位置の変更は、頻繁に生じるものではないので、遠隔会議が開始されてしばらくすると、マイクアレイ１１４のマイクの指向性が向けられる方向は、当初から限られた数方向であることが認識できる。 Here, consider a situation in which a remote conference is started and a certain time has passed. It is not a frequent occurrence that a conference participant newly joins, sits down, or changes the seating position during the conference, so that the microphone directivity of the microphone array 114 is not long after the remote conference is started. It can be recognized that there are a limited number of directions from the beginning.

換言すれば、設置されるマイクアレイ１１４に対する会議参加者の位置は、通常、固定されるものであり、マイクアレイ１１４が備える複数のマイクのうち、遠隔会議において、発話者の音声入力に使用すべきマイクと、発話者の音声入力に不要なマイクに区別が可能であるといえる。また、発話者の存在しない方向から入力される音については、ノイズである可能性が高いともいえる。 In other words, the position of the conference participant with respect to the installed microphone array 114 is normally fixed, and is used for voice input of the speaker in the remote conference among the plurality of microphones provided in the microphone array 114. It can be said that it is possible to distinguish between a microphone that should be used and a microphone that is unnecessary for the voice input of the speaker. In addition, it can be said that there is a high possibility that the sound is input from a direction in which no speaker is present.

本実施形態では、この点に着目して、マイクアレイ１１４のうち発話者の音声入力に不要としてもよいマイクを認識し、これらのマイクについては、マイク入力をオフにする制御を行う、または、他の用途に用いる制御とするものである。 In the present embodiment, focusing on this point, the microphone array 114 recognizes microphones that may be unnecessary for the voice input of the speaker, and performs control for turning off the microphone input for these microphones, or The control is used for other purposes.

図５は、図４に示した状況でのマイクアレイ１１４の集音制御の詳細を示す説明図である。例えば、図５の例では、発話者Ｓ１からの音声の入力方向Ｉ１では、マイクアレイ１１４のマイクＡ，Ｇ，Ｈを有効とし、発話者Ｓ２からの音声の入力方向Ｉ２では、マイクアレイ１１４のマイクＡ，Ｂ，Ｃを有効として、マイクアレイ１１４の集音制御がされる。 FIG. 5 is an explanatory diagram showing details of the sound collection control of the microphone array 114 in the situation shown in FIG. For example, in the example of FIG. 5, the microphones A, G, and H of the microphone array 114 are enabled in the voice input direction I1 from the speaker S1, and the microphone array 114 is set in the voice input direction I2 from the speaker S2. Sound collection control of the microphone array 114 is performed with the microphones A, B, and C enabled.

すなわち、図５の例では、発話者からの音声の集音に使用しているのはマイクＡ，Ｂ，Ｃ，Ｇ，Ｈの５つのマイク（集音用マイク１４ａともいう）であり、マイクＤ，Ｅ，Ｆの３つのマイクは、発話者からの音声の集音に不要なマイク（非集音用マイク１４ｂともいう）であるといえる。また、入力方向Ｉ３から入力される音はノイズである可能性が高い。 That is, in the example of FIG. 5, five microphones (also referred to as sound collection microphones 14 a) of microphones A, B, C, G, and H are used for collecting voice from the speaker. It can be said that the three microphones D, E, and F are microphones (also referred to as non-sound-collecting microphones 14b) that are unnecessary for collecting voice from the speaker. In addition, the sound input from the input direction I3 is highly likely to be noise.

そこで、本実施形態に係る情報処理装置（会議端末５）は、複数の音声入力部（マイクＡ〜Ｈ）を備えてなる音声入力手段（マイクアレイ１１４）と、音声を出力する音声出力手段（スピーカ１１５）と、複数の音声入力部へ入力される音声に基づいて、音声入力手段への音声の入力方向を検知する音源方向検知手段（音源方向検知部１３０）と、音源方向検知手段の検知結果に応じて、音声入力部のうち少なくとも１以上の音声入力部を発話者からの音声の集音に使用しない音声入力部（非集音用マイク１４ｂ）とする音声入力制御手段（音声入力制御部１３２）と、を備えるものである。なお、括弧内は実施形態での符号、適用例を示す。 Therefore, the information processing apparatus (conference terminal 5) according to the present embodiment includes an audio input unit (microphone array 114) including a plurality of audio input units (microphones A to H), and an audio output unit that outputs audio ( Speaker 115), sound source direction detecting means (sound source direction detecting unit 130) for detecting the sound input direction to the sound input means based on the sound input to the plurality of sound input units, and detection by the sound source direction detecting means Depending on the result, voice input control means (voice input control) using a voice input unit (non-sound collecting microphone 14b) that does not use at least one of the voice input units for collecting voice from the speaker. Part 132). In addition, the code | symbol in embodiment and the example of application are shown in a parenthesis.

音声入力制御部１３２は、例えば、非集音用マイク１４ｂ（マイクＤ，Ｅ，Ｆ）については、音声入力をオフに制御する。これにより、マイクアレイ１１４においてマイクが使用されないで余ってしまうことを解消するとともに、ノイズが入力されて、接続先の会議端末５からノイズが出力されることを抑制することで、会議参加者の音声が聞き取りにくくなることを防止し、円滑な会話を実現することができる。 For example, the voice input control unit 132 controls the voice input to be turned off for the non-sound collecting microphone 14b (mics D, E, and F). As a result, it is possible to eliminate the remaining microphones from being used in the microphone array 114, and to suppress the noise from being input and the noise being output from the conference terminal 5 at the connection destination, thereby preventing the conference participants from being connected. It is possible to prevent the voice from becoming difficult to hear and to realize a smooth conversation.

また、音声入力制御部１３２は、例えば、非集音用マイク１４ｂ（マイクＤ，Ｅ，Ｆ）については、他の用途に転用する制御する。例えば、非集音用マイク１４ｂをノイズキャンセル用マイクとすることが好ましい。ノイズキャンセル処理は、ノイズとなる音を集音して、集音した音（ノイズ）に対して、これを打ち消すように逆位相の信号を重ねることで、ノイズを低減させる処理である。ノイズキャンセル処理としては、公知または新規の技術を適用することができる。 In addition, the voice input control unit 132 controls, for example, diversion of the non-sound collecting microphone 14b (mics D, E, and F) to other uses. For example, the non-collecting microphone 14b is preferably a noise canceling microphone. The noise canceling process is a process of reducing noise by collecting noises and superimposing signals with opposite phases so as to cancel the collected sounds (noises). As the noise canceling process, a known or new technique can be applied.

図６は、ノイズキャンセル処理における入力信号の説明図である。図６（Ａ）に示すように、マイクアレイ１１４へ入力される音声の信号は、集音すべき発話者からの音声入力信号（メイン音声信号Ｍ）と、その他の雑音等による音声入力信号（ノイズ信号Ｎ）が含まれる。 FIG. 6 is an explanatory diagram of an input signal in the noise cancellation process. As shown in FIG. 6A, the audio signal input to the microphone array 114 includes an audio input signal (main audio signal M) from a speaker to be collected, and an audio input signal (such as other noise) ( A noise signal N) is included.

そして、図６（Ｂ）に示すようなノイズキャンセル処理用信号Ｃを、図６（Ａ）に示した入力信号に重ねるノイズキャンセル処理により、図６（Ｃ）に示すように、ノイズが低減されたノイズ低減後信号Ｒを得ることができる。 Then, noise is reduced as shown in FIG. 6C by noise cancellation processing in which the signal C for noise cancellation processing as shown in FIG. 6B is superimposed on the input signal shown in FIG. The signal R after noise reduction can be obtained.

例えば、図５に示した例では、発話者Ｓ１からの音声の入力方向Ｉ１では、マイクアレイ１１４の集音用マイク１４ａ（マイクＡ，Ｇ，Ｈ）から音声を集音し、非集音用マイク１４ｂ（マイクＤ，Ｅ，Ｆ）では雑音を集音し、非集音用マイク１４ｂで集音した音に対して、ノイズキャンセル処理を実行する。このとき、残りのマイクＢ，Ｃについては、発話者Ｓ２から発話の可能性があるので音声入力可能な状態とする。 For example, in the example shown in FIG. 5, in the voice input direction I1 from the speaker S1, voice is collected from the microphone 14a (mics A, G, H) of the microphone array 114 and is not collected. The microphone 14b (microphones D, E, and F) collects noise, and performs noise cancellation processing on the sound collected by the non-collecting microphone 14b. At this time, the remaining microphones B and C are in a state in which voice input is possible because there is a possibility of speech from the speaker S2.

また、発話者Ｓ２からの音声の入力方向Ｉ２では、マイクアレイ１１４の集音用マイク１４ａ（マイクＡ，Ｂ，Ｃ）から音声を集音し、非集音用マイク１４ｂ（マイクＤ，Ｅ，Ｆ）では雑音を集音し、非集音用マイク１４ｂで集音した音に対して、ノイズキャンセル処理を実行する。このとき、残りのマイクＧ，Ｈについては、発話者Ｓ１から発話の可能性があるので音声入力可能な状態とする。 Further, in the voice input direction I2 from the speaker S2, voice is collected from the microphone 14a (mics A, B, C) of the microphone array 114, and the non-sound microphone 14b (mics D, E, In F), noise is collected, and noise cancellation processing is performed on the sound collected by the non-collecting microphone 14b. At this time, the remaining microphones G and H are in a state in which voice input is possible because there is a possibility of speech from the speaker S1.

ここまで説明したマイクアレイ１１４の集音制御について図７のフローチャートを参照して説明する。先ず、会議端末５の会議開始ボタンの押下や、相手先の会議端末５からの会議呼び出しにより、テレビ会議が開始される（Ｓ１０１）と、音源方向検知部１３０はマイクアレイ１１４への音声の入力方向を検知する（Ｓ１０２）。この検知結果は、一時記憶装置に蓄積させておく（Ｓ１０３）。 The sound collection control of the microphone array 114 described so far will be described with reference to the flowchart of FIG. First, when a video conference is started by pressing a conference start button on the conference terminal 5 or a conference call from the destination conference terminal 5 (S101), the sound source direction detection unit 130 inputs sound to the microphone array 114. The direction is detected (S102). The detection result is accumulated in the temporary storage device (S103).

次いで、音声入力制御部１３２は、蓄積された音声の入力方向の検知結果に基づいて、マイクアレイ１１４のマイクのうち、集音用マイク１４ａと非集音用マイク１４ｂとを判別する（Ｓ１０４）。この際の判断基準は、音源方向検知処理における検知割合や検知回数等に基づいて判別するものであればよく、判別方法は、限られるものではない。例えば、
音源方向検知処理において、所定時間に亘り音源方向として検知されていない方向に対応するマイクを非集音用マイク１４ｂとし、その他のマイクを集音用マイク１４ａとすることができる。また、例えば、音源方向検知処理で検知された検知回数のうち、所定の割合以上で検知されている方向に対応するマイクを、集音用マイク１４ａとし、その他のマイクを非集音用マイク１４ｂとすることができる。 Next, the sound input control unit 132 determines the sound collecting microphone 14a and the non-sound collecting microphone 14b among the microphones of the microphone array 114 based on the detection result of the accumulated sound input direction (S104). . In this case, the determination criterion is not limited as long as it is determined based on the detection ratio and the number of detections in the sound source direction detection process. For example,
In the sound source direction detection process, the microphone corresponding to the direction that has not been detected as the sound source direction for a predetermined time can be set as the non-sound collecting microphone 14b, and the other microphones can be set as the sound collecting microphone 14a. Further, for example, the microphone corresponding to the direction detected at a predetermined ratio or more of the number of detections detected in the sound source direction detection process is set as the sound collecting microphone 14a, and the other microphones are set as the non-sound collecting microphone 14b. It can be.

次いで、Ｓ１０４の判断において、非集音用マイク１４ｂと判断されたマイクをノイズキャンセル用のマイクに設定し（Ｓ１０５）、非集音用マイク１４ｂからの入力音（ノイズ）に対して、ノイズキャンセル処理を実行する（Ｓ１０６）。 Next, in the determination in S104, the microphone determined as the non-collecting microphone 14b is set as a noise canceling microphone (S105), and noise cancellation is performed on the input sound (noise) from the non-collecting microphone 14b. The process is executed (S106).

以上説明したように、本実施形態に係る会議端末５は、マイクアレイ１１４への音声の入力方向に基づいて話者追尾を実行する機能を備える会議端末５であって、さらに、会議の経過とともに、音声の入力方向とならないマイク（発話者が存在しない方向のマイク）については、発話者の音声入力用のマイクとしないことで、マイクアレイ１１４の一部のマイクを音声入力以外の用途で使用することを可能とするものである。 As described above, the conference terminal 5 according to the present embodiment is a conference terminal 5 having a function of performing speaker tracking based on the input direction of sound to the microphone array 114, and further, with the progress of the conference. For microphones that are not in the voice input direction (microphones in the direction in which no speaker is present), some microphones in the microphone array 114 are used for purposes other than voice input by not using the voice input microphone for the speaker. It is possible to do.

従来の会議端末では、マイクの集音範囲やカメラの撮像範囲は、広範囲に集音、撮像して、相手拠点に伝えることに重きが置かれていたが、話者追尾機能の実現により、広範囲の会議音声映像ではなく、発話者個人ごとの発話、映像となり、それ以外の音声や撮像範囲は雑音として扱うことが可能となってきた。 In conventional conference terminals, the sound collection range of the microphone and the image capture range of the camera were focused on collecting and capturing the image over a wide range and transmitting it to the other party's base. It is now possible to treat the other speech and imaging range as noise.

そして、本実施形態に係る会議端末５では、マイクアレイ１１４で音源の入力方向を検知して話者追尾をしばらく行っていると、会議の参加者が存在する位置が判明し、マイクアレイ１１４の指向性を向ける方向が固定化されて、使用しないマイクが判明することに着目して、マイクアレイ１１４のマイクを必要な音声を得るためのマイクと、その他のマイクに分けるようにしている。 In the conference terminal 5 according to the present embodiment, when the microphone array 114 detects the input direction of the sound source and performs speaker tracking for a while, the position where the conference participant exists is found, and the microphone array 114 Focusing on the fact that the microphones that are not used are identified by fixing the direction of directivity, the microphones of the microphone array 114 are divided into microphones for obtaining necessary sound and other microphones.

そして、必要な音声を得るためのマイク以外のマイクを、例えば、雑音除去のために転用することで、マイクアレイ１１４の各マイクの有効活用を図るとともに、ノイズの低減を図ることができ、接続先の会議端末５での音声の聞き取りやすさを向上させて、会議環境を向上させることができる。 Then, by using a microphone other than the microphone for obtaining the necessary sound, for example, for noise removal, it is possible to effectively use each microphone of the microphone array 114 and to reduce noise. It is possible to improve the conference environment by improving the ease of listening to voice at the previous conference terminal 5.

［第２の実施形態］
以下、本発明に係る情報処理装置の例である会議端末５の他の実施形態について説明する。なお、上記実施形態と同様の点についての説明は適宜省略する。 [Second Embodiment]
Hereinafter, another embodiment of the conference terminal 5 which is an example of the information processing apparatus according to the present invention will be described. In addition, the description about the same point as the said embodiment is abbreviate | omitted suitably.

マイクアレイ１１４を用いた集音制御として、マイクアレイ１１４のマイクのうち最も高い音圧（音声レベル）で入力されるマイクを、集音用マイク１４ａとして、その他のマイクの少なくとも一部のマイクを、非集音用マイク１４ｂとする制御としてもよい。 As sound collection control using the microphone array 114, a microphone input at the highest sound pressure (sound level) among the microphones of the microphone array 114 is used as a sound collection microphone 14a, and at least some of the other microphones are used. The non-sound collecting microphone 14b may be controlled.

図８は、第２の実施形態での音声入力制御部１３２によるマイクアレイ１１４の集音制御の説明図である。図８に示す例では、発話者Ｓ２から音声の入力方向Ｉ２で音声が入力される様子を示している。 FIG. 8 is an explanatory diagram of the sound collection control of the microphone array 114 by the voice input control unit 132 according to the second embodiment. In the example shown in FIG. 8, the voice is input from the speaker S2 in the voice input direction I2.

本実施形態では、マイクアレイ１１４のマイクのうち、発話者Ｓ２に最も近く、最も高い音圧（音声レベル）で入力されるマイクＢを集音用マイク１４ａとしている。そして、集音用マイク１４ａから最も遠い位置にあるマイク（図８に示す例では、対角位置にあるマイクＦ）からの入力は、ノイズである可能性が高いと判断することができるので、これを非集音用マイク１４ｂとし、ノイズキャンセル用のマイクとしている。 In the present embodiment, among the microphones of the microphone array 114, the microphone B closest to the speaker S2 and input at the highest sound pressure (sound level) is used as the sound collection microphone 14a. Since it can be determined that the input from the microphone farthest from the microphone 14a for sound collection (the microphone F at the diagonal position in the example shown in FIG. 8) is likely to be noise, This is a non-sound collecting microphone 14b, which is a noise canceling microphone.

なお、図８の例では、集音用マイク１４ａおよび非集音用マイク１４ｂとしてそれぞれ１つのマイクを選択する例を示したが、それぞれ２以上のマイクを選択するようにしてもよい。 In the example of FIG. 8, an example is shown in which one microphone is selected as each of the sound collecting microphones 14a and the non-sound collecting microphones 14b. However, two or more microphones may be selected.

尚、上述の実施形態は本発明の好適な実施の例ではあるがこれに限定されるものではなく、本発明の要旨を逸脱しない範囲において種々変形実施可能である。 The above-described embodiment is a preferred embodiment of the present invention, but is not limited thereto, and various modifications can be made without departing from the gist of the present invention.

１テレビ会議システム
３サーバ
５会議端末
１０１ＣＰＵ
１０２ＲＯＭ
１０３ＲＡＭ
１０４フラッシュメモリ
１０５ＳＳＤ
１０６記録メディア
１０７メディアドライブ
１０８操作部
１０９電源スイッチ
１１０バスライン
１１１ネットワークＩ／Ｆ
１１２カメラ
１１３撮像素子Ｉ／Ｆ
１１４マイクアレイ
１４ａ集音用マイク
１４ｂ非集音用マイク
１１５スピーカ
１１６音声入出力Ｉ／Ｆ
１１７ディスプレイＩ／Ｆ
１１８外部機器接続Ｉ／Ｆ
１２０ディスプレイ
１２０ｃケーブル
１３０音源方向検知部
１３１撮像範囲制御部
１３２音声入力制御部
Ｎネットワーク 1 Video conference system 3 Server 5 Conference terminal 101 CPU
102 ROM
103 RAM
104 Flash memory 105 SSD
106 Recording medium 107 Media drive 108 Operation unit 109 Power switch 110 Bus line 111 Network I / F
112 Camera 113 Image sensor I / F
114 Microphone array 14a Microphone for sound collection 14b Microphone for non-sound collection 115 Speaker 116 Audio input / output I / F
117 Display I / F
118 External device connection I / F
120 Display 120c Cable 130 Sound Source Direction Detection Unit 131 Imaging Range Control Unit 132 Audio Input Control Unit N Network

特許第５４１８３２７号公報Japanese Patent No. 5418327

Claims

Voice input means comprising a plurality of voice input units;
Audio output means for outputting audio;
Sound source direction detection means for detecting the input direction of the sound to the sound input means based on the sound input to the plurality of sound input units;
A voice input control unit configured to use at least one or more voice input units among the voice input units as voice input units that are not used for collecting voice from a speaker according to a detection result of the sound source direction detection unit. An information processing apparatus characterized by that.

The voice input control means is a voice that does not use at least one voice input unit corresponding to a direction that has not been detected as a voice input direction for a predetermined time in the sound source direction detection means for collecting voice from a speaker. The information processing apparatus according to claim 1, wherein the information processing apparatus is an input unit.

The information processing apparatus according to claim 1, wherein the voice input control unit uses a voice input unit that is not used for collecting voice from a speaker as a voice input unit for noise cancellation.

The information processing apparatus according to claim 1, wherein the voice input control unit turns off voice input of a voice input unit that is not used for collecting voice from a speaker.

The sound source direction detecting means uses at least one voice input unit including a voice input having the highest voice level among voice inputs from the plurality of voice input units as a voice input unit used for collecting voice from a speaker. The information processing apparatus according to claim 1, wherein at least one of the other voice input units is a voice input unit that is not used for collecting voice from a speaker.

The sound source direction detecting means uses a voice input unit having the highest voice level among voice inputs from the plurality of voice input units as a voice input unit for collecting sound, and inputs a voice farthest from the voice input unit. The information processing apparatus according to claim 5, wherein the voice input unit is not used for collecting sound.

Imaging means for imaging a predetermined range;
The information processing apparatus according to claim 1, further comprising: an imaging range control unit that controls an imaging range of the imaging unit according to a detection result of the sound source direction detection unit.

A conference system comprising the information processing apparatus according to claim 1 as at least one of a plurality of conference terminals, wherein voice data is transmitted and received between the conference terminals.

Voice input means comprising a plurality of voice input units;
A method of controlling an information processing apparatus comprising: audio output means for outputting sound;
A sound source direction detection process for detecting a voice input direction to the voice input unit based on voices input to the plurality of voice input units;
A voice input control process in which at least one voice input unit of the voice input units is a voice input unit that is not used for collecting voice from a speaker according to a detection result in the sound source direction detection process; A method for controlling an information processing apparatus.