JP2020036225A

JP2020036225A - Information processing apparatus, information processing method, and information processing program

Info

Publication number: JP2020036225A
Application number: JP2018161973A
Authority: JP
Inventors: 令治田中; Reiji Tanaka
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 2018-08-30
Filing date: 2018-08-30
Publication date: 2020-03-05

Abstract

To detect a speech request from a terminal and perform speech control without being affected by a connection method or a terminal type in a conference system that can support a plurality of connection methods and can be connected to a plurality of terminal types.SOLUTION: An information processing apparatus according to the present invention that decodes media signals received by a plurality of terminals, synthesizes one or more media signals selected from the plurality of media signals, and transmits an encoded synthesized media signal includes: speech request detecting means for analyzing each of the decoded media signals and detecting a predetermined speech request state on the basis of each of the media signals; and notification means for notifying a speech permission authority terminal of speech request information indicating that speech has been requested from the transmission source of the media signal from which the speech request state is detected when the speech request state is detected by the speech request detecting means.SELECTED DRAWING: Figure 1

Description

本発明は、情報処理装置、情報処理方法及び情報処理プログラムに関し、例えば、会議システムに適用し得るものである。 The present invention relates to an information processing apparatus, an information processing method, and an information processing program, and can be applied to, for example, a conference system.

例えば、テレビ会議システムには、参加者が議長端末に発言を要求し、議長の判断により、参加者に発言を許容する発言制御機能がある（特許文献１参照）。 For example, a video conference system has a statement control function in which a participant requests a statement from a chair terminal and allows the participant to speak according to the chairman's judgment (see Patent Document 1).

従来、テレビ会議システムの発言制御の方法は、テレビ会議システムが採用する接続方式（通信プロトコル）で利用可能なイベント通知機能を使用している。例えば、発言要求を希望する会議端末が議長端末にイベント通知し、議長が発言を許可するか否かを判断する。そして、議長が発言を許可する場合に、議長端末が会議サーバに発言許可のイベントを通知している。 2. Description of the Related Art Conventionally, a speech control method of a video conference system uses an event notification function that can be used in a connection method (communication protocol) adopted by the video conference system. For example, a conference terminal requesting a statement request notifies the chair terminal of an event, and the chair determines whether or not to permit speech. When the chair permits speech, the chair terminal notifies the conference server of a speech permission event.

特開２００６−０３３６５７号公報JP 2006-033657 A

近年のテレビ会議システムとして利用可能な接続方式（通信プロトコル）には様々な方式が存在しており、また、会議端末の種類も多く存在している。例えば、会議端末がゲートウェイを経由して異なる接続方式（通信プロトコル）の会議サーバと接続する方式がある。また、会議端末は専用端末ではなくスマートフォンやタブレット端末等の携帯端末とすることもある。従って、テレビ会議システムは、複数の接続方式や、多種多様な会議端末を収容することが可能になりつつある。 There are various types of connection systems (communication protocols) that can be used as recent video conference systems, and there are many types of conference terminals. For example, there is a method in which a conference terminal connects to a conference server of a different connection method (communication protocol) via a gateway. Further, the conference terminal may be a mobile terminal such as a smartphone or a tablet terminal instead of the dedicated terminal. Therefore, the video conference system is becoming capable of accommodating a plurality of connection methods and various types of conference terminals.

しかしながら、異なる接続方式や多種類の会議端末をテレビ会議システムに接続させたときに、異なる接続方式（通信プロトコル）の会議端末や、異なる種類の会議端末からの発言要求を議長端末に通知することができないという課題が生じ得る。 However, when different connection methods or various types of conference terminals are connected to the video conference system, notification of a request from a conference terminal of a different connection type (communication protocol) or a request for a speech from a different type of conference terminal is sent to the chair terminal. There is a problem that the task cannot be performed.

そのため、複数の接続方式に対応可能であり、複数の端末種類と接続可能な会議システムにおいて、接続方式や端末種類に影響を受けずに、端末からの発言要求を検出して発言制御を行なうことができる情報処理装置、情報処理方法及び情報処理プログラムが求められている。 Therefore, in a conference system that can support a plurality of connection methods and can connect to a plurality of terminal types, it is necessary to detect a speech request from a terminal and perform a speech control without being affected by the connection method or the terminal type. There is a demand for an information processing apparatus, an information processing method, and an information processing program that can perform the processing.

かかる課題を解決するために、第１の本発明に係る情報処理装置は、複数の端末のそれぞれから受信したメディア信号を復号し、複数のメディア信号の中から選択した１又は複数のメディア信号を合成し符号化された合成メディア信号を複数の端末に送信する情報処理装置において、（１）復号された各メディア信号を解析して、各メディア信号に基づいて所定の発言要求状態を検出する発言要求検出手段と、（２）発言要求検出手段により発言要求状態が検出されると、発言要求状態を検出したメディア信号の送信元から発言が要求されたことを示す発言要求情報を、予め設定された発言許可権限端末に対して通知する通知手段とを備えることを特徴とする。 In order to solve such a problem, an information processing apparatus according to a first aspect of the present invention decodes a media signal received from each of a plurality of terminals and converts one or a plurality of media signals selected from the plurality of media signals. In an information processing apparatus that transmits a combined and encoded composite media signal to a plurality of terminals, (1) a statement that analyzes each decoded media signal and detects a predetermined statement request state based on each media signal; When the request request means and (2) the statement request state is detected by the statement request detecting means, statement request information indicating that the request has been made from the source of the media signal that has detected the statement request state is set in advance. Notification means for notifying the speech permission authority terminal.

第２の本発明に係る情報処理方法は、複数の端末のそれぞれから受信したメディア信号を復号し、複数のメディア信号の中から選択した１又は複数のメディア信号を合成し符号化された合成メディア信号を複数の端末に送信する情報処理方法において、（１）発言要求検出手段が、復号された各メディア信号を解析して、各メディア信号に基づいて所定の発言要求状態を検出し、（２）通知手段が、発言要求検出手段により発言要求状態が検出されると、発言要求状態を検出した上メディア信号の送信元から発言が要求されたことを示す発言要求情報を、予め設定された発言許可権限端末に対して通知することを特徴とする。 The information processing method according to the second invention decodes a media signal received from each of a plurality of terminals, combines one or a plurality of media signals selected from the plurality of media signals, and encodes the combined media. In the information processing method for transmitting a signal to a plurality of terminals, (1) the speech request detecting unit analyzes each of the decoded media signals, detects a predetermined speech request state based on each of the media signals, and (2) When the statement request detecting means detects the statement request state, the notifying means detects the statement request state, and transmits statement request information indicating that a request has been made from the transmission source of the media signal to a preset statement request. It is characterized in that the permission authority terminal is notified.

第３の本発明に係る情報処理プログラムは、複数の端末のそれぞれから受信したメディア信号を復号し、複数のメディア信号の中から選択した１又は複数のメディア信号を合成し符号化された合成メディア信号を複数の端末に送信する情報処理プログラムにおいて、コンピュータを、（１）復号された各メディア信号を解析して、各メディア信号に基づいて所定の発言要求状態を検出する発言要求検出手段と、（２）発言要求検出手段により発言要求状態が検出されると、発言要求状態を検出したメディア信号の送信元から発言が要求されたことを示す発言要求情報を、予め設定された発言許可権限端末に対して通知する通知手段として機能させることを特徴とする。 An information processing program according to a third aspect of the present invention is a computer-readable storage medium that decodes a media signal received from each of a plurality of terminals, combines one or more media signals selected from the plurality of media signals, and encodes the combined media. In an information processing program for transmitting a signal to a plurality of terminals, the computer includes: (1) a message request detecting unit that analyzes each decoded media signal and detects a predetermined message request state based on each media signal; (2) When the utterance request detecting means detects the utterance request state, the utterance request information indicating that the utterance is requested from the transmission source of the media signal that has detected the utterance request state is transmitted to a predetermined utterance permission authority terminal. It is made to function as a notifying means for notifying to.

本発明によれば、接続方式や端末種別に影響を受けずに、端末からの発言要求を検出して発言制御を行なうことができる。 According to the present invention, it is possible to detect a speech request from a terminal and perform speech control without being affected by a connection method or a terminal type.

実施形態に係る発言要求検出部及び発言制御操作部の内部構成と、発言要求検出部及び発言制御操作部の関係を示すブロック図である。FIG. 3 is a block diagram illustrating an internal configuration of a statement request detection unit and a statement control operation unit according to the embodiment, and a relationship between the statement request detection unit and the statement control operation unit. 実施形態に係る会議システムの全体構成を示す全体構成図である。1 is an overall configuration diagram illustrating an overall configuration of a conference system according to an embodiment. 実施形態に係るＭＣＵの内部構成を示す内部構成図である。FIG. 2 is an internal configuration diagram illustrating an internal configuration of an MCU according to the embodiment. 実施形態に係る音声信号処理部の全体の動作について示した説明図である。FIG. 3 is an explanatory diagram illustrating an overall operation of the audio signal processing unit according to the embodiment. 実施形態に係るＭＣＵにおける発言要求検出処理を示すフローチャートである。6 is a flowchart illustrating a statement request detection process in the MCU according to the embodiment. 変形実施形態に係る画像信号処理部の合成部の内部構成を示すブロック図である。It is a block diagram showing an internal configuration of a synthesis unit of an image signal processing unit according to a modified embodiment.

（Ａ）主たる実施形態
以下では、本発明に係る情報処理装置、情報処理方法及び情報処理プログラムの実施形態を、図面を参照しながら詳細に説明する。 (A) Main Embodiment Hereinafter, an embodiment of an information processing apparatus, an information processing method, and an information processing program according to the present invention will be described in detail with reference to the drawings.

この実施形態では、本発明を利用してテレビ会議システムを構築する場合を例示する。 In this embodiment, a case where a video conference system is constructed using the present invention will be exemplified.

（Ａ−１）実施形態の構成
（Ａ−１−１）全体構成
図２は、実施形態に係る会議システムの全体構成を示す全体構成図である。 (A-1) Configuration of Embodiment (A-1-1) Overall Configuration FIG. 2 is an overall configuration diagram illustrating an overall configuration of a conference system according to the embodiment.

図２において、実施形態に係る会議システム７は、会議サーバ１、複数（図２では３台）の会議端末５（５−１〜５−３）を有して構成される。 2, a conference system 7 according to the embodiment includes a conference server 1 and a plurality (three in FIG. 2) of conference terminals 5 (5-1 to 5-3).

会議サーバ１は、会議端末５の接続許可及びアドレス変換処理等の機能を担っているゲートキーパ（ＧＫ）２と、複数の拠点にある会議端末５から得た音声、映像、データを合成して会議用データに変換する機能を担っている多地点制御装置（ＭＵＣ：ＭｕｌｔｉｐｏｉｎｔＣｏｎｔｒｏｌＵｎｉｔ、以下「ＭＵＣ」と呼ぶ。）３とを有する。 The conference server 1 synthesizes a conference by synthesizing a gatekeeper (GK) 2 having functions such as a connection permission of the conference terminal 5 and an address conversion process, and audio, video, and data obtained from the conference terminals 5 at a plurality of bases. And a multipoint control unit (MUC: Multipoint Control Unit, hereinafter referred to as “MUC”) 3 having a function of converting the data into data for use.

なお、ゲートキーパ（ＧＫ）は、次のような場合においてシステム構成上、設置されない場合もある。 The gatekeeper (GK) may not be installed due to the system configuration in the following cases.

ここで、次のような場合とは、会議サーバ１において、Ｈ．３２３（後述）を使用しない場合、ＩＰアドレスと電話番号の対応を記憶・管理して相手先の会議端末への接続において仲介の必要がない場合などである。 Here, the following case means that in the conference server 1, In the case where the H.323 (described later) is not used, the correspondence between the IP address and the telephone number is stored and managed, and there is no need for mediation in connection to the other party's conference terminal.

例えば、３台以上の参加者（会議端末５）の場合、会議システム７には、ＭＣＵ３と呼ばれる会議サーバ機能を提供する装置が用いられる。ＭＣＵ３は、各会議端末５（５−１〜５−３）からの音声／映像を受信し、発言対象の音声を合成して、各会議端末５（５−１〜５−３）に配信する。また、ＭＣＵ３は、必要な形式の映像を切り変えたり、合成したりして、各会議端末５（５−１〜５−３）に配信する。 For example, in the case of three or more participants (conference terminal 5), a device that provides a conference server function called MCU3 is used for the conference system 7. The MCU 3 receives audio / video from each of the conference terminals 5 (5-1 to 5-3), synthesizes a voice to be uttered, and distributes the synthesized voice to each of the conference terminals 5 (5-1 to 5-3). . Further, the MCU 3 switches or synthesizes a video of a required format and distributes the video to each conference terminal 5 (5-1 to 5-3).

会議サーバ１は、複数の会議端末５−１〜５−３を接続にして仮想的な会議を設定する。会議サーバ１は、会議に接続する複数の会議端末５−１〜５−３を通じて参加者の音声、映像、データを受信し、会議設定に応じて映像データを合成したり、音声を合成したりして、各会議端末５−１〜５−３に送信する。 The conference server 1 connects a plurality of conference terminals 5-1 to 5-3 to set a virtual conference. The conference server 1 receives the audio, video, and data of the participant through a plurality of conference terminals 5-1 to 5-3 connected to the conference, and synthesizes the video data or the audio according to the conference settings. Then, it transmits to each of the conference terminals 5-1 to 5-3.

会議サーバ１は、複数の接続方式を利用できる。会議システムの接続方式（通信プロトコル）には、例えば、ＳＩＰ（ＳｅｓｓｉｏｎＩｎｉｔｉａｔｉｏｎＰｒｏｔｏｃｏｌ）、ＩＴＵ−Ｔ勧告通信プロトコルＨ．３２３、インターネットを利用したＷｅｂ会議サービス（インタネット電話サービス）等がある。 The conference server 1 can use a plurality of connection methods. The connection method (communication protocol) of the conference system includes, for example, SIP (Session Initiation Protocol) and the ITU-T recommended communication protocol H.264. H.323, a Web conference service using the Internet (Internet telephone service), and the like.

ここで、Ｗｅｂ会議サービスは、ＷｅｂＲＴＣ（ＷｅｂＲｅａｌ−ＴｉｍｅＣｏｍｍｕｎｉｃａｔｉｏｎ）を利用し、ＨＴＭＬ、ＨＴＴＰ、ＴＣＰ/ＩＰ、ＵＤＰ/ＩＰ等のプロトコルを利用して、会議端末と相互接続することができるものである。 Here, the Web conference service can be interconnected with a conference terminal by using WebRTC (Web Real-Time Communication) and using protocols such as HTML, HTTP, TCP / IP, and UDP / IP. is there.

会議サーバ１は、複数の接続方式に対応可能であり、また多種多様な会議端末５と接続可能である。 The conference server 1 can support a plurality of connection methods and can connect to various conference terminals 5.

この実施形態では、例えば、会議端末５−１および会議端末５−２が、会議システムの接続方式（通信プロトコル）としてＨ．３２３（またはＳＩＰ）を利用する端末であり、会議サーバ１の会議に参加しているものとする。 In this embodiment, for example, the conference terminal 5-1 and the conference terminal 5-2 are connected to the H.264 connection system (communication protocol) of the conference system. It is assumed that the terminal uses H.323 (or SIP) and participates in the conference of the conference server 1.

この実施形態では、例えば、会議端末５−３が、Ｗｅｂ会議サービスで用いられている会議システムの接続方式（通信プロトコル）を利用する端末であり、会議サーバ１の会議に参加しているものとする。つまり、会議端末５−３はゲートウェイ（ＧＷ）６を経由して会議サーバ１と接続しているものとする。 In this embodiment, for example, the conference terminal 5-3 is a terminal that uses the connection method (communication protocol) of the conference system used in the Web conference service and participates in the conference of the conference server 1. I do. That is, it is assumed that the conference terminal 5-3 is connected to the conference server 1 via the gateway (GW) 6.

各会議端末５−１〜５−３は、各拠点において会議に参加する参加者が使用するものであり、例えば、マイク、スピーカ、カメラ、ディスプレイ、制御装置等を含む装置である。各会議端末５−１〜５−３は、例えば、専用会議端末、パーソナルコンピュータ、スマートフォン、タブレット端末、ウェラブル端末、携帯端末等を適用することができる。 Each of the conference terminals 5-1 to 5-3 is used by a participant who participates in a conference at each base, and is, for example, a device including a microphone, a speaker, a camera, a display, a control device, and the like. As each of the conference terminals 5-1 to 5-3, for example, a dedicated conference terminal, a personal computer, a smartphone, a tablet terminal, a wearable terminal, a mobile terminal, and the like can be applied.

各会議端末５（５−１〜５−３）は、例えば、パーソナルコンピュータ上で様々な操作機能を提供するソフトウェアによるものや、また例えば、ハードウェアメーカが独自の操作方式を実装した端末等を適用できる。 Each of the conference terminals 5 (5-1 to 5-3) is, for example, a terminal using software that provides various operation functions on a personal computer, or a terminal in which a hardware maker implements a unique operation method. Applicable.

なお、この実施形態では、３台の会議端末５−１〜５−３が１つ会議を開催する場合を例示するが、１つの会議を開催する会議端末５の数は特に限定されるものではない。 In this embodiment, a case where three conference terminals 5-1 to 5-3 hold one conference is illustrated, but the number of conference terminals 5 that hold one conference is not particularly limited. Absent.

［発言制御操作部４］
発言制御操作部４は、１又は複数の参加者に対して発言を許可することを指定することができるものである。発言制御操作部４は、会議サーバ１が提供する会議システムの機能のうちの１つの機能である。 [Speech control operation unit 4]
The utterance control operation unit 4 can specify that one or more participants are allowed to make a utterance. The speech control operation unit 4 is one of the functions of the conference system provided by the conference server 1.

より具体的には、複数の参加者の中から、参加者に発言を許可することができる者（発言許可決定者）を決め、その発言許可決定者の会議端末５において、発言制御操作部４が有効に機能する。 More specifically, from among a plurality of participants, a person (speaker permitted to decide) who can give permission to the participant is determined, and in the conference terminal 5 of the person permitted to speak, the speech control operation unit 4 is determined. Works effectively.

一般的には、会議の議長が発言許可決定者となるケースが多いので、ここでは、発言許可決定者の会議端末５を「議長端末」とも呼ぶ。図１では、会議端末５−１を議長端末としている。なお、議長端末は「発言許可権限端末」とも呼ぶ。 In general, in many cases, the chairperson of the conference is the speaker who has determined permission to speak. Therefore, here, the conference terminal 5 of the person who has determined permission to speak is also referred to as a “chairperson terminal”. In FIG. 1, the conference terminal 5-1 is the chair terminal. The chair terminal is also referred to as a "speaking permission authority terminal".

なお、この実施形態では、発言許可決定者の会議端末５（５−１〜５−３）において設定オンされることにより、発言制御操作部４が有効に機能する場合を例示するがこれに限定されない。 Note that, in this embodiment, a case is described in which the speech control operation unit 4 functions effectively by being turned on in the conference terminal 5 (5-1 to 5-3) of the speech permission decider, but is not limited thereto. Not done.

議長端末５−１では、発言を希望する参加者の会議端末５から発言要求を受け付けることができる。発言要求を受信した議長端末５−１では、発言許可決定者が、会議進行等を踏まえた上で、当該参加者に発言を許可するか否かを決定する。 The chair terminal 5-1 can receive a speech request from the conference terminal 5 of the participant who wants to speak. At the chair terminal 5-1 that has received the statement request, the statement permission determiner determines whether or not to permit the participant to speak, based on the progress of the conference and the like.

発言許可決定者が当該参加者に発言を許可する場合、発言許可決定者が議長端末５−１において所定の指定方法で当該参加者を指定する。そうすると、議長端末５−１は、当該参加者の会議端末５を指定する情報（発言指定情報）を、ＭＣＵ３に送信する。 When the speech permission decider permits the participant to speak, the speech permission decider specifies the participant in the chair terminal 5-1 by a predetermined designation method. Then, the chair terminal 5-1 sends information (speech designation information) designating the conference terminal 5 of the participant to the MCU 3.

議長端末５−１から発言指定情報を受信したＭＣＵ３は、指定された会議端末５からの音声（映像を含んでもよい。）を、全ての会議端末５（５−１〜５−３）に送信する合成対象とする。つまり、それまで発言が許可されていなかった参加者の音声が、会議用の合成音声に含まれることになり、他の参加者は、発言が許可された参加者の音声を聞くことができる。なお、以下では、発言する権限を発言権とも呼ぶ。 The MCU 3 that has received the utterance designation information from the chair terminal 5-1 transmits the audio (may include video) from the designated conference terminal 5 to all the conference terminals 5 (5-1 to 5-3). To be synthesized. That is, the voice of the participant whose speech has not been permitted is included in the synthesized speech for the conference, and the other participants can hear the voice of the participant whose speech has been permitted. In the following, the authority to speak is also called the right to speak.

ところで、従来、参加者が発言を希望する場合、従来の会議端末には、発言要求を議長端末に送信する発言要求処理部が搭載されている。発言要求は、要求を行う端末から議長端末に直接行われるものと、ＭＣＵを介して間接的に行われる方式があるが、いずれの場合も含まれる。 By the way, conventionally, when a participant wants to speak, the conventional conference terminal is equipped with a statement request processing unit for transmitting a statement request to the chair terminal. The request for speech can be made either directly from the requesting terminal to the chair terminal or indirectly via the MCU, and both are included.

しかし、複数の接続方式に対応可能となり、多種多様な会議端末との間で会議を開催する場合、接続方式（通信プロトコル）が異なるため、発言要求方法を統一することは難しく、また会議端末によっては、発言要求処理部が搭載されていないようなこともある。 However, it is possible to support multiple connection methods, and when holding a conference with a variety of conference terminals, it is difficult to unify the request method because the connection method (communication protocol) is different. May not have a statement request processing unit.

そこで、この実施形態では、会議サーバ１のＭＣＵ３が、各会議端末５（５−１〜５−３）から受信した音声信号に基づいて、参加者が発言を希望している又は参加者が発言したという状態を検出して、議長端末５−１の発言制御操作部４に発言要求を通知できるようにする。 Therefore, in this embodiment, the MCU 3 of the conference server 1 determines whether the participant wants to speak or the participant speaks based on the audio signal received from each of the conference terminals 5 (5-1 to 5-3). The state of having made the message is detected, and the statement request operation can be notified to the statement control operation unit 4 of the chair terminal 5-1.

（Ａ−１−２）ＭＣＵ３の内部構成
図３は、実施形態に係るＭＣＵ３の内部構成を示す内部構成図である。 (A-1-2) Internal Configuration of MCU 3 FIG. 3 is an internal configuration diagram showing an internal configuration of the MCU 3 according to the embodiment.

図３において、実施形態に係るＭＵＣ３は、大別して、音声信号処理部１０、映像信号処理部２０、制御部３０を有する。 3, the MUC 3 according to the embodiment roughly includes an audio signal processing unit 10, a video signal processing unit 20, and a control unit 30.

なお、ＭＣＵ３は、音声信号処理部１０、映像信号処理部２０を有する場合を例示しているが、データ信号処理部等を有するようにしてもよい。ＭＣＵ３は、例えばパーソナルコンピュータ等の情報処理装置上に、処理プログラム（例えば、この実施形態に係る情報処理プログラム等）がインストールされることにより構築される。なお、１台の情報処理装置に限定されるものではなく、複数台の情報処理装置に図３の処理部が分散配置されるようにしてもよい。なお、音声信号処理部１０、映像信号処理部２０の一部又は全部の構成要素がハードウェアで構築されてもよい。 In addition, although the case where the MCU 3 includes the audio signal processing unit 10 and the video signal processing unit 20 is illustrated, the MCU 3 may include a data signal processing unit and the like. The MCU 3 is constructed by installing a processing program (for example, the information processing program according to the present embodiment) on an information processing apparatus such as a personal computer. Note that the present invention is not limited to one information processing device, and the processing units in FIG. 3 may be distributed and arranged in a plurality of information processing devices. Note that some or all of the components of the audio signal processing unit 10 and the video signal processing unit 20 may be configured by hardware.

［制御部３０］
制御部３０は、ＭＣＵ３の音声信号処理部１０、映像信号処理部２０の機能を制御するものであり、会議開催や符号化処理等を制御する。各種機能を司る処理部又は装置である。また、制御部３０は、会議サーバ１と各会議端末５（５−１〜５−３）との間の通信回線の帯域制御なども行なう。 [Control unit 30]
The control unit 30 controls the functions of the audio signal processing unit 10 and the video signal processing unit 20 of the MCU 3, and controls conference holding, encoding processing, and the like. It is a processing unit or device that performs various functions. The control unit 30 also controls the bandwidth of a communication line between the conference server 1 and each of the conference terminals 5 (5-1 to 5-3).

［音声信号処理部１０］
音声信号処理部１０は、複数の会議端末５−１〜５−３のそれぞれから受信した音声（符号化処理された音声信号）を復号し、制御部３０からの指示に基づいて合成対象とする音声を合成し、その合成信号を符号化処理して、複数の会議端末５−１〜５−３に送信する。 [Audio signal processing unit 10]
The audio signal processing unit 10 decodes audio (encoded audio signal) received from each of the plurality of conference terminals 5-1 to 5-3, and sets the audio to be synthesized based on an instruction from the control unit 30. The voice is synthesized, the synthesized signal is encoded, and transmitted to the plurality of conference terminals 5-1 to 5-3.

音声信号処理部１０は、音声受信部１１（１１−１〜１１−３）、復号部１２（１２−１〜１２−３）、合成部１７、符号化部１５、音声送信部１６（１６−１〜１６−３）を有する。 The audio signal processing unit 10 includes an audio receiving unit 11 (11-1 to 11-3), a decoding unit 12 (12-1 to 12-3), a combining unit 17, an encoding unit 15, and an audio transmitting unit 16 (16- 1 to 16-3).

ここで、ＭＣＵ３は、１つ又は複数の会議を開催するため、開催される会議の数に合わせて１つ又は複数の会議室を、会議システム上に仮想的に形成する。 Here, in order to hold one or more meetings, the MCU 3 virtually forms one or more meeting rooms on the meeting system in accordance with the number of held meetings.

またここで、音声受信部１１、復号部１２、音声送信部１６は、ＭＣＵ３で開催される会議室１つに対して接続する会議端末５の数だけ作成される。 Also, here, the audio receiving unit 11, the decoding unit 12, and the audio transmitting unit 16 are created by the number of conference terminals 5 connected to one conference room held by the MCU 3.

音声受信部１１（１１−１〜１１−３）は、対応する会議端末５から音声信号を受信し、その受信した音声信号を復号部１２（１２−１〜１２−３）に与えるものである。 The audio receiving unit 11 (11-1 to 11-3) receives an audio signal from the corresponding conference terminal 5, and supplies the received audio signal to the decoding unit 12 (12-1 to 12-3). .

復号部１２（１２−１〜１２−３）は、音声受信部１１（１１−１〜１１−３）から受信した音声信号を復号して、合成部１７に与えるものである。ＭＣＵ３と各会議端末５（５−１〜５−３）との間では、情報量を低減して帯域を確保するために、音声信号は圧縮符号化されて送受信される。音声信号の圧縮符号化方式は特に限定されるものではない。例えば、Ｇ．７１１（ＩＴＵ−Ｔ標準化技術）、Ｇ．７２２、ＭＰＥＧ−４ＡＡＣ−ＬＤ等に対応可能であり、より具体的にＰＣＭ（パルス符号変調方式）、ＡＤＰＣＭ（適応差分ＰＣＭ方式）などを用いることができる。 The decoding unit 12 (12-1 to 12-3) decodes the audio signal received from the audio receiving unit 11 (11-1 to 11-3) and supplies the audio signal to the synthesizing unit 17. Between the MCU 3 and each of the conference terminals 5 (5-1 to 5-3), an audio signal is compression-encoded and transmitted / received in order to reduce the amount of information and secure a band. The compression coding method of the audio signal is not particularly limited. For example, G. 711 (ITU-T standardization technology); 722, MPEG-4 AAC-LD, etc., and more specifically, PCM (pulse code modulation), ADPCM (adaptive differential PCM), or the like can be used.

復号部１２（１２−１〜１２−３）は、会議端末５から受信した音声信号を常時復号して合成部１７に与えている。つまり、発言権が付与されている（発言が許可されている）会議端末５からの音声信号だけでなく、発言権が付与されていない（発言が許可されていない）会議端末５から受信した音声信号も復号部１２は復号している。 The decoding unit 12 (12-1 to 12-3) constantly decodes the audio signal received from the conference terminal 5 and supplies the audio signal to the synthesizing unit 17. That is, in addition to the audio signal from the conference terminal 5 to which the floor is given (to speak), the voice received from the conference terminal 5 to which the floor is not given (to speak is not allowed). The decoding unit 12 also decodes the signal.

合成部１７は、復号部１２（１２−１〜１２−３）から復号された音声信号（ＰＣＭ信号）を受け取ると、合成対象とする１又は複数の音声信号を選択して、選択した音声信号を合成して符号化部１５に与える。また、合成部１７は、発言要求検出部１３、合成対象音声選択部１４を有する。 Upon receiving the decoded audio signal (PCM signal) from the decoding unit 12 (12-1 to 12-3), the synthesizing unit 17 selects one or more audio signals to be synthesized and selects the selected audio signal. Are combined and given to the encoding unit 15. Further, the synthesizing unit 17 includes a speech request detecting unit 13 and a synthesis target voice selecting unit 14.

発言要求検出部１３は、復号部１２（１２−１〜１２−３）から復号された音声信号（ＰＣＭ信号）を受け取ると、各会議端末５からの音声信号に基づいて参加者が発言を要求しているか否かを検出するものである。発言要求検出部１３の詳細な説明は後述する。 Receiving the speech signal (PCM signal) decoded from the decoding unit 12 (12-1 to 12-3), the speech request detection unit 13 allows the participant to request a speech based on the speech signal from each conference terminal 5. It is to detect whether or not it is performed. A detailed description of the statement request detection unit 13 will be described later.

合成対象音声選択部１４は、復号部１２（１２−１〜１２−３）から復号された音声信号（ＰＣＭ信号）の中から合成対象とする音声信号を選択し、合成対象として選択した音声信号を符号化部１５に与える。 The synthesis target audio selection unit 14 selects an audio signal to be synthesized from the audio signals (PCM signals) decoded from the decoding units 12 (12-1 to 12-3), and selects the audio signal selected as the synthesis target. To the encoding unit 15.

符号化部１５は、合成部１７の合成対象音声選択部１４により合成された音声信号を受け取り、その音声信号を符号化し、符号化した音声信号を音声送信部１６（１６−１〜１６−３）に与える。 The encoding unit 15 receives the audio signal synthesized by the synthesis target audio selecting unit 14 of the synthesizing unit 17, encodes the audio signal, and outputs the encoded audio signal to the audio transmitting unit 16 (16-1 to 16-3). Give to).

１つの会議に対して、任意の数の符号化部１５が作成される。これは、この実施形態の会議サーバ１が、複数の接続方式（通信プロトコル）に対応可能であるため、通信プロトコルが異なることにより、符号化方式が異なることがあるためである。従って、制御部３０の指示に従って、対応すべき符号化方式の数に応じた数の符号化部１５が作成されるようにしてもよい。 An arbitrary number of encoding units 15 are created for one conference. This is because the conference server 1 of this embodiment can support a plurality of connection methods (communication protocols), and thus the communication methods are different, so that the encoding methods may be different. Therefore, according to the instruction of the control unit 30, the number of the encoding units 15 corresponding to the number of encoding systems to be supported may be created.

音声信号に対して施す符号化方式は、復号部１２の詳細な説明で行なったので、ここでの説明は省略する。 The encoding method applied to the audio signal has been described in the detailed description of the decoding unit 12, and the description is omitted here.

音声送信部１６（１６−１〜１６−３）は、符号化部１５による符号化された音声信号を、対応する会議端末５（５−１〜５−３）に送信する。 The audio transmission unit 16 (16-1 to 16-3) transmits the audio signal encoded by the encoding unit 15 to the corresponding conference terminal 5 (5-1 to 5-3).

［映像信号処理部２０］
映像信号処理部２０は、複数の会議端末５−１〜５−３のそれぞれから受信した映像（符号化処理された映像信号）を復号し、制御部３０からの指示に基づいて、映像形式を変換したり、映像を合成したりして、複数の会議端末５−１〜５−３に送信する。 [Video signal processing unit 20]
The video signal processing unit 20 decodes the video (encoded video signal) received from each of the plurality of conference terminals 5-1 to 5-3, and changes the video format based on an instruction from the control unit 30. The data is converted and synthesized, and transmitted to the plurality of conference terminals 5-1 to 5-3.

映像信号処理部２０は、映像受信部２１（２１−１〜２１−３）、復号部２２（２２−１〜２２−３）、合成部２７、符号化部２５、映像送信部２６（２６−１〜２６−３）を有する。 The video signal processing unit 20 includes a video receiving unit 21 (21-1 to 21-3), a decoding unit 22 (22-1 to 22-3), a combining unit 27, an encoding unit 25, and a video transmitting unit 26 (26- 1-26-3).

またここで、映像受信部２１、復号部２２、映像送信部２６は、ＭＣＵ３で開催される会議１つに対して、接続する会議端末５の数だけ作成される。 Also, here, the video receiving unit 21, the decoding unit 22, and the video transmitting unit 26 are created for one conference held by the MCU 3 by the number of conference terminals 5 to be connected.

映像受信部２１（２１−１〜２１−３）は、対応する会議端末５から映像信号を受信し、その受信した映像信号を復号部２２（２２−１〜２２−３）に与えるものである。ＭＣＵ３と各会議端末５（５−１〜５−３）との間では、情報量を低減して帯域を確保するために、映像信号は圧縮符号化されて送受信される。映像信号の圧縮符号化方式は特に限定されるものではない。例えば、Ｈ．２６１、Ｈ．２６３、ＭＰＥＧ−２、ＭＰＥＧ−４、Ｈ．２６４、Ｈ．２６５等などを用いることができる。 The video receivers 21 (21-1 to 21-3) receive video signals from the corresponding conference terminals 5, and provide the received video signals to the decoders 22 (22-1 to 22-3). . Between the MCU 3 and each of the conference terminals 5 (5-1 to 5-3), in order to reduce the amount of information and secure a band, a video signal is compression-encoded and transmitted / received. The compression encoding method of the video signal is not particularly limited. For example, H. 261, H .; H.263, MPEG-2, MPEG-4, H.264. 264, H .; 265 or the like can be used.

復号部２２（２２−１〜２２−３）は、映像受信部２１（２１−１〜２１−３）から受信した映像信号を復号して、合成部２３に与えるものである。 The decoding unit 22 (22-1 to 22-3) decodes the video signal received from the video receiving unit 21 (21-1 to 21-3) and supplies the video signal to the synthesizing unit 23.

合成部２３は、復号部２２（２２−１〜２２−３）からの映像信号に基づいて、必要な映像形式に変換し、映像を合成して符号化部２５に与える。通常、１つの会議に対して、１個の合成部２３が作成される。しかし、１つの会議に対して、複数個の合成部２３が作成されるようにしてもよい。 The synthesizing unit 23 converts the video signal into a required video format based on the video signal from the decoding unit 22 (22-1 to 22-3), synthesizes the video, and provides the video to the encoding unit 25. Usually, one synthesizing unit 23 is created for one conference. However, a plurality of combining units 23 may be created for one conference.

符号化部２５は、合成部２３からの映像信号を符号化して、映像送信部２６（２６−１〜２６−３）に与える。 The encoding unit 25 encodes the video signal from the synthesizing unit 23 and supplies the encoded video signal to the video transmission unit 26 (26-1 to 26-3).

１つの会議に対して、任意の数の符号化部２５が作成される。これは、この実施形態の会議サーバ１が、複数の接続方式（通信プロトコル）に対応可能であるため、通信プロトコルが異なることにより、符号化方式が異なることがあるためである。従って、制御部３０の指示に従って、対応すべき符号化方式の数に応じた数の符号化部２５が作成されるようにしてもよい。 An arbitrary number of encoding units 25 are created for one conference. This is because the conference server 1 of this embodiment can support a plurality of connection methods (communication protocols), and thus the communication methods are different, so that the encoding methods may be different. Therefore, in accordance with an instruction from the control unit 30, the number of encoding units 25 corresponding to the number of encoding systems to be supported may be created.

映像信号に対して施す符号化方式は、復号部２２の詳細な説明で行なったので、ここでの説明は省略する。 The encoding method applied to the video signal has been described in detail in the decoding unit 22, and the description is omitted here.

映像送信部２６（２６−１〜２６−３）は、符号化部２５による符号化された映像信号を、対応する会議端末５（５−１〜５−３）に送信する。 The video transmission unit 26 (26-1 to 26-3) transmits the video signal encoded by the encoding unit 25 to the corresponding conference terminal 5 (5-1 to 5-3).

（Ａ−１−３）発言要求検出部１３の内部構成
図１は、実施形態に係る発言要求検出部１３及び発言制御操作部４の内部構成と、発言要求検出部１３及び発言制御操作部４の関係を示すブロック図である。 (A-1-3) Internal Configuration of the Statement Request Detector 13 FIG. 1 shows the internal configuration of the statement request detector 13 and the statement control operation unit 4 according to the embodiment, and the statement request detector 13 and the statement control operation unit 4. FIG. 4 is a block diagram showing the relationship of FIG.

図１において、発言要求検出部１３は、発言要求対象選択部１３１、レベル検出部１３２、ＤＴＭＦ検出部１３３、音声認識検出部１３４、キーワード記憶部１３５、発言要求検知部１３６を有する。 In FIG. 1, the comment request detection unit 13 includes a comment request target selection unit 131, a level detection unit 132, a DTMF detection unit 133, a voice recognition detection unit 134, a keyword storage unit 135, and a comment request detection unit 136.

ここで、「発言要求状態」とは、発言が許可されていないメディアデータ（この実施形態では、音声信号）を解析して、参加者が発言している状態、若しくは、参加者が発言を希望している状態であることを意味する。 Here, the “speaking request state” refers to a state in which a participant is speaking or the participant wants to speak by analyzing media data (sound signal in this embodiment) for which speaking is not permitted. It means that you are doing.

発言要求対象選択部１３１は、ＭＣＵ３の制御部３０からの発言許可情報に基づいて、復号部１２（１２−１〜１２−３）からの音声信号の中から、発言要求状態の検出対象と音声信号を選択する。つまり、発言要求対象選択部１３１は、発言が許可されていない参加者の音声信号を選択する。 Based on the speech permission information from the control unit 30 of the MCU 3, the speech request target selection unit 131 selects a speech request state detection target and a speech from the speech signals from the decoding units 12 (12-1 to 12-3). Select a signal. That is, the statement request target selection unit 131 selects the voice signal of the participant whose speech is not permitted.

より具体的に、発言許可情報には、現時点で、発言が許可されている会議端末５の識別情報が含まれている。従って、発言要求対象選択部１３１は、発言許可情報に基づいて、発言が許可されている会議端末５以外の会議端末５の音声信号を選択する。これにより、会議中で発言権が付与されている参加者以外の参加者の音声を選択し、その選択した音声から、参加者が発言をしているか又は参加者が発言を希望しているかを検出することができる。 More specifically, the speaking permission information includes identification information of the conference terminal 5 to which speaking is currently permitted. Therefore, the statement request target selection unit 131 selects the audio signal of the conference terminal 5 other than the conference terminal 5 to which the statement is permitted, based on the statement permission information. In this way, the voice of the participant other than the participant to whom the floor has been granted is selected during the conference, and from the selected voice, it is determined whether the participant is speaking or the participant wants to speak. Can be detected.

レベル検出部１３２は、発言要求対象選択部１３１により選択された音声信号の音声レベルに基づいて、発言要求状態を検出する。つまり、参加者が発言したときに、音声レベルが増大することが考えられる。そこで、レベル検出部１３２は、音声信号の音声レベルを監視して、音声レベルが増大した状態を発言要求状態と捉えて、発言要求状態を検出する。より具体的に、レベル検出部１３２は、各会議端末５の音声レベルの時系列変化を監視し、音声レベルの時系列変化の傾きが閾値を越えて急峻に増大したときに発言要求状態としてもよい。また、各会議端末５の音声レベルが、１又は複数の閾値を越えたときに発言要求状態としてもよい。 The level detection unit 132 detects a speech request state based on the audio level of the audio signal selected by the speech request target selection unit 131. That is, when the participant speaks, the sound level may increase. Therefore, the level detection unit 132 monitors the audio level of the audio signal, detects a state in which the audio level has increased as a speech request state, and detects the speech request state. More specifically, the level detection unit 132 monitors the time-series change of the audio level of each conference terminal 5 and sets the speech request state when the gradient of the time-series change of the audio level sharply increases beyond the threshold. Good. Further, the voice request state may be set when the audio level of each conference terminal 5 exceeds one or more threshold values.

ＤＴＭＦ検出部１３３は、発言要求対象選択部１３１により選択された音声信号に基づいて、周波数帯の合成信号音（ＤＴＭＦ信号）を検出したときに発言要求状態を検出する。例えば、会議端末５にＤＴＭＦ送出器が搭載されている場合、発言を要求するときには、いずれかのＤＴＭＦ信号を送出することを参加者の間で取り決めておくことで実現できる。これにより、発言を要求する参加者がＤＴＭＦ送出器を操作して、ＤＴＭＦ信号を含む音声信号がＭＣＵ３に送信され、ＤＴＭＦ検出部１３３がＤＴＭＦ信号を検出することで、発言要求状態を検出できる。 The DTMF detection unit 133 detects a speech request state when detecting a synthesized signal tone (DTMF signal) in a frequency band based on the audio signal selected by the speech request target selection unit 131. For example, when the conference terminal 5 is equipped with a DTMF transmitter, when requesting a speech, it can be realized by arranging among the participants to transmit any DTMF signal. As a result, the participant requesting a speech operates the DTMF transmitter to transmit a voice signal including the DTMF signal to the MCU 3, and the DTMF detection unit 133 detects the DTMF signal, thereby detecting the speech request state.

音声認識検出部１３４は、発言要求対象選択部１３１により選択された音声信号に基づいて音声認識処理を行ない、その音声認識結果に基づいて発言要求状態を検出する。キーワード記憶部１３５は予め設定されたメッセージを記憶しているコーパス（辞書）である。キーワード記憶部１３５には、少なくとも、参加者が発言を要求することを示すキーワードが記憶されている。音声認識検出部１３４は、音声認識アルゴリズムにより、音声信号を音波に変換して音素を特定してパターンマッチングにより音声認識を行なう。音声認識検出部１３４は、キーワード記憶部１３５を参照して、得られた単語とキーワードとをマッチングして発言を要求しているか否かを判断する。なお、音声認識アルゴリズムは特に限定されるものではない。 The voice recognition detection unit 134 performs a voice recognition process based on the voice signal selected by the voice request target selection unit 131, and detects a voice request state based on the voice recognition result. The keyword storage unit 135 is a corpus (dictionary) that stores a preset message. The keyword storage unit 135 stores at least a keyword indicating that a participant requests a statement. The voice recognition detection unit 134 converts a voice signal into a sound wave by a voice recognition algorithm, specifies a phoneme, and performs voice recognition by pattern matching. The speech recognition detection unit 134 refers to the keyword storage unit 135 and matches the obtained word with the keyword to determine whether or not a request is made to make a statement. Note that the speech recognition algorithm is not particularly limited.

ここで、レベル検出部１３２、ＤＴＭＦ検出部１３３及び音声認識検出部１３４について、いずれかの検出方法を指定することができる。１個の検出方法を指定してもよいし、複数の検出方法を指定してもよい。検出方法の選択方法は、様々な方法を適用することができる。例えば、予め検出方法を設定しておくようにしてもよい。また、参加者又は議長による操作により、会議端末５が検出方法を指定するようにしてもよい。 Here, any one of the detection methods can be designated for the level detection unit 132, the DTMF detection unit 133, and the voice recognition detection unit 134. One detection method may be specified, or a plurality of detection methods may be specified. Various methods can be applied as a selection method of the detection method. For example, a detection method may be set in advance. Further, the conference terminal 5 may specify the detection method by an operation by the participant or the chair.

また、レベル検出部１３２、ＤＴＭＦ検出部１３３又は音声認識検出部１３４が、発言が許可されていない音声信号（復号信号）を監視し、音声信号から発言要求状態を検出することにより、以下のメリットがある。 In addition, the level detecting unit 132, the DTMF detecting unit 133, or the voice recognition detecting unit 134 monitors a voice signal (decoded signal) for which the voice is not permitted to detect, and detects a voice request state from the voice signal, thereby providing the following advantages. There is.

上述したように、復号部１２（１２−１〜１２−３）は、会議端末５（５−１〜５−３）から受信した音声信号を常時復号している。つまり、復号部１２は、発言が許可されている参加者（会議端末５）の音声だけでなく、発言が許可されていない参加者（会議端末５）からの音声も常時復号している。そして、従来、発言が許可されていない参加者の復号音声については廃棄されている。これは、発言が許可されていない参加者の復号音声は、合成対象とする音声信号ではないためである。 As described above, the decoding unit 12 (12-1 to 12-3) constantly decodes the audio signal received from the conference terminal 5 (5-1 to 5-3). That is, the decoding unit 12 always decodes not only the voice of the participant (conference terminal 5) to which the speech is permitted, but also the voice of the participant (the conference terminal 5) to which the speech is not permitted. Conventionally, the decoded voice of a participant whose speech is not permitted is discarded. This is because the decoded voice of the participant whose speech is not permitted is not a voice signal to be synthesized.

しかし、この実施形態では、従来、合成対象としないために廃棄していた、発言が許可されていない参加者の復号音声を、レベル検出部１３２、ＤＴＭＦ検出部１３３及び音声認識検出部１３４が監視して、当該参加者の発言要求状態を検出するようにしている。 However, in this embodiment, the level detection unit 132, the DTMF detection unit 133, and the speech recognition detection unit 134 monitor the decoded voice of the participant whose speech is not permitted, which has conventionally been discarded because it is not targeted for synthesis. Then, the participant's comment request state is detected.

別の観点から見ると、従来は、音声を合成する際に利用しない音声についても復号部１２は常時復号しており、処理負荷がかかると共に復号音声が有効に活用されていなかった。しかし、この実施形態によれば、常時復号される復号音声を有効に活用することができ、この方式実装により復号部１２の処理負荷が増大することもない。 From another point of view, conventionally, the decoding unit 12 always decodes even the voice not used when synthesizing the voice, so that the processing load is applied and the decoded voice is not effectively used. However, according to this embodiment, the decoded voice that is always decoded can be effectively used, and the processing load of the decoding unit 12 does not increase by implementing this method.

発言要求検知部１３６は、レベル検出部１３２、ＤＴＭＦ検出部１３３又は音声認識検出部３４により発言要求状態が検出されると、発言制御操作部４の発言要求受信部４１に発言要求信号を通知する。これにより、接続方式や会議端末５の種別等が異なる場合でも、発言制御操作部４に対して発言要求信号を通知することができる。 When the speech request detection unit 136 detects the speech request state by the level detection unit 132, the DTMF detection unit 133, or the voice recognition detection unit 34, the speech request detection unit 136 notifies the speech request reception unit 41 of the speech control operation unit 4 of the speech request signal. . Thereby, even when the connection method, the type of the conference terminal 5 and the like are different, it is possible to notify the speech control operation unit 4 of the speech request signal.

図１に示すように、実施形態に係る発言制御操作部４は、発言要求受信部４１、発言要求表示部４２、発言指定部４３を有する。 As illustrated in FIG. 1, the utterance control operation unit 4 according to the embodiment includes an utterance request reception unit 41, an utterance request display unit 42, and an utterance designation unit 43.

発言要求受信部４１は、ＭＣＵ３から発言要求信号を受信する。上述したように、ＭＣＵ３の発言要求検出部１３により発言要求状態が検出されると、発言要求検出部１３から発言要求信号を受信する。 The comment request receiving unit 41 receives a comment request signal from the MCU 3. As described above, when the speech request detection unit 13 of the MCU 3 detects the speech request state, the speech request signal is received from the speech request detection unit 13.

発言要求表示部４２は、当該発言制御操作部４を搭載している自端末５（すなわち、議長端末）のディスプレイ（表示部）に、発言要求があった旨を表示する。これにより、議長に、他の参加者が発言を希望していることを報知することができる。 The statement request display section 42 displays on the display (display section) of the terminal 5 (that is, the chair terminal) on which the statement control operation section 4 is mounted, that the statement request has been made. Thus, the chair can be notified that another participant wants to speak.

発言指定部４３は、発言を要求している参加者に対して議長が発言許可する場合に、その参加者に発言許可を示す発言指定情報を、ＭＣＵ３に通知する。発言指定情報は、例えば、その参加者の会議端末５の識別情報を含むようにする。これにより、合成対象音声選択部１４は、ＭＣＵ３０の制御部３０からの発言指定情報に基づいて、発言許可されていない参加者の音声信号を合成対象とすることができる。 When the chair permits the participant who is requesting the utterance to speak, the utterance designation unit 43 notifies the MCU 3 of utterance designation information indicating that the participant is permitted to speak. The speech designation information includes, for example, identification information of the conference terminal 5 of the participant. Accordingly, the synthesis target voice selection unit 14 can set the voice signal of the participant whose speech is not permitted to be synthesized based on the utterance designation information from the control unit 30 of the MCU 30.

（Ａ−２）実施形態の動作
［会議サーバ（音声信号処理部）全体の動作］
以下では、まず、会議サーバ１における音声信号処理の動作を、図面を参照して説明する。 (A-2) Operation of Embodiment [Overall Operation of Conference Server (Audio Signal Processing Unit)]
Hereinafter, first, the operation of the audio signal processing in the conference server 1 will be described with reference to the drawings.

図４は、この実施形態に係る音声信号処理部１０の全体の動作について示した説明図である。 FIG. 4 is an explanatory diagram showing the overall operation of the audio signal processing unit 10 according to this embodiment.

会議サーバ１と会議端末５−１〜５−３との間の呼の確立時に、ＭＣＵ３は各会議端末５−１〜５−３に、音声信号の送信先情報を通知する（Ｓ１１）。音声信号の送信先情報には、音声信号処理部１０の音声受信部１１−１〜１１−３のサーバアドレス、通信ポート番号の情報が含まれる。サーバアドレス、通信ポート番号の情報は、呼の確立時に、制御部３０が音声信号処理部１０に割り当てたものを用いる。そして、呼の確立後、各会議端末５−１〜５−３は、音声信号を音声信号処理部１０に送信する。 When a call is established between the conference server 1 and the conference terminals 5-1 to 5-3, the MCU 3 notifies the conference terminals 5-1 to 5-3 of the transmission destination information of the audio signal (S11). The destination information of the audio signal includes information on server addresses and communication port numbers of the audio receiving units 11-1 to 11-3 of the audio signal processing unit 10. The information assigned to the audio signal processing unit 10 by the control unit 30 when the call is established is used as the information of the server address and the communication port number. Then, after the call is established, each of the conference terminals 5-1 to 5-3 transmits an audio signal to the audio signal processing unit 10.

音声信号処理部１０では、会議端末５−１〜５−３ごとに、音声受信部１１−１〜１１−３、復号部１２−１〜１２−３、音声送信部１６−１〜１６−３が作成される（Ｓ１２）。ここでは、会議サーバ１は、３台の会議端末５−１〜５−３と呼を確立するので、それぞれの会議端末５−１〜５−３に対応する音声受信部１１−１〜１１−３、復号部１２−１〜１２−３、音声送信部１６−１〜１６−３が作成される。 In the audio signal processing unit 10, the audio receiving units 11-1 to 11-3, the decoding units 12-1 to 12-3, and the audio transmitting units 16-1 to 16-3 are provided for each of the conference terminals 5-1 to 5-3. Is created (S12). Here, since the conference server 1 establishes a call with the three conference terminals 5-1 to 5-3, the audio receiving units 11-1 to 11- corresponding to the respective conference terminals 5-1 to 5-3. 3. The decoding units 12-1 to 12-3 and the audio transmission units 16-1 to 16-3 are created.

会議端末５−１〜５−３からの音声信号が音声受信部１１−１〜１１−３により受信され、受信された音声信号が復号部１２−１〜１２−３により復号され、復号音声が得られる（Ｓ１３）。 Audio signals from the conference terminals 5-1 to 5-3 are received by the audio receiving units 11-1 to 11-3, and the received audio signals are decoded by the decoding units 12-1 to 12-3. Obtained (S13).

復号された音声信号は合成部１７に送られ、合成部１７により音声信号が合成される（Ｓ１４）。このとき、制御部３０は合成対象に関する情報を合成部１７に指示しており、合成部１７は制御部３０から指示された合成対象の音声信号を合成する。すなわち、発言許可された会議端末の識別情報が含まれる発言許可情報が制御部３０から合成部１７に通知され、合成部１７は発言許可されている会議端末５からの音声信号を合成する。 The decoded audio signal is sent to the synthesizing unit 17, where the audio signal is synthesized (S14). At this time, the control unit 30 instructs the synthesis unit 17 about information on the synthesis target, and the synthesis unit 17 synthesizes the audio signal of the synthesis target specified by the control unit 30. That is, the speech permission information including the identification information of the conference terminal permitted to speak is notified from the control unit 30 to the synthesizing unit 17, and the synthesizing unit 17 synthesizes the audio signal from the conference terminal 5 permitted to speak.

そして、合成された音声信号は符号化部１５に与えられ、符号化部１５により音声信号が符号化される（Ｓ１５）。 Then, the synthesized audio signal is provided to the encoding unit 15, and the audio signal is encoded by the encoding unit 15 (S15).

符号化部１５で符号化された音声信号は、制御部３０の指示に応じて、対応する音声送信部１６−１〜１６−３に送られ、会議端末５−１〜５−３に送信される（Ｓ１６）。会議の終了まで上述のステップＳ１３の処理から再度動作する。 The audio signal encoded by the encoding unit 15 is transmitted to the corresponding audio transmission units 16-1 to 16-3 according to the instruction of the control unit 30, and transmitted to the conference terminals 5-1 to 5-3. (S16). The operation is performed again from the above-described processing in step S13 until the end of the conference.

［発言要求検出処理］
次に、実施形態に係るＭＣＵ３における発言要求検出処理を、図面を参照しながら詳細に説明する。 [Speech request detection process]
Next, a speech request detection process in the MCU 3 according to the embodiment will be described in detail with reference to the drawings.

図５は、実施形態に係るＭＣＵ３における発言要求検出処理を示すフローチャートである。 FIG. 5 is a flowchart illustrating a statement request detection process in the MCU 3 according to the embodiment.

図４のＳ１４で説明したように、合成部１７には制御部３０から発言許可情報が通知され、合成部１７の合成対象音声選択部１４は、発言許可情報に基づいて合成対象の音声信号を合成する。 As described in S14 of FIG. 4, the synthesizing unit 17 is notified of the speech permission information from the control unit 30, and the synthesis target voice selection unit 14 of the synthesis unit 17 converts the speech signal to be synthesized based on the speech permission information. Combine.

このような合成対象音声選択部１４による音声合成処理と共に、発言要求検出部１３は、発言許可されていない参加者の音声信号を監視している。 Along with such speech synthesis processing by the synthesis target speech selection unit 14, the speech request detection unit 13 monitors the speech signal of a participant whose speech is not permitted.

なお、発言要求状態の検出方法は、音声レベル方式、ＤＴＭＦ検出方式、音声認識方式のいずれか又はこれらを組み合わせたものが指定される。 As the method of detecting the speech request state, any one of a voice level method, a DTMF detection method, and a voice recognition method or a combination thereof is designated.

図５において、発言要求対象選択部１３１は、発言許可情報に基づいて、発言許可されていない参加者（会議端末５）の音声信号を選択する（Ｓ１０１）。 In FIG. 5, the statement request target selection unit 131 selects a voice signal of a participant (conference terminal 5) not permitted to speak, based on the statement permission information (S101).

［音声レベル方式］
発言要求状態の検出方法として音声レベル方式が指定されるときの動作を説明する。 [Audio level method]
The operation when the voice level method is designated as the method of detecting the speech request state will be described.

レベル検出部１３２は、合成対象音声選択部１４により選択された音声信号を監視する。レベル検出部１３２は、入力された音声信号の音声レベル値と閾値とを比較し、音声レベル値が閾値より大きいか否かを判断する（Ｓ１０２）。 The level detector 132 monitors the audio signal selected by the synthesis target audio selector 14. The level detection unit 132 compares the audio level value of the input audio signal with the threshold, and determines whether the audio level value is larger than the threshold (S102).

そして、音声レベル値が閾値より大きい場合、レベル検出部１３２は、発言要求状態を検出し、Ｓ１０６に移行する。音声レベル値が閾値以下の場合、Ｓ１０１に戻り、レベル検出部１３２は音声信号の監視を続ける。 Then, when the voice level value is larger than the threshold, the level detection unit 132 detects the speech request state, and proceeds to S106. If the audio level value is equal to or less than the threshold, the process returns to S101, and the level detection unit 132 continues monitoring the audio signal.

例えば、会議進行中に、発言許可されていない参加者が何らかの発言をしたときに又は参加者が「議長、発言させてください。」などと発したときには、当該拠点の会議端末５からの音声信号の音声レベルは増大することになる。なお、この参加者の音声は合成対象ではないので、この時点では、合成音声には含まれておらず、他の参加者は聞くことはできない。しかし、そのような場合でも、音声レベル値が閾値より大きくなるので、発言要求状態と検出され、発言制御操作部４に対して発言要求が通知されることになる。 For example, when a participant who is not permitted to speak makes a certain speech while the conference is in progress, or when the participant makes a statement such as “Chair, please let me speak”, an audio signal from the conference terminal 5 at the base concerned is given. Will increase. Since the voice of this participant is not a synthesis target, it is not included in the synthesized voice at this time, and other participants cannot hear it. However, even in such a case, since the voice level value becomes larger than the threshold, the voice request state is detected, and the voice control operation unit 4 is notified of the voice request.

なお、ここでは、音声信号の音声レベル値と閾値とを比較する場合を例示したが、音声レベルの時系列変化を監視し、音声レベルの時系列変化の傾きと閾値とを比較して、傾きが閾値より大きいときに、発言要求状態を検出したとしてもよい。あるいは、音声レベルの時系列で一定レベルの音声が一定時間の平均音声レベル（閾値）を超えた場合を、発言要求状態を検出したとしてもよい。 Here, the case where the audio level value of the audio signal is compared with the threshold has been described as an example, but the time series change of the audio level is monitored, and the inclination of the audio level time series change is compared with the threshold. Is larger than the threshold value, the statement request state may be detected. Alternatively, the voice request state may be detected when the voice of a certain level in the time series of the voice level exceeds the average voice level (threshold) for a certain time.

［ＤＴＭＦ検出方式］
発言要求状態の検出方法としてＤＴＭＦ方式が指定されるときの動作を説明する。 [DTMF detection method]
The operation when the DTMF method is specified as the method of detecting the speech request state will be described.

ＤＴＭＦ検出部１３３は、発言要求対象選択部１３１により選択された音声信号を監視する。ＤＴＭＦ検出部１３３は、入力された音声信号に、周波数帯の合成信号音（ＤＴＭＦ信号）が含まれているか否かを判断する（Ｓ１０３）。 The DTMF detection unit 133 monitors the audio signal selected by the speech request target selection unit 131. The DTMF detection unit 133 determines whether or not the input audio signal includes a synthesized signal sound (DTMF signal) in a frequency band (S103).

そして、ＤＴＭＦ信号が含まれているとき、ＤＴＭＦ検出部１３３は、発言要求状態を検出し、Ｓ１０６に移行する。ＤＴＭＦ信号が含まれていないとき、Ｓ１０１に戻り、ＤＴＭＦ検出部１３３は音声信号の監視を続ける。 Then, when the DTMF signal is included, the DTMF detection unit 133 detects the speech request state, and proceeds to S106. When the DTMF signal is not included, the process returns to S101, and the DTMF detection unit 133 continues monitoring the audio signal.

［音声認識方式］
発言要求状態の検出方法として音声認識方式が指定されるときの動作を説明する。 [Speech recognition method]
The operation when the voice recognition method is specified as the method of detecting the speech request state will be described.

音声認識検出部１３４は、発言要求対象選択部１３１により選択された音声信号を監視する。音声認識検出部１３４は、音声認識アルゴリズムにより、音声信号を音波に変換し、音素を特定して、パターンマッチングにより音声認識を行なう（Ｓ１０４）。音声認識検出部１３４は、キーワード記憶部１３５を参照して、得られた単語とキーワードとをマッチングする（Ｓ１０５）。 The speech recognition detection unit 134 monitors the speech signal selected by the speech request target selection unit 131. The voice recognition detection unit 134 converts a voice signal into a sound wave by a voice recognition algorithm, specifies a phoneme, and performs voice recognition by pattern matching (S104). The speech recognition detection unit 134 matches the obtained word with the keyword by referring to the keyword storage unit 135 (S105).

そして、キーワードが検出されると、音声認識検出部１３４は、発言要求状態を検出し、Ｓ１０６に移行する。キーワードが検出されないとき、Ｓ１０１に戻り、音声認識検出部１３４は音声信号の監視を続ける。 Then, when the keyword is detected, the voice recognition detection unit 134 detects the state of the utterance request, and proceeds to S106. When no keyword is detected, the process returns to S101, and the voice recognition detection unit 134 continues monitoring the voice signal.

例えば、キーワード記憶部１３５に、「議長」、「発言させてください」などの発言を要求するキーワードを記憶しておく。そして、発言許可されていない参加者が「議長、発言させてください。」などと発したときには、音声認識により発言要求状態と検出さる。従って、発言制御操作部４に対して発言要求が通知されることになる。 For example, in the keyword storage unit 135, keywords such as “chair” and “please let me speak” are stored. Then, when a participant who is not permitted to speak emits a message such as "Chairman, let me speak", it is detected as a speech request state by voice recognition. Therefore, the speech control operation unit 4 is notified of the speech request.

次に、発言要求検知部１３６は、レベル検出部１３２、ＤＴＭＦ検出部１３３又は音声認識検出部３４から検出結果が通知される。 Next, the speech request detection unit 136 is notified of a detection result from the level detection unit 132, the DTMF detection unit 133, or the speech recognition detection unit 34.

そして、発言要求状態が検出されると、発言要求検知部１３６は、発言要求状態を検出した会議端末５の識別情報を含む発言要求信号を、発言制御操作部４に通知する（Ｓ１０６）。 When the speech request state is detected, the speech request detection unit 136 notifies the speech control operation unit 4 of a speech request signal including the identification information of the conference terminal 5 that has detected the speech request state (S106).

発言制御操作部４では、議長が、当該参加者に発言許可するか否かを判断する。そして、議長が当該参加者に発言を許可する場合に、その参加者の会議端末５の識別情報を含む発言指定情報がＭＣＵ３に送信される。ＭＣＵ３の制御部３０を通じて、発言指定情報が合成対象音声選択部１４に受信されると（Ｓ１０７）、合成対象音声選択部１４は、発言指定情報で指定された会議端末５の音声信号を合成対象に設定する（Ｓ１０８）。これにより、当該参加者の音声も含んだ音声合成が作成される。 In the statement control operation unit 4, the chair determines whether or not the participant is permitted to speak. Then, when the chair allows the participant to speak, the speech designation information including the identification information of the conference terminal 5 of the participant is transmitted to the MCU 3. When the speech designation information is received by the synthesis target voice selection unit 14 via the control unit 30 of the MCU 3 (S107), the synthesis target speech selection unit 14 converts the speech signal of the conference terminal 5 specified by the speech designation information into the synthesis target speech signal. (S108). Thereby, a speech synthesis including the speech of the participant is created.

（Ａ−３）実施形態の効果
以上のように、この実施形態によれば、発言制御を使用する会議室で、特別な発言要求機構を使用しなくても、復号された音声信号を監視することで、参加者が発言をしたこと又は発言を希望していることを検出でき、その参加者に発言許可の是非を判断させることができる。 (A-3) Effects of Embodiment As described above, according to this embodiment, a decoded audio signal is monitored in a conference room using speech control without using a special speech request mechanism. This makes it possible to detect that the participant has made a statement or desires to make a statement, and allows the participant to determine whether or not to permit speech.

また、この実施形態は、復号された音声信号から発言要求状態を検出するので、会議サーバとの通信プロトコル（接続方式）の違いや、インターネットを利用した接続や、会議端末の種別の違いに拘わらず、発言制御を行なうことができる。その結果、多種多様な端末を相互利用して、会議システムを実現することができる。 Further, in this embodiment, since the speech request state is detected from the decoded audio signal, regardless of the difference in the communication protocol (connection method) with the conference server, the connection using the Internet, and the difference in the type of the conference terminal, Utterance control can be performed. As a result, a conference system can be realized by mutually using various terminals.

さらに、この実施形態は、従来、常時各拠点からの音声を復号して音声合成に利用しない音声を廃棄していたが、音声合成に利用していない音声信号を有効に利用することができるため、音声復号による処理増加も生じない。 Furthermore, in this embodiment, conventionally, the voice from each site is always decoded and the voice not used for the voice synthesis is discarded, but the voice signal not used for the voice synthesis can be effectively used. Also, there is no increase in processing due to voice decoding.

（Ｂ）他の実施形態
上述した実施形態においても種々の変形実施形態を言及したが、本発明は、以下の変形実施形態にも適用することができる。 (B) Other Embodiments Although various modified embodiments have been described in the above-described embodiments, the present invention can be applied to the following modified embodiments.

（Ｂ−１）上述した実施形態では、発言許可されていない参加者（会議端末）の音声信号に基づいて、参加者が発言した又は参加者が発言を希望していることを検出する場合を例示した。しかし、これに限定される、発言許可されていない参加者（会議端末）の画像データに基づいて検出するようにしてもよい。 (B-1) In the above-described embodiment, the case where it is detected that a participant has made a speech or that the participant wants to make a speech is detected based on a voice signal of a participant (conference terminal) not permitted to speak. Illustrated. However, the detection may be performed based on image data of a participant (conference terminal) not permitted to speak, which is limited to this.

例えば、図６に例示するように、映像信号処理部２０の合成部２４が、上述した実施形態における発言要求検出部１３の代わりに発言要求検出部１３−１とを備え、発言要求検出部１３−１が、復号された画像データを解析して、発言要求状態を検出するようにしてもよい。その際、図６に例示するように、合成対象音声選択部１４の代わりに合成対象音声選択部１４−１が、合成対象として選択した音声信号を符号化部１５に与えるようにする。 For example, as illustrated in FIG. 6, the synthesizing unit 24 of the video signal processing unit 20 includes an utterance request detection unit 13-1 instead of the utterance request detection unit 13 in the above-described embodiment. -1 may analyze the decoded image data to detect the speech request state. At this time, as illustrated in FIG. 6, the synthesis target speech selection unit 14-1 gives the audio signal selected as the synthesis target to the encoding unit 15 instead of the synthesis target speech selection unit 14.

ここで、発言要求検出部１３−１は、復号部２２（２２−１〜２２−３）から復号された映像信号を受け取ると、各会議端末５からの映像信号に基づいて参加者が発言を要求しているか否かを検出するものである。発言要求検出部１３−１の詳細な説明は後述する。なお、発言要求検出部１３−１は、上述した実施形態の発言要求検出部１３と併用して使用することも可能である。 Here, when the speech request detection unit 13-1 receives the decoded video signal from the decoding unit 22 (22-1 to 22-3), the participant makes a speech based on the video signal from each conference terminal 5. This is to detect whether or not the request has been made. A detailed description of the statement request detection unit 13-1 will be described later. Note that the statement request detection unit 13-1 can be used in combination with the statement request detection unit 13 of the above-described embodiment.

ここで、合成対象音声選択部１４−１は、制御部３０を介して受信した発言要求検出部１３−１の検出結果に基づいて、復号部１２（１２−１〜１２−３）から復号された音声信号の中から、合成対象とする音声信号を選択し、合成対象として選択した音声信号を符号化部１５に与える。なお、合成対象音声選択部１４−１は、上述した実施形態の合成対象音声選択部１４と併用して使用することも可能である。 Here, the synthesis target voice selection unit 14-1 is decoded by the decoding unit 12 (12-1 to 12-3) based on the detection result of the speech request detection unit 13-1 received via the control unit 30. An audio signal to be synthesized is selected from the audio signals thus obtained, and the audio signal selected as the synthesis target is provided to the encoding unit 15. Note that the synthesis target voice selection unit 14-1 can be used in combination with the synthesis target voice selection unit 14 of the above-described embodiment.

ここでは、例えば、参加者がカメラに向かって手を振ることで、発言を希望することをアピールしていることを、発言要求検出部１３−１が検出する場合を例示する。 Here, for example, a case is described in which the statement request detection unit 13-1 detects that the participant is appealing for a statement by waving his hand toward the camera.

このような場合、発言要求検出部１３−１は、復号された映像ストリームに対して動き検出（又は動体検出）アルゴリズムにより、発言許可されていない参加者の動きを検出するようにしてもよい。なお、動き検出アルゴリズムは、種々のアルゴリズムを適用することができる。例えば、発言要求検出部１３−１は、映像ストリームをフレーム毎に区切り、例えば、「手」のように、検出可能な物体を予め登録して対象物を検出し、オブジェクトの移動量や移動速度と予め設定した閾値との比較結果に基づいて、発言要求状態を検出するなどの方法を適用してもよい。勿論、これに限定されるものではなく、映像ストリームから、発言を要求している参加者の動き（動作）を検出することができれば、様々な方法を適用できる。 In such a case, the speech request detection unit 13-1 may detect the motion of a participant whose speech is not permitted by using a motion detection (or moving object detection) algorithm for the decoded video stream. Note that various algorithms can be applied to the motion detection algorithm. For example, the utterance request detection unit 13-1 divides the video stream into frames, detects a detectable object such as a "hand" in advance, detects the target, and detects the moving amount and moving speed of the object. For example, a method of detecting a speech request state based on a comparison result between the command and a preset threshold may be applied. Of course, the present invention is not limited to this, and various methods can be applied as long as the movement (movement) of the participant who is requesting the speech can be detected from the video stream.

７…会議システム７…会議サーバ、２…ゲートキーパ（ＧＫ）、３…ＭＣＵ、１０…音声信号処理部、２０…映像信号処理部、３０…制御部、
１１（１１−１〜１１−３）…音声受信部、１２（１２−１〜１２−３）…復号部、１７…合成部、１５…符号化部、１６（１６−１〜１６−３）…音声送信部、１３…発言要求検出部、１４…合成対象選択部、
１３１…発言要求対象選択部、１３２…レベル検出部、１３３…ＤＴＭＦ検出部、１３４…音声認識検出部１３４…キーワード記憶部、１３６…発言要求検知部、
５（５−１〜５−３）…会議端末、４…発言制御操作部、４１…発言要求受信部、４２…発言要求表示部、４３…発言指定部。 7 conference system 7 conference server, 2 gatekeeper (GK), 3 MCU, 10 audio signal processing unit, 20 video signal processing unit, 30 control unit,
11 (11-1 to 11-3): voice receiving unit, 12 (12-1 to 12-3): decoding unit, 17: combining unit, 15: coding unit, 16 (16-1 to 16-3) ... Speech transmission unit, 13 ... Speaking request detection unit, 14 ... Synthesis target selection unit,
131: utterance request target selection unit 132: level detection unit 133: DTMF detection unit 134: voice recognition detection unit 134: keyword storage unit 136: utterance request detection unit
5 (5-1 to 5-3): conference terminal, 4: speech control operation unit, 41: speech request receiving unit, 42: speech request display unit, 43: speech designation unit.

Claims

An information processing apparatus that decodes a media signal received from each of a plurality of terminals, combines one or a plurality of media signals selected from the plurality of media signals, and transmits an encoded combined media signal to the plurality of terminals. At
Analyzing each of the decoded media signals, speech request detection means for detecting a predetermined speech request state based on each of the media signals,
When the utterance request detecting means detects the utterance request state, the utterance request information indicating that the utterance is requested from the transmission source of the media signal that has detected the utterance request state is transmitted to the utterance permission authority set in advance. An information processing apparatus comprising: a notification unit that notifies a terminal.

2. The information processing apparatus according to claim 1, wherein the speech request detection unit monitors the decoded media signal other than the synthesis target based on synthesis target information related to the generation of the synthesis media signal.

The media signal is an audio signal,
3. The speech request detection unit according to claim 1, wherein the speech request detection unit detects the speech request state based on a time-series change of each audio signal not used for the synthesized speech among the decoded audio signals. An information processing apparatus according to claim 1.

4. The information processing apparatus according to claim 3, wherein the statement request detecting unit detects the statement request state based on a change in a speech level value of each of the speech signals not used for the synthesized speech. .

4. The speech request detecting means performs speech recognition using each of the speech signals not used for the synthesized speech, and detects the speech request state based on the speech recognition result. An information processing apparatus according to claim 1.

4. The information according to claim 3, wherein the speech request detection unit detects the speech request state by detecting a frequency synthesis signal included in each of the speech signals that is not used in the synthesized speech. Processing equipment.

The media signal is a video signal,
3. The speech request detection unit according to claim 1, wherein the speech request detection unit detects the speech request state based on a time-series change of each video signal not used for the composite video among the decoded video signals. An information processing apparatus according to claim 1.

An information processing method for decoding a media signal received from each of a plurality of terminals, combining one or a plurality of media signals selected from the plurality of media signals, and transmitting an encoded combined media signal to the plurality of terminals At
The speech request detection means analyzes each of the decoded media signals and detects a predetermined speech request state based on each of the media signals.
When the notifying means detects the utterance request state by the utterance request detecting means, utterance request information indicating that an utterance has been requested from the transmission source of the media signal that has detected the utterance request state is set in advance. An information processing method for notifying an authorized speech permission terminal.

An information processing program for decoding a media signal received from each of a plurality of terminals, combining one or a plurality of media signals selected from the plurality of media signals, and transmitting an encoded combined media signal to the plurality of terminals At
Computer
Analyzing each of the decoded media signals, speech request detection means for detecting a predetermined speech request state based on each of the media signals,
When the utterance request detecting means detects the utterance request state, utterance request information indicating that the source of the media signal that has detected the utterance request state has requested utterance is set to a predetermined utterance permission authority. An information processing program for functioning as a notifying means for notifying a terminal.