JP2015070515A

JP2015070515A - Information processing apparatus, information processing system, control method of information processing apparatus, control method of information processing system, and program

Info

Publication number: JP2015070515A
Application number: JP2013204509A
Authority: JP
Inventors: 久士矢島; Hisashi Yajima
Original assignee: Canon Marketing Japan Inc; Canon MJ IT Group Holdings Inc; Canon Software Inc
Current assignee: Canon Marketing Japan Inc; Canon IT Solutions Inc; Canon MJ IT Group Holdings Inc
Priority date: 2013-09-30
Filing date: 2013-09-30
Publication date: 2015-04-13
Anticipated expiration: 2033-09-30
Also published as: JP6417652B2

Abstract

PROBLEM TO BE SOLVED: To provide a technique for smoothly recovering delay of voice when after transmitting inputs of voice data in a plurality of client terminals to a server and mixing the voice data in the server, the voice data are transmitted to other client terminals as mixed voice data.SOLUTION: When voice data in a queue for storing voice data received from respective client terminals exceeds predetermined capacity in a server for mixing voice data received from a plurality of client terminals and distributing the mixed voice data to the respective client terminals, delay of voice is generated. In this case, delay of distribution of voice data is recovered by skipping silent data determined as data not including voice out of voice data to be mixed without using the silent data for mixing.

Description

ネットワークを介して音声を送受信するアプリケーションにおいて、ネットワークの遅延などによる音声の遅延を回復する技術に関する。 The present invention relates to a technique for recovering audio delay due to network delay or the like in an application that transmits and receives audio over a network.

ネットワークを介して音声を送受信するアプリケーションにおいて、ネットワークの揺らぎなどを考慮し、例えば、サーバ上に各クライアントからの上り音声を蓄積するためのキューを用意する場合がある。しかし、キューを持つことにより、キューにデータが蓄積していくことで遅延が増大する状況も発生する。 In an application that transmits and receives audio over a network, for example, a queue for accumulating uplink audio from each client may be prepared on the server in consideration of network fluctuations. However, having a queue also causes a situation in which delay increases as data accumulates in the queue.

クライアントが受信したそれぞれのクライアントの音声から無音を破棄することで遅延を回復するこことが可能であるが、サーバでミキシングする場合、クライアントが受信する下り音声はサーバでミキシングされた１つの音声データとなる。その場合、いずれか１つのクライアントが発話中であれば、そのミキシングした音声は無音ではないため破棄されず、遅延を回復することはできない。 It is possible to recover the delay by discarding silence from the voice of each client received by the client. However, when mixing by the server, the downlink voice received by the client is one voice data mixed by the server. It becomes. In that case, if any one of the clients is speaking, the mixed voice is not silent and is not discarded, and the delay cannot be recovered.

そのため、下り音声をサーバでミキシングしてクライアントに送信する場合には、当該クライアントとサーバ間の下りの通信が揺らぐなどした際に、他のクライアントにはすでに送信済みである音声が、当該クライアントにだけ遅れて送信されるといったことが発生し、その後に受信する音声がすべて遅延する状態を維持してしまうことになる。 For this reason, when downlink audio is mixed and transmitted to a client by a server, when the downlink communication between the client and the server fluctuates, the audio that has already been transmitted to other clients is transmitted to the client. Therefore, a state in which all the audio received thereafter is delayed is maintained.

以上のように遅延が増大すると具体的には以下のような問題が発生する。複数拠点で遠隔会議をしているとした場合、時間の経過とともに音声の遅延が蓄積し、会議の進行に支障が出る場合がある。例えば、相手が発話していないと思い、発話した場合に音声が重複し、聞き取れない可能性がある。また、質問に対する回答が来ていないと思い再度質問するなど、会話が成り立たない可能性がある。 As described above, when the delay increases, specifically, the following problems occur. If a remote conference is performed at multiple locations, audio delays accumulate over time, which may hinder the progress of the conference. For example, when the other party thinks that he / she is not speaking and there is a possibility that the voice is duplicated and cannot be heard. In addition, there is a possibility that the conversation does not hold, for example, the question is asked again because the answer to the question is not received.

特許文献１における通信システムは、音声データを送信装置から受信装置に送信するシステム構成であって、送信側装置は、音声データの音量レベルに基づいて無音状態情報を生成して、送信装置は音声データを蓄積して遅延時間を調整するジッタ吸収バッファと、ジッタ吸収バッファに蓄積されている音声データ量が設定した許容蓄積量を超過した場合に、当該許容蓄積量を増加させ、ジッタ吸収バッファに蓄積されている音声データ量が一定時間、設定した許容蓄積量内である場合に、無音状態を示す音声データを廃棄して、当該許容蓄積量をデフォルト値に戻すジッタバッファ調整部を備えたものである。 The communication system in Patent Document 1 has a system configuration in which audio data is transmitted from a transmission device to a reception device, and the transmission side device generates silence state information based on the volume level of the audio data, and the transmission device A jitter absorption buffer that accumulates data and adjusts the delay time, and when the amount of audio data stored in the jitter absorption buffer exceeds the set allowable storage amount, the allowable storage amount is increased and the jitter absorption buffer A jitter buffer adjustment unit that discards audio data indicating a silent state and restores the permissible accumulated amount to the default value when the accumulated audio data amount is within the set allowable accumulated amount for a certain period of time. It is.

特開２０１２−１２４６８９JP2012-124689

しかしながら、特許文献１に記載の技術においては、あくまで送信端末、受信端末が一対一の関係にある場合に限られており、例えば、会議システムのように、一対多の関係であって、また、１つのクライアントから他のクライアントに直接音声データを送るわけではなく、例えば会議サーバにおいて他のクライアント端末の音声データをミキシングしてから、クライアント端末に音声データを送信する場合には対応できない。 However, the technique described in Patent Document 1 is limited to a case where the transmission terminal and the reception terminal are in a one-to-one relationship. For example, the relationship is a one-to-many relationship as in a conference system. Audio data is not directly sent from one client to another client. For example, it is not possible to transmit audio data to a client terminal after mixing audio data of another client terminal in a conference server.

本発明の目的は、上記問題に鑑み、複数のクライアント端末における音声データの入力が、サーバに送信され、サーバでミキシングされた上で、他のクライアント端末にミキシング済みの音声データとして送信される場合において、音声の遅延をスムーズに回復するための技術を提供することである。 In view of the above problems, an object of the present invention is when audio data input at a plurality of client terminals is transmitted to a server, mixed at the server, and then transmitted as audio data that has been mixed to another client terminal. In other words, a technology for smoothly recovering the delay of voice is provided.

本発明は、入力された音声データを所定の単位で情報処理装置に送信するクライアント端末と、ネットワークを介して接続可能な当該情報処理装置であって、前記クライアント端末から受信した音声データと、当該音声データを他のいずれの前記クライアント端末に送信済みであるか否かを識別可能に示す送信済端末情報とを、当該クライアント端末に対応するキューとして、前記情報処理装置と音声データを送受信する全てのクライアント端末のキューを格納するミキシング用音声記憶手段と、前記ミキシング用音声記憶手段に格納されたキューのうち、音声データを送信するクライアント端末以外の前記クライアント端末に対応するキューの音声データを、ミキシング音声データとして当該クライアント端末に送信すべく生成するミキシング手段と、前記ミキシング手段において、前記キューにおける音声データを前記クライアント端末に送信するためのミキシング音声データを生成するために使用した旨を、前記送信済端末情報に記載する送信済端末情報記録手段と、所定の条件に基づき、前記ミキシング手段においてミキシングを実施する際に、ミキシング用音声記憶手段におけるキューに対する下り音声遅延回復処理をするか否かを判定する下り遅延回復処理判定手段と、所定の条件に基づき、前記音声データが、音声が含まれていないとみなされる無音データであるか否かを判定する無音データ判定手段と、を備え、前記ミキシング手段は、前記クライアント端末に送信するミキシング音声データを生成する際に、下り遅延回復処理判定手段において、ミキシング用音声記憶手段におけるキューに対する下り音声遅延回復処理をすると判定された場合には、ミキシング音声データを生成するために他の前記クライアント端末に対応するキューから取り出した未送信の音声データに対して、無音データ判定手段により無音データであると判定された場合には、当該音声データに対する処理をせず、次の音声データに対して処理を進めることを特徴とする。 The present invention provides a client terminal that transmits input voice data to an information processing apparatus in a predetermined unit, the information processing apparatus connectable via a network, the voice data received from the client terminal, All transmitted / received audio data to / from the information processing apparatus with the transmitted terminal information indicating whether or not audio data has already been transmitted to any other client terminal as a queue corresponding to the client terminal Voice data in a queue corresponding to the client terminal other than the client terminal that transmits voice data among the queues stored in the mixing voice storage means, Mixes generated to be sent to the client terminal as mixing audio data And transmitted terminal information recording means for describing in the transmitted terminal information that the voice data in the queue is used to generate mixed voice data for transmitting to the client terminal in the mixing means. And a downlink delay recovery processing determination unit that determines whether or not to perform downlink audio delay recovery processing for the queue in the mixing audio storage unit when mixing is performed in the mixing unit based on a predetermined condition; Silence data determination means for determining whether or not the voice data is silence data that is considered to contain no voice based on a condition, and the mixing means transmits the mixing voice to the client terminal When generating data, the downstream delay recovery processing determination means uses the mixing sound. If it is determined to perform the downstream audio delay recovery process for the queue in the storage means, the silent data is sent to the untransmitted audio data extracted from the queue corresponding to the other client terminal in order to generate mixing audio data. If it is determined by the determination means that the data is silence data, the process is not performed on the voice data but the process is performed on the next voice data.

本発明により、複数のクライアント端末における音声データの入力が、サーバに送信され、サーバでミキシングされた上で、他のクライアント端末にミキシング済みの音声データとして送信される場合において、音声の遅延をスムーズに回復することが可能となった。 According to the present invention, input of audio data in a plurality of client terminals is transmitted to a server, mixed in the server, and then transmitted as audio data already mixed to other client terminals. It became possible to recover.

本発明の実施の形態に係るシステム構成を示す図の一例である。It is an example of the figure which shows the system configuration | structure which concerns on embodiment of this invention. 本発明の実施の形態に係るハードウェアの構成を示す図の一例である。It is an example of the figure which shows the structure of the hardware which concerns on embodiment of this invention. 本発明の実施の形態に係るソフトウェアの機能構成を示す図の一例である。It is an example of the figure which shows the function structure of the software which concerns on embodiment of this invention. 本発明の実施の形態に係るクライアント端末における再生遅延の発生と回復処理を示すイメージの一例を示す図である。It is a figure which shows an example of the image which shows generation | occurrence | production of a reproduction delay and recovery processing in the client terminal which concerns on embodiment of this invention. 本発明の実施の形態に係る会議サーバにおける、音声情報がキューに格納される状況を示すイメージの一例を示す図である。It is a figure which shows an example of the image which shows the condition where the audio | voice information is stored in a queue in the conference server which concerns on embodiment of this invention. 本発明の実施の形態に係る会議サーバにおける、クライアント端末からの受信状態に基づく遅延状態の発生と回復処理を示すイメージの一例を示す図である。It is a figure which shows an example of the image which shows generation | occurrence | production of a delay state based on the reception state from a client terminal, and a recovery process in the conference server which concerns on embodiment of this invention. 本発明の実施の形態に係る会議サーバにおける、クライアント端末への送信状態に基づく遅延状態の発生と回復処理を示すイメージの一例を示す図である。It is a figure which shows an example of the image which shows generation | occurrence | production of a delay state based on the transmission state to a client terminal, and a recovery process in the conference server which concerns on embodiment of this invention. 本発明の実施の形態に係るクライアント端末における、再生遅延回復処理のフローチャートの一例である。It is an example of the flowchart of the reproduction | regeneration delay recovery process in the client terminal which concerns on embodiment of this invention. 本発明の実施の形態に係る会議サーバにおける、クライアント端末からの受信状態に基づく遅延回復処理のフローチャートの一例である。It is an example of the flowchart of the delay recovery process based on the reception state from the client terminal in the conference server which concerns on embodiment of this invention. 本発明の実施の形態に係るクライアント端末への送信状態に基づく遅延回復処理のフローチャートの一例である。It is an example of the flowchart of the delay recovery process based on the transmission state to the client terminal which concerns on embodiment of this invention. 図７において遅延が発生した場合の遅延回復方法をキューのイメージを用いて説明する図の一例である。FIG. 8 is an example of a diagram illustrating a delay recovery method when a delay occurs in FIG. 7 using a queue image.

以下、本発明の実施の形態を、図面を参照して詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

図１は、本発明の実施の形態に係るシステム構成を示す図の一例である。クライアント端末１０１（複数）と会議サーバ１０２は、ネットワークを介して接続可能である。 FIG. 1 is an example of a diagram showing a system configuration according to an embodiment of the present invention. The client terminals 101 (multiple) and the conference server 102 can be connected via a network.

会議システム（図１）においては、会議に参加するユーザのうち、いずれか（便宜上、主催者とする）が、クライアント端末１０１から会議サーバ１０２にアクセスして、会議室を予約する。会議室とは、可能の会議スペースであって、後述するところの「招待された参加者」だけが入室可能にすることも可能である。また、フリースペースとして、不特定のユーザが参加できたり、発言はできないが、視聴のみ可能であったりしても良い。 In the conference system (FIG. 1), one of the users participating in the conference (for convenience, the organizer) accesses the conference server 102 from the client terminal 101 and reserves the conference room. The conference room is a possible conference space, and only “invited participants” described later can enter the room. In addition, as a free space, an unspecified user can participate or cannot speak, but only viewing is possible.

主催者は、会議を特定する会議ＩＤ（あるいは会議室ＩＤ、部屋番号など）、その会議室を使用する時刻などを決定し、特定の参加者を招待する場合には、会議サーバに登録されている参加者の通知先（例えばメールアドレス）などを用いて、参加を呼びかけるようにしてもよい。特定のユーザのみを参加させるためには、前記通知の中に、当該会議室に入室するためのパスワードを記載可能とする。 The organizer determines the conference ID (or conference room ID, room number, etc.) for identifying the conference, the time to use the conference room, etc., and invites a specific participant to be registered in the conference server. Participant's notification destination (for example, e-mail address) may be used to call for participation. In order to allow only a specific user to participate, a password for entering the conference room can be described in the notification.

ここで、各クライアント端末１０１ａ〜ｃのユーザ（実際の会議参加者）のマイクロフォン（不図示）、による音声データ、あるいはカメラ（不図示）による撮像データは、図４〜図７でそのイメージとして示すように、一旦、会議サーバ１０２に送信される。例えば音声については各々のクライアント端末１０１から会議サーバ１０２に対して、クライアント端末１０１の個数分の上り音声として送信される。 Here, audio data from a microphone (not shown) of a user (actual conference participant) of each client terminal 101a to 101c or imaging data from a camera (not shown) is shown as an image in FIGS. Thus, it is once transmitted to the conference server 102. For example, audio is transmitted from each client terminal 101 to the conference server 102 as uplink audio for the number of client terminals 101.

前述の上り音声は、会議サーバ１０２で、複数のクライアント端末１０１から受信した上り音声をミキシングし、他のクライアント端末１０１に配信する（下り音声データ）。ただ、ミキシングに際して、ある上り音声を送信したクライアント端末１０１に送り返される下り音声データには、その端末自身の音声をミキシングすることは不要である。 The above uplink voice is mixed by the conference server 102 with the uplink voice received from the plurality of client terminals 101 and distributed to the other client terminals 101 (downlink voice data). However, it is not necessary to mix the voice of the terminal itself with the downlink voice data sent back to the client terminal 101 that transmitted a certain uplink voice during mixing.

ここで、クライアント端末１０１と会議サーバ１０２を別筐体として記載しているが、ある１つのクライアント端末１０１が、会議サーバ１０２の機能を同じ筐体に構成されるようにしても良い。 Here, although the client terminal 101 and the conference server 102 are described as separate cases, one client terminal 101 may be configured to have the functions of the conference server 102 in the same case.

なお、本発明の実施形態においては、会議システムを例として説明を進めるが、必ずしも会議システムに限らず、例えばネットワークを介して音声の交換をし、ネットワークの遅延などによりパケット(音声データを含むデータ)の遅延が発生する場合に利用可能な技術として提供する。 In the embodiment of the present invention, the description will be given by taking the conference system as an example. However, the present invention is not necessarily limited to the conference system. For example, voice exchange is performed via a network, and packets (data including voice data are ) Provided as a usable technique when a delay occurs.

図２は、本発明の実施の形態に係るハードウェアの構成を示す図の一例である。図２に示すように、クライアント端末１０１、会議サーバ１０２は、システムバス２０４を介してＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）２０１、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）２０２、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）２０３、入力コントローラ２０５、ビデオコントローラ２０６、メモリコントローラ２０７、通信Ｉ／Ｆコントローラ２０８等が接続された構成を採る。ＣＰＵ２０１は、システムバス２０４に接続される各デバイスやコントローラを統括的に制御する。 FIG. 2 is an example of a diagram illustrating a hardware configuration according to the embodiment of the present invention. As shown in FIG. 2, the client terminal 101 and the conference server 102 are connected via a system bus 204 to a CPU (Central Processing Unit) 201, a RAM (Random Access Memory) 202, a ROM (Read Only Memory) 203, an input controller 205, A configuration is adopted in which a video controller 206, a memory controller 207, a communication I / F controller 208, and the like are connected. The CPU 201 comprehensively controls each device and controller connected to the system bus 204.

また、ＲＯＭ２０３あるいは外部メモリ２１１には、ＣＰＵ２０１の制御プログラムであるＢＩＯＳ（ＢａｓｉｃＩｎｐｕｔ／ＯｕｔｐｕｔＳｙｓｔｅｍ）やＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）や、各サーバあるいは各ＰＣが実行する機能を実現するために必要な後述する各種プログラム等が記憶されている。また、本発明を実施するために必要な情報が記憶されている。なお外部メモリはデータベースであってもよい。 Further, the ROM 203 or the external memory 211 will be described later, which is necessary for realizing the functions executed by each server or each PC, such as BIOS (Basic Input / Output System) and OS (Operating System) which are control programs of the CPU 201. Various programs are stored. Further, information necessary for carrying out the present invention is stored. The external memory may be a database.

ＲＡＭ２０２は、ＣＰＵ２０１の主メモリ、ワークエリア等として機能する。ＣＰＵ２０１は、処理の実行に際して必要なプログラム等をＲＯＭ２０３あるいは外部メモリ２１１からＲＡＭ２０２にロードし、ロードしたプログラムを実行することで各種動作を実現する。 The RAM 202 functions as a main memory, work area, and the like for the CPU 201. The CPU 201 implements various operations by loading a program or the like necessary for executing the processing from the ROM 203 or the external memory 211 to the RAM 202 and executing the loaded program.

また、入力コントローラ２０５は、キーボード（ＫＢ）２０９や不図示のマウス等のポインティングデバイス等からの入力を制御する。 The input controller 205 controls input from a keyboard (KB) 209 or a pointing device such as a mouse (not shown).

ビデオコントローラ２０６は、ディスプレイ２１０等の表示器への表示を制御する。尚、表示器は液晶ディスプレイ等の表示器でもよい。これらは、必要に応じて管理者が使用する。 The video controller 206 controls display on a display device such as the display 210. The display device may be a display device such as a liquid crystal display. These are used by the administrator as needed.

メモリコントローラ２０７は、ブートプログラム、各種のアプリケーション、フォントデータ、ユーザファイル、編集ファイル、各種データ等を記憶する外部記憶装置（ハードディスク（ＨＤ））や、フレキシブルディスク（ＦＤ）、あるいは、ＰＣＭＣＩＡ（ＰｅｒｓｏｎａｌＣｏｍｐｕｔｅｒＭｅｍｏｒｙＣａｒｄＩｎｔｅｒｎａｔｉｏｎａｌＡｓｓｏｃｉａｔｉｏｎ）カードスロットにアダプタを介して接続されるコンパクトフラッシュ（登録商標）メモリ等の外部メモリ２１１へのアクセスを制御する。 The memory controller 207 is an external storage device (hard disk (HD)), flexible disk (FD), or PCMCIA (Personal Computer) that stores a boot program, various applications, font data, user files, editing files, various data, and the like. Controls access to an external memory 211 such as a Compact Flash (registered trademark) memory connected to a Memory Card International Association (Card Memory) card slot via an adapter.

通信Ｉ／Ｆコントローラ２０８は、ネットワークを介して外部機器と接続・通信し、ネットワークでの通信制御処理を実行する。例えば、ＴＣＰ／ＩＰ（ＴｒａｎｓｍｉｓｓｉｏｎＣｏｎｔｒｏｌＰｒｏｔｏｃｏｌ／ＩｎｔｅｒｎｅｔＰｒｏｔｏｃｏｌ）を用いた通信等が可能である。 The communication I / F controller 208 connects and communicates with an external device via a network, and executes communication control processing on the network. For example, communication using TCP / IP (Transmission Control Protocol / Internet Protocol) is possible.

尚、ＣＰＵ２０１は、例えばＲＡＭ２０２内の表示情報用領域へアウトラインフォントの展開（ラスタライズ）処理を実行することにより、ディスプレイ２１０上に表示することが可能である。また、ＣＰＵ２０１は、ディスプレイ２１０上のマウスカーソル（図示しない）等によるユーザ指示を可能とする。 Note that the CPU 201 can display on the display 210 by executing an outline font rasterization process on a display information area in the RAM 202, for example. Further, the CPU 201 enables a user instruction using a mouse cursor (not shown) on the display 210.

本発明を実現するための後述する各種プログラムは、外部メモリ２１１に記録されており、必要に応じてＲＡＭ２０２にロードされることによりＣＰＵ２０１によって実行されるものである。さらに、上記プログラムの実行時に用いられる定義ファイルおよび各種情報テーブル等も、外部メモリ２１１に格納されており、これらについての詳細な説明についても後述する。 Various programs to be described later for realizing the present invention are recorded in the external memory 211 and executed by the CPU 201 by being loaded into the RAM 202 as necessary. Furthermore, definition files and various information tables used when executing the program are also stored in the external memory 211, and a detailed description thereof will be described later.

図３は、本発明の実施の形態に係るソフトウェアの機能構成を示す図の一例である。クライアント端末１０１と会議サーバ１０２のソフトウェア構成要素と各記憶部、およびそれらの間のデータの受け渡し（端末３から音声データ３２８等に関連付けられた点線矢印を除く）を図で示したものである。 FIG. 3 is an example of a diagram showing a functional configuration of software according to the embodiment of the present invention. The software components of the client terminal 101 and the conference server 102, each storage unit, and data exchange between them (excluding the dotted arrows associated with the audio data 328 and the like from the terminal 3) are shown in the figure.

まず、音声データの流れについて概略を説明する。クライアント端末１０１が、ユーザから音声の入力を受け付けると、音声データに変換され会議サーバ１０２に送信される。複数のクライアント端末１０１から会議サーバが受信した音声データは、クライアント端末１０１ごとに用意されたキューに格納される。あるクライアント端末１０１に送信される音声データは、当該クライアント端末１０１自体の音声データ（に対応するキュー）を除き、他のクライアント端末１０１に対応するキューから音声データを１つずつ取得してミキシングし、クライアント端末１０１に送信する。ミキシングされた音声データを受信したクライアント端末１０１は、（当該クライアント端末１０１自体の音声データはミキシングされていない）他のクライアント端末１０１のミキシングされた音声データを再生することでユーザは、音声を聴くことが出来る。 First, an outline of the flow of audio data will be described. When the client terminal 101 receives voice input from the user, it is converted into voice data and transmitted to the conference server 102. The audio data received by the conference server from the plurality of client terminals 101 is stored in a queue prepared for each client terminal 101. Audio data transmitted to a certain client terminal 101 is obtained by mixing audio data one by one from a queue corresponding to another client terminal 101 except for the audio data of the client terminal 101 itself (corresponding to the queue). To the client terminal 101. The client terminal 101 that has received the mixed voice data reproduces the mixed voice data of the other client terminal 101 (the voice data of the client terminal 101 itself is not mixed), so that the user listens to the voice. I can do it.

クライアント端末１０１は、音声入力部３１１において、ユーザが使用するマイクなどの接続機器により音声データの入力を受け付ける。入力を受け付けた音声データは、音声送信部３１２から、会議サーバ１０２に送信される。ここで、音声データは、一定のサイズ（例えば、時間を基準として１０ミリセカンド分の入力された音声など）に区切られる。以降の説明で「音声データ」と呼ぶ場合には、この一定のサイズに区切られた入力音声のデータを意味することにする。これはあくまで例であり、データのビット数など会議システムでの基準に従うものとする。 In the audio input unit 311, the client terminal 101 receives input of audio data through a connected device such as a microphone used by the user. The audio data that has received the input is transmitted from the audio transmission unit 312 to the conference server 102. Here, the voice data is divided into a certain size (for example, voice inputted for 10 milliseconds with respect to time). In the following description, the term “voice data” refers to input voice data divided into a certain size. This is merely an example, and it is assumed that the conference system standard such as the number of data bits is followed.

クライアント端末１０１から送信された音声データ（上り音声データ）は、会議サーバ１０２の音声受信部３２１において、受信される。受信した音声データは、ミキシング用音声記憶部３２６に格納される。ミキシング用音声記憶部３２６は、会議に参加している各々のクライアント端末１０１に対応してキューを用意し、ある程度の数の音声データを記憶する。図３においては、ミキシング用音声記憶部３２６のうち、各々のクライアント端末１０１に対応してキューの１つ１つの要素を端末１〜端末３に対応して縦方向（例では７つの矩形）に示している。 The audio data (upstream audio data) transmitted from the client terminal 101 is received by the audio receiving unit 321 of the conference server 102. The received audio data is stored in the mixing audio storage unit 326. The mixing audio storage unit 326 prepares a queue corresponding to each client terminal 101 participating in the conference, and stores a certain number of audio data. In FIG. 3, in the mixing audio storage unit 326, each element of the queue corresponding to each client terminal 101 is arranged in a vertical direction (seven rectangles in the example) corresponding to the terminals 1 to 3. Show.

音声受信時に、上りチェック部３２２により上り音声データ受信（取得）が原因で、処理の遅延が発生していると判断された場合には、遅延回復処理が行われる。具体的には、この処理は、受信したクライアント端末１０１ごとに行われ、対応するキューに格納されている音声データの数が、サーバ閾値記憶部３２７の上り遅延閾値に基づき、所定数を上回っていると判断された場合に、無音データの削除処理をする。 At the time of audio reception, if the uplink check unit 322 determines that processing delay has occurred due to reception (acquisition) of uplink audio data, delay recovery processing is performed. Specifically, this processing is performed for each received client terminal 101, and the number of audio data stored in the corresponding queue exceeds a predetermined number based on the upstream delay threshold of the server threshold storage unit 327. If it is determined that there is, the silence data is deleted.

ここで、各キューに格納される音声データの一単位は１つの矩形に対応するものであり、より詳細には、音声データ３２８と送信済端末情報３２９のペアを格納する。音声データ３２８は、前述のクライアント端末１０１から受信した音声データである。一方、受信された音声データは、他の端末から受信した音声データとミキシングして、会議サーバ１０２から、各々のクライアント端末１０１に送信（ミキシング音声送信部３２３）される。その際、何らかの状態（例えばネットワークの回線速度の違い）により、送信済のクライアント端末１０１と、未送信クライアント端末１０１とが発生する。そこで、後述の遅延回復処理をクライアント端末１０１ごとに処理できるよう、送信済のクライアント端末１０１か、未送信のクライアント端末１０１かの情報を「送信済端末情報３２９」に記載する。データ構成は如何様でもよく、送信済／未送信のクライアント端末１０１が識別できればよい。 Here, one unit of audio data stored in each queue corresponds to one rectangle, and more specifically, a pair of audio data 328 and transmitted terminal information 329 is stored. The audio data 328 is audio data received from the client terminal 101 described above. On the other hand, the received audio data is mixed with audio data received from other terminals and transmitted from the conference server 102 to each client terminal 101 (mixed audio transmission unit 323). At that time, a transmitted client terminal 101 and an untransmitted client terminal 101 are generated due to some state (for example, a difference in network line speed). Therefore, information on whether the client terminal 101 has already been transmitted or the client terminal 101 that has not been transmitted is described in “transmitted terminal information 329” so that a delay recovery process described later can be performed for each client terminal 101. Any data structure may be used as long as the transmitted / untransmitted client terminal 101 can be identified.

なお、あるクライアント端末１０１へ送信する音声データのミキシングに際して、もともと当該クライアント端末１０１から受信した音声データはミキシングする対象から省く。即ち、自分自身以外のクライアント端末１０１から送られた音声データをミキシングする。 When mixing audio data to be transmitted to a certain client terminal 101, the audio data originally received from the client terminal 101 is omitted from the object to be mixed. That is, the audio data sent from the client terminal 101 other than itself is mixed.

下りチェック部３２４は、ミキシング音声送信部３２３がクライアント端末１０１に音声データを送信する際に、ミキシングすべきクライアント端末１０１（即ち、送信するクライアント端末１０１以外のクライアント端末１０１）関する遅延回復処理をする。具体的には、この処理は、ミキシングする際に、サーバ閾値記憶部３２７の下り遅延フラグ(クライアント端末１０１ごとに存在する)が“オン”になっている場合に、各キュー内の無音データをスキップし、次の音声データをミキシングする。 The downlink check unit 324 performs a delay recovery process on the client terminal 101 to be mixed (that is, the client terminal 101 other than the client terminal 101 to be transmitted) when the mixing audio transmission unit 323 transmits the audio data to the client terminal 101. . Specifically, this processing is performed when mixing the silence data in each queue when the downstream delay flag (existing for each client terminal 101) of the server threshold value storage unit 327 is “ON”. Skip and mix the next audio data.

下りチェック部３２４による遅延回復処理の完了後（遅延回復の必要がないと判定された場合を含む）、ミキシングされたデータは、所定のクライアント端末１０１に送信される。なお、会議サーバ１０２の処理において、音声受信部３２１とミキシング音声送信部３２３は、同期した処理である必要はない。 After completion of the delay recovery process by the downlink check unit 324 (including a case where it is determined that there is no need for delay recovery), the mixed data is transmitted to a predetermined client terminal 101. In the process of the conference server 102, the audio reception unit 321 and the mixing audio transmission unit 323 do not need to be synchronized processes.

クライアント端末１０１のミキシング音声受信部３１３においては、会議サーバ１０２が、当該クライアント端末１０１以外の音声データをミキシングしたデータを受信し、受信音声記憶部３１６（キュー）に格納する。その際に、端末閾値記憶部３１７の再生遅延閾値に基づき、所定数を上回っていると判断された場合に、無音データの削除処理をする。 In the mixing voice receiving unit 313 of the client terminal 101, the conference server 102 receives data obtained by mixing voice data other than that of the client terminal 101 and stores it in the received voice storage unit 316 (queue). At that time, if it is determined that the number exceeds a predetermined number based on the reproduction delay threshold value of the terminal threshold value storage unit 317, the silence data is deleted.

音声再生部３１５においては、受信音声記憶部３１６（キュー）からミキシングされた音声を取り出し、再生する。なお、クライアント端末１０１の処理において、ミキシング音声受信部３１３と音声再生部３１５は、同期した処理である必要はない。 The audio reproduction unit 315 takes out the mixed audio from the received audio storage unit 316 (queue) and reproduces it. In the processing of the client terminal 101, the mixing audio reception unit 313 and the audio reproduction unit 315 do not have to be synchronized processing.

音声の遅延回復処理は、クライアント端末１０１、会議サーバ１０２の双方で実行される。それぞれの遅延回復処理の概要を、音声データを格納するキューのイメージを用いて図４（クライアント端末１０１側）、図５〜図７（会議サーバ１０２側）について説明する。 The audio delay recovery process is executed by both the client terminal 101 and the conference server 102. An outline of each delay recovery process will be described with reference to FIG. 4 (client terminal 101 side) and FIGS. 5 to 7 (conference server 102 side) using an image of a queue storing audio data.

図４は、本発明の実施の形態に係るクライアント端末における再生遅延の発生と回復処理を示すイメージの一例を示す図である。図４においては、１）遅延が発生していない状態、２）遅延が発生している状態、３）遅延回復方法、について説明する。なお、図４〜図７で説明する本発明の実施の形態に係わるキューはすべてＦＩＦＯとして説明する。まず「遅延が発生していない状態」について１）を用いて説明する。 FIG. 4 is a diagram showing an example of an image showing the occurrence of reproduction delay and recovery processing in the client terminal according to the embodiment of the present invention. In FIG. 4, 1) a state where no delay has occurred, 2) a state where a delay has occurred, and 3) a delay recovery method will be described. Note that all the queues according to the embodiment of the present invention described with reference to FIGS. 4 to 7 are described as FIFOs. First, “a state in which no delay occurs” will be described with reference to 1).

（１）まずクライアント端末１０１は、会議サーバ１０２から音声データ（ミキシングされた音声データ）を受信する。
（２）その音声データを受信音声記憶部３１６に再生する待ち状態のデータとして格納する。 (1) First, the client terminal 101 receives audio data (mixed audio data) from the conference server 102.
(2) The audio data is stored in the received audio storage unit 316 as data in a waiting state for reproduction.

（３）図４の例では、受信した「Ａ１」のみが格納されているが、再生遅延閾値に基づき、所望の範囲の個数であれば遅延とはみなさいとしてよい。キューは、ＦＩＦＯであり、先に格納された音声データが先に取り出され再生される。 (3) In the example of FIG. 4, only the received “A1” is stored. However, it may be regarded as a delay if the number is within a desired range based on the reproduction delay threshold. The queue is a FIFO, and the previously stored audio data is first extracted and reproduced.

（４）キューの先頭にきた音声データは、音声再生部３１５により取り出され再生されることにより、当該クライアント端末１０１のユーザが聞くことが出来る。 (4) The voice data at the head of the queue is taken out and played back by the voice playback unit 315, so that the user of the client terminal 101 can hear it.

以降の図４における説明で、遅延が発生する／しないの違いはあるものの、この（１）〜（４）の流れは同じである。 In the following description of FIG. 4, the flow of (1) to (4) is the same although there is a difference in whether or not a delay occurs.

次に「遅延が発生している状態」について２）を用いて説明する。例えば、ネットワークの遅延などにより複数のパケット（便宜上音声データと同じ記号で表す。例ではＡ１〜Ａ７の７個）を、ほぼ同時にクライアント端末１０１にて受信したとする。クライアント端末１０１のキュー（受信音声記憶部３１６）には７個の音声データが蓄積され、音声データの再生が遅延することになる。 Next, the “state in which delay occurs” will be described with reference to 2). For example, it is assumed that a plurality of packets (represented by the same symbols as the voice data for convenience. In the example, seven packets A1 to A7) are received at the client terminal 101 almost simultaneously due to network delay or the like. Seven audio data are accumulated in the queue (received audio storage unit 316) of the client terminal 101, and reproduction of the audio data is delayed.

これ以降の処理において、再生のためキューから削除されていく音声データと、受信してキューに格納する音声データは原則として同一のタイミングであるため、遅延の速度は一定になる（自然に遅延が回復することはない、ということ）。 In the subsequent processing, since the audio data that is deleted from the queue for playback and the audio data that is received and stored in the queue are in principle the same timing, the delay speed is constant (naturally there is a delay). It will never recover).

従って、何らかの理由により、再度の遅延が発生すると、その遅延した音声データの総数は、次第に蓄積されていき、最終的には、会議の通話にユーザが違和感を感ずるレベルに達する。すなわち、カメラで撮影した画像が別に送受信されている場合には、発言の画像と音声が著しくずれる、また他のクライアント端末１０１との発言内容の前後関係がおかしくなってくる、などが考えられる。 Therefore, if another delay occurs for some reason, the total number of the delayed audio data is gradually accumulated, and finally reaches a level at which the user feels uncomfortable with the conference call. That is, when images captured by the camera are transmitted / received separately, it is conceivable that the speech image and sound are significantly different from each other, and the context of speech content with other client terminals 101 is incorrect.

従って、前述の通り、再生遅延閾値に基づき、所望の範囲の個数であれば遅延とはみなさいが、所定の範囲を超えた場合には、再生時の遅延回復処理を行う必要がある。 Therefore, as described above, based on the reproduction delay threshold value, if the number is within a desired range, it is regarded as a delay. However, when the predetermined range is exceeded, it is necessary to perform a delay recovery process during reproduction.

図４の最後に「遅延回復方法」について３）を用いて説明する。すなわち、再生遅延閾値に基づき、所望の範囲の個数を超えたとして、遅延回復処理を行う。キューにある音声データには、Ａ１〜Ａ７の音声データが格納されているが、このうち「（）」が付与された音声データ（Ａ２、Ａ４、Ａ５、Ａ６を例えば（Ａ２）と記載したもの）については、「無音データ」であるとする。ここで無音データとは、システム上「音声が含まれていないとみなす音声データ」である。音声データが無音データであるか否かの判定については「特開２０００−３１２２２３」などにより周知の技術であるため、説明を省略する。 The “delay recovery method” will be described with reference to 3) at the end of FIG. That is, based on the reproduction delay threshold value, delay recovery processing is performed assuming that the number of desired ranges has been exceeded. The audio data in the queue stores the audio data of A1 to A7. Of these, the audio data to which “()” is given (A2, A4, A5, A6 are described as (A2), for example) ) Is “silent data”. Here, the silent data is “voice data that is regarded as not including voice” in the system. The determination as to whether or not the audio data is silent data is a well-known technique disclosed in “Japanese Patent Laid-Open No. 2000-31223” and will not be described.

音声データの再生には、キューの先頭から音声データを１つずつ取得して再生するが、取得する際にキューに格納されている音声データの数をカウントし、その数が、再生遅延閾値に基づき、所望の範囲の個数を超えた場合には、遅延回復処理を開始する。その場合、キューに格納されているすべての音声データを確認し、無音データを削除していく。なお、遅延回復処理は、他の処理に割り込まれることなく一気に処理を行う。 For audio data reproduction, audio data is acquired one by one from the head of the queue and reproduced. The number of audio data stored in the queue is counted at the time of acquisition, and this number is set as a reproduction delay threshold. Based on this, when the number of desired ranges is exceeded, delay recovery processing is started. In that case, all audio data stored in the queue is confirmed, and the silence data is deleted. The delay recovery process is performed at a time without being interrupted by other processes.

図５は、本発明の実施の形態に係る会議サーバにおける、音声情報がキューに格納される状況を示すイメージの一例を示す図である。図５は、遅延が発生していない場合の状態を示す。
（１）まず、会議サーバ１０２は、クライアント端末１０１ごとに音声データを受信する。 FIG. 5 is a diagram showing an example of an image showing a situation in which audio information is stored in a queue in the conference server according to the embodiment of the present invention. FIG. 5 shows a state when no delay occurs.
(1) First, the conference server 102 receives audio data for each client terminal 101.

（２）次に、受信した音声データを、ミキシング用音声記憶部３２６においてクライアント端末１０１ごとに用意したキューに格納する。格納する情報は、図３で説明したように音声データ３２８だけではなく、他のクライアント端末１０１が当該音声データをミキシングして送信済であるかどうかを、他のクライアント端末１０１ごとに記憶する送信済端末情報３２９をも格納する。両者（３２８、３２９）を合わせて、便宜上音声情報と呼ぶことにする。キューに積んだ直後の初期状態では、他のいずれのクライアント端末１０１にも送信していない。なお、この送信済であるか否かの情報は、他のすべてのクライアント端末１０１の配列を用意したフラグであってもよいし、あるいは送信済端末の一覧、あるいは送信済でない端末の一覧など、いずれのデータ構成であってもよい。 (2) Next, the received audio data is stored in a queue prepared for each client terminal 101 in the mixing audio storage unit 326. The information to be stored is not only the audio data 328 as described with reference to FIG. 3, but also a transmission that stores for each other client terminal 101 whether or not the other client terminal 101 has already mixed and transmitted the audio data. Stored terminal information 329 is also stored. Both (328, 329) are collectively referred to as audio information for convenience. In the initial state immediately after being placed in the queue, it is not transmitted to any other client terminal 101. The information indicating whether or not the transmission has been completed may be a flag prepared with an array of all other client terminals 101, or a list of terminals that have already been transmitted or a list of terminals that have not been transmitted. Any data structure may be used.

（３）次に、各キューの音声データ（キューの先頭の音声情報に含まれる音声データ）を取り出しミキシングする。前述の通り、音声データを送信しようとするクライアント端末１０１に対応するキューを除き、その他のキューの音声データをミキシングする。
（４）ミキシングした音声データをクライアント端末１０１に送信する。 (3) Next, the audio data of each queue (audio data included in the audio information at the head of the queue) is extracted and mixed. As described above, the audio data of other queues are mixed except for the queue corresponding to the client terminal 101 to which the audio data is to be transmitted.
(4) The mixed audio data is transmitted to the client terminal 101.

図６は、本発明の実施の形態に係る会議サーバにおける、クライアント端末からの受信状態に基づく遅延状態の発生と回復処理を示すイメージの一例を示す図である。図６では、まず「１）上り音声データ取得（クライアント端末１０１から会議サーバ１０２に送信された音声データ取得）が原因で、遅延が発生する状況」を説明する。 FIG. 6 is a diagram showing an example of an image showing the generation and recovery processing of the delay state based on the reception state from the client terminal in the conference server according to the embodiment of the present invention. In FIG. 6, first, “1) a situation in which delay occurs due to acquisition of uplink voice data (acquisition of voice data transmitted from the client terminal 101 to the conference server 102)” will be described.

（１）まず、クライアント端末１０１Ａからは、遅延することなく音声データを受信しているとする。一方、クライアント端末１０１Ｂからは、ネットワークの問題などで、７つの音声データ（Ｂ１〜Ｂ７）を受信する。 (1) First, it is assumed that audio data is received from the client terminal 101A without delay. On the other hand, seven audio data (B1 to B7) are received from the client terminal 101B due to a network problem or the like.

（２）従って、クライアント端末１０１Ａに対応するキューは、１つの音声情報が格納されるが、クライアント端末１０１Ｂでは、７つの音声情報が対応するキューに格納する。 (2) Accordingly, one audio information is stored in the queue corresponding to the client terminal 101A, but seven audio information is stored in the corresponding queue in the client terminal 101B.

（３）クライアント端末１０１Ｃに送信するためのミキシングは、クライアント端末１０１Ａ用、Ｂ用のキューから取り出した音声データを用いる。クライアント端末１０１Ａ用のキューからは、音声データ「Ａ７」（「Ｂ７」と同じ時間の音声）が取得され用いられるが、クライアント端末１０１Ｂ用のキューからは、音声データ「Ｂ１」（「Ｂ７」より６単位時間前の音声）が遅延データとして残っている。従って、それぞれ時間のずれた「Ａ７＋Ｂ１」というミキシング音声データが生成される。 (3) Mixing for transmission to the client terminal 101C uses audio data extracted from the queues for the client terminal 101A and B. From the queue for the client terminal 101A, the voice data “A7” (voice having the same time as “B7”) is acquired and used, but from the queue for the client terminal 101B, the voice data “B1” (from “B7”) is used. (Sound of 6 unit time ago) remains as delay data. Accordingly, mixing audio data of “A7 + B1” with different time is generated.

前述のクライアント端末１０１における再生の場合と同様に、ミキシングのためキューから削除されていく音声情報と、受信してキューに格納する音声情報は原則として同一の個数であるため、遅延の速度は一定になる（自然に遅延が回復することはない、ということ）。従って、何らかの理由により、クライアント端末１０１Ｂから受信する音声データのみに対して遅延が発生すると、その遅延した音声データの総数は、次第に蓄積されていく。 As in the case of reproduction in the client terminal 101 described above, the audio information that is deleted from the queue for mixing and the audio information that is received and stored in the queue are in principle the same number, so the delay speed is constant. (The delay will not recover naturally). Accordingly, when a delay occurs only for the audio data received from the client terminal 101B for some reason, the total number of the delayed audio data is gradually accumulated.

（４）最終的にクライアント端末１０１に送信された（ミキシングされた）音声データは、ユーザからみて問題と判断されるレベルに達する。即ち、ミキシングされた音声データを受信したクライアント端末１０１以外のクライアント端末１０１の音声データがミキシングされているわけだが、時間的なずれがあり、会話が成り立たない可能性もある。 (4) The audio data that is finally transmitted (mixed) to the client terminal 101 reaches a level that is determined to be a problem by the user. That is, the audio data of the client terminals 101 other than the client terminal 101 that has received the mixed audio data is being mixed, but there is a possibility that the conversation may not be established due to a time lag.

次に、前述の遅延に対する「２）遅延回復方法」を説明する。例として１）のようにクライアント端末１０１Ｂから受信したキューにおいて、所定の数の範囲を超える音声情報が蓄積されたとする（上り遅延閾値に基づいて判定）。所定の数の範囲を超える音声情報が蓄積されたと判定されると遅延回復処理が開始される。「２）」の図の左側にあるキューには、Ｂ１〜Ｂ７のうち、無音データに対応する音声情報（「（）」が付与されたもの）が４つあるため、これを削除する。無音データかどうかの判定は、クライアント端末１０１で行い、送信する音声データに付加される。会議サーバ１０２においては、実際の音声データ内部を解析するのではなく、クライアント端末１０１から送信された音声データの付加情報により、無音データであるか否かを判定する。ただし、無音データかどうかの判定は会議サーバ１０２で行ってもよい。なお、会議サーバにおいては、音声データの下り処理（クライアント端末１０１への音声データ送信）においても、音声データが無音であるか否かを判定する場合があるが、同様とする。残りは右側のキューの通り、３つ（Ｂ１、Ｂ３、Ｂ７）の音声情報が格納されることになる。これらが、すべてのキュー（クライアント端末１０１ごとに用意されている）ごとに、実行される。なお、遅延回復処理は、他の処理に割り込まれることなく一気に処理を行う。 Next, “2) Delay recovery method” for the above-described delay will be described. As an example, it is assumed that audio information exceeding a predetermined number of ranges is accumulated in the queue received from the client terminal 101B as in 1) (determination based on the uplink delay threshold). When it is determined that audio information exceeding a predetermined number of ranges has been accumulated, delay recovery processing is started. In the queue on the left side of the figure “2)”, there are four pieces of voice information (those assigned “()”) corresponding to silence data among B1 to B7, and these are deleted. Whether the data is silent data is determined by the client terminal 101 and added to the audio data to be transmitted. The conference server 102 does not analyze the inside of the actual voice data, but determines whether or not the data is silence based on the additional information of the voice data transmitted from the client terminal 101. However, the conference server 102 may determine whether the data is silence data. Note that the conference server may determine whether or not the audio data is silent even in the audio data downlink processing (audio data transmission to the client terminal 101). The rest is stored as three (B1, B3, B7) audio information as shown on the right queue. These are executed for every queue (prepared for each client terminal 101). The delay recovery process is performed at a time without being interrupted by other processes.

図７は、本発明の実施の形態に係る会議サーバにおける、クライアント端末への送信状態に基づく遅延状態の発生と回復処理を示すイメージの一例を示す図である。図７では、まず「３）下り音声データ取得（会議サーバ１０２からクライアント端末１０１へのミキシング音声データ送信）が原因で、遅延が発生する状況」を説明する。
（１）会議サーバ１０２は、クライアント端末１０１Ａ、Ｂの双方から音声データを受信する。 FIG. 7 is a diagram showing an example of an image showing the generation and recovery processing of the delay state based on the transmission state to the client terminal in the conference server according to the embodiment of the present invention. In FIG. 7, first, “3) a situation in which delay occurs due to downlink audio data acquisition (mixed audio data transmission from the conference server 102 to the client terminal 101)” will be described.
(1) The conference server 102 receives audio data from both the client terminals 101A and 101B.

（２）各々の対応するキューに、音声情報（音声データと送信済端末情報を関連付けた情報）を格納する。図７の３）では、それぞれ７つの音声情報が格納されている。 (2) Voice information (information in which voice data and transmitted terminal information are associated) is stored in each corresponding queue. In 3) of FIG. 7, seven pieces of audio information are stored.

（３）クライアント端末１０１ごとの各キューから、音声データを取り出す。各クライアント端末１０１に送るための音声データをミキシングするため、クライアント端末１０１ごとに、自分自身を除く他のクライアント端末１０１のすべての音声データを取り出す。 (3) Extract audio data from each queue for each client terminal 101. In order to mix the audio data to be sent to each client terminal 101, all audio data of the other client terminals 101 other than itself are extracted for each client terminal 101.

（４）次にミキシングした音声データを、クライアント端末１０１Ｃに送信する。その際に、ネットワークの状態などにより、遅延が発生することがある。一方、クライアント端末１０１Ｂへ送信する音声(クライアント端末１０１Ａとクライアント端末１０１Ｃの音声データをミキシングした音声データ)は遅延なく送信できたとする。その場合、クライアント端末１０１Ｂには送信済みだが、クライアント端末１０１Ｃには未送信のクライアント端末１０１Ａの音声がキューに存在することになる。音声情報の送信済端末情報３２９に、未送信クライアント端末１０１が１つでもあれば、キューに格納された音声情報を削除することは出来ない。削除されなくとも、クライアント端末１０１Ｂでは、ミキシング／送信で無視されるデータとなるが、クライアント端末１０１Ｃでは無視できないため、クライアント端末１０１Ａで再生される音声と、クライアント端末１０１Ｃで再生される音声との時間間隔が累積されてきて、徐々に会話の時間的な前後関係が不自然になることも想定される。
次に、図１１をもちいて前述の遅延に対する「４）遅延回復方法」を説明する。 (4) Next, the mixed audio data is transmitted to the client terminal 101C. At that time, a delay may occur depending on the state of the network. On the other hand, it is assumed that the voice transmitted to the client terminal 101B (voice data obtained by mixing the voice data of the client terminal 101A and the client terminal 101C) can be transmitted without delay. In this case, the voice of the client terminal 101A that has been transmitted to the client terminal 101B but not transmitted to the client terminal 101C exists in the queue. If there is at least one untransmitted client terminal 101 in the transmitted terminal information 329 of the voice information, the voice information stored in the queue cannot be deleted. Even if the data is not deleted, the data is ignored in the mixing / transmission in the client terminal 101B, but cannot be ignored in the client terminal 101C. Therefore, the audio reproduced at the client terminal 101A and the audio reproduced at the client terminal 101C It is also assumed that time intervals have accumulated and the temporal context of the conversation gradually becomes unnatural.
Next, "4) Delay recovery method" for the above-described delay will be described with reference to FIG.

図１１は、図７において遅延が発生した場合の遅延回復方法をキューのイメージを用いて説明する図の一例である。 FIG. 11 is an example of a diagram illustrating a delay recovery method in the case where a delay occurs in FIG. 7 using a queue image.

例としてクライアント端末１０１Ｃに送信する音声で遅延が発生している場合(下り遅延フラグにて判断)、クライアント端末１０１Ａおよびクライアント端末１０１Ｂの音声をキューから取出しミキシングする際に、無音データをスキップし、無音ではない音声データをミキシングすることで、遅延を回復する。なお、スキップした無音データに未送信端末が存在しない場合はキューから削除される。「４）」の図のキューには、Ｂ１〜Ｂ７のうち、無音データに対応する音声情報（「（）」が付与されたもの）が４つあるため、これをスキップする。これらが、すべてのキュー（クライアント端末１０１ごとに用意されている）ごとに、実行される。なお、遅延回復処理は、他の処理に割り込まれることなく一気に処理を行う。 As an example, when a delay occurs in the voice transmitted to the client terminal 101C (determined by the downlink delay flag), the silence data is skipped when the voices of the client terminal 101A and the client terminal 101B are taken out from the queue and mixed. Delay is recovered by mixing non-silent audio data. Note that if there is no untransmitted terminal in the skipped silence data, it is deleted from the queue. In the queue of “4)”, there are four pieces of audio information (with “()” added) corresponding to silent data among B1 to B7, and these are skipped. These are executed for every queue (prepared for each client terminal 101). The delay recovery process is performed at a time without being interrupted by other processes.

図８〜図１０のフローチャートは、クライアント端末１０１、または会議サーバ１０２における遅延に対して、遅延回復の処理を説明するものである。 The flowcharts of FIGS. 8 to 10 explain the delay recovery processing for the delay in the client terminal 101 or the conference server 102.

図８は、本発明の実施の形態に係るクライアント端末における、再生遅延回復処理のフローチャートの一例である。図８のフローチャートの各ステップ（Ｓ８０１〜Ｓ８１３）はクライアント端末１０１のＣＰＵ２０１により、また、Ｓ８１４〜Ｓ８１５は会議サーバ１０２のＣＰＵ２０１により実行される。図８のフローチャートは、図４のキューのイメージに対応する。
Ｓ８０１においては、会議サーバから送信されたミキシング済音声データを受け付ける。 FIG. 8 is an example of a flowchart of the reproduction delay recovery process in the client terminal according to the embodiment of the present invention. Each step (S801 to S813) in the flowchart of FIG. 8 is executed by the CPU 201 of the client terminal 101, and steps S814 to S815 are executed by the CPU 201 of the conference server 102. The flowchart of FIG. 8 corresponds to the image of the queue of FIG.
In step S801, mixed audio data transmitted from the conference server is received.

Ｓ８０２においては、クライアント端末１０１の受信音声記憶部３１６のキューに、前記受信したミキシング済音声データを格納する（エンキュー）。
Ｓ８０３においては、前記キューに格納された音声データの数をカウントする。 In S802, the received mixed voice data is stored in the queue of the received voice storage unit 316 of the client terminal 101 (enqueue).
In step S803, the number of audio data stored in the queue is counted.

Ｓ８０４においては、キューに所定の数の範囲を超える音声データが蓄積されているか判定する（再生遅延閾値に基づいて判定）。 In step S804, it is determined whether audio data exceeding a predetermined number of ranges is accumulated in the queue (determination based on a reproduction delay threshold).

Ｓ８０５においては、所定の数の範囲外音声データが蓄積されたか否かの判定に基づき分岐する。所定の数が範囲内である場合は、ＮＯに進み、Ｓ８１０に進む。所定の数が範囲外である場合には、ＹＥＳに進み＜無音破棄＞のルーチンを実行するためＳ８０６に進む。 In S805, the process branches based on the determination as to whether or not a predetermined number of out-of-range audio data has been accumulated. When the predetermined number is within the range, the process proceeds to NO and proceeds to S810. If the predetermined number is out of the range, the process proceeds to YES and proceeds to S806 in order to execute the <silent discard> routine.

このＳ８０６〜Ｓ８０９の処理は、他の処理に割り込まれることなく一気に処理を行う。ここで、クライアント端末１０１のキューの全ての音声データについてすべてのチェックを行う。
Ｓ８０６においては、キューの先頭の１つの音声データをデキューする（取り出す）。
Ｓ８０７においては、取り出した音声データが、無音データであるか否かをチェックする。 The processes in S806 to S809 are performed at a time without being interrupted by other processes. Here, all of the audio data in the queue of the client terminal 101 is checked.
In S806, one audio data at the head of the queue is dequeued (taken out).
In S807, it is checked whether or not the extracted audio data is silence data.

Ｓ８０８においては、チェックした音声データが無音であると判定された場合には、そのまま（音声データをキューに戻すことなく）次の音声データをチェックすべくＳ８０６に戻る。即ち、無音であると判定された音声データは破棄される。また、チェックした音声データが無音ではないと判定された場合には、Ｓ８０９において、（Ｓ８０６でキューから取り出した）音声データを、再度キューにエンキューする（格納する）。 If it is determined in S808 that the checked audio data is silent, the process returns to S806 to check the next audio data as it is (without returning the audio data to the queue). That is, audio data determined to be silent is discarded. If it is determined that the checked audio data is not silent, the audio data (taken from the queue in S806) is enqueued (stored) again in S809.

前述の通り、キューはＦＩＦＯであり、また、Ｓ８０６〜Ｓ８０９の処理は、他の処理に割り込まれることなく、全ての音声データに対して一度ずつ処理を行うため、無音データは全て削除され、図４の３）に記載された左のキューから右のキューのように変わった後でも、残った音声データの時間的順序性は確保される。 As described above, the queue is a FIFO, and the processing of S806 to S809 is performed once for all audio data without being interrupted by other processing, so that all the silent data is deleted. Even after a change from the left cue described in 4-3) to the right cue, the temporal order of the remaining audio data is ensured.

以上で、受信音声記憶部３１６におけるミキシング音声データを受信したクライアント端末１０１のキューにおいて遅延が発生した場合には、その遅延回復処理を行った。 As described above, when a delay occurs in the queue of the client terminal 101 that has received the mixed audio data in the received audio storage unit 316, the delay recovery process is performed.

次にＳ８１０において、再度、キューに所定の数の範囲を超える音声データが蓄積されているかカウントする。 Next, in S810, it is counted again whether audio data exceeding a predetermined number of ranges is accumulated in the queue.

Ｓ８１１においては、所定の数の範囲を超える音声データが蓄積されているか判定する（再生遅延閾値と同じ値でもよいし、異なる値でもよい）。すなわち、遅延回復処理を完了した結果として十分な効果が得られたか否かを判断する。 In S811, it is determined whether audio data exceeding a predetermined number of ranges has been accumulated (may be the same value as the reproduction delay threshold value or a different value). That is, it is determined whether or not a sufficient effect has been obtained as a result of completing the delay recovery process.

Ｓ８１２においては、所定の数の範囲を超える音声データが蓄積されている場合（ＹＥＳの場合）には、Ｓ８１３に進む。そうでない場合（ＮＯの場合）には、Ｓ８０１（音声の受信）に戻る。なお、音声の再生は、非同期処理にて実行されおり、遅延回復処理とは無関係であるため、フローチャートとしては図示していない。 In S812, if audio data exceeding a predetermined number of ranges is accumulated (in the case of YES), the process proceeds to S813. If not (NO), the process returns to S801 (sound reception). Note that the audio reproduction is performed by an asynchronous process and is not related to the delay recovery process, and thus is not shown in the flowchart.

Ｓ８１３においては、会議サーバ１０２に、遅延回復処理の通知を行う。即ち、無音データを削除すると言う方式では、クライアント端末１０１においては、これ以上の対応は出来ない。従って、その場合は、会議サーバ側にも、遅延回復のための支援を要請することになる。また、会議サーバ１０２に、遅延回復処理の通知を行った後、Ｓ８０１に戻る。一方、会議サーバ１０２のＳ８１４においては、クライアント端末１０１のＳ８１３からの通知を受け取る。 In step S813, the conference server 102 is notified of delay recovery processing. That is, the method of deleting silent data cannot be further handled by the client terminal 101. Therefore, in this case, the conference server side is requested to support for delay recovery. Further, after notifying the conference server 102 of the delay recovery process, the process returns to S801. On the other hand, in S814 of the conference server 102, the notification from S813 of the client terminal 101 is received.

Ｓ８１５においては、「下り遅延フラグ」をオンにして、クライアント端末１０１から回復支援処理の支援要請があることを示す。これにより、会議サーバ１０２で動作している「下り音声遅延回復」の処理が起動されるようにしても良い。ただし、「下り音声遅延回復」は、クライアント端末１０１からの要請の有無にかかわらず、独自のルーチンにおける判断で起動されても良い。
以上で、クライアント端末１０１における遅延回復処理に関するフローチャートの説明を完了する。 In step S815, the “downlink delay flag” is turned on to indicate that there is a request for support for recovery support processing from the client terminal 101. As a result, the “downbound voice delay recovery” process operating in the conference server 102 may be activated. However, “downlink voice delay recovery” may be activated by a determination in an original routine regardless of whether there is a request from the client terminal 101.
This completes the description of the flowchart relating to the delay recovery processing in the client terminal 101.

図９は、本発明の実施の形態に係る会議サーバにおける、クライアント端末からの受信状態に基づく遅延回復処理のフローチャートの一例である。図９のフローチャートの各ステップは会議サーバ１０２のＣＰＵ２０１により実行される。下記の処理は、会議サーバ１０２において、クライアント端末１０１ごとに用意された個々のキューごとに処理される。また、図９のフローチャートで説明する処理は、クライアント端末１０１から会議サーバ１０２が受信する音声データなので、上り音声データに関する音声遅延回復処理の説明である。図９のフローチャートは、図５および図６のイメージに対応する。 FIG. 9 is an example of a flowchart of delay recovery processing based on the reception state from the client terminal in the conference server according to the embodiment of the present invention. Each step of the flowchart of FIG. 9 is executed by the CPU 201 of the conference server 102. The following processing is performed for each individual queue prepared for each client terminal 101 in the conference server 102. Further, since the processing described with reference to the flowchart of FIG. 9 is audio data received by the conference server 102 from the client terminal 101, the audio delay recovery processing relating to uplink audio data is described. The flowchart in FIG. 9 corresponds to the images in FIGS. 5 and 6.

Ｓ９０１においては、クライアント端末１０１から音声データ（正確には音声データを含む通信用のパケット）を受信する。ここで、フローチャートを１つだけ記載しているが、具体的には、１つの受信部で受け付けて、そのデータから複数あるクライアント端末１０１のいずれから送信されたデータであるかを判別しても良いし、そもそも、会議サーバ１０２とクライアント端末１０１の接続が確立されると“クライアント端末１０１ごと”に受信部がマルチスレッドで生成されて、特定のクライアント端末１０１の受信部としてもよい。ただし、Ｓ９０２以降では、受信部でクライアント端末１０１が識別されているものとして、識別された１つのクライアント端末１０１に対応する処理とする。 In step S <b> 901, voice data (more precisely, a communication packet including voice data) is received from the client terminal 101. Here, only one flowchart is described, but specifically, even if one receiving unit accepts the data and determines which of the plurality of client terminals 101 transmits the data. In the first place, when the connection between the conference server 102 and the client terminal 101 is established, a receiving unit may be generated in multiple threads for each “client terminal 101” and may be used as a receiving unit for a specific client terminal 101. However, in S902 and after, it is assumed that the client terminal 101 is identified by the receiving unit, and the processing corresponds to one identified client terminal 101.

Ｓ９０２においては、Ｓ９０１で受信した音声データをミキシング用音声記憶部３２６の識別された１つのクライアント端末１０１に対応するキューに、既に音声情報（前述のパケット内の音声データと、前述の送信済端末情報３２９）が幾つ格納されているかをチェックする。 In step S902, the voice data received in step S901 is already stored in the queue corresponding to the identified client terminal 101 in the mixing voice storage unit 326 in the voice information (the voice data in the packet and the transmitted terminal described above). Check how many pieces of information 329) are stored.

Ｓ９０３においては、前記音声情報の数が、サーバ閾値記憶部３２７の上り遅延閾値に従って所定数を上回っているか否かを判断する。 In S903, it is determined whether or not the number of the audio information exceeds a predetermined number according to the upstream delay threshold value of the server threshold value storage unit 327.

Ｓ９０４においては、キューにおける音声情報の数が、上り遅延閾値に従って所定数を上回っている、と判定された場合（ＹＥＳの場合）には、Ｓ９０５に進む。上回っていない、と判定された場合（ＮＯの場合）には、Ｓ９０９に進む。 In S904, when it is determined that the number of audio information in the queue exceeds the predetermined number according to the uplink delay threshold (in the case of YES), the process proceeds to S905. If it is determined that it has not exceeded (NO), the process proceeds to S909.

Ｓ９０９においては、キューに格納された音声情報の数が、上り遅延閾値に従って所定数を上回っていない、即ち、まだ（遅延回復処理をすることなく）音声情報を格納できると判断されるため、キューに格納（エンキュー）して、次の音声受信のためＳ９０１に戻る。 In S909, it is determined that the number of audio information stored in the queue does not exceed the predetermined number according to the uplink delay threshold, that is, it is determined that the audio information can still be stored (without performing delay recovery processing). (Enqueue) and return to S901 for the next voice reception.

Ｓ９０５に進んだ場合には、Ｓ９０５からＳ９０８の処理を、キューに格納されている全ての音声情報に対して繰り返す。その間、Ｓ９０５からＳ９０８のループは抜けない。 When the processing proceeds to S905, the processing from S905 to S908 is repeated for all the audio information stored in the queue. In the meantime, the loop from S905 to S908 does not come off.

上り遅延閾値に従って所定数を上回っている、と判定された場合には、Ｓ９０５において、キューから１つの音声情報を抽出（デキュー。キューから取り出すこと）する。 If it is determined that the predetermined number is exceeded according to the upstream delay threshold, one piece of audio information is extracted (dequeued, taken out from the queue) from the queue in S905.

Ｓ９０６においては、Ｓ９０５において抽出した音声情報に含まれる音声データが、“無音データ”であるか否かを判定する。Ｓ９０７においては、Ｓ９０５において抽出した音声情報に含まれる音声データが“無音データ”である場合（ＹＥＳの場合）、Ｓ９０５に戻る。すなわち、デキューした音声情報は無音データであるので、キューに戻さない（削除する）。 In S906, it is determined whether or not the audio data included in the audio information extracted in S905 is “silence data”. In S907, when the audio data included in the audio information extracted in S905 is “silent data” (in the case of YES), the process returns to S905. That is, since the dequeued audio information is silent data, it is not returned (deleted) to the queue.

Ｓ９０８においては、デキューした音声情報は無音データではないので、エンキューする（キューの最後に戻す）。 In S908, since the dequeued audio information is not silence data, it is enqueued (returned to the end of the queue).

これらのＳ９０５からＳ９０８の処理により、キューの中の無音データの数分だけ音声情報が削減されることにより、音声遅延を削減するという効果を得ることができる。以上で、図９のフローチャートの説明を完了する。 By performing the processing from S905 to S908, the audio information is reduced by the number of silence data in the queue, so that an effect of reducing the audio delay can be obtained. This completes the description of the flowchart in FIG. 9.

図１０は、本発明の実施の形態に係るクライアント端末への送信状態に基づく遅延回復処理のフローチャートの一例である。図１０のフローチャートの各ステップは会議サーバ１０２のＣＰＵ２０１により実行される。図１０のフローチャートは、会議サーバ１０２のミキシング用音声記憶部３２６における複数のクライアント端末１０１のキューにおける音声情報（に含まれる音声データ）をミキシングして、クライアント端末１０１に送信する処理である。会議サーバ１０２からクライアント端末１０１に送信するため、上り音声データである。図７のキューのイメージに対応する。 FIG. 10 is an example of a flowchart of delay recovery processing based on the transmission state to the client terminal according to the embodiment of the present invention. Each step of the flowchart of FIG. 10 is executed by the CPU 201 of the conference server 102. The flowchart of FIG. 10 is a process of mixing the audio information (audio data included) in the queues of the plurality of client terminals 101 in the mixing audio storage unit 326 of the conference server 102 and transmitting the audio information to the client terminal 101. Since it is transmitted from the conference server 102 to the client terminal 101, it is uplink voice data. This corresponds to the image of the queue in FIG.

あるクライアント端末１０１に送信するミキシング音声データに、その端末自身の音声データはミキシングされていない。従って図１０のフローチャートで処理するミキシング用音声記憶部３２６のキューは、送信しようとするクライアント端末１０１以外のキューに対応するキュー１つずつに対して処理される。
Ｓ１００１においては、キューから１つの音声情報を抽出する（デキュー）。
Ｓ１００２においては、図８のＳ８１５で“オン”にされる下り遅延フラグをチェックする。 The voice data of the terminal itself is not mixed with the mixed voice data transmitted to a certain client terminal 101. Accordingly, the queues of the mixing audio storage unit 326 processed in the flowchart of FIG. 10 are processed for each queue corresponding to a queue other than the client terminal 101 to be transmitted.
In S1001, one piece of audio information is extracted from the queue (dequeue).
In S1002, the downlink delay flag that is turned “ON” in S815 of FIG. 8 is checked.

Ｓ１００３においては、下り遅延フラグが“オン”である場合（ＹＥＳの場合）には、Ｓ１００４に進む。すなわち、クライアント端末１０１側から、会議サーバ１０２側での遅延回復処理が要求されているためである。また、“オフ”である場合（ＮＯ）の場合には、Ｓ１００７に進む。 In S1003, when the downlink delay flag is “ON” (in the case of YES), the process proceeds to S1004. In other words, this is because a delay recovery process on the conference server 102 side is requested from the client terminal 101 side. If it is “OFF” (NO), the process proceeds to S1007.

なお、Ｓ１００２において、下り遅延フラグをチェックするのではなく、上り時のチェック（サーバ閾値記憶部３２７における上り遅延閾値）と同様に、閾値を用いても良い。この閾値を、下り遅延閾値とし、キューにおける音声情報の数が、下り音声閾値に基づいて所定の数に達したらＳ１００３においてＳ１００４（ＹＥＳの方向）、達していない場合にはＳ１００７に進むようにしても良い。 In S1002, the threshold may be used in the same manner as the check at the time of uplink (uplink delay threshold in the server threshold value storage unit 327) instead of checking the downlink delay flag. This threshold value is set as a downstream delay threshold value, and when the number of audio information in the queue reaches a predetermined number based on the downstream audio threshold value, the process proceeds to S1004 (YES direction) in S1003, and if not, the process proceeds to S1007. .

Ｓ１００４においては、Ｓ１００１においてデキューした音声情報に含まれる音声データが、無音データであるか否かをチェックする。 In S1004, it is checked whether or not the audio data included in the audio information dequeued in S1001 is silence data.

Ｓ１００５においては、Ｓ１００４においてチェックした音声データが“無音データ”である場合（ＹＥＳの場合）には、Ｓ１００６に進み、次の音声情報をキューから取り出して、Ｓ１００４に戻り処理を繰り返す。この繰り返しにより、“無音データ”を無視した処理が進んでいく。 In S1005, when the audio data checked in S1004 is “silent data” (in the case of YES), the process proceeds to S1006, the next audio information is taken out from the queue, and the process returns to S1004 to repeat the process. By repeating this process, the process of ignoring “silent data” proceeds.

一方、Ｓ１００４においてチェックした音声データが“無音データ”ではない場合（ＮＯの場合）には、Ｓ１００７に進み、前述の通りミキシング音声として、Ｓ１００８によりクライアント端末１０１に送信する。Ｓ１００７においては、ミキシングに使用した音声データは、不要なデータとしてキューに戻していない。しかし、実際には、他のクライアント端末１０１で使用する必要があるかも知れないため、“全てのクライアント端末（この音声データを送信してきたクライアント端末を除く）で、この音声データを送信済みではない”場合には、改めてエンキューする必要がある。以上で、図１０のフローチャートの説明を完了する。 On the other hand, if the sound data checked in S1004 is not “silent data” (NO), the process proceeds to S1007, and is transmitted to the client terminal 101 as mixed sound in S1008 as described above. In S1007, the audio data used for mixing is not returned to the queue as unnecessary data. However, in actuality, since it may be necessary to use it in another client terminal 101, “all client terminals (except for the client terminal that has transmitted this audio data) have not yet transmitted this audio data. In this case, it is necessary to enqueue again. Above, description of the flowchart of FIG. 10 is completed.

なお、上述した各種データの構成及びその内容はこれに限定されるものではなく、用途や目的に応じて、様々な構成や内容で構成されることは言うまでもない。 It should be noted that the configuration and contents of the various data described above are not limited to this, and it goes without saying that the various data and configurations are configured according to the application and purpose.

以上、一実施形態について示したが、本発明は、例えば、システム、装置、方法、プログラムもしくは記録媒体等としての実施態様をとることが可能であり、具体的には、複数の機器から構成されるシステムに適用しても良いし、また、一つの機器からなる装置に適用しても良い。 Although one embodiment has been described above, the present invention can take an embodiment as, for example, a system, apparatus, method, program, or recording medium, and specifically includes a plurality of devices. The present invention may be applied to a system including a single device.

また、本発明におけるプログラムは、図８〜図１０のフローチャートの処理方法をコンピュータが実行可能なプログラムであり、本発明の記憶媒体は図８〜図１０のフローチャートの処理方法をコンピュータが実行可能なプログラムが記憶されている。なお、本発明におけるプログラムは図８〜図１０のフローチャートの各装置の処理方法ごとのプログラムであってもよい。 Further, the program according to the present invention is a program capable of executing the processing method of the flowcharts of FIGS. 8 to 10, and the storage medium of the present invention can execute the processing method of the flowcharts of FIGS. 8 to 10. The program is stored. The program in the present invention may be a program for each processing method of each device in the flowcharts of FIGS.

以上のように、前述した実施形態の機能を実現するプログラムを記録した記録媒体を、システムあるいは装置に供給し、そのシステムあるいは装置のコンピュータ（またはＣＰＵやＭＰＵ）が記録媒体に格納されたプログラムを読出し実行することによっても、本発明の目的が達成されることは言うまでもない。 As described above, a recording medium that records a program that implements the functions of the above-described embodiments is supplied to a system or apparatus, and a computer (or CPU or MPU) of the system or apparatus stores the program stored in the recording medium. It goes without saying that the object of the present invention can also be achieved by executing the reading.

この場合、記録媒体から読み出されたプログラム自体が本発明の新規な機能を実現することになり、そのプログラムを記憶した記録媒体は本発明を構成することになる。 In this case, the program itself read from the recording medium realizes the novel function of the present invention, and the recording medium storing the program constitutes the present invention.

コンピュータプログラムを供給するための記録媒体としては、例えば、フレキシブルディスク、ハードディスク、光ディスク、光磁気ディスク、ＣＤ−ＲＯＭ、ＣＤ−Ｒ、ＤＶＤ−ＲＯＭ、磁気テープ、不揮発性のメモリカード、ＲＯＭ、ＥＥＰＲＯＭ、シリコンディスク、ソリッドステートドライブ等を用いることができる。 As a recording medium for supplying a computer program, for example, a flexible disk, a hard disk, an optical disk, a magneto-optical disk, a CD-ROM, a CD-R, a DVD-ROM, a magnetic tape, a nonvolatile memory card, a ROM, an EEPROM, Silicon disks, solid state drives, etc. can be used.

また、コンピュータが読み出したプログラムを実行することにより、前述した実施形態の機能が実現されるだけでなく、そのプログラムの指示に基づき、コンピュータ上で稼働しているＯＳ（オペレーティングシステム）等が実際の処理の一部または全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。 Further, by executing the program read by the computer, not only the functions of the above-described embodiments are realized, but also an OS (operating system) operating on the computer based on an instruction of the program is actually It goes without saying that a case where the function of the above-described embodiment is realized by performing part or all of the processing and the processing is included.

さらに、記録媒体から読み出されたプログラムが、コンピュータに挿入された機能拡張ボードやコンピュータに接続された機能拡張ユニットに備わるメモリに書き込まれた後、そのプログラムコードの指示に基づき、その機能拡張ボードや機能拡張ユニットに備わるＣＰＵ等が実際の処理の一部または全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。 Furthermore, after the program read from the recording medium is written to the memory provided in the function expansion board inserted into the computer or the function expansion unit connected to the computer, the function expansion board is based on the instructions of the program code. It goes without saying that the case where the CPU or the like provided in the function expansion unit performs part or all of the actual processing and the functions of the above-described embodiments are realized by the processing.

また、本発明は、複数の機器から構成されるシステムに適用しても、１つの機器からなる装置に適用してもよい。また、本発明は、システムあるいは装置にプログラムを供給することによって達成される場合にも適応できることは言うまでもない。この場合、本発明を達成するためのプログラムを格納した記録媒体を該システムあるいは装置に読み出すことによって、そのシステムあるいは装置が、本発明の効果を享受することが可能となる。さらに、本発明を達成するためのプログラムをネットワーク上のサーバ、データベース等から通信プログラムによりダウンロードして読み出すことによって、そのシステムあるいは装置が、本発明の効果を享受することが可能となる。
なお、上述した各実施形態およびその変形例を組み合わせた構成も全て本発明に含まれるものである。 Further, the present invention may be applied to a system composed of a plurality of devices or an apparatus composed of a single device. Needless to say, the present invention can be applied to a case where the present invention is achieved by supplying a program to a system or apparatus. In this case, by reading a recording medium storing a program for achieving the present invention into the system or apparatus, the system or apparatus can enjoy the effects of the present invention. Furthermore, by downloading and reading a program for achieving the present invention from a server, database, etc. on a network using a communication program, the system or apparatus can enjoy the effects of the present invention.
In addition, all the structures which combined each embodiment mentioned above and its modification are also included in this invention.

１０１クライアント端末
１０２会議サーバ
１０３ネットワーク
３１１音声入力部
３１２音声送信部
３１３ミキシング音声受信部
３１４受信音声チェック部
３１５音声再生部
３１６受信音声記憶部
３１７端末閾値記憶部
３２１音声受信部
３２２上りチェック部
３２３ミキシング音声送信部
３２４下りチェック部
３２５ミキシング部
３２６ミキシング用音声記憶部
３２７サーバ閾値記憶部
３２８音声データ
３２９送信済端末情報 101 Client terminal 102 Conference server 103 Network 311 Audio input unit 312 Audio transmission unit 313 Mixing audio reception unit 314 Reception audio check unit 315 Audio reproduction unit 316 Reception audio storage unit 317 Terminal threshold storage unit 321 Audio reception unit 322 Uplink check unit 323 Mixing Audio transmission unit 324 Down check unit 325 Mixing unit 326 Mixing audio storage unit 327 Server threshold storage unit 328 Audio data 329 Transmitted terminal information

Claims

A client terminal that transmits input audio data to the information processing apparatus in a predetermined unit; and the information processing apparatus connectable via a network,
Audio data received from the client terminal, and transmitted terminal information indicating that the audio data has already been transmitted to any other client terminal as a queue corresponding to the client terminal, Audio storage means for mixing for storing queues of all client terminals that transmit and receive audio data;
Mixing means for generating, in the queue stored in the mixing voice storage means, voice data in a queue corresponding to the client terminal other than the client terminal that sends voice data to be sent to the client terminal as mixing voice data When,
In the mixing means, the transmitted terminal information recording means for describing in the transmitted terminal information that the voice data in the queue is used to generate mixing voice data for transmitting to the client terminal;
A downlink delay recovery processing determination unit that determines whether or not to perform a downlink audio delay recovery process for a queue in the mixing audio storage unit when performing mixing in the mixing unit based on a predetermined condition;
Silence data determination means for determining whether or not the sound data is silence data that is regarded as not including sound, based on a predetermined condition;
With
The mixing means, when generating the mixing voice data to be transmitted to the client terminal, when the downlink delay recovery processing determination means determines that the downlink voice delay recovery processing for the queue in the mixing voice storage means is performed, If the unsent audio data taken out from the queue corresponding to the other client terminal to generate mixing audio data is determined to be silence data by the silence data determination means, An information processing apparatus characterized by proceeding to the next audio data without processing.

The predetermined condition in the downlink delay recovery processing determination means is a downlink delay indicating whether or not a notification is received from the client terminal to perform a downlink voice delay recovery process for a queue in the mixing voice storage means in the information processing apparatus. The information processing apparatus according to claim 1, wherein the flag is set to a value indicating that downlink voice delay recovery processing for the queue in the mixing voice storage unit is to be performed.

The predetermined condition in the downlink delay recovery processing determination means is that the mixing is performed when the number of audio data in a queue used for generating the mixed audio data exceeds a predetermined range based on a downlink delay threshold. 2. The information processing apparatus according to claim 1, wherein the information processing apparatus determines that the downstream audio delay recovery process for the queue in the audio storage means is to be performed.

A queue corresponding to the client terminal in the mixing voice storage means, wherein the voice data received from the client terminal and the transmitted terminal information indicating that the voice data has been transmitted to any other client terminal are identifiable. Voice registration means for mixing to be stored in
When it is determined that the number of audio data stored in the queue corresponding to the client terminal exceeds a predetermined range based on the uplink delay threshold stored in the server threshold storage unit, the silence data in the queue , Server silent data deleting means for deleting transmitted terminal information corresponding to the silent data from the queue;
The information processing apparatus according to claim 1, further comprising:

The predetermined condition in the silence data determination means is a result of determining whether or not the voice data is silence data in the client terminal as additional information of the voice data, and the silence data determination means The information processing apparatus according to any one of claims 1 to 4, wherein the information processing unit determines whether or not the sound data is silent based on the additional information.

An information processing system in which a client terminal that transmits input audio data to an information processing device in a predetermined unit and an information processing device can be connected via a network,
The information processing apparatus includes:
Audio data received from the client terminal, and transmitted terminal information indicating that the audio data has already been transmitted to any other client terminal as a queue corresponding to the client terminal, Audio storage means for mixing for storing queues of all client terminals that transmit and receive audio data;
Mixing means for generating, in the queue stored in the mixing voice storage means, voice data in a queue corresponding to the client terminal other than the client terminal that sends voice data to be sent to the client terminal as mixing voice data When,
In the mixing means, the transmitted terminal information recording means for describing in the transmitted terminal information that the voice data in the queue is used to generate mixing voice data for transmitting to the client terminal;
Mixing audio data transmitting means for transmitting the mixing audio data to the client terminal;
A downlink delay recovery processing determination unit that determines whether or not to perform a downlink audio delay recovery process for a queue in the mixing audio storage unit when performing mixing in the mixing unit based on a predetermined condition;
Silence data determination means for determining whether or not the sound data is sound data silence data that is regarded as not including sound based on a predetermined condition;
With
The client terminal is
Mixing audio data receiving means for receiving the mixing audio data transmitted by the mixing audio data transmitting means;
Received voice registration means for storing the mixed voice data received by the mixed voice data receiving means in a queue of the received voice storage means as reproduced voice data;
When the number of the reproduced audio data stored in the queue of the received audio storage means exceeds a predetermined range based on the first reproduction delay threshold stored in the terminal threshold storage means, the received audio storage A reproduction delay recovery processing means for determining whether or not the reproduced audio data stored in the cue of the means is silent data, and deleting from the queue if the data is silent data;
As a result of the silence data being deleted from the cue by the reproduction delay recovery processing means, the number of the reproduced audio data is further determined based on a second reproduction delay threshold value stored in the terminal threshold value storage means. A downstream voice delay recovery processing requesting means for notifying a request to perform a downstream voice delay recovery process for the queue in the mixing voice storage means in the information processing apparatus,
With
The mixing means, when generating the mixing voice data to be transmitted to the client terminal, when the downlink delay recovery processing determination means determines that the downlink voice delay recovery processing for the queue in the mixing voice storage means is performed, If the sound data taken out from the queue corresponding to the other client terminal to generate mixing sound data is determined to be soundless data by the soundless data determining means, the sound data is processed. An information processing system characterized by proceeding with processing for the next audio data.

A method of controlling the information processing apparatus connectable via a network with a client terminal that transmits input audio data to the information processing apparatus in a predetermined unit,
The audio registration unit for mixing receives the audio data received from the client terminal, and transmitted terminal information indicating that the audio data has been transmitted to any other client terminal. A mixing voice registration step for registering voice data in a queue in the mixing voice storage means for storing queues of all client terminals that transmit and receive voice data as a queue corresponding to the terminal;
The mixing unit is to transmit the audio data of the queue corresponding to the client terminal other than the client terminal that transmits the audio data among the queues stored in the mixing audio storage unit to the client terminal as mixing audio data. A mixing step to generate,
The transmitted terminal information recording means describes in the transmitted terminal information that in the mixing step, the voice data in the queue is used to generate mixing voice data for transmitting to the client terminal. A terminal information recording step;
Downlink delay recovery processing determination means for determining whether or not to perform downlink audio delay recovery processing for the queue in the mixing audio storage means when performing the mixing in the mixing step based on a predetermined condition A determination step;
A silent data determining means, based on a predetermined condition, including a silent data determining step for determining whether the voice data is silent data that is regarded as not including voice;
In the mixing step, when generating the mixing voice data to be transmitted to the client terminal, in the downlink delay recovery processing determination step, when it is determined to perform the downlink voice delay recovery processing for the queue in the mixing voice storage unit, If unsent audio data extracted from a queue corresponding to another client terminal to generate mixing audio data is determined to be silence data in the silence data determination step, A control method for an information processing apparatus, wherein the process is performed on the next audio data without performing the process.

A program that can be executed by a client terminal that transmits input audio data to the information processing apparatus in a predetermined unit, and the information processing apparatus that can be connected via a network,
The information processing apparatus;
Audio data received from the client terminal, and transmitted terminal information indicating that the audio data has already been transmitted to any other client terminal as a queue corresponding to the client terminal, Mixing voice registration means for registering voice data in a queue in the mixing voice storage means for storing queues of all client terminals that transmit and receive voice data;
Mixing means for generating, in the queue stored in the mixing voice storage means, voice data in a queue corresponding to the client terminal other than the client terminal that sends voice data to be sent to the client terminal as mixing voice data ,
In the mixing means, the transmitted terminal information recording means for describing in the transmitted terminal information that the voice data in the queue is used to generate mixing voice data for transmitting to the client terminal,
A downlink delay recovery process determination unit that determines whether or not to perform a downlink audio delay recovery process for a queue in the mixing audio storage unit when performing the mixing in the mixing unit based on a predetermined condition;
Based on a predetermined condition, the audio data is made to function as silence data determination means for determining whether or not the audio data is silence data that is regarded as not including sound,
The mixing means, when generating the mixing voice data to be transmitted to the client terminal, when the downlink delay recovery processing determination means determines that the downlink voice delay recovery processing for the queue in the mixing voice storage means is performed, If the unsent audio data taken out from the queue corresponding to the other client terminal to generate mixing audio data is determined to be silence data by the silence data determination means, A program characterized by proceeding to the next audio data without processing.

A control method of an information processing system connectable via a network with a client terminal that transmits input audio data to an information processing device in a predetermined unit,
The information processing apparatus includes:
The audio registration unit for mixing receives the audio data received from the client terminal, and transmitted terminal information indicating that the audio data has been transmitted to any other client terminal. A mixing voice registration step for registering voice data in a queue in the mixing voice storage means for storing queues of all client terminals that transmit and receive voice data as a queue corresponding to the terminal;
The mixing unit is to transmit the audio data of the queue corresponding to the client terminal other than the client terminal that transmits the audio data among the queues stored in the mixing audio storage unit to the client terminal as mixing audio data. A mixing step to generate,
The transmitted terminal information recording means describes in the transmitted terminal information that in the mixing step, the voice data in the queue is used to generate mixing voice data for transmitting to the client terminal. A terminal information recording step;
A mixing voice data transmitting means for transmitting the mixing voice data to the client terminal;
Downlink delay recovery processing determination means for determining whether or not to perform downlink audio delay recovery processing for the queue in the mixing audio storage means when performing the mixing in the mixing step based on a predetermined condition A determination step;
A silent data determining means, based on a predetermined condition, including a silent data determining step for determining whether the voice data is silent data that is regarded as not including voice;
The client terminal is
A mixing sound data receiving means for receiving the mixing sound data transmitted by the mixing sound data transmitting step;
A reception voice registration step in which the reception voice registration means stores the mixing voice data received in the mixing voice data reception step as reproduction voice data in a queue of the reception voice storage means;
In the reproduction delay recovery processing means, the number of the reproduced audio data stored in the queue of the received audio storage means exceeds a predetermined range based on the first reproduction delay threshold stored in the terminal threshold storage means. In this case, it is determined whether or not the reproduced audio data stored in the queue of the received audio storage means is silence data, and if it is silent data, a reproduction delay recovery processing step of deleting from the queue;
As a result of the deletion of silence data from the queue by the reproduction delay recovery processing step, the downlink audio delay recovery processing request means further stores the second number of the reproduced audio data stored in the terminal threshold storage means. A downlink audio delay recovery process request step for notifying a request to perform a downlink audio delay recovery process for a queue in the mixing audio storage means in the information processing apparatus when a predetermined range is exceeded based on a reproduction delay threshold Including
In the mixing step, when generating the mixing voice data to be transmitted to the client terminal, in the downlink delay recovery processing determination step, when it is determined to perform the downlink voice delay recovery processing for the queue in the mixing voice storage unit, If unsent audio data extracted from a queue corresponding to another client terminal to generate mixing audio data is determined to be silence data in the silence data determination step, A control method for an information processing system, characterized in that the processing is performed on the next audio data without processing.

A program that can be executed in an information processing system in which a client terminal that transmits input voice data to an information processing device in a predetermined unit and the information processing device can be connected via a network,
The information processing apparatus;
Audio data received from the client terminal, and transmitted terminal information indicating that the audio data has already been transmitted to any other client terminal as a queue corresponding to the client terminal, Mixing voice registration means for registering voice data in a queue in the mixing voice storage means for storing queues of all client terminals that transmit and receive voice data;
Mixing means for generating, in the queue stored in the mixing voice storage means, voice data in a queue corresponding to the client terminal other than the client terminal that sends voice data to be sent to the client terminal as mixing voice data ,
In the mixing means, the transmitted terminal information recording means for describing in the transmitted terminal information that the voice data in the queue is used to generate mixing voice data for transmitting to the client terminal,
Mixing audio data transmitting means for transmitting the mixing audio data to the client terminal;
A downlink delay recovery process determination unit that determines whether or not to perform a downlink audio delay recovery process for a queue in the mixing audio storage unit when performing the mixing in the mixing unit based on a predetermined condition;
Based on a predetermined condition, the audio data is made to function as silence data determination means for determining whether or not the audio data is silence data that is regarded as not including sound,
The client terminal is
Mixing audio data receiving means for receiving the mixing audio data transmitted by the mixing audio data transmitting means;
Received voice registration means for storing the mixing voice data received by the mixing voice data receiving means in a queue of the received voice storage means as reproduced voice data;
When the number of the reproduced audio data stored in the queue of the received audio storage means exceeds a predetermined range based on the first reproduction delay threshold stored in the terminal threshold storage means, the received audio storage A reproduction delay recovery processing means for determining whether or not the reproduced audio data stored in the cue of the means is silent data, and deleting it from the cue if it is silent data;
As a result of the silence data being deleted from the cue by the reproduction delay recovery processing means, the number of the reproduced audio data is further determined based on a second reproduction delay threshold value stored in the terminal threshold value storage means. If it exceeds, the information processing apparatus functions as a downlink voice delay recovery process request means for notifying a request to perform a downlink voice delay recovery process for the queue in the mixing voice storage means,
The mixing means, when generating the mixing voice data to be transmitted to the client terminal, when the downlink delay recovery processing determination means determines that the downlink voice delay recovery processing for the queue in the mixing voice storage means is performed, If the sound data taken out from the queue corresponding to the other client terminal to generate mixing sound data is determined to be soundless data by the soundless data determining means, the sound data is processed. A program characterized by proceeding with processing for the next audio data.