JP2008141348A

JP2008141348A - Communication apparatus

Info

Publication number: JP2008141348A
Application number: JP2006323926A
Authority: JP
Inventors: Tadao Furukawa; 忠雄古川
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2006-11-30
Filing date: 2006-11-30
Publication date: 2008-06-19

Abstract

PROBLEM TO BE SOLVED: To suitably inform participants about whose speech is heard by a voice output from a speaker in remote communication such as a remote conference, etc. SOLUTION: A conference participant presses a speech button when the person speaks. A terminal for the voice conference in which a speech button is pressed, transmits speaker discrimination information including an IP address of the own terminal to a server apparatus 20. If the server apparatus 20 receives the speaker discrimination information, a speaker discrimination unit 23 reads out a management table which corresponds to the IP address and an addition mode of notifying voice data. A notifying voice addition unit 24 mixes a voice corresponding to the IP address included in the speaker discrimination information of the management table and a speech content of the participant, and transmits to each terminal for voice conference. COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、ネットワークを介して行われる遠隔コミュニケーションを支援する技術に関する。 The present invention relates to a technology for supporting remote communication performed via a network.

互いに離れた場所にいる複数の者が、通信回線に接続された音響機器を用いて音声による遠隔会議を行うことを可能とする技術がある。その技術においては、異なる場所に配置された音響機器の各々により集音された発言者の音声信号が中央装置に送信され、中央装置にてミキシングされた後、各音響機器に送信される。そのような技術を開示した文献として、例えば特許文献１がある。
特開２００５−２６９３４７号公報 There is a technology that enables a plurality of persons who are separated from each other to conduct a teleconference by voice using an audio device connected to a communication line. In the technology, a speaker's voice signal collected by each of the acoustic devices arranged at different locations is transmitted to the central device, mixed by the central device, and then transmitted to each acoustic device. As a document disclosing such a technique, there is, for example, Patent Document 1.
JP 2005-269347 A

ところで、多人数で行われる遠隔会議においては、スピーカから聞こえてくる発言者の声の特徴（声色、声の高さなど）だけでは、それが誰の発言であるか即座に正確に判別することは難しいといった問題があった。
本発明は上記の問題点に鑑みて為されたものであり、遠隔会議等の遠隔コミュニケーションにおいて、スピーカから聞こえてくる音声が誰の発言によるものであるかを参加者に適切に通知することを目的としている。 By the way, in a teleconference held by a large number of people, the voice characteristics (voice color, pitch, etc.) of the speaker that can be heard from the speaker can be immediately and accurately determined as to who the voice is. There was a problem that was difficult.
The present invention has been made in view of the above-mentioned problems, and in remote communication such as a remote conference, appropriately notifies the participant of who speaks the sound heard from the speaker. It is aimed.

本発明に係る通信装置の第１の実施形態は、通信網を介して接続された複数の通信端末のそれぞれから、オーディオ信号および前記通信端末を一意に識別することを可能にする識別情報を受信する受信手段と、報知信号を前記識別情報と対応付けて記憶する記憶手段と、前記各通信端末からのオーディオ信号をミキシングして前記各通信端末に出力するとともに、前記ミキシングにおいては前記各通信端末を出力先とする複数系統のミキシングを行い、かつ、各系統においては出力元である通信端末から出力されたオーディオ信号はミキシングから除外する処理を行うミキシング出力手段と、前記受信手段が受信した識別情報に対応付けられた報知信号を前記記憶手段から読み出し、読み出した報知信号が前記ミキシング出力手段によるミキシング後の信号に含まれるように付加する報知信号付加手段とを有することを特徴とする。 The first embodiment of the communication apparatus according to the present invention receives an audio signal and identification information that makes it possible to uniquely identify the communication terminal from each of a plurality of communication terminals connected via a communication network. Receiving means, storage means for storing a notification signal in association with the identification information, audio signals from the communication terminals are mixed and output to the communication terminals, and the communication terminals are used in the mixing. And a plurality of systems having the output destination as the output destination, and in each system, the audio signal output from the communication terminal that is the output source is processed to be excluded from the mixing, and the identification received by the receiving means A notification signal associated with information is read from the storage means, and the read notification signal is mixed by the mixing output means. And having a broadcast signal adding unit that adds to include in the signal after the ring.

本発明に係る通信装置の第２の実施形態は、請求項１に記載の通信装置において、前記受信手段は、さらに前記通信端末から報知信号付加指示を受信し、前記報知信号付加手段は、前記受信手段が前記報知信号付加指示を受信したときに限り、前記報知音声を前記オーディオ信号に付加することを特徴とする。 According to a second embodiment of the communication apparatus according to the present invention, in the communication apparatus according to claim 1, the reception unit further receives a notification signal addition instruction from the communication terminal, and the notification signal addition unit The notification voice is added to the audio signal only when the receiving means receives the notification signal addition instruction.

本発明に係る通信装置の第３の実施形態は、請求項１または２に記載の通信装置において、前記報知信号付加手段は、前記各報知信号に対応する出力元の通信端末を出力先とする系統のミキシング信号については、当該報知信号を付加しないことを特徴とする。 According to a third embodiment of the communication apparatus of the present invention, in the communication apparatus according to claim 1 or 2, the notification signal adding means uses an output source communication terminal corresponding to each notification signal as an output destination. The notification signal is not added to the system mixing signal.

本発明に係る通信装置の第４の実施形態は、請求項１ないし３いずれかに記載の通信装置において、前記報知信号付加手段は、前記報知信号を付加すべきオーディオ信号の出力開始と出力終了を検出し、前記報知信号を前記出力開始の前または前記出力終了の後に付加することを特徴とする。 The communication device according to a fourth embodiment of the present invention is the communication device according to any one of claims 1 to 3, wherein the notification signal adding means starts and ends output of an audio signal to which the notification signal is to be added. And the notification signal is added before the start of output or after the end of output.

本発明に係る通信装置の第５の実施形態は、請求項４に記載の通信装置において、前記報知信号を前記出力開始の前または前記出力終了の後のいずれに付加するかを指定する付加位置指定情報を前記各識別情報に対応させて記憶する付加位置記憶手段を具備し、前記報知信号付加手段は、前記付加位置記憶手段に記憶されている付加位置指定情報を前記識別情報に応じて読み出し、読み出した付加位置指定情報に応じて前記通信端末ごとに報知信号を付加すべき位置を制御することを特徴とする。 5th Embodiment of the communication apparatus which concerns on this invention is a communication apparatus of Claim 4, The addition position which designates whether the said alerting | reporting signal is added before the said output start or after the said output end Additional position storage means for storing designation information corresponding to each identification information is provided, and the notification signal addition means reads the additional position designation information stored in the additional position storage means in accordance with the identification information. The position to which the notification signal should be added is controlled for each of the communication terminals according to the read additional position designation information.

本発明に係る通信装置の第６の実施形態は、請求項１ないし５いずれかに記載の通信装置において、前記記憶手段は、さらに、前記報知信号を付加するか否かを示すフラグを前記識別情報に対応付けて記憶し、前記報知信号付加手段は、前記受信手段が受信した識別情報に対応付けられた前記フラグを参照して報知信号を付加するか否かを決定することを特徴とする。 According to a sixth embodiment of the communication apparatus according to the present invention, in the communication apparatus according to any one of claims 1 to 5, the storage means further includes a flag indicating whether or not to add the notification signal. The notification signal adding means determines whether or not to add a notification signal with reference to the flag associated with the identification information received by the receiving means. .

本発明に係る通信装置の第７の実施形態は、請求項１ないし６いずれかに記載の通信装置において、前記オーディオ信号はパケット信号であり、前記識別情報は前記パケットに含まれることを特徴とする。 According to a seventh embodiment of the communication apparatus of the present invention, in the communication apparatus according to any one of claims 1 to 6, the audio signal is a packet signal, and the identification information is included in the packet. To do.

本発明に係る通信装置の第８の実施形態は、請求項１ないし７いずれかに記載の通信装置において、前記報知信号は音声信号であることを特徴とする。 An eighth embodiment of the communication apparatus according to the present invention is the communication apparatus according to any one of claims 1 to 7, wherein the notification signal is an audio signal.

本発明によれば、遠隔会議等の遠隔コミュニケーションにおいて、スピーカから聞こえてくる音声が誰の発言によるものであるかを参加者に適切に通知することができる、といった効果を奏する。 According to the present invention, in remote communication such as a remote conference, there is an effect that it is possible to appropriately notify a participant of who speaks the sound heard from the speaker.

以下、図面を参照しつつ、本発明を実施する際の最良の形態について説明する。
（Ａ．構成）
以下、図面を参照して本発明の好適な実施形態について説明する。図１は、本発明に係る音声会議システム１の全体構成を示す図である。音声会議システム１は、音声会議用端末１０Ａ、１０Ｂ、および１０Ｃと、サーバ装置２０と、それらの通信装置を接続するネットワーク３０とから成る。 Hereinafter, the best mode for carrying out the present invention will be described with reference to the drawings.
(A. Configuration)
Hereinafter, preferred embodiments of the present invention will be described with reference to the drawings. FIG. 1 is a diagram showing an overall configuration of an audio conference system 1 according to the present invention. The audio conference system 1 includes audio conference terminals 10A, 10B, and 10C, a server device 20, and a network 30 that connects these communication devices.

音声会議用端末１０Ａ、１０Ｂ、および１０Ｃは、同一の構成および機能を有する。以下の説明において、音声会議用端末１０Ａ、１０Ｂ、および１０Ｃを区別する必要が無いときには、「音声会議用端末１０」と総称する。
なお、ネットワーク３０には、音声会議用端末１０Ａ、１０Ｂ、および１０Ｃの３つの音声会議用端末１０が接続されているが、２台、もしくは４台以上の音声会議用端末１０が接続されていても良い。要はネットワーク３０に、複数の音声会議用端末１０が接続されていれば良い。 The audio conference terminals 10A, 10B, and 10C have the same configuration and function. In the following description, when it is not necessary to distinguish the voice conference terminals 10A, 10B, and 10C, they are collectively referred to as “voice conference terminal 10”.
The network 30 is connected to the three audio conference terminals 10, 10 A, 10 B, and 10 C, but is connected to two or four or more audio conference terminals 10. Also good. In short, a plurality of voice conference terminals 10 may be connected to the network 30.

ネットワーク３０は例えばインターネットであり、音声会議用端末１０の間で所定の通信プロトコルに従って行われるデータ通信を仲介する。本実施形態で用いられている通信プロトコルは、アプリケーション層の通信プロトコルとしては、Real-time Transport Protocol（以下、「ＲＴＰ」）が用いられており、トランスポート層の通信プロトコルとしては、ＵＤＰが用いられており、ネットワーク層の通信プロトコルとしてはＩＰが用いられている。
音声会議用端末１０Ａ、１０Ｂ、および１０Ｃには、ＩＰアドレス（それぞれ“１９９．１９９．１．１”、“１９９．１９９．１．２”、および“１９９．１９９．１．３”）が割り振られており、ネットワーク上で一元的に識別される。
ＵＤＰとＩＰについては、一般に広く用いられている通信プロトコルであるため説明を省略し、以下ＲＴＰについて説明する。 The network 30 is, for example, the Internet and mediates data communication performed between the voice conference terminals 10 according to a predetermined communication protocol. As the communication protocol used in the present embodiment, Real-time Transport Protocol (hereinafter, “RTP”) is used as the communication protocol of the application layer, and UDP is used as the communication protocol of the transport layer. IP is used as a communication protocol in the network layer.
The IP addresses (“199.199.1.1”, “199.199.1.2”, and “199.199.1.3”) are allocated to the audio conference terminals 10A, 10B, and 10C, respectively. And are identified centrally on the network.
Since UDP and IP are communication protocols that are generally widely used, description thereof will be omitted, and RTP will be described below.

ＲＴＰとは、音声データや映像データをend-to-endでリアルタイムに送受信する通信サービスを提供するための通信プロトコルであり、その詳細はＲＦＣ１８８９に規定されている。ＲＴＰにおいては、ＲＴＰパケットを生成し送受信することにより通信端末同士でデータの授受が行われる。 RTP is a communication protocol for providing a communication service for transmitting and receiving audio data and video data in end-to-end in real time, and details thereof are defined in RFC1889. In RTP, data is exchanged between communication terminals by generating and transmitting / receiving RTP packets.

図２に示すように、ＲＴＰパケットは、ＩＰにおけるデータ転送単位であるパケットやＴＣＰにおけるデータ転送単位であるセグメントと同様に、ヘッダ部とペイロード部とで構成されている。ヘッダ部には、タイムスタンプ、送信元識別子および送信先識別子の３種類のデータが書き込まれる。ここで、タイムスタンプとは、時刻（音声通信の開始を指示されてから経過した時間）を示すデータである。ペイロード部には、所定時間（本実施形態では２０ミリ秒）分の音声データが書き込まれている。 As shown in FIG. 2, the RTP packet is composed of a header part and a payload part, like a packet that is a data transfer unit in IP and a segment that is a data transfer unit in TCP. Three types of data including a time stamp, a transmission source identifier, and a transmission destination identifier are written in the header portion. Here, the time stamp is data indicating the time (the time elapsed since the start of voice communication was instructed). Audio data for a predetermined time (20 milliseconds in this embodiment) is written in the payload portion.

以下、音声会議用端末１０のハードウェア構成について説明する。図３は、音声会議用端末１０のハードウェア構成を示すブロック図である。
図に示す制御部１０１は、例えばＣＰＵ（Central Processing Unit）であり、後述するＲＯＭ１０３ａに格納されている各種プログラムを実行することにより、本発明に特徴的な動作を行ったり、音声会議用端末１０の各部の動作を制御したりする。
通信ＩＦ部１０２は、例えばＮＩＣ（Network Interface Card）であり、ネットワーク３０に有線接続されている。この通信ＩＦ部１０２は、制御部１０１から引渡されたＲＴＰパケットを下位層の通信プロトコルにしたがって順次カプセル化することにより得られるＩＰパケットをネットワーク３０へ送出する。なお、カプセル化とは、上記ＲＴＰパケットをペイロード部に書き込んだＵＤＰセグメントを生成し、さらに、そのＵＤＰセグメントをペイロード部に書き込んだＩＰパケットを生成することである。また、通信ＩＦ部１０２は、ネットワーク３０を介してデータを受信し、ＩＰパケットに対して上記カプセル化とは逆の処理を行うことにより、そのＩＰパケットにカプセル化されているＲＴＰパケットを読み出して制御部１０１へ出力する。 Hereinafter, the hardware configuration of the audio conference terminal 10 will be described. FIG. 3 is a block diagram showing a hardware configuration of the audio conference terminal 10.
A control unit 101 shown in the figure is, for example, a CPU (Central Processing Unit), and performs various operations stored in a ROM 103a (to be described later), thereby performing operations characteristic of the present invention, and the audio conference terminal 10. Control the operation of each part.
The communication IF unit 102 is, for example, a NIC (Network Interface Card), and is connected to the network 30 by wire. The communication IF unit 102 sends IP packets obtained by sequentially encapsulating the RTP packets delivered from the control unit 101 to the network 30 according to a lower layer communication protocol. Encapsulation means generating a UDP segment in which the RTP packet is written in the payload portion, and further generating an IP packet in which the UDP segment is written in the payload portion. Further, the communication IF unit 102 receives data via the network 30 and reads the RTP packet encapsulated in the IP packet by performing processing opposite to the encapsulation on the IP packet. Output to the control unit 101.

記憶部１０３は、ＲＯＭ（Read Only Memory）１０３ａおよびＲＡＭ（Random Access Memory）１０３ｂを有する。ＲＯＭ１０３ａは、本発明に特徴的な機能を制御部１０１に実行させるための制御プログラムを格納している。制御プログラムの一例として、データの圧縮・伸張を行うソフトウェアであるコーデック（登録商標）が挙げられる。コーデックは所定の圧縮率で音声データを圧縮し、所定のビットレートでデータを送受信する。ＲＡＭ１０３ｂは、後述するマイクロホン１０６ｂから受取った音声データを記憶したり、各種プログラムにしたがって作動している制御部１０１によってワークエリアとして利用されたりする。 The storage unit 103 includes a ROM (Read Only Memory) 103a and a RAM (Random Access Memory) 103b. The ROM 103a stores a control program for causing the control unit 101 to execute functions characteristic of the present invention. An example of the control program is a codec (registered trademark) which is software for compressing and decompressing data. The codec compresses audio data at a predetermined compression rate and transmits / receives data at a predetermined bit rate. The RAM 103b stores audio data received from a microphone 106b described later, and is used as a work area by the control unit 101 operating according to various programs.

操作部１０４は、キーボードやマウス、および発言ボタン１０４ａを有する。キーボードやマウスは、本会議の参加者が何らかの入力を行った場合に、その操作内容を表すデータを制御部１０１へと伝達する。発言ボタン１０４ａは、参加者により押下されるとその旨を示すデータを制御部１０１へと伝達する。
表示部１０５は、例えばモニタであり、制御部１０１による制御下で音声会議用端末１０が有する、またはネットワーク３０を介して受取った各種のデータを画面に表示する。 The operation unit 104 includes a keyboard, a mouse, and a speech button 104a. The keyboard and mouse transmit data representing the operation content to the control unit 101 when a participant in the conference makes some input. When the participant button 104 a is pressed by the participant, data indicating that is transmitted to the control unit 101.
The display unit 105 is, for example, a monitor, and displays various data on the screen that the voice conference terminal 10 has or receives via the network 30 under the control of the control unit 101.

音声入力部１０６は、アナログ／デジタル（以下、「Ａ／Ｄ」と略記する）コンバータ１０６ａとマイクロホン１０６ｂとを有する。マイクロホン１０６ｂは、参加者の音声を収音しその音声を表すアナログ信号（以下、音声信号）を生成し、Ａ／Ｄコンバータ１０６ａへ出力する。Ａ／Ｄコンバータ１０６ａは、マイクロホン１０６ｂから受取った音声信号にＡ／Ｄ変換を施し、その変換結果であるデジタルデータ（以下、音声データ）を制御部１０１へ出力する。 The audio input unit 106 includes an analog / digital (hereinafter abbreviated as “A / D”) converter 106a and a microphone 106b. The microphone 106b collects the voice of the participant, generates an analog signal (hereinafter referred to as a voice signal) representing the voice, and outputs the analog signal to the A / D converter 106a. The A / D converter 106 a performs A / D conversion on the audio signal received from the microphone 106 b and outputs digital data (hereinafter referred to as audio data), which is the conversion result, to the control unit 101.

音声出力部１０７は、制御部１０１から受取った音声データに応じた音声の再生を行うものであり、Ｄ／Ａコンバータ１０７ａとスピーカ１０７ｂとを有する。Ｄ／Ａコンバータ１０７ａは、制御部１０１から受取った音声データに対して、Ｄ／Ａ変換を施すことによってアナログ方式の音声信号へ変換し、スピーカ１０７ｂへ出力するものである。そして、スピーカ１０７ｂは、Ｄ／Ａコンバータ１０７ａから受取った音声信号の表す音声を放音する。以上が、音声会議用端末１０のハードウェア構成である。 The audio output unit 107 reproduces audio corresponding to the audio data received from the control unit 101, and includes a D / A converter 107a and a speaker 107b. The D / A converter 107a converts the audio data received from the control unit 101 into an analog audio signal by performing D / A conversion, and outputs the analog audio signal to the speaker 107b. The speaker 107b emits sound represented by the sound signal received from the D / A converter 107a. The above is the hardware configuration of the audio conference terminal 10.

以上に説明したように、本実施形態に係る音声会議用端末１０の構成は、一般的なコンピュータ装置のハードウェア構成と同一であり、本発明に係る音声会議用端末１０に特徴的な機能は、以下に説明するソフトウェアモジュールにより実現されている。 As described above, the configuration of the audio conference terminal 10 according to the present embodiment is the same as the hardware configuration of a general computer apparatus, and the characteristic functions of the audio conference terminal 10 according to the present invention are as follows. This is realized by a software module described below.

音声会議用端末１０は、マイクロホン１０６ｂが収音した音声を表す音声データの音量レベルを検知して、そこから発言が含まれている部分のみを選択してネットワーク３０に向けて出力する機能を有する。図４は、音声会議用端末１０が行う音声データの生成・出力の流れを示したブロック図である。 The audio conference terminal 10 has a function of detecting the volume level of the audio data representing the audio collected by the microphone 106b, selecting only the portion containing the speech from there, and outputting it to the network 30. . FIG. 4 is a block diagram showing a flow of voice data generation / output performed by the voice conference terminal 10.

音声入力部１０６は、会議中継続して音声を収音し音声データを生成する。生成された音声データは一旦ＲＡＭ１０３ｂの音声データバッファ領域に書き込まれる。
なお、音声データは音声データバッファ領域に一旦書き込まれて後述する各種処理が行われるが、各種処理による信号の遅れはごくわずかであるために、通話のリアルタイム性には影響はない。 The voice input unit 106 continuously collects voice during the meeting and generates voice data. The generated audio data is once written in the audio data buffer area of the RAM 103b.
The voice data is once written in the voice data buffer area and subjected to various processes to be described later. However, since the signal delay due to the various processes is negligible, there is no effect on the real-time nature of the call.

音量レベル検知部１１０は、ＲＡＭ１０３ｂに書き込まれた音声データを所定の大きさ（本実施形態においては２０ミリ秒分）のフレーム単位で読み取り、フレームごとに音量レベル検知処理を行う。すなわち、音量レベルが所定の閾値を超える期間があるフレームを有音フレームとし、音量レベルが所定の閾値を一度も超える期間が無いフレームを無音フレームとする。また、それらのフレームに対応する期間を、以下ではそれぞれ有音期間および無音期間と呼ぶ。 The volume level detection unit 110 reads the audio data written in the RAM 103b in units of frames of a predetermined size (in this embodiment, 20 milliseconds), and performs volume level detection processing for each frame. That is, a frame having a period in which the volume level exceeds a predetermined threshold is defined as a sound frame, and a frame having no period in which the volume level exceeds the predetermined threshold is defined as a silent frame. In addition, the periods corresponding to these frames are hereinafter referred to as a sound period and a silent period, respectively.

有音フレームには、有音圧縮処理が施される。すなわち、フレーム選択部１１１は有音フレームをＲＡＭ１０３ｂから音声データ圧縮部１１２に受け渡し、音声データ圧縮部１１２は該フレームの音声データを、コーデックにより所定の圧縮率で圧縮する。圧縮が施された音声データには図２に示されるようにＲＴＰヘッダを付与され、ＲＴＰパケットが生成され、通信ＩＦ部１０２に出力される。 The sound frame is subjected to sound compression processing. That is, the frame selection unit 111 delivers a sound frame from the RAM 103b to the audio data compression unit 112, and the audio data compression unit 112 compresses the audio data of the frame at a predetermined compression rate by the codec. The compressed audio data is given an RTP header as shown in FIG. 2, an RTP packet is generated, and output to the communication IF unit 102.

一方、無音フレームには、無音圧縮処理が施される。すなわち、フレーム選択部１１１は、無音フレームを音声データ圧縮部１１２に受け渡さず、その結果無音期間にはＲＴＰパケットは生成されない。
無音フレームには、会議室のざわめき（暗騒音）などが含まれ、音量レベルは非常に低いとしても、そのデータ量は小さくないことが一般に知られている。無音圧縮処理によれば、参加者が必要とする音声は含まれていない無音フレームを“間引く”ことにより、必要な情報を欠落させることなく送信データ量を減らすことができる。 On the other hand, a silence compression process is performed on the silence frame. That is, the frame selection unit 111 does not deliver the silent frame to the audio data compression unit 112, and as a result, no RTP packet is generated during the silent period.
It is generally known that the silent frame includes the noise (background noise) of the conference room and the amount of data is not small even if the volume level is very low. According to the silence compression process, the amount of transmission data can be reduced without losing necessary information by “thinning out” silence frames that do not include the speech required by the participant.

通信ＩＦ部１０２は以上のように生成されたＲＴＰパケットを受け取り、受取ったＲＴＰパケットを下位層の通信プロトコルへ順次引渡すことによってＩＰパケットを生成し、該ＩＰパケットをネットワーク３０へ送出する。 The communication IF unit 102 receives the RTP packet generated as described above, generates the IP packet by sequentially delivering the received RTP packet to the lower layer communication protocol, and sends the IP packet to the network 30.

また、音声会議用端末１０がネットワーク３０を介して他の音声会議用端末１０から他の参加者の音声を表す音声データを受取ると、Ｄ／Ａコンバータ１０７ａは受取った音声データを音声信号に変換し、スピーカ１０７ｂは該音声信号の表す音声を放音する。 When the audio conference terminal 10 receives audio data representing the voice of another participant from another audio conference terminal 10 via the network 30, the D / A converter 107a converts the received audio data into an audio signal. The speaker 107b emits the sound represented by the sound signal.

次に、サーバ装置２０の構成について説明する。図５は、サーバ装置２０のハードウェア構成を示すブロック図である。制御部２０１、通信ＩＦ部２０２、記憶部２０３、操作部２０４、および表示部２０５は、各々音声会議用端末１０における制御部１０１、通信ＩＦ部１０２、記憶部１０３、操作部１０４、および表示部１０５と同様の機能を持つため、説明を省略する。なお、ＲＯＭ２０３ａには後述する識別情報管理テーブル、複数の報知音声データ、およびサーバ装置２０に特徴的な動作を行わせるプログラムが格納されている。 Next, the configuration of the server device 20 will be described. FIG. 5 is a block diagram illustrating a hardware configuration of the server device 20. The control unit 201, the communication IF unit 202, the storage unit 203, the operation unit 204, and the display unit 205 are respectively the control unit 101, the communication IF unit 102, the storage unit 103, the operation unit 104, and the display unit in the audio conference terminal 10. Since it has the same function as 105, the description is omitted. The ROM 203a stores an identification information management table (to be described later), a plurality of notification voice data, and a program for causing the server device 20 to perform a characteristic operation.

以下では、ＲＯＭ２０３ａに格納されている識別情報管理テーブルについて説明する。図６は識別情報管理テーブルの一例を示す図である。識別情報管理テーブルには、各音声会議用端末１０に割り当てられたＩＰアドレスに対応付けて、該音声会議用端末１０を利用する参加者の名前、報知音声挿入可否、報知音声挿入タイミング、および報知音声データが書き込まれている。 Hereinafter, the identification information management table stored in the ROM 203a will be described. FIG. 6 shows an example of the identification information management table. In the identification information management table, the name of the participant who uses the audio conference terminal 10, whether or not to insert the audio announcement, the audio announcement insertion timing, and the announcement are associated with the IP address assigned to each audio conference terminal 10. Audio data is written.

さて、サーバ装置２０は、複数の音声会議用端末１０から音声データを受取り、それらの音声データを合成して再び各音声会議用端末１０へ送信する“ミキシング処理”を行う。またミキシング処理を行う際に、ミキシングした音声はどの音声会議用端末１０から受取ったものであるのか、すなわちどの参加者の発言であるのかを表す報知音声を発言に対して付加する“発言者報知処理”を行う。上記識別情報管理テーブルは、発言者報知処理を施すか否か、また、施す場合にはどのように施すかを参加者ごとに規定したテーブルである。 Now, the server device 20 performs “mixing processing” for receiving voice data from a plurality of voice conference terminals 10, synthesizing those voice data, and transmitting them again to each voice conference terminal 10. Also, when performing the mixing process, a voice message indicating which voice conference terminal 10 the voice that has been mixed, that is, which participant's voice is added, is added to the voice. Process ". The identification information management table is a table that defines, for each participant, whether or not to perform speaker notification processing and how to perform it.

さて、上記識別情報管理テーブルにおいて、報知音声挿入可否の項目には、上記発言者報知処理を行うか否かが書き込まれている。本実施例においては、全ての参加者について“可”と書き込まれている。報知音声挿入タイミングの項目には、報知音声を付加するタイミングが書き込まれている。すなわち、“前”と書きこまれている参加者の場合、ミキシング処理された音声データが再生される前に報知音声がミキシング処理され、“後”と書き込まれている参加者の場合、ミキシング処理された音声データが再生された後に報知音声がミキシング処理される。報知音声データの項目には、報知音声として付加する音声を表す音声データのファイル名が書き込まれている。
なお、報知音声データ１、２、および３はそれぞれ、“参加者Ａの発言でした。”、“参加者Ｂの発言です。”、および“参加者Ｃの発言です。”というナレーションを表す。 In the identification information management table, whether or not to perform the speaker notification process is written in the item of whether or not notification voice insertion is possible. In this embodiment, “Yes” is written for all participants. In the item of the notification voice insertion timing, a timing for adding the notification voice is written. That is, in the case of a participant in which “front” is written, the notification sound is mixed before the audio data subjected to the mixing processing is reproduced, and in the case of a participant in which “after” is written, the mixing processing is performed. After the sound data is reproduced, the notification sound is mixed. In the item of the notification sound data, a file name of sound data representing the sound added as the notification sound is written.
Note that the notification audio data 1, 2, and 3 represent narrations such as “Speaking from Participant A”, “Speaking from Participant B”, and “Speaking from Participant C”.

（Ｂ、動作）
次に、音声会議システム１により行われる会議において各装置が行う動作について、説明する。なお、サーバ装置２０のＲＯＭ２０３ａには、図６に示す内容の識別情報管理テーブル、および上述した報知音声データ１ないし３が格納されている。また、音声会議用端末１０Ａ、１０Ｂ、および１０Ｃを、それぞれ参加者Ａ、Ｂ、およびＣが利用しているものとする。 (B, operation)
Next, operations performed by each device in a conference held by the audio conference system 1 will be described. Note that the ROM 203a of the server device 20 stores the identification information management table having the contents shown in FIG. 6 and the above-described notification voice data 1 to 3. Further, it is assumed that the voice conference terminals 10A, 10B, and 10C are used by the participants A, B, and C, respectively.

なお、以下の説明において、各音声会議用端末１０の構成が音声会議用端末１０Ａ、１０Ｂ、および１０Ｃのいずれに属するのか示す必要がある場合には、例えば音声会議用端末１０Ａの制御部を制御部１０１Ａなどのように、符号に対応するアルファベットを付して示す。また、音声会議用端末１０が出力する音声データについても、音声会議用端末１０Ａ、１０Ｂ、および１０Ｃが出力した音声データをそれぞれ音声データＡ、Ｂ、Ｃと表す。 In the following description, when it is necessary to indicate which of the voice conference terminals 10A, 10B, and 10C the configuration of each voice conference terminal 10 belongs to, for example, the control unit of the voice conference terminal 10A is controlled. Like the part 101A, the alphabet corresponding to the reference numerals is shown. As for the audio data output from the audio conference terminal 10, the audio data output from the audio conference terminals 10A, 10B, and 10C are represented as audio data A, B, and C, respectively.

以下では、音声会議用端末１０Ａを利用する参加者Ａが発言する場合について説明する。なお、複数の参加者が並行して発言する場合は以下の動作が並行して行われる。
参加者Ａは、発言を行うにあたり発言ボタン１０４ａＡを押下する。押下された操作部１０４Ａは、発言ボタン１０４ａＡが押下された旨を示す信号を制御部１０１Ａに出力する。該信号を受取った制御部１０１Ａは、自端末に割り振られたＩＰアドレスを含み、発言ボタン１０４ａＡが押下された旨を示す“発言者識別情報”をネットワーク３０へ送出する。 Below, the case where the participant A using 10 A of audio | voice conference terminals speaks is demonstrated. When a plurality of participants speak in parallel, the following operations are performed in parallel.
Participant A presses the speaking button 104aA when speaking. The pressed operation unit 104A outputs a signal indicating that the speech button 104aA has been pressed to the control unit 101A. The control unit 101A that has received the signal sends “speaker identification information” including the IP address assigned to the terminal itself and indicating that the speaker button 104aA has been pressed to the network 30.

さて、図７はサーバ装置２０が行う処理の流れを示したフローチャートである。また、図８は、サーバ装置２０が各音声会議用端末１０から受取った音声データをミキシング処理する過程を示した図である。サーバ装置２０は、音声会議用端末１０Ａから発言者識別情報を受信すると（ステップＳＡ１００；“Ｙｅｓ”）、制御部２０１は該発言者識別情報に含まれる音声会議用端末１０ＡのＩＰアドレスをＲＡＭ２０３ｂに書き込む。発言者識別部２３は、ＲＡＭ２０３ｂに書き込まれたＩＰアドレスを読み出し（ステップＳＡ１１０）、ＲＯＭ２０３ａに書き込まれた識別情報管理テーブルにおいて該ＩＰアドレスに対応付けられた各項目のデータを読み出し、報知音声付加部２４に出力する。なお、ステップＳＡ１１０の判定結果が“Ｎｏ”の場合は、ステップＳＡ１００をくり返す。 FIG. 7 is a flowchart showing the flow of processing performed by the server device 20. FIG. 8 is a diagram showing a process in which the voice data received from each voice conference terminal 10 by the server device 20 is mixed. When the server device 20 receives the speaker identification information from the voice conference terminal 10A (step SA100; “Yes”), the control unit 201 stores the IP address of the voice conference terminal 10A included in the speaker identification information in the RAM 203b. Write. The speaker identification unit 23 reads the IP address written in the RAM 203b (step SA110), reads the data of each item associated with the IP address in the identification information management table written in the ROM 203a, and the notification voice adding unit 24. If the determination result in step SA110 is “No”, step SA100 is repeated.

報知音声付加部２４は、報知音声挿入可否の項目から、参加者Ａの発言に対して報知音声を挿入するかどうか判定する（ステップＳＡ１２０）。報知音声挿入可否の項目が“可”である場合（ステップＳＡ１２０；“Ｙｅｓ”）には、報知音声付加部２４は、ステップＳＡ１３０以下の処理を行う。一方、ステップＳＡ１２０の判定結果が“Ｎｏ”である場合には報知音声付加部２４は処理を終了する。本実施例では報知音声の挿入の可否は“可”であるから、ステップＳＡ１２０の判定は“Ｙｅｓ”となる。 The notification voice adding unit 24 determines whether or not the notification voice is to be inserted into the speech of the participant A from the item of whether or not the notification voice can be inserted (step SA120). When the notification voice insertion enable / disable item is “Yes” (step SA120; “Yes”), the notification voice adding unit 24 performs the processing after step SA130. On the other hand, if the determination result in step SA120 is “No”, the notification voice adding unit 24 ends the process. In this embodiment, whether or not the notification voice can be inserted is “possible”, so the determination in step SA120 is “Yes”.

報知音声付加部２４は、続くステップＳＡ１３０において報知音声挿入タイミングを判定する。報知音声挿入タイミングが、“前”である場合には、制御部１０１ＡはステップＳＡ１８０以降の処理を行う。一方、報知音声挿入タイミングが、“後”である場合には、制御部１０１ＡはステップＳＡ１４０以降の処理を行う。本動作例では、参加者Ａの報知音声挿入タイミングの項目は“後”であるから、ステップＳＡ１４０以降の処理が行われる。 The notification voice adding unit 24 determines the notification voice insertion timing in the subsequent step SA130. When the notification voice insertion timing is “before”, the control unit 101A performs the processing after step SA180. On the other hand, when the notification voice insertion timing is “after”, the control unit 101A performs the processing after step SA140. In this operation example, since the item of the notification voice insertion timing of the participant A is “after”, the processes after step SA140 are performed.

さて、上述した識別情報管理テーブルにおいて、“報知音声挿入タイミング”の項に“後”と書き込まれている参加者には、発言の前に発言ボタン１０４ａを押下してそのまま発言を始めるよう予め通知されている。従って、参加者Ａは発言ボタン１０４ａＡを押下した後すぐに発言を行い、サーバ装置２０は発言者識別情報に続き参加者Ａの発言内容を表す音声データを音声会議用端末１０Ａから受信する。 In the identification information management table described above, participants who have written “after” in the “notification voice insertion timing” section are notified in advance to start speaking as is by pressing the speaking button 104a before speaking. Has been. Accordingly, the participant A speaks immediately after pressing the speech button 104aA, and the server device 20 receives voice data representing the speech content of the participant A from the voice conference terminal 10A following the speaker identification information.

ステップＳＡ１４０では、制御部２０１は音声会議用端末１０Ａから受取った音声データＡのミキシング処理を行う。図８において音声データ合成部２２Ａ、２２Ｂ、および２２Ｃは同様の構成を有する。以下の説明においてそれらを区別する必要がない場合は、音声データ合成部２２と総称する。 In step SA140, the control unit 201 performs a mixing process on the audio data A received from the audio conference terminal 10A. In FIG. 8, the voice data synthesizers 22A, 22B, and 22C have the same configuration. In the following description, when it is not necessary to distinguish between them, the voice data synthesis unit 22 is collectively referred to.

まず、通信ＩＦ部２０２は音声会議用端末１０から出力された音声データを受取る。通信ＩＦ部２０２が受取った音声データは、ＲＡＭ２０３ｂの音声データバッファ領域に一旦書き込まれる。
回路指定部２１は、ＲＡＭ２０３ｂに書き込まれた音声データを順次読み出し、以下のようにして音声データ合成部２２に割り当てて出力する。すなわち、回路指定部２１は、音声会議用端末１０Ａから受取った音声データＡを音声データ合成部２２Ｂおよび２２Ｃに出力し、音声データＢを音声データ合成部２２Ａおよび２２Ｃに出力し、音声データＣを音声データ合成部２２Ａおよび２２Ｂに出力する。 First, the communication IF unit 202 receives audio data output from the audio conference terminal 10. The audio data received by the communication IF unit 202 is temporarily written in the audio data buffer area of the RAM 203b.
The circuit designating unit 21 sequentially reads out the audio data written in the RAM 203b, assigns it to the audio data synthesis unit 22 as follows, and outputs it. That is, the circuit designating unit 21 outputs the audio data A received from the audio conference terminal 10A to the audio data synthesis units 22B and 22C, outputs the audio data B to the audio data synthesis units 22A and 22C, and outputs the audio data C. It outputs to the voice data synthesis units 22A and 22B.

音声データ合成部２２Ａは、受取った音声データＢおよびＣを合成し、生成された音声データを通信ＩＦ部２０２に出力する。同様にして音声データ合成部２２Ｂ、および２２Ｃは受取った音声データを合成し、それぞれ通信ＩＦ部２０２に出力する。音声データ合成部２２Ａ、２２Ｂ、および２２Ｃから音声データを受取った通信ＩＦ部２０２は、ネットワーク３０を介して、それぞれ音声会議用端末１０Ａ、１０Ｂ、および１０Ｃに向けて送信する。 The voice data synthesis unit 22A synthesizes the received voice data B and C and outputs the generated voice data to the communication IF unit 202. Similarly, the voice data synthesis units 22B and 22C synthesize the received voice data and output the synthesized voice data to the communication IF unit 202, respectively. The communication IF unit 202 that has received the audio data from the audio data synthesis units 22A, 22B, and 22C transmits the audio data to the audio conference terminals 10A, 10B, and 10C via the network 30, respectively.

以上の処理により、例えば参加者Ａの発言を表す音声データは、音声会議用端末１０Ｂおよび１０Ｃに出力され参加者ＢおよびＣに対して再生される。従って、各参加者は自らの発言を他の参加者に聞かせることができると共に、他の参加者の発言を聞くことができる。
ステップＳＡ１５０において、制御部２０１は、当該発言者の発言が継続しているかどうか判定する。発言が継続している間はステップＳＡ１５０の判定結果は“Ｎｏ”となりステップＳＡ１５０が繰り返されるが、発言が終わるとステップＳＡ１５０の判定結果は“Ｙｅｓ”となり、制御部２０１はステップＳＡ１６０以降の処理を行う。 Through the above processing, for example, audio data representing the speech of the participant A is output to the audio conference terminals 10B and 10C and reproduced for the participants B and C. Accordingly, each participant can make his / her remarks heard by other participants and can hear the remarks of other participants.
In step SA150, the control unit 201 determines whether or not the speaker's speech continues. While the utterance continues, the determination result of step SA150 is “No” and step SA150 is repeated. When the utterance is completed, the determination result of step SA150 is “Yes”, and the control unit 201 performs the processing after step SA160. Do.

ステップＳＡ１６０において、報知音声付加部２４は識別情報管理テーブルに規定された報知音声データ１をＲＯＭ２０３ａから読み出す。続くステップＳＡ１７０において、報知音声付加部２４は、報知音声データを音声データ合成部２２Ａ、２２Ｂ、および２２Ｃに割り当てて出力する。
音声データ合成部２２は、報知音声付加部２４から受取った音声データを通信ＩＦ部２０２へ出力する。通信ＩＦ部２０２は、音声データ合成部２２Ａ、２２Ｂ、および２２Ｃから受取った音声データを、それぞれ音声会議用端末１０Ａ、１０Ｂ、および１０Ｃへ出力する。 In step SA160, the notification voice adding unit 24 reads the notification voice data 1 defined in the identification information management table from the ROM 203a. In the subsequent step SA170, the notification voice adding unit 24 assigns the notification voice data to the voice data synthesis units 22A, 22B, and 22C and outputs them.
The voice data synthesis unit 22 outputs the voice data received from the notification voice addition unit 24 to the communication IF unit 202. Communication IF unit 202 outputs the audio data received from audio data synthesis units 22A, 22B, and 22C to audio conference terminals 10A, 10B, and 10C, respectively.

音声会議用端末１０Ｂおよび１０Ｃは、ネットワーク３０を介して報知音声データおよび音声データを順次受取ると、Ｄ／Ａコンバータ１０７ａは受取った両音声データをアナログの音声信号に変換し、スピーカ１０７ｂは該音声信号の表す音声を放音する。
以上の処理の結果、音声会議用端末１０Ｂおよび１０Ｃにおいて、参加者Ａの発言内容が再生された後に、報知音声“参加者Ａの発言でした。”が再生される。また、音声会議用端末１０Ａにおいては、参加者Ａが発言を終えた後に上記報知音声が再生される。その結果、他の参加者は、発言を聞いた後に誰が発言したのかを知ることができる。 When the audio conference terminals 10B and 10C sequentially receive the notification audio data and the audio data via the network 30, the D / A converter 107a converts both received audio data into analog audio signals, and the speaker 107b receives the audio data. Releases the sound represented by the signal.
As a result of the above processing, after the content of the speech of the participant A is reproduced in the audio conference terminals 10B and 10C, the notification sound “It was the speech of the participant A.” is reproduced. In the audio conference terminal 10A, the notification audio is reproduced after the participant A finishes speaking. As a result, other participants can know who has spoken after hearing it.

次に、参加者Ｂが発言を行う場合について説明する。なお、ステップＳＡ１００およびステップＳＡ１１０については、参加者Ａにおける場合と同様であるため、ステップＳＡ１２０以降の動作について説明する。
報知音声付加部２４は、ステップＳＡ１３０において報知音声挿入タイミングを判定する。この場合、報知音声挿入タイミングの項目は“前”であるから、ステップＳＡ１８０以降の処理が行われる。 Next, a case where the participant B speaks will be described. Since Step SA100 and Step SA110 are the same as those in Participant A, the operation after Step SA120 will be described.
The notification voice adding unit 24 determines the notification voice insertion timing in step SA130. In this case, since the notification voice insertion timing item is “Previous”, the processes after Step SA180 are performed.

なお、上述した識別情報管理テーブルにおいて、“報知音声挿入タイミング”の項に“前”と書き込まれている参加者には、発言ボタン１０４ａを押下し報知音声が再生され終わってから発言を始めるよう予め通知されている。従って、参加者Ｂは発言ボタン１０４ａＢを押下した後すぐには発言を行わず、報知音声の再生（すなわちステップＳＡ１８０およびステップＳＡ１９０）が終了するのを待つ。 In the above-described identification information management table, the participant whose “previous” is written in the “notification voice insertion timing” section starts the utterance after the utterance button 104a is pressed and the notification voice is reproduced. It is notified beforehand. Accordingly, the participant B does not speak immediately after pressing the speech button 104aB, and waits for the reproduction of the notification sound (ie, step SA180 and step SA190) to end.

ステップＳＡ１８０において、報知音声付加部２４は識別情報管理テーブルに規定された報知音声データをＲＯＭ２０３ａから読み出す。続くステップＳＡ１９０において、報知音声付加部２４は、報知音声データを音声データ合成部２２Ａ、２２Ｂ、および２２Ｃに出力する。該報知音声データは、上述の音声データ同様通信ＩＦ部２０２を介して各音声会議用端末１０に出力され、報知音声“参加者Ｂの発言です。”が参加者に向けて放音される。 In step SA180, the notification sound adding unit 24 reads out notification sound data defined in the identification information management table from the ROM 203a. In subsequent step SA190, the notification voice adding unit 24 outputs the notification voice data to the voice data synthesis units 22A, 22B, and 22C. The notification audio data is output to each audio conference terminal 10 via the communication IF unit 202 as in the case of the above-described audio data, and the notification audio “participant B speaks” is emitted toward the participants.

参加者Ｂは、該報知音声を聞き終わると発言を始める。発言内容を表す音声データは、上述の参加者Ａの音声データと同様にミキシング処理され、各音声会議用端末１０で再生される。なお、サーバ装置２０は、ステップＳＡ１８０およびステップＳＡ１９０の処理を行っている最中に音声会議用端末１０から音声データを受取った場合は、該音声データを破棄し、ステップＳＡ１９０を終えた時点以降に受取った音声データについてステップＳＡ２００のミキシング処理を行う。 Participant B starts speaking after listening to the notification voice. The audio data representing the content of the message is mixed and reproduced by each audio conference terminal 10 in the same manner as the audio data of the participant A described above. If the server device 20 receives audio data from the audio conference terminal 10 during the processing of step SA180 and step SA190, the server device 20 discards the audio data, and after the time point at which step SA190 is completed. The received voice data is mixed in step SA200.

以上の処理が行われることにより、音声会議用端末１０Ａおよび１０Ｃにおいては、報知音声“参加者Ｂの発言です。”が再生された後に参加者Ｂの発言を聞くことができる。従って、各参加者は発言者を特定した上で発言を聞くことができる。また、たとえば会話がしばらく途切れた後に突然参加者が発言するような場合にも、報知音声により他の参加者の意識をこれから行われる発言に向けた上で発言することができる。 By performing the above processing, the voice conference terminals 10A and 10C can hear the voice of the participant B after the notification voice “the voice of the participant B” is reproduced. Accordingly, each participant can listen to the speech after identifying the speaker. Further, for example, even when a participant suddenly speaks after the conversation is interrupted for a while, it is possible to speak after making the other participant's consciousness toward the speech to be performed by the notification voice.

（Ｃ．変形例）
以上、本発明の一実施形態について説明したが、本発明は以下に述べる種々の態様で実施することができる。 (C. Modification)
As mentioned above, although one Embodiment of this invention was described, this invention can be implemented with the various aspect described below.

（１）上述した実施形態においては、ネットワーク３０はインターネットである場合について説明したが、ＬＡＮ（Local Area Network）などであっても良い。要は、所定の通信プロトコルに従って通信装置同士が行うデータ通信を仲介するものであれば良い。また、音声会議用端末１０およびサーバ装置２０は有線でネットワーク３０に接続されている場合について説明したが、ネットワーク３０が例えば無線ＬＡＮ（Local Area Network）などの無線パケット通信網であり、音声会議用端末１０およびサーバ装置２０が、この無線パケット通信網に接続されていても勿論良い。 (1) In the above-described embodiment, the case where the network 30 is the Internet has been described. However, a LAN (Local Area Network) or the like may be used. In short, what is necessary is just to mediate data communication performed between communication devices according to a predetermined communication protocol. In addition, the case where the audio conference terminal 10 and the server device 20 are connected to the network 30 by wire has been described. However, the network 30 is a wireless packet communication network such as a wireless local area network (LAN), for example, Of course, the terminal 10 and the server device 20 may be connected to the wireless packet communication network.

（２）上記実施形態においては、会議端末１０およびサーバ装置２０に特徴的な機能をソフトウェアモジュールで実現する場合について説明したが、上記各機能を担うハードウェアモジュールを組み合わせて本発明に係る通信装置を構成するようにしても良い。 (2) In the above embodiment, a case has been described in which the functions characteristic of the conference terminal 10 and the server device 20 are realized by software modules. However, the communication device according to the present invention is combined with hardware modules responsible for the above functions. You may make it comprise.

（３）上述した実施形態では、音声データによって会議を行う場合について説明したが、データの種類は音声データのみに限られるものではなく、動画データなど他の種類のデータを併せて送信してもよい。 (3) In the above-described embodiment, the case where a conference is performed using audio data has been described. However, the type of data is not limited to audio data, and other types of data such as moving image data may be transmitted together. Good.

（４）上記実施形態においては、本発明に係る通信装置に特徴的な機能を実現するためのプログラムをＲＯＭ１０３ａまたは２０３ａに予め書き込んでおく場合について説明したが、ＣＤ−ＲＯＭやＤＶＤなどのコンピュータ装置読み取り可能な記録媒体に上記制御プログラムを記録して配布するとしても良く、インターネットなどの電気通信回線経由のダウンロードにより上記制御プログラムを配布するようにしても勿論良い。 (4) In the above embodiment, a case has been described in which a program for realizing the functions characteristic of the communication apparatus according to the present invention is written in the ROM 103a or 203a in advance, but a computer apparatus such as a CD-ROM or DVD The control program may be recorded and distributed on a readable recording medium, or the control program may be distributed by downloading via a telecommunication line such as the Internet.

（５）上述した実施形態では、音声データの送受信に係るアプリケーション層の通信プロトコルとしてＲＴＰを用いる場合について説明したが、他の通信プロトコルを用いても良いことは勿論である。要は、所定のヘッダ部とペイロード部とを有するデータブロックのペイロード部に、音声データを所定時間分ずつ書き込んで送信する通信プロトコルであれば、どのような通信プロトコルであっても良い。 (5) In the above-described embodiment, the case where RTP is used as the communication protocol of the application layer related to transmission / reception of audio data has been described, but it is needless to say that other communication protocols may be used. In short, any communication protocol may be used as long as it is a communication protocol that writes and transmits audio data for a predetermined time in a payload portion of a data block having a predetermined header portion and a payload portion.

（６）上述した実施形態では、発言ボタンを押下することにより発言者報知を行う場合について説明したが、その手段は必ずしもボタンの押下でなくても良い。タッチパネルなど他の入力手段を用いても良い。 (6) In the above-described embodiment, the case where the speaker notification is performed by pressing the speech button has been described, but the means may not necessarily be the button pressing. Other input means such as a touch panel may be used.

（７）上述した実施形態では、音声によって発言者を報知する場合について説明した。しかし、報知手段は音声に限定されない。例えば、各発言者の肖像である肖像データを記憶部２０３に予め書き込んでおき、該肖像データを識別情報管理テーブルにおいてＩＰアドレスに対応させておき、報知音声を放音して発言者を報知する代わりに肖像データを表示部１０５に表示させて発言者を報知しても良い。
また、別の態様では、各音声会議用端末１０にＬＥＤなどの表示体を予め設けておき、上述した報知音声を放音するタイミングと同様のタイミングで該表示体を駆動（ＬＥＤを発光）させても良い。また、報知音声とそのような表示体の駆動を並行して制御しても良い。
なお、上記肖像やＬＥＤの表示は、上記実施例のように発言の前後のみのタイミングで表示しても良いし、参加者が発言をしている間継続して表示しても良い。 (7) In the above-described embodiment, the case where the speaker is notified by voice has been described. However, the notification means is not limited to voice. For example, portrait data, which is the portrait of each speaker, is written in the storage unit 203 in advance, the portrait data is associated with the IP address in the identification information management table, and the speaker is notified by emitting a notification voice. Alternatively, portrait data may be displayed on the display unit 105 to notify the speaker.
In another aspect, each voice conference terminal 10 is provided with a display body such as an LED in advance, and the display body is driven (the LED emits light) at a timing similar to the timing at which the notification voice is emitted. May be. Moreover, you may control a notification audio | voice and the drive of such a display body in parallel.
In addition, the display of the said portrait and LED may be displayed only at the timing before and behind a speech like the said Example, and may be continuously displayed while a participant is speaking.

（８）上述した実施形態では、報知音声挿入可否、および報知音声挿入タイミングを参加者ごとに設定する場合について説明したが、全ての参加者の条件を統一しても良い。 (8) In the above-described embodiment, the case where the notification voice insertion availability and the notification voice insertion timing are set for each participant has been described. However, the conditions of all the participants may be unified.

（９）上述した実施形態では、報知音声として例示したような音声内容を利用する場合について説明した。しかし、報知音声内容は例示したような音声に限られるものではない。例えば、ビープ音や特定の音楽など、各参加者に割り当てられており発言者を特定できるような音声内容であれば、どのような音声を用いても良い。 (9) In the above-described embodiment, the case where the audio content exemplified as the notification audio is used has been described. However, the notification voice content is not limited to the voice as illustrated. For example, any audio may be used as long as the audio content is assigned to each participant and can identify the speaker, such as a beep sound or specific music.

（１０）上述した実施形態では、報知音声を発言の前か後いずれか一方に付加する場合について説明した。しかし、付加するタイミングは発言の前後両方でも良い。 (10) In the above-described embodiment, the case where the notification voice is added either before or after the speech has been described. However, the timing to add may be both before and after the statement.

（１１）上述した実施形態では、各参加者に対して報知音声を対応付ける場合について説明したが、報知音声は例えば参加者が属するグループ毎に共通の報知音声を対応付けても良い。 (11) In the above-described embodiment, the case where the notification sound is associated with each participant has been described. However, the notification sound may be associated with a common notification sound for each group to which the participant belongs, for example.

（１２）上述した実施形態では、発言ボタン１０４ａを押下することにより生成された発言者識別情報に基づいてサーバ装置２０が報知音声の付加を行う場合について説明した。しかし、各参加者が発言したことを自動的に検出して、検出結果を契機として報知音声を付加する態様（以下、自動モードと呼ぶ）にしても良い。その場合のサーバ装置２０の動作を以下に説明する。なお、その場合、図７におけるステップＳＡ１００のみが上記実施形態と異なるため、以下ではステップＳＡ１００についてのみ説明する。なお、各参加者は予め自動モードが選択されている旨を通知されているものとする。
自動モードにおいては、参加者は発言ボタン１０４ａを押すことなく発言を始める。参加者Ａが発言すると、音声会議用端末１０Ａは該発言内容を表す音声データをサーバ装置２０に送信する。サーバ装置２０は該音声データを受信し、制御部２０１はパケットに書き込まれた音声データをＲＡＭ２０３ｂに書き込む。上述のように音声会議用端末１０では音声データに無音圧縮を施しているため、サーバ装置２０は参加者が発言を開始する直前までパケットを受取っていない。制御部２０１は、ある音声会議用端末１０から所定の時間を超えてパケットを受取らず、その後該音声会議用端末１０から音声データを受取った場合には、その音声データのパケットに書き込まれた送信元ＩＰアドレスをＲＡＭ２０３ｂに書き込む。制御部２０１は、ＲＡＭ２０３ｂに書き込まれたＩＰアドレスを上記発言者識別情報と同様に用い、該発言に対して報知音声を付加する。
以上の処理により、発言者にボタンを押すなどのわずらわしい操作を行わせること無くサーバ装置２０は発言者の報知を行うことができる。 (12) In the above-described embodiment, the case has been described in which the server device 20 adds the notification voice based on the speaker identification information generated by pressing the speaker button 104a. However, it is also possible to automatically detect that each participant speaks and add a notification sound based on the detection result (hereinafter referred to as an automatic mode). The operation of the server device 20 in that case will be described below. In this case, only step SA100 in FIG. 7 is different from the above embodiment, and therefore only step SA100 will be described below. Each participant is notified in advance that the automatic mode is selected.
In the automatic mode, the participant starts speaking without pressing the speaking button 104a. When the participant A speaks, the audio conference terminal 10 A transmits audio data representing the content of the message to the server device 20. The server device 20 receives the audio data, and the control unit 201 writes the audio data written in the packet into the RAM 203b. As described above, since the audio conference terminal 10 performs silence compression on the audio data, the server device 20 does not receive a packet until immediately before the participant starts speaking. If the control unit 201 does not receive a packet from a certain audio conference terminal 10 for a predetermined time and then receives audio data from the audio conference terminal 10, the control unit 201 transmits the packet written in the audio data packet. The original IP address is written into the RAM 203b. The control unit 201 uses the IP address written in the RAM 203b in the same manner as the speaker identification information, and adds a notification voice to the message.
Through the above processing, the server device 20 can notify the speaker without causing the speaker to perform troublesome operations such as pressing a button.

（１３）上述した実施形態では、報知音声を発言者も含めた全ての参加者に対して付加する場合について説明した。しかし、報知音声を発言の後に付加する場合については、発言するに際して報知音の発音終了を待つ必要がないので、発言者の端末に出力されるミキシング信号に対しては報知音声を付加しないようにしてもよい。 (13) In the above-described embodiment, the case has been described in which the notification voice is added to all participants including the speaker. However, in the case of adding a notification sound after speaking, there is no need to wait for the notification sound to end when speaking, so that the notification sound is not added to the mixing signal output to the speaker's terminal. May be.

（１４）上述した実施形態では、報知音声を発言の前に付加する場合、発言者が報知音声の再生中に発言すると、報知音声の再生中に発言した部分についてはミキシング処理を施されず、報知音声再生終了した時点以降の発言についてミキシング処理が施される場合について説明した。しかし、報知音声の再生中に発言者が発言をした場合には、報知音声と発言内容をミキシングして両者を同時に再生するようにしても良い。 (14) In the above-described embodiment, when adding a notification sound before speaking, if a speaker speaks during the reproduction of the notification sound, a mixing process is not performed on a portion that is spoken during the reproduction of the notification sound. The case where the mixing process is performed on the utterances after the time point when the notification voice reproduction is finished has been described. However, when the speaker speaks during the reproduction of the notification voice, the notification voice and the content of the statement may be mixed and played back simultaneously.

（１５）上述した実施形態では、発言者の発言内容を表す音声データは発言者自身に対しては再生されないようにミキシング処理する場合について説明した。しかし、発言者の発言内容を表す音声データを発言者自身に再生しても良い。 (15) In the above-described embodiment, the description has been given of the case where the mixing processing is performed so that the voice data representing the content of the speech of the speaker is not reproduced for the speaker himself. However, audio data representing the content of the speaker's speech may be reproduced by the speaker himself.

（１６）上述した実施形態では、音声入力部１０６で音声データが生成されてからＲＡＭ１０３ｂの音声データバッファ領域で各種音声処理を施された結果生じる遅延時間は無視できるほど小さい場合について説明した。この遅延時間を利用して、遅延の間に短い報知音声を付加するようにしても良い。例えば、すべての発言のミキシング処理に対して所定の期間の遅延（例えば０．２秒）の間に、そのような処理を施せば、報知音声を発言の前に付加する場合にも発言者は報知音声の終了を意識することなく発言することができる。この場合の報知音は、話者が識別出来る程度の短い電子音などが適切である。 (16) In the above-described embodiment, the case has been described in which the delay time generated as a result of performing various audio processing in the audio data buffer area of the RAM 103b after the audio data is generated by the audio input unit 106 is negligibly small. Using this delay time, a short notification voice may be added during the delay. For example, if such processing is performed during a predetermined period of delay (for example, 0.2 seconds) for all speech mixing processing, the speaker can also add a notification voice before adding the speech. It is possible to speak without being aware of the end of the notification voice. The notification sound in this case is suitably an electronic sound that is short enough to identify the speaker.

（１７）上述した実施形態では、ミキシング処理が施されたオーディオ信号に対して報知音声の付加を行う場合について説明した。しかし、ミキシング処理が行われる前のオーディオ信号に対し報知音声の付加を行っても良い。そのような場合、各音声会議用端末１０に報知音の付加を行う装置を設けても良いし、サーバ装置２０において音声データ合成部２２よりも上流において報知音の付加を行う装置を設けても良い。要するに、最終的に音声会議用端末１０に対して出力されるオーディオ信号に対して報知音声が含まれていれば良い。 (17) In the above-described embodiment, a case has been described in which notification sound is added to an audio signal that has been subjected to mixing processing. However, the notification sound may be added to the audio signal before the mixing process is performed. In such a case, a device for adding a notification sound may be provided in each voice conference terminal 10, or a device for adding a notification sound upstream of the voice data synthesis unit 22 in the server device 20 may be provided. good. In short, it is only necessary that the notification sound is included in the audio signal that is finally output to the audio conference terminal 10.

本発明に係るサーバ装置を含んでいる音声会議システム１の全体構成を示すブロック図である。It is a block diagram which shows the whole structure of the audio conference system 1 including the server apparatus based on this invention. ＲＴＰパケットを示す図である。It is a figure which shows an RTP packet. 音声会議用端末１０の構成を示すブロック図である。2 is a block diagram showing a configuration of a voice conference terminal 10. FIG. 音声会議用端末１０による音声データの圧縮方法を示した図である。It is the figure which showed the compression method of the audio | voice data by the audio conference terminal. サーバ装置２０の構成を示すブロック図である。2 is a block diagram showing a configuration of a server device 20. FIG. 識別情報管理テーブルの一例である。It is an example of an identification information management table. サーバ装置２０の行う発言者報知処理の流れを示す図である。It is a figure which shows the flow of the speaker alerting | reporting process which the server apparatus 20 performs. サーバ装置２０におけるデータの流れを示すブロック図である。3 is a block diagram showing a data flow in the server device 20. FIG.

Explanation of symbols

１…音声会議システム、１０、１０Ａ、１０Ｂ、１０Ｃ…音声会議用端末、２０…サーバ装置、２１…回路指定部、２２、２２Ａ、２２Ｂ、２２Ｃ…音声データ合成部、２３…発言者識別部、２４…報知音声付加部、３０…ネットワーク、１０１、２０１…制御部、１０２、２０２…通信ＩＦ部、１０３、２０３…記憶部、１０３ａ、２０３ａ…ＲＯＭ、１０３ｂ、２０３ｂ…ＲＡＭ、１０４、２０４…操作部、１０４ａ…発言ボタン、１０５、２０５…表示部、１０６…音声入力部、１０６ａ…Ａ／Ｄコンバータ、１０６ｂ…マイクロホン、１０７…音声出力部、１０７ａ…Ｄ／Ａコンバータ、１０７ｂ…スピーカ、１０８、２０６…バス、１１０…音量レベル検知部、１１１…フレーム選択部、１１２…音声データ圧縮部 DESCRIPTION OF SYMBOLS 1 ... Voice conference system 10, 10A, 10B, 10C ... Voice conference terminal, 20 ... Server apparatus, 21 ... Circuit designation | designated part, 22, 22A, 22B, 22C ... Voice data synthesizer, 23 ... Speaker identification part, 24 ... Notification voice adding unit, 30 ... Network, 101, 201 ... Control unit, 102, 202 ... Communication IF unit, 103, 203 ... Storage unit, 103a, 203a ... ROM, 103b, 203b ... RAM, 104, 204 ... Operation 104a ... Speak button, 105,205 ... Display unit, 106 ... Audio input unit, 106a ... A / D converter, 106b ... Microphone, 107 ... Audio output unit, 107a ... D / A converter, 107b ... Speaker, 108 206 ... Bus, 110 ... Volume level detection unit, 111 ... Frame selection unit, 112 ... Audio data compression unit

Claims

Receiving means for receiving from each of a plurality of communication terminals connected via a communication network an audio signal and identification information that makes it possible to uniquely identify the communication terminal;
Storage means for storing a notification signal in association with the identification information;
The audio signal from each communication terminal is mixed and output to each communication terminal, and in the mixing, a plurality of systems with each communication terminal as an output destination are mixed, and each system has an output source. An audio signal output from a certain communication terminal, a mixing output means for performing processing to exclude the audio signal from mixing;
A notification signal adding unit that reads out a notification signal associated with the identification information received by the reception unit from the storage unit and adds the read notification signal so as to be included in the signal after mixing by the mixing output unit; A communication device.

The receiving means further receives a notification signal addition instruction from the communication terminal,
The communication apparatus according to claim 1, wherein the notification signal adding unit adds the notification sound to the audio signal only when the receiving unit receives the notification signal addition instruction.

The said notification signal addition means does not add the said notification signal about the mixing signal of the system | strain which makes the output destination communication terminal corresponding to each said notification signal an output destination. Communication device.

The notification signal adding means detects an output start and an output end of an audio signal to which the notification signal is to be added, and adds the notification signal before the output start or after the output end. The communication apparatus according to any one of 1 to 3.

Additional position storage means for storing additional position designation information for designating whether to add the notification signal before the start of output or after the end of output, corresponding to each identification information;
The notification signal adding means should read the additional position designation information stored in the additional position storage means according to the identification information, and add a notification signal for each communication terminal according to the read additional position designation information. The communication apparatus according to claim 4, wherein the position is controlled.

The storage means further stores a flag indicating whether or not to add the notification signal in association with the identification information,
6. The notification signal adding means determines whether or not to add a notification signal with reference to the flag associated with the identification information received by the receiving means. The communication device described.

The communication apparatus according to claim 1, wherein the audio signal is a packet signal, and the identification information is included in the packet.

The communication apparatus according to claim 1, wherein the notification signal is an audio signal.