JP2005045740A

JP2005045740A - Device, method and system for voice communication

Info

Publication number: JP2005045740A
Application number: JP2003280433A
Authority: JP
Inventors: Tadayuki Hattori; 忠幸服部; Yoshiyuki Kunito; 義之國頭; Akihiro Hokimoto; 晃弘保木本; Satoru Kawabata; 哲川畑
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2003-07-25
Filing date: 2003-07-25
Publication date: 2005-02-17

Abstract

<P>PROBLEM TO BE SOLVED: To provide a device, a method and a system of voice communication in which the real time voice information amount is reduced in high-quality communication having a BGM and a sound effect function via the Internet. <P>SOLUTION: A VoIP system has a real time transmission mode for transmitting real time voice data in which the voice data of a user or a real time voice data by synthesizing the voice data with the BGM etc. between VoIP clients by an RTP packet, and a batch transfer mode for transferring only the BGM in advance of the real time voice data by the RTP packet. Also, an RTCP packet can notify the communication party of the reception buffer size of the RTP packet. The VoIp client checks the gathered voice level of a user, and when listening to no voice and judging that the user does not speak, the VoIP client sets the batch transfer mode and transmits only the BGM which should be synthesized with the voice and transmitted in advance by the reception buffer size. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、例えばＶｏＩＰ（Voice over Internet Protocol）を使用したいわゆるインターネット電話等を行う通話装置、通話方法及び通話システムに関し、特に高音質の音声及び音響からなる実時間音響データをやり取りするために好適な通話装置、通話方法及び通話システムに関する。 The present invention relates to a call device, a call method, and a call system that perform a so-called Internet telephone using, for example, VoIP (Voice over Internet Protocol), and is particularly suitable for exchanging real-time acoustic data composed of high-quality voice and sound. TECHNICAL FIELD The present invention relates to a simple call device, a call method, and a call system.

音声をＩＰパケットにしてカプセル化することでＩＰ網を介した音声通話を可能とする技術として、ＶｏＩＰがある。ＶｏＩＰによる通話を行うためには、通話したい相手の情報の取得、通話したい相手の呼び出し、応答といった一連の情報交換をする必要があり、これらの目的のために、ＳＩＰ（Session Initiation Protocol）等の呼制御プロトコルが使用される。 VoIP is a technology that enables voice communication via an IP network by encapsulating voice in IP packets. In order to make a call by VoIP, it is necessary to exchange a series of information such as acquisition of information on the other party to be called, calling of the other party to be called, and response. For these purposes, SIP (Session Initiation Protocol) etc. A call control protocol is used.

このようなＶｏＩＰを含め、インターネット等の通信回線網を利用し、実時間動画像データや音声データ等の実時間音響データ等の実時間データを送信するシステムが増えている。インターネット等の公衆回線網においては、複数の利用者がネットワークの帯域を供給しているため、輻輳制御手法、すなわち輻輳の回避及び輻輳発生時の鎮静化手法は大きな課題となっている。従って、実時間性を重要とする通信形態が増えるという変化に伴い、実時間データの通信における輻輳制御手法が重要になってきている（下記特許文献１参照）。 There are an increasing number of systems that transmit real-time data such as real-time audio data such as real-time moving image data and audio data using a communication network such as the Internet including such VoIP. In public line networks such as the Internet, since a plurality of users supply network bandwidth, a congestion control method, that is, a congestion avoidance method and a sedation method at the time of occurrence of congestion is a major issue. Therefore, with a change in the number of communication modes in which real-time property is important, congestion control techniques in real-time data communication have become important (see Patent Document 1 below).

例えば、下記特許文献２においては、音声区間で重要な意味合いをもつ有声音に比較的高いビット量を与え、以下無声音、背景雑音の順にビット数を減らす音声符号化装置を備えるサーバが、クライアント端末に対して送信する音声の総伝送ビット量を抑制する技術が開示されている。 For example, in Patent Document 2 below, a server including a speech encoding device that gives a relatively high bit amount to voiced sound having an important meaning in a speech section and reduces the number of bits in the order of unvoiced sound and background noise is a client terminal. A technique for suppressing the total transmission bit amount of voice to be transmitted is disclosed.

ところで、従来、ＶｏＩＰを使用した実時間音声コミュニケーションにおいては、人の音声の主たる情報が含まれているのは、４〜５KHz付近の周波数帯域であったため、チャンネル数は１、標本化周波数を８KHz〜１６KHzに設定するのが常道であったが、このような狭い帯域では背景雑音や雰囲気等を伝達するのは困難である。 By the way, conventionally, in real-time voice communication using VoIP, the main information of human voice is included in the frequency band near 4 to 5 KHz, so the number of channels is 1 and the sampling frequency is 8 KHz. Although it is usual to set the frequency to ˜16 KHz, it is difficult to transmit background noise and atmosphere in such a narrow band.

チャンネル数を２以上に設定し、標本化周波数をＣＤＤＡ（Compact Disk Digital Audio）並の４４．１KHzまで引き上げると背景雑音や雰囲気等も伝送でき、また加えて、音声コミュニケーションに、既存の楽曲コンテンツを流用した高音質なＢＧＭ機能等を付加することなども可能になる。 If the number of channels is set to 2 or more and the sampling frequency is raised to 44.1 KHz, which is the same level as CDDA (Compact Disk Digital Audio), background noise and atmosphere can be transmitted. In addition, existing music content can be used for voice communication. It is also possible to add a diverted high-quality BGM function or the like.

ここで、チャンネル数を２以上に設定し、標本化周波数をＣＤＤＡ並の４４．１KHzまで引き上げて、無圧縮の１６bits LinearＰＣＭ（Phase Code Modulation：パルス符号変調）のデータを伝送する場合、ネットワーク上の伝送ビットレートは１４１１．２Kbps以上になってしまい、実インターネット環境で利用するのは難しい。したがって、現状では高能率符号化方式を用いる必要があるが、近時のＣＰＵの性能の向上により、ＢＧＭ機能等を付加したような高音質な実時間音響データをやり取りする実時間音声コミュニケーションにおいても高能率符号化方式を採用することが可能になってきている。 Here, when the number of channels is set to 2 or more, the sampling frequency is raised to 44.1 KHz, which is the same level as CDDA, and uncompressed 16-bit Linear PCM (Phase Code Modulation) data is transmitted, The transmission bit rate becomes 1411.2 Kbps or more, and it is difficult to use it in an actual Internet environment. Therefore, it is necessary to use a high-efficiency encoding method at present, but in real-time voice communication that exchanges high-quality real-time acoustic data such as a BGM function added due to recent improvements in CPU performance. It has become possible to adopt a high-efficiency encoding method.

なお、高能率符号化方式にはいくつかの方式があるが、先ず周波数分解を行い、信号を複数のサブバンドに分割したうえ、それぞれのブロックを、聴覚心理特性を利用して符号化精度を適応的に変化させて、必要最低限のビット数で所定の周波数成分をエントロピー符号するものが殆どである。この高能率符号化により、符号化する周波数成分に応じてビットレートは可変になる。 There are several high-efficiency encoding methods. First, frequency decomposition is performed, the signal is divided into a plurality of subbands, and each block is encoded using the psychoacoustic characteristics. Most of them adaptively change and entropy code a predetermined frequency component with a minimum number of bits. With this high-efficiency encoding, the bit rate becomes variable according to the frequency component to be encoded.

特開２００１−３２００４４０号公報Japanese Patent Application Laid-Open No. 2001-3200430 特開２００１−５４７４号公報JP 2001-5474 A

しかしながら、上述の特許文献２に記載の方法により伝送レートを抑制しようとしても、高音質な実時間音声コミュニケーションにおいては、会話に加えて高音質なＢＧＭ付加機能があるため、人が音声を発している・いないにかかわらず、一定量のデータ送受信を行い、ネットワーク帯域を占有してしまうという問題点がある。 However, even if an attempt is made to suppress the transmission rate by the method described in Patent Document 2 described above, in high-quality real-time voice communication, there is a BGM addition function with high-quality sound in addition to conversation. Regardless of whether it is present or not, there is a problem that a certain amount of data is transmitted and received, and the network bandwidth is occupied.

本発明は、このような従来の実情に鑑みて提案されたものであり、インターネットを介したＢＧＭ及び効果音機能を有するような高音質な音声コミュニケーションの際に、リアルタイム音声の情報量を減らした通話装置、通話方法及び通話システムを提供することを目的とする。 The present invention has been proposed in view of such conventional circumstances, and has reduced the amount of information of real-time audio during high-quality voice communication having BGM and sound effect functions via the Internet. It is an object to provide a call device, a call method, and a call system.

上述した目的を達成するために、本発明に係る通話装置は、ンターネットを介して少なくとも音声データの送受信を行う通話装置において、集音した音声を電気信号に変換する音声変換手段と、上記電気信号に変換された音声データに付加データを合成する合成手段と、上記音声データ及び／又は該音声データに合成する付加データを格納したデータパケットを生成するデータパケット生成手段と、少なくとも上記データパケットの送受信を管理する管理情報を格納した制御パケットを生成する制御パケット生成手段と、上記データパケット及び上記制御パケットを上記インターネットを介して１以上の他の通話装置に送信する送信手段と、上記１以上の他の通話装置からのデータパケット及び制御パケットを受信する受信手段と、上記データパケット及び制御パケットの送受信を制御する制御手段とを有し、上記制御手段は、上記音声データ又は該音声データに付加データを合成した合成データを格納した第１のデータパケットを実時間伝送する第１のモードと、上記付加データのみを格納した第２のデータパケットを一括伝送する第２のモードとを切替制御することを特徴とする。 In order to achieve the above-described object, a call device according to the present invention includes a voice conversion unit that converts collected sound into an electric signal in a call device that transmits and receives at least voice data via the Internet, and the electric device. Synthesizing means for synthesizing additional data with audio data converted into a signal, data packet generating means for generating the data packet storing the audio data and / or additional data to be synthesized with the audio data, and at least the data packet Control packet generation means for generating a control packet storing management information for managing transmission and reception; transmission means for transmitting the data packet and the control packet to one or more other call devices via the Internet; Receiving means for receiving data packets and control packets from other call devices; Control means for controlling transmission and reception of packets and control packets, wherein the control means transmits in real time a first data packet storing the voice data or synthesized data obtained by synthesizing additional data with the voice data. Switching control between the first mode and the second mode in which the second data packet storing only the additional data is transmitted collectively.

本発明においては、音声データのみを格納した第１のデータパケットを実時間伝送（リアルタイム転送）する第１のモードと、音声に合成する、例えばバックグラウンドミュージック及び／又は効果音等の付加データのみを格納した第２のデータパケットを一括伝送（バッチ転送）する第２のモードとを有し、これを切替え制御することで、必要に応じて付加データだけを先送りすることができる。 In the present invention, the first data packet storing only the audio data is transmitted in real time (real-time transfer), and only the additional data such as background music and / or sound effects is synthesized with the audio. And a second mode for batch transmission (batch transfer) of the second data packet storing the data, and by switching control of this, only the additional data can be postponed as necessary.

また、上記制御手段は、上記集音した音声が所定の音量レベル未満である場合に、上記一括伝送を行う第２のモードに切替制御することができ、ユーザが発話していないタイミングを利用して上記付加データの先送りを実行することができる。 Further, the control means can control to switch to the second mode in which the collective transmission is performed when the collected sound is lower than a predetermined volume level, and uses timing when the user is not speaking. Thus, the additional data can be postponed.

更に、上記制御手段は、上記他の通話装置と上記第１のモードで実時間伝送を行う前に上記第２のモードで一括伝送を行うよう制御することができ、ユーザが通話を開始する直前のタイミングを利用して付加データを先送りすることができる。 Further, the control means can control to perform batch transmission in the second mode before performing real-time transmission in the first mode with the other call device, and immediately before the user starts a call. The additional data can be postponed using the timing.

更にまた、上記制御手段は、上記第２のモードで上記第２のデータパケットを一括伝送した後、所定時間は上記第１のモードで上記音声データのみを格納した上記第１のデータパケットを送信するよう制御することができ、所定期間、即ち上記付加データを先送りした期間は、音声データのみとし、音声データ及び付加データからなるデータパケットに比して伝送レートを小さくすることができる。 Furthermore, the control means transmits the first data packet storing only the audio data in the first mode for a predetermined time after collectively transmitting the second data packet in the second mode. In the predetermined period, that is, the period in which the additional data is postponed, only the voice data is used, and the transmission rate can be reduced as compared with the data packet including the voice data and the additional data.

また、上記制御手段は、上記第２のモードで上記第２のデータパケットを一括伝送した後、上記集音した音声が所定の音量レベル未満である場合、所定時間はデータパケットの送信を停止するよう制御することができ、付加データを先送りしておくことで、ユーザが発話していない場合はデータの伝送をする必要がなくなる。 In addition, after the batch transmission of the second data packet in the second mode, the control means stops transmission of the data packet for a predetermined time when the collected sound is lower than a predetermined volume level. By controlling the additional data in advance, it is not necessary to transmit data when the user is not speaking.

更に、上記受信手段は、受信したデータパケットをバッファリングする受信バッファを有し、上記制御手段は、上記受信バッファの容量を示すバッファ情報や上記受信バッファにバッファリングされているデータ量を示すバッファ占有情報を上記制御パケットに格納させ、上記受信手段が受信した上記他の通話装置からの制御パケットに格納された当該他の通話装置のバッファ情報や上記バッファ占有情報に基づき、上記第２のデータパケットの伝送レートを制御することができ、他の通話装置のバッファサイズ及びその占有率に応じて先送りする付加データ量を可変とすることができる。 Further, the reception means has a reception buffer for buffering received data packets, and the control means is a buffer information indicating the capacity of the reception buffer and a buffer indicating the amount of data buffered in the reception buffer. Occupancy information is stored in the control packet, and the second data is based on the buffer information and the buffer occupancy information of the other call device stored in the control packet from the other call device received by the receiving means. The packet transmission rate can be controlled, and the amount of additional data to be postponed can be made variable in accordance with the buffer size and the occupation rate of other communication devices.

本発明に係る通話方法は、インターネットを介して少なくとも音声データの送受信を行う通話方法において、集音した音声を電気信号に変換する音声変換工程と、上記電気信号に変換された音声データに付加データを合成する合成工程と、上記音声データ及び／又は該音声データに合成する付加データを格納したデータパケットを生成するデータパケット生成工程と、少なくとも上記データパケットの送受信を管理する管理情報を格納した制御パケットを生成する制御パケット生成工程と、上記データパケット及び上記制御パケットを上記インターネットを介して１以上の他の通話装置に送信する送信工程と、上記１以上の他の通話装置からのデータパケット及び制御パケットを受信する受信工程と、上記データパケット及び制御パケットの送受信を制御する制御工程とを有し、上記制御工程では、上記音声データ又は該音声データに付加データを合成した合成データを格納した第１のデータパケットを実時間伝送する第１のモードと、上記付加データのみを格納した第２のデータパケットを一括伝送する第２のモードとを切替制御することを特徴とする。 The call method according to the present invention is a call method in which at least audio data is transmitted / received via the Internet, an audio conversion step of converting collected sound into an electric signal, and additional data in the audio data converted into the electric signal. A data packet generating step for generating a data packet storing the voice data and / or additional data to be combined with the voice data, and a control storing management information for managing at least transmission / reception of the data packet A control packet generating step for generating a packet; a transmission step for transmitting the data packet and the control packet to one or more other communication devices via the Internet; a data packet from the one or more other communication devices; A receiving step for receiving the control packet; and the data packet and the control packet A first mode for transmitting in real time a first data packet storing the voice data or synthesized data obtained by synthesizing the voice data with additional data; Switching control is performed between the second mode in which the second data packet storing only the additional data is transmitted collectively.

本発明に係る通話システムは、インターネットを介して少なくとも音声データの送受信を行う通話システムにおいて、各上記通話装置は、集音した音声を電気信号に変換する音声変換手段と、上記電気信号に変換された音声データに付加データを合成する合成手段と、上記音声データ及び／又は該音声データに合成する付加データを格納したデータパケットを生成するデータパケット生成手段と、少なくとも上記データパケットの送受信を管理する管理情報を格納した制御パケットを生成する制御パケット生成手段と、上記データパケット及び上記制御パケットを上記インターネットを介して１以上の他の通話装置に送信する送信手段と、上記１以上の他の通話装置からのデータパケット及び制御パケットを受信する受信手段と、上記データパケット及び制御パケットの送受信を制御する制御手段とを有し、少なくとも２つの通話装置の一方の上記受信手段は、受信したデータパケットをバッファリングする受信バッファを有し、上記制御手段は、該受信バッファの容量を示すバッファ情報を上記制御パケットに格納させ、他方の通話装置の上記制御手段は、上記音声データ又は該音声データに付加データを合成した合成データを格納した第１のデータパケットを実時間伝送する第１のモードと、上記付加データのみを格納した第２のデータパケットを一括伝送する第２のモードとを切替制御するものであって、上記受信手段が受信した上記一方の通話装置からの制御パケットに格納された当該一方の通話装置のバッファ情報に基づき、上記第２のデータパケットの伝送レートを制御することを特徴とする。 The call system according to the present invention is a call system that transmits and receives at least audio data via the Internet, wherein each of the call devices is converted into an electric signal, voice converting means for converting the collected voice into an electric signal, and Managing at least transmission / reception of the data packet, combining means for synthesizing additional data with the voice data, data packet generating means for generating the data packet storing the voice data and / or additional data to be synthesized with the voice data Control packet generation means for generating a control packet storing management information; transmission means for transmitting the data packet and the control packet to one or more other call devices via the Internet; and the one or more other call calls Receiving means for receiving data packets and control packets from the device; and the data Control means for controlling transmission / reception of packets and control packets, and the reception means of one of the at least two communication devices has a reception buffer for buffering received data packets, and the control means Buffer information indicating the capacity of the buffer is stored in the control packet, and the control means of the other communication device executes the first data packet storing the voice data or synthesized data obtained by combining the voice data with additional data. The one communication device that controls switching between a first mode for time transmission and a second mode for batch transmission of a second data packet storing only the additional data. The transmission rate of the second data packet is controlled based on the buffer information of the one communication device stored in the control packet from Characterized in that it.

本発明においては、一方の通話装置が受信バッファの大きさを制御パケットにより他方の通話装置に通知し、他方の通話装置は、上記一方の通話装置の受信バッファの大きさに基づき、第２のモードの際には付加データのみを一括伝送する伝送レートを制御することができる。 In the present invention, one call device notifies the size of the reception buffer to the other call device by means of a control packet, and the other call device uses a second buffer based on the size of the reception buffer of the one call device. In the mode, it is possible to control the transmission rate for collectively transmitting only the additional data.

本発明に係る通話装置によれば、インターネットを介して少なくとも音声データの送受信を行う通話装置において、集音した音声を電気信号に変換する音声変換手段と、上記電気信号に変換された音声データに付加データを合成する合成手段と、データパケットを生成するデータパケット生成手段と、少なくとも上記データパケットの送受信を管理する管理情報を格納した制御パケットを生成する制御パケット生成手段と、上記データパケット及び上記制御パケットを上記インターネットを介して１以上の他の通話装置に送信する送信手段と、上記１以上の他の通話装置からのデータパケット及び制御パケットを受信する受信手段と、上記データパケット及び制御パケットの送受信を制御する制御手段とを有し、上記制御手段は、上記音声データ又は該音声データに付加を合成した合成データを格納した第１のデータパケットを実時間伝送する第１のモードと、上記付加データのみを格納した第２のデータパケットを一括伝送する第２のモードとを切替制御するので、音声に合成するバックグラウンドミュージック及び／又は効果音等の付加データのみからなる第２のデータパケットを所望のタイミングで、音声よりも先に一括伝送しておけば、先送りした付加データが受信側で再生されている間は、音声データのみを実時間伝送することができるため、高音質で情報量が大きい実時間コミュニケーションを行う際に、常時占有するネットワーク帯域を狭めることができる。 According to the communication device of the present invention, in the communication device that transmits and receives at least audio data via the Internet, the audio conversion means that converts the collected sound into an electric signal, and the audio data converted into the electric signal. Combining means for combining additional data, data packet generating means for generating a data packet, control packet generating means for generating a control packet storing at least management information for managing transmission / reception of the data packet, the data packet, and the data packet Transmission means for transmitting a control packet to one or more other communication devices via the Internet, reception means for receiving a data packet and control packet from the one or more other communication devices, and the data packet and control packet Control means for controlling transmission / reception of the audio data. A first mode for transmitting in real time a first data packet that stores data or synthesized data obtained by synthesizing the voice data and a second data packet that stores a second data packet that stores only the additional data. Since the mode is controlled to be switched, if the second data packet consisting only of additional data such as background music and / or sound effects to be synthesized with the voice is transmitted in a batch at a desired timing before the voice, While the additional data that has been postponed is played back on the receiving side, only the audio data can be transmitted in real time, so the network bandwidth that is always occupied is narrowed when performing real-time communication with high sound quality and large amount of information. be able to.

本発明の通話システムによれば、一方の通話装置が他方の通話装置へ受信バッファの容量を通知し、これを受けた他方の通話装置は、受信側となる一方の通話装置の受信バッファの容量に合わせて、一括伝送により先送りする付加データの伝送レートを決定することができ、音声及び付加データからなる高音質で情報量が大きい実時間音響データをネットワークを介してやり取りする場合に、効率的にネットワーク帯域を狭めることができる。 According to the call system of the present invention, one call device notifies the other call device of the capacity of the reception buffer, and the other call device that receives this notifies the capacity of the reception buffer of the one call device serving as the reception side. It is possible to determine the transmission rate of additional data to be postponed by batch transmission, and it is efficient when exchanging high-quality, high-quality real-time acoustic data consisting of voice and additional data via a network. The network bandwidth can be narrowed.

以下、本発明を適用した具体的な実施の形態について、図面を参照しながら詳細に説明する。この実施の形態は、本発明を、２以上の通話装置がインターネットを介してＶｏＩＰにより通話を行う通話装置としてのＶｏＩＰクライアント及びこれを備えた通話システムとしてのＶｏＩＰシステムに適用したものである。本実施の形態におけるＶｏＩＰクライアントは、リアルタイム音声の情報量を減らし、帯域を故意に狭めことが可能なものである。 Hereinafter, specific embodiments to which the present invention is applied will be described in detail with reference to the drawings. In this embodiment, the present invention is applied to a VoIP client as a call device in which two or more call devices make a call by VoIP over the Internet and a VoIP system as a call system including the same. The VoIP client in this embodiment can reduce the amount of information of real-time voice and intentionally narrow the bandwidth.

先ず、本実施の形態におけるＶｏＩＰを使用したネットワークコミュニケーションをオ行うＶｏＩＰシステムの概略について説明する。図１は、本実施の形態におけるＶｏＩＰシステムの一例を示す模式図である。本実施の形態におけるＶｏＩＰシステムでは、例えば２チャンネル以上で、且つユーザ間で、通話のみではなく様々な効果音及びＢＧＭ等も共有することができる高音質の音声コミュニケーションを実現するものである。なお、本実施の形態におけるＶｏＩＰシステムは、２つの通話装置（以下、ＶｏＩＰクライアントという。）間で行なわれるものとするが、ＶｏＩＰシステムを構成するＶｏＩＰクライアントは２つに限らず、従ってＶｏＩＰクライアントを介してネットワークコミュニケーションに参加可能な参加者は２以上であってもよい。 First, an outline of a VoIP system that performs network communication using VoIP in the present embodiment will be described. FIG. 1 is a schematic diagram showing an example of a VoIP system in the present embodiment. The VoIP system according to the present embodiment realizes high-quality voice communication that can share not only a call but also various sound effects and BGM, for example, between two or more channels and between users. Note that the VoIP system in the present embodiment is performed between two call devices (hereinafter referred to as VoIP clients). However, the number of VoIP clients constituting the VoIP system is not limited to two. There may be two or more participants who can participate in network communication.

図１に示すように、ＶｏＩＰシステム１００は、例えばＰＣ（Personal Computer）等のＶｏＩＰクライアント１１１と、これとインターネット１３０を介して接続されたＶｏＩＰクライアント１２１とを有する。 As shown in FIG. 1, the VoIP system 100 includes a VoIP client 111 such as a PC (Personal Computer), for example, and a VoIP client 121 connected thereto via the Internet 130.

このＶｏＩＰシステム１００においては、ＶｏＩＰクライアント１１１のユーザ１１０と、ＶｏＩＰクライアント１２１のユーザ１２０とは、自身のＶｏＩＰクライアント１１１、１２１に搭載される後述するＶｏＩＰ用のアプリケーション（ソフト・フォン）等と、例えばマイクロフォンとヘッドフォンとからなるヘッドセット又はマイクロフォンと受話器とからなるハンドセットとを使用し、インターネット１３０を介して通信相手とコミュニケーションを行う。 In the VoIP system 100, the user 110 of the VoIP client 111 and the user 120 of the VoIP client 121 are VoIP applications (soft phones), which will be described later, installed in their VoIP clients 111 and 121, for example. A headset composed of a microphone and headphones or a handset composed of a microphone and a receiver is used to communicate with a communication partner via the Internet 130.

インターネット１３０は、一般公衆回線等の通信回線や、情報通信ネットワークを複数接続することによって世界中に拡がったネットワーク環境である。現在、広帯域、高速な通信回線の普及によってブロードバンド伝送（Broadband Transmission）を可能としている。光ファイバー、非対称ディジタル加入者線、無線等を用い、５００ｋｂｐｓ以上の通信回線でネットワークを構成している。 The Internet 130 is a network environment that is spread all over the world by connecting a plurality of communication lines such as general public lines and information communication networks. Currently, broadband transmission is enabled by the widespread use of broadband and high-speed communication lines. A network is configured with a communication line of 500 kbps or higher using an optical fiber, an asymmetric digital subscriber line, radio, or the like.

このインターネット１３０には、ＶｏＩＰ通信を制御するＶｏＩＰサーバ１３１、及び音源データ１３２及びダウンロードユーザインフォメーション１３３等のデータを管理するウェブサーバ１３４等が接続されている。また、各ＶｏＩＰクライアント１１１、１２１には、各ＶｏＩＰクライアント１１１、１２１が有するウェブブラウザ１１２、１２２等によりウェブサーバ１３４からダウンロードするか、又は自身で購入若しくは編集した音源データ１１３、１２３が記憶されている。 Connected to the Internet 130 are a VoIP server 131 for controlling VoIP communication, a web server 134 for managing data such as sound source data 132 and download user information 133, and the like. Each VoIP client 111, 121 stores sound source data 113, 123 downloaded from the web server 134 by the web browser 112, 122 or the like included in each VoIP client 111, 121, or purchased or edited by itself. Yes.

ウェブサーバ１３４のデータベースに格納されている音源データ１３２、及びユーザがダウンロード等して所持している音源データ１１３、１２３は、例えばＢＧＭ（Back Ground Music）等に使用される音楽や、波の音・拍手の音・雷鳴・ベルの音等の各種効果音であり、これらの音源データは、ユーザ間のネットワークコミュニケーションにて使用することができる。 The sound source data 132 stored in the database of the web server 134 and the sound source data 113 and 123 possessed by the user by downloading, for example, are music used for BGM (Back Ground Music) or the like, or sound of waves -Various sound effects such as clapping sound, thunder, bell sound, etc. These sound source data can be used in network communication between users.

次に、このＶｏＩＰクライアントについて説明する。図２は、ＶｏＩＰシステムを構成するＶｏＩＰクライアントの機能を示すブロック図である。図２に示すように、このようなインターネットコミュニケーションを行うためのＶｏＩＰクライアント２０は、コミュニケーションに参加している参加者、即ち通話相手へデータを送信する送信手段２１と、通話相手からのデータを受け取る受信手段４１とを有する。 Next, the VoIP client will be described. FIG. 2 is a block diagram showing the functions of the VoIP client constituting the VoIP system. As shown in FIG. 2, the VoIP client 20 for performing such Internet communication receives the data from the calling party and the transmission means 21 for transmitting data to the participants participating in the communication, that is, the calling party. Receiving means 41.

送信手段２１は、マイクが接続され、外部（ユーザ）の音声をキャッチするマイクキャプチャ（MIC capture）２２と、例えばＭＰ３（ＭＰＥＧ（Moving Picture Experts Group）１オーディオLayer３）、又はＭＰＥＧ４等に圧縮された各種効果音（Sound Effect：ＳＥ）の音源ファイルが記憶された効果音ファイル記憶部２３と、同じく圧縮された各種ＢＧＭの音源ファイルが記憶されたＢＧＭファイル記憶部２４と、効果音ファイル及びＢＧＭファイルを読み出しデコードする夫々デコーダ２５及び２６と、マイクによりキャプチャした音、効果音、及びＢＧＭのゲインを制御して音量調整するゲイン調整部２７〜２９と、これらの３つの音を合成する合成部３０と、合成した音を圧縮符号化するエンコーダ３１と、圧縮符号化されたデータをデータパケットとしてのＲＴＰ（Real-Time Transport Protocol）パケットに格納するデータパケット生成手段としてのパケット化部（packetize）３２ａと、後述する制御情報を生成し、ＲＴＰパケット及びＲＴＣＰ（Real-Time Control Protocol）パケットの送受信を制御する制御部３４と、該制御情報及びＲＴＰパケットをコントロールする管理情報等をＲＴＣＰパケットに格納する制御パケット生成手段としてのＲＴＣＰパケット化部３２ｂと、ＲＴＰパケット及びＲＴＣＰパケット等を送信する送信部３３とを有する。送信部３３から送られたＲＴＰパケットがインターネット１３０を介して通信対象となる他のＶｏＩＰクライアントに送信される。 The transmission means 21 is connected to a microphone, and is compressed into a microphone capture (MIC capture) 22 that catches the external (user) voice and, for example, MP3 (MPEG (Moving Picture Experts Group) 1 audio Layer 3) or MPEG4 Sound effect file storage section 23 storing sound source files of various sound effects (SE), BGM file storage section 24 storing sound source files of various compressed BGM, sound effect files and BGM files Decoders 25 and 26 that respectively read and decode the sound, gain adjustment units 27 to 29 that adjust the volume by controlling the gain of the sound, sound effect, and BGM captured by the microphone, and the synthesis unit 30 that synthesizes these three sounds An encoder 31 for compressing and encoding the synthesized sound, and a data packet for the encoded data. A packetizing unit 32a as a data packet generating means to be stored in an RTP (Real-Time Transport Protocol) packet, and control information to be described later, and an RTP packet and an RTCP (Real-Time Control Protocol) packet A control unit 34 that controls transmission / reception, an RTCP packetizing unit 32b as a control packet generation unit that stores the control information and management information that controls RTP packets in the RTCP packet, and transmission that transmits RTP packets, RTCP packets, and the like Part 33. The RTP packet sent from the sending unit 33 is sent via the Internet 130 to another VoIP client to be communicated.

一方、インターネット１３０を介して通信相手の送信手段２１から送られるデータを受信する受信手段４１は、ＲＴＰパケット及びＲＴＣＰパケット等を受信する受信部４２と、受信したＲＴＰパケットをデパケッタイズするＲＴＰデパケット化部（depacketize）４３ａと、ＲＴＣＰパケットをデパケッタイズするＲＴＣＰデパケット化部（depacketize）４３ｂと、ＲＴＰパケットの到着時間を補正するデジッタ部（de-jitter）４４と、送られたＲＴＰパケットのエラーが生じた部分又は伝送中に損失した部分等の欠落部分を補償するパケット補償部（packet loss compensator）４５と、パケット補償部４５からのデータを復号伸張するデコーダ４６と、例えばＭＰ３、ＭＰＥＧ４等に圧縮された着信音の音源ファイルが記憶された着信音ファイル（Ring Tone File）記憶部４７と、着信音ファイルを読み出しデコードするデコーダ４８と、デコーダ４６及び４８のゲイン調整をする夫々ゲイン調整部４９及び５０と、上述した送信手段２１における合成部３０にて合成された送信用のＰＣＭデータ、即ち送信者側自身の音源のゲイン調整するゲイン調整部５２と、ゲイン調整された送信用のデータ、通信相手から送信されてきたデータ、及び着信音データを合成する合成部５３と、合成部５３にて合成されたデータをヘッドフォン（ＨＰ）へ出力する出力部５４と、合成部５３へ出力される着信音とは別に着信音のゲインを調整するゲイン調整部５１と、ゲイン調整された着信音を外部へ出力するスピーカ（ＳＰ）５５とを有する。 On the other hand, the receiving means 41 for receiving data sent from the transmitting means 21 of the communication partner via the Internet 130 includes a receiving section 42 for receiving RTP packets and RTCP packets, and an RTP depacketizing section for depacketizing the received RTP packets. (Depacketize) 43a, an RTCP depacketizing unit (depacketize) 43b for depacketizing an RTCP packet, a dejitter unit (de-jitter) 44 for correcting the arrival time of the RTP packet, and a portion where an error has occurred in the sent RTP packet Alternatively, a packet compensator 45 that compensates for missing parts such as a lost part during transmission, a decoder 46 that decodes and decompresses data from the packet compensator 45, and an incoming call that is compressed to MP3, MPEG4, etc. Ring tone file (Ring Tone) that stores sound source files File) storage unit 47, decoder 48 that reads and decodes the ringtone file, gain adjustment units 49 and 50 that adjust the gains of decoders 46 and 48, and synthesis unit 30 in transmission means 21 described above. PCM data for transmission, that is, a gain adjusting unit 52 that adjusts the gain of the transmitter's own sound source, and a combining unit that combines the gain-adjusted transmission data, data transmitted from the communication partner, and ringtone data 53, an output unit 54 that outputs the data synthesized by the synthesis unit 53 to the headphones (HP), a gain adjustment unit 51 that adjusts the gain of the ringtone separately from the ringtone output to the synthesis unit 53, And a speaker (SP) 55 for outputting a gain-adjusted ring tone to the outside.

次に、このＶｏＩＰクライアントのデータの送受信方法について説明する。先ず、送信側において、マイクキャプチャ２２は、マイクにより入力されたユーザの音声をキャッチしてゲイン調整部２７へ送る。ゲイン調整部２７は、音声データに対し、ユーザの指示によるか又は自動的に、ゲイン係数ｋ１を乗算し、所望の大きさにゲイン調整する。 Next, a data transmission / reception method of the VoIP client will be described. First, on the transmission side, the microphone capture 22 catches the user's voice input by the microphone and sends it to the gain adjustment unit 27. The gain adjusting unit 27 multiplies the audio data by a gain coefficient k1 in accordance with a user instruction or automatically, and adjusts the gain to a desired magnitude.

効果音ファイル記憶部２３及びＢＧＭファイル記憶部２４は、例えばＭＰ３、ＭＰＥＧ４等の圧縮技術により予め圧縮された音源ファイルが記憶された、例えばハードディスクドライブ（ＨＤＤ）、ＲＯＭ（read only memory）又は光磁気ディスクからなり、デコーダ２５及び２６は、これらの圧縮符号化データを読み出し、読み出した圧縮符号化データをＰＣＭデータに変換する。更に、デコーダ２５、２６は、変換したＰＣＭデータを夫々ゲイン調整部２８、２９へ送り、ゲイン調整部２８、２９は、送られてきたデータに対し、ユーザの指示によるか又は自動的に、夫々ゲイン係数ｋ２及びｋ３を乗算して、所望の大きさにゲイン調整する。 The sound effect file storage unit 23 and the BGM file storage unit 24 store, for example, a hard disk drive (HDD), a ROM (read only memory), or a magneto-optical device in which a sound source file compressed in advance by a compression technique such as MP3 or MPEG4 is stored. It consists of a disk, and the decoders 25 and 26 read out these compressed encoded data, and convert the read compressed encoded data into PCM data. Furthermore, the decoders 25 and 26 send the converted PCM data to the gain adjustment units 28 and 29, respectively, and the gain adjustment units 28 and 29 respectively send the received data according to a user instruction or automatically. Multiply the gain coefficients k2 and k3 to adjust the gain to a desired size.

ゲイン調整部２７〜２９にてゲイン調整された実時間音響データの音声、ＢＧＭ及び効果音は、合成部３０に供給され、合成部３０はゲイン調整部２７〜２９の出力を飽和処理しつつ加算し、この加算結果をエンコーダ３１に出力する。エンコーダ３１は、この加算結果をネットワークにて送信するために、例えばＭＰＥＧ４等にエンコードする。エンコードされたデータは、リアルタイム・トランスポート・プロトコル（Real-time Transport Protocol：ＲＴＰ）に従ってデータをパケット化するＲＴＰパケット化（packetize）部３２に供給される。 The sound, BGM, and sound effect of the real-time acoustic data whose gain is adjusted by the gain adjusting units 27 to 29 are supplied to the synthesizing unit 30, and the synthesizing unit 30 adds the outputs of the gain adjusting units 27 to 29 while performing saturation processing. The addition result is output to the encoder 31. The encoder 31 encodes, for example, MPEG4 in order to transmit the addition result through the network. The encoded data is supplied to an RTP packetizing unit 32 that packetizes the data according to a real-time transport protocol (RTP).

ＲＴＰパケット化部３２ａは、エンコードデータをＲＴＰパケットにパケット化し、パケット化データは、送信部３３からインターネット１３０を介して通信相手のＶｏＩＰクライアントへ送信される。 The RTP packetization unit 32a packetizes the encoded data into RTP packets, and the packetized data is transmitted from the transmission unit 33 to the VoIP client of the communication partner via the Internet 130.

また、制御部３４は、ＲＴＰパケットにて送信するエンコードデータとして、音声データのみ又は音声データにＢＧＭ及び効果音が合成された合成データを格納した第１のデータパケットとしてのＲＴＰパケットを実時間伝送（リアルタイム転送）する第１のモード（実時間伝送モード）と、音声データに合成する例えばＢＧＭのみを格納した第２のデータパケットとしてのＲＴＰパケットを一括転送する第２のモード（一括伝送モード）とを切替え制御するものである。 Further, the control unit 34 transmits in real time an RTP packet as a first data packet in which only audio data or synthesized data obtained by synthesizing BGM and sound effects into audio data is stored as encoded data to be transmitted in the RTP packet. A first mode (real-time transmission mode) for performing (real-time transfer) and a second mode (collective transmission mode) for collectively transferring RTP packets as second data packets storing, for example, only BGM to be synthesized with voice data And switching control.

具体的には、マイクキャプチャ２２によりキャプチャされる音声が所定の音量レベル未満と判断した場合、即ちユーザが発話していないようなタイミングで、ＢＧＭのみを一括して転送する。又は呼制御の後、実時間音響データをやり取りする前にＢＧＭのみ一括転送する。このようにＢＧＭのみを一括転送した場合は、所定期間ＢＧＭを送信する必要がなくなり、実時間伝送としては、音声データのみを送信する。 Specifically, when it is determined that the sound captured by the microphone capture 22 is less than a predetermined volume level, that is, at a timing when the user is not speaking, only BGM is transferred at once. Alternatively, after the call control, only BGM is batch transferred before exchanging real-time acoustic data. Thus, when only BGM is transferred at once, it is not necessary to transmit BGM for a predetermined period, and only audio data is transmitted as real-time transmission.

このようにＢＧＭのみを一括転送する場合には、マイクキャプチャ２２からの音声はミュートし、また効果音をデコードするデコーダ２５の出力もミュートすることで、デコーダ３１の出力をＢＧＭのみとすることができる。又は、例えばＲＴＰパケット化部３２ａが、ＢＧＭファイル記憶部２４に格納された圧縮された音源ファイルを必要量読み出し、これをＲＴＰパケットに格納してＲＴＰパケット化するようにしてもよい。 In this way, when only BGM is transferred at once, the sound from the microphone capture 22 is muted, and the output of the decoder 25 for decoding the sound effect is also muted, so that the output of the decoder 31 is set to only BGM. it can. Alternatively, for example, the RTP packetizing unit 32a may read out a necessary amount of the compressed sound source file stored in the BGM file storage unit 24, store it in the RTP packet, and convert it into an RTP packet.

また、制御部３４は、後述するＲＴＣＰパケットのレポートブロック等に格納される各種管理情報を管理すると共に、ＲＴＣＰパケットの拡張部分に格納される制御情報を生成する。制御情報は、ＲＴＣＰ受信部４２にて受信したパケットのジッタを相殺するために設けられた揺らぎ吸収バッファ及びヘッドフォン５４にて再生する前の無圧縮のＰＣＭデータを格納しておくＰＣＭバッファの容量等を通知するものである。また、本実施の形態におけるＶｏＩＰクライアントが送受信するＲＴＣＰパケットの拡張部分には、通信相手となる他のＶｏＩＰクライアントが送信してくるＲＴＰパケットに格納される実時間音響データの圧縮率を指定する例えばエンコード帯域及びサブバンド分割数等の情報、及び伝送途中でエラーが生じたり又は失われたような欠落パケットの再送要求を指示する情報等も記述することができる。 In addition, the control unit 34 manages various management information stored in a report block of an RTCP packet, which will be described later, and generates control information stored in an extended portion of the RTCP packet. The control information includes a fluctuation absorbing buffer provided to cancel the jitter of the packet received by the RTCP receiving unit 42, a capacity of the PCM buffer for storing uncompressed PCM data before being reproduced by the headphones 54, and the like. Is to be notified. Further, in the extended portion of the RTCP packet transmitted / received by the VoIP client in the present embodiment, the compression rate of the real-time acoustic data stored in the RTP packet transmitted by the other VoIP client serving as the communication partner is designated. Information such as the encoding band and the number of sub-band divisions, and information indicating a retransmission request for a lost packet in which an error occurs or is lost during transmission can also be described.

そして、ＲＴＣＰパケット化部３２ｂは、この制御情報及び管理情報をＲＴＣＰパケットにパケット化し、このパケット化データも送信部３３からインターネット１３０を介して通信相手のＶｏＩＰクライアントへ送信される。 The RTCP packetizing unit 32b packetizes the control information and management information into RTCP packets, and the packetized data is also transmitted from the transmitting unit 33 to the VoIP client of the communication partner via the Internet 130.

また、ＶｏＩＰによる通話では、ＲＴＰパケットを送信する前に、制御部３４がＳＩＰ等の呼制御プロトコルにより、送信部３３を介して、通信対象となる他のＶｏＩＰクライアントに対し呼シグナリングを行う。そして、他のＶｏＩＰクライアントの間でＲＴＰ／ＲＴＣＰセッションが確立された後、送信部３３がＲＴＰパケット及びＲＴＣＰパケットを送信することができる。ＲＴＰ／ＲＴＣＰパケットの詳細は後述する。 In a VoIP call, before transmitting an RTP packet, the control unit 34 performs call signaling to another VoIP client to be communicated via the transmission unit 33 using a call control protocol such as SIP. Then, after the RTP / RTCP session is established between other VoIP clients, the transmission unit 33 can transmit the RTP packet and the RTCP packet. Details of the RTP / RTCP packet will be described later.

一方、受信側においては、受信部４２がインターネット１３０を介して送られてくるＲＴＰパケット及びＲＴＣＰパケット等を受信する。そして、ＲＴＣＰデパケット化部４３ｂは、受信したＲＴＣＰパケットを分解し、制御部３４は、ＢＧＭのみ一括転送する一括転送モードの際には、ＲＴＣＰパケットの拡張部分に格納された制御情報のうち、上述した通話相手の揺らぎ吸収バッファ及びＰＣＭバッファの大きさに基づき、一括転送するＢＧＭデータの伝送レートを制御する。 On the other hand, on the receiving side, the receiving unit 42 receives RTP packets, RTCP packets, and the like sent via the Internet 130. Then, the RTCP depacketizing unit 43b decomposes the received RTCP packet, and the control unit 34 in the batch transfer mode in which only BGM is batch transferred, among the control information stored in the extended portion of the RTCP packet. The transmission rate of BGM data to be collectively transferred is controlled based on the sizes of the fluctuation absorbing buffer and PCM buffer of the other party.

また、ＲＴＰデパケット化部４３が逆パケット化した後、デジッタ部４４が例えばネットワークの状態等により受信が遅れる等して到着時刻が不均等になっているデータを補正し、等間隔化する。この補正は、逆パケット化されＩＰ、ＵＤＰ（User Datagram Protocol）から分解されたＲＴＰのタイムスタンプ、シーケンシャルナンバーを基に行なわれる。その後、パケット補償部４５がネットワークの送受信において欠落又は受信不能等となったパケットを保障する処理、具体的には、欠落したパケットの代わりにその前又は後のパケットと同じパケットを使用するようにしたり、欠落したパケットの再送要求をして改めて欠落パケットを受信するようにする等したりしてパケットの損失を補償する。こうして得られたＭＰＥＧ４等の圧縮データはデコーダ４６にてデコードされ、これをゲイン調整部４９がユーザ指示又は自動的に、ゲイン係数ｋ５を乗算してゲイン調整する。 In addition, after the RTP depacketization unit 43 converts the packet to a reverse packet, the dejitter unit 44 corrects the data in which the arrival times are unequal due to, for example, a delay in reception due to a network condition or the like, and equalizes the data. This correction is performed based on the RTP time stamp and sequential number that are depacketized and decomposed from IP and UDP (User Datagram Protocol). Thereafter, the packet compensator 45 ensures the packet that is lost or cannot be received in the transmission / reception of the network, specifically, the same packet as the previous or subsequent packet is used instead of the lost packet. The packet loss is compensated by requesting retransmission of the lost packet or receiving the lost packet again. The compressed data such as MPEG4 obtained in this way is decoded by the decoder 46, and the gain adjustment unit 49 multiplies the gain coefficient k5 by the user or automatically adjusts the gain.

デコーダ４８は、他のＶｏＩＰクライアントから呼シグナリングされたとき、即ち電話がかかってきたときにユーザに通知するための呼び出し用の呼び出し音又は呼び出し用に使用する音楽等からなる着信音の音源ファイルを着信音ファイル記憶部４７から読み出す。ファイル記憶部４７からの着信音データは、ユーザの所望によって予め選択されており、着信のタイミングに従って図示しないＲＡＭに読み出されながらデコーダ４８にてデコードされる。着信音の音源ファイルも、ＢＧＭ等の音源ファイル等と同様にＭＰ３、ＭＰＥＧ４等に圧縮されたものとなっており、ファイル単位の着信音データとしてファイル記憶部４７に複数ファイル分記憶されている。そして、デコーダ４８は、読み出した音源ファイルをデコードし、デコードされた音源データをゲイン調整部５０、５１がユーザ指示又は自動的に、夫々ゲイン係数ｋ６、ｋ７を乗算することでゲイン調整する。 The decoder 48 generates a ring tone sound source file including a ring tone for calling or a music used for calling when a call is signaled from another VoIP client, that is, when a call is received. Read from the ringtone file storage unit 47. The ring tone data from the file storage unit 47 is selected in advance according to the user's request, and is decoded by the decoder 48 while being read into a RAM (not shown) according to the timing of the incoming call. The sound file of the ringtone is also compressed to MP3, MPEG4, etc., like the sound file of BGM, etc., and is stored in the file storage unit 47 for a plurality of files as ringtone data for each file. Then, the decoder 48 decodes the read sound source file, and gain adjustment is performed by the gain adjusting units 50 and 51 multiplying the decoded sound source data by a user instruction or automatically by gain coefficients k6 and k7, respectively.

合成部５３は、着信音及び通信相手から送信されゲイン調整部４９にてゲイン調整された、通話相手から受け取った受信データと、ゲイン調整部５２から出力される自身の送信データとを加算処理する。ゲイン調整部５２は、送信される送信データを通話相手と共有するため、送信データにユーザが設定するループバック音量レベルであるゲイン係数ｋ４を乗算するものである。 The synthesizer 53 adds the received data received from the call partner and the transmission data output from the gain adjuster 52, which is transmitted from the ring tone and the communication partner and gain-adjusted by the gain adjuster 49. . The gain adjusting unit 52 multiplies the transmission data by a gain coefficient k4 that is a loopback volume level set by the user in order to share the transmission data to be transmitted with the other party.

そして、合成部５３にて合成されたこれらのデータは、ヘッドフォン５４を介して出力されユーザに伝えられる。また、着信音ファイル記憶部４７から読み出された着信音データは、ヘッドフォン５４とは別にスピーカ５５からも出力されるよう構成されている。 The data synthesized by the synthesis unit 53 is output via the headphones 54 and transmitted to the user. In addition, the ring tone data read from the ring tone file storage unit 47 is configured to be output from the speaker 55 separately from the headphones 54.

ここで、デコーダ４６の前段若しくは後段、又はゲイン調整部４９の後段に、一括送信されたＢＧＭデータを格納しておくバッファを用意し、一括送信後に送られる音声又は音声及び効果音からなる実時間音響データに順次合成して出力されるようバッファからＢＧＭデータが読み出される。また、デコーダ４６の後段又はゲイン調整部４９の後段には上述のＰＣＭバッファが設けられるが、これに一括転送されたＢＧＭデータをバッファリングしておいてもよい。 Here, a buffer for storing batch-transmitted BGM data is prepared before or after the decoder 46, or after the gain adjusting unit 49, and the real time composed of voice or voice and sound effect sent after the batch transmission. BGM data is read from the buffer so as to be sequentially synthesized and output to the acoustic data. Further, although the above-described PCM buffer is provided in the subsequent stage of the decoder 46 or the subsequent stage of the gain adjusting unit 49, the BGM data collectively transferred thereto may be buffered.

このようなＶｏＩＰクライアントにおいては、制御部３４により、必要に応じて、実時間音響データを転送する実時間伝送モードと、ＢＧＭのみを一括転送する一括伝送モードとを切替えることにより、例えばユーザが発話していないタイミング等にＢＧＭのみを一括転送しておけば、その後、一括転送したＢＧＭが再生されている間は、音声のみの実時間音響データを送信すればよく、実時間音響データとして伝送するデータ量を削減し、ＶｏＩＰクライアントが占有するネットワーク帯域を狭めることができる。 In such a VoIP client, the control unit 34 switches between a real-time transmission mode for transferring real-time acoustic data and a batch transmission mode for batch-transferring only BGM as necessary, for example, by the user. If only the BGM is transferred at a time when the BGM is not transferred, the real-time acoustic data of only the sound may be transmitted and then transmitted as the real-time acoustic data while the batch-transferred BGM is reproduced. The amount of data can be reduced, and the network bandwidth occupied by the VoIP client can be reduced.

更に、一般の電話に比してダイナミックレンジを広くすることで、ステレオ等の高音質の音声とＢＧＭ及び効果音等からなる音響とが合成された合成音を送受信することができ、従ってＢＧＭ等を付加しても会話（音声）をマスキングすることがない。また、効果音及びＢＧＭ等の音源ファイルから読み出した音と、通信者自身が話した音声、即ちマイク音とを別々にゲイン調整することができるため、効果音及びＢＧＭの音量を所望のレベルに調整することができ、ユーザは、通信相手へ例えば自身の気分を伝えたりすることができる。 Furthermore, by widening the dynamic range as compared with a general telephone, it is possible to transmit and receive a synthesized sound in which a high-quality sound such as stereo and sound composed of BGM and sound effects are synthesized. Even if is added, conversation (voice) is not masked. In addition, since the sound read from the sound source file such as the sound effect and BGM and the sound spoken by the communication person, that is, the microphone sound can be adjusted separately, the sound effect and the volume of the BGM can be set to desired levels. The user can adjust the user's feelings, for example, to the communication partner.

また、着信音をヘッドフォン５４とスピーカ５５とで別々に出力することにより、例えばユーザがヘッドフォン５４を使用してＶｏＩＰによる通信中に一時的にＶｏＩＰクライアント２０から離れたり、通話が終了した後もヘッドフォン５４をセットしたままにしておいたりした場合であっても、着信音がスピーカ５５から外部に出力されるようにすることができる。 In addition, by outputting the ringtone separately from the headphone 54 and the speaker 55, for example, the user temporarily leaves the VoIP client 20 during VoIP communication using the headphone 54, or even after the call ends. Even when 54 is left set, the ringtone can be output from the speaker 55 to the outside.

次に、このＶｏＩＰクライアントに使用されるソフトウェアについて説明する。図３は、一般的なＰＣ向けＶｏＩＰクライアントアプリケーションを含むソフトウェアモジュールを示す図である。ＶｏＩＰクライアントは、この図３に示す開放型システム間相互接続（Open System Interconnection：ＯＳＩ）のアーキテクチャに基づく各階層のプロトコルに応じたソフトウェアモジュール１を実行することにより上述の図２に示した機能を達成する。 Next, software used for the VoIP client will be described. FIG. 3 is a diagram illustrating a software module including a general VoIP client application for a PC. The VoIP client performs the function shown in FIG. 2 by executing the software module 1 corresponding to the protocol of each layer based on the open system interconnection (OSI) architecture shown in FIG. Achieve.

図３において下位層から上位層に向かって各階層を説明する。先ず、物理層２としての機能にはユニバーサル・シリアル・バス（Universal Serial Bus：ＵＳＢ）カメラドライバ２ａ、ＵＳＢオーディオドライバ２ｂ及び各種ドライバ２ｃがある。カメラドライバ２ａからのビデオデータやオーディオドライバ２ｂからのオーディオデータの伝送条件の物理的条件を合わせるレイヤである。 In FIG. 3, each layer will be described from the lower layer to the upper layer. First, the functions as the physical layer 2 include a universal serial bus (USB) camera driver 2a, a USB audio driver 2b, and various drivers 2c. This is a layer that matches the physical conditions of the transmission conditions of video data from the camera driver 2a and audio data from the audio driver 2b.

次に、データリンク層としての機能には、オペレーティングシステム（Operating System：ＯＳ）３がある。隣接ノード間の誤りのないデータ伝送を実行するためのものである。 Next, the function as the data link layer includes an operating system (OS) 3. This is for executing data transmission without error between adjacent nodes.

そして、ネットワーク層としての機能には、インターネットプロトコル（Internet Protocol：ＩＰ）制御部４がある。ネットワーク層は、データ送受信に使用する通信経路を選択し、フロー制御・品質制御などの通信制御を行うところである。信頼性を追求しないコネクションレス（Connectionless)パケット伝送プロトコルであるＩＰは、信頼性保証機能、フロー制御機能、エラー回復機能を上位階層（トランスポート層とアプリケーション層）に任せている。 The network layer function includes an Internet Protocol (IP) control unit 4. The network layer selects a communication path used for data transmission / reception and performs communication control such as flow control and quality control. IP, which is a connectionless packet transmission protocol that does not pursue reliability, leaves the reliability assurance function, flow control function, and error recovery function to the upper layers (transport layer and application layer).

トランスポート層としての機能には、トランスポート・コントロール・プロトコル（Transport Control Protocol：ＴＣＰ）／ユーザ・データグラム・プロトコル（User Datagram Protocol：ＵＤＰ）制御部５がある。 The function as the transport layer includes a transport control protocol (TCP) / user datagram protocol (UDP) control unit 5.

トランスポート層では、ＩＰアドレスを使用してエンド・ツー・エンドの伝送を行う。ネットワークの種類に依存せず、要求される品質クラスに従ってフロー制御や順序制御を行う。ＴＣＰは信頼性保証機能を持ち、伝送したデータの各バイトにシーケンス番号を付け、受信側から受け取り通知（Acknowledgment：ＡＣＫ（確認応答））が送られてこなければデータを再送する。ＵＤＰは、アプリケーション間のデータグラムの送信機能を提供する。ＩＰネットワークを用いて、音声・動画像をストリーミング再生する場合、一般にエラー時に再送を行うＴＣＰのようなトランスポート・プロトコルは使用できない。また、ＴＣＰは、１対１通信用のプロトコルであり、複数の相手に情報を送信することができない。そこで、このような用途には、ＵＤＰが用いられる。即ち、例えば再送制御を必要とする等、信頼性が高い通信を行う際にはＴＣＰ通信としてＴＣＰパケットを生成し、音声及び映像データ、並びに後述するＳＩＰを伝送する等、リアルタイム性が高い通信を行う際にはＵＤＰ通信としＵＤパケットを生成する。 In the transport layer, end-to-end transmission is performed using an IP address. Regardless of the type of network, flow control and sequence control are performed according to the required quality class. TCP has a reliability guarantee function, attaches a sequence number to each byte of transmitted data, and retransmits the data if no acknowledgment (Acknowledgment: ACK) is sent from the receiving side. UDP provides a function for transmitting datagrams between applications. When streaming an audio / video using an IP network, generally, a transport protocol such as TCP that retransmits when an error occurs cannot be used. TCP is a protocol for one-to-one communication, and information cannot be transmitted to a plurality of partners. Therefore, UDP is used for such applications. In other words, when performing highly reliable communication such as requiring retransmission control, TCP packets are generated as TCP communication, and communication with high real-time properties such as transmission of audio and video data and SIP described later is performed. When performing, UDP communication is generated as UDP communication.

ＵＤＰは、アプリケーションのプロセスがリモートマシン上の他のアプリケーションのプロセスへデータを伝送することを、最小のオーバーヘッドで行えるように設計されている。そのため、ＵＤＰのヘッダに入る情報は、送信元ポート番号、宛先ポート番号、データ長、チェックサムのみであり、ＴＣＰにあるパケットの順序を表す番号を入れるフィールドがないので、ネットワーク上で異なる経路を介して伝送されるなどによりパケットの順序が入れ替わってしまった場合に、その順序を正しい状態に戻す処理を行うことができない。また、送信時のタイムスタンプ等の時間情報を入れるフィールドは、ＴＣＰにもＵＤＰにもない。 UDP is designed to allow application processes to transmit data to other application processes on a remote machine with minimal overhead. Therefore, the information entered in the UDP header is only the source port number, destination port number, data length, and checksum, and there is no field for entering the number indicating the order of packets in TCP. When the order of the packets is changed due to transmission through the network, processing for returning the order to the correct state cannot be performed. Also, there is no field for inputting time information such as a time stamp at the time of transmission in TCP or UDP.

セッション層６としての機能には、セッション・イニシエーション・プロトコル（Session Initiation Protocol：ＳＩＰ）制御部１１、ＲＴＰ／ＲＴＣＰ制御部１２、コーデック（codec）１３、通話音とＢＧＭ又は効果音の合成処理ソフトウェアを構成する保留音制御部１４、ＢＧＭ合成部１５、及び着信音制御部１６がある。セッション層６は、情報の伝送制御を行う。アプリケーション間における対話モードを管理して会話単位の制御を行う。 The functions as the session layer 6 include a session initiation protocol (SIP) control unit 11, an RTP / RTCP control unit 12, a codec (codec) 13, speech sound and BGM or sound effect synthesis processing software. There are a holding tone control unit 14, a BGM synthesis unit 15, and a ringing tone control unit 16 which are configured. The session layer 6 controls information transmission. Manage conversation modes between applications and control conversation units.

ＳＩＰ制御部１１は、ＩＰネットワーク上でマルチメディアセッションを確立・変更・終了するための、アプリケーション層６のシグナリングプロトコル（ＲＦＣ３２６１で標準化）により呼制御を行う。 The SIP control unit 11 performs call control using a signaling protocol of the application layer 6 (standardized by RFC3261) for establishing, changing, and terminating a multimedia session on the IP network.

また、ＲＴＰ／ＲＴＣＰ制御部１２のうち、ＲＴＰは、音声データにＢＧＭ及び効果音等が合成され圧縮符号化された送信データを送信するためのプロトコルであり、送信データに時間情報及び順序情報を付与してネットワークを通じて音声データ送受信する機能を有する。また、ＲＴＣＰは、ＲＴＰを制御する制御プロトコルであり、ＲＴＰのフロー制御、クロック同期及びデータの再生時刻の認識、情報源の認識等を行う機能を有する。 Of the RTP / RTCP control unit 12, RTP is a protocol for transmitting transmission data in which BGM and sound effects are combined with audio data and compressed and encoded. Time information and order information are added to the transmission data. And has a function of transmitting and receiving voice data through the network. RTCP is a control protocol for controlling RTP, and has functions of performing RTP flow control, clock synchronization, data reproduction time recognition, information source recognition, and the like.

コーデック１３は、送信音（伝送データ）を後述する高能率音響圧縮及び復号する機能を有する。 The codec 13 has a function of compressing and decoding the transmission sound (transmission data) described later with high efficiency.

また、プレゼンテーション層としての機能には、ＶｏＩＰ通話制御７がある。プレゼンテーション層では、アプリケーションで送受信する情報の表現形式を管理して、データの変換や暗号化を行う。ＶｏＩＰ通話制御部７は、ＶｏＩＰ通話機能の全体をコントロールする。 The function as the presentation layer includes VoIP call control 7. The presentation layer manages the expression format of information transmitted and received by the application, and performs data conversion and encryption. The VoIP call control unit 7 controls the entire VoIP call function.

最上層のアプリケーション層としての機能には、コンピュータを視覚的に操作するためのグラフィカルユーザインターフェース（Graphical User Interface：ＧＵＩ）８がある。アプリケーション層は、ユーザプログラムで使用する通信機能の外部仕様を管理して、それに基づく情報のやり取りを行う層であり、ＧＵＩ８は、ユーザ操作用のインターフェイスを提供し、ユーザの手入力情報をハンドリングする。 As a function of the uppermost application layer, there is a graphical user interface (GUI) 8 for visually operating a computer. The application layer is a layer for managing external specifications of communication functions used in the user program and exchanging information based on the external specifications. The GUI 8 provides an interface for user operation and handles user input information. .

図４は、ＧＵＩの一例を示す模式図である。ＧＵＩ８は、図４に示すように、ＶｏＩＰクライアントアプリケーション１の終了処理を行うアプリケーション制御部６１と、情報表示部６２と、ダイヤル部６３と、ヘッドセットボリューム部６４と、スピーカボリューム部６５と、サウンドエフェクト選択表示部６６と、サウンドエフェクト制御部６７と、ＢＧＭ選択表示部６８と、ＢＧＭ制御部６９とを有する。 FIG. 4 is a schematic diagram illustrating an example of a GUI. As shown in FIG. 4, the GUI 8 includes an application control unit 61 that performs termination processing of the VoIP client application 1, an information display unit 62, a dial unit 63, a headset volume unit 64, a speaker volume unit 65, and a sound. An effect selection display section 66, a sound effect control section 67, a BGM selection display section 68, and a BGM control section 69 are provided.

情報表示部６２は、ユーザが通信対象となる相手先の番号をダイヤルした場合のダイヤル番号や、通信相手が話し中か否か等の相手状態等を表示する。ダイヤル部６３は、ＶｏＩＰ相手先をダイヤルするため等のテンキーからなる。また、ヘッドセットボリューム６４は、ヘッドセットから出力される音量を調節し、スピーカボリューム部６５は、スピーカから出力される音量を調節する。サウンドエフェクト選択部６７は、ユーザが使用可能なサウンドエフェクト音源データファイルの表示するものであり、例えば銃声音、雷音、拍手の音、歓声音等を選択でき、サウンドエフェクト制御部６７により、効果音の再生及び停止、並びに音量調節を行う。これにより、効果音の各種効果音でユーザが通話相手への気持ち等を表現することができる。 The information display unit 62 displays a dial number when the user dials the number of the other party to be communicated, a partner state such as whether or not the communication partner is busy, and the like. The dial unit 63 includes a numeric keypad for dialing a VoIP counterpart. The headset volume 64 adjusts the volume output from the headset, and the speaker volume unit 65 adjusts the volume output from the speaker. The sound effect selection unit 67 displays a sound effect sound source data file that can be used by the user. For example, a gunshot sound, lightning sound, applause sound, cheering sound, etc. can be selected. Play and stop sound, and adjust volume. Thereby, a user can express feelings to the other party with various sound effects.

また、ＢＧＭ選択表示部６８は、ユーザが使用可能な例えばタヒチの波の音等のＢＧＭ音源データファイルを表示するものであり、更にＢＧＭ制御部６９により、ＢＧＭの再生及び停止、並びにＢＧＭの音量調節を行うことで、サウンドエフェクトと同様、ユーザ自身が選択し、調節した音量により、ユーザの気分やその場の雰囲気を通信相手へ伝えること等ができる。 The BGM selection display unit 68 displays a BGM sound source data file such as a sound of a Tahiti wave that can be used by the user. Further, the BGM control unit 69 further plays and stops the BGM and the volume of the BGM. By performing the adjustment, the user's mood and the atmosphere of the place can be communicated to the communication partner by the volume selected and adjusted by the user himself, like the sound effect.

次に、ソフトウェアモジュール１を実行するＶｏＩＰクライアントのハードウェア構成について説明する。図５は、ＶｏＩＰクライアントのハードウェア構成を示すブロック図である。図５に示すように、ＶｏＩＰクライアント２０において、ＣＰＵ（Central Processing Unit）２２１は、ＲＯＭ（Read Only Memory）２２２に記憶されている上記ソフトウェアモジュールを構成する各種プログラム、又は記憶部２２８からＲＡＭ（Random Access Memory）２２３にロードされた上述のソフトウェアモジュール１を構成する各種プログラムに従って各種の処理を実行する。また図示せぬタイマが計時動作を行い、時刻情報をＣＰＵ２２１に供給する。ＲＡＭ２２３にはまた、ＣＰＵ２２１が各種の処理を実行する上において必要なデータ等も適宜記憶される。 Next, the hardware configuration of the VoIP client that executes the software module 1 will be described. FIG. 5 is a block diagram showing a hardware configuration of the VoIP client. As shown in FIG. 5, in the VoIP client 20, a CPU (Central Processing Unit) 221 includes various programs constituting the software module stored in a ROM (Read Only Memory) 222 or a RAM (Random) from the storage unit 228. Access Memory) 223 executes various processes in accordance with various programs constituting the software module 1 described above. In addition, a timer (not shown) performs time counting operation and supplies time information to the CPU 221. The RAM 223 also appropriately stores data necessary for the CPU 221 to execute various processes.

ＣＰＵ２２１、ＲＯＭ２２２及びＲＡＭ２２３は、バス２２４を介して相互に接続されている。このバス２２４にはまた、入出力インターフェイス２２５も接続されている。 The CPU 221, ROM 222, and RAM 223 are connected to each other via a bus 224. An input / output interface 225 is also connected to the bus 224.

入出力インターフェイス２２５には、キーボード、マウス等よりなる入力部２２６、ＣＲＴ、ＬＣＤ等よりなるディスプレイ、並びに、ヘッドフォンやスピーカ等よりなる出力部２２７、ハードディスク等より構成される記憶部２２８、モデム、ターミナルアダプタなどより構成される通信部２２９が接続されている。 The input / output interface 225 includes an input unit 226 including a keyboard and a mouse, a display including a CRT and an LCD, an output unit 227 including a headphone and a speaker, a storage unit 228 including a hard disk, a modem, a terminal A communication unit 229 composed of an adapter or the like is connected.

通信部２２９は、図示せぬインターネットを介しての通信処理を行う。ＣＰＵ２２１から提供されたデータを送信する。また通信部２２９は通信相手から受信したデータをＣＰＵ２２１、ＲＡＭ２２３、記憶部２２８に出力する。記憶部２２８はＣＰＵ２２１との間でやり取りし、情報の保存・消去を行う。通信部２２９はまた、他のクライアントとの間で、アナログ信号またはディジタル信号の通信処理を行う。 The communication unit 229 performs communication processing via the Internet (not shown). Data provided from the CPU 221 is transmitted. The communication unit 229 outputs data received from the communication partner to the CPU 221, RAM 223, and storage unit 228. The storage unit 228 communicates with the CPU 221 to store and delete information. The communication unit 229 also performs analog signal or digital signal communication processing with other clients.

入出力インターフェイス２２５にはまた、必要に応じてドライブ２３０が接続され、磁気ディスク２４１、光ディスク２４２、光磁気ディスク２４３、或いは半導体メモリ２４４等が適宜装着され、それらから読み出されたコンピュータプログラムが、必要に応じて記憶部２２８にインストールされる。 A drive 230 is connected to the input / output interface 225 as necessary, and a magnetic disk 241, an optical disk 242, a magneto-optical disk 243, a semiconductor memory 244, or the like is appropriately mounted, and a computer program read from these is loaded. It is installed in the storage unit 228 as necessary.

次に、このようなハードウェア構成のＶｏＩＰクライアントが上述の図２に示すソフトウェアモジュール１を構成する各種プログラムを実行することによりＶｏＩＰ通話を実行する方法について説明する。 Next, a method of executing a VoIP call by executing various programs constituting the software module 1 shown in FIG. 2 by the VoIP client having such a hardware configuration will be described.

図６は、図３に示す一般的なＰＣ向けソフトウェアモジュールのうち、ＶｏＩＰ通話制御部７が制御するＲＴＰ／ＲＴＣＰパケットの送受信機能を説明する図である。図６に示すように、ＶｏＩＰ通話制御部７の制御は、スレッド１〜スレッド５に分けることができる。スレッド１は、送信者がＲＴＰパケットを送信するまでの処理、即ち送信者が発話した音声を含むデータを格納したデータパケットを送信する処理であり、スレッド２、３は、送信されたＲＴＰパケットを受信して再生するまでの処理、即ち上記送信者が通信対象となっている相手の音声を含むデータを聞くまでの処理である。 FIG. 6 is a diagram for explaining the RTP / RTCP packet transmission / reception function controlled by the VoIP call control unit 7 in the general PC software module shown in FIG. As shown in FIG. 6, the control of the VoIP call control unit 7 can be divided into threads 1 to 5. The thread 1 is a process until the sender transmits an RTP packet, that is, a process of transmitting a data packet containing data including voice uttered by the sender, and the threads 2 and 3 transmit the transmitted RTP packet. Processing until reception and reproduction, that is, processing until the sender hears data including the voice of the other party with whom communication is made.

また、スレッド４は、オペレーティングシステム（ＯＳ）３に付随するスレッドライブラリにより自動生成される。スレッドライブラリは、プライオリティに応じてスレッド１、スレッド２及びスレッド３のメインプロセッサ（図５に示すＣＰＵ２２１）上での計算資源配分、即ちスケジューリングを行う。 The thread 4 is automatically generated by a thread library attached to the operating system (OS) 3. The thread library performs calculation resource allocation, that is, scheduling, on the main processor (CPU 221 shown in FIG. 5) of the thread 1, thread 2, and thread 3 according to priority.

また、スレッド５は、アプリケーションであるＧＵＩ８のメインスレッドであり、スレッド１、スレッド２及びスレッド３を生成したり破棄したりすると共に、プログラミングされたアルゴリズム或いはユーザ操作に応じてスレッド１、スレッド２及びスレッド３の制御を行う。 The thread 5 is a main thread of the GUI 8 that is an application. The thread 5, the thread 2, and the thread 3 are generated and destroyed, and the thread 1, the thread 2, and the thread 5 are generated according to a programmed algorithm or a user operation. The thread 3 is controlled.

ＲＴＰ／ＲＴＣＰパケットの送信処理であるスレッド１においては、マイクからユーザの音声をキャプチャしてＰＣＭデータを受け取り（Capture）、必要に応じて、ＢＧＭ合成部５等により、キャプチャサンプル・フレーム毎に、効果音及びＢＧＭと音声とを合成し（Effect or Mixing）、コーデック１３は、そのデータを圧縮符号化する（encode）。そして、ＲＴＰ／ＲＴＣＰ制御部１２が圧縮符号化したデータをＲＴＰパケット化し（packet）、送信する（send）。 In the thread 1 which is the transmission processing of the RTP / RTCP packet, the user's voice is captured from the microphone and PCM data is received (Capture). If necessary, the BGM synthesizing unit 5 etc. The sound effects and BGM and sound are synthesized (Effect or Mixing), and the codec 13 compresses and encodes the data (encode). Then, the RTP / RTCP control unit 12 converts the compression-encoded data into an RTP packet (packet) and transmits the packet (send).

また、ＲＴＰパケットの送信処理とは別に、通信対象の通話装置におけるコーデックの圧縮符号化方式を制御するための制御情報を格納したＲＴＣＰパケットの送信処理が行われる。 In addition to the RTP packet transmission process, an RTCP packet transmission process that stores control information for controlling the codec compression encoding method in the communication target communication device is performed.

一方、ＲＴＰパケットの受信処理においては、ＶｏＩＰクライアントの受信側に設けられたデコード処理前後の揺らぎ吸収バッファ、即ち、エンコードされたバイトストリームＰＣＭを格納する揺らぎ吸収バッファ（Forward Jitter buffer）ＢＦ１と、デコードされ無圧縮のlinearＰＣＭとされたデータを格納するＰＣＭバッファ（Backward Jitter buffer）ＢＦ２とを使用する。 On the other hand, in the RTP packet receiving process, a fluctuation absorbing buffer provided before and after the decoding process provided on the receiving side of the VoIP client, that is, a fluctuation absorbing buffer (Forward Jitter buffer) BF1 for storing the encoded byte stream PCM, and a decoding A PCM buffer (Backward Jitter buffer) BF2 for storing data that has been converted into an uncompressed linear PCM is used.

そして、ＲＴＰ／ＲＴＣＰパケットの受信機能のうち、スレッド２では、ＲＴＰ／ＲＴＣＰ制御部１２がＲＴＰパケットを受信し（Receive）、ネットワーク上で発生したデータ損失及び伝送パケットのデータエラー等の欠落パケットを補う処理を行い（parse）、デコード処理前に設けられた揺らぎ吸収バッファＢＦ１に格納する（push＆pop）する。これをコーデック１３が復号伸張し（decode）、デコード処理後に設けられる揺らぎ吸収バッファ（ＰＣＭバッファ）ＢＦ２に、無圧縮のデータとして格納する（push）。 In the RTP / RTCP packet reception function, in the thread 2, the RTP / RTCP control unit 12 receives the RTP packet (Receive), and detects a lost packet such as a data loss occurring on the network and a data error of the transmission packet. Compensation processing is performed (parse), and the data is stored (push & pop) in the fluctuation absorbing buffer BF1 provided before the decoding processing. This is decoded and expanded by the codec 13 (decode), and stored as uncompressed data in a fluctuation absorbing buffer (PCM buffer) BF2 provided after the decoding process (push).

そして、スレッド３においては、ＵＳＢカメラドライバ２ａ、ＵＳＢオーディオドライバ２ｂ及び各種ドライバ２ｃが、揺らぎ吸収バッファＢＦ２から無圧縮のデータを読み出し（pop）、この読み出したデータを再生する（sound device）。 In the thread 3, the USB camera driver 2a, the USB audio driver 2b, and the various drivers 2c read unpopulated data from the fluctuation absorbing buffer BF2 (pop) and reproduce the read data (sound device).

また、スレッド１にて行うエフェクト効果及びＢＧＭと音声との合成（Effect or Mixing）は、スレッド３において、揺らぎ吸収バッファＢＦ２からデータを読み出した後であって、再生する前に行ってもよく、また無圧縮のデータを格納するＰＣＭバッファを用意せず、揺らぎ吸収バッファからデータを読み出し復号したデータを揺らぎ吸収バッファＢＦ２に格納することなく再生するようにしてもよい。 Further, the effect effect and the synthesis (Effect or Mixing) of BGM and sound performed in the thread 1 may be performed after the data is read from the fluctuation absorbing buffer BF2 in the thread 3 and before the reproduction. Alternatively, a PCM buffer for storing uncompressed data may not be prepared, and data decoded by reading data from the fluctuation absorbing buffer may be reproduced without being stored in the fluctuation absorbing buffer BF2.

ここで、本実施の形態におけるＶｏＩＰシステムにおいては、高音質なリアルタイムコミュニケーションを実現するもの、即ち例えば会話に加えて高音質なＢＧＭ付加機能を有するものであり、ユーザが音声を発しているか否かにかかわらず、一定量のデータ送受信を行い、ネットワーク帯域を占有してしまうことを防止するものである。 Here, the VoIP system according to the present embodiment realizes high-quality real-time communication, that is, for example, has a high-quality BGM addition function in addition to conversation, and whether or not the user utters voice. Regardless of this, a certain amount of data is transmitted and received to prevent the network bandwidth from being occupied.

このため、図３に示すＶｏＩＰ通話制御手段７は、マイクからキャプチャしたユーザの音声情報量をチェックし、例えば会話が途切れている場合等、音声が所定の音量レベル未満であり音声情報量が少ない場合等において、音声データの伝送を一時的に停止し、ＢＧＭ分のデータだけを先読み一括（バッチ）伝送するようＲＴＰ／ＲＣＴＰ制御部１２を制御する。即ち、上述したように、通常は、音声とＢＧＭ及び効果音とを合成した実時間音響データをＲＴＰパケットにて伝送する実時間伝送を行う実時間伝送モードとし、ユーザが発話していないと判断したタイミングで、ＢＧＭのみを一括伝送する一括伝送モードに切り替える。 For this reason, the VoIP call control means 7 shown in FIG. 3 checks the amount of voice information of the user captured from the microphone. For example, when the conversation is interrupted, the voice is less than a predetermined volume level and the amount of voice information is small. In some cases, the transmission of audio data is temporarily stopped, and the RTP / RCTP control unit 12 is controlled so that only BGM data is prefetched and transmitted in batch. That is, as described above, normally, a real-time transmission mode in which real-time acoustic data obtained by synthesizing voice, BGM, and sound effects is transmitted in an RTP packet is set to a real-time transmission mode, and it is determined that the user is not speaking. At this timing, the mode is switched to the batch transmission mode in which only BGM is batch-transmitted.

ここで、ＲＴＰ／ＲＣＴＰ制御部１２から送信されるＲＴＣＰパケットは、後述するように、拡張により自身の揺らぎ吸収バッファＢＦ１、ＢＦ２のサイズを記述して送ることができるもので、各ＶｏＩＰクライアントは、通信相手となる他のＶｏＩＰクライアントの揺らぎ吸収バッファサイズを認識することができる。そこで、ＶｏＩＰクライアントは、ユーザの音声情報が少なくユーザが発話していないと判断できるような場合、他のＶｏＩＰクライアントの揺らぎ吸収バッファサイズ分、ＢＧＭデータのみを一括して送信することができる。 Here, as will be described later, the RTCP packet transmitted from the RTP / RCTP control unit 12 can be transmitted by describing the size of its own fluctuation absorbing buffer BF1, BF2 by extension, and each VoIP client It is possible to recognize the fluctuation absorbing buffer size of another VoIP client as a communication partner. Therefore, when it can be determined that the voice information of the user is small and the user is not speaking, the VoIP client can collectively transmit only the BGM data corresponding to the fluctuation absorbing buffer size of the other VoIP clients.

このように、音声と共に送られる予定のＢＧＭ等の付加データを先送りすることで、付加データはリアルタイムで送信する必要がなくなって、リアルタイムに送信する情報は音声データのみ、又は音声と効果音とを合成した実時間音響データとなり、ネットワーク帯域を狭めることができる。 In this way, by deferring additional data such as BGM scheduled to be sent together with the voice, it is not necessary to send the additional data in real time, and the information to be sent in real time is only the voice data or the voice and sound effects. Real-time acoustic data is synthesized, and the network bandwidth can be narrowed.

次に、本実施の形態におけるＶｏＩＰシステムにおけるＲＴＰ／ＲＴＣＰパケットについて更に具体的に説明する。 Next, the RTP / RTCP packet in the VoIP system in the present embodiment will be described more specifically.

図３に示すＳＩＰ制御部１１の呼制御により、ＲＴＰ／ＲＴＣＰパケットは、ＲＴＰ／ＲＴＣＰのセッションを確立した後、送受信が開始される。図７（ａ）及び（ｂ）は、ＲＴＰパケットの夫々構成及びヘッダのフォーマットを示す図であり、図８（ａ）乃至（ｃ）は、アプリケーションによって拡張可能なＲＴＣＰＡＰＰパケットの構成、そのヘッダ（RTCP APP Application-defined Header）、及び送信レポートのフォーマット例を示す図である。 The RTP / RTCP packet is transmitted / received after the RTP / RTCP session is established by the call control of the SIP control unit 11 shown in FIG. FIGS. 7A and 7B are diagrams showing the structure of RTP packets and the format of the header. FIGS. 8A to 8C are the structures of RTCP APP packets that can be expanded by the application and their headers. It is a figure which shows the example of a format of (RTCP APP Application-defined Header) and a transmission report.

ＲＴＰは、インターネット等のＩＰネットワークにおいて、リアルタイムに音声や動画を送信／受信するトランスポート・プロトコルであり、ＲＦＣ１８８９で勧告されている。ＲＴＰは、トランスポート層に位置し、一般にＵＤＰ上でＲＴＣＰと共に用いられる。 RTP is a transport protocol for transmitting / receiving voice and moving images in real time in an IP network such as the Internet, and is recommended by RFC1889. RTP is located in the transport layer and is generally used with RTCP over UDP.

そして、図７（ａ）に示すように、ＲＴＰパケットは、ＩＰヘッダ、ＵＤＰヘッダ、ＲＴＰヘッダ及びＲＴＰデータからなる。そして、ＲＴＰヘッダには、図７（ｂ）に示すように、先頭から、バージョン情報格納部（Ｖ：version、例えばＶ＝２）Ｆ１、パディング格納部（Ｐ：padding）Ｆ２、拡張ビット格納部（Ｘ：extension）Ｆ３、ＣＳＲＣ（contributing source）カウント格納部（ＣＣ）Ｆ４、マーカ情報（Ｍ：marker）格納部Ｆ４、マーカ・ビット格納部（Ｍ：maker）Ｆ５、ペイロード種別情報格納部（ＰＴ：payload type）Ｆ６、シーケンス番号情報格納部（sequence number）Ｆ７、タイムスタンプ格納部（time stamp）Ｆ８、ＳＳＲＣ識別子格納部（synchronization source identifier）Ｆ９、ＣＳＲＣ識別子格納部Ｆ１０が設けられ、ＣＳＲＣ識別子格納部Ｆ１０の後ろに実時間音響データが付加される。 As shown in FIG. 7A, the RTP packet includes an IP header, a UDP header, an RTP header, and RTP data. In the RTP header, as shown in FIG. 7B, from the top, a version information storage unit (V: version, for example, V = 2) F1, a padding storage unit (P: padding) F2, and an extension bit storage unit (X: extension) F3, CSRC (contributing source) count storage unit (CC) F4, marker information (M: marker) storage unit F4, marker bit storage unit (M: maker) F5, payload type information storage unit (PT : Payload type) F6, sequence number information storage unit (sequence number) F7, time stamp storage unit (time stamp) F8, SSRC identifier storage unit (synchronization source identifier) F9, and CSRC identifier storage unit F10 are provided to store CSRC identifiers. Real-time acoustic data is added behind the part F10.

バージョン情報格納部Ｆ１には、ＲＴＰのバージョンを示す情報が格納され、例えばＲＴＰ２を示すときには、その旨のバージョン情報が格納される。 The version information storage unit F1 stores information indicating the RTP version. For example, when indicating RTP2, the version information indicating that is stored.

カウント格納部Ｆ４は、ヘッダ中に示されるＣＳＲＣ（寄与送信元識別子）の数を示す。 The count storage unit F4 indicates the number of CSRCs (contributing transmission source identifiers) indicated in the header.

ペイロード種別情報格納部Ｆ６には、実時間音響データの種類を示す情報が格納され、例えば映像や音声を示す旨の情報等が格納される。 The payload type information storage unit F6 stores information indicating the type of real-time acoustic data, for example, information indicating video or audio.

シーケンス番号情報格納部Ｆ７には、ＲＴＰセッションにおいて、ＲＴＰパケットを送受信する度にカウントアップされ、送受信するＲＴＰパケットの順番を認識するためのシーケンス番号が格納される。 The sequence number information storage unit F7 counts up every time an RTP packet is transmitted / received in an RTP session, and stores a sequence number for recognizing the order of the RTP packets to be transmitted / received.

タイムスタンプ格納部Ｆ８には、実時間音響データを作成、更新した日時に関するタイムスタンプ情報が格納される。 The time stamp storage unit F8 stores time stamp information related to the date and time when real-time acoustic data is created and updated.

ＳＳＲＣ識別子格納部Ｆ９及びＣＳＲＣ識別子格納部Ｆ１０には、ＲＴＰセッションにおいて、データ送信側のソースを識別するための情報が格納される。ＳＳＲＣ識別子は、同期送信元識別子であり、同一ユーザが組み合わせて扱うべき複数のストリームが同じ値を共有するように割り当てた識別子であり、ＣＳＲＣ識別子は、寄与送信元識別子であり、ストリーム源を示す。複数のストリームがミキシング処理され１つのストリームデータとして提供される場合等に使用される。 Information for identifying the source on the data transmission side in the RTP session is stored in the SSRC identifier storage unit F9 and the CSRC identifier storage unit F10. The SSRC identifier is a synchronous transmission source identifier, is an identifier assigned so that a plurality of streams to be handled in combination by the same user share the same value, and the CSRC identifier is a contributing transmission source identifier, indicating a stream source . This is used when a plurality of streams are mixed and provided as one stream data.

ここで、リアルタイム音声はＳＳＲＣとし、ミキシング処理部分の音源データ（付加データ）を１つのＣＳＲＣとみなし、ＣＳＲＣカウント格納部（ＣＣ）Ｆ４を１つインクリメントし、ＣＳＲＣを付与する。 Here, the real-time voice is SSRC, the sound source data (additional data) of the mixing processing part is regarded as one CSRC, the CSRC count storage unit (CC) F4 is incremented by one, and the CSRC is given.

図３に示すＲＴＰ／ＲＴＣＰ制御部８は、ＲＴＰに従って実時間音響データを送信するに際して、上記各格納部に各種情報を格納すると共に、各格納部に格納された各種情報を認識して実時間音響データを抽出する処理をする。 When transmitting real-time acoustic data according to RTP, the RTP / RTCP control unit 8 shown in FIG. 3 stores various types of information in the storage units and recognizes the various types of information stored in the storage units. Process to extract acoustic data.

上述のＲＴＰが音声・動画像データそのものを送信／受信するプロトコルであるのに対し、ＲＴＣＰは、周期的に、パケットロス、遅延ジッタ、ラウンドトリップ等の回線品質を評価し、その帯域に見合ったリアルタイム通信を実現するため情報を送信／受信するプロトコルである。 While RTP is a protocol that transmits / receives audio / video data itself, RTCP periodically evaluates line quality such as packet loss, delay jitter, and round trip, and matches the bandwidth. A protocol for transmitting / receiving information to realize real-time communication.

このＲＴＣＰを用いることにより、相手からフィードバックされてくる情報により、ネットワークの状態などを推測して送信レートを変更するなどの動的な処理を行うことができる。また、今誰がデータを送信していて、誰が受信しているかを示す情報もＲＴＣＰパケットで同時に送っているので、今現在の参加者の情報を知ることもできる。 By using this RTCP, it is possible to perform dynamic processing such as changing the transmission rate by estimating the network state or the like based on information fed back from the other party. In addition, since information indicating who is currently sending data and who is receiving it is also sent at the same time in the RTCP packet, it is possible to know the current participant information.

図８（ａ）に示すように、ＲＴＣＰパケットは、ＩＰヘッダ、ＵＤＰヘッダ、ＲＴＣＰヘッダ及びＲＴＣＰデータからなる。そして、拡張可能なＲＴＣＰパケットであるＲＴＣＰＡＰＰパケットのヘッダには、図８（ｂ）に示すように、先頭から、バージョン情報格納部（Ｖ：version、例えばＶ＝２）Ｆ１１、パディング格納部（Ｐ：padding）Ｆ１２、subtype格納部Ｆ１３、パケットタイプ（ＰＴ：packet type）格納部Ｆ１４、レポート長格納部（length）Ｆ１６、ＳＳＲＣ／ＣＳＲＣ識別子格納部Ｆ１７、アスキー（ＡＳＣＩＩ：American Standard Code for Information、情報交換用アメリカ標準コード）で記述されるＮａｍｅ格納部Ｆ１８が設けられ、この後に、アプリケーション独自のデータが格納されるデータ格納部（Application-Dependent Data）Ｆ１９が付加される。 As shown in FIG. 8A, the RTCP packet includes an IP header, a UDP header, an RTCP header, and RTCP data. In the header of the RTCP APP packet, which is an expandable RTCP packet, as shown in FIG. 8B, the version information storage unit (V: version, for example, V = 2) F11, the padding storage unit ( P: padding) F12, subtype storage unit F13, packet type (PT) storage unit F14, report length storage unit (length) F16, SSRC / CSRC identifier storage unit F17, ASCII (American Standard Code for Information), A Name storage unit F18 described in US standard code for information exchange) is provided, followed by a data storage unit (Application-Dependent Data) F19 in which application-specific data is stored.

パケットタイプＰＴ格納部Ｆ１４には、ＲＴＣＰパケットの種別が記録され、本実施の形態においては、このパケットタイプＰＴ＝ＡＰＰ（Application：アプリケーション固有情報）＝２０４（パケットタイプ値）と記述される。ＡＰＰは、ＲＴＣＰ規定外のアプリケーション固有の制御情報を通知するためのパケットであることを示す。 The type of RTCP packet is recorded in the packet type PT storage unit F14, and in this embodiment, this packet type PT = APP (Application: application specific information) = 204 (packet type value) is described. APP indicates that it is a packet for notifying application-specific control information outside the RTCP standard.

図３に示すＲＴＰ／ＲＴＣＰ制御部８は、ＲＴＣＰＡＰＰパケットとして、上述したように、欠落パケットの再送要求をＲＴＰパケットを受信した際のジッタを吸収するために使用するＲＴＰパケット受信のための揺らぎ吸収バッファのサイズをデータ格納部Ｆ１９に記述して送信することができる。そして同時に、ＶｏＩＰ通話制御部３の指示により、通信相手が送信すべき送信データにおける圧縮する周波数成分の情報を記述して送信することができる。 As described above, the RTP / RTCP control unit 8 shown in FIG. 3 uses the RTCP packet as a RTCP APP packet, and as described above, the fluctuation for receiving the RTP packet used to absorb the jitter when the RTP packet is received as the retransmission request for the missing packet. The size of the absorption buffer can be described in the data storage unit F19 and transmitted. At the same time, according to an instruction from the VoIP call control unit 3, it is possible to describe and transmit information on frequency components to be compressed in transmission data to be transmitted by the communication partner.

ここで、ＲＴＣＰパケットには、ＲＴＰデータの送信者から送られるタイプのＲＴＣＰＳＲ（Sender Report、）パケットと、ＲＴＰデータの受信者から送られるタイプのＲＴＣＰパケットＲＴＣＰＲＲ（Receiver Report、）パケットとがある。 Here, the RTCP packet includes an RTCP SR (Sender Report,) packet of a type sent from the sender of the RTP data and an RTCP packet RTCP RR (Receiver Report,) of a type sent from the receiver of the RTP data. is there.

ＳＲパケットは、ストリームを送出している端末から他の端末に対して送出されるもので、自装置が送出したストリームに関する情報である送信情報（sender info）と、受信したストリーム各々についてストリームの受信状態（パケット破棄率、ジッタ等）を送信装置へ報告するためのレポートブロック（reception report block）とを含み、ＲＲパケットは、他の通話装置から受信したスリームに関する情報を通知するためのもので、同じく受信したストリーム各々についてストリームの受信状態を送信装置へ報告するためのレポートブロックを含むものである。 The SR packet is transmitted from a terminal that is transmitting a stream to another terminal. Transmission information (sender info) that is information regarding a stream transmitted by the own device and reception of the stream for each received stream Including a report block (reception report block) for reporting the state (packet discard rate, jitter, etc.) to the transmitting device, and the RR packet is for notifying information on the stream received from the other communication device, Similarly, each received stream includes a report block for reporting the reception status of the stream to the transmitting apparatus.

このレポートブロックは、図８（ｃ）に示すように、パケットの送信者の同期送信元(ＳＳＲＣ：Synchronization Source)識別子、ＲＴＰ損失率、損失ＲＴＰパケット数、受信シーケンス番号、到着時間間隔のジッタの平均値、最後に受信したＳＲの送信時刻（ＬＳＲ：Last SR timestamp）、最後にＳＲを受信した時刻からこのＲＲを送るまでの時間（ＤＬＳＲ：Delay since Last SR）を入れることになっている。 As shown in FIG. 8C, this report block includes a synchronization source (SSRC) identifier of a packet sender, an RTP loss rate, the number of lost RTP packets, a reception sequence number, and jitter of an arrival time interval. The average value, the transmission time of the last received SR (LSR: Last SR timestamp), and the time (DLSR: Delay since Last SR) from when the SR was last received until this RR is sent are entered.

したがって、送信側においては、ＲＴＰデータの送信の際に、送信ＲＴＰパケット数及び送信ＲＴＰバイト数を管理しておき、また、受信側においては、ＲＴＰデータの受信の際に、受信ＲＴＰパケット数、損失ＲＴＰパケット数及び到着時間のジッタ等の管理情報を管理する。 Therefore, the transmission side manages the number of transmission RTP packets and the number of transmission RTP bytes when transmitting RTP data, and the reception side receives the number of received RTP packets when receiving RTP data. Management information such as the number of lost RTP packets and jitter of arrival time is managed.

図９は、本実施の形態におけるＲＴＣＰＡＰＰメッセージの交換時にやりとりされるメッセージを示す図である。ＲＴＰのSenderでもありReceiverでもあるＵＡ１とＵＡ２との間でやり取りされるＳＲブロックを含むＲＴＣＰパケット及びＲＲブロックを含むＲＴＣＰパケットにおいては、図９に示すように、拡張データとして、エンコード帯域幅（Encode Bandwidth）、サブバンド分割ブロック数（sub band Numbers）、デコード処理部前後の揺らぎ吸収バッファサイズ（時間に換算可能）（Forward and Backward Jitter Buffer size）、上記吸収バッファにキューイングされているデータサイズ（時間に換算可能）（Buffer queued size）、ＲＴＰパケットの再送リクエスト情報（Re-Request：sequence number）を記述することができる。ＲＴＰパケットの再送リクエストにはシーケンス番号を利用する。 FIG. 9 is a diagram showing messages exchanged when exchanging RTCP APP messages in the present embodiment. In the RTCP packet including the SR block and the RTCP packet including the RR block exchanged between UA1 and UA2 which are both RTP Sender and Receiver, as shown in FIG. 9, the encoded bandwidth (Encode Bandwidth), number of subband division blocks (sub band Numbers), fluctuation absorption buffer size before and after decoding processing unit (convertible to time) (Forward and Backward Jitter Buffer size), data size queued in the absorption buffer ( (Convertible to time) (Buffer queued size), and RTP packet retransmission request information (Re-Request: sequence number) can be described. The sequence number is used for the RTP packet retransmission request.

ここで、このアプリケーション固有情報ＲＴＣＰＡＰＰパケットとして、デコード処理前後の揺らぎ吸収バッファＢＦ１及びＢＦ２のバッファサイズ、この揺らぎ吸収バッファＢＦ１及びＢＦ２を占有しているデータ量の情報に加え、受信側のＶｏＩＰクライアンが送信側のＶｏＩＰクライアントの送信データのエンコード帯域幅及びサブバンド分割ブロック数等を指定する情報を記載することにより、通信相手の圧縮符号化方式を指定することで自身が受信するストリームの伝送レートを制御したり、また、自身の揺らぎ吸収バッファサイズと、ＳＲ（Sender Report）ブロックを含むＲＴＣＰＡＰＰパケットを送信してからＲＲ（Receiver Report）ブロックを含むＲＴＣＰＡＰＰパケットを受信するまでの時間であるＲＴＴ（Round Trip Time）とに応じて欠落パケットの再送要求をしたりすることも可能である。 Here, as the application-specific information RTCP APP packet, in addition to the information on the buffer sizes of the fluctuation absorbing buffers BF1 and BF2 before and after the decoding process, the data amount occupying the fluctuation absorbing buffers BF1 and BF2, the VoIP client on the receiving side Describes the encoding bandwidth of the transmission data of the VoIP client on the transmitting side, the number of subband division blocks, etc., and by specifying the compression encoding method of the communication partner, the transmission rate of the stream received by itself And the time from the transmission of the RTCP APP packet including the SR (Sender Report) block to the reception of the RTCP APP packet including the RR (Receiver Report) block. For RTT (Round Trip Time) In response, it is also possible to request retransmission of the missing packet.

次に、このようなＲＴＣＰＡＰＰパケットを使用して、リアルタイムデータの情報量を減らす方法について更に詳細に説明する。図１０は、実際に音声が途切れている際にＢＧＭデータのみをバッチ伝送する際の処理シーケンスを示す図である。 Next, a method for reducing the information amount of real-time data using such RTCP APP packet will be described in more detail. FIG. 10 is a diagram showing a processing sequence when only BGM data is batch-transmitted when audio is actually interrupted.

図１０に示すように、ＶｏＩＰシステムを構成するＶｏＩＰクライアントＵＡ１、ＵＡ２との間で、音声にＢＧＭを合成した高音質なデータを送受信する場合において、上述したように、先ずＲＴＣＰのＡＰＰパケットを利用して、デコード処理部前後の揺らぎバッファＢＦ１及びＢＦ２のバッファサイズを交換し合う（Ｄ１）。これらの揺らぎ吸収バッファＢＦ１及びＢＦ２のバッファサイズは、伝送時間に換算可能である。 As shown in FIG. 10, when transmitting / receiving high-quality sound data in which voice is synthesized with BGM between VoIP clients UA1 and UA2 constituting the VoIP system, first, as described above, the RTCP APP packet is used. Then, the buffer sizes of the fluctuation buffers BF1 and BF2 before and after the decoding processing unit are exchanged (D1). The buffer sizes of these fluctuation absorbing buffers BF1 and BF2 can be converted into transmission time.

そして、実時間伝送モードとして、ＲＴＰパケットにより、音声及び音響データを送受信し合う（Ｄ２）。このとき、リアルタイム音声を例えばＳＳＲＣ＝１２３４とし、上述のＣＣフィールドＦ４をインクリメントし、ＢＧＭオーディオを例えばＣＳＲＣ＝５６７８等としてデータの送受信を行う。 Then, as a real-time transmission mode, voice and acoustic data are transmitted and received by RTP packets (D2). At this time, the real-time voice is set to SSRC = 1234, the CC field F4 described above is incremented, and the BGM audio is set to, for example, CSRC = 5678 to transmit / receive data.

その後、一方のＶｏＩＰクライアントＵＡ１において、例えば、ユーザが何らかの事情で通信装置から離れる等して会話が中断し、マイクが集音する音声が途切れたと判断した場合、ネットワーク帯域に余裕があれば、一括伝送モードに切り替え、相手側のデコード処理後の揺らぎ吸収バッファＢＦ２のバッファサイズ（例えばｎ秒分）だけ、ＢＧＭデータを一括（バッチ）伝送する（Ｄ３）。この際、ＳＳＲＣには、リアルタイム音声とＧＧＭとをミキシングして送信していた際のＣＳＲＣ（＝５６７８）を使用する。即ち、ＳＳＲＣ＝５６７７として送信する。 Thereafter, in one VoIP client UA1, for example, when it is determined that the conversation is interrupted due to, for example, the user leaving the communication apparatus for some reason, and the voice collected by the microphone is interrupted, if there is room in the network bandwidth, The mode is switched to the transmission mode, and BGM data is transmitted in batch (batch) by the buffer size (for example, n seconds) of the fluctuation absorbing buffer BF2 after the decoding process on the other side (D3). At this time, the CSRC (= 5678) used when mixing and transmitting real-time voice and GGM is used for SSRC. That is, it transmits as SSRC = 5677.

これにより、一括送信したＢＧＭデータのｎ秒分の時間だけ、送信側のＶｏＩＰクライアントＵＡ１は音声だけを送信することができる。 As a result, the VoIP client UA1 on the transmission side can transmit only voice for a time corresponding to n seconds of the BGM data transmitted in a batch.

また、ｎ秒の間に音声が検出されない場合は、ＶｏＩＰクライアントＵＡ１は何も送信する必要がない。そして、ＶｏＩＰクライアントＵＡ１が音声を再び送信する際には、再びＢＧＭとミキシングし、シーケンス番号をｎサイズ分インクリメントし、音響圧縮してＲＴＰパケットとして、通話相手であるＶｏＩＰクライアントＵＡ２との送信を再開すればよい。 If no voice is detected within n seconds, the VoIP client UA1 need not transmit anything. When the VoIP client UA1 transmits the voice again, the VoIP client UA1 again mixes with the BGM, increments the sequence number by n size, and compresses the sound as an RTP packet to resume transmission with the VoIP client UA2 that is the other party. do it.

本実施の形態においては、マイクからキャプチャした人の音声情報が少ない場合、例えば会話が途切れている場合等において、音声データの伝送を一時的に停止し、ＢＧＭ分のデータだけを先読みし、一括伝送することにより、ユーザが音声を発していない場合等において、より効率的にネットワーク帯域を活用することができる。 In the present embodiment, when the voice information of the person captured from the microphone is small, for example, when the conversation is interrupted, for example, the transmission of the voice data is temporarily stopped, only the data for BGM is pre-read, By transmitting, the network bandwidth can be utilized more efficiently when the user is not speaking.

また、上述の実施の形態においては、会話が途切れた際に、ミキシングする他の音響データを一括伝送するものとしたが、呼制御の際にＢＧＭデータ分を先読みし、一括伝送することも可能である。 In the above-described embodiment, when the conversation is interrupted, other acoustic data to be mixed is collectively transmitted. However, it is also possible to pre-read and transmit the BGM data at the time of call control. It is.

例えば、通話装置のユーザは、通常、通話開始前に通話対象を認識しており、従って始めにミキシングするＢＧＭを選択しておき、次に実際に電話を掛けることにより、上述の図３に示すＳＩＰ制御部１１にて呼制御が行なわれる。そして、ＲＴＰ／ＲＴＣＰメディアセッションが確率し、ＲＴＣＰＡＰＰパケットにより、互いに相手側のデコード処理後の揺らぎ吸収バッファＢＦ２のバッファサイズを交換した後、音声データのＲＴＰパケットを送受信する前に、一括伝送モードとして、ＢＧＭのみを一括送信するようにすることで、一括送信したＢＧＭデータｎ秒分の時間は、送信側のＶｏＩＰクライアントは、音声のみを送信すればよくなり、リアルタイムに伝送する情報量を削減することができ、上述と同様の効果を奏する。 For example, the user of the call device normally recognizes the target of the call before the start of the call, and therefore selects the BGM to be mixed first, and then actually makes a call, as shown in FIG. The SIP control unit 11 performs call control. Then, after the RTP / RTCP media session is probable and the buffer size of the fluctuation absorbing buffer BF2 after the other party decoding process is exchanged with each other by the RTCP APP packet, before the RTP packet of the voice data is transmitted / received, the batch transmission mode As a result, only the BGM is sent in a batch, so that the VoIP client on the sending side needs to send only the voice for the time of BGM data sent in batches, reducing the amount of information transmitted in real time. The same effects as described above can be obtained.

更に、このように呼制御の後、音声データの送受信の前にＢＧＭ等の付加データを一括送信しておくと共に、例えばＢＧＭを変更する等のユーザの要求に応じて、上述のように、音声が途切れたタイミングで上記付加データを一括送信する、即ち必要に応じて実時間伝送モードと一括伝送モードとを切り替えて伝送するようにしてもよいことは勿論である。 Further, after the call control as described above, additional data such as BGM is collectively transmitted before transmission / reception of the voice data, and in response to a user request such as changing the BGM, the voice is transmitted as described above. Of course, the additional data may be transmitted at a time when the transmission is interrupted, that is, the real time transmission mode and the collective transmission mode may be switched and transmitted as necessary.

本発明の実施の形態におけるＶｏＩＰシステムの一例を示す模式図である。It is a schematic diagram which shows an example of the VoIP system in embodiment of this invention. 本発明の実施の形態におけるＶｏＩＰシステムのうち、ＶｏＩＰクライアント１１１とインターネット１３０とからなる要部を示すブロック図である。It is a block diagram which shows the principal part which consists of the VoIP client 111 and the internet 130 among the VoIP systems in embodiment of this invention. 本発明の実施の形態におけるＶｏＩＰシステムが有するＶｏＩＰクライアントとして、一般的なＰＣを使用した場合におけるＶｏＩＰクライアントアプリケーションを示す図である。It is a figure which shows the VoIP client application at the time of using general PC as a VoIP client which the VoIP system in embodiment of this invention has. 上記ＶｏＩＰクライアントアプリケーションにおけるＧＵＩの一例を示す模式図である。It is a schematic diagram which shows an example of GUI in the said VoIP client application. 本発明の実施の形態におけるＶｏＩＰシステムにおけるＶｏＩＰクライアント（ＶｏＩＰクライアント）のハードウェア構成を示すブロック図である。It is a block diagram which shows the hardware constitutions of the VoIP client (VoIP client) in the VoIP system in embodiment of this invention. 図３に示すＶｏＩＰクライアントアプリケーションのうち、ＲＴＰ／ＲＴＣＰパケットを送受信する処理を示す機能ブロック図である。It is a functional block diagram which shows the process which transmits / receives an RTP / RTCP packet among the VoIP client applications shown in FIG. （ａ）及び（ｂ）は、夫々ＲＴＰパケットの構成及びヘッダのフォーマットを示す図である。(A) And (b) is a figure which shows the structure of a RTP packet, and the format of a header, respectively. （ａ）乃至（ｃ）は、アプリケーションによって拡張可能なＲＴＣＰＡＰＰパケットの構成、そのヘッダ（RTCP APP Application-defined Header）、及び送信レポートのフォーマット例を示す図である。(A) thru | or (c) is a figure which shows the format example of the structure of the RTCP APP packet expandable by an application, its header (RTCP APP Application-defined Header), and a transmission report. 本実施の形態におけるＶｏＩＰシステムにおいて、ＲＴＣＰＡＰＰメッセージの交換時にやりとりされるメッセージを示す図である。It is a figure which shows the message exchanged at the time of the exchange of RTCP APP message in the VoIP system in this Embodiment. 本実施の形態における通話方法を説明する図であって、音声が途切れている際にＢＧＭデータのみをバッチ伝送する際の処理シーケンスを示す図である。It is a figure explaining the telephone call method in this Embodiment, Comprising: It is a figure which shows the process sequence at the time of carrying out batch transmission of only BGM data, when the audio | voice has interrupted.

Explanation of symbols

１ＶｏＩＰクライアントアプリケーション、２１送信手段、２２マイクキャプチャ、２３効果音ファイル読み込み部、２４ＢＧＭファイル読み込み部、２５，２６，３１デコーダ、２７，２８，２９ゲイン調整部、３０合成部、３１エンコーダ、３２パケット化部、３３送信部、４１受信手段、４２受信部、４３デパケッタイズ部、４４デジッタ部、４５パケット補償部、４６デコーダ、４７読み出し部、４８復号部、４９，５０，５１，５２ゲイン調整部、５３合成部、５４出力部、５５スピーカ、１００ＶｏＩＰ通信システム、１１１、１２１ＶｏＩＰクライアント、１１０，１２０ユーザ、１１２，１２２ウェブブラウザ、１３４ウェブサーバ、１１３、１２３音源データ、１３０インターネット 1 VoIP client application, 21 transmission means, 22 microphone capture, 23 sound effect file reading unit, 24 BGM file reading unit, 25, 26, 31 decoder, 27, 28, 29 gain adjustment unit, 30 synthesis unit, 31 encoder, 32 Packetizing unit, 33 transmitting unit, 41 receiving means, 42 receiving unit, 43 depacketizing unit, 44 dejittering unit, 45 packet compensating unit, 46 decoder, 47 reading unit, 48 decoding unit, 49, 50, 51, 52 gain adjusting unit , 53 synthesis unit, 54 output unit, 55 speaker, 100 VoIP communication system, 111, 121 VoIP client, 110, 120 user, 112, 122 web browser, 134 web server, 113, 123 sound source data, 130 interface Tsu door

Claims

In a communication device that transmits and receives at least audio data via the Internet,
Audio conversion means for converting the collected sound into an electrical signal;
Synthesizing means for synthesizing additional data with the audio data converted into the electrical signal;
Data packet generating means for generating a data packet storing the audio data and / or additional data to be synthesized with the audio data;
Control packet generating means for generating a control packet storing management information for managing at least transmission and reception of the data packet;
Transmitting means for transmitting the data packet and the control packet to the one or more other communication devices via the Internet;
Receiving means for receiving data packets and control packets from the one or more other communication devices;
Control means for controlling transmission and reception of the data packet and the control packet,
The control means includes a first mode for transmitting in real time a first data packet storing the voice data or synthesized data obtained by synthesizing additional data with the voice data, and second data storing only the additional data. A communication apparatus that controls switching between a second mode in which packets are collectively transmitted.

The call device according to claim 1, wherein the control means performs switching control to the second mode in which the collective transmission is performed when the collected sound is lower than a predetermined volume level.

The call device according to claim 1, wherein the control means controls to perform batch transmission in the second mode before performing real-time transmission in the first mode with the other call device.

The control means controls to transmit the first data packet storing only the audio data in the first mode for a predetermined time after batch transmission of the second data packet in the second mode. The communication device according to claim 1, wherein:

The control means controls the transmission of the data packet to be stopped for a predetermined time when the collected sound is less than a predetermined volume level after the second data packet is collectively transmitted in the second mode. The communication device according to claim 1, wherein:

The call device according to claim 1, wherein the additional data is background music and / or sound effects.

The receiving means has a receiving buffer for buffering received data packets,
The control means stores buffer information indicating the capacity of the reception buffer in the control packet, and stores the buffer information of the other communication apparatus stored in the control packet from the other communication apparatus received by the reception means. The communication device according to claim 1, wherein a transmission rate of the second data packet is controlled based on the transmission rate.

The control means stores buffer occupancy information indicating the amount of data buffered in the reception buffer in the control packet, and the other means stored in the control packet received by the receiving means from the other call device. The call device according to claim 7, wherein a transmission rate of the second data packet is controlled based on buffer information and buffer occupancy information of the call device.

In a calling method that transmits and receives at least audio data via the Internet,
An audio conversion process for converting the collected sound into an electrical signal;
A synthesis step of synthesizing additional data with the audio data converted into the electrical signal;
A data packet generating step for generating a data packet storing the audio data and / or additional data to be synthesized with the audio data;
A control packet generation step of generating a control packet storing management information for managing at least transmission and reception of the data packet;
A transmission step of transmitting the data packet and the control packet to one or more other communication devices via the Internet;
Receiving a data packet and a control packet from the one or more other communication devices;
A control process for controlling transmission and reception of the data packet and the control packet,
In the control step, a first mode for transmitting in real time a first data packet storing the voice data or synthesized data obtained by synthesizing additional data with the voice data, and second data storing only the additional data. A call method characterized by switching control of a second mode in which packets are collectively transmitted.

The call method according to claim 9, wherein, in the control step, when the collected voice is less than a predetermined volume level, the control is switched to the second mode in which the batch transmission is performed.

The call method according to claim 9, wherein, in the control step, control is performed so as to perform batch transmission in the second mode before performing real-time transmission in the first mode with the other call device.

In the control step, after the second data packet is collectively transmitted in the second mode, the first data packet storing only the audio data is transmitted in the first mode for a predetermined time. The call method according to claim 9, wherein:

In the control step, after the second data packet is collectively transmitted in the second mode, if the collected sound is less than a predetermined volume level, control is performed to stop transmission of the data packet for a predetermined time. The call method according to claim 9, wherein:

The call method according to claim 9, wherein the additional data is background music and / or sound effects.

In the control step, buffer information indicating a capacity of a reception buffer for buffering the received data packet is stored in the control packet, and the control information from the other call device received in the reception step is stored in the control packet. The call method according to claim 9, wherein a transmission rate of the second data packet is controlled based on buffer information of another call device.

In the control step, buffer occupancy information indicating the amount of data buffered in the reception buffer is stored in the control packet, and the reception unit stores the other information stored in the control packet received from the other call device. The call method according to claim 15, wherein a transmission rate of the second data packet is controlled based on buffer information and buffer occupancy information of the call device.

In a call system that transmits and receives at least audio data via the Internet,
Each of the above communication devices
Audio conversion means for converting the collected sound into an electrical signal;
Synthesizing means for synthesizing additional data with the audio data converted into the electrical signal;
Data packet generating means for generating a data packet storing the audio data and / or additional data to be synthesized with the audio data;
Control packet generating means for generating a control packet storing management information for managing at least transmission and reception of the data packet;
Transmitting means for transmitting the data packet and the control packet to the one or more other communication devices via the Internet;
Receiving means for receiving data packets and control packets from the one or more other communication devices;
Control means for controlling transmission and reception of the data packet and the control packet,
The reception means of one of the at least two communication devices has a reception buffer for buffering the received data packet, and the control means stores buffer information indicating the capacity of the reception buffer in the control packet,
The control means of the other call device stores a first mode for transmitting in real time a first data packet storing the voice data or synthesized data obtained by synthesizing the voice data with additional data, and stores only the additional data. A second mode of batch transmission of the second data packet, and a buffer of the one communication device stored in the control packet from the one communication device received by the receiving means. A call system, wherein the transmission rate of the second data packet is controlled based on the information.