JP2018101826A

JP2018101826A - Voice speech system, voice speech method, and program

Info

Publication number: JP2018101826A
Application number: JP2016245058A
Authority: JP
Inventors: 進杉本; Susumu Sugimoto; 亮郷原; Akira Gohara; 正雄押見; Masao Oshimi
Original assignee: CRI Middleware Co Ltd
Current assignee: CRI Middleware Co Ltd
Priority date: 2016-12-19
Filing date: 2016-12-19
Publication date: 2018-06-28

Abstract

PROBLEM TO BE SOLVED: To reduce computational complexity on a device for mixing speeches.SOLUTION: A voice speech system includes multiple client devices and a server device. Each client device includes voice acquisition means 30 for acquiring voice data as time-domain data, first conversion means 31 for sampling the acquired time-domain data, dividing it into blocks of constant sample number, and converting it into frequency region data in units of divided blocks, transmission means 32 for transmitting the converted frequency region data to the server device 11, reception means 33 for receiving mixed data generated by mixing multiple frequency region data from the server device 11, and second conversion means 31 for converting the mixed data received by the reception means 33 into the time-domain data. Thus, the server device 11 can mix multiple frequency region data, without converting the frequency region data into the time-domain data.SELECTED DRAWING: Figure 6

Description

本発明は、音声データをやりとりして音声通話を行う音声通話システム、音声通話方法およびその方法をコンピュータに実行させるためのプログラムに関する。 The present invention relates to a voice call system for exchanging voice data to make a voice call, a voice call method, and a program for causing a computer to execute the method.

インターネットを介して複数のユーザが会話を行うVoIPシステム、テレビ会議システム、音声チャットシステム等では、ネットワークの接続形態としてPear To Pear（以下P2P）が広く用いられる（特許文献１参照）。しかし、P2Pはクライアント同士が接続するため、音声通話の人数が増えるに従いシステム全体の通信量が膨大に増えてしまう。 Pear To Pear (hereinafter referred to as P2P) is widely used as a network connection form in VoIP systems, video conferencing systems, voice chat systems, and the like in which a plurality of users have a conversation via the Internet (see Patent Document 1). However, P2P connects clients, so as the number of voice calls increases, the communication volume of the entire system increases enormously.

P2P通信における通信量が膨大に増える問題を解消する方法として、通信経路の途中に設けられた中継装置（サーバ）において、一人のユーザ（クライアント）に対して他の複数のユーザが発した音声を混合し、当該一人のユーザが使用する端末へ送信し、音声を出力させるシステムがある。このような通信方式を採用するネットワークシステムをクライアント・サーバモデルと呼ぶ。クライアント・サーバモデルでは、当該他の複数のユーザが使用する端末において音声データを符号化して送信し、中継装置で時間領域データに復号し（例えば、特許文献２参照）、復号した時間領域データを混合し、その結果を符号化して、当該一人のユーザが使用する端末に送信している。 As a method of solving the problem that the amount of communication in P2P communication increases enormously, in the relay device (server) provided in the middle of the communication path, the voices uttered by other users to one user (client) There are systems that mix, transmit to a terminal used by the one user, and output audio. A network system employing such a communication method is called a client / server model. In the client / server model, audio data is encoded and transmitted at a terminal used by the other users, and is decoded into time domain data by a relay device (see, for example, Patent Document 2). The result is mixed, and the result is encoded and transmitted to the terminal used by the one user.

インターネットを介して複数のユーザが会話を行うシステムでは、通話音声の圧縮にITU-T G.711をはじめとする差分パルス符号変調（DPCM）を基盤とした音声コーデックが広く用いられる。しかし、DPCMを基盤とした音声コーデックは圧縮率が低く、低ビットレートゆえに通話品質を向上できない。通信量を抑えつつ通話品質を向上するには、時間領域データを周波数領域データに変換してから符号化する、変換符号化を基盤とした音声コーデックを用いる事が望ましい。 In a system in which a plurality of users have a conversation via the Internet, a voice codec based on differential pulse code modulation (DPCM) such as ITU-T G.711 is widely used for compressing call voice. However, the voice codec based on DPCM has a low compression rate and the call quality cannot be improved due to the low bit rate. In order to improve call quality while suppressing the amount of communication, it is desirable to use a voice codec based on transform coding, in which time domain data is converted into frequency domain data and then encoded.

特開２００５−３２８１７８号公報JP 2005-328178 A 特開２０１０−０４４１７５号公報JP 2010-044175 A

しかしながら、上記の中継装置では、会話を行うユーザの音声を時間領域データに復号し、再度符号化する必要があるため、人数の増加に従い、音声を混合する装置上での計算量が膨大に増えるという問題があった。 However, in the above relay device, it is necessary to decode the voice of the user performing the conversation into time domain data and re-encode it, so that the amount of calculation on the device that mixes the voice increases enormously as the number of people increases. There was a problem.

このため、音声を混合する装置上での計算量を減らすことができるシステムや方法の提供が望まれていた。 For this reason, provision of the system and method which can reduce the computational complexity on the apparatus which mixes an audio | voice was desired.

本発明は、上記課題に鑑み、複数のクライアント装置とサーバ装置とを含む音声通話システムであって、各クライアント装置が、音声データを、時間に対する音圧の変化を表す時間領域データとして取得する音声取得手段と、時間領域データをサンプリングし、サンプル数が一定のブロックに分割し、分割したブロック単位で周波数成分毎の強さを表す周波数領域データに変換する第１変換手段と、変換された周波数領域データをサーバ装置に送信する送信手段と、サーバ装置から複数の周波数領域データを混合することにより生成された混合データを受信する受信手段とを含み、受信手段により受信された混合データを時間領域データに変換する第２変換手段と、を備えることを特徴とする音声通話システムが提供される。 In view of the above problems, the present invention is a voice call system including a plurality of client devices and a server device, and each client device acquires voice data as time domain data representing a change in sound pressure with respect to time. An acquisition unit, a first conversion unit that samples time-domain data, divides the data into blocks having a fixed number of samples, and converts the divided frequency units into frequency-domain data representing the strength of each frequency component; Including transmission means for transmitting region data to the server device, and reception means for receiving mixed data generated by mixing a plurality of frequency domain data from the server device, wherein the mixed data received by the receiving means is time domain There is provided a voice call system comprising: a second conversion means for converting data.

本発明のシステム等を提供することにより、音声を混合する装置上での計算量を減らすことができる。 By providing the system of the present invention, it is possible to reduce the amount of calculation on a device for mixing audio.

音声通話システムの構成例を示した図である。It is the figure which showed the structural example of the voice call system. 本発明の音声通話システムの概要を説明するための図である。It is a figure for demonstrating the outline | summary of the voice call system of this invention. サンプル数が一定のブロック単位でクライアント装置がデータ変換を行う処理を示す図である。It is a figure which shows the process which a client apparatus performs data conversion in the block unit with a fixed sample number. サーバ装置の処理内容を示す図である。It is a figure which shows the processing content of a server apparatus. クライアント装置のハードウェア構成を例示した図である。It is the figure which illustrated the hardware constitutions of the client apparatus. クライアント装置の機能ブロック図である。It is a functional block diagram of a client apparatus. クライアント装置において実行される音声データの送信処理の流れを示したフローチャートである。It is the flowchart which showed the flow of the transmission process of the audio | voice data performed in a client apparatus. クライアント装置において実行される音声データの受信処理の流れを示したフローチャートである。It is the flowchart which showed the flow of the reception process of the audio | voice data performed in a client apparatus. サーバ装置の第１の実施形態を示した機能ブロック図である。It is the functional block diagram which showed 1st Embodiment of the server apparatus. 対応表を例示した図である。It is the figure which illustrated the correspondence table. サーバ装置において実行される音声データの混合処理の流れを示したフローチャートである。It is the flowchart which showed the flow of the mixing process of the audio | voice data performed in a server apparatus. サーバ装置の第２の実施形態を示した機能ブロック図である。It is the functional block diagram which showed 2nd Embodiment of the server apparatus.

図１は、音声通話システムの構成例を示した図である。複数のユーザによって行われる音声通話では、一人のユーザに対し、他の全てのユーザの音声を送信する必要がある。二人のユーザ間で音声通話を行う場合は、ユーザ間で音声データを送信し合えばよいが、人数が増加すると、通信量が膨大に増えていくため、音声データを混合（ミックス）して配信することが望ましい。そこで、音声通話システムは、ユーザが発した音声の音声データを取得し、他の全てのユーザの音声を出力する複数のクライアント装置１０ａ〜１０ｎと、音声データをミックスするサーバ装置１１とを含んで構成される。 FIG. 1 is a diagram showing a configuration example of a voice call system. In a voice call performed by a plurality of users, it is necessary to transmit the voices of all other users to one user. When performing a voice call between two users, it is only necessary to transmit voice data between users. However, as the number of people increases, the amount of communication increases enormously, so the voice data is mixed (mixed). It is desirable to deliver. Therefore, the voice call system includes a plurality of client devices 10a to 10n that obtain voice data of voices uttered by a user and output voices of all other users, and a server device 11 that mixes the voice data. Composed.

複数のクライアント装置１０ａ〜１０ｎとサーバ装置１１とは、ネットワーク１２に接続され、ネットワーク１２を介して互いに通信することができるようになっている。クライアント装置１０ａ〜１０ｎは、クライアント装置１０ａ〜１０ｎを使用する各ユーザが発した音声を音声データとして受け付け、音声データをサーバ装置１１へネットワーク１２を介して送信する。 The plurality of client devices 10 a to 10 n and the server device 11 are connected to the network 12 and can communicate with each other via the network 12. The client devices 10 a to 10 n accept voices uttered by each user using the client devices 10 a to 10 n as voice data, and transmit the voice data to the server device 11 via the network 12.

サーバ装置１１は、クライアント装置１０ａに対し、クライアント装置１０ａ以外のクライアント装置１０ｂ〜１０ｎから受信した音声データをミックスして配信する。サーバ装置１１は、同様にクライアント装置１０ｂに対し、クライアント装置１０ｂ以外のクライアント装置１０ａ、１０ｃ〜１０ｎから受信した音声データをミックスして配信する。このようにして、サーバ装置１１は、各クライアント装置１０ｃ〜１０ｎに対しても、同様にして音声データをミックスして配信する。 The server apparatus 11 mixes and distributes the audio data received from the client apparatuses 10b to 10n other than the client apparatus 10a to the client apparatus 10a. Similarly, the server apparatus 11 mixes and distributes the audio data received from the client apparatuses 10a, 10c to 10n other than the client apparatus 10b to the client apparatus 10b. In this way, the server apparatus 11 also mixes and distributes the audio data to the client apparatuses 10c to 10n in the same manner.

各ユーザは、自分が使用するクライアント装置から自分の音声を入力して送信し、他の全ユーザの音声を受信して出力させ、それを聞くことにより、互いに離れた場所にいても、複数のユーザ間での同時音声通話を実現することができる。 Each user inputs and transmits his / her voice from the client device he / she uses, receives and outputs the voices of all other users, and listens to it, so that multiple users can Simultaneous voice calls between users can be realized.

クライアント装置１０ａ〜１０ｎは、上記の処理を実行することができればいかなる装置であってもよく、例えば、家庭用ゲーム機器、アミューズメント機器、パソコン機器、モバイルフォン機器、音声入出力可能な組み込み機器とすることができる。 The client devices 10a to 10n may be any devices as long as they can execute the above processing, and are, for example, home game devices, amusement devices, personal computer devices, mobile phone devices, and embedded devices capable of voice input / output. be able to.

ネットワーク１２は、LAN(Local Area Network)、WAN(Wide Area Network)、インターネットのいずれであってもよく、また、有線ネットワークであってもよいし、無線ネットワークであってもよい。無線通信は、無線LAN、Bluetooth（登録商標）、赤外線通信により行うことができ、無線LANを使用する場合は、ネットワーク１２に接続されたアクセスポイントを介して実施することができる。ネットワーク１２は、１つのネットワークに限られるものではなく、２以上のネットワークであってもよく、２以上のネットワークはルータやプロキシサーバ等の中継装置により接続することができる。 The network 12 may be any of a LAN (Local Area Network), a WAN (Wide Area Network), and the Internet, and may be a wired network or a wireless network. Wireless communication can be performed by wireless LAN, Bluetooth (registered trademark), infrared communication, and when using a wireless LAN, can be performed via an access point connected to the network 12. The network 12 is not limited to one network, and may be two or more networks, and the two or more networks can be connected by a relay device such as a router or a proxy server.

音声通話システムを構成する複数のクライアント装置１０ａ〜１０ｎは、音声通話サービスを利用するユーザがログインする等して、認証されたクライアント装置とすることができる。ユーザの認証は、ユーザIDやパスワードの入力による認証であってもよいし、生体認証であってもよいし、ICカード、携帯電話、スマートフォン等を読み取り装置にかざすことにより行う認証であってもよい。これらの認証方法は一例であり、これまでに知られたいかなる認証方法でも採用することができる。 The plurality of client devices 10a to 10n constituting the voice call system can be authenticated client devices by, for example, logging in by a user who uses the voice call service. User authentication may be authentication by inputting a user ID or password, biometric authentication, or authentication performed by holding an IC card, a mobile phone, a smartphone, or the like over a reading device. Good. These authentication methods are examples, and any authentication method known so far can be adopted.

クライアント装置１０ａ〜１０ｎは、ユーザが発した音声の音声データを取得し、サーバ装置１１へ送信する際、ユーザまたはクライアント装置１０ａ〜１０ｎを識別するためのユーザ識別情報または装置識別情報を付加して送信することができる。これにより、各音声データを送信したユーザや装置を識別することができる。識別情報は、その両方を付加して送信してもよいが、データサイズが大きくなるので、いずれかの情報のみを付加して送信することが好ましい。 When the client devices 10a to 10n acquire voice data of a voice uttered by the user and transmit it to the server device 11, the client devices 10a to 10n add user identification information or device identification information for identifying the user or the client devices 10a to 10n. Can be sent. Thereby, the user and apparatus which transmitted each audio | voice data can be identified. The identification information may be transmitted with both of them added, but since the data size becomes large, it is preferable to transmit only one of the pieces of information.

ユーザ識別情報としては、ユーザ名やユーザIDを用いることができ、装置識別情報としては、装置名、装置ID、IPアドレス、MAC(Media Access Control)アドレス等を用いることができる。ユーザ識別情報は、ユーザがログイン時に入力した情報から取得することができ、装置識別情報は、クライアント装置内に予め登録されている情報を取得して使用することができる。 A user name or a user ID can be used as the user identification information, and a device name, a device ID, an IP address, a MAC (Media Access Control) address, or the like can be used as the device identification information. The user identification information can be obtained from information input by the user at the time of login, and the device identification information can be obtained by using information registered in advance in the client device.

サーバ装置１１は、クライアント装置１０ａに対し、クライアント装置１０ａ以外のクライアント装置１０ｂ〜１０ｎから受信した音声データをユーザ識別情報または装置識別情報により識別し、識別した音声データをミックスして配信することができる。サーバ装置１１は、他のクライアント装置１０ｂ〜１０ｎに対しても同様に、音声データを識別し、ミックスして配信することができる。 The server device 11 may identify the audio data received from the client devices 10b to 10n other than the client device 10a by the user identification information or the device identification information, and mix and distribute the identified audio data to the client device 10a. it can. Similarly, the server apparatus 11 can identify the audio data, mix and distribute it to the other client apparatuses 10b to 10n.

次に、本発明の概要について説明する。図２は、本発明の音声通話システムの概要を説明するための図である。ここで、図２（Ａ）は従来の音声通話システムを示し、図２（Ｂ）は本発明の音声通話システムを示す。また、図３は、サンプル数が一定のブロック単位でクライアント装置がデータ変換を行う処理を示す図である。また、図４は、サーバ装置の処理内容を示す図である。ここで、図４（Ａ）は従来の音声通話システムにおけるサーバ装置の処理内容を示し、図４（Ｂ）は本発明の音声通話システムにおけるサーバ装置の処理内容を示す。 Next, the outline of the present invention will be described. FIG. 2 is a diagram for explaining the outline of the voice call system of the present invention. Here, FIG. 2A shows a conventional voice call system, and FIG. 2B shows a voice call system of the present invention. FIG. 3 is a diagram illustrating processing in which the client device performs data conversion in units of blocks having a fixed number of samples. FIG. 4 is a diagram illustrating processing contents of the server device. Here, FIG. 4A shows the processing contents of the server apparatus in the conventional voice call system, and FIG. 4B shows the processing contents of the server apparatus in the voice call system of the present invention.

図２及び図３に示すように、クライアント装置側において、マイク１００はユーザが発した音声を集音して音声データを取得する（図２（Ａ）（Ｂ）、図３（１）を参照）。なお、マイク１００は、後述する図５の音声入力装置２６及び図６の音声取得手段３０に対応する。クライアント装置は、アナログの音声データをデジタルの音声データに変換する（図３（２）のＡ／Ｄ変換を参照）。Ａ／Ｄ変換の処理において、クライアント装置は、連続的なアナログの音声データを所定のサンプリング間隔ごとに値を平均化して順次抽出することで、離散的なデジタルの音声データに変換する。図３（３）に示すように、サンプリング周波数が例えば４４．１ｋＨｚの場合、Ａ／Ｄ変換後の音声データは１秒当たり４４１００のサンプル数のデータとなる。なお、サンプリング周波数は４４．１ｋＨｚに限らず４８ｋＨｚなどであってもよい。 As shown in FIGS. 2 and 3, on the client device side, the microphone 100 collects the voice uttered by the user and acquires the voice data (see FIGS. 2A and 2B and FIG. 3A). ). The microphone 100 corresponds to a voice input device 26 in FIG. 5 and a voice acquisition unit 30 in FIG. The client device converts analog audio data into digital audio data (see A / D conversion in FIG. 3B). In the A / D conversion processing, the client device converts continuous analog audio data into discrete digital audio data by averaging the values at predetermined sampling intervals and sequentially extracting the values. As shown in FIG. 3 (3), when the sampling frequency is, for example, 44.1 kHz, the audio data after the A / D conversion is data of 44100 samples per second. The sampling frequency is not limited to 44.1 kHz, and may be 48 kHz.

次に、クライアント装置は、Ａ／Ｄ変換後の音声データをサンプル数が一定のブロックに分割する（図３（３）（４）を参照）。ブロックサイズとしては、例えばサンプル数が１２８や１０２４などを想定している。図３（３）に示す音声データ（サンプリングデータ）は時間に対する音圧の変化を表す時間領域データである。なお、図２では時間領域データを「波」と表記している。 Next, the client device divides the audio data after A / D conversion into blocks having a fixed number of samples (see FIGS. 3 (3) and (4)). As the block size, for example, the number of samples is assumed to be 128 or 1024. The audio data (sampling data) shown in FIG. 3 (3) is time domain data representing a change in sound pressure with respect to time. In FIG. 2, the time domain data is represented as “wave”.

次に、クライアント装置は、時間領域データをブロック単位で周波数成分毎の強さを表す周波数領域データに変換する（図２（Ａ）（Ｂ）、図３（５）を参照）。図２及び図３に示す例では、時間領域データから周波数領域データへの変換は、修正離散コサイン変換（MDCT：Modified Discrete Cosine Transformation）を用いている。ただし、フーリエ変換（Fourier Transformation）、離散フーリエ変換（DFT：Discrete Fourier Transformation）、高速フーリエ変換（FFT：Fast Fourier Transformation）、離散コサイン変換（DCT：Discrete Cosine Transformation）等を用いてもよい。なお、図２では周波数領域データを「周波数」と表記している。その後、クライアント装置は、修正離散コサイン変換後の周波数領域データをネットワーク１２を介してサーバ装置に送信する。 Next, the client device converts the time domain data into frequency domain data representing the strength of each frequency component in units of blocks (see FIGS. 2A and 2B and FIG. 3B). In the example shown in FIGS. 2 and 3, the modified discrete cosine transformation (MDCT) is used for the conversion from the time domain data to the frequency domain data. However, Fourier transform (Fourier Transformation), discrete Fourier transform (DFT: Discrete Fourier Transformation), fast Fourier transform (FFT), discrete cosine transform (DCT: Discrete Cosine Transformation), etc. may be used. In FIG. 2, the frequency domain data is expressed as “frequency”. Thereafter, the client device transmits the frequency domain data after the modified discrete cosine transform to the server device via the network 12.

図２（Ａ）及び図４（Ａ）に示す従来の音声通話システムでは、サーバ装置は、クライアント装置から送信された周波数領域データを受信すると、受信した周波数領域データに対して逆修正離散コサイン変換（IMDCT：Inverse Modified Discrete Cosine Transformation）を行うことで周波数領域データを時間領域データに戻す。 In the conventional voice call system shown in FIGS. 2 (A) and 4 (A), when the server apparatus receives the frequency domain data transmitted from the client apparatus, the inverse corrected discrete cosine transform is performed on the received frequency domain data. (IMDCT: Inverse Modified Discrete Cosine Transformation) is performed to return the frequency domain data to the time domain data.

このように、従来の音声通話システムにおいて逆修正離散コサイン変換を行って周波数領域データを時間領域データに戻すのは以下の理由からである。すなわち、一般的な音声符号化処理では、圧縮率を高めるためにブロックサイズを切り替えながら処理を行う。圧縮率を高めるためにブロックサイズを可変にすると、複数の周波数領域データを混合（ミックス）することができない。つまり、修正離散コサイン変換を行うときのブロックサイズが同じサイズでなければ、複数の周波数領域データを直接加算して混合することができない。 As described above, the reason why the frequency domain data is returned to the time domain data by performing the inversely modified discrete cosine transform in the conventional voice communication system is as follows. That is, in general speech encoding processing, processing is performed while switching block sizes in order to increase the compression rate. If the block size is made variable in order to increase the compression rate, a plurality of frequency domain data cannot be mixed (mixed). In other words, if the block size when performing the modified discrete cosine transform is not the same size, a plurality of frequency domain data cannot be directly added and mixed.

従って、従来の音声通話システムでは、サーバ装置は、周波数領域データに対して逆修正離散コサイン変換を行うことにより、一旦、周波数領域データを時間領域データ（波形）に変換し、複数の時間領域データ（波形）を加算して混合データを生成する。その後、サーバ装置は、再度、修正離散コサイン変換を行うことにより、生成した混合データ（時間領域データ）を周波数領域データに変換して通話相手のクライアント装置に送信する。 Therefore, in the conventional voice call system, the server device performs inverse correction discrete cosine transform on the frequency domain data to temporarily convert the frequency domain data into time domain data (waveform), and a plurality of time domain data. (Waveform) is added to generate mixed data. Thereafter, the server device performs the modified discrete cosine transform again to convert the generated mixed data (time domain data) into frequency domain data and transmit it to the client device of the other party.

これに対して、図２（Ｂ）及び図４（Ｂ）に示す本発明の音声通話システムでは、上述したように、修正離散コサイン変換を行うときのブロックサイズが同じサイズであるので、サーバ装置は、複数の周波数領域データを直接加算して混合（ミックス）することが可能である。従って、サーバ装置は、変換処理を行うことなく、複数の周波数領域データを直接加算して混合データを生成し、生成した混合データ（周波数領域データ）を通話相手のクライアント装置に送信する。 On the other hand, in the voice call system of the present invention shown in FIGS. 2B and 4B, as described above, the block size when performing the modified discrete cosine transform is the same size, so the server device It is possible to directly add and mix (mix) a plurality of frequency domain data. Therefore, the server device generates mixed data by directly adding a plurality of frequency domain data without performing conversion processing, and transmits the generated mixed data (frequency domain data) to the client device of the other party.

図２（Ａ）（Ｂ）に示すように、クライアント装置は、サーバ装置から送信された混合データを受信すると、受信した混合データ（周波数領域データ）に対して逆修正離散コサイン変換を行うことにより、周波数領域データを時間領域データに変換する。そして、クライアント装置は、時間領域データをスピーカ２００から音声出力する。なお、図２及び図４では、逆修正離散コサイン変換（IMDCT）を用いているが、そのような処理に限らず、逆フーリエ変換や逆離散フーリエ変換等を用いてもよい。このように、図２（Ｂ）及び図４（Ｂ）に示す音声通話システムでは、サーバ装置が逆修正コサイン変換を行うことなく複数の周波数領域データを直接加算して混合するので、処理負担も大幅に軽減される。 As shown in FIGS. 2A and 2B, when the client device receives the mixed data transmitted from the server device, the client device performs inverse correction discrete cosine transform on the received mixed data (frequency domain data). The frequency domain data is converted into time domain data. Then, the client device outputs the time domain data from the speaker 200 as a sound. In FIGS. 2 and 4, inverse modified discrete cosine transform (IMDCT) is used, but not limited to such processing, inverse Fourier transform, inverse discrete Fourier transform, or the like may be used. As described above, in the voice call system shown in FIGS. 2B and 4B, the server device directly adds and mixes the plurality of frequency domain data without performing the inverse correction cosine transform, and thus the processing load is also increased. It is greatly reduced.

図５は、上記の処理を実現するためのクライアント装置１０ａ〜１０ｎのハードウェア構成を例示した図である。以下、クライアント装置１０ａ〜１０ｎにおいて特定のクライアント装置を示す場合を除き、クライアント装置１０と表記する。クライアント装置１０は、ハードウェアとして、CPU２０、ROM２１、RAM２２、HDD２３、通信I/F２４、入出力I/F２５、音声入力装置２６、音声出力装置２７を備える。CPU２０、ROM２１、RAM２２、HDD２３、通信I/F２４、入出力I/F２５は、バス２８に接続され、バス２８を介してデータ等のやりとりを可能にしている。クライアント装置１０は、音声出力装置２７に代えて表示装置を備える構成や、別途、表示装置を備える構成であってもよい。また、情報を入力するための入力装置をさらに備えていてもよい。 FIG. 5 is a diagram illustrating a hardware configuration of the client devices 10a to 10n for realizing the above processing. Hereinafter, the client device 10a to 10n will be referred to as the client device 10 unless a specific client device is indicated. The client device 10 includes a CPU 20, ROM 21, RAM 22, HDD 23, communication I / F 24, input / output I / F 25, audio input device 26, and audio output device 27 as hardware. The CPU 20, ROM 21, RAM 22, HDD 23, communication I / F 24, and input / output I / F 25 are connected to the bus 28 and allow data and the like to be exchanged via the bus 28. The client device 10 may have a configuration including a display device instead of the audio output device 27 or a configuration including a display device separately. Moreover, you may further provide the input device for inputting information.

CPU２０は、クライアント装置１０全体を制御する。ROM２１は、クライアント装置１０を起動させるためのブートプログラムやHDD２３等を制御するファームウェア等を記憶する。RAM２２は、CPU２０に対して作業領域を提供する。HDD２３は、音声データを取得し、取得した音声データの変換等を行うプログラム、OSやその他のプログラム、各種の設定データ等を記憶する。 The CPU 20 controls the entire client device 10. The ROM 21 stores a boot program for starting the client device 10, firmware for controlling the HDD 23, and the like. The RAM 22 provides a work area for the CPU 20. The HDD 23 stores audio data, stores a program for converting the acquired audio data, an OS and other programs, various setting data, and the like.

通信I/F２４は、クライアント装置１０をネットワーク１２に接続し、サーバ装置１１や他のクライアント装置との通信を制御する。入出力I/F２５は、音声入力装置２６からの音声データの入力、音声出力装置２７への音声データの出力を制御する。音声入力装置２６は、マイク等の音声を入力する装置で、音声出力装置２７は、スピーカ等の音声を出力する装置である。 The communication I / F 24 connects the client device 10 to the network 12 and controls communication with the server device 11 and other client devices. The input / output I / F 25 controls input of audio data from the audio input device 26 and output of audio data to the audio output device 27. The sound input device 26 is a device that inputs sound such as a microphone, and the sound output device 27 is a device that outputs sound such as a speaker.

サーバ装置１１については、図面を参照して説明しないが、入出力I/F２５、音声入力装置２６、音声出力装置２７を必要としないため、CPU２０、ROM２１、RAM２２、HDD２３、通信I/F２４をハードウェアとして備えることができる。なお、これらの装置では、HDD２３を用いる構成を例示しているが、これに限られるものではなく、SSD(Solid State Drive)等であってもよい。また、この音声通話システムを、テレビ会議システム等に使用する場合、映像を取得し、配信するために、カメラ等の撮像装置をさらに備えることもできる。 Although the server device 11 is not described with reference to the drawings, the input / output I / F 25, the voice input device 26, and the voice output device 27 are not required, so the CPU 20, ROM 21, RAM 22, HDD 23, and communication I / F 24 are hard-wired. Can be provided as wear. In these apparatuses, the configuration using the HDD 23 is illustrated, but the present invention is not limited to this, and an SSD (Solid State Drive) or the like may be used. In addition, when this voice call system is used for a video conference system or the like, an imaging device such as a camera can be further provided to acquire and distribute video.

図６は、クライアント装置１０が備える機能を説明するための機能ブロック図である。クライアント装置１０は、ユーザが発した音声を音声データとして取得する機能、音声データをサーバ装置１１にネットワーク１２を介して送信する機能、ミックスされた音声データをサーバ装置１１からネットワーク１２を介して受信する機能、ミックスされた音声データを出力する機能を備える。したがって、クライアント装置１０は、上記の各機能を機能手段として備えた装置とすることができる。なお、これらの機能は、HDD２３に記憶されたプログラムをCPU２０が読み出し実行することにより実現することができる。 FIG. 6 is a functional block diagram for explaining functions provided in the client device 10. The client device 10 has a function of acquiring voice uttered by a user as voice data, a function of transmitting voice data to the server device 11 via the network 12, and receiving mixed voice data from the server device 11 via the network 12. And a function of outputting mixed audio data. Therefore, the client device 10 can be a device provided with each of the above functions as functional means. These functions can be realized by the CPU 20 reading and executing a program stored in the HDD 23.

クライアント装置１０は、その機能手段として、音声取得手段３０と、変換手段３１と、送信手段３２と、受信手段３３と、音声出力手段３４とを備える。 The client device 10 includes a voice acquisition unit 30, a conversion unit 31, a transmission unit 32, a reception unit 33, and a voice output unit 34 as functional units.

音声取得手段３０は、ユーザが発した音声の音声データを、時間に対する音圧の変化を表す時間領域データとして取得する。マイクを使用する場合、音声データが時間領域データとして取得される。変換手段３１は、音声取得手段３０により取得された時間領域データを、周波数成分毎の強さを表す周波数領域データに変換する。周波数領域データは、どの周波数成分がどれだけ含まれているかを示すデータである。 The voice acquisition unit 30 acquires voice data of a voice uttered by a user as time domain data representing a change in sound pressure with respect to time. When using a microphone, audio data is acquired as time domain data. The conversion unit 31 converts the time domain data acquired by the audio acquisition unit 30 into frequency domain data representing the strength of each frequency component. The frequency domain data is data indicating how many frequency components are included.

時間領域データを周波数領域データに変換するために、音声コーデックを使用することができる。コーデックは、データの符号化（エンコード）と、符号化したデータの復号（デコード）とを双方向に行うことができるプログラムである。変換手段３１は、この変換に際し、修正離散コサイン変換(MDCT: Modified Discrete Cosine Transformation)等を使用することができる。そのほか、フーリエ変換、離散フーリエ変換、高速フーリエ変換等を使用することもできる。なお、周波数領域データを時間領域データに変換する場合は、逆修正離散コサイン変換(IMDCT)等を使用することができ、そのほか、逆フーリエ変換や逆離散フーリエ変換等を使用することもできる。 An audio codec can be used to convert time domain data to frequency domain data. A codec is a program that can bidirectionally perform encoding (encoding) of data and decoding (decoding) of encoded data. The conversion means 31 can use a modified discrete cosine transformation (MDCT) or the like for this conversion. In addition, Fourier transform, discrete Fourier transform, fast Fourier transform, or the like can be used. In addition, when transforming frequency domain data into time domain data, inverse modified discrete cosine transform (IMDCT) or the like can be used, and in addition, inverse Fourier transform, inverse discrete Fourier transform, or the like can also be used.

また、音声圧縮に用いられる時間領域データから周波数領域データへの変換には、複数のバンドパスフィルタで周波数領域に変換するサブバンドフィルタや、上記のMDCTやIMDCT以外の重複直交変換(LOT)を用いることもできる。バンドパスフィルタは、特定の周波数のみを通し、他の周波数は通さないフィルタ回路である。直交変換は、時間領域の信号を周波数成分に変換するものである。 Also, for the conversion from time domain data to frequency domain data used for speech compression, a subband filter that converts to the frequency domain with multiple bandpass filters and the above-mentioned MDCT and IMDCT other than the orthogonal orthogonal transform (LOT) It can also be used. The bandpass filter is a filter circuit that passes only a specific frequency and does not pass other frequencies. Orthogonal transformation transforms a time domain signal into frequency components.

上記のコーデックとしては、周波数領域から時間領域への変換を行うサンプル数が一定（ブロックサイズが同一）のコーデックであればいかなるコーデックでも用いることができる。一例としては、MPEG-1 Audio Layer-2のような固定サイズの変換を行うコーデックを挙げることができる。ちなみに、フレーム毎に処理単位が変更可能なmp3やAAC等のコーデックは、フレームサイズを制限しない限り、使用することができない。 As the above codec, any codec can be used as long as the number of samples to be converted from the frequency domain to the time domain is constant (the block size is the same). An example is a codec that performs fixed-size conversion, such as MPEG-1 Audio Layer-2. Incidentally, codecs such as mp3 and AAC that can change the processing unit for each frame cannot be used unless the frame size is limited.

特に重複直交変換、例えばMDCT/IMDCTの使用により、インターネット通信において頻発するパケットの到着遅延やパケットの消失障害に対して、少ない計算量で補間することができ、音声通話の品質劣化を抑制することができる。 In particular, the use of overlapping orthogonal transforms, such as MDCT / IMDCT, can interpolate with a small amount of computation for packet arrival delays and packet loss failures that occur frequently in Internet communications, and suppress voice communication quality degradation. Can do.

ネットワーク１２を介して送受信される周波数領域データは、一般にパケットと呼ばれる伝送単位に分割して送受信される。ネットワーク１２として利用されるインターネットで音声通話を実現する場合、このパケットが伝送途中で消失し（パケットロスト）、パケット順序が入れ替えられる等して、音声データが不連続になる場合がある。音声データが不連続になると、人間にとって突発的で耳障りなノイズに聞こえる。 Frequency domain data transmitted / received via the network 12 is transmitted / received by being divided into transmission units generally called packets. When a voice call is realized on the Internet used as the network 12, this packet may be lost during transmission (packet lost), and the packet order may be changed. If the audio data becomes discontinuous, it will sound like a sudden and annoying noise for humans.

このノイズを低減するべく、ネットワーク１２上でやりとりする音声データに冗長な音声データを付加したり、時間領域データに復号した後、不連続な部分を滑らかにする処理を追加したりすることができる。 In order to reduce this noise, it is possible to add redundant audio data to audio data exchanged on the network 12, or to add a process for smoothing discontinuous parts after decoding into time domain data. .

しかしながら、冗長な音声データを付加する方法では、データサイズが大きくなり、滑らかにする処理では、処理工程が増加してしまう。そこで、時間領域データを周波数領域データに変換する際、重複直交変換を使用することができる。重複直交変換の一例として上記のMDCTを使用することができる。MDCTは、１処理単位内に前後のデータと、音声が重なりながら切り替わるクロスフェードする成分とをもつため、音声データの一部が欠損して不連続になったとしても、不連続部分を滑らかに繋ぐ効果を得ることができる。このため、パケットロストによるノイズを抑制する副次的な効果を得ることができる。同様の効果を得るために、周波数領域データを時間領域データに変換する際にも、重複直交変換を使用することができ、その一例として上記のIMDCTを使用することができる。 However, in the method of adding redundant audio data, the data size becomes large, and the processing steps increase in the smoothing process. Thus, when transforming time domain data to frequency domain data, overlapping orthogonal transformation can be used. The above MDCT can be used as an example of overlapping orthogonal transformation. MDCT has the data before and after within one processing unit and the crossfading component that switches while the audio overlaps, so even if part of the audio data is lost and becomes discontinuous, the discontinuous part is smoothed A connecting effect can be obtained. For this reason, the secondary effect which suppresses the noise by packet loss can be acquired. In order to obtain the same effect, when the frequency domain data is converted into the time domain data, the overlapping orthogonal transformation can be used, and the IMDCT described above can be used as an example.

送信手段３２は、変換手段３１により変換された周波数領域データを、サーバ装置１１へネットワーク１２を介して送信する。送信手段３２は、周波数領域データに、ログイン等でユーザが入力したユーザ識別情報やクライアント装置１０に登録されている装置識別情報を付加して送信することができる。 The transmission unit 32 transmits the frequency domain data converted by the conversion unit 31 to the server device 11 via the network 12. The transmission unit 32 can add the user identification information input by the user by login or the device identification information registered in the client device 10 to the frequency domain data and transmit it.

送信手段３２は、周波数領域データをそのまま送信してもよいが、変換手段３１が符号化したデータを送信することもできる。符号化は、データの圧縮であり、変換手段３１が上記のコーデックを使用して周波数領域データへの変換と同時に行うことができる。 The transmission unit 32 may transmit the frequency domain data as it is, but can also transmit the data encoded by the conversion unit 31. Encoding is data compression, and the conversion means 31 can be performed simultaneously with conversion to frequency domain data using the above-described codec.

受信手段３３は、サーバ装置１１から周波数領域データをミックスした音声データ、すなわち混合データを受信する。受信手段３３は、周波数領域データをミックスした混合データを受信してもよいし、その混合データがサーバ装置１１において符号化された混合データを受信してもよい。変換手段３１は、受信手段３３により受信された混合データを、上記のコーデックを使用して時間領域データに変換する。受信手段３３が符号化された混合データを受信した場合、変換手段３１は、混合データを復号し、復号した混合データを時間領域データに変換することができる。音声出力手段３４は、変換手段３１により変換された時間領域データを音声出力する。 The receiving unit 33 receives audio data obtained by mixing the frequency domain data from the server device 11, that is, mixed data. The receiving unit 33 may receive mixed data obtained by mixing the frequency domain data, or may receive mixed data obtained by encoding the mixed data in the server device 11. The conversion unit 31 converts the mixed data received by the reception unit 33 into time domain data using the codec. When the reception unit 33 receives the encoded mixed data, the conversion unit 31 can decode the mixed data and convert the decoded mixed data into time domain data. The audio output unit 34 outputs the time domain data converted by the conversion unit 31 as audio.

各クライアント装置１０が行う音声データの送信処理について、図７を参照して説明する。ユーザは、クライアント装置１０を使用して音声通話サービスにログインしたことを受けて、ステップ４００から処理を開始する。ステップ４０５では、音声取得手段３０が音声の入力があるかを判断する。音声の入力があれば、ステップ４１０へ進み、入力がなければ、ステップ４２５へ進む。 Audio data transmission processing performed by each client device 10 will be described with reference to FIG. In response to logging in to the voice call service using the client device 10, the user starts processing from Step 400. In step 405, the voice acquisition unit 30 determines whether there is a voice input. If there is a voice input, the process proceeds to step 410, and if there is no input, the process proceeds to step 425.

ステップ４１０では、音声取得手段３０が音声データを時間領域データとして取得する。ステップ４１５では、変換手段３１が時間領域データを周波数領域データに変換する。そして、ステップ４２０では、送信手段３２が周波数領域データをサーバ装置１１に送信する。ステップ４２５では、ユーザがこの音声通話サービスを終了し、ログオフしたかを判断する。ログオフしていない場合、ステップ４０５へ戻り、ステップ４０５からステップ４２５までの処理を繰り返す。ログオフした場合は、ステップ４３０へ進み、処理を終了する。 In step 410, the voice acquisition means 30 acquires voice data as time domain data. In step 415, the conversion means 31 converts time domain data into frequency domain data. In step 420, the transmission unit 32 transmits the frequency domain data to the server device 11. In step 425, it is determined whether the user has terminated the voice call service and has logged off. If not logged off, the process returns to step 405, and the processing from step 405 to step 425 is repeated. If it is logged off, the process proceeds to step 430 and the process is terminated.

次に、各クライアント装置１０が行う音声データの受信処理について、図８を参照して説明する。ユーザは、クライアント装置１０を使用して音声通話サービスにログインしたことを受けて、ステップ５００から処理を開始する。ステップ５０５では、受信手段３３がサーバ装置１１から混合データを受信する。ステップ５１０では、変換手段３１が混合データを時間領域データに変換する。 Next, audio data reception processing performed by each client device 10 will be described with reference to FIG. In response to logging in to the voice call service using the client device 10, the user starts processing from Step 500. In step 505, the receiving unit 33 receives the mixed data from the server device 11. In step 510, the conversion means 31 converts the mixed data into time domain data.

ステップ５１５では、音声出力手段３４が時間領域データを音声出力する。ステップ５２０では、ユーザがこの音声通話サービスを終了し、ログオフしたかを判断する。ログオフしていない場合、ステップ５０５へ戻り、ステップ５０５からステップ５２０までの処理を繰り返す。ログオフした場合は、ステップ５２５へ進み、処理を終了する。 In step 515, the audio output means 34 outputs the time domain data as audio. In step 520, it is determined whether the user has terminated the voice call service and logged off. If not logged off, the process returns to step 505 and the processing from step 505 to step 520 is repeated. In the case of logoff, the process proceeds to step 525 and the process is terminated.

クライアント装置１０では、図７に示した処理と、図８に示した処理とを並行して実施することができる。なお、他のユーザ間の会話を聞くだけであれば、図８に示した処理のみを実施するだけでよい。この場合、クライアント装置１０は、音声取得手段３０を備えていなくてもよい。 In the client device 10, the process illustrated in FIG. 7 and the process illustrated in FIG. 8 can be performed in parallel. Note that if only the conversation between other users is heard, only the processing shown in FIG. 8 need be performed. In this case, the client device 10 may not include the voice acquisition unit 30.

図９は、サーバ装置１１が備える機能を説明する機能ブロック図である。サーバ装置１１は、各クライアント装置から音声データを受信する機能、音声データをミックスする機能、音声データを各クライアント装置に配信する機能を備える。したがって、サーバ装置１１は、上記の各機能を機能手段として備えた装置とすることができる。なお、これらの機能は、クライアント装置１０と同様、HDDに記憶されたプログラムをCPUが読み出し実行することにより実現することができる。 FIG. 9 is a functional block diagram illustrating functions provided in the server device 11. The server device 11 has a function of receiving audio data from each client device, a function of mixing audio data, and a function of distributing audio data to each client device. Therefore, the server device 11 can be a device provided with each of the above functions as functional means. These functions can be realized by the CPU reading and executing the program stored in the HDD, as in the client device 10.

サーバ装置１１は、機能手段として、受信手段４０と、混合手段４１と、配信手段４２とを備える。受信手段４０は、各クライアント装置から送信された周波数領域データを受信する。受信手段４０は、周波数領域データに付加されたユーザ識別情報または装置識別情報を抽出し、周波数領域データとその識別情報とを混合手段４１に渡す。受信手段４０は、クライアント装置１０から符号化された周波数領域データを受信した場合、復号することができる。 The server device 11 includes a receiving unit 40, a mixing unit 41, and a distribution unit 42 as functional units. The receiving means 40 receives the frequency domain data transmitted from each client device. The receiving unit 40 extracts user identification information or device identification information added to the frequency domain data, and passes the frequency domain data and the identification information to the mixing unit 41. When receiving the encoded frequency domain data from the client device 10, the receiving unit 40 can decode it.

混合手段４１は、各クライアント装置に対し、他の全てのクライアント装置から受信した周波数領域データをミックスして混合データを生成する。例えば、クライアント装置１０ａに対しては、クライアント装置１０ｂ〜１０ｎから受信した周波数領域データをユーザ識別情報等で識別し、その識別した周波数領域データをミックスして混合データを生成する。 The mixing unit 41 mixes the frequency domain data received from all the other client devices and generates mixed data for each client device. For example, for the client device 10a, the frequency domain data received from the client devices 10b to 10n is identified by user identification information or the like, and the identified frequency domain data is mixed to generate mixed data.

周波数領域データは、どの周波数成分がどの程度含まれているかを示す周波数成分毎に数値化されたデータであるため、その数値化データを周波数成分毎に足し合わせることによりミックスすることができる。 Since the frequency domain data is data digitized for each frequency component indicating how much frequency component is included, it can be mixed by adding the digitized data for each frequency component.

配信手段４２は、混合手段４１が各クライアント装置につき生成した混合データを、当該各クライアント装置に配信する。配信手段４２は、例えば、図１０に示すような装置IDとIPアドレスとを対応付けた対応表を参照し、対応表からIPアドレスを取得し、それを使用して周波数領域データを配信することができる。 The distribution unit 42 distributes the mixed data generated by the mixing unit 41 for each client device to each client device. For example, the distribution unit 42 refers to a correspondence table in which device IDs and IP addresses are associated with each other as shown in FIG. 10, acquires an IP address from the correspondence table, and uses this to distribute frequency domain data Can do.

対応表は、IPアドレスのほか、このサービスにログインした装置を識別するための装置の状態情報も含めることができる。装置の状態は、ログイン中、ログオフのいずれかを設定することができる。図８では、装置ID「client1」〜「client4」がログイン中で、「client5」がログオフであるため、「client1」〜「client4」の装置で音声通話が行われていることを示している。 In addition to the IP address, the correspondence table can also include device status information for identifying the device logged into this service. The state of the device can be set to either log-in or log-off. In FIG. 8, since the device IDs “client1” to “client4” are logged in and “client5” is logged off, it indicates that a voice call is being performed with the devices “client1” to “client4”.

混合手段４１や配信手段４２は、対応表を参照することで、ログイン中の装置を、音声通話システムを構成するクライアント装置１０として認識し、どの周波数領域データをミックスして混合データを生成し、どのクライアント装置１０に配信するかを識別することができる。なお、状態情報は、クライアント装置１０がログインあるいはログオフしたことを受信手段４０が受け付け、受信手段４０が設定することができる。 By referring to the correspondence table, the mixing unit 41 and the distribution unit 42 recognize the logged-in device as the client device 10 constituting the voice call system, mix which frequency domain data to generate mixed data, It is possible to identify which client device 10 is to be distributed. The status information can be set by the receiving means 40 when the receiving means 40 accepts that the client device 10 has logged in or logged off.

なお、サーバ装置１１は、１つのクライアント装置に対して、他の全てのクライアント装置から受信した音声データをミックスして混合データを生成し、当該１つのクライアント装置に配信する。配信手段４２は、周波数領域データを符号化して配信することもできる。しかしながら、当該他の全てのクライアント装置のいずれもが音声データを送信しているとは限らない。この場合、サーバ装置１１は、当該他の全てのクライアント装置のうち、音声データを送信しているクライアント装置から受信した音声データをミックスして混合データを生成することができる。 The server apparatus 11 generates mixed data by mixing audio data received from all other client apparatuses with respect to one client apparatus, and distributes the mixed data to the one client apparatus. The distribution means 42 can also encode and distribute the frequency domain data. However, not all the other client devices are transmitting audio data. In this case, the server device 11 can generate mixed data by mixing audio data received from a client device that is transmitting audio data among all other client devices.

また、サーバ装置１１は、同時接続者数よりも同時話者数が少ない場合に、あるクライアント装置に対する混合データを、他のクライアント装置に対する混合データとして送信し、混合処理の計算量をさらに削減することができる。 Further, when the number of simultaneous speakers is smaller than the number of simultaneously connected persons, the server apparatus 11 transmits mixed data for a certain client apparatus as mixed data for another client apparatus, and further reduces the amount of calculation of the mixing process. be able to.

また、サーバ装置１１は、１つのクライアント装置に対して、単に他の全てのクライアント装置から受信した音声データをミックスして混合データを作成する場合、実際には全てのクライアント装置から受信した音声データを１度だけミックスして、全てのクライアント装置に送信する共通の混合データを作ることで計算量をさらに削減することができる。受信したクライアント装置では、このようにして作られた混合データに対して、送信した音声データの位相を反転して足し合わせることで、他の全てのクライアント装置から送信した音声データの混合データを生成することができる。このため、各クライアント装置は、上記の音声データ、すなわち周波数領域データの位相を反転し、混合データに足し合わせる加算手段をさらに備えることができる。 In addition, when the server apparatus 11 simply mixes audio data received from all other client apparatuses to create mixed data for one client apparatus, the audio data actually received from all client apparatuses Can be further reduced by mixing only once and creating common mixed data to be transmitted to all client devices. The received client device generates mixed data of audio data transmitted from all other client devices by inverting the phase of the transmitted audio data to the mixed data created in this way and adding them together. can do. For this reason, each client device can further include an adding means for inverting the phase of the audio data, that is, the frequency domain data, and adding it to the mixed data.

図１１を参照して、サーバ装置１１が行う音声データの混合処理について詳細に説明する。サーバ装置１１は、複数のユーザが各クライアント装置を使用してサービスにログインしたことを受けて、ステップ８００から処理を開始する。ステップ８０５では、受信手段４０が複数のクライアント装置１０から周波数領域データを受信する。 With reference to FIG. 11, the audio data mixing process performed by the server apparatus 11 will be described in detail. The server apparatus 11 starts processing from step 800 in response to a plurality of users logging in to the service using each client apparatus. In step 805, the receiving means 40 receives frequency domain data from the plurality of client devices 10.

ステップ８１０では、混合手段４１が、一のクライアント装置に対し、他の全てのクライアント装置から受信した周波数領域データをミックスし、混合データを生成することにより、各クライアント装置１０に対する各混合データを生成する。周波数領域データは、付加されたユーザ識別情報または装置識別情報により識別されるため、クライアント装置１０に応じてミックスすべき周波数領域データを識別し、識別した周波数領域データをミックスすることができる。 In step 810, the mixing unit 41 generates the mixed data for each client device 10 by mixing the frequency domain data received from all the other client devices and generating the mixed data for one client device. To do. Since the frequency domain data is identified by the added user identification information or device identification information, it is possible to identify the frequency domain data to be mixed according to the client device 10 and to mix the identified frequency domain data.

ステップ８１５では、混合手段４１が各クライアント装置１０につき生成した各混合データを、配信手段４２が当該各クライアント装置１０に配信する。配信手段４２は、図１０に示す対応表を参照し、配信先のクライアント装置１０に混合データを配信することができる。ステップ８２０では、全てのユーザがログオフしたかを判断する。ログオフしていない場合、まだ音声通話が行われていることから、ステップ８０５へ戻り、ステップ８０５からステップ８２０までの処理を繰り返す。ログオフした場合は、ステップ８２５へ進み、処理を終了する。 In step 815, the distribution unit 42 distributes each mixed data generated by the mixing unit 41 for each client device 10 to each client device 10. The distribution means 42 can distribute the mixed data to the distribution destination client device 10 with reference to the correspondence table shown in FIG. In step 820, it is determined whether all users have logged off. If not logged off, since the voice call is still being made, the process returns to step 805 to repeat the processing from step 805 to step 820. If it is logged off, the process proceeds to step 825 to end the process.

このように、サーバ装置１１で復号、符号化を行うのではなく、クライアント装置１０で変換を行うので、音声をミックスするサーバ装置１１における計算量を減らすことができる。これにより、サーバ装置１１の負荷を軽減することができるので、より多くのユーザ間での同時通話を実現することが可能となる。 As described above, since the server apparatus 11 does not perform decoding and encoding but performs conversion by the client apparatus 10, it is possible to reduce the amount of calculation in the server apparatus 11 that mixes audio. Thereby, since the load of the server apparatus 11 can be reduced, it becomes possible to realize simultaneous calls among more users.

サーバ装置１１は、受信した周波数領域データに対して音声エフェクトを施すことができる。音声エフェクトは、音声の加工であり、その例として、雑音（ノイズ）除去、ボイスチェンジ、ボリューム変更、パンニング等を挙げることができる。なお、これらは一例であり、その他のエフェクトを施すことも可能である。このように、サーバ装置１１においてボリューム変更やパンニングを行うことで、あるユーザの音声データに対して、それを聞く各ユーザにとってそれぞれ異なる位置関係に配置したような効果をつけることができる。すなわち、１つの音声をユーザ毎に異なるように聞かせることができる。 The server device 11 can apply an audio effect to the received frequency domain data. The sound effect is processing of sound, and examples thereof include noise removal, voice change, volume change, panning, and the like. These are only examples, and other effects can be applied. As described above, by performing volume change and panning in the server device 11, it is possible to provide an effect that voice data of a certain user is arranged in a different positional relationship for each user who listens to it. That is, one voice can be heard differently for each user.

このため、サーバ装置１１は、図１２に示すように、さらに、周波数領域データを加工するための加工手段４３をさらに備えることができる。上記のノイズ除去には、所定の周波数より高い周波数成分を除去するローパスフィルタ、所定の周波数より低い周波数成分を除去するハイパスフィルタを用いることができる。上記のボイスチェンジでは、音の特定の周波数を上げ、または下げて音の音色を変えることができる。また、周波数帯域毎にレベルを操作するイコライザを用い、周波数特性を変更することもできる。 For this reason, the server apparatus 11 can further be provided with the processing means 43 for processing frequency domain data, as shown in FIG. For the above noise removal, a low-pass filter that removes frequency components higher than a predetermined frequency and a high-pass filter that removes frequency components lower than a predetermined frequency can be used. In the above-mentioned voice change, the timbre of the sound can be changed by raising or lowering the specific frequency of the sound. Further, it is possible to change the frequency characteristics using an equalizer that operates the level for each frequency band.

音声エフェクトは、周波数領域データに対して施すエフェクトが多いため、従来においては、時間領域データを一旦周波数領域データに変換し、エフェクトを施し、再び時間領域データに変換して、音声をミックスする装置へ送信していた。しかしながら、本手法では、クライアント装置１０で周波数領域データに変換してサーバ装置１１に送信することから、送信前の周波数領域データに直接エフェクトを施し、時間領域データに戻すことなく、サーバ装置１１に送信することができる。このため、音声に対して周波数を操作する音声エフェクトを少ない計算量で施すことが可能となる。 Since many sound effects are applied to frequency domain data, in the past, the time domain data is once converted to frequency domain data, the effect is applied, and the time domain data is converted again to mix the audio. Had been sent to. However, in this method, since the client apparatus 10 converts the frequency domain data into the frequency domain data and transmits the data to the server apparatus 11, the client apparatus 10 directly applies the effect to the frequency domain data before transmission and returns the time domain data to the server apparatus 11. Can be sent. For this reason, it is possible to apply a sound effect for manipulating the frequency to the sound with a small amount of calculation.

これまで本発明のシステム、方法およびプログラムについて詳細に説明してきたが、本発明は、上述した実施形態に限定されるものではなく、他の実施形態や、追加、変更、削除など、当業者が想到することができる範囲内で変更することができ、いずれの態様においても本発明の作用・効果を奏する限り、本発明の範囲に含まれるものである。 The system, method, and program of the present invention have been described in detail so far. However, the present invention is not limited to the above-described embodiments, and those skilled in the art can add other embodiments, additions, modifications, deletions, and the like. It can be changed within the range that can be conceived, and any aspect is included in the scope of the present invention as long as the effects and effects of the present invention are exhibited.

したがって、上記のプログラムが記録されたCD-ROMやSDカード等の記録媒体、そのプログラムを保持し、ダウンロード要求に応じてそのプログラムを提供するプログラム提供サーバ等も提供することができるものである。 Accordingly, it is possible to provide a recording medium such as a CD-ROM or SD card in which the above program is recorded, a program providing server that holds the program, and provides the program in response to a download request.

１０、１０ａ〜１０ｎ…クライアント装置、１１…サーバ装置、１２…ネットワーク、２０…CPU、２１…ROM、２２…RAM、２３…HDD、２４…通信I/F、２５…入出力I/F、２６…音声入力装置、２７…音声出力装置、２８…バス、３０…音声取得手段、３１…変換手段（第１変換手段、第２変換手段）、３２…送信手段、３３…受信手段、３４…音声出力手段、４０…受信手段、４１…混合手段、４２…配信手段、４３…加工手段

DESCRIPTION OF SYMBOLS 10, 10a-10n ... Client apparatus, 11 ... Server apparatus, 12 ... Network, 20 ... CPU, 21 ... ROM, 22 ... RAM, 23 ... HDD, 24 ... Communication I / F, 25 ... Input / output I / F, 26 ... voice input device, 27 ... voice output device, 28 ... bus, 30 ... voice acquisition means, 31 ... conversion means (first conversion means, second conversion means), 32 ... transmission means, 33 ... reception means, 34 ... voice Output means, 40 ... receiving means, 41 ... mixing means, 42 ... distributing means, 43 ... processing means

Claims

A voice call system including a plurality of client devices and server devices,
The client device is
Voice acquisition means for acquiring voice data as time-domain data representing a change in sound pressure with respect to time;
Sampling the time-domain data acquired by the voice acquisition unit, dividing the block into blocks having a fixed number of samples, and converting the frequency-domain data representing the strength of each frequency component into divided blocks; ,
Transmitting means for transmitting the frequency domain data converted by the converting means to the server device;
Receiving means for receiving mixed data generated by mixing a plurality of frequency domain data from the server device;
And a second conversion means for converting the mixed data received by the receiving means into time domain data.

2. The voice call system according to claim 1, wherein the transmission unit transmits the frequency domain data with user identification information for identifying a user or device identification information for identifying the client device.

The voice call system according to claim 1, further comprising voice output means for outputting the time domain data converted by the conversion means by voice.

The voice call system according to any one of claims 1 to 3, wherein the conversion means converts the time domain data or the frequency domain data by overlapping orthogonal transformation.

5. The voice call system according to claim 1, wherein the conversion unit encodes the converted frequency domain data and decodes the mixed data encoded in the server device. 6.

The server device generates the mixed data by mixing frequency domain data received from all of the plurality of client devices, and transmits the mixed data to the client device,
6. The client device according to claim 1, further comprising an adding unit that inverts and adds the phase of the frequency domain data transmitted by the client device to the received mixed data. Voice call system.

The server device includes: a receiving unit that receives the frequency domain data from the client device; a mixing unit that mixes the plurality of frequency domain data received by the receiving unit to generate mixed data; and the mixing unit The voice call system according to claim 1, further comprising a distribution unit that distributes the generated mixed data to the client device.

The mixing unit generates the mixed data for the client device by mixing the frequency domain data received from all other client devices with one client device to generate mixed data, and the distribution unit The voice call system according to claim 7, wherein the mixed data generated by the mixing unit is distributed to the client device.

The voice call system according to claim 7 or 8, further comprising processing means for processing the frequency domain data.

10. The audio according to claim 7, wherein the reception unit decodes the frequency domain data encoded in the client device, and the distribution unit encodes and distributes the mixed data. Call system.

A voice call method implemented in a voice call system including a plurality of client devices and a server device,
A plurality of client devices acquiring audio data as time domain data representing a change in sound pressure with respect to time;
A plurality of the client devices sample the time-domain data, divide the data into blocks having a fixed number of samples, and convert the frequency-domain data representing the strength of each frequency component in divided blocks;
A plurality of client devices transmitting the frequency domain data to the server device;
The server device receiving a plurality of the frequency domain data;
The server device generates the mixed data for the client device by mixing the frequency domain data received from all other client devices with one client device to generate mixed data;
The server device delivering the mixed data to the client device;
The client device receiving the mixed data;
The client device converting the mixed data into time domain data;
A voice communication method comprising: a step of outputting the time domain data by voice as the client device.

A program for executing voice call processing in a voice call system including a plurality of client devices and server devices,
A plurality of client devices;
Obtaining audio data as time domain data representing a change in sound pressure with respect to time;
Sampling the time domain data, dividing the sample into blocks having a fixed number of samples, and converting the frequency domain data representing the strength of each frequency component in divided blocks;
A plurality of the client devices transmitting the frequency domain data to the server device;
In the server device,
Receiving a plurality of said frequency domain data;
Generating the mixed data for the client device by mixing the frequency domain data received from all the other client devices with one client device to generate mixed data;
Delivering the mixed data to the client device;
In the client device,
Receiving the mixed data;
Converting the mixed data into time domain data;
Executing the step of outputting the time domain data as a voice.