JP2004336292A

JP2004336292A - System, device and method for processing speech

Info

Publication number: JP2004336292A
Application number: JP2003128069A
Authority: JP
Inventors: Shinichi Araki; 真一荒木; Naoki Mikuni; 直樹三國
Original assignee: Namco Ltd
Current assignee: Namco Ltd
Priority date: 2003-05-06
Filing date: 2003-05-06
Publication date: 2004-11-25

Abstract

<P>PROBLEM TO BE SOLVED: To provide a system, a device and a method capable of preventing increase of the data quantity of communication in accordance with the number of users and capable of eliminating limit in the number of the users. <P>SOLUTION: The speech processing system includes a voice chat server 10 and n-pieces of voice chat clients 20-1 to 20-n. The respective voice chat clients digitize sound collected by a microphone 22, compress it and transmit it to the voice chat server 10. The voice chat server 10 creates three-dimensional voice data considering positions of the users based on voice data sent from the respective voice chat clients and transmits it to the respective voice chat clients. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、複数の端末を用いて複数の利用者間で会話を行う際に音声データの送受信を行う音声処理システム、装置および方法に関する。
【０００２】
【従来の技術】
従来から、ネットワークを介して接続された複数の端末を用いて、各端末の利用者間で会話を行う音声チャットシステムが知られている。例えば、インターネットのサービスプロバイダはこの音声チャットシステムを備えており、複数の利用者間で会話を楽しむことができるようになっている（例えば、特許文献１参照。）。
【０００３】
図８は、従来の音声チャットシステムの構成を示す図である。図８に示す音声チャットシステムは、音声チャットサーバ３００とこれに接続されたｎ台の音声チャットクライアント３１０−１〜３１０−ｎとを含んで構成されている。各音声チャットクライアント３１０−１等のそれぞれは、マイクロホン３２２で集音した利用者の音声データ（上り音声）を音声チャットサーバ３００に向けて送信し、音声チャットサーバ３００は、各音声チャットクライアント３１０−１等に対して他の音声チャットクライアント３１０−２等から送られてきた音声データを転送する処理を行っている。各音声チャットクライアント３１０−１等では、音声チャットサーバ３００から送られてくる自装置以外の（ｎ−１）人分の音声データ（下り音声）を受信し、各利用者の位置に基づいて３次元音声処理を行い、各利用者の発声位置を異ならせることにより臨場感のあるステレオ音声をスピーカ３２４、３２６から出力する。
【０００４】
【特許文献１】
特開２００３−６１３２号公報（第３−５頁、図１−７）
【０００５】
【発明が解決しようとする課題】
ところで、上述した従来の音声チャットシステムに含まれる各音声チャットクライアント３１０−１等は、自装置以外から送られてくる（ｎ−１）人分の下り音声を受信しているため、自装置から送信する１人分の上り音声を加えると、回線を介してｎ人分の音声データを同時に送受信することになり、音声チャットに参加する利用者の人数が増えれば増えるほど通信のデータ量が増加するという問題があった。また、一般には、電話回線やＩＳＤＮ回線等を介して各音声チャットクライアント３１０−１等と音声チャットサーバ３００とが接続されるネットワークの形態が多いが、このように限られた通信容量の回線を用いる場合には、同時に会話することができる利用者の数がこの通信容量によって制限されてしまい、会話に参加することができる利用者数を増やすことができないという問題があった。さらに、この種の問題は、音声チャットシステムと同様の手法を採用して音声データの送受信を行う音声会議システム等についても存在する。
【０００６】
本発明は、このような点に鑑みて創作されたものであり、その目的は、利用者の発声位置を特定可能な三次元音声データにより臨場感のあるステレオ音声を得ることができるとともに、利用者の数に応じて通信のデータ量が増加することを防止することができ、利用者数の制限をなくすことができる音声処理システム、装置および方法を提供することにある。
【０００７】
【課題を解決するための手段】
上述した課題を解決するために、本発明の音声処理システムは、複数の端末装置と音声処理装置とが通信回線を介して接続されている。端末装置は、利用者の音声に対応する音声データを音声処理装置に向けて送信する第１の送信手段と、複数の端末装置のそれぞれの利用者の発声位置を特定可能な三次元音声データが音声処理装置から送られてきたときにこれを受信する第１の受信手段と、第１の受信手段によって受信された三次元音声データに対応する音声を複数のスピーカから出力する音声出力手段とを備えている。音声処理装置は、複数の端末装置のそれぞれから送られてくる音声データを受信する第２の受信手段と、第２の受信手段によって受信された音声データに基づいて、三次元音声データを生成する三次元音声データ生成手段と、三次元音声データ生成手段によって生成された三次元音声データを端末装置に向けて送信する第２の送信手段とを備えている。
【０００８】
また、本発明の音声処理方法は、音声処理システムに適用されるものであり、利用者の音声に対応する音声データを端末装置に備わった第１の送信手段から音声処理装置に備わった第２の受信手段に向けて送信するステップと、第２の受信手段によって受信した複数の端末装置から送られてきた音声データに基づいて、複数の端末装置のそれぞれの利用者の発声位置を特定可能な三次元音声データを三次元音声データ生成手段によって生成するステップと、三次元音声データ生成手段によって生成された三次元音声データを音声処理装置に備わった第２の送信手段から複数の端末装置のそれぞれに備わった第１の受信手段に向けて送信するステップと、第１の受信手段によって受信した三次元音声データに対応する音声を音声出力手段によって複数のスピーカから出力するステップとを有している。
【０００９】
本発明の音声処理システムや音声処理装置を用いることにより、あるいは、本発明の音声処理方法を実行することにより、音声処理装置において各端末装置から送られてくる音声データに基づいて三次元音声データが生成されて各端末装置に送信されるため、各端末装置では他の端末装置から送信される別々の音声データを用いることなく、三次元音声データに基づいて各端末装置に利用者の発声位置が特定可能な臨場感のあるステレオ音声の再生を行うことができる。このため、各端末装置と音声処理装置とを接続する通信回線は、端末装置から音声処理装置に向けて送信される音声データと、反対に音声処理装置からこの端末装置に向けて送信される三次元音声データとに対応した通信容量を確保するだけでよく、しかも三次元音声データのデータ量は端末装置の数に関係しないため、利用者の数に応じて通信のデータ量が増加することを防止することができる。また、利用者数に関係なく通信のデータ量がほぼ一定となることから、利用者数の制限をなくすことができる。
【００１０】
なお、通信のデータ量を削減する別の手法として、各端末装置から送られてくる音声データを音声処理装置において単純にミキシングする方法もあるが、この場合には、本発明の目的の一つである臨場感のあるステレオ音声を得るという目的は達成できない。
【００１１】
また、上述した第１および第２の送信手段は、音声データを圧縮して送信し、第１および第２の受信手段は、受信した音声データを伸長することが望ましい。端末装置と音声処理装置との間で圧縮された音声データを送受信することにより、通信のデータ量をさらに削減することが可能になる。
【００１２】
また、上述した三次元音声データ生成手段は、一の端末装置に送信する三次元音声データを、この端末装置以外の他の端末装置に対応する音声データに基づいて生成することが望ましい。音声処理装置において一の端末装置に送信する三次元音声データをその他の端末装置から送られてきた音声データに基づいて生成し、この一の端末装置に送信することにより、各端末装置では、自装置の利用者以外の各利用者の音声を出力することが可能になる。また、このように各端末装置に個別の三次元音声データを送信することにより、各端末装置に対して必要最小限のデータを送信することが可能になり、通信のデータ量を削減することができる。
【００１３】
また、上述した端末装置から音声処理装置に送られる音声データは、モノラル再生が可能な音声データであり、音声処理装置から端末装置に送られる三次元音声データは、ステレオ再生が可能な右用および左用の音声データであることが望ましい。三次元音声データとして、右音声と左音声とからなるステレオ音声を採用することにより、各端末装置において各利用者の発声位置の特定が可能になる。また、音声処理装置から各端末装置に送信する三次元音声データとして、利用者数に関係なく右用と左用の２種類の音声データを用意すればよいため、通信のデータ量を少なく、しかも利用者数に関係なくほぼ一定にすることができる。
【００１４】
また、ゲームの仮想空間内に登場する複数のキャラクタのそれぞれが複数の端末装置のそれぞれに対応しており、上述した三次元音声データ生成手段は、一の端末装置に送信する三次元音声データにおける各音声の発声位置を、この端末装置以外の他の端末装置に対応するキャラクタの仮想空間内での位置に基づいて設定することが望ましい。これにより、複数の端末装置を用いてゲームを行う場合に、各端末装置に対応したキャラクタ毎に異なる発声位置を設定することが可能になり、しかも、ゲームの進行に応じて仮想空間内において各キャラクタが移動すると各発声位置もその移動に伴って変化するゲームを実現することができるため、各プレーヤが他の登場キャラクタと実際に会話を行っているような擬似的な体験を行うことができる臨場感のあるゲームを実現することが可能になる。
【００１５】
【発明の実施の形態】
以下、本発明を適用した一実施形態の音声処理システムについて、図面を参照しながら詳細に説明する。
図１は、一実施形態の音声処理システムの全体構成を示す図である。図１に示すように、本実施形態の音声処理システムは、音声チャットサーバ１０とｎ台の音声チャットクライアント２０−１〜２０−ｎとを含んで構成されている。
【００１６】
音声チャットクライアント２０−１〜２０−ｎのそれぞれは、利用者の音声を集音するマイクロホン２２と、ステレオ音声を出力する２個のスピーカ２４、２６とが接続されており、マイクロホン２２によって集音したモノラルの音声をデジタル化した後圧縮して通信回線３０−１等を介して音声チャットサーバ１０に向けて送信する機能と、通信回線３０−１等を介して音声チャットサーバ１０から送られてくる右音声および左音声からなるステレオ音声の圧縮データを伸長する機能を有している。
【００１７】
図２は、音声チャットクライアント２０−１の構成を示す図である。なお、他の音声チャットクライアント２０−２〜２０−ｎも同じ構成を有している。
図２に示すように、音声チャットクライアント２０−１は、Ａ／Ｄ（アナログ−デジタル）変換部２００、クライアント処理部２１０、通信装置２２０、Ｄ／Ａ（デジタル−アナログ）変換部２３０、２３２、アンプ２４０、２４２を含んで構成されている。また、音声チャットクライアント２０−１には、利用者の音声を集音するマイクロホン２２と、左音声と右音声からなるステレオ音声を出力する２つのスピーカ２４、２６が接続されている。
【００１８】
Ａ／Ｄ変換部２００は、マイクロホン２２から出力されるアナログ音声信号をデジタルの音声データに変換する。例えば、ＰＣＭ形式の音声データに変換されて、クライアント処理部２１０に入力される。
クライアント処理部２１０は、入力される音声データを音声チャットサーバ１０に向けて送信するとともに、音声チャットサーバ１０から送られてくる音声データを受信する制御を行う。このために、クライアント処理部２１０は、音声圧縮処理部２１２、送信処理部２１４、受信処理部２１６、音声伸長処理部２１８を備える。
【００１９】
音声圧縮処理部２１２は、Ａ／Ｄ変換部２００から入力される音声データに対して所定の圧縮処理を行って圧縮音声データを生成する。送信処理部２１４は、音声圧縮処理部２１２によって生成された圧縮音声データを音声チャットサーバ１０に向けて送信する処理を行う。受信処理部２１６は、音声チャットサーバ１０から送られてくる圧縮音声データを受信する処理を行う。音声伸長処理部２１８は、受信処理部２１６によって受信された圧縮音声データを伸長して非圧縮音声データを生成する。
【００２０】
通信装置２２０は、音声チャットクライアント２０−１と音声チャットサーバ１０とを接続する通信回線３０−１との間で物理的な電気信号の送受信処理を行う。Ｄ／Ａ変換部２３０は、クライアント処理部２１０から出力される右音声用の非圧縮音声データをアナログの音声信号に変換する。この音声信号は、アンプ２４０によって増幅されて、スピーカ２４から出力される。同様に、Ｄ／Ａ変換部２３２は、クライアント処理部２１０から出力される左音声用の非圧縮音声データをアナログの音声信号に変換する。この音声信号は、アンプ２４２によって増幅されて、スピーカ２６から出力される。
【００２１】
図３は、音声チャットサーバ１０の構成を示す図である。図３に示すように、音声チャットサーバ１０は、通信装置１００とサーバ処理部１１０を含んで構成されている。通信装置１００は、音声チャットサーバ１０と音声チャットクライアント２０−１等とを接続する通信回線３０−１等との間で物理的な電気信号の送受信処理を行う。サーバ処理部１１０は、音声チャットクライアント２０−１等から送られてくる各音声データを合成して各音声チャットクライアント２０−１等に送り返す処理を行う。このために、サーバ処理部１１０は、受信処理部１１２、音声伸長処理部１１４、三次元音声処理部１１６、音声圧縮処理部１１８、送信処理部１２０を備える。
【００２２】
受信処理部１１２は、各音声チャットクライアント２０−１等から送られてくるモノラル音声に対応する圧縮音声データを受信する処理を行う。音声伸長処理部１１４は、受信処理部１１２によって受信処理された各圧縮音声データを伸張して非圧縮音声データに変換する処理を行う。
【００２３】
三次元音声処理部１１６は、音声チャットクライアント２０−１〜２０−ｎのそれぞれに対応する非圧縮音声データに基づいて、それぞれの音声チャットクライアント２０−１等において他の音声チャットクライアントの利用者の発声位置を特定可能な右音声と左音声からなる三次元音声を生成する。これにより、音声チャットクライアント２０−２〜２０−ｎのそれぞれから送られてくる音声データに基づいて、音声チャットクライアント２０−１に送信する左右別々の音声データが生成される。また、音声チャットクライアント２０−１、２０−３〜２０−ｎのそれぞれから送られてくる音声データに基づいて、音声チャットクライアント２０−２に送信する左右別々の音声データが生成される。このように、一の音声チャットクライアントを除く他の音声チャットクライアントから送られてくる各音声データに基づいて、この一の音声チャットクライアントに送信する音声データが生成される。それぞれの音声チャットクライアント２０−１〜２０−ｎに対応して生成する各音声の発声位置の設定方法については後述する。
【００２４】
音声圧縮処理部１１８は、三次元音声処理部１１６によって生成された各音声チャットクライアント毎の音声データ（非圧縮音声データ）をデータ送信に適した形式で圧縮する処理を行う。送信処理部１２０は、音声圧縮処理部１１８によって生成された三次元音声データとしての各圧縮音声データを、それぞれのデータ送信先となる各音声チャットクライアント２０−１等に向けて送信する処理を行う。
【００２５】
上述した音声チャットクライアント２０−１〜２０−ｎが端末装置に、送信処理部２１４、通信装置２２０が第１の送信手段に、受信処理部２１６、通信装置２２０が第１の受信手段に、Ｄ／Ａ変換部２３０、２３２、アンプ２４０、２４２が音声出力手段にそれぞれ対応する。また、受信処理部１１２、通信装置１００が第２の受信手段に、送信処理部１２０、通信装置１００が第２の送信手段に、三次元音声処理部１１６が三次元音声データ生成手段にそれぞれ対応する。
【００２６】
本実施形態の音声処理システムはこのような構成を有しており、次にその動作を説明する。
図４は、音声チャットクライアント２０−１等による音声データ送信の動作手順を示す流れ図である。音声チャットクライアント２０−１〜２０−ｎのそれぞれにおける音声データ送信の動作手順は同じであるため、以下では、音声チャットクライアント２０−１の動作に着目して説明を行うものとする。
【００２７】
音声チャットクライアント２０−１を使用する利用者が発声して、マイクロホン２２から出力されるモノラルの音声信号が音声チャットクライアント２０−１に入力されると（ステップ１００）、Ａ／Ｄ変換部２００は、この入力されたアナログの音声信号をデジタルの非圧縮音声データに変換する（ステップ１０１）。
【００２８】
次に、クライアント処理部２１０内の音声圧縮処理部２１２は、Ａ／Ｄ変換部２００から出力される非圧縮音声データを圧縮して送信用の圧縮音声データを生成する（ステップ１０２）。送信処理部２１４は、この圧縮音声データを通信装置２２０から音声チャットサーバ１０に向けて送信する（ステップ１０３）。
【００２９】
図５は、各音声チャットクライアント２０−１等から音声データが送られてきた音声チャットサーバ１０による三次元音声データ生成の動作手順を示す流れ図である。
各音声チャットクライアント２０−１等から圧縮音声データが送られてくると、サーバ処理部１１０内の受信処理部１１２は、これらの圧縮音声データを受信する処理を行う（ステップ２００）。音声伸張処理部１１４は、この受信した各音声チャットクライアント２０−１等毎の圧縮音声データを伸張して、元の非圧縮音声データを生成する（ステップ２０１）。
【００３０】
次に、三次元音声処理部１１６は、音声伸張処理部１１４によって生成された各音声チャットクライアント毎の圧縮音声データを用いて、ｎ台の音声チャットクライアント２０−１〜２０−ｎのそれぞれに送信するｎ組の三次元音声データを、それぞれの音声の発声位置を考慮して生成する処理を行う（ステップ２０２）。例えば、音声チャットクライアント２０−１に送信する一組の三次元音声データとして、音声チャットクライアント２０−２〜２０−ｎのそれぞれのモノラルの非圧縮音声データを用いて、これらの音声チャットクライアントに対応する利用者の発声位置が特定可能なステレオ再生を行うために必要な右用の非圧縮音声データと左用の非圧縮音声データが生成される。同様にして、他の音声チャットクライアント２０−２〜２０−ｎのそれぞれに対応する（ｎ−１）組の三次元音声データが別々に生成される。
【００３１】
それぞれの音声チャットクライアント２０−１〜２０−ｎに対応して生成する各音声の発声位置の設定方法については、いくつかの具体例が考えられる。
（１）各音声チャットクライアント２０−１〜２０−ｎを所定の順番に適当に三次元空間上に配置し、この配置状態に基づいて各音声の発声位置を設定する。
【００３２】
（２）各音声チャットクライアント２０−１〜２０−ｎの地理的な配置に基づいて各音声の発声位置を設定する。例えば、各音声チャットクライアント２０−１〜２０−ｎの設置位置の経度、緯度、高度等があらかじめわかっており、あるいは接続の都度これらの情報が各音声チャットクライアントから送られてきて、これらの情報に基づいて地理的な配置を認識する場合や、各音声チャットクライアント２０−１〜２０−ｎに対応する電話番号に基づいて地理的な配置を認識する場合などが考えられる。
【００３３】
（３）会話を行っている各音声チャットクライアント２０−１〜２０−ｎの仮想的な状況に従って各音声の発声位置を設定する。例えば、本実施形態の音声処理システムを用いてロールプレイングゲームを行う場合を考えるものとする。各音声チャットクライアント２０−１〜２０−ｎでは、ゲームの仮想空間内で自己のキャラクタ（この音声チャットクライアントを操作するプレーヤに対応するキャラクタ）の位置を中心として、他のキャラクタ（他の音声チャットクライアントを操作するプレーヤに対応するキャラクタ）の仮想空間内での相対的な位置が決まるため、このようにして決定された各キャラクタの仮想空間内での位置に基づいて各音声の発声位置が設定される。各音声の発声位置は、ゲームの進行に伴って各キャラクタの相対的な位置が移動したときに、この移動内容に応じて変化する。
【００３４】
図６は、各音声チャットクライアント２０−１〜２０−ｎに送られる三次元音声データにおける各音声の発声位置を示す説明図である。図６において、各音声チャットクライアントに対応する三次元空間内での発声位置が丸印で、その発声の向きが矢印で示されている。また、具体的な発声位置の設定は、上述した（１）〜（３）のいずれかの手法を用いて行われるものとする。
【００３５】
このように各音声の相対的な発声位置が設定されている場合に、着目している一の音声チャットクライアントを中心として他の音声チャットクライアントの位置に各発声位置が設定された三次元音声データが生成される。例えば、音声チャットクライアント２０−１に送信される三次元音声データでは、音声チャットクライアント２０−１の位置を中心にして、その右側に音声チャットクライアント２０−２〜２０−ｎのそれぞれに対応した発声位置が設定される。また、音声チャットクライアント２０−２に送信される三次元音声データでは、音声チャットクライアント２０−２を中心にして、左前方に音声チャットクライアント２０−１に対応した発声位置が設定され、右側に音声チャットクライアント２０−３〜２０−ｎのそれぞれに対応した発声位置が設定される。他の音声チャットクライアント２０−２〜２０−ｎのそれぞれに送信される三次元音声データについても同様であり、送信先となる音声チャットクライアントを中心にして、他の音声チャットクライアントに対応した発声位置が設定される。
【００３６】
次に、音声圧縮処理部１１８は、三次元音声処理部１１６によって生成されたｎ組の三次元音声データを圧縮する（ステップ２０３）。これらｎ組の三次元音声データのそれぞれは、ｎ台の音声チャットクライアント２０−１〜２０−ｎのそれぞれに別々に送信されるため、ステップ２０３における圧縮処理は、ｎ組の三次元音声データのそれぞれについて別々に行われる。その後、送信処理部１２０は、圧縮された三次元音声データを通信装置１００から各音声チャットクライアントに向けて送信する処理を行う（ステップ２０４）。
【００３７】
図６は、音声チャットサーバ１０から三次元音声データが送られてきた音声チャットクライアント２０−１等による三次元音声データ受信の動作手順を示す流れ図である。音声チャットクライアント２０−１〜２０−ｎのそれぞれにおける三次元音声データ受信の動作手順は同じであるため、以下では、音声チャットクライアント２０−１の動作に着目して説明を行うものとする。
【００３８】
音声チャットサーバ１０から圧縮された三次元音声データが送られてくると、クライアント処理部２１０内の受信処理部２１６は、この圧縮された三次元音声データを受信する処理を行う（ステップ３００）。音声伸張処理部２１８は、この受信した三次元音声データを伸張して、右用の非圧縮音声データと左用の非圧縮音声データを生成する（ステップ３０１）。Ｄ／Ａ変換部２３０、２３２は、これら左右の非圧縮音声データを別々にアナログの音声信号に変換する（ステップ３０２）。その後、これらの音声信号は、アンプ２４０、２４２によって増幅されて、スピーカ２４、２６から出力される（ステップ３０３）。
【００３９】
このように、本実施形態の音声処理システムでは、各音声チャットクライアント２０−１等から送られてきた音声データに基づいて、音声チャットサーバ１０によって各音声チャットクライアント毎に対応する三次元音声データを生成し、各音声チャットクライアントに送信しているため、各音声チャットクライアント２０−１等では利用者の発声位置を特定可能な三次元音声データにより臨場感のあるステレオ音声を得ることができる。また、各音声チャットクライアント２０−１等と音声チャットサーバ１０との間を接続する通信回線３０−１等は、一の音声チャットクライアントから音声チャットサーバ１０に向けて送信される音声データと、反対に音声チャットサーバ１０からこの音声チャットクライアントに向けて送信される左右用の音声データとを考慮した容量を確保すれば十分であり、音声チャットクライアントおよびその利用者の数に応じて通信のデータ量が増加することを防止することができる。例えば、図１に示すように、モノラル音声やステレオの左音声、右音声のそれぞれの送受信に必要な帯域を１とすると、本実施形態の音声処理システムでは、各音声チャットクライアント２０−１等と音声チャットサーバ１０とを接続する通信回線３０−１等の通信帯域として３を確保すれば十分であり、しかもこの通信帯域の値は、接続される音声チャットクライアント２０−１等の台数によらず一定となる。
【００４０】
また、本実施形態の音声処理システムでは、一定の通信容量を確保しておくだけでよいため、本実施形態の音声処理システムを用いた音声チャットに参加する利用者数の制限をなくすことができる。
また、ゲームの仮想空間内に登場する複数のキャラクタのそれぞれが複数の音声チャットクライアント２０−１〜２０−ｎのそれぞれに対応したロールプレイングゲーム等を考えた場合には、音声チャットサーバ１０から各音声チャットクライアントに送信する三次元音声データにおける各音声の発声位置を、このゲームに登場するキャラクタの仮想空間内での位置に基づいて設定することが望ましい。これにより、複数の音声チャットクライアント２０−１〜２０−ｎを用いてゲームを行う場合に、各キャラクタ毎に異なる発声位置を設定することが可能になり、しかも、ゲームの進行に応じて仮想空間内において各キャラクタが移動すると各発声位置もその移動に伴って変化するゲームを実現することができるため、各プレーヤが他の登場キャラクタと実際に会話を行っているような擬似的な体験を行うことができる臨場感のあるゲームを実現することが可能になる。
【００４１】
なお、本発明は上記実施形態に限定されるものではなく、本発明の要旨の範囲内において種々の変形実施が可能である。例えば、上述した実施形態では、各音声チャットクライアント２０−１等において右音声と左音声からなる２チャンネルのステレオ再生を行うようにしたが、３チャンネル以上のステレオ再生を行いたい場合には、音声チャットサーバ１０において生成する三次元音声を構成する音声のチャンネル数を増やせばよい。例えば、５チャンネルの三次元音声に対応する音声データを音声チャットサーバ１０から各音声チャットクライアント２０−１等に送信することにより、各音声チャットクライアント２０−１等では、空間的な広がりを有する臨場感ある音場を再生することができるようになる。また、三次元音声を実現するためにチャンネル数を増加させた場合には、その分だけ通信回線３０−１等の通信容量を増加させる必要があるが、利用者数の増加に関係なく常に一定の通信容量を確保するだけでよい。
【００４２】
また、上述した実施形態では、音声チャットシステムに本発明を適用したが、その他の音声処理システム、例えば複数の会議室を通信回線で接続した電子会議システムに本発明を適用することができる。
また、上述した実施形態では、マイクロホン２２を用いて利用者の音声を集音したが、マイクロホン２２を用いる代わりに音声合成技術等を用いて利用者の音声に対応する音声データを生成するようにしてもよい。例えば、利用者としてのプレーヤが操作可能な複数の端末装置と、ゲームサーバとしての機能を有する音声処理装置とが通信回線を介して接続されている場合に、プレーヤが端末装置に備わったキーボードやその他の操作部を操作したときに、この操作内容に応じた音声データを端末装置によって生成して音声処理装置に向けて送信するようにしてもよい。
【００４３】
また、上述した実施形態では、端末装置としての音声チャットクライアントと音声処理装置としての音声チャットサーバとを別々に構成したが、音声処理装置に端末装置の機能を持たせたり、一の端末装置に音声処理装置の機能を持たせるようにしてもよい。
【００４４】
また、上述した実施形態では、音声チャットサーバ１０から音声チャットクライアント２０−１〜２０−ｎのそれぞれに、ステレオ音声の再生が可能な三次元音声データを送信したが、各音声チャットクライアント２０−１等から送られてきた個別音声データを単に合成して生成したモノラル音声用の合成音声データを送信するようにしてもよい。この場合には、音声チャットクライアント２０−１等に含まれる一方のＤ／Ａ変換器２３２やアンプ２４２および外付けされたスピーカ２６を省略することができる。しかも、この場合であっても、音声チャットクライアントおよびその利用者の数に応じて通信のデータ量が増加することを防止するとともに、音声チャットに参加する利用者数の制限をなくすことができる。
【００４５】
【発明の効果】
上述したように、本発明によれば、音声処理装置おいて各端末装置から送られてくる音声データに基づいて三次元音声データが生成されて各端末装置に送信されるため、各端末装置では他の端末装置から送信される別々の音声データを用いることなく、三次元音声データに基づいて各端末装置に利用者の発声位置が特定可能な音声の再生を行うことができる。このため、各端末装置と音声処理装置とを接続する通信回線は、端末装置から音声処理装置に向けて送信される音声データと、反対に音声処理装置からこの端末装置に向けて送信される三次元音声データとに対応した通信容量を確保するだけでよく、しかも三次元音声データのデータ量は端末装置の数に関係しないため、利用者の数に応じて通信のデータ量が増加することを防止することができる。また、利用者数に関係なく通信のデータ量がほぼ一定となることから、利用者数の制限をなくすことができる。
【図面の簡単な説明】
【図１】一実施形態の音声処理システムの全体構成を示す図である。
【図２】音声チャットクライアントの構成を示す図である。
【図３】音声チャットサーバの構成を示す図である。
【図４】音声チャットクライアントによる音声データ送信の動作手順を示す流れ図である。
【図５】各音声チャットクライアントから音声データが送られてきた音声チャットサーバによる三次元音声生成の動作手順を示す流れ図である。
【図６】各音声チャットクライアントに送られる三次元音声データにおける各音声の発声位置を示す説明図である。
【図７】音声チャットサーバから三次元音声データが送られてきた音声チャットクライアントによる三次元音声データ受信の動作手順を示す流れ図である。
【図８】従来の音声チャットシステムの構成を示す図である。
【符号の説明】
１０音声チャットサーバ
２０−１〜２０−ｎ音声チャットクライアント
２２マイクロホン
２４、２６スピーカ
１００、２２０通信装置
１１０サーバ処理部
１１２、２１６受信処理部
１１４、２１８音声伸張処理部
１１６三次元音声処理部
１１８、２１２音声圧縮処理部
１２０、２１４送信処理部
２１０クライアント処理部[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to an audio processing system, apparatus, and method for transmitting and receiving audio data when a conversation is performed between a plurality of users using a plurality of terminals.
[0002]
[Prior art]
2. Description of the Related Art Conventionally, there is known a voice chat system in which a plurality of terminals connected via a network have a conversation between users of the respective terminals. For example, an Internet service provider is provided with this voice chat system so that a plurality of users can enjoy a conversation (for example, see Patent Document 1).
[0003]
FIG. 8 is a diagram showing a configuration of a conventional voice chat system. The voice chat system shown in FIG. 8 includes a voice chat server 300 and n voice chat clients 310-1 to 310-n connected thereto. Each of the voice chat clients 310-1 and the like transmits the voice data (upbound voice) of the user collected by the microphone 322 to the voice chat server 300, and the voice chat server 300 A process of transferring voice data sent from another voice chat client 310-2 or the like to the first or the like is performed. Each voice chat client 310-1 or the like receives the voice data (downstream voice) of (n-1) persons other than the own apparatus transmitted from the voice chat server 300, and receives 3 based on the position of each user. By performing two-dimensional sound processing and making the utterance positions of each user different, stereo sound having a sense of reality is output from the speakers 324 and 326.
[0004]
[Patent Document 1]
JP-A-2003-6132 (page 3-5, FIG. 1-7)
[0005]
[Problems to be solved by the invention]
By the way, since each of the voice chat clients 310-1 and the like included in the above-described conventional voice chat system receives downlink voices of (n-1) persons transmitted from devices other than the device itself, By adding the upstream voice of one person to transmit, the voice data of n people will be transmitted and received at the same time via the line. As the number of users participating in the voice chat increases, the amount of communication data increases. There was a problem of doing. In general, there are many types of networks in which each of the voice chat clients 310-1 and the like and the voice chat server 300 are connected via a telephone line, an ISDN line, or the like. When used, there is a problem that the number of users who can talk at the same time is limited by this communication capacity, and the number of users who can participate in the conversation cannot be increased. Further, this kind of problem also exists in a voice conference system or the like that transmits and receives voice data by employing a method similar to the voice chat system.
[0006]
The present invention has been made in view of the above points, and an object of the present invention is to provide stereoscopic sound with a sense of realism by using three-dimensional sound data capable of specifying a user's utterance position, It is an object of the present invention to provide a voice processing system, an apparatus, and a method capable of preventing an increase in the amount of communication data according to the number of users and eliminating the limitation on the number of users.
[0007]
[Means for Solving the Problems]
In order to solve the above-described problem, in the audio processing system of the present invention, a plurality of terminal devices and the audio processing device are connected via a communication line. The terminal device includes a first transmission unit that transmits audio data corresponding to the user's voice to the audio processing device, and three-dimensional audio data that can specify the utterance position of each of the plurality of terminal devices. First receiving means for receiving the signal sent from the sound processing device, and sound output means for outputting sound corresponding to the three-dimensional sound data received by the first receiving means from a plurality of speakers; Have. The audio processing device generates three-dimensional audio data based on audio data received by the second reception unit that receives audio data transmitted from each of the plurality of terminal devices and the audio data received by the second reception unit. There is provided three-dimensional audio data generating means, and second transmitting means for transmitting the three-dimensional audio data generated by the three-dimensional audio data generating means to the terminal device.
[0008]
Further, the voice processing method of the present invention is applied to a voice processing system, wherein voice data corresponding to a voice of a user is transmitted from a first transmitting means provided in a terminal device to a second data provided in a voice processing device. Transmitting to the receiving means, and uttering positions of the respective users of the plurality of terminal devices can be specified based on the voice data transmitted from the plurality of terminal devices received by the second receiving means. Generating three-dimensional voice data by the three-dimensional voice data generating means; and transmitting the three-dimensional voice data generated by the three-dimensional voice data generating means from the second transmitting means included in the voice processing device to each of the plurality of terminal devices. Transmitting to the first receiving means provided in the step (c), and outputting the sound corresponding to the three-dimensional sound data received by the first receiving means by the sound output means. And a step of outputting a plurality of speakers.
[0009]
By using the voice processing system and the voice processing device of the present invention, or by executing the voice processing method of the present invention, the voice processing device performs three-dimensional voice data based on voice data sent from each terminal device. Is generated and transmitted to each terminal device, so that each terminal device does not use separate voice data transmitted from other terminal devices, and each user uses the three-dimensional voice data to generate a voice position of the user. Can reproduce a stereo sound with a sense of presence that can be specified. For this reason, the communication line connecting each terminal device and the audio processing device includes audio data transmitted from the terminal device to the audio processing device and tertiary data transmitted from the audio processing device to this terminal device. It is only necessary to secure the communication capacity corresponding to the original voice data.Moreover, since the data volume of the three-dimensional voice data is not related to the number of terminal devices, it is necessary to reduce the data volume of the communication according to the number of users. Can be prevented. Further, since the amount of communication data is substantially constant irrespective of the number of users, the limitation of the number of users can be eliminated.
[0010]
As another method for reducing the amount of communication data, there is also a method of simply mixing audio data sent from each terminal device in an audio processing device. In this case, one of the objects of the present invention is described. However, the objective of obtaining a stereo sound with a sense of reality cannot be achieved.
[0011]
Further, it is preferable that the first and second transmitting means described above compress and transmit the audio data, and the first and second receiving means decompress the received audio data. By transmitting and receiving compressed audio data between the terminal device and the audio processing device, it is possible to further reduce the amount of communication data.
[0012]
In addition, it is desirable that the above-described three-dimensional sound data generating means generates three-dimensional sound data to be transmitted to one terminal device based on sound data corresponding to another terminal device other than the terminal device. In the audio processing device, three-dimensional audio data to be transmitted to one terminal device is generated based on audio data transmitted from another terminal device, and is transmitted to the one terminal device. It becomes possible to output the voice of each user other than the user of the device. Also, by transmitting individual three-dimensional audio data to each terminal device in this manner, it becomes possible to transmit the minimum necessary data to each terminal device, and it is possible to reduce the amount of communication data. it can.
[0013]
The audio data sent from the terminal device to the audio processing device is audio data that can be reproduced in monaural, and the three-dimensional audio data that is sent from the audio processing device to the terminal device is the right and left audio that can be reproduced in stereo. Desirably, the audio data is for the left. By employing stereo sound composed of right sound and left sound as the three-dimensional sound data, it is possible to specify the utterance position of each user in each terminal device. Also, as the three-dimensional audio data to be transmitted from the audio processing device to each terminal device, two types of audio data for right and left may be prepared regardless of the number of users, so that the amount of communication data is small and the usage is small. It can be almost constant regardless of the number of people.
[0014]
In addition, each of the plurality of characters appearing in the virtual space of the game corresponds to each of the plurality of terminal devices, and the three-dimensional sound data generation unit described above uses three-dimensional sound data to be transmitted to one terminal device. It is desirable to set the utterance position of each voice based on the position in the virtual space of the character corresponding to another terminal device other than this terminal device. This makes it possible to set a different utterance position for each character corresponding to each terminal device when playing a game using a plurality of terminal devices, and furthermore, in the virtual space according to the progress of the game. When the character moves, it is possible to realize a game in which each utterance position changes according to the movement, so that a pseudo-experience can be provided as if each player is actually talking with another appearing character. It is possible to realize a realistic game.
[0015]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, an audio processing system according to an embodiment of the present invention will be described in detail with reference to the drawings.
FIG. 1 is a diagram illustrating an overall configuration of a sound processing system according to an embodiment. As shown in FIG. 1, the voice processing system of the present embodiment includes a voice chat server 10 and n voice chat clients 20-1 to 20-n.
[0016]
Each of the voice chat clients 20-1 to 20-n is connected to a microphone 22 that collects a user's voice and two speakers 24 and 26 that output stereo voice. A function of digitizing the compressed monaural voice, compressing it, and transmitting it to the voice chat server 10 via the communication line 30-1 or the like; and a function of transmitting the voice from the voice chat server 10 via the communication line 30-1 or the like. It has a function to decompress compressed data of a stereo sound composed of an incoming right audio and a left audio.
[0017]
FIG. 2 is a diagram showing a configuration of the voice chat client 20-1. The other voice chat clients 20-2 to 20-n also have the same configuration.
As shown in FIG. 2, the voice chat client 20-1 includes an A / D (analog-digital) converter 200, a client processor 210, a communication device 220, D / A (digital-analog) converters 230, 232, It is configured to include amplifiers 240 and 242. In addition, the voice chat client 20-1 is connected to a microphone 22 that collects a user's voice, and two speakers 24 and 26 that output a stereo voice composed of a left voice and a right voice.
[0018]
The A / D converter 200 converts an analog audio signal output from the microphone 22 into digital audio data. For example, the data is converted into audio data in the PCM format and input to the client processing unit 210.
The client processing unit 210 controls the transmission of the input voice data to the voice chat server 10 and the reception of the voice data transmitted from the voice chat server 10. To this end, the client processing unit 210 includes an audio compression processing unit 212, a transmission processing unit 214, a reception processing unit 216, and an audio decompression processing unit 218.
[0019]
The audio compression processing unit 212 performs a predetermined compression process on the audio data input from the A / D conversion unit 200 to generate compressed audio data. The transmission processing unit 214 performs a process of transmitting the compressed voice data generated by the voice compression processing unit 212 to the voice chat server 10. The reception processing unit 216 performs a process of receiving the compressed voice data sent from the voice chat server 10. The audio decompression processing unit 218 expands the compressed audio data received by the reception processing unit 216 to generate uncompressed audio data.
[0020]
The communication device 220 performs transmission / reception processing of a physical electric signal between the voice chat client 20-1 and the communication line 30-1 that connects the voice chat server 10. The D / A converter 230 converts the uncompressed audio data for right audio output from the client processor 210 into an analog audio signal. This audio signal is amplified by the amplifier 240 and output from the speaker 24. Similarly, the D / A converter 232 converts the uncompressed audio data for the left audio output from the client processor 210 into an analog audio signal. This audio signal is amplified by the amplifier 242 and output from the speaker 26.
[0021]
FIG. 3 is a diagram illustrating a configuration of the voice chat server 10. As shown in FIG. 3, the voice chat server 10 includes a communication device 100 and a server processing unit 110. The communication device 100 performs a process of transmitting and receiving a physical electric signal between the voice chat server 10 and the communication line 30-1 connecting the voice chat client 20-1 and the like. The server processing unit 110 performs a process of synthesizing each voice data sent from the voice chat client 20-1 or the like and sending it back to each voice chat client 20-1 or the like. To this end, the server processing unit 110 includes a reception processing unit 112, an audio decompression processing unit 114, a three-dimensional audio processing unit 116, an audio compression processing unit 118, and a transmission processing unit 120.
[0022]
The reception processing unit 112 performs a process of receiving compressed audio data corresponding to monaural audio transmitted from each audio chat client 20-1 or the like. The audio decompression processing unit 114 performs a process of decompressing each compressed audio data received by the reception processing unit 112 and converting the decompressed audio data into uncompressed audio data.
[0023]
The three-dimensional voice processing unit 116 determines whether or not each voice chat client 20-1 or the like uses another voice chat client based on the uncompressed voice data corresponding to each of the voice chat clients 20-1 to 20-n. A three-dimensional sound composed of a right sound and a left sound capable of specifying an utterance position is generated. Thereby, the left and right separate voice data to be transmitted to the voice chat client 20-1 is generated based on the voice data transmitted from each of the voice chat clients 20-2 to 20-n. Also, separate left and right voice data to be transmitted to the voice chat client 20-2 is generated based on the voice data transmitted from each of the voice chat clients 20-1 and 20-3 to 20-n. As described above, the voice data to be transmitted to the one voice chat client is generated based on the voice data transmitted from the other voice chat clients except the one voice chat client. A method of setting the utterance position of each voice generated corresponding to each voice chat client 20-1 to 20-n will be described later.
[0024]
The voice compression processing unit 118 performs a process of compressing the voice data (uncompressed voice data) for each voice chat client generated by the three-dimensional voice processing unit 116 in a format suitable for data transmission. The transmission processing unit 120 performs a process of transmitting each of the compressed audio data as the three-dimensional audio data generated by the audio compression processing unit 118 to each of the voice chat clients 20-1 and the like, which are the data transmission destinations. .
[0025]
The above-mentioned voice chat clients 20-1 to 20-n are terminal devices, the transmission processing unit 214 and the communication device 220 are first transmission units, the reception processing unit 216 and the communication device 220 are first reception units, The / A converters 230 and 232 and the amplifiers 240 and 242 correspond to audio output means, respectively. Also, the reception processing unit 112 and the communication device 100 correspond to the second reception unit, the transmission processing unit 120 and the communication device 100 correspond to the second transmission unit, and the three-dimensional sound processing unit 116 corresponds to the three-dimensional sound data generation unit. I do.
[0026]
The audio processing system of the present embodiment has such a configuration, and the operation will be described next.
FIG. 4 is a flowchart showing an operation procedure of voice data transmission by the voice chat client 20-1 or the like. Since the operation procedure of voice data transmission in each of the voice chat clients 20-1 to 20-n is the same, the following description will focus on the operation of the voice chat client 20-1.
[0027]
When the user using the voice chat client 20-1 utters and a monaural voice signal output from the microphone 22 is input to the voice chat client 20-1 (step 100), the A / D conversion unit 200 The input analog audio signal is converted into digital uncompressed audio data (step 101).
[0028]
Next, the audio compression processor 212 in the client processor 210 compresses the uncompressed audio data output from the A / D converter 200 to generate compressed audio data for transmission (Step 102). The transmission processing unit 214 transmits the compressed voice data from the communication device 220 to the voice chat server 10 (step 103).
[0029]
FIG. 5 is a flowchart showing an operation procedure of generating three-dimensional voice data by the voice chat server 10 to which voice data has been sent from each voice chat client 20-1 or the like.
When the compressed voice data is sent from each voice chat client 20-1 or the like, the reception processing unit 112 in the server processing unit 110 performs a process of receiving the compressed voice data (Step 200). The voice decompression processing unit 114 decompresses the received compressed voice data for each voice chat client 20-1 and the like to generate original uncompressed voice data (Step 201).
[0030]
Next, the three-dimensional voice processing unit 116 transmits the compressed voice data for each voice chat client generated by the voice decompression processing unit 114 to each of the n voice chat clients 20-1 to 20-n. A process of generating n sets of three-dimensional audio data in consideration of the utterance position of each audio is performed (step 202). For example, as a set of three-dimensional voice data to be transmitted to the voice chat client 20-1, each of the monaural uncompressed voice data of the voice chat clients 20-2 to 20-n is used to support these voice chat clients. The non-compressed audio data for the right and the uncompressed audio data for the left, which are necessary for performing the stereo reproduction in which the utterance position of the user to be specified can be specified. Similarly, (n-1) sets of three-dimensional voice data corresponding to each of the other voice chat clients 20-2 to 20-n are separately generated.
[0031]
Several specific examples are conceivable for the method of setting the utterance position of each voice generated corresponding to each voice chat client 20-1 to 20-n.
(1) The voice chat clients 20-1 to 20-n are appropriately arranged in a predetermined order on a three-dimensional space, and the utterance position of each voice is set based on the arrangement state.
[0032]
(2) The utterance position of each voice is set based on the geographical arrangement of each voice chat client 20-1 to 20-n. For example, the longitude, latitude, altitude, and the like of the installation position of each voice chat client 20-1 to 20-n are known in advance, or these information are sent from each voice chat client each time connection is made, and these information are transmitted. Or a case where the geographical arrangement is recognized based on telephone numbers corresponding to the respective voice chat clients 20-1 to 20-n.
[0033]
(3) The utterance position of each voice is set according to the virtual situation of each voice chat client 20-1 to 20-n having a conversation. For example, assume that a role playing game is performed using the audio processing system of the present embodiment. Each of the voice chat clients 20-1 to 20-n centers on the position of its own character (the character corresponding to the player who operates this voice chat client) in the virtual space of the game, and places other characters (other voice chats). Since the relative position of the character corresponding to the player operating the client) in the virtual space is determined, the utterance position of each sound is set based on the position of each character thus determined in the virtual space. Is done. When the relative position of each character moves with the progress of the game, the utterance position of each voice changes according to the contents of the movement.
[0034]
FIG. 6 is an explanatory diagram showing the utterance positions of each voice in the three-dimensional voice data sent to each voice chat client 20-1 to 20-n. In FIG. 6, the utterance position in the three-dimensional space corresponding to each voice chat client is indicated by a circle, and the direction of the utterance is indicated by an arrow. Further, it is assumed that the specific utterance position is set by using any one of the above-described methods (1) to (3).
[0035]
When the relative utterance position of each voice is set in this way, three-dimensional voice data in which each utterance position is set at the position of one voice chat client of interest and the position of another voice chat client Is generated. For example, in the three-dimensional voice data transmitted to the voice chat client 20-1, the utterance corresponding to each of the voice chat clients 20-2 to 20-n is located on the right side of the position of the voice chat client 20-1. The position is set. In the three-dimensional voice data transmitted to the voice chat client 20-2, an utterance position corresponding to the voice chat client 20-1 is set on the left front centering on the voice chat client 20-2, and the voice position is set on the right side. An utterance position corresponding to each of the chat clients 20-3 to 20-n is set. The same applies to the three-dimensional voice data transmitted to each of the other voice chat clients 20-2 to 20-n, and the utterance position corresponding to the other voice chat client, centering on the voice chat client as the transmission destination Is set.
[0036]
Next, the audio compression processing unit 118 compresses the n sets of three-dimensional audio data generated by the three-dimensional audio processing unit 116 (Step 203). Since each of these n sets of three-dimensional voice data is transmitted separately to each of the n voice chat clients 20-1 to 20-n, the compression processing in step 203 is performed by the n sets of three-dimensional voice data. This is done separately for each. Thereafter, the transmission processing unit 120 performs a process of transmitting the compressed three-dimensional voice data from the communication device 100 to each voice chat client (step 204).
[0037]
FIG. 6 is a flowchart showing an operation procedure of receiving the three-dimensional voice data by the voice chat client 20-1 or the like to which the three-dimensional voice data has been transmitted from the voice chat server 10. Since the operation procedure of receiving the three-dimensional audio data in each of the voice chat clients 20-1 to 20-n is the same, the following description will be made focusing on the operation of the voice chat client 20-1.
[0038]
When the compressed three-dimensional voice data is transmitted from the voice chat server 10, the reception processing unit 216 in the client processing unit 210 performs a process of receiving the compressed three-dimensional voice data (step 300). The audio expansion processor 218 expands the received three-dimensional audio data to generate right uncompressed audio data and left uncompressed audio data (step 301). The D / A converters 230 and 232 separately convert these left and right uncompressed audio data into analog audio signals (step 302). Thereafter, these audio signals are amplified by the amplifiers 240 and 242 and output from the speakers 24 and 26 (step 303).
[0039]
As described above, in the voice processing system of the present embodiment, the voice chat server 10 converts the three-dimensional voice data corresponding to each voice chat client based on the voice data transmitted from each voice chat client 20-1 or the like. Since it is generated and transmitted to each voice chat client, each voice chat client 20-1 or the like can obtain realistic stereo sound by three-dimensional voice data that can specify the utterance position of the user. The communication line 30-1 connecting the voice chat client 20-1 and the voice chat server 10 is connected to the voice data transmitted from one voice chat client to the voice chat server 10. It is sufficient to secure a capacity in consideration of the left and right voice data transmitted from the voice chat server 10 to the voice chat client, and the amount of communication data depends on the number of voice chat clients and their users. Can be prevented from increasing. For example, as shown in FIG. 1, assuming that a band required for transmission and reception of each of monaural voice and stereo left voice and right voice is 1, in the voice processing system of the present embodiment, each voice chat client 20-1 etc. It is sufficient to secure 3 as the communication band of the communication line 30-1 and the like connecting the voice chat server 10, and the value of this communication band is independent of the number of the connected voice chat clients 20-1 and the like. It will be constant.
[0040]
Further, in the voice processing system of the present embodiment, since it is only necessary to secure a certain communication capacity, it is possible to eliminate the limitation on the number of users participating in voice chat using the voice processing system of the present embodiment. .
Further, when considering a role playing game or the like in which each of a plurality of characters appearing in the virtual space of the game corresponds to each of the plurality of voice chat clients 20-1 to 20-n, the voice chat server 10 It is desirable to set the utterance position of each sound in the three-dimensional sound data transmitted to the voice chat client based on the position of the character appearing in the game in the virtual space. Thus, when a game is played using a plurality of voice chat clients 20-1 to 20-n, it is possible to set different utterance positions for each character, and furthermore, it is possible to set a virtual space according to the progress of the game. As each character moves within the game, it is possible to realize a game in which each utterance position changes according to the movement, so that each player has a simulated experience as if they were actually talking with other characters. It is possible to realize a game with a sense of realism.
[0041]
Note that the present invention is not limited to the above embodiment, and various modifications can be made within the scope of the present invention. For example, in the above-described embodiment, the two-channel stereo playback including the right audio and the left audio is performed in each voice chat client 20-1 or the like. The number of audio channels constituting the three-dimensional audio generated in the chat server 10 may be increased. For example, by transmitting voice data corresponding to three-dimensional voice of five channels from the voice chat server 10 to each voice chat client 20-1 or the like, each voice chat client 20-1 or the like has a real presence having a spatial spread. It is possible to reproduce a sound field with a sense. When the number of channels is increased to realize three-dimensional voice, the communication capacity of the communication line 30-1 and the like must be increased by that amount, but the communication capacity is always constant regardless of the number of users. It is only necessary to secure the communication capacity of.
[0042]
In the above-described embodiment, the present invention is applied to the voice chat system. However, the present invention can be applied to other voice processing systems, for example, an electronic conference system in which a plurality of conference rooms are connected by a communication line.
In the above-described embodiment, the user's voice is collected using the microphone 22. However, instead of using the microphone 22, voice data corresponding to the user's voice is generated using a voice synthesis technology or the like. May be. For example, when a plurality of terminal devices that can be operated by a player as a user and an audio processing device having a function as a game server are connected via a communication line, the player may use a keyboard or a terminal device. When the other operation unit is operated, audio data according to the operation content may be generated by the terminal device and transmitted to the audio processing device.
[0043]
In the above-described embodiment, the voice chat client as the terminal device and the voice chat server as the voice processing device are separately configured. However, the voice processing device may have the function of the terminal device, or one terminal device may have the function. The function of the audio processing device may be provided.
[0044]
Further, in the above-described embodiment, the three-dimensional audio data capable of reproducing the stereo audio is transmitted from the voice chat server 10 to each of the voice chat clients 20-1 to 20-n. For example, synthesized voice data for monaural voice generated by simply synthesizing the individual voice data sent from such as may be transmitted. In this case, the D / A converter 232, the amplifier 242, and the external speaker 26 included in the voice chat client 20-1 or the like can be omitted. Moreover, even in this case, it is possible to prevent an increase in the amount of communication data according to the number of voice chat clients and their users, and to eliminate the limitation on the number of users participating in the voice chat.
[0045]
【The invention's effect】
As described above, according to the present invention, in the audio processing device, three-dimensional audio data is generated based on audio data transmitted from each terminal device and transmitted to each terminal device. Without using separate audio data transmitted from other terminal devices, it is possible to reproduce a sound in which the utterance position of the user can be specified for each terminal device based on the three-dimensional audio data. For this reason, the communication line connecting each terminal device and the audio processing device includes audio data transmitted from the terminal device to the audio processing device and tertiary data transmitted from the audio processing device to the terminal device. It is only necessary to secure the communication capacity corresponding to the original voice data.Moreover, since the data volume of the three-dimensional voice data is not related to the number of terminal devices, it is necessary to reduce the data volume of the communication according to the number of users. Can be prevented. Further, since the amount of communication data is substantially constant irrespective of the number of users, the limitation of the number of users can be eliminated.
[Brief description of the drawings]
FIG. 1 is a diagram illustrating an overall configuration of a voice processing system according to an embodiment.
FIG. 2 is a diagram showing a configuration of a voice chat client.
FIG. 3 is a diagram showing a configuration of a voice chat server.
FIG. 4 is a flowchart showing an operation procedure of voice data transmission by a voice chat client.
FIG. 5 is a flowchart showing an operation procedure of three-dimensional voice generation by a voice chat server to which voice data has been sent from each voice chat client.
FIG. 6 is an explanatory diagram showing utterance positions of each voice in three-dimensional voice data sent to each voice chat client.
FIG. 7 is a flowchart showing an operation procedure of receiving three-dimensional voice data by a voice chat client to which three-dimensional voice data has been sent from a voice chat server.
FIG. 8 is a diagram showing a configuration of a conventional voice chat system.
[Explanation of symbols]
10 Voice chat server
20-1 to 20-n Voice Chat Client
22 microphone
24, 26 speakers
100, 220 communication device
110 server processing unit
112, 216 reception processing unit
114, 218 Voice expansion processing unit
116 3D audio processing unit
118, 212 Audio compression processing unit
120, 214 transmission processing unit
210 Client processing unit

Claims

In a voice processing system in which a plurality of terminal devices and a voice processing device are connected via a communication line,
The terminal device,
First transmission means for transmitting voice data corresponding to a user's voice to the voice processing device;
First receiving means for receiving, when three-dimensional audio data capable of specifying the utterance position of each user of the plurality of terminal devices has been sent from the audio processing device,
Sound output means for outputting sound corresponding to the three-dimensional sound data received by the first receiving means from a plurality of speakers,
The audio processing device,
Second receiving means for receiving the audio data sent from each of the plurality of terminal devices,
Three-dimensional audio data generating means for generating the three-dimensional audio data based on the audio data received by the second receiving means;
A second transmission unit that transmits the three-dimensional audio data generated by the three-dimensional audio data generation unit to the terminal device.

In claim 1,
The first and second transmitting means compresses and transmits the audio data,
The audio processing system according to claim 1, wherein the first and second receiving units extend the received audio data.

In claim 1 or 2,
The three-dimensional sound data generating means generates the three-dimensional sound data to be transmitted to one terminal device based on the sound data corresponding to another terminal device other than the terminal device. Voice processing system.

In any one of claims 1 to 3,
The audio data sent from the terminal device to the audio processing device is audio data that can be reproduced in monaural,
The audio processing system, wherein the three-dimensional audio data sent from the audio processing device to the terminal device is right and left audio data that can be reproduced in stereo.

In any one of claims 1 to 4,
Each of a plurality of characters appearing in the virtual space of the game corresponds to each of the plurality of terminal devices,
The three-dimensional sound data generating means may determine the utterance position of each sound in the three-dimensional sound data to be transmitted to one terminal device in the virtual space of the character corresponding to another terminal device other than the terminal device. A sound processing system, wherein the setting is performed based on the position of a sound.

An audio processing device connected to a plurality of terminal devices via a communication line,
Second receiving means for receiving audio data sent from each of the plurality of terminal devices;
Three-dimensional audio data generating means for generating three-dimensional audio data based on the audio data received by the second receiving means;
A second transmission unit that transmits the three-dimensional audio data generated by the three-dimensional audio data generation unit to the terminal device;
An audio processing device comprising:

In claim 6,
The three-dimensional sound data generating means generates the three-dimensional sound data to be transmitted to one terminal device based on the sound data corresponding to another terminal device other than the terminal device. Voice processing device.

A plurality of loudspeakers, a first transmitting means for transmitting audio data, a first receiving means for receiving three-dimensional audio data, and a voice corresponding to the three-dimensional audio data received by the first receiving means. A plurality of terminal devices each having an audio output unit for outputting from the plurality of speakers; a second transmission unit for transmitting and receiving the audio data or the three-dimensional audio data to and from the plurality of terminal devices; and (2) receiving means, and the three-dimensional sound data generating the three-dimensional sound data capable of specifying the utterance position of each user of the plurality of terminal devices based on the sound data received by the second receiving means An audio processing method in an audio processing system comprising: an audio processing device having data generation means; and
Transmitting the voice data corresponding to the voice of the user from the first transmission unit provided in the terminal device to the second reception unit provided in the voice processing device;
Based on the audio data transmitted from the plurality of terminal devices received by the second receiving means, the three-dimensional audio data capable of specifying the utterance position of each user of the plurality of terminal devices is converted to the tertiary sound data. Generating by original voice data generating means;
The three-dimensional sound data generated by the three-dimensional sound data generating means is transmitted from the second transmitting means provided in the sound processing device to the first receiving means provided in each of the plurality of terminal devices. Sending,
Outputting a sound corresponding to the three-dimensional sound data received by the first receiving means from the plurality of speakers by the sound output means;
A voice processing method comprising:

In claim 8,
The generation of the three-dimensional audio data by the three-dimensional audio data generating means is performed corresponding to each of the plurality of terminal devices,
The audio processing method, wherein the three-dimensional audio data transmitted to one of the terminal devices is generated based on the audio data transmitted from another terminal device other than the terminal device.