JP2006180251A

JP2006180251A - Voice signal processor for enabling callers to perform simultaneous utterance, and program

Info

Publication number: JP2006180251A
Application number: JP2004371876A
Authority: JP
Inventors: Noriyuki Hata; 紀行畑
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2004-12-22
Filing date: 2004-12-22
Publication date: 2006-07-06

Abstract

<P>PROBLEM TO BE SOLVED: To provide a means for enabling multiple callers to perform utterance and to discriminate the voice of the different callers concerning a system for synthesizing the voice of the plurality of callers and issuing it. <P>SOLUTION: A voice signal processing server 13 receives the voice signal of a participant 19 from each terminal apparatus 11. A cumulative time data generator 132 generates cumulative time data which indicates the accumulation of an utterance time indicated by each voice signal. A terminal apparatus selector 133 selects a prescribed number of terminal apparatuses 11 of the participants 19 with the long utterance time, based on the cumulative time data. A voice and image position allocating part 135 allocates the respectively different voice and image positions to the selected terminal apparatuses 11. A voice signal working part 136 applies processing so as to allow the voice, and image position of each voice signal to coincide with the allocated voice and image position. The voice signal with the voice and image position adjusted therein is mixed and transmitted to the terminal apparatus 11. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、会議等の状況下において、音響機器を用いた複数話者による同時発声を可能とする音声信号処理技術に関する。 The present invention relates to an audio signal processing technique that enables simultaneous speech by a plurality of speakers using an acoustic device in a meeting or the like.

多地点に配置された話者が参加する音声会議において、各話者の音声が聞き手に区別可能となるように、各話者の音声の音像定位を異ならせる技術がある。そのような技術においては、各話者の近傍に配されたマイクから入力される音声信号を左右２チャンネルの音声信号に分離した後、各チャンネルの音声信号に音圧調整を行うことにより、当該音声信号により示される音声の音像位置を変更する。その際、話者毎に異なる音圧調整が行われる結果、話者毎の音声が異なる音像位置に配置される。その後、音圧調整の行われた話者毎の音声信号が加算され、聞き手の近傍に配置されたスピーカに出力される。その結果、スピーカから複数話者の音声が同時に発せられても各音声の音像位置が異なるため、聞き手は各話者の音声を区別することができる。そのような従来技術を開示した文献として、例えば特許文献１がある。
特開平１１−１２７４９９号公報 In an audio conference in which speakers arranged at multiple points participate, there is a technique for differentiating the sound image localization of each speaker's voice so that the voice of each speaker can be distinguished by the listener. In such a technique, the sound signal input from the microphones arranged in the vicinity of each speaker is separated into two left and right channel sound signals, and then the sound pressure adjustment is performed on the sound signals of each channel. The sound image position of the sound indicated by the sound signal is changed. At this time, as a result of performing different sound pressure adjustment for each speaker, the sound for each speaker is arranged at a different sound image position. Thereafter, the sound signals for each speaker whose sound pressure has been adjusted are added and output to a speaker arranged in the vicinity of the listener. As a result, the sound image position of each voice is different even when the voices of a plurality of speakers are simultaneously emitted from the speaker, so that the listener can distinguish the voices of the speakers. As a document disclosing such a prior art, there is, for example, Patent Document 1.
Japanese Patent Laid-Open No. 11-127499

従来技術による場合、話者の数に応じた音像位置が設けられる。そのため、話者の数が増えるに従い隣り合う音像位置間の距離が短くなり、隣り合う音像位置に配置された異なる話者の音声間を聞き手が区別することが困難となる。 In the case of the prior art, a sound image position corresponding to the number of speakers is provided. Therefore, as the number of speakers increases, the distance between adjacent sound image positions becomes shorter, and it becomes difficult for the listener to distinguish between the sounds of different speakers arranged at adjacent sound image positions.

上記の状況に鑑み、本発明は、複数話者の音声を合成して発音するシステムにおいて、多数の話者による発声を可能としつつ、聞き手にとって異なる話者の音声を区別可能とする手段を提供することを目的とする。 In view of the above situation, the present invention provides a means for enabling a speaker to distinguish different speakers' voices while enabling speech by a large number of speakers in a system that synthesizes and generates voices of multiple speakers. The purpose is to do.

上記課題を達成するために、本発明は、複数の端末装置から音声信号を受け取る入力手段と、前記入力手段により前記複数の端末装置の各々から過去の所定期間内に受け取られた音声信号の持続時間の累積値を示す累積時間データを生成する累積時間データ生成手段と、前記累積時間データに基づき、前記複数の端末装置の中から所定数の複数の端末装置を選択する選択手段と、前記選択手段により選択された複数の端末装置の各々に対し、互いに異なる音像位置を割り当てる割当手段と、前記選択手段により選択された複数の端末装置から前記入力手段により音声信号が受け取られたとき、当該音声信号を、前記割当手段により当該端末装置に割り当てられた音像位置を有する音声信号に加工するとともに、前記選択手段により選択されなかった端末装置から前記入力手段により音声信号が受け取られたとき、当該音声信号を、所定の音像位置を有する音声信号に加工する加工手段と、前記加工手段により加工された音声信号を出力する出力手段とを備えることを特徴とする音声信号処理装置を提供する。 To achieve the above object, the present invention provides an input means for receiving audio signals from a plurality of terminal devices, and a continuation of an audio signal received from each of the plurality of terminal devices by the input means within a past predetermined period. Cumulative time data generating means for generating cumulative time data indicating a cumulative value of time; selection means for selecting a predetermined number of terminal devices from the plurality of terminal devices based on the cumulative time data; and the selection Allocating means for assigning different sound image positions to each of the plurality of terminal devices selected by the means, and when the sound signal is received by the input means from the plurality of terminal devices selected by the selecting means, The signal is processed into an audio signal having a sound image position assigned to the terminal device by the assigning unit, and is not selected by the selecting unit. When an audio signal is received from the terminal device by the input means, processing means for processing the audio signal into an audio signal having a predetermined sound image position, and an output for outputting the audio signal processed by the processing means And an audio signal processing apparatus.

かかる構成の音声信号処理装置によれば、異なる話者の音声信号が異なる所定数の音像位置に動的に配置されるため、話者が多数であり、音声信号を送信してくる端末装置の数が多い場合であっても、隣り合う音像位置間の間隔が一定以上に保たれる。その結果、生成される合成音声信号を再生して得られる音声に含まれる異なる話者の音声が聞き手によって混同されるという不都合が防止される。 According to the audio signal processing apparatus having such a configuration, since the audio signals of different speakers are dynamically arranged at different predetermined numbers of sound image positions, the number of speakers is large, and the terminal device that transmits the audio signal Even when the number is large, the interval between adjacent sound image positions is kept at a certain level or more. As a result, the inconvenience that the voices of different speakers included in the voice obtained by reproducing the generated synthesized voice signal are confused by the listener is prevented.

また、好ましい態様において、前記割当手段は、一の端末装置に割り当てられている一の音像位置を他の音像位置に変更するとき、当該一の端末装置に対し、前記一の音像位置から前記他の音像位置に時間の経過に伴い推移する音像位置を割り当てるように構成してもよい。 Also, in a preferred aspect, the assigning means changes the one sound image position from the one sound image position to the other sound image position when changing the one sound image position assigned to the one terminal device to another sound image position. A sound image position that changes over time may be assigned to the sound image position.

かかる構成の音声信号処理装置によれば、端末装置に割り当てられる音像位置に変更が加えられる場合においても、滑らかに音像位置が移動する音声を示す合成音声信号が生成されるため、合成音声信号を再生して得られる音声の聞き手が、同じ話者の音声を捕捉することが容易となる。 According to the audio signal processing device having such a configuration, even when a change is made to the sound image position assigned to the terminal device, a synthesized audio signal indicating a sound whose sound image position moves smoothly is generated. It becomes easy for the listener of the voice obtained by reproduction to capture the voice of the same speaker.

また、他の好ましい態様において、前記音声信号処理装置は、前記選択手段により選択された複数の端末装置の各々に関し、当該端末装置から受け取られた音声信号の周波数特性を示す周波数特性データを生成する周波数特性特定手段を備え、前記割当手段は、任意の２つの音像位置を割り当てる端末装置から受け取られた音声信号の前記周波数特性データの間の類似性が所定の条件を満たすように、前記選択手段により選択された複数の端末装置の各々に対し音像位置を割り当てるように構成してもよい。 In another preferable aspect, the audio signal processing device generates frequency characteristic data indicating a frequency characteristic of an audio signal received from the terminal device for each of the plurality of terminal devices selected by the selection unit. Frequency characteristic specifying means, and the assigning means selects the selecting means so that the similarity between the frequency characteristic data of the audio signals received from the terminal device to which any two sound image positions are assigned satisfies a predetermined condition. The sound image position may be assigned to each of the plurality of terminal devices selected by the above.

かかる構成の音声信号処理装置によれば、異なる話者の音声信号の周波数特性が互いに近似する場合にそれらの音像位置を一定以上離す等により、合成音声信号を再生して得られる音声の聞き手が、異なる話者の音声を混同する不都合をより確実に回避することができる。 According to the audio signal processing apparatus having such a configuration, when the frequency characteristics of the audio signals of different speakers are close to each other, the audio listener obtained by reproducing the synthesized audio signal can be obtained by separating the sound image positions by a certain distance or more. Therefore, it is possible to more reliably avoid the inconvenience of confusing the voices of different speakers.

さらに、本発明は、複数の端末装置から音声信号を受け取る入力手段と、前記入力手段により前記複数の端末装置の各々から過去の所定期間内に受け取られた音声信号の持続時間の累積値を示す累積時間データを生成する累積時間データ生成手段と、前記累積時間データに基づき、前記複数の端末装置の中から所定数の複数の端末装置を選択する選択手段と、前記選択手段により選択された複数の端末装置の各々に対し、互いに異なる周波数特性を割り当てる割当手段と、前記選択手段により選択された複数の端末装置から前記入力手段により音声信号が受け取られたとき、当該音声信号を、前記割当手段により当該端末装置に割り当てられた周波数特性を有する音声信号に加工するとともに、前記選択手段により選択されなかった端末装置から前記入力手段により音声信号が受け取られたとき、当該音声信号を、所定の周波数特性を有する音声信号に加工する加工手段と、前記加工手段により加工された音声信号を出力する出力手段とを備えることを特徴とする音声信号処理装置を提供する。 Furthermore, the present invention shows an input unit that receives audio signals from a plurality of terminal devices, and a cumulative value of durations of audio signals received from each of the plurality of terminal devices by the input unit within a predetermined period in the past. Cumulative time data generating means for generating cumulative time data, selection means for selecting a predetermined number of terminal devices from among the plurality of terminal devices based on the cumulative time data, and a plurality selected by the selection means Allocating means for assigning different frequency characteristics to each of the terminal devices, and when the voice signal is received by the input means from a plurality of terminal devices selected by the selecting means, the allocating means The terminal device that has been processed into an audio signal having the frequency characteristics assigned to the terminal device by the selection device and has not been selected by the selection means And a processing means for processing the audio signal into an audio signal having a predetermined frequency characteristic when an audio signal is received by the input means, and an output means for outputting the audio signal processed by the processing means. An audio signal processing apparatus is provided.

かかる構成の音声信号処理装置によれば、異なる話者の音声信号に対し異なる周波数特性を持つように加工処理が施され、異なる話者の音声が区別可能な合成音声信号が生成されると同時に、周波数特性の数が所定数に限られており、それらが動的に異なる話者の音声信号に割り当てられるため、話者が多数であり、音声信号を送信してくる端末装置の数が多い場合であっても、加工処理後の音声信号の周波数特性間に一定以上の差異が保たれる。その結果、生成される合成音声信号を再生して得られる音声に含まれる異なる話者の音声が聞き手によって混同されるという不都合が防止される。 According to the speech signal processing apparatus having such a configuration, processing is performed so that speech signals of different speakers have different frequency characteristics, and a synthesized speech signal that can distinguish speeches of different speakers is generated at the same time. Since the number of frequency characteristics is limited to a predetermined number and these are dynamically assigned to different speaker voice signals, the number of speakers is large and the number of terminal devices that transmit voice signals is large. Even in this case, a certain difference or more is maintained between the frequency characteristics of the processed audio signal. As a result, the inconvenience that the voices of different speakers included in the voice obtained by reproducing the generated synthesized voice signal are confused by the listener is prevented.

また、好ましい態様において、前記割当手段は、一の端末装置に割り当てられている一の周波数特性を他の周波数特性に変更するとき、当該一の端末装置に対し、前記一の周波数特性から前記他の周波数特性に時間の経過に伴い推移する周波数特性を割り当てるように構成してもよい。 Further, in a preferred aspect, when the allocating unit changes one frequency characteristic allocated to one terminal device to another frequency characteristic, the allocating unit changes the one frequency characteristic from the one frequency characteristic to the other frequency characteristic. A frequency characteristic that changes with the passage of time may be assigned to the frequency characteristic.

かかる構成の音声信号処理装置によれば、端末装置に割り当てられる周波数特性に変更が加えられる場合においても、滑らかに周波数特性が変化する音声を示す合成音声信号が生成されるため、合成音声信号を再生して得られる音声の聞き手が、同じ話者の音声を捕捉することが容易となる。 According to the audio signal processing device having such a configuration, even when a change is made to the frequency characteristic assigned to the terminal device, a synthesized audio signal indicating a voice whose frequency characteristic changes smoothly is generated. It becomes easy for the listener of the voice obtained by reproduction to capture the voice of the same speaker.

また、本発明は、上記の音声信号処理装置により行われる処理をコンピュータに実行させるプログラムを提供する。 The present invention also provides a program that causes a computer to execute processing performed by the above-described audio signal processing apparatus.

［１．第１実施形態］
［１．１．音声会議システムの構成］
図１は、本発明の第１実施形態にかかる音声会議システム１の構成を示したブロック図である。音声会議システム１は、互いに異なる場所にいる会議参加者が、音声により会議を行うことを可能とするシステムである。音声会議システム１は、複数の通信機器を相互に接続するネットワーク１０と、ネットワーク１０に各々接続された複数の端末装置１１と、端末装置１１の各々に接続されたヘッドセット１２と、ネットワーク１０に接続された音声信号処理サーバ１３を備えている。 [1. First Embodiment]
[1.1. Configuration of audio conference system]
FIG. 1 is a block diagram showing a configuration of an audio conference system 1 according to the first embodiment of the present invention. The audio conference system 1 is a system that enables conference participants in different places to hold a conference by audio. The audio conference system 1 includes a network 10 for connecting a plurality of communication devices to each other, a plurality of terminal devices 11 connected to the network 10, a headset 12 connected to each of the terminal devices 11, and a network 10. A connected audio signal processing server 13 is provided.

複数の端末装置１１およびヘッドセット１２の各々は、会議の参加者１９の各々により使用される。音声会議システム１を利用した会議に参加可能な参加者の数、すなわち端末装置１１およびヘッドセット１２の数は任意に変更可能であり、さらに会議の進行中に参加者の構成が変動してもよい。 Each of the plurality of terminal devices 11 and the headset 12 is used by each of the conference participants 19. The number of participants who can participate in the conference using the audio conference system 1, that is, the number of the terminal devices 11 and the headsets 12 can be arbitrarily changed, and even if the configuration of the participants fluctuates during the conference, Good.

図１に示すように、異なる参加者１９および当該参加者１９が使用する端末装置１１およびヘッドセット１２を互いに区別する必要がある場合には、それぞれ、参加者１９−ｎ、端末装置１１−ｎおよびヘッドセット１２−ｎのように、末尾に「−ｎ」を付してそれらを区別する。ただし、「ｎ」は任意の自然数である。また、異なる参加者１９および当該参加者１９が使用する端末装置１１およびヘッドセット１２を互いに区別する必要がない場合には、それぞれ、単に参加者１９、端末装置１１およびヘッドセット１２と呼ぶ。 As shown in FIG. 1, when different participants 19 and the terminal device 11 and the headset 12 used by the participant 19 need to be distinguished from each other, the participant 19-n and the terminal device 11-n, respectively. And, like the headset 12-n, “-n” is added to the end to distinguish them. However, “n” is an arbitrary natural number. In addition, when there is no need to distinguish different participants 19 and the terminal devices 11 and headsets 12 used by the participants 19, they are simply referred to as participants 19, terminal devices 11 and headsets 12, respectively.

ネットワーク１０は、有線または無線により相互接続された１以上の中継装置を備え、異なる通信機器間のデータの中継を行う。ネットワーク１０は、インターネット等の利用者を限定しないオープンネットワークであってもよいし、イントラネットやインターネットプロトコル以外の通信プロトコルを用いるＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）等のいずれであってもよい。 The network 10 includes one or more relay devices interconnected by wire or wireless, and relays data between different communication devices. The network 10 may be an open network that does not limit users such as the Internet, or may be any of an intranet, a LAN (Local Area Network) using a communication protocol other than the Internet protocol, and the like.

端末装置１１は、参加者１９の音声を示す音声信号を音声信号処理サーバ１３に送信するとともに、音声信号処理サーバ１３から他の参加者１９の音声を示す音声信号が合成された合成音声信号を受信する装置であり、例えば、汎用のパーソナルコンピュータ、ＰＤＡ（ＰｅｒｓｏｎａｌＤｉｇｉｔａｌＡｓｓｉｓｔａｎｔ）、専用端末装置等のいずれであってもよい。端末装置１１は、ヘッドセット１２から入力されるアナログ音声信号をデジタル音声信号に変換するとともにデジタル音声信号をアナログ音声信号に変換してヘッドセット１２に出力する音声信号処理部１１１と、音声信号処理部１１１から音声信号を受け取ってネットワーク１０に送出する音声信号送信部１１２と、ネットワーク１０から音声信号を受け取って記憶部１１４に書き込む音声信号受信部１１３と、端末装置１１の制御プログラム等を記憶するとともに他の構成部のワークエリアとして利用される記憶部１１４を備えている。 The terminal device 11 transmits a voice signal indicating the voice of the participant 19 to the voice signal processing server 13 and a synthesized voice signal obtained by synthesizing a voice signal indicating the voice of the other participant 19 from the voice signal processing server 13. For example, a general-purpose personal computer, a PDA (Personal Digital Assistant), or a dedicated terminal device may be used. The terminal device 11 converts an analog audio signal input from the headset 12 into a digital audio signal, converts the digital audio signal into an analog audio signal, and outputs the analog audio signal to the headset 12, and an audio signal processing An audio signal transmitting unit 112 that receives an audio signal from the unit 111 and sends it to the network 10, an audio signal receiving unit 113 that receives the audio signal from the network 10 and writes it to the storage unit 114, and a control program for the terminal device 11 are stored. A storage unit 114 is also used as a work area for other components.

ヘッドセット１２は、参加者１９の音声を示すアナログ音声信号を生成して端末装置１１に出力するマイクと、端末装置１１から入力されるアナログ音声信号を音声に変換して発音するヘッドフォンを備えている。ヘッドセット１２のヘッドフォンは左入力および右入力を備え、左右異なる音を発音可能なステレオヘッドフォンである。 The headset 12 includes a microphone that generates an analog audio signal indicating the voice of the participant 19 and outputs the analog audio signal to the terminal device 11, and a headphone that converts the analog audio signal input from the terminal device 11 into sound and generates a sound. Yes. The headphones of the headset 12 are stereo headphones that have a left input and a right input and can produce different sounds on the left and right.

音声信号処理サーバ１３は、複数の端末装置１１の各々から音声信号を受信し、受信した音声信号に対し、音像位置調整およびミキシングの処理を行って合成音声信号を生成し、生成した合成音声信号を複数の端末装置１１の各々に送信する装置である。音声信号処理サーバ１３は、ネットワーク１０から音声信号を受信する音声信号受信部１３１と、過去の所定時間内に受信した音声信号により示される音声の持続時間の累積値を送信元である端末装置１１の各々に関し示す累積時間データを生成する累積時間データ生成部１３２と、累積時間データ等に基づき音空間に配置すべき音声信号の出所となる所定数の端末装置を選択する端末装置選択部１３３と、各端末装置１１から受信される音声信号の周波数特性を特定し周波数特性データを生成する周波数特性特定部１３４と、音声信号の出所である端末装置１１の各々に対し当該音声信号が音空間において配置されるべき音像位置を割り当てる音像位置割当部１３５と、音声信号の各々の音像位置が音像位置割当部１３５により割り当てられた音像位置となるように音声信号を加工し調整済の音声信号を生成する音声信号加工部１３６と、調整済の音声信号をミキシングして合成音声信号を生成するミキシング部１３７と、合成音声信号をネットワーク１０に送信する音声信号送信部１３８と、音声信号処理サーバ１３の制御プログラム等を記憶するとともに他の構成部のワークエリアとして利用される記憶部１３９を備えている。また、記憶部１３９は、端末装置選択部１３３が所定数の端末装置を選択するための条件および音像位置割当部１３５が音像位置を各端末装置１１に割り当てるための条件を示す条件データを格納した条件データベース１３９１を記憶している。 The audio signal processing server 13 receives an audio signal from each of the plurality of terminal devices 11, performs sound image position adjustment and mixing processing on the received audio signal, generates a synthesized audio signal, and generates the generated synthesized audio signal Is transmitted to each of the plurality of terminal devices 11. The audio signal processing server 13 receives the audio signal receiving unit 131 that receives the audio signal from the network 10 and the terminal device 11 that is the transmission source of the accumulated value of the audio duration indicated by the audio signal received within the past predetermined time. An accumulated time data generating unit 132 for generating accumulated time data shown for each of the above, a terminal device selecting unit 133 for selecting a predetermined number of terminal devices from which audio signals to be arranged in the sound space are based on the accumulated time data, and the like The frequency characteristic specifying unit 134 that specifies the frequency characteristic of the audio signal received from each terminal device 11 and generates frequency characteristic data, and the audio signal is transmitted to the terminal device 11 that is the source of the audio signal in the sound space. A sound image position assignment unit 135 for assigning a sound image position to be arranged, and a sound image position assignment unit 135 for assigning each sound image position of the audio signal. An audio signal processing unit 136 for processing an audio signal to generate an adjusted audio signal so as to obtain an adjusted sound image position, a mixing unit 137 for generating a synthesized audio signal by mixing the adjusted audio signal, and a synthesized audio signal Are transmitted to the network 10 and a storage unit 139 that stores a control program of the audio signal processing server 13 and the like and is used as a work area for other components. The storage unit 139 stores condition data indicating conditions for the terminal device selection unit 133 to select a predetermined number of terminal devices and conditions for the sound image position assignment unit 135 to assign sound image positions to the terminal devices 11. A condition database 1391 is stored.

多人数の参加者１９が参加する会議においては、仮に音像位置割当部１３５が全ての参加者１９の音声に対し固定的に異なる音像位置を割り当てると、隣り合う音像位置との距離が短くなる。その結果、類似する音声に対し近接する音像位置が割り当てられ、それらの音声により同時に発言がなされると、それらの音声に対し音像位置調整が行われたとしても、ミキシング後の音声は聞き手にとって判別困難なものとなる。そこで、端末装置選択部１３３はその時々に音像位置割当部１３５により音像位置の割当が行われるべき音声を所定数だけ選択することにより、音像位置の数を制限するのである。すなわち、音像位置割当部１３５は、端末装置選択部１３３により選択された端末装置１１にのみ、音像位置の割当を行う。その結果、隣り合う音像位置間の距離が一定以上狭くなることはなく、上記のような問題が生じない。 In a conference in which a large number of participants 19 participate, if the sound image position assignment unit 135 assigns different sound image positions to all the participants 19 in a fixed manner, the distance from the adjacent sound image positions is shortened. As a result, close sound image positions are assigned to similar sounds, and if the sound is simultaneously spoken by those sounds, the mixed sound is discriminated by the listener even if the sound image position is adjusted for those sounds. It will be difficult. Therefore, the terminal device selection unit 133 limits the number of sound image positions by selecting a predetermined number of sounds to which sound image position allocation should be performed by the sound image position allocation unit 135 from time to time. That is, the sound image position assignment unit 135 assigns the sound image position only to the terminal device 11 selected by the terminal device selection unit 133. As a result, the distance between adjacent sound image positions does not become narrower than a certain value, and the above problem does not occur.

ところで、端末装置選択部１３３による端末装置１１の選択および音像位置割当部１３５による音像位置の割当は条件データベース１３９１に含まれる条件データに従い行われる。端末装置選択部１３３および音像位置割当部１３５は各々、選択もしくは割当の処理において累積時間データ生成部１３２により生成される累積時間データを用いる。累積時間データは、過去の所定時間内における各参加者１９の発言時間の累計値を示すデータであり、時間の経過とともに常時変化している。従って、端末装置選択部１３３により選択される端末装置１１および音像位置割当部１３５によりそれらの端末装置１１に割り当てられる音像位置は動的に変化する。 By the way, the selection of the terminal device 11 by the terminal device selection unit 133 and the assignment of the sound image position by the sound image position assignment unit 135 are performed according to the condition data included in the condition database 1391. Each of the terminal device selection unit 133 and the sound image position assignment unit 135 uses the accumulated time data generated by the accumulated time data generation unit 132 in the selection or assignment process. The accumulated time data is data indicating the accumulated value of the speech time of each participant 19 within the past predetermined time, and is constantly changing with the passage of time. Therefore, the terminal device 11 selected by the terminal device selection unit 133 and the sound image position assigned to the terminal device 11 by the sound image position assignment unit 135 dynamically change.

また、音像位置割当部１３５は、音像位置の割当を行うにあたり、周波数特性特定部１３４により生成される周波数特性データを用いることもできる。周波数特性データは、各参加者１９の音声の特徴を示すデータである。 In addition, the sound image position assignment unit 135 can use the frequency characteristic data generated by the frequency characteristic specifying unit 134 when assigning the sound image position. The frequency characteristic data is data indicating the voice characteristics of each participant 19.

音声信号受信部１３１により受信された音声信号は、その出所である端末装置１１に対し音像位置割当部１３５により割り当てられる音像位置を有するように音声信号加工部１３６で調整された後、ミキシング部１３７でミキシングされ、音声信号送信部１３８により各端末装置１１に送信される。以上が音声信号処理サーバ１３の動作の概要である。 The audio signal received by the audio signal receiving unit 131 is adjusted by the audio signal processing unit 136 so as to have a sound image position assigned by the sound image position assigning unit 135 to the terminal device 11 that is the source, and then the mixing unit 137. And is transmitted to each terminal apparatus 11 by the audio signal transmission unit 138. The above is the outline of the operation of the audio signal processing server 13.

続いて、条件データベース１３９１の説明を行う。図２は、条件データベース１３９１の内容を例示した図である。条件データベース１３９１は、複数のレコード（図２の各データ行）を含み、各レコードは条件データと、その条件データの内容が他の条件データの内容と競合する場合にいずれを優先するかを示す優先度を含んでいる。優先度は整数値であり、数が小さいほど優先度が高いことを示す。以下、図２に例示される条件データの説明を行う。ただし、条件データの種類は以下に説明するものに限られない。まず、条件データ「ポジション数：６」は、音像位置割当部１３５により音声に対し割り当てられる音像位置の数を６とすることを指示している。以下、音像位置割当部１３５により音声に割り当てられる音像位置を、聞き手の左から右にかけて「ポジション１」、「ポジション２」、・・・のように呼ぶ。これらの音像位置は、聞き手の正面を基準（０度）とし、左方向（負）もしくは右方向（正）への傾きの角度により特定される。例えば、ポジション１〜６は、それぞれ「−５０度」、「−３０度」、「−１０度」、「＋１０度」、「＋３０度」および「＋５０度」のように特定される。 Subsequently, the condition database 1391 will be described. FIG. 2 is a diagram illustrating the contents of the condition database 1391. The condition database 1391 includes a plurality of records (each data row in FIG. 2), and each record indicates the condition data and which is prioritized when the contents of the condition data conflict with the contents of other condition data. Includes priority. The priority is an integer value, and the smaller the number, the higher the priority. Hereinafter, the condition data illustrated in FIG. 2 will be described. However, the types of condition data are not limited to those described below. First, the condition data “number of positions: 6” indicates that the number of sound image positions assigned to the sound by the sound image position assignment unit 135 is six. Hereinafter, the sound image positions assigned to the sound by the sound image position assigning unit 135 are referred to as “position 1”, “position 2”,. These sound image positions are specified by the angle of inclination in the left direction (negative) or right direction (positive) with the front of the listener as the reference (0 degree). For example, the positions 1 to 6 are specified as “−50 degrees”, “−30 degrees”, “−10 degrees”, “+10 degrees”, “+30 degrees”, and “+50 degrees”, respectively.

条件データ「監視期間：２０分」は、累積時間データ生成部１３２および周波数特性特定部１３４に対し、音声信号受信部１３１により受信された音声信号のうち、過去２０分間分のものを利用することを指示している。この場合、累積時間データ生成部１３２は過去２０分間に受信された音声信号に基づき各端末装置１１から受信した音声信号の持続時間を算出することにより参加者１９の各々の発言時間の累計値を示す累計時間データを生成し、周波数特性特定部１３４は過去２０分間に受信された音声信号に対し周波数分析を行うことにより参加者１９の各々の音声信号の周波数特性を示す周波数特性データを生成することになる。 The condition data “monitoring period: 20 minutes” uses the audio signal received by the audio signal receiving unit 131 for the past 20 minutes for the cumulative time data generating unit 132 and the frequency characteristic specifying unit 134. Is instructing. In this case, the accumulated time data generation unit 132 calculates the duration of the audio signal received from each terminal device 11 based on the audio signal received in the past 20 minutes, thereby obtaining the accumulated value of the speech time of each participant 19. The frequency characteristic specifying unit 134 generates frequency characteristic data indicating the frequency characteristic of each voice signal of the participant 19 by performing frequency analysis on the voice signal received in the past 20 minutes. It will be.

条件データ「移動速度：２０秒」は、音像位置割当部１３５が、ある端末装置１１から受信する音声信号に割り当てている音像位置を変更する場合に、２０秒間で音像位置の移動が完了するように、徐々に音像位置の変更を行うことを指示している。類似の指示を行う条件データとして、例えば条件データ「移動時間：１度／秒」のような条件データを用いるようにしてもよい。この条件データは、音像位置割当部１３５に、音像位置を変更する際、それまで割り当てていた音像位置から新たに割り当てたい音像位置に向け、１秒間に１度の角速度で徐々に移動させるように音像位置の変更を行うことを指示している。 The condition data “movement speed: 20 seconds” indicates that the movement of the sound image position is completed in 20 seconds when the sound image position assignment unit 135 changes the sound image position assigned to the sound signal received from a certain terminal device 11. Instructed to gradually change the sound image position. For example, condition data such as condition data “movement time: 1 degree / second” may be used as condition data for giving a similar instruction. When changing the sound image position to the sound image position assigning unit 135, this condition data is gradually moved from the previously assigned sound image position to the sound image position to be newly assigned at an angular velocity of once per second. An instruction is given to change the position of the sound image.

条件データ「常設保留ポジション：ポジション４」は、音声信号受信部１３１が受信した音声信号が、その時点で端末装置選択部１３３により選択されている端末装置１１以外の端末装置１１から送信されたものである場合に、その音声信号により示される音声をポジション４に配置することを指示している。常設保留ポジションには、次に説明する一時的保留ポジションの場合と異なり、端末装置選択部１３３により選択されている端末装置１１から送信された音声信号により示される音声が配置されることはない。 The condition data “permanent holding position: position 4” is the one in which the audio signal received by the audio signal receiving unit 131 is transmitted from the terminal device 11 other than the terminal device 11 selected by the terminal device selecting unit 133 at that time. In this case, it is instructed to place the sound indicated by the sound signal at position 4. Unlike the case of the temporary holding position described below, the permanent holding position does not include the voice indicated by the voice signal transmitted from the terminal device 11 selected by the terminal device selection unit 133.

条件データ「一時的保留ポジション：ポジション３」は、音声信号受信部１３１が受信した音声信号が、その時点で端末装置選択部１３３により選択されている端末装置１１以外の端末装置１１から送信されたものであり、さらに条件データ「常設保留ポジション：ポジション４」により示されるポジション４に対し、個別に音声信号により示される音声をポジション３に配置することを指示している。すなわち、常設保留ポジションが本来、端末装置選択部１３３により選択されている端末装置１１から送信された音声信号のために確保されているポジションであるのに対し、一時的保留ポジションは一時的に他の端末装置１１から送信された音声信号のために利用されることが許可されたポジションである。 The condition data “temporary holding position: position 3” is transmitted from the terminal device 11 other than the terminal device 11 selected by the terminal device selection unit 133 at the time when the audio signal received by the audio signal receiving unit 131 is received. In addition, for the position 4 indicated by the condition data “permanent holding position: position 4”, it is instructed to individually arrange the audio indicated by the audio signal at the position 3. That is, the permanent holding position is a position that is reserved for the audio signal transmitted from the terminal device 11 that is originally selected by the terminal device selection unit 133, whereas the temporary holding position is temporarily different. This position is permitted to be used for an audio signal transmitted from the terminal device 11.

条件データ「累積時間ランキング１：ポジション６」は、累積時間データ生成部１３２により生成される累積時間データにより、過去の所定時間において発言時間の最も長かったとされる参加者１９の端末装置１１に対し、ポジション６を割り当てることを指示している。この場合、過去の所定時間とは、条件データ「監視期間：２０分」により示される２０分間である。 The condition data “cumulative time ranking 1: position 6” is based on the cumulative time data generated by the cumulative time data generation unit 132 with respect to the terminal device 11 of the participant 19 whose speech time is the longest in the past predetermined time. , Position 6 is instructed. In this case, the past predetermined time is 20 minutes indicated by the condition data “monitoring period: 20 minutes”.

条件データ「周波数特性類似度＞０．９０→間隔２以上」は、周波数特性特定部１３４により生成される周波数特性データにより示される周波数特性が互いに類似している音声に対し、間隔が２以上離れている音像位置を割り当てることを指示している。ここで、間隔とは、２つのポジション間に含まれる他のポジションの数に１を加えた数である。例えば、ポジション１とポジション４の間隔は「３」となる。周波数特性類似度とは、２つの音声信号の周波数特性の類似度を示す指標であり、周波数特性データに基づき音像位置割当部１３５により算出される。周波数特性の類似度を示す指標としては様々なものが考えられるが、以下の説明においては例として、２つの音声信号の周波数特性を各々示すグラフの相違度を用いることとする。より具体的には、音像位置割当部１３５は以下のように周波数特性類似度を算出するものとする。
（ａ）周波数特性データより、各音声信号の周波数特性を示すグラフを描く。その際、縦軸に音圧レベル（ｄＢ）をとり、横軸に周波数（Ｈｚ）の対数をとる。
（ｂ）（ａ）において描いたグラフと横軸とに囲まれる図形の面積が所定値Ｓとなるように、音圧レベルを全周波数帯域にわたり増加もしくは減少させる。すなわち、グラフを縦軸方向に上下させる。これは、音声信号により示される音声のボリュームを上下させ、異なる音声間の全体的なボリュームを合わせることを意味する。
（ｃ）２つの音声信号に関し、（ｂ）においてボリューム調整を行ったグラフを重ね合わせた場合に、各グラフと横軸とに囲まれる図形の間の重複部分の面積ｓを算出する。
（ｄ）（ｂ）において用いたＳと、（ｃ）において算出したｓからＲ＝ｓ／Ｓを算出し、このＲを２つの音声信号に関する周波数特性類似度とする。 The condition data “frequency characteristic similarity> 0.90 → interval 2 or greater” is a distance of 2 or more apart from speech whose frequency characteristics indicated by the frequency characteristic data generated by the frequency characteristic identification unit 134 are similar to each other. It is instructed to assign the sound image position. Here, the interval is a number obtained by adding 1 to the number of other positions included between two positions. For example, the interval between position 1 and position 4 is “3”. The frequency characteristic similarity is an index indicating the similarity between the frequency characteristics of two audio signals, and is calculated by the sound image position assignment unit 135 based on the frequency characteristic data. Various indexes for indicating the similarity of the frequency characteristics can be considered. In the following description, as an example, the degree of difference between the graphs indicating the frequency characteristics of two audio signals is used. More specifically, it is assumed that the sound image position assignment unit 135 calculates the frequency characteristic similarity as follows.
(A) A graph showing the frequency characteristic of each audio signal is drawn from the frequency characteristic data. At that time, the vertical axis represents the sound pressure level (dB), and the horizontal axis represents the logarithm of frequency (Hz).
(B) The sound pressure level is increased or decreased over the entire frequency band so that the area of the figure surrounded by the graph drawn in (a) and the horizontal axis becomes the predetermined value S. That is, the graph is moved up and down in the vertical axis direction. This means that the volume of the audio indicated by the audio signal is raised and lowered to match the overall volume between the different sounds.
(C) For the two audio signals, the area s of the overlapping portion between the graphs surrounded by the graphs and the horizontal axis is calculated when the graphs whose volume adjustment is performed in (b) are superimposed.
(D) R = s / S is calculated from S used in (b) and s calculated in (c), and this R is defined as the frequency characteristic similarity for the two audio signals.

上記のように算出されるＲは０以上１以下の値をとり、その値が大きいほど２つの音声信号の周波数特性が類似していることを示す。条件データ「周波数特性類似度＞０．９０→間隔３以上」に従う場合、２つの音声信号の周波数特性類似度が０．９０より大きい場合、それらの音声信号により示される音声が３以上の間隔をあけて配置されるように、端末装置１１に対し音像位置の割当が行われることになる。 R calculated as described above takes a value from 0 to 1, and the larger the value, the more similar the frequency characteristics of the two audio signals. In accordance with the condition data “frequency characteristic similarity> 0.90 → interval 3 or greater”, if the frequency characteristic similarity of two audio signals is greater than 0.90, the audio indicated by those audio signals has an interval of 3 or greater. The sound image position is assigned to the terminal device 11 so as to be spaced apart.

条件データ「端末ＩＤ「０１２３」：ポジション１」は、端末装置１１のネットワーク１０における識別子である端末ＩＤが「０１２３」である端末装置１１にポジション１を割り当てることを指示している。すなわち、ポジション１は特定の端末装置１１の使用者である参加者１９のために確保され、その参加者１９の発言時間の長短によってポジション１が他の参加者１９の音声に割り当てられることはない。このような条件データは、例えば、会議の司会者のように、必ずしも発言時間は長くないが、常に同じ音像位置にその音声が配置されることが全ての聞き手にとって好都合であるような場合に用いられる。 The condition data “terminal ID“ 0123 ”: position 1” indicates that position 1 is assigned to the terminal device 11 whose terminal ID is “0123”, which is an identifier in the network 10 of the terminal device 11. That is, the position 1 is reserved for the participant 19 who is a user of the specific terminal device 11, and the position 1 is not assigned to the voice of the other participant 19 due to the speaking time of the participant 19. . Such condition data is used when it is convenient for all listeners to always place the sound at the same sound image position, although the speaking time is not necessarily long, for example, as a conference chairperson. It is done.

条件データ「デフォルトポジション：端末ＩＤ順」は、音声信号の送信元の端末装置１１の端末ＩＤの大小に従い、各音声信号により示される音声に音像位置の割り当てを行うことを音像位置割当部１３５に指示している。この条件データに対応付けられている優先度は「９」であり、他の条件データと比較して優先度が低い。すなわち、この条件データは、他の条件データにより音像位置が一意に定まらない場合に用いられる条件を示している。例えば、会議の開始直後にはまだ累積時間データが生成されていないため、条件データ「累積時間ランキング１：ポジション６」等の条件データによっては最初の発言者の音声のポジションが定まらない。そのような場合、条件データ「デフォルトポジション：端末ＩＤ順」により、発言した参加者１９の使用する端末装置１１の端末ＩＤの小さい順にポジション１、ポジション２、・・・のように音声が配置されることになる。 Condition data “default position: in order of terminal ID” indicates to the sound image position assignment unit 135 that the sound image position is assigned to the sound indicated by each sound signal according to the size of the terminal ID of the terminal device 11 that is the transmission source of the sound signal. I am instructing. The priority associated with this condition data is “9”, which is lower than the other condition data. That is, this condition data indicates a condition used when the sound image position is not uniquely determined by other condition data. For example, since the accumulated time data has not yet been generated immediately after the start of the conference, the voice position of the first speaker cannot be determined by the condition data such as the condition data “cumulative time ranking 1: position 6”. In such a case, according to the condition data “default position: terminal ID order”, the voices are arranged in the order of the terminal ID of the terminal device 11 used by the speaking participant 19 in the order of position 1, position 2,. Will be.

［１．２．音声会議システムの動作］
続いて、複数の参加者１９が音声会議システム１を用いて会議を行う場合の音声会議システム１の動作を説明する。まず、会議に参加する参加者１９は各自の端末装置１１を操作して、端末装置１１と音声信号処理サーバ１３との間に通信コネクションを確立させる。音声信号処理サーバ１３は、複数の端末装置１１の各々との間に通信コネクションを確立するが、それら複数の通信コネクションは音声信号処理サーバ１３により割り当てられるコネクションＩＤにより識別される。通信コネクションの確立の方法は従来技術によるものと同様であるので、説明を省略する。端末装置１１は、各々、任意のタイミングで音声信号処理サーバ１３との間に通信コネクションを確立することができるので、参加者１９は任意のタイミングで会議に参加したり、会議から離脱したりすることができる。 [1.2. Operation of the audio conference system]
Next, the operation of the audio conference system 1 when a plurality of participants 19 hold a conference using the audio conference system 1 will be described. First, each participant 19 participating in the conference operates his / her own terminal device 11 to establish a communication connection between the terminal device 11 and the audio signal processing server 13. The audio signal processing server 13 establishes a communication connection with each of the plurality of terminal apparatuses 11, and the plurality of communication connections are identified by a connection ID assigned by the audio signal processing server 13. Since the method for establishing the communication connection is the same as that according to the prior art, the description thereof is omitted. Since each of the terminal devices 11 can establish a communication connection with the audio signal processing server 13 at an arbitrary timing, the participant 19 joins or leaves the conference at an arbitrary timing. be able to.

端末装置１１と音声信号処理サーバ１３との間に通信コネクションの確立が行われると、端末装置１１の記憶部１１４には、音声信号処理サーバ１３から受信される音声信号を所定時間分（例えば５秒間分）だけＦＩＦＯ（Ｆｉｒｓｔ−ｉｎＦｉｒｓｔ−ｏｕｔ）により一時的に記憶するデータバッファ１１４１が確保される。ただし、後述するように端末装置１１が音声信号処理サーバ１３から受信する音声信号は左右２チャンネルであるため、データバッファ１１４１は、左右２チャンネルの音声信号を所定時間分だけ記憶可能なように、その記憶容量が決定される。 When the communication connection is established between the terminal device 11 and the audio signal processing server 13, the storage unit 114 of the terminal device 11 stores the audio signal received from the audio signal processing server 13 for a predetermined time (for example, 5 A data buffer 1141 for temporary storage is secured by FIFO (First-in First-out) only for seconds). However, as will be described later, since the audio signal received by the terminal device 11 from the audio signal processing server 13 is the left and right channels, the data buffer 1141 can store the audio signals of the left and right channels for a predetermined time. Its storage capacity is determined.

同様に、音声信号処理サーバ１３の記憶部１３９には、確立された通信コネクションごとに、端末装置１１から受信される音声信号を条件データベース１３９１（図２参照）に含まれる条件データ「監視期間：２０分」で示される時間分（すなわち２０分間分）だけＦＩＦＯにより一時的に記憶するデータバッファ１３９２が確保される。これらのデータバッファはコネクションＩＤにより識別され、音声信号処理サーバ１３においては、異なる端末装置１１から受信した音声信号は異なるデータバッファ１３９２に順次記憶される。音声信号処理サーバ１３が端末装置１１から受信する音声信号はモノラルであるため、データバッファ１３９２の記憶容量はデータバッファ１１４１と異なり、１チャンネルの音声信号を２０分間分だけ記憶可能なように、その記憶容量が決定される。 Similarly, the storage unit 139 of the audio signal processing server 13 stores the audio signal received from the terminal device 11 for each established communication connection in the condition data “monitoring period: included in the condition database 1391 (see FIG. 2). A data buffer 1392 for temporarily storing data by the FIFO is secured for the time indicated by “20 minutes” (that is, for 20 minutes). These data buffers are identified by the connection ID. In the audio signal processing server 13, audio signals received from different terminal apparatuses 11 are sequentially stored in different data buffers 1392. Since the audio signal received by the audio signal processing server 13 from the terminal device 11 is monaural, the storage capacity of the data buffer 1392 is different from that of the data buffer 1141, so that one channel of audio signal can be stored for 20 minutes. Storage capacity is determined.

また、音声信号処理サーバ１３は、端末装置１１の各々にコネクションＩＤを割り当てると同時に、コネクションＩＤを割り当てた端末装置１１から端末ＩＤを取得し、割り当てたコネクションＩＤと端末ＩＤとの対応表を作成する。図３は、音声信号処理サーバ１３において作成されるコネクションＩＤと端末ＩＤの対応表の例を示している。音声信号処理サーバ１３はこの対応表に従い、コネクションＩＤにより、そのコネクションＩＤにより特定される通信コネクションを用いた通信の相手の端末装置１１を特定することができる。例えば、コネクションＩＤ「００２３」に端末ＩＤ「０１２３」が対応付けられている場合、コネクションＩＤ「００２３」で識別されるデータバッファ１３９２には、端末ＩＤ「０１２３」で特定される端末装置１１から送信された音声信号が記憶されることになる。 Also, the audio signal processing server 13 assigns a connection ID to each terminal device 11, and simultaneously obtains a terminal ID from the terminal device 11 to which the connection ID is assigned, and creates a correspondence table between the assigned connection ID and the terminal ID. To do. FIG. 3 shows an example of a correspondence table of connection IDs and terminal IDs created in the audio signal processing server 13. The audio signal processing server 13 can specify the terminal device 11 of the communication partner using the communication connection specified by the connection ID by the connection ID according to the correspondence table. For example, when the terminal ID “0123” is associated with the connection ID “0023”, the data is transmitted from the terminal device 11 identified by the terminal ID “0123” to the data buffer 1392 identified by the connection ID “0023”. The recorded audio signal is stored.

上記のように、端末装置１１と音声信号処理サーバ１３との間に通信コネクションが確立されると、参加者１９は端末装置１１に接続されたヘッドセット１２を用いて発言を行うことができる。参加者１９が発音した音声は、ヘッドセット１２のマイクによりアナログ音声信号に変換され、端末装置１１の音声信号処理部１１１に入力される。音声信号処理部１１１はヘッドセット１２から受け取ったアナログ音声信号をデジタル音声信号に変換した後、変換後の音声信号を含むデータパケットを生成する。 As described above, when a communication connection is established between the terminal device 11 and the audio signal processing server 13, the participant 19 can speak using the headset 12 connected to the terminal device 11. The sound produced by the participant 19 is converted into an analog sound signal by the microphone of the headset 12 and input to the sound signal processing unit 111 of the terminal device 11. The audio signal processing unit 111 converts the analog audio signal received from the headset 12 into a digital audio signal, and then generates a data packet including the converted audio signal.

図４は、音声信号が複数のデータパケットに含まれる様子を模式的に示した図である。音声信号処理部１１１は、音声信号を先頭から順に所定のデータ長の音声信号ブロックに分割する。音声信号処理部１１１は、音声信号ブロックの間の順序を示すブロック番号を音声信号ブロックの前に付加する。さらに、音声信号処理部１１１はブロック番号の前に、コネクションＩＤ、送信元ＩＤおよび送信先ＩＤを順次付加する。送信元ＩＤは端末装置１１の端末ＩＤであり、送信先ＩＤは音声信号処理サーバ１３のネットワーク１０におけるＩＤ（以下、「サーバＩＤ」と呼ぶ）である。さらに、音声信号処理部１１１は、ブロック番号等を付加した音声信号ブロックの前および後に、一連のデータの区切りを示すデータとして、ＨＯＤ（ＨｅａｄｏｆＤａｔａ）およびＥＯＤ（ＥｎｄｏｆＤａｔａ）を付加する。このように生成されたＨＯＤで始まりＥＯＤで終わる一連のデータがデータパケットである。 FIG. 4 is a diagram schematically illustrating how an audio signal is included in a plurality of data packets. The audio signal processing unit 111 divides the audio signal into audio signal blocks having a predetermined data length in order from the top. The audio signal processing unit 111 adds a block number indicating the order between the audio signal blocks to the front of the audio signal block. Furthermore, the audio signal processing unit 111 sequentially adds a connection ID, a transmission source ID, and a transmission destination ID before the block number. The transmission source ID is a terminal ID of the terminal device 11, and the transmission destination ID is an ID in the network 10 of the audio signal processing server 13 (hereinafter referred to as “server ID”). Further, the audio signal processing unit 111 adds HOD (Head of Data) and EOD (End of Data) as data indicating a delimiter of a series of data before and after the audio signal block to which the block number and the like are added. A series of data starting with the HOD generated in this way and ending with EOD is a data packet.

音声信号処理部１１１は、上記のように生成したデータパケットを順次、音声信号送信部１１２に引き渡し、音声信号送信部１１２は受け取ったデータパケットを順次、ネットワーク１０に送出する。ネットワーク１０に含まれる中継装置は、送信先ＩＤによりネットワーク１０において特定される通信機器へ到達可能な通信経路を示すルーティングテーブルを記憶しており、端末装置１１から送出されたデータパケットに含まれる送信先ＩＤに基づき、ルーティングテーブルに従い送信先ＩＤにより特定される通信機器へ到達可能な通信経路上の隣接する中継装置にデータパケットを転送する。その結果、データパケットは音声信号処理サーバ１３に送り届けられる。ルーティングテーブルの更新方法等は従来技術によるものと同様であるので、説明を省略する。 The audio signal processing unit 111 sequentially transfers the data packets generated as described above to the audio signal transmission unit 112, and the audio signal transmission unit 112 sequentially transmits the received data packets to the network 10. The relay device included in the network 10 stores a routing table indicating a communication path that can reach the communication device specified in the network 10 by the transmission destination ID, and is included in the data packet transmitted from the terminal device 11. Based on the destination ID, the data packet is transferred to an adjacent relay device on the communication path that can reach the communication device specified by the destination ID according to the routing table. As a result, the data packet is delivered to the audio signal processing server 13. Since the routing table update method and the like are the same as those in the prior art, the description thereof is omitted.

ネットワーク１０を介して上記のように音声信号処理サーバ１３に送り届けられたデータパケットは、音声信号処理サーバ１３の音声信号受信部１３１により受信される。音声信号受信部１３１は、受信したデータパケットに含まれるコネクションＩＤに従い、当該コネクションＩＤにより識別されるデータバッファ１３９２のうち、当該データパケットに含まれるブロック番号に応じた領域に、当該データパケットに含まれる音声信号ブロックを記憶させる。端末装置１１から送出された複数のデータパケットは、各々、ネットワーク１０において通過する通信経路が異なる結果、送出順に音声信号処理サーバ１３に受信されるとは限らない。しかしながら、音声信号受信部１３１により、ブロック番号に応じた順序でデータバッファ１３９２に音声信号ブロックが記憶される結果、データバッファ１３９２に記憶される一連の音声信号は、端末装置１１においてデータパケットに分割される前の音声信号を再現したものとなる。データパケットの一部が何らかの理由で音声信号処理サーバ１３に到達しなかった場合には、音声信号処理サーバ１３が到達しなかったデータパケットに含まれる音声信号を前後の音声信号に基づき補間する等の処理を行うが、それらの処理は従来技術によるものと同様であるため、説明を省略する。 The data packet sent to the audio signal processing server 13 as described above via the network 10 is received by the audio signal receiving unit 131 of the audio signal processing server 13. The audio signal receiving unit 131 is included in the data packet in an area corresponding to the block number included in the data packet in the data buffer 1392 identified by the connection ID according to the connection ID included in the received data packet. The audio signal block to be recorded is stored. The plurality of data packets transmitted from the terminal device 11 are not necessarily received by the audio signal processing server 13 in the order of transmission as a result of the different communication paths passing through the network 10. However, as a result of the audio signal receiving unit 131 storing the audio signal blocks in the data buffer 1392 in the order corresponding to the block numbers, the series of audio signals stored in the data buffer 1392 is divided into data packets in the terminal device 11. It is a reproduction of the audio signal before being played. When a part of the data packet does not reach the audio signal processing server 13 for some reason, the audio signal included in the data packet that the audio signal processing server 13 did not reach is interpolated based on the preceding and following audio signals, etc. However, since these processes are the same as those according to the prior art, description thereof will be omitted.

上記のように、音声信号受信部１３１により、複数の端末装置１１の各々から受信される音声信号が、過去所定の期間分だけ、対応するコネクションＩＤにより識別されるデータバッファ１３９２に記憶される。ただし、各データバッファ１３９２に記憶される音声信号には、参加者１９が発言をしていない時の音、すなわちほぼ無音を示す信号も混在している。そこで、累積時間データ生成部１３２は、所定時間間隔で、各データバッファ１３９２に記憶されている音声信号から、振幅の絶対値が所定の閾値を超える部分を取り出し、取り出した音声信号の部分の時間を示す累積時間データを生成する。以下、例として、累積時間データ生成部１３２は５秒間隔で累積時間データを生成するものとする。累積時間データには、その生成に用いられた音声信号の記憶されていたデータバッファ１３９２を識別するコネクションＩＤが対応付けられている。すなわち、累積時間データは、対応付けられているコネクションＩＤにより識別される端末装置１１の参加者１９による過去の所定時間内の発言時間の累積を示している。 As described above, the audio signal receiving unit 131 stores the audio signal received from each of the plurality of terminal devices 11 in the data buffer 1392 identified by the corresponding connection ID for the past predetermined period. However, the sound signal stored in each data buffer 1392 includes a sound when the participant 19 is not speaking, that is, a signal indicating almost silence. Therefore, the accumulated time data generation unit 132 extracts, from the audio signal stored in each data buffer 1392 at predetermined time intervals, a portion where the absolute value of the amplitude exceeds a predetermined threshold, and the time of the portion of the extracted audio signal Cumulative time data indicating is generated. Hereinafter, as an example, it is assumed that the accumulated time data generation unit 132 generates accumulated time data at intervals of 5 seconds. The accumulated time data is associated with a connection ID for identifying the data buffer 1392 in which the audio signal used for the generation is stored. That is, the accumulated time data indicates the accumulated speech time within a predetermined past time by the participant 19 of the terminal device 11 identified by the associated connection ID.

累積時間データ生成部１３２がデータバッファ１３９２に記憶されている音声信号から参加者１９の発言を示す部分を取り出す際には、例えば音声信号の振幅が所定の閾値を超えた後、所定時間（例えば２秒間）はたとえ振幅が所定の閾値以下となっても、発言が継続しているものとみなすようにしてもよい。そうすれば、参加者１９による発言において言葉と言葉の間に小休止があるような場合にも、それらの一連の発言時間が累積時間データに反映される。 When the cumulative time data generation unit 132 extracts a part indicating the speech of the participant 19 from the audio signal stored in the data buffer 1392, for example, after the amplitude of the audio signal exceeds a predetermined threshold, a predetermined time (for example, Even if the amplitude is equal to or less than a predetermined threshold for 2 seconds, it may be considered that the speech continues. Then, even when there is a brief pause between words in the speech by the participant 19, those series of speech times are reflected in the accumulated time data.

また、端末装置１１から音声信号処理サーバ１３に対し、参加者１９の発言がない期間の音声信号を送信しないようにし、累積時間データ生成部１３２は端末装置１１により受信された音声信号により示される音声の発音時間を単純に累積するようにしてもよい。その場合、音声信号の送信元である端末装置１１の音声信号送信部１１２において、音声信号のうち振幅が所定の閾値を超える部分のみをデータパケットとしてネットワーク１０に送出するようにすればよいので、データの通信量が削減される。 Further, the terminal device 11 is prevented from transmitting an audio signal during a period when the participant 19 does not speak to the audio signal processing server 13, and the accumulated time data generation unit 132 is indicated by the audio signal received by the terminal device 11. The sound production time may be simply accumulated. In that case, in the audio signal transmission unit 112 of the terminal device 11 that is the transmission source of the audio signal, only a portion of the audio signal whose amplitude exceeds a predetermined threshold may be transmitted to the network 10 as a data packet. Data traffic is reduced.

累積時間データ生成部１３２は、上記のように５秒間隔で累積時間データを生成すると、生成した累積時間データを端末装置選択部１３３および音像位置割当部１３５に引き渡す。 When the accumulated time data generating unit 132 generates accumulated time data at intervals of 5 seconds as described above, the accumulated time data generating unit 132 delivers the generated accumulated time data to the terminal device selecting unit 133 and the sound image position allocating unit 135.

端末装置選択部１３３は、累積時間データ生成部１３２から累積時間データを受け取ると、受け取った累積時間データに基づき、記憶部１３９に記憶されている条件データベース１３９１に含まれる条件データに従って、複数の参加者１９の使用する端末装置１１から、参加者１９の過去の所定期間における累積発言時間の長い順に所定数だけ選択する。図２に例示される条件データベース１３９１によれば、音像位置の総数は６であるが、常設保留ポジションとしてポジション４が保留され、端末ＩＤ「０１２３」の端末装置１１にポジション１が確保されているため、端末装置選択部１３３は、累積時間データにより示される累積時間が長い参加者１９の端末装置１１から順に、４つの端末装置１１を選択する。ただし、上記のように選択された４つの端末装置１１に、端末ＩＤが「０１２３」であるものが含まれている場合、その端末装置１１には既にポジション１が確保されているため、端末装置選択部１３３はさらに１つの端末装置１１を選択する。 Upon receiving the accumulated time data from the accumulated time data generating unit 132, the terminal device selecting unit 133 performs a plurality of participations according to the condition data included in the condition database 1391 stored in the storage unit 139 based on the received accumulated time data. A predetermined number is selected from the terminal device 11 used by the person 19 in descending order of the accumulated speech time of the participant 19 in the past predetermined period. According to the condition database 1391 illustrated in FIG. 2, the total number of sound image positions is 6, but the position 4 is reserved as a permanent reservation position, and the position 1 is secured in the terminal device 11 with the terminal ID “0123”. Therefore, the terminal device selection unit 133 selects the four terminal devices 11 in order from the terminal device 11 of the participant 19 with the long accumulated time indicated by the accumulated time data. However, when the four terminal devices 11 selected as described above include those having the terminal ID “0123”, since the terminal device 11 has already been secured in position 1, the terminal device The selection unit 133 further selects one terminal device 11.

端末装置選択部１３３による端末装置１１の選択処理は、より具体的には、累積時間データにより示される累積時間の長い順に累積時間データを並べた場合の上位のものを選択し、選択した累積時間データに対応付けられたコネクションＩＤを選択することにより行われる。端末装置選択部１３３は、上記のように選択したコネクションＩＤを、例えば対応する累積時間の長い順に並べ、選択データとして周波数特性特定部１３４および音像位置割当部１３５に引き渡す。端末装置選択部１３３は５秒間隔で累積時間データを受け取るので、端末装置選択部１３３もまた５秒間隔で選択データを周波数特性特定部１３４および音像位置割当部１３５に引き渡すことになる。 More specifically, the selection processing of the terminal device 11 by the terminal device selection unit 133 is performed by selecting the upper one when the accumulated time data is arranged in the order of the accumulated time indicated by the accumulated time data and selecting the accumulated time. This is done by selecting a connection ID associated with the data. The terminal device selection unit 133 arranges the connection IDs selected as described above, for example, in the order of the corresponding accumulated time, and passes them to the frequency characteristic specifying unit 134 and the sound image position assignment unit 135 as selection data. Since the terminal device selection unit 133 receives the accumulated time data at intervals of 5 seconds, the terminal device selection unit 133 also delivers the selection data to the frequency characteristic specifying unit 134 and the sound image position assignment unit 135 at intervals of 5 seconds.

周波数特性特定部１３４は、端末装置選択部１３３から選択データを受け取ると、選択データに含まれるコネクションＩＤの各々に関し、対応付けられたデータバッファ１３９２に記憶されている音声信号に関する周波数分析を行い、各周波数と音声信号に含まれる当該周波数の音圧レベルとの関係を示す周波数特性データを生成する。周波数特性特定部１３４は、生成した周波数特性データを音像位置割当部１３５に引き渡す。周波数特性特定部１３４による音像位置割当部１３５に対する周波数特性データの引き渡しも５秒間隔で行われる。 When the frequency characteristic specifying unit 134 receives the selection data from the terminal device selection unit 133, the frequency characteristic specifying unit 134 performs frequency analysis on the audio signal stored in the associated data buffer 1392 for each connection ID included in the selection data, Frequency characteristic data indicating the relationship between each frequency and the sound pressure level of the frequency included in the audio signal is generated. The frequency characteristic specifying unit 134 passes the generated frequency characteristic data to the sound image position assignment unit 135. The frequency characteristic specifying unit 134 also delivers frequency characteristic data to the sound image position allocating unit 135 at intervals of 5 seconds.

音像位置割当部１３５は、端末装置選択部１３３から選択データを受け取るとともに、周波数特性特定部１３４から選択データに含まれるコネクションＩＤに各々対応する周波数特性データを受け取ると、受け取ったデータに基づき、条件データベース１３９１に含まれる条件データに従い、端末装置１１に対する音像位置の割当処理を以下のように行う。 When receiving the selection data from the terminal device selection unit 133 and the frequency characteristic data corresponding to each connection ID included in the selection data from the frequency characteristic specifying unit 134, the sound image position allocation unit 135 receives the condition data based on the received data. In accordance with the condition data included in the database 1391, sound image position assignment processing for the terminal device 11 is performed as follows.

例えば、選択データが「００３４，００１５，０００４，０００９」であり、選択データに含まれるコネクションＩＤに対応付けられた端末ＩＤ（図３参照）が「０２７８，０３０１，００４１，００８４」であった場合、音像位置割当部１３５は、条件データ「累積時間ランキング１：ポジション６」に従って、ポジション６を累積時間のランキング１位である端末ＩＤ「０２７８」に対応するコネクションＩＤ「００３４」に割り当てる。同様に、コネクションＩＤ「００１５」にはポジション２が、コネクションＩＤ「０００４」にはポジション５が、コネクションＩＤ「０００９」にはポジション３が割り当てられる。 For example, when the selection data is “0034, 0015, 0004, 0009” and the terminal ID (see FIG. 3) associated with the connection ID included in the selection data is “0278, 0301, 0041, 0084”. The sound image position assigning unit 135 assigns position 6 to the connection ID “0034” corresponding to the terminal ID “0278” that is ranked first in the cumulative time according to the condition data “cumulative time ranking 1: position 6”. Similarly, position 2 is assigned to connection ID “0015”, position 5 is assigned to connection ID “0004”, and position 3 is assigned to connection ID “0009”.

図５は、上記のように音像位置割当部１３５により各ポジションがコネクションＩＤに割り当てられる結果、いずれの参加者１９の音声がいずれのポジションに配置されるかを模式的に示した図である。図５において、ポジション１には、端末ＩＤ「０１２３」の端末装置１１を使用する参加者１９の音声が固定的に配置されている。また、ポジション２、３、５および６には、それぞれ、端末ＩＤが「０３０１」、「００８４」、「００４１」および「０２７８」の端末装置１１を使用する参加者１９の音声が配置されているが、これらの配置は参加者１９の発言時間の累積値に基づき一時的に定められた配置であり、これらの配置は動的に変化する。また、ポジション４には、ポジション１〜３、５または６のいずれにも割り当てられていない１または複数の端末装置１１を使用する参加者１９により発言がなされた場合に、その参加者１９の音声が配置される。さらに、ポジション３には、ポジション１〜３、５または６のいずれにも割り当てられていない異なる端末装置１１を使用する２名の参加者１９により同時に発言がなされた場合に、例えばそれら２名のうち、後から発言を開始した参加者１９の音声が一時的に配置される。 FIG. 5 is a diagram schematically showing which participant 19's voice is placed in which position as a result of the positions assigned to the connection IDs by the sound image position assigning unit 135 as described above. In FIG. 5, the voice of the participant 19 who uses the terminal device 11 with the terminal ID “0123” is fixedly arranged at position 1. In addition, the voices of the participants 19 who use the terminal devices 11 whose terminal IDs are “0301”, “0084”, “0041”, and “0278” are arranged at positions 2, 3, 5, and 6, respectively. However, these arrangements are arrangements temporarily determined based on the cumulative value of the speaking time of the participant 19, and these arrangements change dynamically. In addition, when a speech is made by a participant 19 who uses one or a plurality of terminal devices 11 that are not assigned to any of positions 1 to 3, 5, or 6 at position 4, the voice of the participant 19 is recorded. Is placed. Furthermore, in the case where position 3 is simultaneously spoken by two participants 19 using different terminal devices 11 that are not assigned to any of positions 1 to 3, 5 or 6, for example, Among them, the voice of the participant 19 who started speaking later is temporarily arranged.

続いて、音像位置割当部１３５は、条件データ「周波数特性類似度＞０．９０→間隔２以上」に従い、上記のように各コネクションＩＤに割り当てたポジションに必要な修正を加える。具体的には、音像位置割当部１３５は上記のように各ポジションに割り当てたコネクションＩＤに対応する周波数特性データから２つを順次選択して、それらの周波数特性データから既に説明した方法により周波数特性類似度を算出する。そして、音像位置割当部１３５は、条件データ「周波数特性類似度＞０．９０→間隔２以上」により示される条件に反するポジションの割当が行われている場合、条件を満たすようにポジションの割り当て変更を行う。 Subsequently, the sound image position allocating unit 135 makes necessary corrections to the positions allocated to the respective connection IDs as described above in accordance with the condition data “frequency characteristic similarity> 0.90 → interval 2 or more”. Specifically, the sound image position assigning unit 135 sequentially selects two of the frequency characteristic data corresponding to the connection ID assigned to each position as described above, and uses the frequency characteristic data according to the method already described from the frequency characteristic data. Calculate similarity. Then, the sound image position assignment unit 135 changes the position assignment so as to satisfy the condition when the assignment of the position that violates the condition indicated by the condition data “frequency characteristic similarity> 0.90 → interval 2 or more” is performed. I do.

例えば、ポジション２を割り当てられたコネクションＩＤ「００１５」に対応する周波数特性データと、ポジション３を割り当てられたコネクションＩＤ「０００９」に対応する周波数特性データから算出された周波数特性類似度が０．９５であった場合、音像位置割当部１３５は、例えばコネクションＩＤ「０００９」にポジション５を割り当て、ポジション５が割り当てられていたコネクションＩＤ「０００４」にポジション３を割り当てることにより、隣り合うポジションの音声信号が一定以上類似することがないようにするのである。ただし、複数のコネクションＩＤに対しどのようにポジションを割り当てても条件データ「周波数特性類似度＞０．９０→間隔２以上」を満たすことができないような場合、音像位置割当部１３５は、例えば隣り合うコネクションＩＤに対応する音声信号間の周波数特性類似度が最小となるような配置を選択するなど、所定のルールに従いポジションの割り当てを行う。 For example, the frequency characteristic similarity calculated from the frequency characteristic data corresponding to the connection ID “0015” assigned position 2 and the frequency characteristic data corresponding to the connection ID “0009” assigned position 3 is 0.95. In this case, for example, the sound image position assigning unit 135 assigns position 5 to the connection ID “0009” and assigns position 3 to the connection ID “0004” to which the position 5 has been assigned. Is not to be more similar than a certain level. However, in the case where the condition data “frequency characteristic similarity> 0.90 → interval 2 or more” cannot be satisfied regardless of how the positions are assigned to the plurality of connection IDs, the sound image position assigning unit 135, for example, Positions are assigned according to a predetermined rule, such as selecting an arrangement that minimizes the frequency characteristic similarity between audio signals corresponding to matching connection IDs.

音像位置割当部１３５は、上記のように各コネクションＩＤに対しポジションの割り当てを行うと、各ポジションに割り当てられたコネクションＩＤを示す音像位置割当データを生成する。例えば、図５に例示したポジションの割り当てを行った場合、音像位置割当部１３５により生成される音像位置割当データは、［００２３：−５０度、００１５：−３０度、０００９（一時的保留）：−１０度、（常設保留）：＋１０度、０００４：＋３０度、００３４：＋５０度］となる。音像位置割当部１３５は、音像位置割当データを生成し、音声信号加工部１３６に引き渡す。 When the sound image position assignment unit 135 assigns a position to each connection ID as described above, the sound image position assignment unit 135 generates sound image position assignment data indicating the connection ID assigned to each position. For example, when the position allocation illustrated in FIG. 5 is performed, the sound image position allocation data generated by the sound image position allocation unit 135 is [0023: −50 degrees, 0015: −30 degrees, 0009 (temporary hold): −10 degrees, (permanent hold): +10 degrees, 0004: +30 degrees, 0034: +50 degrees]. The sound image position assignment unit 135 generates sound image position assignment data and passes it to the sound signal processing unit 136.

音像位置割当部１３５による音声信号加工部１３６に対する音像位置割当データの引き渡しは、通常、５秒間隔で行われる。ただし後述するように、音像位置割当部１３５は音像位置の割当を変更する必要が生じた場合には、変更にかかるコネクションＩＤに割り当てられる音像位置が徐々に移動するように、音像位置割当データの生成および引き渡しをより短い時間間隔で行う。 The delivery of the sound image position assignment data to the audio signal processing unit 136 by the sound image position assignment unit 135 is normally performed at intervals of 5 seconds. However, as will be described later, when the sound image position assignment unit 135 needs to change the assignment of the sound image position, the sound image position assignment data of the sound image position assignment data is set so that the sound image position assigned to the connection ID related to the change gradually moves. Generate and deliver at shorter time intervals.

音声信号加工部１３６は記憶部１３９に確保されている全てのデータバッファ１３９２に関し、新たに書き込まれる音声信号の振幅を常時監視し、振幅の絶対値が所定の閾値を超えた場合、そのデータバッファ１３９２に対応する参加者１９の発言があったものと判断して、その音声信号に対し音像位置の調整処理を行う。音声信号加工部１３６は、音声信号に対する音像位置の調整処理を行う場合、音像位置割当部１３５から最後に受け取った音像位置割当データに従い、その音像位置を決定する。 The audio signal processing unit 136 constantly monitors the amplitude of the newly written audio signal for all the data buffers 1392 secured in the storage unit 139, and if the absolute value of the amplitude exceeds a predetermined threshold, the data buffer It is determined that there is a statement from the participant 19 corresponding to 1392, and the sound image position adjustment process is performed on the audio signal. The audio signal processing unit 136 determines the sound image position according to the sound image position allocation data received last from the sound image position allocation unit 135 when performing the adjustment process of the sound image position with respect to the audio signal.

例えば、端末ＩＤ「０３０１」の参加者１９により発言があったとする。その結果、端末ＩＤ「０３０１」に対応するコネクションＩＤ「００１５」（図３参照）で特定されるデータバッファ１３９２に書き込まれる音声信号が、所定の閾値を超えることになる。その場合、音声信号加工部１３６は最後に受け取った音像位置割当データからコネクションＩＤ「００１５」を検索する。今、音像位置割当部１３５により割り当てられている音像位置が図５に例示のものであるとすると、音声信号加工部１３６は最後に受け取った音像位置割当データにコネクションＩＤ「００１５」が含まれていることを発見し、さらにそのコネクションＩＤに「−３０度」が対応付けられていることを発見する。そこで、音声信号加工部１３６は、コネクションＩＤ「００１５」で特定されるデータバッファ１３９２に書き込まれた音声信号の音像位置が「−３０度」で示される位置、すなわち左前方の位置となるように音声信号に対し加工処理を行う。すなわち、音声信号加工部１３６は音声信号の複製を２組生成し、その一方に対し増幅率ａ（ａ＞１）を乗じてその振幅を増幅させると同時に、他方に対し増幅率ｂ（ｂ＜１）を乗じてその振幅を減少させる。そのように振幅の調整を行った２組の音声データを、それぞれ左チャンネルおよび右チャンネルの音声信号とすることにより、データバッファ１３９２に書き込まれた音声信号の音像位置が左前方となるようにするのである。 For example, it is assumed that a participant 19 having a terminal ID “0301” makes a statement. As a result, the audio signal written to the data buffer 1392 specified by the connection ID “0015” (see FIG. 3) corresponding to the terminal ID “0301” exceeds a predetermined threshold. In this case, the audio signal processing unit 136 searches the connection ID “0015” from the sound image position allocation data received last. Now, assuming that the sound image position assigned by the sound image position assignment unit 135 is the one shown in FIG. 5, the audio signal processing unit 136 includes the connection ID “0015” in the sound image position assignment data received last. And that the connection ID is associated with “−30 degrees”. Therefore, the audio signal processing unit 136 is arranged such that the sound image position of the audio signal written in the data buffer 1392 specified by the connection ID “0015” is a position indicated by “−30 degrees”, that is, a left front position. Process the audio signal. In other words, the audio signal processing unit 136 generates two sets of duplicate audio signals, multiplies one of them by the amplification factor a (a> 1) and amplifies the amplitude thereof, and simultaneously increases the amplification factor b (b < Multiply by 1) to reduce its amplitude. The two sets of audio data whose amplitudes are adjusted in this way are used as the audio signals of the left channel and the right channel, respectively, so that the sound image position of the audio signal written in the data buffer 1392 is located on the left front side. It is.

また、例えば端末ＩＤ「００７５」の参加者１９により発言があったとする。その結果、端末ＩＤ「００７５」に対応するコネクションＩＤ「００２１」（図３参照）で特定されるデータバッファ１３９２に書き込まれる音声信号が、所定の閾値を超えることになる。その場合、音声信号加工部１３６は最後に受け取った音像位置割当データからコネクションＩＤ「００２１」を検索する。音像位置割当部１３５により割り当てられている音像位置が図５に例示のものである場合、音声信号加工部１３６は音像位置割当データの中にコネクションＩＤ「００２１」を発見できない。その場合、音声信号加工部１３６は音像位置割当データにおいて（常設留保）に対応付けられている「＋１０度」に従い、コネクションＩＤ「００２１」で特定されるデータバッファ１３９２に書き込まれた音声信号の音像位置が「＋１０度」で示される位置、すなわち前方やや右の位置となるように音声信号に対し加工処理を行う。 For example, it is assumed that the participant 19 having the terminal ID “0075” has made a statement. As a result, the audio signal written to the data buffer 1392 specified by the connection ID “0021” (see FIG. 3) corresponding to the terminal ID “0075” exceeds a predetermined threshold. In that case, the audio signal processing unit 136 searches the connection ID “0021” from the sound image position assignment data received last. When the sound image position assigned by the sound image position assigning unit 135 is as illustrated in FIG. 5, the audio signal processing unit 136 cannot find the connection ID “0021” in the sound image position assigning data. In that case, the audio signal processing unit 136 follows the “+10 degrees” associated with (permanent reservation) in the sound image position allocation data, and the sound image of the audio signal written in the data buffer 1392 specified by the connection ID “0021” The audio signal is processed so that the position becomes a position indicated by “+10 degrees”, that is, a position slightly ahead and to the right.

音声信号加工部１３６は、上記のように振幅が所定の閾値を超えた音声信号に対し音像位置の調整処理を行い、左右２チャンネル、すなわちステレオとなった音声信号をミキシング部１３７に引き渡す。ミキシング部１３７は、音声信号加工部１３６から１人の参加者１９の音声を示すステレオの音声信号を受け取ると、受け取ったステレオの音声信号をそのまま音声信号送信部１３８に引き渡す。また、ミキシング部１３７は、音声信号加工部１３６から２人以上の参加者１９の音声を示すステレオの音声信号を受け取ると、受け取った音声信号を左右各々のチャンネル毎にミキシングし、ミキシングにより生成したステレオの合成音声信号を音声信号送信部１３８に引き渡す。また、ミキシング部１３７は、音声信号加工部１３６から音声信号を受け取らない間は、振幅０を示すステレオの音声信号を生成し、生成した音声信号を音声信号送信部１３８に引き渡す。 The audio signal processing unit 136 performs sound image position adjustment processing on the audio signal whose amplitude exceeds a predetermined threshold as described above, and delivers the audio signal in the left and right channels, that is, in stereo, to the mixing unit 137. When the mixing unit 137 receives a stereo audio signal indicating the audio of one participant 19 from the audio signal processing unit 136, the mixing unit 137 delivers the received stereo audio signal as it is to the audio signal transmission unit 138. When the mixing unit 137 receives a stereo audio signal indicating the audio of two or more participants 19 from the audio signal processing unit 136, the mixing unit 137 mixes the received audio signal for each of the left and right channels, and generates the audio signal by mixing. The stereo synthesized audio signal is delivered to the audio signal transmission unit 138. In addition, the mixing unit 137 generates a stereo audio signal having an amplitude of 0 while receiving no audio signal from the audio signal processing unit 136, and delivers the generated audio signal to the audio signal transmission unit 138.

音声信号送信部１３８は、ミキシング部１３７からステレオの音声信号を受け取ると、受け取った音声信号を含むデータパケットを生成する。図６は、音声信号送信部１３８においてデータパケットが生成される様子を模式的に示した図である。音声信号送信部１３８により生成されるデータパケットには、音声信号送信部１１２により生成されるデータパケット（図４参照）と異なり、音声信号ブロックの代わりに、左チャンネルの音声信号および右チャンネルの音声信号を各々所定のデータ長に分割した左音声信号ブロックおよび右音声信号ブロックが含まれている。また、音声信号処理サーバ１３はその時点で会議に参加している全ての参加者１９の端末装置１１に音声信号を送信する必要があるため、音声信号送信部１３８は、その時点で確立されている全ての通信コネクションに関し、その通信コネクションを特定するコネクションＩＤをヘッダに含むデータパケットを複数同時に生成する。各データパケットに含まれる送信先ＩＤは、コネクションＩＤに対応する端末ＩＤ（図３参照）である。 When receiving the stereo audio signal from the mixing unit 137, the audio signal transmission unit 138 generates a data packet including the received audio signal. FIG. 6 is a diagram schematically showing how a data packet is generated in the audio signal transmission unit 138. The data packet generated by the audio signal transmission unit 138 differs from the data packet generated by the audio signal transmission unit 112 (see FIG. 4), instead of the audio signal block, the audio signal of the left channel and the audio of the right channel. A left audio signal block and a right audio signal block obtained by dividing the signal into predetermined data lengths are included. Further, since the audio signal processing server 13 needs to transmit an audio signal to the terminal devices 11 of all the participants 19 who are currently participating in the conference, the audio signal transmission unit 138 is established at that time. With respect to all communication connections, a plurality of data packets including a connection ID for specifying the communication connection in the header are generated simultaneously. The transmission destination ID included in each data packet is a terminal ID (see FIG. 3) corresponding to the connection ID.

音声信号送信部１３８は、上記のように生成したデータパケットを順次ネットワーク１０に送出する。ネットワーク１０に送出されたデータパケットは各々に含まれる送信先ＩＤに基づき、各々の端末装置１１に送り届けられる。各端末装置１１の音声信号受信部１１３はデータパケットを受信すると、受信したデータパケットに含まれる左音声信号ブロックおよび右音声信号ブロックを、ブロック番号に応じたデータバッファ１１４１の領域に順次書き込んでゆく。その一方で、音声信号処理部１１１はデータバッファ１１４１に書き込まれた音声信号を左右の各々に関し順次、アナログ音声信号に変換し、ヘッドセット１２のヘッドフォンの左入力および右入力にそれぞれ出力する。ヘッドセット１２のヘッドフォンは、左入力および右入力に入力された音声信号を各々音に変換し発音する。ヘッドセット１２から発音される音は、異なる場所にいる参加者１９の各々の音声をミキシングしたものであるが、異なる参加者１９の発言による音声が異なる音像位置に配置された音であるため、同時に複数の参加者１９が発言を行った場合であっても、聞き手はそれらの発言を容易に判別することができる。 The audio signal transmission unit 138 sequentially transmits the data packets generated as described above to the network 10. The data packet sent to the network 10 is sent to each terminal device 11 based on the destination ID included in each data packet. When the audio signal receiving unit 113 of each terminal device 11 receives the data packet, the left audio signal block and the right audio signal block included in the received data packet are sequentially written in the area of the data buffer 1141 corresponding to the block number. . On the other hand, the audio signal processing unit 111 sequentially converts the audio signal written in the data buffer 1141 into an analog audio signal for each of the left and right, and outputs them to the left input and the right input of the headphones of the headset 12, respectively. The headphones of the headset 12 convert the sound signals input to the left input and the right input into sounds and generate sounds. The sound generated from the headset 12 is a sound obtained by mixing the voices of the participants 19 in different places, but the voices produced by the voices of the different participants 19 are arranged at different sound image positions. Even when a plurality of participants 19 speak at the same time, the listener can easily discriminate those comments.

ところで、上述したように、累積時間データ生成部１３２は５秒間隔で累積時間データを生成し、端末装置選択部１３３は５秒間隔で選択データを生成する。周波数特性特定部１３４は５秒間隔で、選択データに含まれるコネクションＩＤの各々に対応する音声信号に関する周波数特性データを生成する。従って、音像位置割当部１３５は５秒間隔で、端末装置選択部１３３から選択データを、また周波数特性特定部１３４から周波数特性データを受け取り、それらに基づき音像位置の割当処理を行う。 By the way, as described above, the accumulated time data generation unit 132 generates accumulated time data at intervals of 5 seconds, and the terminal device selection unit 133 generates selection data at intervals of 5 seconds. The frequency characteristic specifying unit 134 generates frequency characteristic data related to the audio signal corresponding to each connection ID included in the selection data at intervals of 5 seconds. Therefore, the sound image position assignment unit 135 receives selection data from the terminal device selection unit 133 and frequency characteristic data from the frequency characteristic specifying unit 134 at intervals of 5 seconds, and performs sound image position assignment processing based on them.

音像位置の割当に変更がない場合、音像位置割当部１３５は５秒間隔で同じ内容の音像位置割当データを生成し、音声信号加工部１３６に引き渡す。一方、音像位置の割当を変更する必要が生じた場合、音像位置割当部１３５は５秒間隔よりも短い時間間隔、例えば０．１秒間隔で、変更にかかるコネクションＩＤの音像位置が変更前の音像位置から変更後の音像位置に徐々に移動するように音像位置割当データを生成し、音声信号加工部１３６に引き渡す。 When there is no change in the allocation of the sound image position, the sound image position allocation unit 135 generates sound image position allocation data having the same content at intervals of 5 seconds, and delivers it to the audio signal processing unit 136. On the other hand, when it is necessary to change the allocation of the sound image position, the sound image position allocation unit 135 sets the connection-related sound ID of the connection ID before the change at a time interval shorter than the 5-second interval, for example, an interval of 0.1 second. Sound image position assignment data is generated so as to gradually move from the sound image position to the changed sound image position, and delivered to the audio signal processing unit 136.

図７は、図５に例示した音像位置の割当が行われている状態で、コネクションＩＤ「００２１」で特定される通信コネクションを用いた端末装置１１の参加者１９が頻繁に発言を行った結果、過去２０分間の累積発言時間のランキングに変動が生じた場合に、音像位置割当部１３５がコネクションＩＤ「００２１」に対し音像位置を割り当てる様子を模式的に示した図である。図７（ａ）は、累積時間データにより示される発言時間の累積値のランキングに変更が生じる前の状態を示している。 FIG. 7 shows the result of frequent remarks made by the participant 19 of the terminal device 11 using the communication connection specified by the connection ID “0021” in the state in which the sound image position exemplified in FIG. 5 is assigned. FIG. 11 is a diagram schematically showing how the sound image position assigning unit 135 assigns a sound image position to a connection ID “0021” when a change occurs in the ranking of the accumulated speech time for the past 20 minutes. FIG. 7A shows a state before a change occurs in the ranking of the accumulated value of the speech time indicated by the accumulated time data.

図７（ａ）の状態でコネクションＩＤ「００２１」に対応する参加者１９の発言時間の累積値のランキングが４位となると、コネクションＩＤ「０００９」に割り当てられていたポジション３が、コネクションＩＤ「００２１」に割り当てられることになる。その場合、音像位置割当部１３５は、条件データベース１３９１（図２参照）に含まれる条件データ「移動時間：２０秒」に従い、１秒間に１度の角速度（２０度÷２０秒＝１度／秒）で、徐々にポジション４の位置からポジション３の位置に移動する音像位置をコネクションＩＤ「００２１」に割り当てる。より具体的には、音像位置割当部１３５は以下の音像位置割当データを０．１秒間隔で順次音声信号加工部１３６に引き渡す。
［００２３：−５０度、００１５：−３０度、００２１（一時的保留）：＋９．９度、（常設保留）：＋１０度、０００４：＋３０度、００３４：＋５０度］
［００２３：−５０度、００１５：−３０度、００２１（一時的保留）：＋９．８度、（常設保留）：＋１０度、０００４：＋３０度、００３４：＋５０度］
［００２３：−５０度、００１５：−３０度、００２１（一時的保留）：＋９．７度、（常設保留）：＋１０度、０００４：＋３０度、００３４：＋５０度］
［００２３：−５０度、００１５：−３０度、００２１（一時的保留）：＋９．６度、（常設保留）：＋１０度、０００４：＋３０度、００３４：＋５０度］
（以下、同様） In the state of FIG. 7A, when the ranking of the accumulated value of the speech time of the participant 19 corresponding to the connection ID “0021” is fourth, the position 3 assigned to the connection ID “0009” is changed to the connection ID “0009”. "0021". In that case, the sound image position assigning unit 135 follows the condition data “movement time: 20 seconds” included in the condition database 1391 (see FIG. 2), and the angular velocity is 1 degree per second (20 degrees ÷ 20 seconds = 1 degree / second). The sound image position that gradually moves from the position 4 to the position 3 is assigned to the connection ID “0021”. More specifically, the sound image position assignment unit 135 sequentially delivers the following sound image position assignment data to the audio signal processing unit 136 at intervals of 0.1 seconds.
[0023: -50 degrees, 0015: -30 degrees, 0021 (temporary hold): +9.9 degrees, (permanent hold): +10 degrees, 0004: +30 degrees, 0034: +50 degrees]
[0023: -50 degrees, 0015: -30 degrees, 0021 (temporary hold): +9.8 degrees, (permanent hold): +10 degrees, 0004: +30 degrees, 0034: +50 degrees]
[0023: -50 degrees, 0015: -30 degrees, 0021 (temporary hold): +9.7 degrees, (permanent hold): +10 degrees, 0004: +30 degrees, 0034: +50 degrees]
[0023: -50 degrees, 0015: -30 degrees, 0021 (temporary hold): +9.6 degrees, (permanent hold): +10 degrees, 0004: +30 degrees, 0034: +50 degrees]
(Hereinafter the same)

上記のように、新たにランキング４位になったコネクションＩＤ「００２１」の音像位置としては、即時にポジション３を示す「−１０度」が指定されるのではなく、それまでのポジションを示す「＋１０度」から「−１０度」に向かい、徐々に左に移動するように、０．１秒間隔で変化する角度が指定される。音声信号加工部１３６は、上記のような音像位置割当データを０．１秒間隔で順次受け取り、受け取った音像位置割当データに従った音像位置にコネクションＩＤ「００２１」に対応する参加者１９の音声が配置されるように加工処理を行う。その結果、ミキシング部１３７によりミキシングされた音声信号において、コネクションＩＤ「００２１」に対応する参加者１９の音声が時間の経過に伴い前方やや右から前方やや左に移動していくことになる。図７（ｂ）は音像位置の移動開始後、１０秒が経過した時点の状態を示し、図７（ｃ）は移動が完了した後の状態を示している。 As described above, “−10 degrees” indicating the position 3 is not immediately designated as the sound image position of the connection ID “0021” newly ranked fourth, but the position “ An angle that changes at an interval of 0.1 second is designated so as to move from “+10 degrees” to “−10 degrees” and gradually move to the left. The audio signal processing unit 136 sequentially receives the sound image position assignment data as described above at intervals of 0.1 seconds, and the sound of the participant 19 corresponding to the connection ID “0021” is assigned to the sound image position according to the received sound image position assignment data. Processing is performed so that is arranged. As a result, in the audio signal mixed by the mixing unit 137, the audio of the participant 19 corresponding to the connection ID “0021” moves from the front and right to the front and left slightly as time passes. FIG. 7B shows a state when 10 seconds have elapsed after the start of the movement of the sound image position, and FIG. 7C shows a state after the movement is completed.

上記のように、音像位置が滑らかに変更される結果、同じ発言者の音声が突然、ある音像位置から他の音像位置にジャンプすることがなく、会議の音声の聞き手にとって、ある発言が同じ発言者のものか異なる発言者のものかを把握しやすくなる。しかしながら、音像位置を滑らかに変更するか、すばやく変更するかは条件データベース１３９１に含める移動速度もしくは移動時間を指示する条件データのパラメータを変更することにより任意に変更可能である。すなわち、移動速度もしくは移動時間を指示する条件データを条件データベース１３９１に含めない等により、音像位置が即座に変更後の位置に移動されるようにしてもよい。 As described above, as a result of the sound image position being changed smoothly, the voice of the same speaker is not suddenly jumped from one sound image position to another sound image position. It ’s easier to see if it ’s a different speaker or a different speaker. However, whether the sound image position is changed smoothly or quickly can be arbitrarily changed by changing the parameter of the condition data indicating the moving speed or moving time included in the condition database 1391. That is, the sound image position may be immediately moved to the changed position by not including the condition data instructing the moving speed or moving time in the condition database 1391.

上述したように、音声会議システム１によれば、会議に参加している参加者１９の数が多数となった場合であっても、所定数の音像位置が適宜異なる参加者１９の音声に割り当てられる結果、隣り合う音像位置の間隔が狭くなり過ぎることがなく、聞き手が隣接する音像位置に配置された異なる参加者１９の音声を混同する不都合が防止される。 As described above, according to the audio conference system 1, even when the number of participants 19 participating in the conference becomes large, a predetermined number of sound image positions are appropriately assigned to the audio of the participants 19. As a result, the interval between adjacent sound image positions does not become too narrow, and the inconvenience that the listener mixes the voices of different participants 19 arranged at adjacent sound image positions is prevented.

なお、一度ポジションを確定したらランキング外になるまで、そのままのポジションを維持するようにしてもよい。 Note that once the position is confirmed, the position may be maintained until it is out of the ranking.

ところで、音声信号処理サーバ１３および端末装置１１は、専用のハードウェアにより実現されてもよいし、音声信号の入出力が可能な汎用コンピュータにアプリケーションプログラムに従った処理を実行させることにより実現されてもよい。音声信号処理サーバ１３が汎用コンピュータにより実現される場合、累積時間データ生成部１３２、端末装置選択部１３３、周波数特性特定部１３４、音像位置割当部１３５、音声信号加工部１３６およびミキシング部１３７は、汎用コンピュータが備えるＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）およびＣＰＵの制御下で動作するＤＳＰ（ＤｉｇｉｔａｌＳｉｇｎａｌＰｒｏｃｅｓｓｏｒ）が、アプリケーションプログラムに含まれる各モジュールに従った処理を同時並行して行うことにより、汎用コンピュータの機能として実現される。また、音声信号処理サーバ１３の音声信号受信部１３１および音声信号送信部１３８は、汎用コンピュータがデータパケットをネットワーク１０との間で送受信するために備える入出力インタフェースと、アプリケーションプログラムに含まれる各モジュールに従ったデータパケットの生成および組み立てに関するＣＰＵの処理により、汎用コンピュータの機能として実現される。 By the way, the audio signal processing server 13 and the terminal device 11 may be realized by dedicated hardware or by causing a general-purpose computer capable of inputting and outputting audio signals to execute processing according to an application program. Also good. When the audio signal processing server 13 is realized by a general-purpose computer, the accumulated time data generation unit 132, the terminal device selection unit 133, the frequency characteristic identification unit 134, the sound image position allocation unit 135, the audio signal processing unit 136, and the mixing unit 137 include: Functions of the general-purpose computer are achieved by a CPU (Central Processing Unit) provided in the general-purpose computer and a DSP (Digital Signal Processor) operating under the control of the CPU concurrently performing processing according to each module included in the application program. As realized. The audio signal receiving unit 131 and the audio signal transmitting unit 138 of the audio signal processing server 13 include an input / output interface provided for a general-purpose computer to transmit and receive data packets to and from the network 10, and each module included in the application program. This is realized as a function of a general-purpose computer by the processing of the CPU related to generation and assembly of data packets according to the above.

また、音声会議システム１において、各音像位置に配置されている音声の話者を聞き手が特定可能なように、聞き手に対し表示もしくは音による通知を行う機能を付加するように構成してもよい。例えば、音声信号処理サーバ１３は音声信号を端末装置１１に送信するとともに、図５に示したようなその時点における参加者１９の配置図を示す画像データを端末装置１１に送信する。その場合、端末装置１１は表示部を備え、音声信号処理サーバ１３から送信される画像データに従い、参加者１９の配置図を表示部に表示させる。その結果、聞き手である全ての参加者１９は、ヘッドセット１２のヘッドフォンから発音される音に含まれるいずれの音声がいずれの参加者１９の発言であるかを容易に確認することができる。また、例えば音声信号加工部１３６において、各音像位置に配置されている参加者１９の氏名やニックネーム等を読み上げた音声示す音声信号を所定時間間隔で合成し、合成した音声信号に対し音像位置の調整処理を行って、ミキシング部１３７に引き渡すようにしてもよい。その場合、ミキシング部１３７において参加者１９の発言を示す音声信号と発言者の氏名等を読み上げる合成音声信号が各々同じ音像位置に配置されてミキシングされる結果、聞き手である参加者１９はいずれの音像位置に配置された音声がいずれの参加者１９の発言を示すものであるかを容易に知ることができる。 Further, the audio conference system 1 may be configured to add a function of performing display or sound notification to the listener so that the listener can specify the voice speaker arranged at each sound image position. . For example, the audio signal processing server 13 transmits an audio signal to the terminal device 11, and transmits image data indicating the layout of the participants 19 at that time as illustrated in FIG. 5 to the terminal device 11. In that case, the terminal device 11 includes a display unit, and displays a layout diagram of the participants 19 on the display unit according to the image data transmitted from the audio signal processing server 13. As a result, all the participants 19 who are listeners can easily confirm which voice included in the sound generated from the headphones of the headset 12 is the speech of which participant 19. Further, for example, in the audio signal processing unit 136, an audio signal indicating the voice that reads out the name or nickname of the participant 19 arranged at each sound image position is synthesized at a predetermined time interval, and the sound image position of the synthesized audio signal is synthesized. An adjustment process may be performed and handed over to the mixing unit 137. In that case, the mixing unit 137 mixes the audio signal indicating the speech of the participant 19 and the synthesized audio signal that reads out the name of the speaker and the like at the same sound image position, and as a result, It is possible to easily know which participant 19 speaks the voice arranged at the sound image position.

また、音声会議システム１において、端末装置１１と音声信号処理サーバ１３との間で送受信されるデータパケットに含まれる音声信号を暗号化するように構成してもよい。その場合、音声信号送信部１１２および音声信号送信部１３８において、送信される音声信号が暗号化された後、データブロックに分割され、データパケットに含められる。また、音声信号受信部１３１および音声信号受信部１１３は、データパケットに含まれるデータブロックを組み立てた後、それを復号化して音声信号を復元する。このように音声信号を暗号化すると、会議の内容が第三者に漏洩することが防止される。 Further, the audio conference system 1 may be configured to encrypt an audio signal included in a data packet transmitted / received between the terminal device 11 and the audio signal processing server 13. In that case, the audio signal transmission unit 112 and the audio signal transmission unit 138 encrypt the audio signal to be transmitted, then divide the data into data blocks and include them in the data packet. The audio signal receiving unit 131 and the audio signal receiving unit 113 assemble a data block included in the data packet, and then decode the data block to restore the audio signal. Encrypting the audio signal in this way prevents the contents of the conference from leaking to a third party.

［１．３．変形例］
上述した音声会議システム１においては、端末装置１１の各々は互いに離れた位置に配置され、ネットワーク１０を介して音声信号処理サーバ１３との間でデータ通信を行うことにより、多地点間の会議を実現する。しかしながら、本発明の第１実施形態は、ネットワーク１０を介することなく、各端末装置１１が直接、音声信号処理サーバ１３に接続するように変形することもできる。そのような変形を加えた音声会議システムは、例えば数十名が一同に介して会議を行うような場合に便利なシステムである。 [1.3. Modified example]
In the above-described audio conference system 1, each of the terminal devices 11 is arranged at a position distant from each other, and performs data communication with the audio signal processing server 13 via the network 10, thereby allowing multipoint conferences. Realize. However, the first embodiment of the present invention can be modified such that each terminal device 11 is directly connected to the audio signal processing server 13 without going through the network 10. The voice conference system to which such a modification is added is a convenient system when, for example, dozens of people conduct a conference through the same.

また、上述したサーバにおける音像位置割当処理や音声信号処理を各端末装置で行うようにしてもよい。その場合、例えば各端末装置１１は、条件データベース１３９１、端末装置選択部１３３、周波数特性特定部１３４、音像位置割当部１３５、音声信号加工部１３６およびミキシング部１３７を備え、一方で音声信号処理サーバ１３はそれらの構成部を備えず、各端末装置１１から受信した音声信号をミキシングせずに複数トラックの音声信号として、累積時間データとともに各端末装置１１に送信する。各端末装置１１は音声信号処理サーバ１３から受信した複数トラックの音声信号および累積時間データを用いて、所定数の音声信号の選択および選択した音声信号に対する音像位置の割当等を行う。 Further, the above-described sound image position assignment processing and audio signal processing in the server may be performed by each terminal device. In this case, for example, each terminal device 11 includes a condition database 1391, a terminal device selection unit 133, a frequency characteristic specifying unit 134, a sound image position assignment unit 135, an audio signal processing unit 136, and a mixing unit 137, while an audio signal processing server 13 does not include these components, and transmits the audio signal received from each terminal apparatus 11 to each terminal apparatus 11 together with the accumulated time data as an audio signal of a plurality of tracks without mixing. Each terminal apparatus 11 uses a plurality of tracks of audio signals and accumulated time data received from the audio signal processing server 13 to select a predetermined number of audio signals and assign sound image positions to the selected audio signals.

さらに、上述した音声信号処理サーバ１３と同様の構成を備えた複数の端末装置がピアツーピアで相互接続され、音声会議システムを構成するようにしてもよい。その場合、例えば、各端末装置１１は参加者１９の音声信号を他の全ての端末装置１１に送信するとともに、他の全ての端末装置１１から各々の参加者１９の音声信号を受信する。端末装置１１は、そのように受信した複数の音声信号に対し、上述した音声信号処理サーバ１３における場合と同様の音声信号の選択処理、音像位置の割当処理等を行う。 Furthermore, a plurality of terminal devices having the same configuration as the audio signal processing server 13 described above may be interconnected peer-to-peer to constitute an audio conference system. In that case, for example, each terminal device 11 transmits the audio signal of the participant 19 to all the other terminal devices 11 and receives the audio signal of each participant 19 from all the other terminal devices 11. The terminal device 11 performs a sound signal selection process, a sound image position assignment process, and the like similar to the case of the sound signal processing server 13 described above for the plurality of sound signals thus received.

［２．第２実施形態］
上記の第１実施形態の音声会議システム１においては、異なる参加者１９の発言を示す音声信号に対し異なる音像位置を割り当てることにより、異なる参加者１９の発言を聞き手が混同することを防止可能な音声信号が生成された。第２実施形態の音声会議システム２においては、異なる参加者１９の発言を示す音声信号に対し異なる周波数特性を割り当てることにより、異なる参加者１９の発言を聞き手が混同することを防止可能な音声信号が生成される。音声会議システム２は、音声会議システム１とその構成および動作において多くの点で共通している。従って、以下の説明においては音声会議システム２が音声会議システム１と異なる点のみ説明する。また、音声会議システム２の構成部のうち、音声会議システム１の構成部と同様もしくは対応するものには同じ符合を付す。 [2. Second Embodiment]
In the audio conference system 1 according to the first embodiment, it is possible to prevent the listener from confusing the speech of the different participants 19 by assigning different sound image positions to the audio signals indicating the speech of the different participants 19. An audio signal was generated. In the audio conference system 2 according to the second embodiment, by assigning different frequency characteristics to audio signals indicating the speech of different participants 19, an audio signal capable of preventing the listener from confusion with the speech of different participants 19. Is generated. The audio conference system 2 is common in many respects to the audio conference system 1 in its configuration and operation. Therefore, in the following description, only the points where the audio conference system 2 is different from the audio conference system 1 will be described. In addition, among the components of the audio conference system 2, components that are the same as or correspond to the components of the audio conference system 1 are assigned the same reference numerals.

［２．１．音声会議システムの構成および動作］
図８は、音声会議システム２の構成を示したブロック図である。音声会議システム２の音声信号処理サーバ１３は、音声会議システム１の音声信号処理サーバ１３が備える音像位置割当部１３５の代わりに、周波数特性割当部２３５を備えている。周波数特性割当部２３５は、選択データにより示されるコネクションＩＤの各々に対し、異なる周波数特性を割り当てる。複数の周波数特性を互いに異ならせる方法は様々に考えられるが、本実施形態においては例として、中心周波数が異なるパラメトリックイコライザによる処理を異なる話者の音声信号に対し行うことにより、各々の音声信号の周波数特性が互いに大きく異なるようにする。従って、周波数特性割当部２３５は、各コネクションＩＤに対し異なる中心周波数を割り当てる。図９は、周波数特性割当部２３５により異なる中心周波数がコネクションＩＤに割り当てられる様子を模式的に示した図である。 [2.1. Configuration and operation of audio conference system]
FIG. 8 is a block diagram showing the configuration of the audio conference system 2. The audio signal processing server 13 of the audio conference system 2 includes a frequency characteristic allocation unit 235 instead of the sound image position allocation unit 135 provided in the audio signal processing server 13 of the audio conference system 1. The frequency characteristic assigning unit 235 assigns different frequency characteristics to each connection ID indicated by the selection data. There are various methods for making a plurality of frequency characteristics different from each other, but in this embodiment, as an example, by performing processing by a parametric equalizer having different center frequencies on speech signals of different speakers, The frequency characteristics should be greatly different from each other. Therefore, the frequency characteristic assigning unit 235 assigns a different center frequency to each connection ID. FIG. 9 is a diagram schematically showing how different center frequencies are assigned to connection IDs by the frequency characteristic assigning unit 235.

周波数特性割当部２３５は、端末装置選択部１３３から受け取る選択データおよび周波数特性特定部１３４から受け取る周波数特性データに基づき、条件データベース１３９１に含まれる条件データに従って各中心周波数に対するコネクションＩＤの割当を行うと、その結果を示す周波数特性割当データを音声信号加工部１３６に引き渡す。図９に示される中心周波数の割当が行われる場合、周波数特性割当データは［００２３：５．０ｋＨｚ、００１５：５．９ｋＨｚ、０００９（一時的保留）：７．０ｋＨｚ、（常設保留）：８．５ｋＨｚ、０００４：１０．６ｋＨｚ、００３４：１３．４ｋＨｚ］となる。 When the frequency characteristic allocating unit 235 allocates connection IDs for the respective center frequencies in accordance with the condition data included in the condition database 1391 based on the selection data received from the terminal device selection unit 133 and the frequency characteristic data received from the frequency characteristic specifying unit 134. Then, the frequency characteristic allocation data indicating the result is delivered to the audio signal processing unit 136. When the center frequency allocation shown in FIG. 9 is performed, the frequency characteristic allocation data are [0023: 5.0 kHz, 0015: 5.9 kHz, 0009 (temporary hold): 7.0 kHz, (permanent hold): 8. 5 kHz, 0004: 10.6 kHz, 0034: 13.4 kHz].

音声信号加工部１３６は、データバッファ１３９２に順次記憶される音声信号のうち、振幅が所定の閾値を超えたものに対し、最後に受け取った周波数特性割当データに従い、パラメトリックイコライザの処理、すなわち図９において各中心周波数を有するグラフに示されるような周波数特性のフィルタ処理を施し、生成した音声信号をミキシング部１３７に引き渡す。ミキシング部１３７は各々の参加者１９に関し異なる周波数特性のフィルタ処理の施された音声信号を受け取ると、それらをミキシングして１つの音声信号を生成する。そのように音声信号処理サーバ１３により生成された音声信号は、ネットワーク１０を介して全ての端末装置１１に送信され、ヘッドセット１２を介して音として発音される。ここで、音声会議システム２における音声信号処理サーバ１３から端末装置１１に送信される音声信号はモノラルの音声信号でよく、従ってヘッドセット１２が備えるヘッドフォンもモノラルヘッドフォンでよい。 The audio signal processing unit 136 processes the parametric equalizer according to the frequency characteristic assignment data received last, that is, the audio signal whose amplitude exceeds a predetermined threshold among the audio signals sequentially stored in the data buffer 1392, that is, FIG. In FIG. 5, the filter processing of the frequency characteristics as shown in the graph having the respective center frequencies is performed, and the generated audio signal is delivered to the mixing unit 137. When the mixing unit 137 receives audio signals that have been subjected to filter processing with different frequency characteristics for each participant 19, the mixing unit 137 mixes them to generate one audio signal. The audio signal generated by the audio signal processing server 13 is transmitted to all the terminal devices 11 via the network 10 and is generated as sound via the headset 12. Here, the audio signal transmitted from the audio signal processing server 13 to the terminal device 11 in the audio conference system 2 may be a monaural audio signal, and thus the headphones included in the headset 12 may be monaural headphones.

上記のように、ヘッドセット１２のヘッドフォンから発音される音は、会議における発言を示す音声をミキシングしたものである。ただし、ミキシングされた音声において、異なる参加者１９による発言を示す音声は大きく異なる周波数特性を備えるように調整されているため、聞き手は異なる参加者１９による発言を容易に区別することができる。その際、会議に参加している参加者１９の数が多数となった場合であっても、所定数の周波数特性が適宜異なる参加者１９の音声に割り当てられる結果、隣り合う中心周波数により特定される周波数特性間の差異が一定程度、常に維持される。その結果、聞き手が異なる参加者１９の音声を混同する不都合が常に防止される。 As described above, the sound that is generated from the headphones of the headset 12 is obtained by mixing the sound indicating the speech in the conference. However, in the mixed audio, since the audio indicating the speech by the different participants 19 is adjusted to have greatly different frequency characteristics, the listener can easily distinguish the speech by the different participants 19. At that time, even when the number of participants 19 participating in the conference becomes large, a predetermined number of frequency characteristics are appropriately assigned to the voices of the different participants 19 and, as a result, specified by adjacent center frequencies. The difference between the frequency characteristics is always maintained to a certain extent. As a result, the inconvenience of confusing the voices of participants 19 with different listeners is always prevented.

音声会議システム２においても、音声会議システム１における音像位置割当部１３５と同様に、周波数特性割当部２３５は中心周波数の割当に変更が生じた場合、時間の経過に伴い徐々に中心周波数を変更前から変更後のものに変化させるように、周波数特性割当データの生成および音声信号加工部１３６への引き渡しを行う。その結果、聞き手は音声の周波数特性に変更が加えられた場合であっても、同じ発言者の音声を容易に捕捉することができる。 Also in the audio conference system 2, as in the case of the sound image position allocation unit 135 in the audio conference system 1, when the frequency characteristic allocation unit 235 changes the allocation of the center frequency, the center frequency is gradually changed as time elapses. The frequency characteristic assignment data is generated and delivered to the audio signal processing unit 136 so as to be changed to the changed one. As a result, the listener can easily capture the voice of the same speaker even when the frequency characteristics of the voice are changed.

第２実施形態に関しても、第１実施形態と同様に、音声信号処理サーバ１３および端末装置１１は、専用のハードウェアにより実現されてもよいし、音声信号の入出力が可能な汎用コンピュータにアプリケーションプログラムに従った処理を実行させることにより実現されてもよい。また、音声会議システム２において、各中心周波数の割り当てられた音声の話者を聞き手が特定可能なように、聞き手に対し表示もしくは音による通知を行う機能を付加するように構成してもよい。また、音声会議システム２において、音声信号に暗号化を施すようにしてもよい。さらに、第２実施形態において、第１実施形態における変形例と同様の変形を加えることもできる。 Also in the second embodiment, as in the first embodiment, the audio signal processing server 13 and the terminal device 11 may be realized by dedicated hardware or applied to a general-purpose computer capable of inputting and outputting audio signals. You may implement | achieve by performing the process according to a program. Further, the audio conference system 2 may be configured to add a function of performing display or sound notification to the listener so that the listener can identify the voice speaker to which each center frequency is assigned. Further, in the audio conference system 2, the audio signal may be encrypted. Furthermore, in the second embodiment, the same modification as the modification in the first embodiment can be added.

また、第１実施形態における音像位置の割り当てを、第２実施形態における周波数特性の割り当てに組み合わせるようにしてもよい。その場合、音声信号処理サーバ１３は、音像位置割当部１３５と周波数特性割当部２３５の両方を備え、端末装置選択部１３３により選択された音声信号の各々に、異なる音像位置と異なる周波数特性の両方を割り当てる。その結果、たとえ近い音像位置に類似音色の音声信号が配置されたような場合でも、それらの周波数特性が異なるように加工が施される結果、聞き手にとっては異なる発言者の音声をより容易に聞き分けることができるようになる。なお、互いに近い音像位置に配置される音声信号に対しては、互いに大きく異なる周波数特性を割り当てることにより、聞き手にとってより聞き分け易い音声を提供可能な音声会議システムを構成することも可能である。 The sound image position assignment in the first embodiment may be combined with the frequency characteristic assignment in the second embodiment. In that case, the audio signal processing server 13 includes both the sound image position allocating unit 135 and the frequency characteristic allocating unit 235, and each of the audio signals selected by the terminal device selecting unit 133 has both a different sound image position and a different frequency characteristic. Assign. As a result, even if voice signals of similar timbres are arranged at close sound image positions, the processing is performed so that their frequency characteristics are different, so that it is easier for the listener to distinguish the voices of different speakers. Will be able to. Note that it is also possible to configure an audio conference system that can provide a voice that is easier for the listener to distinguish by assigning significantly different frequency characteristics to the audio signals arranged at the sound image positions close to each other.

本発明の第１実施形態にかかる音声会議システムの構成を示したブロック図である。It is the block diagram which showed the structure of the audio conference system concerning 1st Embodiment of this invention. 本発明の第１実施形態にかかる条件データベースの内容を例示した図である。It is the figure which illustrated the contents of the condition database concerning a 1st embodiment of the present invention. 本発明の第１実施形態にかかるコネクションＩＤと端末ＩＤの対応表を例示した図である。It is the figure which illustrated the correspondence table of connection ID and terminal ID concerning 1st Embodiment of this invention. 本発明の第１実施形態にかかる音声信号とデータパケットの関係を模式的に示した図である。It is the figure which showed typically the relationship between the audio | voice signal and data packet concerning 1st Embodiment of this invention. 本発明の第１実施形態にかかる音像位置の割当の様子を模式的に示した図である。It is the figure which showed typically the mode of the allocation of the sound image position concerning 1st Embodiment of this invention. 本発明の第１実施形態にかかる音声信号とデータパケットの関係を模式的に示した図である。It is the figure which showed typically the relationship between the audio | voice signal and data packet concerning 1st Embodiment of this invention. 本発明の第１実施形態にかかる音像位置の移動の様子を模式的に示した図である。It is the figure which showed typically the mode of the movement of the sound image position concerning 1st Embodiment of this invention. 本発明の第２実施形態にかかる音声会議システムの構成を示したブロック図である。It is the block diagram which showed the structure of the audio conference system concerning 2nd Embodiment of this invention. 本発明の第２実施形態にかかる周波数特性の割当の様子を模式的に示した図である。It is the figure which showed typically the mode of the allocation of the frequency characteristic concerning 2nd Embodiment of this invention.

Explanation of symbols

１・２…音声会議システム、１０…ネットワーク、１１…端末装置、１２…ヘッドセット、１３…音声信号処理サーバ、１１１…音声信号処理部、１１２・１３８…音声信号送信部、１１３・１３１…音声信号受信部、１１４・１３９…記憶部、１３２…累積時間データ生成部、１３３…端末装置選択部、１３４…周波数特性特定部、１３５…音像位置割当部、１３６…音声信号加工部、１３７…ミキシング部、２３５…周波数特性割当部、１１４１・１３９２…データバッファ、１３９１…条件データベース。 DESCRIPTION OF SYMBOLS 1 ... 2 Audio conference system, 10 ... Network, 11 ... Terminal device, 12 ... Headset, 13 ... Audio signal processing server, 111 ... Audio signal processing unit, 112/138 ... Audio signal transmission unit, 113/131 ... Audio Signal receiving unit, 114 · 139 ... storage unit, 132 ... cumulative time data generation unit, 133 ... terminal device selection unit, 134 ... frequency characteristic specifying unit, 135 ... sound image position assignment unit, 136 ... audio signal processing unit, 137 ... mixing , 235... Frequency characteristic assignment unit, 1141 and 1392... Data buffer, 1391.

Claims

Input means for receiving audio signals from a plurality of terminal devices;
Accumulated time data generating means for generating accumulated time data indicating accumulated values of durations of audio signals received within a predetermined past period from each of the plurality of terminal devices by the input means;
Selection means for selecting a predetermined number of terminal devices from the plurality of terminal devices based on the accumulated time data;
Assigning means for assigning different sound image positions to each of the plurality of terminal devices selected by the selecting means;
When an audio signal is received by the input unit from a plurality of terminal devices selected by the selection unit, the audio signal is processed into an audio signal having a sound image position allocated to the terminal device by the allocation unit. Processing means for processing the audio signal into an audio signal having a predetermined sound image position when an audio signal is received by the input means from a terminal device not selected by the selection means;
An audio signal processing apparatus comprising: output means for outputting an audio signal processed by the processing means.

The assigning means changes the time interval from the one sound image position to the other sound image position with respect to the one terminal device when changing the one sound image position assigned to the one terminal device to another sound image position. The sound signal processing apparatus according to claim 1, wherein a sound image position that changes with progress is assigned.

With respect to each of the plurality of terminal devices selected by the selection means, comprising frequency characteristic specifying means for generating frequency characteristic data indicating the frequency characteristics of the audio signal received from the terminal device,
The assigning means includes a plurality of terminals selected by the selecting means so that the similarity between the frequency characteristic data of the audio signals received from the terminal device that assigns two arbitrary sound image positions satisfies a predetermined condition. The sound signal processing apparatus according to claim 1, wherein a sound image position is assigned to each of the apparatuses.

Input means for receiving audio signals from a plurality of terminal devices;
Accumulated time data generating means for generating accumulated time data indicating accumulated values of durations of audio signals received within a predetermined past period from each of the plurality of terminal devices by the input means;
Selection means for selecting a predetermined number of terminal devices from the plurality of terminal devices based on the accumulated time data;
Assigning means for assigning different frequency characteristics to each of the plurality of terminal devices selected by the selecting means;
When an audio signal is received by the input means from a plurality of terminal devices selected by the selection means, the audio signal is processed into an audio signal having a frequency characteristic assigned to the terminal device by the assigning means. Processing means for processing the audio signal into an audio signal having a predetermined frequency characteristic when an audio signal is received by the input means from a terminal device not selected by the selection means;
An audio signal processing apparatus comprising: output means for outputting an audio signal processed by the processing means.

The assigning means, when changing one frequency characteristic assigned to one terminal apparatus to another frequency characteristic, for the one terminal apparatus, time is changed from the one frequency characteristic to the other frequency characteristic. The audio signal processing apparatus according to claim 4, wherein a frequency characteristic that changes over time is assigned.

Receiving audio signals from a plurality of terminal devices;
Generating accumulated time data indicating accumulated values of durations of audio signals received from each of the plurality of terminal devices within a predetermined period in the past;
A process of selecting a predetermined number of terminal devices from the plurality of terminal devices based on the accumulated time data;
A process of assigning different sound image positions to each of the plurality of selected terminal devices;
When an audio signal is received from the selected terminal devices, the audio signal is processed into an audio signal having a sound image position assigned to the terminal device, and a terminal device other than the selected terminal device A process of processing the audio signal into an audio signal having a predetermined sound image position when the audio signal is received from
A program for causing a computer to execute the process of outputting the processed audio signal.

Input means for receiving audio signals from a plurality of terminal devices;
Generating accumulated time data indicating accumulated values of durations of audio signals received from each of the plurality of terminal devices within a predetermined period in the past;
A process of selecting a predetermined number of terminal devices from the plurality of terminal devices based on the accumulated time data;
A process of assigning different frequency characteristics to each of the selected plurality of terminal devices;
When an audio signal is received from the selected terminal devices, the audio signal is processed into an audio signal having a frequency characteristic assigned to the terminal device, and other than the terminal device selected by the selection unit When a voice signal is received from the terminal device, a process of processing the voice signal into a voice signal having a predetermined frequency characteristic;
A program for causing a computer to execute the process of outputting the processed audio signal.