JP2010093328A

JP2010093328A - Sound signal communication system, speech synthesis device, method and program for speech synthesis processing, and recording medium stored with the program

Info

Publication number: JP2010093328A
Application number: JP2008258293A
Authority: JP
Inventors: Hiroyuki Ikawa; 浩幸井河; Manabu Ichimaru; 学一丸
Original assignee: Nippon Systemware Co Ltd
Current assignee: Nippon Systemware Co Ltd
Priority date: 2008-10-03
Filing date: 2008-10-03
Publication date: 2010-04-22
Anticipated expiration: 2028-10-03
Also published as: JP5210788B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a sound signal communication system where conversation can be performed simultaneously among three or more sets of speech terminals. <P>SOLUTION: The sound signal communication system (100) includes a plurality of terminals (10, 20 and 30) having communication functions and an IP network (40) for inter-connecting the terminals, and allows the terminals to communicate sound signals each other in RTP communications. Each of the plurality of terminals has a speech synthesis means (120). The speech synthesis means is characterized in that the means includes an information extracting means (122), a transmission time determining means (123), a sound signal summing means (125), and a signal limiting means (126). The information extraction means extracts time information from a plurality of received sound signals. Based on the extracted time information, the transmission time determination means (123) determines the transmission time of the received signals. Based on the determined transmission time, the sound signal summing means (125) adds the plurality of sound signals. The signal limiting means (126) compares the value of the summed sound signals with a predetermined limitation value, and reduces the value of the summed sound signals to the limitation value when the value of the summed sound signals is larger. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、３台以上の通話端末で同時に通話可能な音声信号通信システム、該端末が備える音声合成装置、該音声合成装置が行う音声合成処理方法、該方法を実行するための音声合成処理プログラム、並びに該プログラムを格納した記録媒体に関する。 The present invention relates to a voice signal communication system capable of calling simultaneously with three or more call terminals, a voice synthesizer provided in the terminal, a voice synthesis processing method performed by the voice synthesizer, and a voice synthesis processing program for executing the method And a recording medium storing the program.

従来の電話システムは、電話回線の交換網を使用した１対１の通話をサービスしているが、近年普及しているＩＰ電話システムは、インターネット回線を用いた多人数による同時通話を提供している（例えば、特許文献１）。 Conventional telephone systems provide one-to-one calls using a telephone line switching network, but IP telephone systems that have become popular in recent years provide simultaneous calls by a large number of people using Internet lines. (For example, Patent Document 1).

ただし、上記のような多人数による通話サービスでは、複数の通話端末のユーザ（通話者）がそれぞれ交代して発声する必要がある。もし、複数の通話端末のユーザが同時に発声した場合には、受信側の通話端末内で複数の音声信号が重なり合うが、これらの音声信号の間で音圧差が大きい場合には、「ブツブツ」という再生ノイズが発生してしまう。これは人間の聴覚上の特性に起因して発生する現象であり、このノイズによって通話端末のユーザは通話内容を聞き損ね、さらには不快感を覚えてしまう。 However, in a call service with a large number of people as described above, it is necessary for users (callers) of a plurality of call terminals to alternately speak. If the users of a plurality of call terminals speak simultaneously, a plurality of audio signals are overlapped in the call terminal on the receiving side, but if the sound pressure difference between these audio signals is large, it is called “buzz” Playback noise occurs. This is a phenomenon that occurs due to human auditory characteristics. Due to this noise, the user of the call terminal fails to hear the content of the call and further feels uncomfortable.

また、多人数で同時に通話を行うには一度に複数の音声信号を処理しなくてはならないため、処理するデータ量が増加してしまう。しかし、既存の携帯電話などは十分な処理能力を有していないため、正常な音声として再生できない恐れがある。 In addition, since a plurality of audio signals must be processed at a time in order to talk simultaneously with a large number of people, the amount of data to be processed increases. However, since existing mobile phones do not have sufficient processing capability, there is a possibility that they cannot be reproduced as normal sound.

また、代替の音声信号の処理として、複数の音声信号を受信した順にシリーズで処理して出力する方法もある。しかし、この処理方法では音声信号の再生のタイミングが遅れてしまうため、滑らかな会話が困難になる。 As an alternative audio signal process, there is a method of processing and outputting a plurality of audio signals in series in the order received. However, this processing method delays the playback timing of the audio signal, making smooth conversation difficult.

一方、特許文献２は、同時に受信した複数の音声信号のパケットの振幅絶対値をサンプル単位で比較し、振幅の一番大きなサンプルのみを選択的に処理する音声ミキシング技術を開示している。
特開２００８−１７２４２０特開２００５−１５１０４４ On the other hand, Patent Document 2 discloses an audio mixing technique that compares absolute amplitude values of a plurality of audio signal packets received at the same time in units of samples and selectively processes only a sample having the largest amplitude.
JP 2008-172420 A JP2005-151044

上記特許文献２で開示している技術は「小さな音は大きな音にかき消される」という理論が前提となっており、同時に受信した音声の中で一番大きな音声だけを選択的に使用して、他の音声は破棄している。この技術を用いて実際に３人以上の通話者の間で会話を行うと、常に特定の通話者の声だけが聞こえたり、あるいは、突然通話者の声が切り替わったりするため正常な会話を行うことが困難である。さらには、複数の音声の何れかに切り替わる際に、これらの音声信号の間で音圧差が大きい場合には、「ブツブツ」という再生ノイズが発生してしまうという問題が残る。 The technology disclosed in Patent Document 2 is premised on the theory that “a small sound is drowned out by a loud sound”, and at the same time, selectively using only the loudest voice among the received voices, Other voices are discarded. When a conversation is actually carried out between three or more parties using this technology, only the voice of a specific caller is always heard, or the caller's voice is suddenly switched, so that a normal conversation is performed. Is difficult. Furthermore, when switching to any one of a plurality of sounds, if there is a large difference in sound pressure between these sound signals, there remains a problem that reproduction noise called “buzz” occurs.

また、本発明の発明者等は上記問題点を解決するために鋭意研究した結果、複数の音声信号を単純にデジタル加算して合成信号を生成すれば十分な音質で当該複数の音声信号を再生できる事実を確認した。 In addition, the inventors of the present invention have made extensive studies to solve the above problems, and as a result, a plurality of sound signals can be reproduced with a sufficient sound quality if a plurality of sound signals are simply digitally added to generate a composite signal. Confirmed the facts that can be made.

しかし、音声信号は、受信元の端末の処理構造や伝播経路などに依存して、送信時点よりも遅延して受信元の端末に到達する。さらに、異なる送信元の端末から送られる音声信号はそれぞれ異なる遅延時間を有する。よって、音声信号を受信した時点を基準として単純加算したのでは、この音声信号間の遅延時間の差が影響して正常な信号再生ができない恐れがある。 However, the audio signal arrives at the receiving terminal with a delay from the time of transmission, depending on the processing structure and propagation path of the receiving terminal. Furthermore, audio signals sent from terminals of different transmission sources have different delay times. Therefore, if the simple addition is performed based on the time when the audio signal is received, there is a possibility that normal signal reproduction cannot be performed due to the difference in delay time between the audio signals.

本発明は上記の不都合を考慮して創案されたものであり、本発明の目的は、３台以上の通話端末の間で同時に会話が可能な音声信号通信システム、そのシステムの該端末が備える音声合成装置、その音声合成装置が行う音声合成処理方法、その方法を実行するための音声合成処理プログラム、並びにそのプログラムを格納した記録媒体を提供することである。 The present invention was devised in view of the above inconveniences, and an object of the present invention is to provide a voice signal communication system capable of simultaneous conversation between three or more call terminals, and a voice included in the terminals of the system. A synthesis device, a speech synthesis processing method performed by the speech synthesis device, a speech synthesis processing program for executing the method, and a recording medium storing the program.

本発明の別の目的は、通話端末の音声処理部に過度な負担をかけずに、かつ、高い品質で、受信した複数の音声信号を合成することが可能な音声合成システム、そのシステムの該端末が備える音声合成装置、その音声合成装置が行う音声合成処理方法、その方法を実行するための音声合成処理プログラム、並びにそのプログラムを格納した記録媒体を提供することである。 Another object of the present invention is to provide a speech synthesis system capable of synthesizing a plurality of received speech signals with high quality without imposing an excessive burden on the speech processing unit of the call terminal, and the system of the system. To provide a speech synthesizer provided in a terminal, a speech synthesis processing method performed by the speech synthesizer, a speech synthesis processing program for executing the method, and a recording medium storing the program.

本発明のさらに別の目的は、受信した複数の音声信号を合成する音声信号通信システムであって、該音声信号間で生じる遅延時間の差を補正することが可能な音声信号通信システム、そのシステムの該端末が備える音声合成装置、その音声合成装置が行う音声合成処理方法、その方法を実行するための音声合成処理プログラム、並びにそのプログラムを格納した記録媒体を提供することである。 Still another object of the present invention is an audio signal communication system for synthesizing a plurality of received audio signals, and an audio signal communication system capable of correcting a difference in delay time generated between the audio signals, and the system A speech synthesizer included in the terminal, a speech synthesis processing method performed by the speech synthesizer, a speech synthesis processing program for executing the method, and a recording medium storing the program.

前記課題を解決するために創案された請求項１の発明は、通信機能を有する複数の端末と、該複数の端末を相互接続するＩＰネットワークと、を備え、前記複数の端末の間で音声信号のＲＴＰ通信が可能な音声信号通信システムであって、
前記複数の端末の受信部はそれぞれ音声合成手段を有し、
前記音声合成手段は、
（ａ）受信した複数の音声信号のＲＴＰヘッダのタイムスタンプから時刻情報を抽出する情報抽出手段と、
（ｂ）前記抽出した時刻情報を基に、前記受信した信号の送信時刻を求める送信時刻決定手段と、
（ｃ）前記求めた送信時刻を基準として、前記複数の音声信号を加算する音声信号加算手段と、
（ｄ）前記加算した音声信号の値と所定の制限値とを比較し、前記音声信号の値の方が大きい場合には、該音声信号の値を前記制限値まで低減させる信号制限手段と、
を備えたことを特徴とする。 The invention of claim 1 devised to solve the above-described problem comprises a plurality of terminals having a communication function and an IP network interconnecting the plurality of terminals, and an audio signal between the plurality of terminals. An audio signal communication system capable of RTP communication,
Each of the receiving units of the plurality of terminals has speech synthesis means,
The speech synthesis means
(A) information extracting means for extracting time information from time stamps of RTP headers of a plurality of received audio signals;
(B) based on the extracted time information, transmission time determining means for obtaining a transmission time of the received signal;
(C) audio signal adding means for adding the plurality of audio signals based on the determined transmission time;
(D) comparing the value of the added audio signal with a predetermined limit value, and if the value of the audio signal is larger, signal limiting means for reducing the value of the audio signal to the limit value;
It is provided with.

前記課題を解決するために創案された請求項２の発明は、ＩＰネットワークを含むＲＴＰ通信システムを介して複数の送信元から受信した音声信号をリアルタイムに合成可能な音声合成装置であって、
（ｅ）前記受信した複数の音声信号のＲＴＰヘッダのタイムスタンプから所定の時刻情報を抽出する情報抽出手段と、
（ｆ）前記抽出した時刻情報を基に、前記受信した信号の送信時刻を求める送信時刻決定手段と、
（ｇ）前記求めた時刻を基準として、前記複数の音声信号を加算する音声信号加算手段と、
（ｈ）前記加算した音声信号を出力する音声出力手段と、
を備えることを特徴とする。 Invention of Claim 2 created in order to solve the said subject is a speech synthesizer which can synthesize | combine the audio | voice signal received from the several transmission source via the RTP communication system containing an IP network in real time,
(E) information extracting means for extracting predetermined time information from time stamps of RTP headers of the plurality of received audio signals;
(F) a transmission time determining means for obtaining a transmission time of the received signal based on the extracted time information;
(G) audio signal adding means for adding the plurality of audio signals with the obtained time as a reference;
(H) audio output means for outputting the added audio signal;
It is characterized by providing.

前記課題を解決するために創案された請求項３の発明は、請求項２の音声合成装置において、前記音声信号加算手段が加算した音声信号の値と所定の制限値とを比較し、前記音声信号の値の方が大きい場合には、該音声信号の値を前記制限値まで低減させる信号制限手段をさらに備えることを特徴とする。 Invention of Claim 3 created in order to solve the said subject WHEREIN: In the speech synthesizer of Claim 2, the value of the audio | voice signal added by the said audio | voice signal addition means is compared with a predetermined | prescribed limit value, The said audio | voice is compared. In the case where the value of the signal is larger, the apparatus further comprises signal limiting means for reducing the value of the audio signal to the limit value.

前記課題を解決するために創案された請求項４の発明は、ＩＰネットワークを含むＲＴＰ通信システムを介して複数の送信元から受信した音声信号をリアルタイムに合成する音声合成処理方法であって、
（ｉ）前記受信した複数の音声信号のＲＴＰヘッダのタイムスタンプから時刻情報を抽出する情報抽出ステップと、
（ｊ）前記抽出した時刻情報を基に、前記受信した信号の送信時刻を求める送信時刻決定ステップと、
（ｋ）前記求めた時刻を基準として、前記複数の音声信号を加算する音声信号加算ステップと、
（ｌ）前記加算した音声信号を出力する音声信号出力ステップと、
を含むことを特徴とする。 Invention of Claim 4 created in order to solve the said subject is the audio | voice synthesis | combination processing method which synthesize | combines the audio | voice signal received from the some transmission source via the RTP communication system containing an IP network in real time,
(I) an information extracting step of extracting time information from time stamps of RTP headers of the plurality of received audio signals;
(J) a transmission time determining step for obtaining a transmission time of the received signal based on the extracted time information;
(K) an audio signal addition step of adding the plurality of audio signals based on the obtained time;
(L) an audio signal output step for outputting the added audio signal;
It is characterized by including.

前記課題を解決するために創案された請求項５の発明は、請求項４に記載の音声合成処理方法において、前記音声信号加算ステップと前記音声信号出力ステップとの間に、前記加算した音声信号の値と所定の制限値とを比較し、前記音声信号の値の方が大きい場合には、該音声信号の値を前記制限値まで低減させる信号制限ステップをさらに含むことを特徴とする。 Invention of Claim 5 created in order to solve the said subject is the audio | voice synthesis processing method of Claim 4, The said audio | voice signal added between the said audio | voice signal addition step and the said audio | voice signal output step. And a predetermined limit value. If the value of the audio signal is larger, a signal limiting step of reducing the value of the audio signal to the limit value is further included.

前記課題を解決するために創案された請求項６の発明は、請求項４または５に記載の音声合成処理方法を行う電子回路である。 Invention of Claim 6 created in order to solve the said subject is an electronic circuit which performs the speech synthesis processing method of Claim 4 or 5.

前記課題を解決するために創案された請求項７の発明は、請求項４または５に記載の音声合成処理方法を処理装置に実行させるプログラムである。 Invention of Claim 7 created in order to solve the said subject is a program which makes a processing apparatus perform the speech synthesis processing method of Claim 4 or 5.

前記課題を解決するために創案された請求項８の発明は、請求項７に記載のプログラムを格納したコンピュータ可読媒体である。 Invention of Claim 8 created in order to solve the said subject is a computer-readable medium which stored the program of Claim 7.

請求項１の音声信号通信システム、請求項２の音声合成装置、請求項４の音声合成処理方法は、複数の音声信号を単純にデジタル加算することで合成している。そのため、演算が簡単であり、端末の処理部には過剰な負荷が生じない。よって、携帯電話などの演算能力の比較的低い機器でも好適に用いることが可能である。 The voice signal communication system according to claim 1, the voice synthesizer according to claim 2, and the voice synthesis processing method according to claim 4 synthesize a plurality of voice signals by simply digitally adding them. Therefore, the calculation is simple and no excessive load is generated in the processing unit of the terminal. Therefore, it can be suitably used even with a device having a relatively low computing capability such as a mobile phone.

また、音声信号の加算は、該音声信号のヘッダ部を参照してその送信時刻を把握した後、その送信時刻を基準として該音声信号を時間軸上に配列した状態で行っている。これにより、音声信号のそれぞれの遅延時間の差が補正され精度の高い音声信号の合成が可能となる。 Further, the addition of the audio signal is performed in a state where the audio signal is arranged on the time axis with reference to the transmission time after the transmission time is grasped by referring to the header portion of the audio signal. As a result, the difference between the delay times of the audio signals is corrected, and the audio signals can be synthesized with high accuracy.

請求項１の音声信号通信システム、請求項３の音声合成装置、請求項５の音声合成処理方法において、合成した音声信号の値を所定の制限値と比較し、この音声信号の値の方が大きい場合には該音声信号の値を前記制限値まで低減させている。
多数の音声データを単純加算する場合、加算したデータの値が大きくなり過ぎて端末の許容入力値を越え、結果として再生音質を損なう場合がある。そのため、合成データにリミッタを設けることでそのような不具合を防止している。 The voice signal communication system according to claim 1, the voice synthesizer according to claim 3, and the voice synthesis processing method according to claim 5, wherein the value of the synthesized voice signal is compared with a predetermined limit value, and the value of the voice signal is greater. If it is larger, the value of the audio signal is reduced to the limit value.
When a large number of audio data are simply added, the value of the added data becomes too large and exceeds the allowable input value of the terminal, and as a result, the reproduced sound quality may be impaired. Therefore, such a problem is prevented by providing a limiter in the composite data.

請求項６ないし８では、本発明の音声合成処理をそれぞれ電子回路、プログラム、コンピュータ可読媒体の形態で提供している。 In the sixth to eighth aspects, the speech synthesis processing of the present invention is provided in the form of an electronic circuit, a program, and a computer readable medium, respectively.

本発明によって、受信する音声信号の遅延時間の差を補正し、音声信号の複雑な演算処理を回避し、かつ高品質の音声を再生可能な多人数通話用音声信号通信システムを提供することが可能となった。さらに、そのシステムで用いる通話端末用の音声合成装置、その音声合成装置が行う音声合成処理方法、その方法を実行するための音声合成処理プログラム、並びにそのプログラムを格納した記録媒体を提供することも可能となった。 According to the present invention, it is possible to provide an audio signal communication system for multi-party calls that corrects a difference in delay time between received audio signals, avoids complicated calculation processing of audio signals, and can reproduce high-quality audio. It has become possible. Furthermore, it is also possible to provide a speech synthesizer for a call terminal used in the system, a speech synthesis processing method performed by the speech synthesizer, a speech synthesis processing program for executing the method, and a recording medium storing the program. It has become possible.

添付の図面を参照して、以下に本発明の一実施形態に係る音声合成システムについて説明する。
図１は、本発明の一実施形態に係る音声信号通信システム１００の概略図である。この音声信号通信システム１００は、３台の端末１０、２０、３０と、ＩＰ（Internet Protocol）ネットワーク４０とを備える。端末１０、２０、３０は電話、通話機能を有する携帯端末またはコンピュータなどの既存の通話機器で構成され、通話音声を入力部（マイク等）でアナログ信号として取り込み、Ａ／Ｄ変換してデジタルデータとしてＩＰネットワーク４０経由で他の端末に出力する送信部と、他の端末からＩＰネットワーク４０経由で入力されたデジタル信号をＤ／Ａ変換した後にアナログ信号として出力部（スピーカ、ヘッドフォン端子等）から出力する受信部とを備える。また、図１では３台の端末が示されているが、端末の数は２以上の任意の数でよい。ＩＰネットワーク４０は上記端末１０、２０、３０を通話可能に相互接続するためのものである。また、このネットワークは無線、有線またはこれらの組み合わせで構成されてよい。 A speech synthesis system according to an embodiment of the present invention will be described below with reference to the accompanying drawings.
FIG. 1 is a schematic diagram of an audio signal communication system 100 according to an embodiment of the present invention. The audio signal communication system 100 includes three terminals 10, 20, 30 and an IP (Internet Protocol) network 40. Terminals 10, 20, and 30 are configured by existing telephone devices such as telephones, portable terminals having a telephone function, or computers. The telephone voice is captured as an analog signal by an input unit (such as a microphone), and A / D converted to digital data. From the output unit (speaker, headphone terminal, etc.) as an analog signal after D / A conversion of the digital signal input from the other terminal via the IP network 40. A receiving unit for outputting. Moreover, although three terminals are shown in FIG. 1, the number of terminals may be any number of two or more. The IP network 40 is for interconnecting the terminals 10, 20, and 30 so as to allow communication. The network may be configured by wireless, wired, or a combination thereof.

上記の音声信号通信システム１００は、端末１０、２０、３０によって同時に通話可能な多人数通話システムである。例えば、端末１０の入力部に入力された音声は、端末２０および３０の双方の出力部からほぼ同時に出力される。逆に、端末２０および３０の双方の入力部にそれぞれ同時に入力された音声は、端末１０の出力部からほぼ同時に出力される。また、当該システムは各端末の送話権の制御は行わないサーバレスのシステムである。
なお、本実施形態において、音声信号通信システム１００はセッション確立時または終了時にはＳＩＰ(session initiation protocol)に、音声信号の通信時にはＲＴＰ(real-time transport protocol)に準拠する。 The audio signal communication system 100 described above is a multi-party call system in which calls can be made simultaneously by the terminals 10, 20, and 30. For example, voices input to the input unit of the terminal 10 are output almost simultaneously from the output units of both the terminals 20 and 30. Conversely, the voices that are simultaneously input to both the input units of the terminals 20 and 30 are output almost simultaneously from the output unit of the terminal 10. The system is a serverless system that does not control the transmission right of each terminal.
In this embodiment, the audio signal communication system 100 conforms to SIP (session initiation protocol) when a session is established or terminated, and conforms to RTP (real-time transport protocol) when an audio signal is communicated.

図２は、本実施形態に係る音声信号通信システム１００が取り扱う音声信号を構成するＲＴＰパケットのＲＴＰヘッダ部の構造を示したものである。詳細は後述するが、音声信号通信システム１００の端末１０、２０、３０の音声データ合成部１２０（図３を参照）は、このヘッダ部の順序番号、タイムスタンプ、同期送信元（ＳＳＲＣ）識別子の３つの情報を利用して音声データ処理を行っている。よって、端末１０、２０、３０の詳細な機能を説明する前に、これらのヘッダ情報について簡単に説明する。 FIG. 2 shows the structure of the RTP header portion of the RTP packet constituting the audio signal handled by the audio signal communication system 100 according to the present embodiment. As will be described in detail later, the voice data synthesis unit 120 (see FIG. 3) of the terminals 10, 20, and 30 of the voice signal communication system 100 is configured with the order number, time stamp, and synchronization transmission source (SSRC) identifier of the header part. Audio data processing is performed using three types of information. Therefore, before describing the detailed functions of the terminals 10, 20, and 30, these header information will be briefly described.

［順序番号］
１つの音声データは複数のパケットに分けて送信されるが、この順序番号は音声データのうち、何番目のパケットであるかを示す情報である。初期値はランダムで、パケットが１つ送られる毎に順序番号が１つ増加する。
［タイムスタンプ］
パケットの最初のバイトのサンプリング時刻を示す情報。本実施形態では受け取ったパケットデータの送信時刻を特定することに使用する。
［同期送信元（ＳＳＲＣ）識別子］
送信元の識別子。受け取ったパケットデータの送信元端末を特定することに使用する。 [Sequence number]
One audio data is transmitted in a plurality of packets, and this sequence number is information indicating the number of packets in the audio data. The initial value is random, and the sequence number is incremented by 1 every time one packet is sent.
[Time stamp]
Information indicating the sampling time of the first byte of the packet. In this embodiment, it is used to specify the transmission time of received packet data.
[Synchronous transmission source (SSRC) identifier]
Source identifier. This is used to specify the source terminal of the received packet data.

次に、図３を参照して、端末１０、２０、３０の音声データ処理部について説明する。なお、これらの３台の端末はすべて同一の音声データ処理機能を有している。この処理部の機能は、大別すると、パケットデータ送受信部４００と、受信ユニット１４０と、送信ユニット２００と、通信セッション制御部３００とから構成される。 Next, the audio data processing unit of the terminals 10, 20, and 30 will be described with reference to FIG. These three terminals all have the same voice data processing function. The functions of this processing unit are roughly divided into a packet data transmission / reception unit 400, a reception unit 140, a transmission unit 200, and a communication session control unit 300.

パケットデータ送受信部４００は、ＩＰネットワーク４０に接続され、端末１０、２０、３０の間のセッションの確立／終了や、音声データの送受信を行う。 The packet data transmission / reception unit 400 is connected to the IP network 40 and performs establishment / termination of a session between the terminals 10, 20, and 30, and transmission / reception of voice data.

受信ユニット１４０は、音声データ受信部１１０と、音声データ合成部１２０と、音声データ出力部１３０と、を備える。音声データ受信部１１０は、パケットデータ送受信部４００から送られるＰＣＭ形式の音声データを受け取り、所定形式の処理用デジタルデータに復調するための機能要素である。音声データ合成部１２０は、音声データ受信部１１０から受け取った複数の端末からの送信信号をそれらの送信時刻を基準として合成するための機能要素であり、主にプロセッサなどの演算器で構成される。音声データ出力部１３０は、音声データ合成部１２０が合成した信号を受け取り、それを外部に出力するための機能要素であり、Ｄ／Ａコンバータ、アナログ回路、スピーカまたはヘッドフォン端子などで構成される。 The receiving unit 140 includes an audio data receiving unit 110, an audio data synthesizing unit 120, and an audio data output unit 130. The audio data receiving unit 110 is a functional element for receiving PCM format audio data sent from the packet data transmitting / receiving unit 400 and demodulating it into digital data for processing in a predetermined format. The voice data synthesizing unit 120 is a functional element for synthesizing transmission signals from a plurality of terminals received from the voice data receiving unit 110 with reference to their transmission times, and is mainly composed of an arithmetic unit such as a processor. . The audio data output unit 130 is a functional element for receiving the signal synthesized by the audio data synthesis unit 120 and outputting the signal to the outside, and includes a D / A converter, an analog circuit, a speaker, a headphone terminal, and the like.

送信ユニット２００は、音声データ入力部２１０と、音声データ送信部２２０と、を備える。音声データ入力部２１０は、端末ユーザが発した音声を基にデジタルデータ信号を生成する機能であり、マイクロフォン、アナログ回路、Ａ／Ｄコンバータなどで構成される。音声データ送信部２２０は、音声データ入力部２１０から受け取ったデータをＰＣＭコーデックで圧縮符号(ＰＣＭ信号）に変換し、ＲＴＰペイロードを付加して、パケットデータ送受信部４００に送る機能要素であり、主にプロセッサなどの演算器で構成される。 The transmission unit 200 includes an audio data input unit 210 and an audio data transmission unit 220. The voice data input unit 210 has a function of generating a digital data signal based on voice uttered by a terminal user, and includes a microphone, an analog circuit, an A / D converter, and the like. The audio data transmission unit 220 is a functional element that converts data received from the audio data input unit 210 into a compression code (PCM signal) using a PCM codec, adds an RTP payload, and sends the data to the packet data transmission / reception unit 400. It consists of a computing unit such as a processor.

通信セッション制御部３００は、パケットデータ送受信部４００のセッション確立／終了やデータのやり取りのタイミングを制御する機能要素であり、主にプロセッサなどの演算器で構成される。 The communication session control unit 300 is a functional element that controls the session establishment / termination and data exchange timing of the packet data transmission / reception unit 400, and is mainly composed of an arithmetic unit such as a processor.

以上の機能要素のうち、受信ユニット１４０の音声データ合成部１２０以外のものは当該技術において既知の機能要素のため、これらの機能の詳細な説明は省略する。よって、音声データ合成部１２０の機能のみ以下に詳細に説明する。 Among the above functional elements, those other than the audio data synthesis unit 120 of the receiving unit 140 are functional elements known in the art, and thus detailed description of these functions is omitted. Therefore, only the function of the voice data synthesis unit 120 will be described in detail below.

図４を参照して音声データ合成部１２０の詳細な機能について説明する。音声データ合成部１２０は、データ受領部１２１と、ヘッダ情報抽出部１２２と、時間差決定部１２３と、時計部１２４と、データ加算および制限部１２５と、データ出力部１２６と、出力制御信号生成部１２７と、ヘッダ情報管理部１２８と、音声データバッファ部１２９と、を備える。 A detailed function of the voice data synthesis unit 120 will be described with reference to FIG. The voice data synthesis unit 120 includes a data reception unit 121, a header information extraction unit 122, a time difference determination unit 123, a clock unit 124, a data addition and restriction unit 125, a data output unit 126, and an output control signal generation unit. 127, a header information management unit 128, and an audio data buffer unit 129.

［データ受領部１２１］
データ受領部１２１は、音声データ受信部１１０（図３参照）から音声データをパケット単位で受け取ると、そのパケットをデータ加算および制限部１２５に転送する。それと同時に、時計部１２４にアクセスして現在時刻を所得し、その時刻をそのパケットの受信時刻Ｒ１として認識する。さらに、受け取ったパケットがその送信元端末からのセッション確立後の最初のパケットの場合には、そのパケットのＲＴＰベッダ部の情報をヘッダ情報抽出部１２２に渡す。 [Data receiving unit 121]
When the data receiving unit 121 receives audio data in units of packets from the audio data receiving unit 110 (see FIG. 3), the data receiving unit 121 transfers the packets to the data addition and restriction unit 125. At the same time, the clock unit 124 is accessed to obtain the current time, and the time is recognized as the reception time R1 of the packet. Further, when the received packet is the first packet after the session is established from the transmission source terminal, the information of the RTP bed portion of the packet is passed to the header information extraction unit 122.

データ受領部１２１は、以上の処理機能に加え、処理を行ったパケットの出力のタイミングも制御する。具体的には、セッション確立後の最初のパケットを受信すると、出力制御信号生成部１２７に出力制御信号の生成を指示する。さらに、定期的に時計部１２４にアクセスして現在時刻を所得し、最新のパケットを受信してからの経過期間を求め、予め設定された期間Ｔｐが過ぎても次のパケットが入力されない場合には通信が終了したとみなして出力制御信号生成部１２７に出力制御信号の停止を指示する。なお、この期間Ｔｐは特定の値に限定されるものではなく、端末の設定者によって要求仕様に基づいて適宜決定されることが好ましい。また、ユーザが最適な値を決定できるよう可変値としてもよい。 In addition to the above processing functions, the data receiving unit 121 also controls the output timing of the processed packet. Specifically, when the first packet after session establishment is received, the output control signal generation unit 127 is instructed to generate an output control signal. Further, when the clock unit 124 is periodically accessed to obtain the current time, the elapsed time since the reception of the latest packet is obtained, and the next packet is not input even after the preset period Tp has passed. Assumes that the communication has ended, and instructs the output control signal generation unit 127 to stop the output control signal. Note that the period Tp is not limited to a specific value, and is preferably determined as appropriate by the terminal setter based on the required specifications. Moreover, it is good also as a variable value so that a user can determine an optimal value.

［ヘッダ情報抽出部１２２］
ヘッダ情報抽出部１２２は、データ受領部１２１から受け取ったＲＴＰヘッダ部の情報から所定の情報（順序番号、タイムスタンプ、同期送信元（ＳＳＲＣ）識別子）を抽出し、その情報を時間差決定部１２３に送る。 [Header information extraction unit 122]
The header information extraction unit 122 extracts predetermined information (sequence number, time stamp, synchronous transmission source (SSRC) identifier) from the information of the RTP header part received from the data reception unit 121, and sends the information to the time difference determination unit 123. send.

［時間差決定部１２３］
時間差決定部１２３は、ヘッダ情報抽出部１２２から送られたＲＴＰヘッダ部の情報（順序番号、タイムスタンプ、同期送信元（ＳＳＲＣ）識別子）を参照し、まず、同期送信元（ＳＳＲＣ）識別子を基にデータ受領部１２１が受け取ったパケットの送信元を特定し、これを送信元情報とする。次いで、タイムスタンプを基に当該パケットの送信時刻Ｔ１を求め、さらに、その送信時刻Ｔ１と先に認識した受信時刻Ｒ１との時間差Ｔｄを求める。ここで、時間差Ｔｄ＝受信時刻Ｒ１−送信時刻Ｔ１となる。この時間差Ｔｄは、送信側の端末の信号処理に伴う時間、伝送路の伝搬遅延、ルータにおけるパケットのキューイング遅延などによって決定される。この値は送信元端末および伝送路が同一の場合にはほぼ一定の値となるため、本実施形態ではセッション確立後における送信元端末の最初のパケットについてのみ求めることにする。 [Time difference determination unit 123]
The time difference determination unit 123 refers to the information (sequence number, time stamp, synchronous transmission source (SSRC) identifier) of the RTP header part sent from the header information extraction unit 122, and first, based on the synchronous transmission source (SSRC) identifier. The transmission source of the packet received by the data receiving unit 121 is identified and used as transmission source information. Next, the transmission time T1 of the packet is obtained based on the time stamp, and further, a time difference Td between the transmission time T1 and the previously recognized reception time R1 is obtained. Here, time difference Td = reception time R1−transmission time T1. This time difference Td is determined by the time involved in the signal processing of the terminal on the transmission side, the propagation delay of the transmission line, the packet queuing delay in the router, and the like. Since this value is almost constant when the transmission source terminal and the transmission path are the same, in this embodiment, only the first packet of the transmission source terminal after session establishment is obtained.

次に、時間差決定部１２３は、ヘッダ情報抽出部１２２から送られたＲＴＰヘッダ部のパケットの順序番号、および、上記の方法で求めた送信元、時間差Ｔｄに関する情報をヘッダ情報管理部１２８に格納する。 Next, the time difference determination unit 123 stores, in the header information management unit 128, the packet sequence number of the RTP header part sent from the header information extraction unit 122, the transmission source obtained by the above method, and information about the time difference Td. To do.

［時計部１２４］
時計部１２４は、データ受領部１２１に現在時刻を提供する機能要素であり、この現在時刻は受信時刻Ｒ１や経過期間Ｔｐを特定するために用いられる。 [Clock part 124]
The clock unit 124 is a functional element that provides the current time to the data receiving unit 121, and this current time is used to specify the reception time R1 and the elapsed period Tp.

［データ加算および制限部１２５］
データ加算および制限部１２５は、データ受信部１２１からパケットを受け取ると、まず、ヘッダ情報管理部１２８にアクセスしてそのパケットの時間差Ｔｄを取得する。そして、先に求めた受信時刻Ｒ１とこの時間差Ｔｄから送信時刻Ｔ１を求める。そして、受け取ったパケットのペイロード（音声データ）を音声データバッファ部１２９内の送信時間Ｔ１に対応する領域に格納するが、その前に該当する格納領域に既にデータが格納されていないかどうかを確認する。既にデータが格納されている場合には、そのデータと格納すべきデータとを加算してからその領域に格納する。ここで使用される加算方式は単純加算でよい。 [Data addition and restriction unit 125]
When the data addition / limitation unit 125 receives a packet from the data reception unit 121, it first accesses the header information management unit 128 to acquire the time difference Td of the packet. Then, the transmission time T1 is obtained from the previously obtained reception time R1 and the time difference Td. Then, the payload (voice data) of the received packet is stored in the area corresponding to the transmission time T1 in the voice data buffer unit 129, but before that, it is confirmed whether the data is already stored in the corresponding storage area. To do. If data has already been stored, the data and the data to be stored are added and stored in the area. The addition method used here may be simple addition.

図５を参照して、データ加算および制限部１２５が行うデータの加算方法について補足説明をする。
図の左側のデータＡおよびデータＢは、それぞれ異なる送信元の端末からデータ加算および制限部１２５に入力された音声データの波形である。データＡの送信時刻Ｔ１はｔＡであり、データＢの送信時刻Ｔ１’はｔＢである。ここで、ｔＡとｔＢの差は、これらの波形のサンプリング周期の２倍とする。この場合、双方のデータを、それぞれの送信時刻Ｔ１とＴ１’を基準として時間軸上に揃えた状態、すなわち、データＢの波形をデータＡの波形に対して２サンプル分だけ右にずらした状態で配列し、同一の時刻のサンプル同士を加算する。こうして得られた合成データの波形が右側の波形である。
このように、データ加算および制限部１２５は時間軸の概念を有し、入力したデータをその送信時刻Ｔ１を参照して時間軸上に配列させた状態で加算を行う。 With reference to FIG. 5, a supplementary explanation will be given regarding the data addition method performed by the data addition and restriction unit 125.
Data A and data B on the left side of the figure are waveforms of audio data input to the data addition and restriction unit 125 from different transmission source terminals. The transmission time T1 of data A is tA, and the transmission time T1 ′ of data B is tB. Here, the difference between tA and tB is twice the sampling period of these waveforms. In this case, both data are aligned on the time axis with reference to the respective transmission times T1 and T1 ′, that is, the waveform of data B is shifted to the right by two samples with respect to the waveform of data A. The samples at the same time are added together. The waveform of the synthesized data thus obtained is the right waveform.
As described above, the data addition and restriction unit 125 has a concept of a time axis, and performs addition in a state where input data is arranged on the time axis with reference to the transmission time T1.

また、もし同時に多数の端末からパケットが同時に受信された場合、これらのパケットのペイロードがすべて加算されるため、最終的に得られるデータの値が相当に大きくなってしまうことに留意されたい。その状態でこのデータを後段の処理部に出力すると、その許容入力を超えて歪んでしまい、再生される音の品質を著しく損なう、あるいは後段の処理部にダメージを与える恐れがある。そのため、上記の不具合を防止するために、データ加算および制限部１２５は取り扱うデータが基準値よりも大きな値となる場合にはこの値を上限値に変更する（以下、この処理をデータ制限処理と称する）。
ここで、上限値Ｔｈはリミッタの役割を果たし、この上限値Ｔｈは例えばＰＣＭデータの有効数値範囲の上限に設定してよい。その場合、ペイロードが８ビットのときは上限値Ｔｈが２５５、１６ビットのときは３２７６７に設定される。 It should also be noted that if packets are simultaneously received from a large number of terminals, all the payloads of these packets are added, resulting in a considerably large data value finally obtained. If this data is output to the processing unit in the subsequent stage in that state, the allowable input may be distorted, and the quality of the reproduced sound may be significantly impaired, or the subsequent processing unit may be damaged. Therefore, in order to prevent the above problem, the data addition and restriction unit 125 changes this value to the upper limit value when the data handled becomes a value larger than the reference value (hereinafter, this process is referred to as a data restriction process). Called).
Here, the upper limit value Th serves as a limiter, and the upper limit value Th may be set to the upper limit of the effective numerical value range of the PCM data, for example. In this case, the upper limit value Th is set to 255 when the payload is 8 bits, and is set to 32767 when the payload is 16 bits.

［データ出力部１２６］
データ出力部１２６は、出力制御信号生成部１２７より出力制御信号を受け取ると、それに応じて音声データバッファ部１２９に格納されたデータを読み出し、後段の音声データ出力部１３０（図３参照）に出力する。 [Data output unit 126]
When the data output unit 126 receives the output control signal from the output control signal generation unit 127, the data output unit 126 reads the data stored in the audio data buffer unit 129 in accordance with the output control signal, and outputs the data to the subsequent audio data output unit 130 (see FIG. 3). To do.

［出力制御信号生成部１２７］
データ受領部１２１からのコマンドに応じて出力制御信号を生成して、データ出力部１２６に出力する機能要素である。音声データの出力を開始する場合には出力制御信号を生成し、音声データの出力を停止する場合には出力制御信号を停止する。 [Output control signal generator 127]
This is a functional element that generates an output control signal in response to a command from the data receiving unit 121 and outputs the output control signal to the data output unit 126. When the output of audio data is started, an output control signal is generated, and when the output of audio data is stopped, the output control signal is stopped.

［ヘッダ情報管理部１２８］
受信したパケットに関する情報（送信元、順序番号、時間差Ｔｄ）を記憶および管理するための機能要素であり、レジスタ、メモリ、ハードディスクなどで構成される。 [Header information management unit 128]
This is a functional element for storing and managing information (transmission source, sequence number, time difference Td) related to received packets, and includes a register, a memory, a hard disk, and the like.

［音声データバッファ部１２９］
データ加算および制限部１２５が処理するデータを一時的に格納するための機能要素であり、レジスタ、メモリ、ハードディスクなどで構成される。このバッファ部は専用に用意するか、あるいはデータの揺らぎの補正を行うためのジッタバッファを流用してよい。バッファ部内部の記憶領域は時間軸の概念を備えており、格納したデータはその送信時刻Ｔ１に関連付けられて管理される。 [Audio data buffer unit 129]
This is a functional element for temporarily storing data to be processed by the data addition and restriction unit 125, and includes a register, a memory, a hard disk, and the like. This buffer unit may be prepared exclusively, or a jitter buffer for correcting data fluctuation may be used. The storage area inside the buffer unit has a concept of a time axis, and the stored data is managed in association with the transmission time T1.

以上の機能を有する音声データ合成部１２０において、２つの音声データが入力したときに出力される音声データの波形の例を図６に示す。この図において、波形１および波形２は、２つの異なる送信元の端末のマイク等に入力される波形であり、波形１は人間が「アー」と一定の周期で繰り返し発声したときの生じる波形であり、波形２は発信機が生成する４９５Ｈｚの周波数の音の正弦波波形である。これらの波形１および波形２が合成されて受信元の端末のスピーカ等から出力される波形が波形３である。この波形に示すように、全周期に渡って波形１および波形２が正しく合成されており、「ブツブツ」ノイズを発生するような音のレベルが急激に変化する箇所も存在しない。 FIG. 6 shows an example of a waveform of audio data output when two audio data are input in the audio data synthesis unit 120 having the above functions. In this figure, waveform 1 and waveform 2 are waveforms that are input to microphones or the like of terminals of two different transmission sources, and waveform 1 is a waveform that is generated when a person repeatedly utters “A” at a constant cycle. Yes, waveform 2 is a sine wave waveform of a 495 Hz frequency sound generated by the transmitter. Waveform 3 is a waveform that is synthesized from waveform 1 and waveform 2 and is output from the speaker of the receiving terminal. As shown in this waveform, the waveform 1 and the waveform 2 are correctly synthesized over the entire period, and there is no place where the level of the sound that causes “buzzy” noise suddenly changes.

以上、本発明の一実施形態に係る音声信号通信システム１００の機能の説明をした。次に、このシステムにおいて端末１０、２０、３０が相互に通信を行う時に、各端末内の音声データ合成部１２０が行うデータ合成処理の手順について説明する。
当該データ合成処理は、データ加算処理とデータ出力処理の２つの処理に大別される。したがって、これらの処理を順に説明する。なお、図７はデータ加算処理のフローを示し、図８はデータ出力処理のフローを示す。 The function of the audio signal communication system 100 according to the embodiment of the present invention has been described above. Next, a procedure of data synthesis processing performed by the voice data synthesis unit 120 in each terminal when the terminals 10, 20, and 30 communicate with each other in this system will be described.
The data synthesis process is roughly divided into two processes, a data addition process and a data output process. Therefore, these processes will be described in order. FIG. 7 shows the flow of data addition processing, and FIG. 8 shows the flow of data output processing.

［データ加算処理］
図７のフロー図を参照して説明する。
まず、図１に示す端末１０、２０、３０がＩＰネットワーク４０に対してセッションを行う。本実施形態の音声信号通信システム１００は、図示のように通信制御を行うサーバが存在しないため、各端末は、他の端末に加えて自身の端末に対してもセッションを行う。セッションが確立されることにより、３つの端末は相互に、かつリアルタイムに音声データをやり取りすることが可能となる。 [Data addition processing]
This will be described with reference to the flowchart of FIG.
First, the terminals 10, 20, and 30 shown in FIG. Since the audio signal communication system 100 according to the present embodiment does not have a server that performs communication control as illustrated, each terminal performs a session with its own terminal in addition to other terminals. By establishing the session, the three terminals can exchange voice data with each other in real time.

セッション確立後、端末１０、２０、３０の何れかの端末（受信端末）は、他の２台の端末（送信端末）からＩＰネットワーク４０を介して音声データのパケットを受信する。このパケットは、送信端末のパケットデータ送受信部４００および音声データ受信部１１０を通り、音声データ合成部１２０に入力する。そして、音声データ合成部１２０内のデータ受領部１２１が入力したパケットを受け取る（ステップＳ１０）。 After the session is established, any one of the terminals 10, 20, and 30 (receiving terminal) receives voice data packets from the other two terminals (transmitting terminals) via the IP network 40. This packet passes through the packet data transmission / reception unit 400 and the audio data reception unit 110 of the transmission terminal, and is input to the audio data synthesis unit 120. The packet received by the data receiving unit 121 in the voice data synthesizing unit 120 is received (step S10).

データ受領部１２１は、受け取ったパケットのヘッダ部の同期送信元（ＳＳＲＣ）識別子と順序番号とを参照し、このパケットがその送信元端末からのセッション開始後の最初のパケットかどうかを確認する（ステップＳ１１）。最初のパケットの場合（ステップＳ１１で「ＹＥＳ」）にはステップＳ１２の手順に進み、その他の場合（ステップＳ１１で「ＮＯ」）はステップＳ１６の手順に進む。 The data receiving unit 121 refers to the synchronous transmission source (SSRC) identifier and the sequence number in the header of the received packet, and confirms whether this packet is the first packet after the start of the session from the transmission source terminal ( Step S11). In the case of the first packet (“YES” in step S11), the procedure proceeds to step S12, and in other cases (“NO” in step S11), the procedure proceeds to step S16.

受け取ったパケットが最初のパケットの場合（ステップＳ１１で「ＹＥＳ」）、データ受領部１２１は時計部１２４にアクセスし、現在時刻を取得し、これを受信時刻Ｒ１とする（ステップＳ１２）。さらに、パケットのヘッダ部の情報をヘッダ情報抽出部１２２に渡す。ヘッダ情報抽出部１２２は受け取ったヘッダ部の情報から順序番号、タイムスタンプ、同期送信元（ＳＳＲＣ）識別子の情報を抽出する（ステップＳ１３）。そして、タイムスタンプより求めたデータパケットの送信時刻Ｔ１と先に求めた受信時刻Ｒ１とを基に時間差Ｔｄ（Ｒ１−Ｔ１）を求める（ステップＳ１４）。次いで、送信元、パケットの順序、時間差Ｔｄに関する情報をヘッダ情報管理部１２８に格納する（ステップＳ１５）。
なお、ステップＳ１２〜Ｓ１５の処理は、ステップＳ１１で「ＹＥＳ」、すなわち、セッション開始後の送信元の端末からの最初のパケットについてのみに行うことに留意されたい。 If the received packet is the first packet (“YES” in step S11), the data receiving unit 121 accesses the clock unit 124, acquires the current time, and sets this as the reception time R1 (step S12). Further, the header information of the packet is passed to the header information extraction unit 122. The header information extraction unit 122 extracts information of a sequence number, a time stamp, and a synchronous transmission source (SSRC) identifier from the received header part information (step S13). Then, a time difference Td (R1−T1) is obtained based on the transmission time T1 of the data packet obtained from the time stamp and the reception time R1 obtained previously (step S14). Next, information on the transmission source, the packet order, and the time difference Td is stored in the header information management unit 128 (step S15).
It should be noted that the processing of steps S12 to S15 is performed only for “YES” in step S11, that is, only for the first packet from the transmission source terminal after the session starts.

次に、ステップＳ１６の処理において、データ加算および制限部１２５はデータ受領部１２１からパケットを受け取り、それからヘッダ情報管理部１２８にアクセスし、受け取ったパケットの時間差情報Ｔｄを取得する。次いで、予め求めた受信時刻Ｒ１と取得した時間差情報Ｔｄから受け取ったパケットの送信時刻Ｔ１を求める（ステップＳ１６）。それから音声データバッファ部１２９にアクセスし、送信時刻Ｔ１に対応する領域に既にデータが格納されているかどうかを確認する（ステップＳ１７）。 Next, in the process of step S16, the data addition and restriction unit 125 receives the packet from the data reception unit 121, and then accesses the header information management unit 128 to acquire the time difference information Td of the received packet. Next, the transmission time T1 of the received packet is obtained from the reception time R1 obtained in advance and the obtained time difference information Td (step S16). Then, the audio data buffer unit 129 is accessed to check whether data is already stored in the area corresponding to the transmission time T1 (step S17).

既にデータが格納されている場合（ステップＳ１７で「ＹＥＳ」）、格納されているデータを読み出し、このデータと処理中のデータとをそれらの送信時刻Ｔ１を基準として時間軸上に配列した状態で加算し（ステップＳ１８）、ステップＳ１９の手順へ進む。
一方、データが存在しない場合には（ステップＳ１７で「ＮＯ」）、ステップＳ１９の手順へ進む。 If data has already been stored (“YES” in step S17), the stored data is read out, and this data and the data being processed are arranged on the time axis with reference to their transmission time T1. Add (step S18) and proceed to step S19.
On the other hand, if there is no data (“NO” in step S17), the process proceeds to step S19.

次に以上の手順で得られたデータの値と所定の制限値Ｔｈとを比較する（ステップＳ１９）。制限値Ｔｈを超える値が存在する場合には（ステップＳ１９で「ＹＥＳ」）、その値を制限値Ｔｈに変更する（ステップＳ２０）。存在しない場合には（ステップＳ１９で「ＮＯ」）、特にデータの変更は行わない。
その上で、該当のデータを、音声データバッファ部１２９内の該データの送信時刻Ｔ１に対応する領域に格納する(ステップＳ２１)。 Next, the value of the data obtained by the above procedure is compared with a predetermined limit value Th (step S19). If there is a value exceeding the limit value Th (“YES” in step S19), the value is changed to the limit value Th (step S20). If it does not exist (“NO” in step S19), no data change is performed.
Then, the corresponding data is stored in an area corresponding to the transmission time T1 of the data in the audio data buffer unit 129 (step S21).

［データ出力処理］
図８のフロー図を参照して説明する。なお、このデータ出力処理は上述のデータ加算処理と並列して行われることに留意されたい。
まず、セッションが開始すると、受信端末のデータ受領部１２１は出力制御信号生成部１２７に出力制御信号を生成させ、データ出力部１２６に出力動作を開始させる（ステップＳ５０）。データ出力部１２６は音声データバッファ部１２９にアクセスして、上述のデータ加算処理においてデータ加算および制限部１２５が格納したデータを取り出し（ステップＳ５１）、音声データ出力部１３０（図３参照）に出力する（ステップＳ５２）。それと同時にデータ受領部１２１は、定期的に時計部１２４から現在時刻を入手して最新のパケットを受信してからの経過時間をモニタする（ステップＳ５３）。経過期間が所定の期間Ｔｐの範囲内の場合（ステップＳ５４で「ＮＯ」）、以降、上記ステップＳ５１〜Ｓ５４の手順を繰り返す。 [Data output processing]
This will be described with reference to the flowchart of FIG. Note that this data output processing is performed in parallel with the data addition processing described above.
First, when a session starts, the data receiving unit 121 of the receiving terminal causes the output control signal generating unit 127 to generate an output control signal, and causes the data output unit 126 to start an output operation (step S50). The data output unit 126 accesses the audio data buffer unit 129, extracts the data stored by the data addition and restriction unit 125 in the above-described data addition process (step S51), and outputs the data to the audio data output unit 130 (see FIG. 3). (Step S52). At the same time, the data receiving unit 121 periodically obtains the current time from the clock unit 124 and monitors the elapsed time after receiving the latest packet (step S53). When the elapsed period is within the range of the predetermined period Tp (“NO” in step S54), the procedure of steps S51 to S54 is repeated thereafter.

一方、経過期間が所定の期間Ｔｐを超えた場合（ステップＳ５４で「ＹＥＳ」）、受信端末はデータの受信が終了したとみなす。具体的には、データ受領部１２１が出力制御信号生成部１２７に出力制御信号の生成を終了させる（ステップＳ５５）。データ出力部１２６は出力制御信号生成部１２７から出力制御信号を受け取らなくなると、データの出力動作を停止する（ステップＳ５６）。以降、受信端末はセッション開始の初期状態に戻って音声データのパケットの受信を待つ。
以上が、本発明の一実施形態に係る音声信号通信システム１００における音声データの合成処理の一連の手順である。 On the other hand, when the elapsed period exceeds the predetermined period Tp (“YES” in step S54), the receiving terminal considers that the reception of data has ended. Specifically, the data receiving unit 121 causes the output control signal generation unit 127 to finish generating the output control signal (step S55). When the data output unit 126 stops receiving the output control signal from the output control signal generation unit 127, the data output unit 126 stops the data output operation (step S56). Thereafter, the receiving terminal returns to the initial state of the session start and waits for the reception of the voice data packet.
The above is a series of procedures for synthesizing audio data in the audio signal communication system 100 according to the embodiment of the present invention.

なお、本発明の一実施形態に係る音声信号通信システム１００が有する機能は、特定のハードウェア資源またはソフトウェア処理に限定されないことに留意されたい。すなわち、本発明の一実施形態に係る端末１０、２０、３０の音声データ合成部１２０はその機能を実現できる限り、如何なるハードウェア（電子回路等）、ソフトウェア（プログラム）、あるいはそれらの組み合わせ等を用いてよい。 Note that the functions of the audio signal communication system 100 according to the embodiment of the present invention are not limited to specific hardware resources or software processing. That is, as long as the voice data synthesizing unit 120 of the terminals 10, 20, and 30 according to the embodiment of the present invention can realize the function, any hardware (electronic circuit, etc.), software (program), or a combination thereof can be used. May be used.

上述した本発明の一実施形態に係る音声信号合成方法を、プログラムとして実装する場合には、このプログラムを外部のサーバ等から該方法を実行する情報処理装置にダウンロードするか、あるいはコンピュータ可読媒体の形態で分配されることが好ましい。コンピュータ可読媒体の例としては、ＣＤ−ＲＯＭ、ＤＶＤ、磁気テープ、フレキシブルディスク、光磁気ディスク、ハードディスクなどが挙げられる。 When the speech signal synthesis method according to an embodiment of the present invention described above is implemented as a program, the program is downloaded from an external server or the like to an information processing apparatus that executes the method, or a computer-readable medium It is preferably distributed in the form. Examples of the computer readable medium include a CD-ROM, DVD, magnetic tape, flexible disk, magneto-optical disk, and hard disk.

以上、本発明を図面に示した実施形態を用いて説明したが、これらは例示的なものに過ぎず、本技術分野の当業者ならば、本発明の範囲および趣旨から逸脱しない範囲で多様な変更および変形が可能なことは理解できるであろう。したがって、本発明の範囲は、説明された実施形態によって定められず、特許請求の範囲に記載された技術的趣旨により定められねばならない。 As mentioned above, although this invention was demonstrated using embodiment shown in drawing, these are only an illustration and those skilled in this technical field can variously be within the range which does not deviate from the range and the meaning of this invention. It will be understood that modifications and variations are possible. Accordingly, the scope of the invention should not be determined by the described embodiments, but by the technical spirit described in the claims.

本発明の一実施形態に係る音声信号通信システム１００の構成を示す図である。1 is a diagram showing a configuration of an audio signal communication system 100 according to an embodiment of the present invention. 音声信号通信システム１００の通信が準拠するＲＴＰデータのヘッダ部の構成を示す図である。It is a figure which shows the structure of the header part of RTP data with which communication of the audio | voice signal communication system 100 is based. 音声信号通信システム１００の端末１０、２０、３０の音声データ処理部の機能を示すブロック図である。3 is a block diagram showing functions of audio data processing units of terminals 10, 20, and 30 of the audio signal communication system 100. FIG. 図３の音声データ合成部１２０の機能を示すブロック図である。It is a block diagram which shows the function of the audio | voice data synthesis | combination part 120 of FIG. 音声データ合成部１２０のデータの加算方法を説明するための図である。FIG. 6 is a diagram for explaining a data addition method of a voice data synthesis unit 120. 端末１０、２０、３０の入出力波形の例である。It is an example of the input / output waveform of the terminals 10, 20, and 30. 音声データ合成部１２０が行う音声合成処理のデータ加算処理を示すフロー図である。It is a flowchart which shows the data addition process of the speech synthesis process which the speech data synthesis part 120 performs. 音声データ合成部１２０が行う音声合成処理のデータ出力処理を示すフロー図である。It is a flowchart which shows the data output process of the speech synthesis process which the speech data synthesis part 120 performs.

Explanation of symbols

１０端末
２０端末
３０端末
４０ＩＰネットワーク
１００音声信号通信システム
１２０音声データ合成部
１２１データ受領部
１２２ヘッダ情報抽出部
１２３時間差決定部
１２４時計部
１２５データ加算および制限部
１２６データ出力部
１２７出力制御信号生成部
１２８ヘッダ情報管理部
１２９音声データバッファ部 DESCRIPTION OF SYMBOLS 10 terminal 20 terminal 30 terminal 40 IP network 100 audio | voice signal communication system 120 audio | voice data synthetic | combination part 121 data reception part 122 header information extraction part 123 time difference determination part 124 clock part 125 data addition and restriction part 126 data output part 127 output control signal generation 128 Header information management unit 129 Audio data buffer unit

Claims

A voice signal communication system comprising a plurality of terminals having a communication function and an IP network interconnecting the plurality of terminals, and capable of RTP communication of voice signals between the plurality of terminals,
Each of the receiving units of the plurality of terminals has speech synthesis means,
The speech synthesis means
Information extracting means for extracting time information from time stamps of RTP headers of a plurality of received audio signals;
Based on the extracted time information, transmission time determination means for obtaining a transmission time of the received signal;
Audio signal adding means for adding the plurality of audio signals based on the determined transmission time;
A signal limiting means for comparing the value of the added audio signal with a predetermined limit value, and if the value of the audio signal is greater, reducing the value of the audio signal to the limit value;
An audio signal communication system comprising:

A speech synthesizer capable of synthesizing speech signals received from a plurality of transmission sources via an RTP communication system including an IP network in real time,
Information extracting means for extracting predetermined time information from time stamps of RTP headers of the plurality of received audio signals;
Based on the extracted time information, transmission time determination means for obtaining a transmission time of the received signal;
Audio signal adding means for adding the plurality of audio signals based on the determined time;
Audio output means for outputting the added audio signal;
A speech synthesizer comprising:

A signal limiting unit that compares the value of the audio signal added by the audio signal adding unit with a predetermined limit value, and reduces the value of the audio signal to the limit value when the value of the audio signal is larger. The speech synthesizer according to claim 2, further comprising:

A voice synthesis processing method for synthesizing voice signals received from a plurality of transmission sources via an RTP communication system including an IP network in real time,
An information extracting step of extracting time information from time stamps of RTP headers of the plurality of received audio signals;
A transmission time determining step for obtaining a transmission time of the received signal based on the extracted time information;
An audio signal addition step of adding the plurality of audio signals based on the determined time;
An audio signal output step for outputting the added audio signal;
A speech synthesis processing method comprising:

Between the audio signal addition step and the audio signal output step,
Comparing the value of the added audio signal with a predetermined limit value, and when the value of the audio signal is larger, further includes a signal limiting step of reducing the value of the audio signal to the limit value. The speech synthesis processing method according to claim 4, wherein:

An electronic circuit that performs the speech synthesis processing method according to claim 4.

A program for causing a processing device to execute the speech synthesis processing method according to claim 4 or 5.

A computer-readable medium storing the program according to claim 7.