JP3492561B2

JP3492561B2 - Communication voice processing device and storage medium storing voice processing program

Info

Publication number: JP3492561B2
Application number: JP21538699A
Authority: JP
Inventors: 良昌梶原
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 1999-07-29
Filing date: 1999-07-29
Publication date: 2004-02-03
Anticipated expiration: 2019-07-29
Also published as: JP2001045055A

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、パケット通信ネッ
トワークを使用して音声通信を行う際の音声の復号化及
び符号化に適した通信用音声処理装置及び音声処理プロ
グラムを記憶した記憶媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a communication voice processing apparatus suitable for decoding and encoding voice when performing voice communication using a packet communication network, and a storage medium storing a voice processing program.

【０００２】[0002]

【従来の技術】パケット通信ネットワークを通じて音声
通信を行う際に遵守すべきものとしてＩＴＵ−Ｔ勧告
Ｈ．３２３が挙げられる。ＩＴＵ−Ｔ勧告Ｈ．３２３
は、パケット通信ネットワーク上でマルチメディア通信
を行う際の規格に関連している。音声、画像等のマルチ
メディア情報をリアルタイムに通信する際には、ＩＴＵ
−Ｔ勧告Ｈ．３２３に準拠した通信端末器を使用するこ
とが求められている。2. Description of the Related Art ITU-T Recommendation H.264 is to be observed when performing voice communication through a packet communication network. 323. ITU-T Recommendation H.264. 323
Is related to a standard for performing multimedia communication on a packet communication network. When communicating multimedia information such as voice and images in real time, ITU
T Recommendation H. It is required to use a communication terminal compliant with H.323.

【０００３】ＩＴＵ−Ｔ勧告Ｈ．３２３に準拠した音声
通信端末器に用いられる通信装置の一例を図６に示す。
図６に示す通信装置は、端末制御部１１、接続制御部１
２、音声入力部１３、音声出力部１４、パケット送受信
部１５、非同期通信網１６及び音声処理部４０を備えて
いる。端末制御部１１は、接続制御部１２および音声処
理部４０の動作を管理し制御する。接続制御部１２は、
端末制御部１１の指示により、接続時に相手端末とネゴ
シエーションを行い、相手端末との接続条件を決定し、
接続処理を行う。音声入力部１３は、マイク等の音声入
力手段（図示せず）により音声データを入力して音声処
理部４０に転送する。音声出力部１４は、音声処理部４
０からの音声データをスピーカー等の音声出力機器（図
示せず）を通じて音声に再生する。パケット送受信部１
５は、非同期通信網１６を通じてパケットの送受信処理
を行う。非同期通信網１６は、パケット通信ネットワー
ク等に採用されてている通信網で、これを介して端末間
でパケットの転送が行われる。ITU-T Recommendation H.264 FIG. 6 shows an example of a communication device used for a voice communication terminal conforming to H.323.
The communication device shown in FIG. 6 includes a terminal control unit 11 and a connection control unit 1.
2, a voice input unit 13, a voice output unit 14, a packet transmitting / receiving unit 15, an asynchronous communication network 16, and a voice processing unit 40. The terminal control unit 11 manages and controls the operations of the connection control unit 12 and the voice processing unit 40. The connection control unit 12
According to an instruction from the terminal control unit 11, negotiation is performed with the partner terminal at the time of connection, connection conditions with the partner terminal are determined,
Perform connection processing. The voice input unit 13 inputs voice data by a voice input means (not shown) such as a microphone and transfers the voice data to the voice processing unit 40. The voice output unit 14 is the voice processing unit 4.
Audio data from 0 is reproduced as audio through an audio output device (not shown) such as a speaker. Packet transmitter / receiver 1
5 performs packet transmission / reception processing through the asynchronous communication network 16. The asynchronous communication network 16 is a communication network adopted in a packet communication network or the like, through which packets are transferred between terminals.

【０００４】音声処理部４０は、端末制御部１１の指示
により、ＩＴＵ−Ｔ勧告Ｇ．７２３．１音声符号化方式
に従って音声データの符号化・復号化処理を行う。ここ
で、Ｇ．７２３．１とは、ＩＴＵ−Ｔによって定められ
た音声データの符号化・復号化方式のことで、音声を符
号化する際には、音声波形の予測を行い符号化を行うた
め、高い圧縮率で符号化することが可能であるが、符号
化処理に時間がかかる。なお、Ｇ．７２３．１では、フ
レームと呼ばれる単位に基づいて音声データの符号化・
復号化処理を行う。１フレームは、３０ｍｓｅｃ（２４
０サンプルのＰＣＭデータ）分の長さで構成される。The voice processing unit 40 receives the ITU-T recommendation G.264 according to an instruction from the terminal control unit 11. The audio data is encoded / decoded according to the 723.1 audio encoding method. Here, G. 723.1 is an audio data encoding / decoding method defined by ITU-T. When audio is encoded, the audio waveform is predicted and encoded, so that a high compression rate is achieved. Although it is possible to encode with, the encoding process takes time. In addition, G. In 723.1, encoding of audio data is performed based on a unit called a frame.
Performs decryption processing. One frame takes 30 msec (24
PCM data of 0 sample).

【０００５】従来の音声処理部４０は、図７に示すよう
に、音声制御部４１と、音声符号化部４２と、音声復号
化部４３とから構成されている。音声制御部４１は、音
声符号化部４２と音声復号化部４３を管理し、各部に交
互に動作の指示を出す。音声符号化部４２は、音声制御
部４１からの指示に従って、音声入力部１３から入力し
た音声データを前記Ｇ．７２３．１符号化方式に従って
符号化する。次いで、符号化ビットストリームをパケッ
ト化し、パケット送受信部１５に書き込む。音声復号化
部４３は、音声制御部４１からの指示に従って、パケッ
ト送受信部１５から入力されたパケットを読み込み、パ
ケットから符号化ビットストリームを取り出す。次い
で、復号化処理を行い、復号化した音声データを音声出
力部１４から音声として出力する。As shown in FIG. 7, the conventional speech processing section 40 is composed of a speech control section 41, a speech coding section 42, and a speech decoding section 43. The voice control unit 41 manages the voice encoding unit 42 and the voice decoding unit 43, and alternately issues an operation instruction to each unit. According to the instruction from the voice control unit 41, the voice encoding unit 42 converts the voice data input from the voice input unit 13 into the G.34. It is encoded according to the 723.1 encoding method. Next, the encoded bit stream is packetized and written in the packet transmitting / receiving unit 15. The audio decoding unit 43 reads the packet input from the packet transmitting / receiving unit 15 and extracts the encoded bitstream from the packet according to the instruction from the audio control unit 41. Next, a decoding process is performed, and the decoded voice data is output from the voice output unit 14 as voice.

【０００６】[0006]

【発明が解決しようとする課題】従来、パケットから音
声を再生する際、音声復号化部４３は、音声制御部４１
から動作指示を受ける度毎にパケット送受信部１５から
１パケットを読み込んで復号化していた。しかしなが
ら、この場合、図８に示すように、音声復号化部４３が
動作指示を受けてから次の動作指示を受けるまでの間隔
が、相手端末からのパケットの到着間隔より長くなる
と、受信パケットの読み込みが間に合わなくなる。した
がって、受信パケットがパケット送受信部１５の受信バ
ッファに蓄積される。受信バッファが満杯になると、蓄
積できない分の受信パケットが抜けてしまう。このた
め、再生時に音声の途切れが発生する。この現象は、パ
ケット送受信部１５の容量が小さい程、顕著に表れる。Conventionally, when a voice is reproduced from a packet, the voice decoding section 43 has a voice control section 41.
Each time it received an operation instruction from, one packet was read from the packet transmitting / receiving unit 15 and decoded. However, in this case, as shown in FIG. 8, when the interval from the voice decoding unit 43 receiving an operation instruction to the next operation instruction becomes longer than the packet arrival interval from the partner terminal, the reception packet Reading will not be in time. Therefore, the received packet is stored in the receiving buffer of the packet transmitting / receiving unit 15. When the receive buffer becomes full, the number of received packets that cannot be stored will be lost. For this reason, a sound break occurs during reproduction. This phenomenon is more remarkable as the capacity of the packet transmitting / receiving unit 15 is smaller.

【０００７】また、従来の技術の欄で述べたように、
Ｇ．７２３．１符号化方式のような予測符号化を行う音
声符号化・復号化方式では、符号化と復号化とを比較す
ると、符号化の方が処理に時間がかかる。したがって、
音声符号化部４２と音声復号化部４３を１：１の割合で
動作させていると、復号化処理が間に合わなくなる場合
がある。図９は、符号化処理を３フレーム毎に行い、復
号化処理を１フレーム毎に行う場合の処理のスケジュー
リング図である。図９に示すように、相手端末からの音
声パケットを復号化する途中で音声信号の符号化処理を
行うようになり、音声データの出力に滞りが生じる。こ
のため、従来の予測符号化を行う音声符号化・復号化方
式では、再生時に音声の途切れが発生してしまうといっ
た欠点がある。Further, as described in the section of the prior art,
G. In a voice encoding / decoding method that performs predictive encoding such as the 723.1 encoding method, when encoding is compared with decoding, the encoding takes longer time. Therefore,
If the audio encoding unit 42 and the audio decoding unit 43 are operated at a ratio of 1: 1, the decoding process may not be in time. FIG. 9 is a scheduling diagram of processing when the encoding processing is performed every three frames and the decoding processing is performed every one frame. As shown in FIG. 9, the audio signal is encoded while the audio packet from the partner terminal is being decoded, and the output of the audio data is delayed. Therefore, the conventional audio encoding / decoding method for performing predictive encoding has a drawback in that audio interruption occurs during reproduction.

【０００８】本発明は、上記事情に鑑みてなされたもの
で、受信パケットの脱落が防止できて再生時に音声の途
切れが生じない通信用音声処理装置及び音声処理プログ
ラムを記憶した記憶媒体を提供することを目的としてい
る。The present invention has been made in view of the above circumstances, and provides a communication voice processing apparatus and a storage medium storing a voice processing program capable of preventing dropping of received packets and causing no interruption of voice during reproduction. Is intended.

【０００９】[0009]

【課題を解決するための手段】第１の発明は、パケット
通信ネットワークを利用して音声通信を行うために音声
信号を符号化・復号化を行う通信用音声処理装置におい
て、送信する際に音声信号を符号化する音声符号化部
と、受信した音声信号パケットを一時格納するパケット
格納部と、該パケット格納部から音声信号パケットを読
み出し復号化する音声復号化部と、各部に動作の指示を
送る音声制御部とを備えることを特徴とする。SUMMARY OF THE INVENTION A first aspect of the present invention is a communication voice processing apparatus for encoding / decoding a voice signal for voice communication using a packet communication network. A voice encoding unit that encodes a signal, a packet storage unit that temporarily stores the received voice signal packet, a voice decoding unit that reads and decodes the voice signal packet from the packet storage unit, and gives an operation instruction to each unit. And a voice control unit for sending.

【００１０】第２の発明は、第１の発明に係る通信用音
声処理装置であって、受信パケットが前記パケット格納
部に一度に格納しきれない数だけ存在する場合には、前
記音声制御部は、前記パケット格納部内で格納時刻が一
番古いパケットから順に削除し、新しい受信パケットを
格納することを特徴とする。A second aspect of the present invention is the communication voice processing apparatus according to the first aspect of the invention, wherein when there are a number of received packets that cannot be stored in the packet storage unit at one time, the voice control unit is used. Is characterized in that the packet stored in the packet storage unit is deleted in order from the earliest stored time and a new received packet is stored.

【００１１】第３の発明は、第１の発明に係る通信用音
声処理装置であって、１つ以上のパケットが受信されて
いる場合に、前記音声制御部は、一度に全ての受信パケ
ットを前記パケット格納部に格納することを特徴とする
請求項１記載の通信用音声処理装置。A third aspect of the present invention is the communication voice processing apparatus according to the first aspect of the present invention, wherein when one or more packets are received, the voice control unit receives all received packets at once. The communication voice processing device according to claim 1, wherein the voice processing device is stored in the packet storage unit.

【００１２】第４の発明は、第１の発明に係る通信用音
声処理装置であって、前記パケット格納部は、受信した
音声信号パケットを格納するリングバッファを備え、受
信する可能性のあるパケット最大数と同数の段数を有す
ることを特徴とする。A fourth invention is the communication voice processing apparatus according to the first invention, wherein the packet storage section includes a ring buffer for storing the received voice signal packet, and a packet which may be received. It is characterized by having the same number of stages as the maximum number.

【００１３】第５の発明は、第４の通信用音声処理装置
であって、受信した音声信号パケットを前記リングバッ
ファに読み込む間隔をＲｅｃｖ_Interval、パケット到着
の最大ジッタをＪ_Max、受信パケットに含まれている音
声情報量をＰ_Frame、前記リングバッファの段数をＲｅ
ｃｖ_Maxlとすると、Ｒｅｃｖ_Maxl＝｛（Ｊ_Max ＋Ｒｅｃｖ_Interval）／
Ｐ_Frame｝＋１が成り立つことを特徴とする。A fifth aspect of the present invention is the voice processing apparatus for communication according to the fourth aspect, wherein the interval for reading a received voice signal packet into the ring buffer is Recv _Interval , the maximum jitter of packet arrival is J _Max , and the received packet is included in the received packet. The amount of audio information being _recorded is P _Frame , and the number of stages of the ring buffer is Re
If cv _Maxl , then Recv _Maxl = {(J _Max + Recv _Interval ) /
It is characterized in that P _Frame } +1 holds.

【００１４】第６の発明は、第５の発明に係る通信用音
声処理装置であって、ＲＴＣＰの統計情報を利用して前
記ジッタＪ_Maxを定める。A sixth invention is a communication voice processing apparatus according to the fifth invention, wherein the jitter J _Max is determined by using statistical information of RTCP.

【００１５】第７の発明は、第１の発明に係る通信用音
声処理装置であって、前記音声制御部は、前記符号化処
理及び復号化処理を実行する回数の比率を、音声信号の
符号化処理が行われる時間と復号化処理が行われる時間
が等しくなるようにすることを特徴とする。A seventh aspect of the present invention is the communication voice processing apparatus according to the first aspect, wherein the voice control unit determines the ratio of the number of times the encoding process and the decoding process are performed, to the code of the voice signal. It is characterized in that the time during which the decoding process is performed is equal to the time during which the decoding process is performed.

【００１６】第８の発明は、第７の発明に係る通信用音
声処理装置であって、前記音声制御部は、符号化処理と
復号化処理に必要な時間の比率を音声データ量から定め
ることを特徴とする。An eighth aspect of the present invention is the communication voice processing apparatus according to the seventh aspect, wherein the voice control unit determines the ratio of the time required for the encoding process and the decoding process from the amount of voice data. Is characterized by.

【００１７】第９の発明は、送信する際に音声信号を符
号化する音声符号化部と、受信した音声信号パケットを
一時格納するパケット格納部と、該パケット格納部から
音声信号パケットを読み出し復号化する音声復号化部
と、各部に動作の指示を送る音声制御部とを備える通信
用音声処理装置において、受信パケットが前記パケット
格納部に一度に格納しきれない数だけ存在する場合に、
前記パケット格納部内で格納時刻が一番古いパケットか
ら順に削除し、新しい受信パケットを格納することを前
記音声制御部に実行させるためのプログラムを記録した
記憶媒体である。According to a ninth aspect of the present invention, a voice encoding unit that encodes a voice signal when transmitting, a packet storage unit that temporarily stores a received voice signal packet, and a voice signal packet that is read from the packet storage unit and decoded. In a voice processing apparatus for communication, which comprises a voice decoding unit to be converted and a voice control unit for sending an operation instruction to each unit, when there are a number of received packets that cannot be stored in the packet storage unit at one time,
It is a storage medium recording a program for causing the voice control unit to delete a packet having the earliest storage time in the packet storage unit in order and store a new received packet.

【００１８】第１０の発明は、１つ以上のパケットが受
信されている場合に、一度に全ての受信パケットを前記
パケット格納部に格納することを前記音声制御部に実行
させるためのプログラムを記録した第９の発明に係る記
憶媒体である。In a tenth aspect of the invention, when one or more packets are received, a program for causing the voice control section to store all the received packets in the packet storage section at a time is recorded. It is a storage medium according to the ninth invention.

【００１９】第１１の発明は、送信する際に音声信号を
符号化する音声符号化部と、受信した音声信号パケット
を一時格納するパケット格納部と、該パケット格納部か
ら音声信号パケットを読み出し復号化する音声復号化部
と、各部に動作の指示を送る音声制御部とを備える通信
用音声処理装置において、前記パケット格納部は、受信
した音声信号パケットを格納するリングバッファを備
え、受信する可能性のあるパケット最大数と同数の段数
を有するようにすることを前記音声制御部に実行させる
ためのプログラムを記録した記憶媒体である。An eleventh aspect of the present invention is a voice coding unit for coding a voice signal when transmitting, a packet storage unit for temporarily storing a received voice signal packet, and a voice signal packet read from the packet storage unit for decoding. In a voice processing apparatus for communication, comprising: a voice decoding unit for converting into audio; and a voice control unit for sending an operation instruction to each unit, the packet storage unit includes a ring buffer for storing a received voice signal packet, and is capable of receiving. It is a storage medium in which a program for causing the voice control unit to have the same number of stages as the maximum number of effective packets is recorded.

【００２０】第１２の発明は、受信した音声信号パケッ
トを前記リングバッファに読み込む間隔をＲｅｃｖ
_Interval、パケット到着の最大ジッタをＪ_Max、受信パ
ケットに含まれている音声情報量をＰ_Frame、前記リン
グバッファの段数をＲｅｃｖ_Maxlとすると、Ｒｅｃｖ_Maxl＝｛（Ｊ_Max ＋Ｒｅｃｖ_Interval）／
Ｐ_Frame｝＋１が成り立つようにすることを前記音声制御部に実行させ
るためのプログラムを記録した第１１の発明に係る記憶
媒体である。A twelfth aspect of the invention is to set an interval for reading a received voice signal packet into the ring buffer as Recv.
_Interval , the maximum jitter of packet arrival is J _Max , the voice information amount included in the received packet is P _Frame , and the number of stages of the ring buffer is Recv _Maxl , then Recv _Maxl = {(J _Max + Recv _Interval ) /
The storage medium according to the eleventh aspect of the present invention has recorded therein a program for causing the audio control unit to execute the operation of making P _Frame } +1 hold.

【００２１】請求項１３の発明は、ＲＴＣＰの統計情報
を利用して前記ジッタＪ_Maxを定めることを前記音声制
御部に実行させるためのプログラムを記録した第１１の
発明に係る記憶媒体である。The thirteenth aspect of the present invention is the storage medium according to the eleventh aspect of the present invention, in which a program for causing the voice control section to execute the determination of the jitter J _Max using the statistical information of RTCP is recorded.

【００２２】第１４の発明は、送信する際に音声信号を
符号化する音声符号化部と、受信した音声信号パケット
を一時格納するパケット格納部と、該パケット格納部か
ら音声信号パケットを読み出し復号化する音声復号化部
と、各部に動作の指示を送る音声制御部とを備える通信
用音声処理装置において、前記符号化処理及び復号化処
理を実行する回数の比率を、音声信号の符号化処理が行
われる時間と復号化処理が行われる時間が等しくなるよ
うにすることを前記音声制御部に実行させるためのプロ
グラムを記録した記憶媒体である。A fourteenth aspect of the present invention is a voice encoding section for encoding a voice signal when transmitting, a packet storage section for temporarily storing a received voice signal packet, and a voice signal packet read from the packet storage section for decoding. In a speech processing apparatus for communication comprising a speech decoding unit for encoding and a speech control unit for sending an operation instruction to each unit, the ratio of the number of times the encoding process and the decoding process are executed is performed by encoding the audio signal. The storage medium stores a program for causing the audio control unit to make the time for performing the decoding process equal to the time for performing the decoding process.

【００２３】第１５の発明は、前記音声制御部は、符号
化処理と復号化処理に必要な時間の比率を音声データ量
から定めることを前記音声制御部に実行させるためのプ
ログラムを記録した第１４の発明に係る記憶媒体であ
る。In a fifteenth aspect of the present invention, the voice control section records a program for causing the voice control section to execute determination of a ratio of time required for encoding processing and decoding processing from the amount of voice data. 14 is a storage medium according to the invention.

【００２４】上記目的を達成するため、本発明では音声
処理部に音声制御部と、音声符号化部と、音声復号化部
とに加え、新たにバッファを有するパケット格納部を設
け、パケット送受信部内のパケットを復号化処理前にパ
ケット格納部のバッファへ転送して格納し、パケットの
復号化時に前記バッファからパッケットを読み出して復
号化するようにしたものである。このようにすると、パ
ケット送受信部に１つ以上のパケットが受信されている
場合、１度に全てのパケットをパケット格納部内のバッ
ファ、例えば、１段以上の段数があるリングバッファに
転送して格納することができ、パケット送受信部でのパ
ケットの脱落が抑制できる。In order to achieve the above object, in the present invention, a voice control unit, a voice encoding unit, and a voice decoding unit are additionally provided in the voice processing unit, and a packet storage unit having a new buffer is provided in the voice transmitting / receiving unit. The packet is transferred to and stored in the buffer of the packet storage unit before the decoding process, and the packet is read from the buffer and decoded when the packet is decoded. With this configuration, when one or more packets are received by the packet transmitting / receiving unit, all the packets are transferred to and stored in a buffer in the packet storage unit, for example, a ring buffer having one or more stages at a time, at once. It is possible to suppress dropping of packets in the packet transmitting / receiving unit.

【００２５】また、音声制御部と音声符号化部と音声復
号化部とを有する音声処理部を備えた通信装置を用いて
符号化処理と復号化処理に必要な時間の比率から符号化
処理及び復号化処理を実行する回数を決めると、予測符
号化を行う音声符号化・復号化方式が採用されている場
合であっても受信パケットを効率よく処理することがで
きる。すなわち、音声符号化部での音声符号化処理と音
声復号化部での音声復号化処理に要する時間によって音
声制御部が音声符号化部及び音声復号化部に動作の指示
を出す回数をそれぞれ定め、音声復号化部の動作回数を
増やすようにする。音声符号化処理または音声復号化処
理の一方に処理が偏らないようにする事で、音声データ
を効率よく処理する事ができる。本発明の通信装置は、
上述したような音声処理部を備えているため、相手端末
器から受信した音声データ量が多い場合でも再生時に音
声の途切れが生じない。Further, by using a communication device having a voice processing unit having a voice control unit, a voice encoding unit, and a voice decoding unit, the encoding process and the encoding process are performed based on the ratio of time required for the encoding process and the decoding process. When the number of times the decoding process is executed is determined, the received packet can be efficiently processed even when the voice coding / decoding method that performs predictive coding is adopted. That is, the number of times the voice control unit issues an operation instruction to the voice encoding unit and the voice decoding unit is determined depending on the time required for the voice encoding process in the voice encoding unit and the voice decoding process in the voice decoding unit. , Increase the number of operations of the speech decoding unit. By not biasing the processing to one of the voice encoding process and the voice decoding process, the voice data can be efficiently processed. The communication device of the present invention is
Since the audio processing unit as described above is provided, even when the amount of audio data received from the partner terminal device is large, audio interruption does not occur during reproduction.

【００２６】[0026]

【発明の実施の形態】以下、本発明の実施の形態を、図
面を参照しながら詳細に説明する。BEST MODE FOR CARRYING OUT THE INVENTION Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

【００２７】図１は本発明に係る音声処理装置を有する
通信装置の構成を示すブロック図である。この通信装置
は、音声処理部を除いて従来の音声通信装置（図７）と
同様の構成であるので、共通部分には同一の符号を付
し、詳しい説明は省略する。この通信装置は、従来技術
の音声通信装置と同様に、ＩＴＵ−Ｔ勧告Ｈ．３２３に
準拠した音声通信端末器に用いるものである。音声処理
部２０は、図１に示すように、音声制御部２１と、音声
符号化部２２と、パケット格納部２３と、音声復号化部
２４とを備える構成である。FIG. 1 is a block diagram showing the configuration of a communication device having a voice processing device according to the present invention. Since this communication device has the same configuration as the conventional voice communication device (FIG. 7) except for the voice processing unit, common parts are denoted by the same reference numerals and detailed description thereof will be omitted. This communication device is similar to the voice communication device of the prior art, and is ITU-T Recommendation H.264. It is used for a voice communication terminal compliant with H.323. As shown in FIG. 1, the voice processing unit 20 is configured to include a voice control unit 21, a voice encoding unit 22, a packet storage unit 23, and a voice decoding unit 24.

【００２８】この音声処理部２０の動作（機能）につい
て説明すると下記の通りである。パケットの符号化処理
を行う際には、音声制御部２１が音声符号化部２２に動
作の指示を出す。音声符号化部２２は、音声制御部２１
からの動作の指示に従って、音声入力部１３から入力音
声データを読み出し、入力した音声データをＧ．７２
３．１符号化方式に従って符号化する。次いで、符号化
ビットストリームをパケット化してパケット送受信部１
５に書き込む。The operation (function) of the voice processing section 20 will be described below. When performing the packet encoding process, the voice control unit 21 issues an operation instruction to the voice encoding unit 22. The voice encoding unit 22 is a voice control unit 21.
In accordance with an operation instruction from G., the input voice data is read from the voice input unit 13, and the input voice data is input to G.I. 72
3.1 Encoding according to the encoding method. Next, the encoded bit stream is packetized to form the packet transmission / reception unit 1
Write to 5.

【００２９】パケットの復号化処理を行う際には、音声
制御部２１は、まず、パケット格納部２３に対して動作
の指示を出す。パケット格納部２３は、音声制御部２１
からの指示に従って、相手端末からの音声信号のパケッ
トをパケット送受信部１５の受信バッファからパケット
格納部２３のバッファ、例えば、リングバッファに読み
込む。次に、音声制御部２１は、音声復号化部２４に対
して動作の指示を出す。音声復号化部２４は、音声制御
部２１からの指示に従って、パケット格納部２３のバッ
ファからパケットを読み込み、パケットから符号化ビッ
トストリームを取り出す。次いで、音声復号化部２４で
復号化処理を行い、復号化された音声データを音声出力
部１４から出力する。When performing the packet decoding process, the voice control unit 21 first issues an operation instruction to the packet storage unit 23. The packet storage unit 23 is a voice control unit 21.
According to the instruction from, the packet of the voice signal from the partner terminal is read from the reception buffer of the packet transmission / reception unit 15 into the buffer of the packet storage unit 23, for example, the ring buffer. Next, the voice control unit 21 issues an operation instruction to the voice decoding unit 24. The audio decoding unit 24 reads the packet from the buffer of the packet storage unit 23 and extracts the encoded bitstream from the packet according to the instruction from the audio control unit 21. Next, the audio decoding unit 24 performs a decoding process, and the decoded audio data is output from the audio output unit 14.

【００３０】ここで、パケット格納部２３と音声制御部
２１の処理に関し、詳細に説明する。まずは、パケット
格納部２３の処理に関し、図２を参照しながら説明す
る。図２は、パケット格納部２３の一例を示すブロック
図である。このパケット格納部２３は、パケット監視部
３１と、リングバッファ３２とを備えている。そして、
音声制御部２１からの指示があると、パケット監視部３
１は、パケット送受信部１５にパケットが存在するかど
うかを確認する。この時、パケット送受信部１５に１つ
以上のパケットが受信されている場合、１度に全てのパ
ケットをリングバッファ３２に転送して格納する。リン
グバファ３２にパケットを格納する際に、リングバッフ
ァ３２の空き領域よりも多いパケットがパケット送受信
部１５に受信される場合には、リングバッファ３２内で
格納時刻の一番古いパケットから順に削除し、新しいパ
ケットが格納される。Here, the processes of the packet storage unit 23 and the voice control unit 21 will be described in detail. First, the processing of the packet storage unit 23 will be described with reference to FIG. FIG. 2 is a block diagram showing an example of the packet storage unit 23. The packet storage unit 23 includes a packet monitoring unit 31 and a ring buffer 32. And
When there is an instruction from the voice control unit 21, the packet monitoring unit 3
1 confirms whether or not a packet exists in the packet transmitting / receiving unit 15. At this time, when one or more packets are received by the packet transmitting / receiving unit 15, all the packets are transferred to the ring buffer 32 and stored at once. When storing packets in the ring buffer 32, if more packets than the free area of the ring buffer 32 are received by the packet transmitting / receiving unit 15, the packets are deleted in order from the earliest stored time in the ring buffer 32. A new packet is stored.

【００３１】図３は、パケットがリングバッファ３２に
格納される状態を示す説明図である。図３に示すよう
に、受信バッファ１７は、パケットが７つ一度に格納可
能であり、リングバッファ３２はパケットが４つ一度に
格納可能とする。図３（ａ）には、リングバッファ３２
の段数よりも少ない数のパケットがリングバッファ３２
に格納されている様子が示されている。パケット格納部
２３は、音声制御部２１の指示により、パケット送受信
部１５内の受信バッファ１７からパケット格納部内のリ
ングバッファ３２に、パケット１ａ，２ａ，３ａを格納
する。FIG. 3 is an explanatory diagram showing a state in which a packet is stored in the ring buffer 32. As shown in FIG. 3, the reception buffer 17 can store seven packets at a time, and the ring buffer 32 can store four packets at a time. In FIG. 3A, the ring buffer 32
The number of packets that is less than the number of
It is shown stored in the. The packet storage unit 23 stores the packets 1a, 2a, 3a from the reception buffer 17 in the packet transmission / reception unit 15 to the ring buffer 32 in the packet storage unit according to an instruction from the voice control unit 21.

【００３２】これに対し、リングバッファ３２の段数よ
りもパケットの数が多い場合、例えば、図３（ｂ）に示
すように、非同期通信網１６から音声信号のパケット１
ａ，２ａ，３ａ，４ａ，５ａ，６ａがパケット送受信部
１５の受信バッファ１７に格納されるとする。次のパケ
ットが受信される際、リングバッファ３２は、４つのパ
ケットしか格納できないので、とりあえずパケット１
ａ，２ａ，３ａ，４ａをリングバッファ３２に格納す
る。次に、リングバッファ３２に格納されている格納時
刻が古いパケット１ａ，２ａを削除し、そこの格納領域
にパケット５ａ，６ａを格納する。On the other hand, when the number of packets is larger than the number of stages of the ring buffer 32, for example, as shown in FIG.
It is assumed that a, 2a, 3a, 4a, 5a, 6a are stored in the reception buffer 17 of the packet transmitting / receiving unit 15. When the next packet is received, the ring buffer 32 can store only four packets.
A, 2a, 3a, 4a are stored in the ring buffer 32. Next, the packets 1a and 2a stored in the ring buffer 32 whose storage time is old are deleted, and the packets 5a and 6a are stored in the storage areas thereof.

【００３３】次に、リングバッファ３２の段数は以下の
ようにして決定する。パケット送受信部１５からパケッ
トを読み込む間隔をＲｅｃｖ_Interval、パケット到着の
最大ジッタをＪ_max、受信パケットに含まれている音声
情報量（符号化フレーム数）をＰ_Frameとすると、パケ
ット送受信部で一度に受信する可能性のあるパケットの
最大数Ｒｅｃｖ_Maxlは、Ｒｅｃｖ_Maxl＝｛（Ｊ_max ＋Ｒｅｃｖ_Interval）／
Ｐ_Frame｝＋１となる。したがって、リングバッファ３２に一度に転送
される可能性のあるパケットの最大数はＲｅｃｖ_Maxlと
なるので、この値をリングバッファ３２の段数とする。
なお、Ｊ_maxの値としては、ＩＴＵ−Ｔ勧告Ｈ．２４５
の最大音声遅延ジッタ(maximum Audio Delay Jitter)ま
たはＩＴＵ−Ｔ勧告Ｈ．２２５のＲＴＣＰのインターア
ライバルジッタ(interarrival Jitter)パラメータ等を
利用する。Next, the number of stages of the ring buffer 32 is determined as follows. Let Recv _{Interval be} the interval at which packets are read from the packet transmitter / receiver 15, J _{max be} the maximum packet arrival jitter, and P _Frame be the audio information amount (the number of encoded frames) contained in the received packet. The maximum number of packets Recv _Maxl that may be received is: Recv _Maxl = {(J _max + Recv _Interval ) /
P _Frame } +1. Therefore, since the maximum number of packets that may be transferred to the ring buffer 32 at one time is Recv _Maxl , this value is set as the number of stages of the ring buffer 32.
The value of J _max is ITU-T Recommendation H.264. 245
Maximum Audio Delay Jitter or ITU-T Recommendation H.264. 225 RTCP interarrival jitter parameters are used.

【００３４】このようにリングバッファ３２の段数を設
定すれば、パケット送信部１５の受信バッファ１７に格
納された受信パケットは必ずリングバッファ３２に一度
に転送が可能である。By setting the number of stages of the ring buffer 32 in this way, the received packets stored in the receive buffer 17 of the packet transmitter 15 can always be transferred to the ring buffer 32 at one time.

【００３５】次に、図８に示した従来のスケジューリン
グ上の問題を解決する手段を図４に示す。すなわち、音
声処理部２０が端末制御部１１から動作指示を受けてか
ら次の動作指示を受けるまでの間隔が相手端末からのパ
ケットの到着間隔より長くなる場合がある。その場合、
本発明では、図４に示すように、パケット送受信部１５
に受信されて、まだパケット格納部２３に転送されてい
ないパケットを一度にパケット格納部２３内のリングバ
ッファ３２に転送できる。そのため、パケット送受信部
１５の受信バッファ１７の格納領域がパケットで満たさ
れにくくなり、受信バッファ１７からパケットが溢れ出
すことを防止できる。FIG. 4 shows means for solving the conventional scheduling problem shown in FIG. That is, the interval from when the voice processing unit 20 receives an operation instruction from the terminal control unit 11 until when the next operation instruction is received may be longer than the arrival interval of packets from the partner terminal. In that case,
In the present invention, as shown in FIG.
Packets received by the packet storage unit 23 and not yet transferred to the packet storage unit 23 can be transferred to the ring buffer 32 in the packet storage unit 23 at one time. Therefore, the storage area of the reception buffer 17 of the packet transmission / reception unit 15 is less likely to be filled with packets, and the packets can be prevented from overflowing from the reception buffer 17.

【００３６】次に、音声制御部２１の処理に関する詳
細な説明を行う。音声制御部２１が音声符号化部２２お
よび音声復号化部２４へそれぞれ動作の指示を出す回数
は以下のようにして決定され得る。音声符号化部２２に
おいて、１フレームの符号化処理を行うために必要な時
間をＴＥ、一度の符号化処理で符号化する音声データ量
（フレーム数）をＦ_E、Ｇ．７２３．１に準拠して音声
復号化部２４において１フレーム分のビットストリーム
の復号化処理を行うために必要な時間をＴ_D、一度の復
号化処理で復号化する音声データ量（フレーム数）をＦ
_Dとすると、音声符号化部２２の動作に必要とされる時
間Ｅ_Time及び音声復号化部２４の動作に必要とされる時
間Ｄ_Timeは、それぞれ次式で表される。Ｅ_Time ＝Ｔ_E × Ｆ_E Ｄ_Time ＝Ｔ_D × Ｆ_D Next, a detailed description will be given regarding the processing of the voice control unit 21. The number of times that the voice control unit 21 issues an operation instruction to the voice encoding unit 22 and the voice decoding unit 24 can be determined as follows. In the audio encoding unit 22, the time required for performing the encoding process of one frame is TE, the audio data amount (the number of frames) encoded in one encoding process is F _E , G. The time required for performing the decoding process of the bit stream for one frame in the audio decoding unit 24 in accordance with G.723.1 is T _D , and the amount of audio data to be decoded in one decoding process (the number of frames) To F
_{Assuming D} , the time E _Time required for the operation of the speech encoding unit 22 and the time D _Time required for the operation of the speech decoding unit 24 are respectively expressed by the following equations. E _Time = T _E × F _E D _Time = T _D × F _D

【００３７】ここで、音声符号化部２２の動作回数を
Ｅ_Count、音声復号化部２４の動作回数をＤ_Countとする
と、音声符号化部２２と音声復号化部２４の動作時間を
等しくするためには、Ｅ_Time × Ｅ_Count ＝Ｄ_Time × Ｄ_Count ここで、音声符号化部２２の動作回数Ｅ_Countを１とす
ると、音声復号化部２４の動作回数Ｄ_CountはＤ_Count ＝Ｅ_Time ／Ｄ_Time となる。音声制御部は、前記符号化処理及び復号化処理
を実行する回数の比率を、音声の符号化処理にかかる時
間と復号化処理にかかる時間が等しくなるよう決める。
例えば、Ｅ_Time：Ｄ_Time ＝７：１の場合に
は、音声処理部２１が音声符号化部２２に動作指令を出
してから次の動作指令を出すまでの間に音声復号化部２
４に７回の動作指令を出すようにすればよい。Here, when the number of operations of the speech encoding unit 22 is E _Count and the number of operations of the speech decoding unit 24 is D _Count , the operation times of the speech encoding unit 22 and the speech decoding unit 24 are made equal. E _Time × E _Count = D _Time × D _Count Here, _assuming that the operation count E _Count of the speech encoding unit 22 is 1, the operation count D _Count of the speech decoding unit 24 is D _Count = E _Time / D It will be _Time . The voice control unit determines the ratio of the number of times the encoding process and the decoding process are executed so that the time required for the voice encoding process and the time required for the decoding process are equal.
For example, in the case of E _Time : D _Time = 7: 1, the speech decoding unit 2 operates between the speech processing unit 21 issuing an operation command to the speech encoding unit 22 and the next operation command.
The operation command may be issued 7 times to 4.

【００３８】次に、図９に示した従来のスケジューリン
グ上の問題を解決する手段を図５に示す。図５は、復号
化処理と符号化処理の時間関係を示す説明図である。図
５に示すように、１フレーム当たりの処理は、復号化処
理、符号化処理、音声再生処理の順に長くなる。相手端
末からの音声信号パケットを３つ復号化してから音声信
号の符号化を３フレーム行うことが示されている。そし
て、３フレームの符号化処理を行っている間に復号化し
た３パケットの音声再生処理を行う。このようにすれ
ば、図５から明らかなように、符号化処理時間が復号化
処理時間よりも長い場合でも相手端末から入力した音声
を途切れなく再生できる。FIG. 5 shows means for solving the conventional scheduling problem shown in FIG. FIG. 5 is an explanatory diagram showing the time relationship between the decoding process and the encoding process. As shown in FIG. 5, the processing per frame becomes longer in the order of decoding processing, encoding processing, and audio reproduction processing. It is shown that three voice signal packets from the partner terminal are decoded and then the voice signal is encoded for three frames. Then, the audio reproduction processing of the decoded three packets is performed while the encoding processing of the three frames is performed. In this way, as is clear from FIG. 5, even if the encoding processing time is longer than the decoding processing time, the voice input from the partner terminal can be reproduced without interruption.

【００３９】さらに上記方法を実行するプログラム、符
号化処理と復号化処理に必要な時間の比率を音声データ
量から決定するプログラム、パケット送受信部に１つ以
上のパケットが受信されている場合に１度に全てのパケ
ットをパケット送受信部からパケット格納部内の受信バ
ッファに転送して格納できるようにするプログラム、パ
ケット格納部内のバッファがリングバッファであり、こ
のバッファに格納しきれない数のパケットを格納する場
合に、当該リングバッファ内で格納時刻が一番古いバッ
ファから順に削除して新しいパケットを格納できるよう
にするプログラム、が記録されている記録媒体を通信装
置が備えていてもよい。また、このような必要なプログ
ラムが音声処理部の音声制御部に記録されていてもよ
い。上述したように本発明は、上記実施態様に限定され
ることなく、特許請求の範囲内で様々な変形が可能であ
る。Further, a program for executing the above method, a program for determining the ratio of time required for the encoding process and the decoding process from the audio data amount, and 1 when one or more packets are received by the packet transmitting / receiving unit. A program that allows all packets to be transferred from the packet transmitter / receiver to the receive buffer in the packet storage unit every time, and the buffer in the packet storage unit is a ring buffer, and the number of packets that cannot be stored in this buffer is stored. In this case, the communication device may include a recording medium in which a program that deletes the buffer with the oldest storage time in the ring buffer in order and stores a new packet is recorded. Further, such a necessary program may be recorded in the voice control unit of the voice processing unit. As described above, the present invention is not limited to the above-described embodiment, and various modifications can be made within the scope of the claims.

【００４０】[0040]

【発明の効果】本発明によれば、パケット送受信部に１
つ以上のパケットが受信されている場合にパケット送受
信部から１度に全てのパケットをパケット格納部内の受
信バッファに転送して格納できるため、パケットがパケ
ット送受信部の受信バッファから溢れ出して脱落するこ
とを抑制することが可能である。また、１度に複数個の
パケットの復号化処理を行うことが可能であるため、パ
ケットの復号化処理の遅れによる再生時の音声の途切れ
が抑制でき、快適な音声通信が実現できる。According to the present invention, the packet transmission / reception unit has one
When two or more packets are received, all packets can be transferred from the packet transmitter / receiver to the receive buffer in the packet storage unit at one time and stored, so that the packets overflow and drop from the receive buffer of the packet transmitter / receiver. It is possible to suppress that. In addition, since it is possible to perform decoding processing for a plurality of packets at one time, it is possible to suppress interruption of audio during reproduction due to delay in decoding processing of packets, and to realize comfortable audio communication.

[Brief description of drawings]

【図１】本発明に係る音声処理装置を有する音声通信装
置の構成を示すブロック図である。FIG. 1 is a block diagram showing a configuration of a voice communication device having a voice processing device according to the present invention.

【図２】パケット格納部の構成を示すブロック図であ
る。FIG. 2 is a block diagram showing a configuration of a packet storage unit.

【図３】パケットがパケット格納部のリングバッファに
格納される状態を示す説明図である。FIG. 3 is an explanatory diagram showing a state in which a packet is stored in a ring buffer of a packet storage unit.

【図４】パケット格納部がパケットを読み込むタイミン
グを示す説明図である。FIG. 4 is an explanatory diagram showing a timing at which a packet storage unit reads a packet.

【図５】音声復号化部が音声の復号化を行うタイミング
を示す説明図である。FIG. 5 is an explanatory diagram showing a timing at which a voice decoding unit performs voice decoding.

【図６】従来の音声通信装置の基本的構成を示すブロッ
ク図である。FIG. 6 is a block diagram showing a basic configuration of a conventional voice communication device.

【図７】従来の音声処理部の構成を含む音声通信装置の
ブロック図である。FIG. 7 is a block diagram of a voice communication device including a configuration of a conventional voice processing unit.

【図８】従来の通信装置のパケット格納部がパケットを
読み込むタイミングを示す説明図である。FIG. 8 is an explanatory diagram showing a timing at which a packet storage unit of a conventional communication device reads a packet.

【図９】従来の通信装置の音声復号化部が音声の復号化
を行うタイミングを示す説明図である。FIG. 9 is an explanatory diagram showing a timing at which a speech decoding unit of a conventional communication device decodes speech.

[Explanation of symbols]

１１端末制御部１２接続制御部１３音声入力部１４音声出力部１５パケット送受信部１６非同期通信網２０音声処理部２１音声制御部２２音声符号化部２３パケット格納部２４音声復号化部２１パケット監視部２２リングバッファ 11 Terminal control unit 12 Connection control unit 13 Voice input section 14 Audio output section 15 Packet transmitter / receiver 16 Asynchronous communication network 20 Voice processing unit 21 Voice control unit 22 Speech coding unit 23 Packet storage 24 Speech decoding unit 21 Packet monitoring unit 22 ring buffer

Claims

(57) [Claims]

1. A speech processing apparatus for communication which encodes / decodes a speech signal in order to perform speech communication using a packet communication network, and a speech coding unit for coding a speech signal when transmitting. includes a packet storage unit for temporarily storing the audio signal packets received, the audio decoding unit to decode the read audio signal packets from the packet storage unit, and a sound control unit which sends an instruction for operation to each unit, the packet The storage unit stores the received voice signal packet.
It is equipped with a ring buffer to store and may receive packets.
Has the same number of stages as the maximum number of packets and reads the received voice signal packet into the ring buffer.
The interval to be inserted is Recv _Interval , the maximum packet arrival
Is J _Max , and the amount of voice information included in the received packet is
P _Frame , where Recv _Maxl is the number of stages of the ring buffer , Recv _Maxl = {(J _Max + Recv _Interval ) / P
_A speech processing apparatus for communication, characterized in that _Frame } +1 holds .

2. The zip using the statistical information of RTCP.
The communication voice processing apparatus according to claim 1, wherein the data J _Max is defined .

3. Sound is produced using a packet communication network.
A voice signal is encoded and decoded for voice communication.
In the credit voice processing device, a voice encoding unit that encodes a voice signal when transmitting,
A packet storage unit for temporarily storing received voice signal packets
And read the audio signal packet from the packet storage unit.
Audio decoding unit to decode and sound to send operation instructions to each unit
With a voice control unit, the time required to perform one frame of encoding processing
T _E , the audio data amount to be encoded in one encoding process
F _E , the decoding process of the bit stream for one frame
The time required to perform is T _D , and decoding is performed in one decoding process
Necessary for encoding process, if the amount of audio data to be encoded is F _D
_Time E _Time and time required for composite processing
D _Time is E _Time = TE × F _E , D _Time = T _D × F _D , and E _{Count is} the number of operations of the voice encoding process , and voice recovery
When the number of operations of Nos processing and D _Count, the voice control unit, a communication speech processing apparatus and controls so that _{_{_{E Time × E Count = D Time}}} × D Count.

4. Audio for encoding an audio signal when transmitting
Temporarily stores the received voice signal packet with the encoding unit
A packet storage unit and an audio signal packet from the packet storage unit.
Voice decoding unit that reads and decodes the packets and operates in each unit
Communication voice processing apparatus including a voice control unit for sending the instruction
In the packet storage unit,
A ring buffer for storing packets is available for reception
To have the same number of rounds as the maximum number of active packets
Read the received voice signal packet into the ring buffer.
The interval to be inserted is Recv _Interval , the maximum packet arrival
Is J _Max , and the amount of voice information included in the received packet is
P _Frame, a number of Recv _MaxL of the ring buffer
Then, Recv _Maxl = {(J _Max + Recv _Interval ) / P
_Frame } +1 is made to cause the voice control unit to execute.
A storage medium in which a program for recording is recorded.

5. The jig using the statistical information of RTCP.
The voice control unit is made to execute the definition of the data J _Max .
The storage medium according to claim 4, wherein a program for recording is recorded.

6. Audio for encoding an audio signal when transmitting
Temporarily stores the received voice signal packet with the encoding unit
A packet storage unit and an audio signal packet from the packet storage unit.
Voice decoding unit that reads and decodes the packets and operates in each unit
Communication voice processing apparatus including a voice control unit for sending the instruction
In, the time required to perform the encoding process of 1 frame is
T _E , the audio data amount to be encoded in one encoding process
F _E , the decoding process of the bit stream for one frame
The time required to perform is T _D , and decoding is performed in one decoding process
Necessary for encoding process, if the amount of audio data to be encoded is F _D
_Time E _Time and time required for composite processing
D _Time is E _Time = TE × F _E , D _Time = T _D × F _D , and E _{Count is} the number of operations of the voice encoding process , and voice recovery
_{Letting the} voice control unit execute control so that E _Time × E _Count = D _Time × D _Count , where D _Count is the number of operations of the encoding process.
A storage medium in which a program for recording is recorded.