JP4117301B2

JP4117301B2 - Audio data interpolation apparatus and audio data interpolation method

Info

Publication number: JP4117301B2
Application number: JP2005064556A
Authority: JP
Inventors: 肇小日向; 徹星; 温松下
Original assignee: 株式会社エイビット
Priority date: 2005-03-08
Filing date: 2005-03-08
Publication date: 2008-07-16
Anticipated expiration: 2025-03-08
Also published as: JP2006253843A

Abstract

PROBLEM TO BE SOLVED: To provide a voice data interpolator and interpolation method, in which the content of conversation can be transmitted accurately by concealing sound break due to delay, even at the occurrence of long delays which cannot be interpolated by an interpolation waveform during conversation. SOLUTION: A voice packet or voice data during talk spurt period is stored temporarily, up to the moment when a delay occurs and is reproduced, immediately prior to the reproduction of the voice in a voice packet being received with a lag, after the delay has been eliminated. COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、音声パケットの連続受信が途切れることによる音声の音切れを隠蔽する音声データ補間装置と音声データ補間方法に関し、更に詳しくは、ネットワークのスパイク遅延に起因し比較的長時間の音切れが発生しても、音切れを認識せずに会話を継続させる音声データ補間装置と音声データ補間方法に関する。 The present invention relates to a voice data interpolating apparatus and a voice data interpolation method for concealing voice interruption due to interruption of continuous reception of voice packets, and more particularly, to a relatively long sound interruption caused by a spike delay of a network. The present invention relates to an audio data interpolating apparatus and an audio data interpolating method for continuing a conversation without recognizing sound interruption even if it occurs.

近年、インターネットなどのＩＰネットワークを介して、ルーターがＩＰアドレスに従って音声パケットを相手先まで転送するＶｏＩＰ（ＶｏｉｃｅｏｖｅｒＩＰ）が、既存の交換機を用いず安価に通話路を形成可能であることから、インターネット電話や、ＩＰネットワークと電話交換網とをゲートウエイで接続し通話路を形成する携帯電話機等の通話方式に利用されている。 In recent years, a VoIP (Voice over IP) in which a router transfers a voice packet to a destination according to an IP address via an IP network such as the Internet can form a speech path inexpensively without using an existing exchange. It is used for a telephone communication system such as an Internet telephone or a cellular phone which connects an IP network and a telephone exchange network with a gateway to form a communication path.

このＩＰネットワークでは、帯域保証のないベストエフォート型通信であるので、ネッワーク回線の状況によって音声パケットの遅延や一部の音声パケットの損失が発生する。この遅延やパケット損失の現象は、概ね、ベース遅延と、ゆらぎ遅延と、スパイク遅延の３種類に分けられる。 Since this IP network is a best-effort type communication with no bandwidth guarantee, a voice packet delay or a part of voice packet loss occurs depending on the network line condition. This phenomenon of delay and packet loss is roughly divided into three types: base delay, fluctuation delay, and spike delay.

ベース遅延は、ＩＰネットワーク回線を経由することによる定常的な遅延であり、通話では無視できる程度のわずかな遅延であり、また、ゆらぎ遅延は、ネットワーク回線のトラヒックにより遅延時間がほぼ一定範囲で上下する遅延であり、受信装置側にプレイアウトバッファーを設けて、このゆらぎ遅延を吸収し、音声パケットから連続する音声を再生している。 The base delay is a steady delay caused by passing through the IP network line, and is a slight delay that can be ignored in a call. The fluctuation delay increases or decreases within a certain range of delay time due to network line traffic. A playout buffer is provided on the receiving device side to absorb the fluctuation delay, and continuous audio is reproduced from the audio packet.

スパイク遅延は、頻度は少ないが、前ぶれなく音声パケットの受信が途絶え、４００ｍｓｅｃ乃至１ｓｅｃ後に、この間受信されるはずであった音声パケットがまとめて受信され、音声パケットの損失を伴うことがある。この遅延原因としては、ルーターの許容値を超えてネットワーク回線のトラヒックが集中した場合、ルーターが何らかの理由で新たな経路形成に時間を要した場合、ルーターの定期メンテナンスなどが考えられている。 Although the spike delay is infrequent, reception of voice packets is interrupted without delay, and after 400 msec to 1 sec, voice packets that should have been received during this time are collectively received, and voice packets may be lost. Possible causes of this delay include regular maintenance of the router when the network line traffic is concentrated beyond the allowable value of the router, the router takes time to form a new route for some reason, and the like.

スパイク遅延の遅延時間を、プレイアウトバッファーにより吸収しようとすると、双方向の音声再生に会話の支障が生じるほどの遅延が生じするので、プレイアウトバッファーによる対処に限界がある。 If an attempt is made to absorb the delay time of the spike delay by the playout buffer, a delay that causes a hindrance to the conversation in bidirectional audio reproduction occurs, so there is a limit to the response by the playout buffer.

そこで従来、遅延が生じても音切れが目立たないように、遅延により音声パケットが途切れる部分に、前後の音声パケットを追加したり、前後の音声パケットの音声データに基づいて補間波形を生成し補間する方法が提案されている。
（例えば特許文献１参照）。 Therefore, conventionally, in order to prevent sound interruptions from becoming noticeable even if a delay occurs, previous and subsequent voice packets are added to the part where the voice packets are interrupted by the delay, or an interpolation waveform is generated based on the voice data of the previous and next voice packets. A method has been proposed.
(For example, refer to Patent Document 1).

図１３は、この特許文献１に記載された従来の音声データ補間装置１００を示すブロック図であり、入力手段１０１から入力されるデジタル音声信号の遅延などによる瞬断は、瞬断検出手段１０２で検出される。瞬断検出手段１０２が瞬断を検出しない場合には、入力手段１０１から入力されるデジタル音声信号は、デジタル音声信号選択手段１０３によってリングバッファー手段１０４へ出力され、記憶される。リングバッファー手段１０４に記憶されたデジタル音声信号は、符号化手段１０５にて符号化され、出力手段１０６を介して出力される。 FIG. 13 is a block diagram showing a conventional audio data interpolating apparatus 100 described in Patent Document 1. An instantaneous interruption due to a delay of a digital audio signal input from the input means 101 is detected by the instantaneous interruption detection means 102. Detected. When the instantaneous interruption detecting unit 102 does not detect the instantaneous interruption, the digital audio signal input from the input unit 101 is output to the ring buffer unit 104 by the digital audio signal selecting unit 103 and stored. The digital audio signal stored in the ring buffer means 104 is encoded by the encoding means 105 and output via the output means 106.

瞬断検出手段１０２において瞬断が検出されると、減衰増加手段１０７は、リングバッファー手段１０４に記憶されている瞬断直前のデジタル音声信号を読み出し、補間信号を生成する。減衰増加手段１０７は、減衰手段と増加手段と無音信号発生手段とを備え、この時点で生成され補間信号は、瞬断直前のデジタル音声信号に減衰係数を乗算し徐々に減衰させた減衰信号である。 When the instantaneous interruption detecting unit 102 detects an instantaneous interruption, the attenuation increasing unit 107 reads the digital audio signal immediately before the instantaneous interruption stored in the ring buffer unit 104 and generates an interpolation signal. The attenuation increasing means 107 includes an attenuating means, an increasing means, and a silence signal generating means. The interpolation signal generated at this time is an attenuation signal that is gradually attenuated by multiplying the digital audio signal immediately before the momentary interruption by an attenuation coefficient. is there.

デジタル音声信号選択手段１０３は、減衰増加手段１０７で生成される減衰信号を、瞬断直前のデジタル音声信号に連続してリングバッファー手段１０４に記憶し、これにより出力手段１０６から瞬断後に瞬断後減衰係数で減衰するデジタル音声信号が連続して出力される。瞬断検出手段１０２が瞬断を一定時間以上検出した場合には、減衰増加手段１０７は、減衰信号に引き続き無音信号を生成し、減衰信号と同様にリングバッファー手段１０４へ出力する。従って、出力手段１０６からは、瞬断後に減衰するデジタル音声信号に連続して無音信号が出力される。 The digital audio signal selection unit 103 stores the attenuation signal generated by the attenuation increase unit 107 in the ring buffer unit 104 in succession to the digital audio signal immediately before the momentary interruption, thereby causing the instantaneous interruption after the momentary interruption from the output unit 106. A digital audio signal attenuated by the post-attenuation coefficient is continuously output. When the instantaneous interruption detecting unit 102 detects the instantaneous interruption for a certain time or more, the attenuation increasing unit 107 generates a silence signal following the attenuation signal and outputs it to the ring buffer unit 104 in the same manner as the attenuation signal. Therefore, the output means 106 outputs a silence signal continuously to the digital audio signal that attenuates after a momentary interruption.

瞬断検出手段１０２が、瞬断後に入力手段１０１から遅れて入力されるデジタル音声信号を検出すると、減衰増加手段１０７は、瞬断後に入力されるこのデジタル音声信号を当初大幅に減衰させておき、時間の経過と共に二一定の増加計数を乗じて徐々に増加させ、リングバッファー手段１０４へ出力する。その結果、出力手段１０６からは、無音信号が出力されている後、瞬断が回復すると、徐々に信号レベルを増加させたデジタル音声信号が出力される。 When the instantaneous interruption detecting unit 102 detects a digital audio signal input after the instantaneous interruption, the attenuation increasing unit 107 initially attenuates the digital audio signal input after the instantaneous interruption. As the time elapses, the value is gradually increased by multiplying by a certain constant, and output to the ring buffer means 104. As a result, when the instantaneous interruption is recovered after the silence signal is output from the output means 106, a digital audio signal having a gradually increased signal level is output.

このように従来の音声データ補間装置１００によれば、ネットワークの遅延によってデジタル音声信号の入力が瞬断しても、その間、有音状態から無音状態へ、また無音状態から有音状態へ滑らかに遷移するので、瞬断による通話の中断通話者に耳障りにならない。 As described above, according to the conventional audio data interpolating apparatus 100, even if the input of the digital audio signal is momentarily interrupted due to the delay of the network, the sound is smoothly changed from the sound state to the soundless state and from the soundless state to the sound state. Since the transition occurs, the call is interrupted due to a momentary interruption.

また、受信される音声パケットに含まれる特徴からその音声パケットが有音声を含むパケットか無音声のパケットかを判定し、遅延が発生した場合に、有音声を含む音声パケットから疑似音声信号を生成して、疑似音声信号を遅延の間に再生される補間信号として再生する音声データ補間装置が、特許文献２、特許文献３に開示されている。 Also, it determines whether the voice packet is a packet containing voiced voice or a packet without voice from the characteristics included in the received voice packet, and generates a pseudo voice signal from the voice packet containing voiced voice when a delay occurs An audio data interpolation device that reproduces a pseudo audio signal as an interpolated signal that is reproduced during a delay is disclosed in Patent Document 2 and Patent Document 3.

特開平１０−２８２９９５号（明細書の項目００６４乃至項目００６８、図５）Japanese Patent Laid-Open No. 10-282959 (Items 0064 to 0068 of the specification, FIG. 5) 特開平６−２０２６９６号（明細書の項目０００８、図１）Japanese Patent Application Laid-Open No. 6-202696 (Item 0008 of FIG. 1, FIG. 1) 特開平８−１２３４９７号（明細書の項目００１１乃至項目００１８、図１）JP-A-8-123497 (Items 0011 to 0018 of the specification, FIG. 1)

しかしながら、上述の各従来の音声データ補間装置では、２０ｍｓｅｃ程度までの短い遅延による遮断について、通話から音切れが目立たないように隠蔽することが可能であるが、スパイク遅延のように４００ｍｓｅｃ以上の長い遅延が通話中に発生した場合には、少なくとも１音以上の音韻がこの間に含まれるので、前後の音声パケットからの波形補間が困難であり、遅延による遮断を隠蔽できず、また、会話の内容が正確に伝達されないとい問題が生じていた。 However, in each of the above-described conventional audio data interpolating devices, it is possible to conceal the interruption due to a short delay of up to about 20 msec so as not to make the sound out of the call inconspicuous, but it is longer than 400 msec as a spike delay. When a delay occurs during a call, at least one phoneme is included during this time, so it is difficult to interpolate waveforms from the previous and next voice packets, and the blockage due to the delay cannot be concealed. There was a problem that was not transmitted correctly.

例えば、「わたしは、あした、しんじゅくえきでまっています。」という会話において、「しん」と「じゅく」を連続して伝える音声パケットの間で、スパイク遅延により「じゅく」を伝える音声パケットが４００ｍｓｅｃ以上遅延したとして、遅れて受信される音声パケットを破棄すると、「わたしは、あした、しん−−えきでまっています。」と再生され、また、遅れて受信される音声パケットを破棄しない処理を行ったとしても、「わたしは、あした、しん−−じゅくえきでまっています。」と再生され、いずれの場合にも遅延による遮断が目立つばかりでなく、「しんじゅくえき」という伝達内容が分断され、会話内容が正確に伝達されなかった。 For example, in a conversation “I am tomorrow, I ’m in love”, “Juku” is transmitted by a spike delay between voice packets that convey “Shin” and “Juku” continuously. Assuming that the voice packet is delayed by 400 msec or more, if the voice packet received with a delay is discarded, “I am tomorrow, I'm stuck.” Is played back. Even if processing that does not destroy is performed, it is replayed as “I am tomorrow, Shin-Jukueki.” In both cases, not only is the blockage due to delay noticeable, but “Shinjuku-eki” ”Was broken, and the conversation was not accurately communicated.

本発明は、このような従来の問題点を考慮してなされたものであり、４００ｍｓｅｃ以上のスパイク遅延が会話中に発生しても、遅延による音切れを隠蔽し、会話内容を正確に伝達できる音声データ補間装置と音声データ補間方法を提供することを目的とする。 The present invention has been made in consideration of such conventional problems, and even if a spike delay of 400 msec or more occurs during a conversation, sound interruption due to the delay is concealed and the contents of the conversation can be accurately transmitted. An object of the present invention is to provide an audio data interpolation device and an audio data interpolation method.

上述の目的を達成するため、請求項１の音声データ補間装置は、送信装置により送信順にパケットシーケンス番号が付与された音声パケットをネットワークを介して受信するネットワーク接続部と、ネットワーク接続部で受信した音声パケットをパケットシーケンス番号から送信順に一時記憶するプレイアウトバッファーと、プレイアウトバッファーの先頭に記憶される音声パケットを読み出し、音声データにデコードする音声復号化回路と、音声復号化回路から出力される音声データが音声を含む場合の特徴を数値化した特徴量を、所定の閾値と比較し、有音声の音声データを判定する有音声判定部と、有音声の音声データが連続する期間から、会話期間と推定するトークスパート期間を判定するトークスパート検出部と、トークスパート検出部でトークスパート期間の開始を検出する毎にリセットされ、新たに音声復号化回路から出力されるトークスパート期間の音声データを連続して一時記憶するトークスパート保持バッファーと、音声データから音声を再生する音声再生部と、音声復号化回路の出力と、トークスパート保持バッファーの出力を、選択的に音声再生部の入力へ切り替える切り替え制御部と、プレイアウトバッファーでの音声パケットの記憶状態から、ネットワーク障害による音声パケットのスパイク遅延を検出する遅延検出部とを備え、
トークスパート期間中に遅延検出部でスパイク遅延を検出した際に、スパイク遅延により遅れて受信される音声パケットが音声データにデコードされ音声再生部へ出力される前に、切り替え制御部は、トークスパート保持バッファーに一時記憶されるトークスパート期間の音声データが出力される間、音声復号化回路の出力に切り替えられている音声再生部の入力を、トークスパート保持バッファーの出力へ切り替え、音声再生部で、トークスパート保持バッファーに一時記憶されるトークスパート期間の音声データに連続して、スパイク遅延により遅れて受信される音声パケットの音声データから、音声を再生することを特徴とする。 In order to achieve the above-described object, the voice data interpolating apparatus according to claim 1 receives the voice packets, to which the packet sequence numbers are assigned in the order of transmission by the transmitting apparatus, received by the network connecting section and the network connecting section. A playout buffer that temporarily stores voice packets in the order of transmission from the packet sequence number, a voice decoding circuit that reads a voice packet stored at the head of the playout buffer and decodes it into voice data, and is output from the voice decoding circuit From the period in which voiced voice data continues, the voiced judgment unit that judges voiced voice data by comparing the feature value obtained by quantifying the feature when the voice data contains voice with a predetermined threshold A talk spurt detector for determining a talk spurt period to be estimated as a period, and a talk spurt A talk spurt holding buffer that continuously resets the voice data of the talk spurt period that is reset each time the start of the talk spurt period is detected at the outgoing part and is newly output from the voice decoding circuit, and the voice from the voice data. From the audio playback unit to be played back, the output of the audio decoding circuit, the output of the talk spurt holding buffer, the switching control unit that selectively switches to the input of the audio playback unit, and the storage state of the audio packet in the playout buffer, A delay detection unit that detects a spike delay of a voice packet due to a network failure,
When the delay detection unit detects a spike delay during the talk spurt period, before the voice packet received delayed by the spike delay is decoded into the audio data and output to the audio reproduction unit, the switching control unit While the audio data of the talk spurt period temporarily stored in the holding buffer is output, the input of the audio reproducing unit switched to the output of the audio decoding circuit is switched to the output of the talk spurt holding buffer, and the audio reproducing unit The voice is reproduced from the voice data of the voice packet received with a delay due to the spike delay in succession to the voice data of the talk spurt period temporarily stored in the talk spurt holding buffer.

遅延検出部でトークスパート期間中のスパイク遅延を検出した際に、トークスパート保持バッファーには、スパイク遅延が発生するまでのトークスパート期間のデコードした音声データが記憶されている。スパイク遅延により遅れて受信される音声パケットが音声データにデコードされ音声再生部へ出力される前に、音声再生部の入力を、トークスパート保持バッファーの出力へ切り替えることにより、遅延により分断される前のトークスパート期間の音声データと、分断された後のトークスパート期間の音声データが一つのトークスパート期間中の連続した音声として再生される。 When the delay detector detects a spike delay during the talk spurt period, the talk spurt holding buffer stores the decoded audio data of the talk spurt period until the spike delay occurs. Before an audio packet received with a delay due to a spike delay is decoded into audio data and output to the audio playback unit, before being divided by the delay by switching the input of the audio playback unit to the output of the talk spurt holding buffer The voice data in the talk spurt period and the voice data in the talk spurt period after the division are reproduced as continuous voices in one talk spurt period.

トークスパート期間は、無音部分で区切られた会話期間の単位と推定されるので、少なくとも単語の単位より長い会話期間で音声が連続再生される。 Since the talk spurt period is estimated as a unit of the conversation period divided by the silent part, the voice is continuously reproduced in a conversation period longer than at least the word unit.

請求項２の音声データ補間装置は、切り替え制御部は、スパイク遅延による遅延予想時間が経過する時に、トークスパート保持バッファーに一時記憶されるトークスパート期間の音声データが全て出力されるように、音声復号化回路の出力と、トークスパート保持バッファーの出力を、音声再生部の入力へ切り替えることを特徴とする。 The audio data interpolating apparatus according to claim 2, wherein the switching control unit outputs the audio data of the talk spurt period temporarily stored in the talk spurt holding buffer when the expected delay time due to the spike delay elapses. The output of the decoding circuit and the output of the talk spurt holding buffer are switched to the input of the audio reproduction unit.

スパイク遅延による遅延時間は、ネットワーク環境によりある程度予想可能な一定時間である。従って、遅延予想時間が経過する時に、トークスパート保持バッファーに一時記憶されるトークスパート期間の音声データを全て出力すれば、その後、スパイク遅延により遅れて受信される音声パケットの音声データが出力され、連続した音声として再生される。 The delay time due to the spike delay is a fixed time that can be predicted to some extent depending on the network environment. Therefore, if all the voice data of the talk spurt period temporarily stored in the talk spurt holding buffer is output when the expected delay time elapses, then the voice data of the voice packet received delayed by the spike delay is output, Played as continuous audio.

請求項３の音声データ補間装置は、切り替え制御部が、スパイク遅延により遅れて受信される音声パケットが、プレイアウトバッファーに一時記憶されることを条件に、音声復号化回路の出力に切り替えられている音声再生部の入力を、トークスパート保持バッファーの出力へ切り替え、トークスパート保持バッファーに一時記憶されるトークスパート期間の音声データが全て音声再生部へ出力された後、トークスパート保持バッファーの出力に切り替えられている音声再生部の入力を、音声復号化回路の出力に切り替え、プレイアウトバッファーに一時記憶されている音声パケットからデコードされた音声データが、音声再生部へ出力されることを特徴とする。 In the speech data interpolating apparatus according to claim 3, the switching control unit is switched to the output of the speech decoding circuit on condition that the speech packet received delayed by the spike delay is temporarily stored in the playout buffer. The input of the audio playback unit is switched to the output of the talk spurt retention buffer, and all the audio data in the talk spurt period temporarily stored in the talk spurt retention buffer is output to the audio playback unit, and then output to the talk spurt retention buffer. The input of the switched audio reproduction unit is switched to the output of the audio decoding circuit, and audio data decoded from the audio packet temporarily stored in the playout buffer is output to the audio reproduction unit. To do.

スパイク遅延による遅延時間が予想できない場合であっても、遅れて受信される音声パケットがプレイアウトバッファーに一時記憶されることを条件に、トークスパート保持バッファーに一時記憶されるトークスパート期間の音声データを出力するので、確実にその後にスパイク遅延により遅れて受信される音声パケットの音声データを出力することができ、連続した音声として再生できる。 Even if the delay time due to the spike delay cannot be predicted, the audio data of the talk spurt period temporarily stored in the talk spurt holding buffer is provided on the condition that the audio packet received late is temporarily stored in the playout buffer. Therefore, it is possible to output the voice data of the voice packet that is received later with a delay due to the spike delay, and to reproduce it as a continuous voice.

請求項４の音声データ補間装置は、送信装置により送信順にパケットシーケンス番号が付与された音声パケットをネットワークを介して受信するネットワーク接続部と、ネットワーク接続部で受信した音声パケットをパケットシーケンス番号から送信順に一時記憶するプレイアウトバッファーと、音声パケットを音声データにデコードする音声復号化回路と、音声復号化回路から出力される音声データが音声を含む場合の特徴を数値化した特徴量を、所定の閾値と比較し、有音声の音声データを判定する有音声判定部と、有音声の音声データが連続する期間から、会話期間と推定するトークスパート期間を判定するトークスパート検出部と、音声復号化回路から出力される音声データより音声を再生する音声再生部と、トークスパート検出部でトークスパート期間の開始を検出する毎にリセットされ、新たにプレイアウトバッファーの先頭から読み出されるトークスパート期間の音声パケットを連続して一時記憶するトークスパート保持バッファーと、プレイアウトバッファーの出力と、トークスパート保持バッファーの出力を、選択的に音声復号化回路の入力へ切り替える切り替え制御部と、プレイアウトバッファーでの音声パケットの記憶状態から、ネットワーク障害による音声パケットのスパイク遅延を検出する遅延検出部とを備え、
トークスパート期間中に遅延検出部でスパイク遅延を検出した際に、スパイク遅延により遅れて受信される音声パケットが音声復号化回路へ出力される前に、切り替え制御部は、トークスパート保持バッファーに一時記憶されるトークスパート期間の音声パケットが出力される間、プレイアウトバッファーの出力に切り替えられている音声復号化回路の入力を、トークスパート保持バッファーの出力へ切り替え、音声復号化回路で、トークスパート保持バッファーから出力されるトークスパート期間の音声パケットに連続して、スパイク遅延により遅れて受信される音声パケットを音声データにデコードし、音声再生部で音声データから音声を再生することを特徴とする。 According to another aspect of the present invention, there is provided a voice data interpolating device that receives a voice packet having a packet sequence number assigned in the order of transmission by a transmitting device via a network, and transmits a voice packet received by the network connecting portion from the packet sequence number. A playout buffer that temporarily stores data in sequence, a voice decoding circuit that decodes voice packets into voice data, and a feature value obtained by quantifying features when voice data output from the voice decoding circuit includes voice, A voice determination unit that determines voice data of voice compared with a threshold; a talk spurt detection unit that determines a talk spurt period to be estimated as a conversation period from a period in which the voice data of voice is continuous; and voice decoding An audio playback unit that plays back audio from the audio data output from the circuit, and a talk spurt detection unit A talk spurt holding buffer that continuously and temporarily stores audio packets in the talk spurt period that is reset each time the start of the talk spurt period is detected and is newly read from the head of the play out buffer, the output of the play out buffer, A switching control unit that selectively switches the output of the part holding buffer to the input of the audio decoding circuit; and a delay detection unit that detects a spike delay of the audio packet due to a network failure from the storage state of the audio packet in the playout buffer; With
When the delay detection unit detects a spike delay during the talk spurt period, the switching control unit temporarily stores in the talk spurt holding buffer before the voice packet received delayed by the spike delay is output to the voice decoding circuit. While the stored voice packet of the talk spurt period is output, the input of the voice decoding circuit switched to the output of the playout buffer is switched to the output of the talk spurt holding buffer. A voice packet received after a delay due to a spike delay is decoded into voice data in succession to the voice packet in the talk spurt period output from the holding buffer, and voice is reproduced from the voice data by the voice reproduction unit. .

遅延検出部でトークスパート期間中のスパイク遅延を検出した際に、トークスパート保持バッファーには、スパイク遅延が発生するまでのトークスパート期間の音声パケットが記憶されている。スパイク遅延により遅れて受信される音声パケットが音声復号化回路へ出力される前に、音声復号化回路の入力を、トークスパート保持バッファーの出力へ切り替えることにより、遅延により分断される前のトークスパート期間の音声データと、分断された後のトークスパート期間の音声データが一つのトークスパート期間中の連続した音声として再生される。 When the delay detector detects a spike delay during the talk spurt period, the talk spurt holding buffer stores voice packets during the talk spurt period until the spike delay occurs. Before the voice packet received delayed by the spike delay is output to the voice decoding circuit, the talk spurt before being divided by the delay is switched by switching the input of the voice decoding circuit to the output of the talk spurt holding buffer. The audio data of the period and the audio data of the talk spurt period after being divided are reproduced as continuous audio in one talk spurt period.

請求項５の音声データ補間装置は、切り替え制御部が、スパイク遅延による遅延予想時間が経過する時に、トークスパート保持バッファーに一時記憶されるトークスパート期間の音声パケットが全て出力されるように、プレイアウトバッファーの出力と、トークスパート保持バッファーの出力を、音声復号化回路の入力へ切り替えることを特徴とする。 The audio data interpolating apparatus according to claim 5, wherein the switching control unit outputs all the audio packets in the talk spurt period temporarily stored in the talk spurt holding buffer when the expected delay time due to the spike delay elapses. The output of the out buffer and the output of the talk spurt holding buffer are switched to the input of the audio decoding circuit.

スパイク遅延による遅延時間は、ネットワーク環境によりある程度予想可能な一定時間である。従って、遅延予想時間が経過する時に、トークスパート保持バッファーに一時記憶されるトークスパート期間の音声パケットを全て出力すれば、その後、スパイク遅延により遅れて受信される音声パケットが出力され、連続した音声として再生される。 The delay time due to the spike delay is a fixed time that can be predicted to some extent depending on the network environment. Therefore, if all audio packets in the talk spurt period temporarily stored in the talk spurt holding buffer are output when the expected delay time elapses, then the audio packets received delayed by the spike delay are output, and continuous audio is output. As played.

請求項６の音声データ補間装置は、切り替え制御部は、スパイク遅延により遅れて受信される音声パケットが、プレイアウトバッファーに一時記憶されることを条件に、プレイアウトバッファーの出力に切り替えられている音声復号化回路の入力を、トークスパート保持バッファーの出力へ切り替え、トークスパート保持バッファーに一時記憶されるトークスパート期間の音声バケットを全て音声復号化回路でデコードした後、トークスパート保持バッファーの出力に切り替えられている音声復号化回路の入力を、プレイアウトバッファーの出力に切り替え、プレイアウトバッファーに一時記憶されている音声パケットを音声復号化回路でデコードすることを特徴とする。 According to another aspect of the audio data interpolating apparatus of the present invention, the switching control unit is switched to the output of the playout buffer on condition that an audio packet received with a delay due to a spike delay is temporarily stored in the playout buffer. The input of the voice decoding circuit is switched to the output of the talk spurt holding buffer, and after all the voice buckets in the talk spurt period temporarily stored in the talk spurt holding buffer are decoded by the voice decoding circuit, the output is output to the talk spurt holding buffer. The input of the switched voice decoding circuit is switched to the output of the playout buffer, and the voice packet temporarily stored in the playout buffer is decoded by the voice decoding circuit.

スパイク遅延による遅延時間が予想できない場合であっても、遅れて受信される音声パケットがプレイアウトバッファーに一時記憶されることを条件に、トークスパート保持バッファーに一時記憶されるトークスパート期間の音声パケットを出力するので、確実にその後にスパイク遅延により遅れて受信される音声パケットを出力することができ、連続した音声として再生できる。 Even if the delay time due to the spike delay cannot be predicted, the voice packet in the talk spurt period temporarily stored in the talk spurt holding buffer is provided on the condition that the voice packet received late is temporarily stored in the playout buffer. Therefore, it is possible to output a voice packet that is received later with a delay due to a spike delay, and to reproduce it as a continuous voice.

請求項７の音声データ補間方法は、ネットワークから受信する音声パケットを送信装置の送信順に音声データにデコードし、デコードした音声データから音声を再生するとともに、有音声を含む音声データが連続する期間から、会話期間と推定するトークスパート期間を判定し、トークスパート期間の開始から、新たにトークスパート期間中に受信される音声パケット若しくは音声パケットからデコードされる音声データを連続してトークスパート保持バッファーに一時記憶し、トークスパート期間の開始毎にリセットして一時記憶を繰り返し、音声パケットのスパイク遅延がトークスパート期間中に発生した際に、トークスパート保持バッフアーに一時記憶したトークスパート期間の音声パケット若しくは音声データから音声を再生した後、スパイク遅延により遅れて受信される音声パケットの音声データから音声を連続して再生することを特徴とする。 According to a seventh aspect of the present invention, the voice data interpolation method decodes voice packets received from the network into voice data in the transmission order of the transmission device, reproduces voice from the decoded voice data, and starts from a period in which voice data including voice is continuous. The talk spurt period to be estimated as the conversation period is determined, and from the start of the talk spurt period, voice packets newly received during the talk spurt period or voice data decoded from the voice packets are continuously stored in the talk spurt holding buffer . Temporarily memorize, reset every time the talk spurt period starts and repeat the temporary memorization, and when a voice packet spike delay occurs during the talk spurt period, the voice packet of the talk spurt period temporarily stored in the talk spurt holding buffer or After playing audio from audio data Characterized by continuously reproducing audio from audio data of the voice packets received with a delay by the spike delay.

音声パケットのスパイク遅延がトークスパート期間中に発生した際に、トークスパート保持バッファーには、スパイク遅延が発生するまでのトークスパート期間の音声パケット若しくは音声パケットからデコードされる音声データが記憶されている。スパイク遅延により遅れて受信される音声パケットの音声データから音声を再生する直前に、トークスパート保持バッファーに一時記憶されている音声パケット若しくは音声データから音声を再生すれば、両者が連続して再生される。 When a voice packet spike delay occurs during the talk spurt period, the talk spurt holding buffer stores the voice packet in the talk spurt period until the spike delay occurs or voice data decoded from the voice packet. . If voice is played back from the voice packet or voice data temporarily stored in the talk spurt holding buffer immediately before playing back the voice data from the voice packet received delayed by the spike delay, both are played back continuously. The

トークスパート期間は、有音声が連続する会話期間の単位と推定されるので、スパイク遅延によりトークスパート期間の再生が分断されたとしても、分断される前後のトークスパート期間の音声が連続し、少なくとも単語の単位より長い会話期間で音声が連続して再生される。 Since the talk spurt period is estimated to be a unit of a conversation period in which voiced speech continues, even if the reproduction of the talk spurt period is divided due to the spike delay, the voices of the talk spurt period before and after the division are continuous, at least Voice is played continuously in a conversation period longer than the word unit.

請求項８の音声データ補間方法は、スパイク遅延後に受信れさる音声パケットを送信装置の送信順に話速変換を行って再生し、スパイク遅延を解消することを特徴とする。 The voice data interpolation method according to an eighth aspect is characterized in that the voice packet received after the spike delay is reproduced by performing the speech speed conversion in the transmission order of the transmission device, thereby eliminating the spike delay.

話速変換は、通話中の無音期間を短縮させ、若しくは音声の再生速度を目立たない程度に増して遅延を吸収するので、通話者が気付かないように通話を継続させ、スパイク遅延による遅れを解消される。 Speaking speed conversion shortens the silence period during a call or absorbs delay by increasing the audio playback speed to an inconspicuous level, so the call continues without the caller being aware and eliminates the delay caused by the spike delay Is done.

請求項９の音声データ補間方法は、トークスパート期間外の無音声期間の音声パケット若しくは音声パケットからデコードされる音声データを一時記憶し、スパイク遅延を検出した後、一時記憶したトークスパート期間の音声パケット若しくは音声データから音声を再生する前に、無音期間の音声パケット若しくは音声データから再生する周囲雑音を再生することを特徴とする。 The voice data interpolation method according to claim 9 temporarily stores voice data decoded from a voice period or a voice packet outside a talk spurt period, detects a spike delay, and then temporarily stores the voice data in a talk spurt period. Before reproducing the voice from the packet or the voice data, the ambient noise reproduced from the voice packet or the voice data in the silent period is reproduced.

トークスパート期間を判定する際に、その間の無音声期間から、無音声期間の音声パケット若しくは音声パケットからデコードされる音声データを一時記憶することができる。無音声期間の音声パケット若しくは音声パケットからデコードされる音声データは、通話中の息継ぎや、会話と会話の間から生じるものであるので、通話者の周囲の雑音を含んでいる。従ってこれらの音声パケット若しくは音声データから、通話者の周囲雑音が再生される。 When determining the talk spurt period, it is possible to temporarily store a voice packet in a voiceless period or voice data decoded from a voice packet from the voiceless period in between. Voice data decoded from a voice packet or a voice packet during a non-voice period is generated from a breathing during a call or between conversations, and thus includes noise around the caller. Therefore, the ambient noise of the caller is reproduced from these voice packets or voice data.

請求項１と請求項４と請求項７の発明によれば、スパイク遅延によりトークスパート期間の音声が分断されても、遅延発生までのトークスパート期間の音声と、遅れて受信される音声パケットから再生されるトークスパート期間の音声が連続して再生され、無音部分で区切られた会話期間の単位で再生される。従って、スパイク遅延が発生しても、会話の一区切り単位で再生され、少なくとも単語の再生中に分断されることはないので、会話内容が正確に伝達される。 According to the first, fourth, and seventh aspects of the present invention, even if the voice in the talk spurt period is divided by the spike delay, the voice in the talk spurt period until the occurrence of the delay and the voice packet received with a delay are obtained. The voice of the talk spurt period to be played back is played back continuously, and played back in units of conversation periods separated by silence. Therefore, even if a spike delay occurs, it is played back in units of a conversation and is not divided at least during the playback of a word, so that the content of the conversation is accurately transmitted.

スパイク遅延が発生するまでのトークスパート期間の音声は、一度再生された後スパイク遅延により中断し、遅れて受信される音声パケットの音声を再生する前に、トークスパート保持バッファーから出力されることにより再びトークスパート期間の先頭から再生されるが、４００ｍｓｅｃ乃至１ｓｅｃのスパイク遅延時間内に繰り返して再生される音声は、会話中に言い直したように聞こえる。言い直しは、日常会話で頻繁に起こることであり、人間の耳にはこれに慣れているため、自然の会話として認識され、スパイク遅延による音切れが生じたことが隠蔽される。 The voice in the talk spurt period until the spike delay occurs is interrupted by the spike delay after being played once, and is output from the talk spurt holding buffer before playing back the voice packet received late. The sound is reproduced again from the beginning of the talk spurt period, but the sound that is repeatedly reproduced within the spike delay time of 400 msec to 1 sec seems to be rephrased during the conversation. Rephrasing is a frequent occurrence in everyday conversations and is familiar to human ears, so it is recognized as a natural conversation and conceals the occurrence of sound interruption due to spike delay.

請求項２と請求項５の発明によれば、トークスパート保持バッファーに一時記憶されるトークスパート期間の音声パケット若しくは音声データは、スパイク遅延による遅延時間中に音声として再生されるので、繰り返し再生のために再生時間に遅れが発生することがなく、また、遅延時間中に音声が途切れる時間が短縮される。 According to the second and fifth aspects of the present invention, since the voice packet or voice data during the talk spurt period temporarily stored in the talk spurt holding buffer is reproduced as voice during the delay time due to the spike delay, Therefore, there is no delay in the reproduction time, and the time during which the sound is interrupted during the delay time is reduced.

請求項３と請求項６の発明によれば、スパイク遅延により遅れる音声パケットが受信されない場合であっても、トークスパート保持バッファーに一時記憶されるトークスパート期間の音声パケット若しくは音声データの音声のみが繰り返して再生されることがなく、確実にトークスパート期間で連続する音声を再生できる。 According to the third and sixth aspects of the present invention, even when a voice packet delayed by a spike delay is not received, only the voice packet or voice data in the talk spurt period temporarily stored in the talk spurt holding buffer is stored. Without being repeatedly played, continuous sound can be played reliably in the talk spurt period.

請求項８の発明によれば、スパイク遅延が発生した際に、通話者には、単に会話中の言い直しと聞こえ、その後は話速変換によって遅延が通話者に気付かないように解消されるので、スパイク遅延による障害を通話者に全く気付かせることなく、自然な会話中に障害を解消させることができる。 According to the invention of claim 8, when a spike delay occurs, the caller can simply hear that the conversation is restated, and thereafter, the delay is resolved by the conversion of the speech speed so that the delay is not noticed by the caller. The failure can be resolved during natural conversation without making the caller aware of the failure caused by the spike delay.

請求項９の発明によれば、有音声を含む音声パケットの特徴から生成する補間信号や無音信号から再生する再生音が人為的な音であるのに対し、無音声期間の音声パケット若しくは音声パケットからデコードされる音声データから再生される周囲雑音は、あたかも通話相手が息継ぎを行ったり、会話と会話に間をもたせて話しているように聞こえる。従って、トークスパート保持バッファーから音声を再生する前のスパイク遅延によって生じる音切れ期間に、無音声期間の音声パケット若しくは音声データから周囲雑音を再生することにより、通話者に気付かれることなく、音切れを解消できる。 According to the ninth aspect of the invention, the interpolated signal generated from the characteristics of the voice packet including voiced sound and the reproduced sound reproduced from the silent signal are artificial sounds, whereas the voice packet or voice packet in the silent period is used. The ambient noise that is reproduced from the audio data decoded from the sound can be heard as if the calling party is breathing, or speaking between the conversations. Therefore, by playing ambient noise from a voice packet or voice data during a non-voice period during a sound break period caused by a spike delay before playing the voice from the talk spurt holding buffer, the voice breaks without the caller being aware. Can be eliminated.

以下、本発明に係る音声データ補間装置１と音声データ補間方法を、図１乃至図１１を用いて説明する。図１は、音声データ補間装置１のブロック図であり、ＩＰネットワーク（インターネットプロトコルにより通信を行う通信網）を介して双方向の通話を行う機能のうち、主として相手側の音声を再生する受話機能に関する構成を示している。 Hereinafter, an audio data interpolation apparatus 1 and an audio data interpolation method according to the present invention will be described with reference to FIGS. FIG. 1 is a block diagram of the voice data interpolating apparatus 1, and among the functions for making a two-way call via an IP network (communication network for communication using the Internet protocol), a receiving function for mainly reproducing the other party's voice. The structure regarding is shown.

音声データ補間装置１は、ＬＡＮインターフェース部を有するネットワーク接続部２において、ここではインターネットであるＩＰネットワーク３に接続し、図示しない送信装置側で、音声を音声コーデックエンコーダにより圧縮し、ある通話時間分(例えば１０ｍｓｅｃ乃至８０ｍｓｅｃ)の音声データを１塊とした音声パケットを、インターネット３を介して受信可能としている。 The voice data interpolating apparatus 1 is connected to the IP network 3 which is the Internet here in the network connecting part 2 having the LAN interface part, and the voice is compressed by the voice codec encoder on the transmitting apparatus side (not shown) for a certain communication time. A voice packet in which voice data (for example, 10 msec to 80 msec) is collected as one lump can be received via the Internet 3.

送信装置は、音声パケットの送信元である送信装置のＩＰアドレスと送信先である音声データ補間装置１のＩＰアドレスと音声パケットの送信順を表すパケットシーケンス番号を含めたＩＰヘッダーを音声パケットに付けてインターネット３に送出し、インターネット３の各ルーターは、この送信先のＩＰアドレスをもとに音声パケットをネットワーク接続部２まで転送する。 The transmission device attaches an IP header including the IP address of the transmission device that is the transmission source of the voice packet, the IP address of the voice data interpolation device 1 that is the transmission destination, and a packet sequence number that indicates the transmission sequence of the voice packet to the voice packet. Then, each router of the Internet 3 forwards the voice packet to the network connection unit 2 based on the destination IP address.

ネットワーク接続部２で受信した音声パケットは、ＩＰアドレスが分離された後に、その出力側に接続されたプレイアウトバッファー４に入力される。プレイアウトバッファー４は、基本的にはＦＩＦＯ構造となっており、ネットワーク接続部２から先に入力された音声パケットが先に出力されるようになっている。従って、通常はパケットシーケンス番号順に出力されることとなるが、ネットワークの状況などでシーケンス番号順に受信されず入れ替わって入力された場合は、それをシーケンス番号順、すなわち送信装置からの送信順にソートする機能も有している。 The voice packet received by the network connection unit 2 is input to the playout buffer 4 connected to the output side after the IP address is separated. The playout buffer 4 basically has a FIFO structure, and a voice packet input first from the network connection unit 2 is output first. Accordingly, the packets are normally output in the order of the packet sequence numbers. However, if they are input without being received in the order of the sequence numbers due to the network conditions or the like, they are sorted in the order of the sequence numbers, that is, in the order of transmission from the transmission device. It also has a function.

また、プレイアウトバッファー４の記憶容量、すなわち一時記憶される音声パケット数は、ネットワークのゆらぎ遅延がわずかな時間内であれば、ゆらぎ遅延を吸収する量とすることが好ましいが、最大ゆらぎ遅延の時間まで全ての音声パケットを一時記憶しようとすると、一時記憶する音声パケット数が大きくなり、再生全体の遅延が大きくなってしまうので、ゆらぎ遅延の遅延量変化に一時記憶する音声パケット数を追従させ、一定の割合で遅れて受信される音声パケットを破棄するプレイアウトバッファを用いてもよい。このようなプレイアウトバッファーは、平成１１年７月発行の情報処理学会論文誌の「ＬＡＮ環境における負荷適応制御を用いた低遅延リアルタイム音声通信システム」に掲載されている。 Further, the storage capacity of the playout buffer 4, that is, the number of temporarily stored voice packets is preferably an amount that absorbs the fluctuation delay if the fluctuation delay of the network is within a short time. If you try to temporarily store all the voice packets until the time, the number of voice packets to be temporarily stored will increase, and the delay of the entire playback will increase, so the number of voice packets to be temporarily stored will follow the change in the delay amount of fluctuation delay. Alternatively, a playout buffer that discards voice packets received with a delay at a constant rate may be used. Such a playout buffer is published in “Low Delay Real-Time Voice Communication System Using Load Adaptive Control in LAN Environment” published by the Information Processing Society of Japan published in July 1999.

プレイアウトバッファー４の出力側には、音声コーデックデコーダ５が接続されている。音声コーデックデコーダ５は、後述するＤ／Ａ出力バッファー７に記憶される音声データが枯渇しそうになった時に、プレイアウトバッファー４の先頭に記憶されている音声パケットを読み出してデコードし、その音声パケットに圧縮された実音声時間分の音声データとする。プレイアウトバッファー４からの音声パケットの読み出しは、プレイアウトバッファー４に記憶される音声パケットが枯渇しない限り、Ｄ／Ａ出力バッファー７に記憶される音声データが枯渇しそうになる毎に１音声パケット単位で読み出され、直前に読み出された音声パケットからデコードされた音声データに連続する音声データとして、音声コーデックデコーダ５より出力される。 An audio codec decoder 5 is connected to the output side of the playout buffer 4. The audio codec decoder 5 reads and decodes the audio packet stored at the head of the playout buffer 4 when the audio data stored in the D / A output buffer 7 to be described later is almost exhausted. The audio data for the actual audio time compressed into A voice packet is read from the playout buffer 4 in units of one voice packet every time the voice data stored in the D / A output buffer 7 is almost exhausted unless the voice packets stored in the playout buffer 4 are exhausted. Is output from the audio codec decoder 5 as audio data continuous with the audio data decoded from the audio packet read immediately before.

また、プレイアウトバッファー４には、プレイアウトバッファー４の先頭に記憶される音声パケットのパケットシーケンス番号を監視するパケット損失検出部６が接続されている。パケット損失検出部６は、音声コーデックデコーダ５によって読み出された直前の音声パケットのパケットシーケンス番号と、新たに先頭に記憶されるパケットシーケンス番号が送信順に連続してない場合に、欠落したパケットシーケンス番号から損失した音声パケットを、補間制御部８へ通知する。 The playout buffer 4 is connected to a packet loss detection unit 6 that monitors the packet sequence number of the voice packet stored at the head of the playout buffer 4. The packet loss detection unit 6 detects the packet sequence that is lost when the packet sequence number of the immediately preceding voice packet read by the voice codec decoder 5 and the packet sequence number newly stored at the head are not consecutive in the transmission order. The voice packet lost from the number is notified to the interpolation control unit 8.

音声コーデックデコーダ５の出力は、補間波形生成部９と第１切換スイッチ１０の第１切換端子１０Ａに接続される。補間波形生成部９は、音声コーデックデコーダ５から出力される音声データの特徴から、この音声データから再生される音声の波形に連続し、音声の波形に類似する補間波形を生成して、２０ｍｓｅｃ乃至４０ｍｓｅｃ程度までの短い遅延による再生音声の瞬断を補うものである。例えば、補間の直前の音声波形のピッチを検出し、そのピッチと同じピッチを持ち、かつ補間直前の音声波形とスムーズに連続するように位相も合わせた補間波形を生成する。また、後述するように、補間波形の出力後の波形が既知である場合には、その後の波形のピッチと位相に合わせ、補間直後の波形ともスムーズに連続する補間波形を生成する。補間波形生成部９の出力は、補間制御部８によって切換制御される第１切換スイッチ１０の第２切換端子１０Ｂに接続し、補間波形の音声データを出力する際には、第１切換スイッチ１０のコモン出力端子１０Ｃを第２切換端子１０Ｂ側へ切り換えて出力するようにしている。 The output of the audio codec decoder 5 is connected to the interpolation waveform generator 9 and the first switching terminal 10 A of the first changeover switch 10. The interpolated waveform generation unit 9 generates an interpolated waveform that is similar to the audio waveform from the characteristics of the audio data output from the audio codec decoder 5 and is similar to the audio waveform reproduced from the audio data. This compensates for the instantaneous interruption of the reproduced sound due to a short delay of about 40 msec. For example, the pitch of the speech waveform immediately before the interpolation is detected, and an interpolation waveform having the same pitch as that of the speech waveform and having the phase matched so as to be smoothly continuous with the speech waveform immediately before the interpolation is generated. As will be described later, when the waveform after the output of the interpolation waveform is known, an interpolation waveform that is smoothly continuous with the waveform immediately after the interpolation is generated in accordance with the pitch and phase of the subsequent waveform. The output of the interpolated waveform generation unit 9 is connected to the second switch terminal 10B of the first switch 10 that is switch-controlled by the interpolation control unit 8, and when the voice data of the interpolated waveform is output, the first switch 10 The common output terminal 10C is switched to the second switching terminal 10B side for output.

音声データが出力される第１切換スイッチ１０のコモン出力端子１０Ｃは、第２切換スイッチ１１の第１切換端子１１Ａ、音声判定部１２、トークスパート保持バッファー１３、及び入力スイッチ１４を介して周囲雑音保持バッファー１５のそれぞれに接続している。 The common output terminal 10 C of the first changeover switch 10 from which the audio data is output is ambient noise via the first changeover terminal 11 A of the second changeover switch 11, the sound determination unit 12, the talk spurt holding buffer 13, and the input switch 14. Each holding buffer 15 is connected.

音声判定部１２は、音声データが音声を含む場合の特徴を数値化した特徴量で表し、所定の閾値と比較することにより、音声コーデックデコーダ５から出力される音声データが有音声を含むものであるかどうかを判定する。ここでは、無音声に比べて音声を含む場合には信号レベルが上昇すると仮定し、図２に示すように、５ｍｓｅｃ毎に音声データの信号レベルを、その間の二乗平均値から求めて信号パワーとし、所定値としたスレッシホールドレベルと比較し、スレッシホールドレベル以上である場合に、有音声の音声データと、スレッシホールド未満である場合に、周囲雑音のみの無音声の音声データとしている。 The voice determination unit 12 represents the feature when the voice data includes voice as a numerical feature, and compares the voice data output from the voice codec decoder 5 with a predetermined threshold value to determine whether voice data includes voiced voice. Determine if. Here, it is assumed that the signal level rises when speech is included compared to no speech, and as shown in FIG. 2, the signal level of the speech data is obtained from the mean square value between them as signal power every 5 msec. Compared to the threshold level set to a predetermined value, when the threshold level is equal to or higher than the threshold level, voiced voice data is obtained, and when the threshold level is lower than the threshold, voiceless voice data including only ambient noise is used. .

信号パワーは、周囲雑音レベルの増減により変化するので、スレッシホールドレベルも比較する信号パワーの変化に追従させる。ここでは、図３に示すようにスレッシホールドレベルを、５ｍｓｅｃ毎に求めた信号パワーの最小値に一定値αを足した値とし、信号パワーの最小値が、例えば３０ｍｓｅｃ更新されない場合に（図中ｔ１から３０ｍｓｅｃ経過したｔ２）最小値を引き上げ、スレッシホールドレベルを信号パワーに追従させて、周囲雑音レベルが増加しても有音声を判定できるようにしている。 Since the signal power changes as the ambient noise level increases or decreases, the threshold level is also made to follow the change in signal power to be compared. Here, as shown in FIG. 3, the threshold level is set to a value obtained by adding a constant value α to the minimum value of the signal power obtained every 5 msec, and the minimum value of the signal power is not updated, for example, 30 msec (FIG. 3). T2 after 30 msec from middle t1) The minimum value is raised and the threshold level is made to follow the signal power so that voice can be determined even if the ambient noise level increases.

音声判定部１２の判定結果は、トークスパート検出部２０に出力される。トークスパート検出部２０は、図４に示すように、音声判定部１２で判定した有音声の音声データが連続する期間をトークスパート期間と判定する。しかしながら、有音声の会話中であっても音韻間に瞬間の無音時間（Ｔ_Ｓ）が発生するので、有音声の音声データが連続する期間であっても、最大無音期間（ＭａｘＴ_Ｓ）までの無音期間は無視し、トークスパート期間とする。最大無音期間（ＭａｘＴ_Ｓ）は１００ｍｓｅｃを標準とするが、これは通話者の会話の特徴によっても異なるので、一定時間トークスパート検出部２０の動作を監視し、例えば、数秒経過してもトークスパートの開始が検知されない場合は、最大無音期間（ＭａｘＴ_Ｓ）を短縮する。 The determination result of the sound determination unit 12 is output to the talk spurt detection unit 20. As shown in FIG. 4, the talk spurt detection unit 20 determines a period in which voiced voice data determined by the voice determination unit 12 is continuous as a talk spurt period. However, even during voiced conversations, instantaneous silence time (T _S ) occurs between phonemes, so even during periods when voiced voice data is continuous, up to the maximum silence period (MaxT _S ). The silent period is ignored and the talk spurt period. The maximum silence period (MaxT _S ) is set to 100 msec as a standard, but this also varies depending on the conversation characteristics of the caller. Therefore, the operation of the talk spurt detection unit 20 is monitored for a certain period of time. If the start of is not detected, the maximum silence period (MaxT _S ) is shortened.

従って、トークスパート期間中に無音声の音声データが連続するようになっても、最大無音期間（ＭａｘＴ_Ｓ）が経過するまでは直ちにトークスパート期間終了と判定せず、最大無音期間（ＭａｘＴ_Ｓ）を越えて無音声の音声データが連続することで、トークスパート期間の終了と判定する。 Therefore, even if silent audio data continues during the talk spurt period, it is not immediately determined that the talk spurt period ends until the maximum silence period (MaxT _S ) elapses, and the maximum silence period (MaxT _S ). It is determined that the talk spurt period has ended when no voice data continues beyond

一方、トークスパート期間以外の期間、すなわち無音声の音声データが連続する期間を無音声期間とするが、この期間内にノイズなどの発生により短い有音時間した場合には、無視せずトークスパート期間の開始と判定する。これは、仮にノイズによって、後述する繰り返しトークスパートの音声データの記憶が開始されても、実際の会話によるトークスパート期間の開始で記憶されたノイズの音声データがリセットされるので問題はなく、逆に、実際の会話によるトークスパート期間開始の判定を、ノイズであるかどうかを判定せずに速やかに行うためである。 On the other hand, a period other than the talk spurt period, that is, a period in which silent voice data continues is set as a silent period. It is determined that the period starts. This is because there is no problem because the noise voice data stored at the start of the talk spurt period by the actual conversation is reset even if the voice data of the repeated talk spurt described later is started due to noise. In addition, the determination of the start of the talk spurt period by actual conversation is made promptly without determining whether it is noise or not.

トークスパート検出部２０は、トークスパート期間の継続期間中、トークスパート期間信号を補間制御部８と入力スイッチ１４へ出力するとともに、無音期間からトークスパート期間を判定した際に、トークスパートスタート信号をトークスパート保持バッファー１３へ出力する。 The talk spurt detection unit 20 outputs the talk spurt period signal to the interpolation control unit 8 and the input switch 14 during the duration of the talk spurt period, and also determines the talk spurt start signal when determining the talk spurt period from the silent period. Output to the talk spurt holding buffer 13.

トークスパート保持バッファー１３は、音声データが出力される第１切換スイッチ１０のコモン出力端子１０Ｃに接続するので、音声として再生される音声データを常時記憶している。このトークスパート保持バッファー１３に記憶される音声データは、トークスパート検出部２０から入力されるトークスパートスタート信号をリセット信号として、トークスパート期間が開始する毎に全て消去される。従って、トークスパート保持バッファー１３には、その時点で、再生中のトークスパート期間の開始部分からの音声に相当する音声データ（以下、繰り返しトークスパートの音声データという）が先頭アドレスより順に記憶されていることになる。トークスパート保持バッファー１３の出力は、補間制御部８の制御により第１切換端子１１Ａとの間で切り換えられる第２切換スイッチ１１の第２切換端子１１Ｂに接続し、これにより補間制御部８の制御による後述のタイミングで、トークスパート保持バッファー１３に記憶される繰り返しトークスパートの音声データがコモン出力端子１１Ｃへ出力される。 Since the talk spurt holding buffer 13 is connected to the common output terminal 10C of the first changeover switch 10 from which audio data is output, the talk spurt holding buffer 13 always stores audio data to be reproduced as audio. All the audio data stored in the talk spurt holding buffer 13 is erased every time the talk spurt period starts using the talk spurt start signal input from the talk spurt detection unit 20 as a reset signal. Therefore, in the talk spurt holding buffer 13, audio data corresponding to the sound from the beginning of the talk spurt period being reproduced (hereinafter referred to as repetitive talk spurt audio data) is stored in order from the start address. Will be. The output of the talk spurt holding buffer 13 is connected to the second switching terminal 11B of the second switching switch 11 that is switched to and from the first switching terminal 11A under the control of the interpolation control unit 8, whereby the control of the interpolation control unit 8 is performed. The audio data of the repeated talk spurt stored in the talk spurt holding buffer 13 is output to the common output terminal 11C at the timing described later.

入力スイッチ１４は、トークスパート検出部２０からトークスパート期間信号が入力されない間、すなわち無音期間中、音声データが出力される第１切換スイッチ１０のコモン出力端子１０Ｃを周囲雑音保持バッファー１５へ接続する。従って、周囲雑音保持バッファー１５には、音声として再生される音声データのうち、会話が途切れた間の無音声の音声データが入力される。周囲雑音保持バッファー１５は、リングバッファー構造となっていて、記憶されていた最も古い音声データを新たに入力される無音声の音声データに書き換える。会話が途切れたときに聞こえる周囲雑音は、特定言語を伝達する機能を持たず、複雑な波形であるので、自然な雑音に似せた波形を生成することが比較的困難であり、また、長時間再生可能な補間波形を生成することが比較的困難であり、仮にスパイク遅延などで発生する１秒以上の音切れに生成した補間波形を使用とする場合、短周期の補間波形が繰り返して再生され、ブザーのような耳障りな雑音となって再生される。 The input switch 14 connects the common output terminal 10 C of the first changeover switch 10 to which the audio data is output to the ambient noise holding buffer 15 while the talk spurt period signal is not input from the talk spurt detection unit 20, that is, during the silent period. . Therefore, the ambient noise holding buffer 15 is inputted with silent voice data during the interruption of the voice among voice data reproduced as voice. The ambient noise holding buffer 15 has a ring buffer structure, and rewrites the oldest stored voice data into silent voice data to be newly input. Ambient noise that can be heard when a conversation is interrupted does not have a function to transmit a specific language, and is a complex waveform. Therefore, it is relatively difficult to generate a waveform that resembles natural noise, and for a long time. It is relatively difficult to generate an interpolated waveform that can be played back. If an interpolated waveform that is generated when there is a sound break of 1 second or more that occurs due to a spike delay or the like is used, an interpolated waveform with a short period is reproduced repeatedly. Plays as annoying noise like a buzzer.

本実施の形態では、トークスパート期間を検出する過程で、トークスパート期間以外の周囲雑音のみからなる無音声期間も容易に検出できることを利用し、無音声期間の音声データを周囲雑音保持バッファー１５へ一時記憶することにより、擬似的な波形を生成することなく、長時間の音切れ期間に周囲雑音を再生して補間し、あたかも相手側通話者の会話が途切れたように、音切れを隠蔽するものである。 In the present embodiment, in the process of detecting the talk spurt period, it is possible to easily detect a voiceless period including only ambient noise other than the talk spurt period, and the voice data of the voiceless period is sent to the ambient noise holding buffer 15. By temporarily storing, without generating a pseudo waveform, the ambient noise is reproduced and interpolated during a long sound interruption period, and the sound interruption is concealed as if the other party's conversation was interrupted. Is.

周囲雑音保持バッファー１５に記憶される無音声期間の音声データは、補間制御部８の制御による後述するタイミングで出力され、出力された音声データは、補間制御部８から所定の増加率若しくは減衰率が乗算処理された後、第２切換スイッチ１１のコモン出力端子１１Ｃから出力される音声データに加えられる。 The voice data of the non-voice period stored in the ambient noise holding buffer 15 is output at a timing described later under the control of the interpolation control unit 8, and the output voice data is output from the interpolation control unit 8 with a predetermined increase rate or attenuation rate. Is added to the audio data output from the common output terminal 11C of the second change-over switch 11.

コモン出力端子１１Ｃと周囲雑音保持バッファー１５の出力を加算する加算器１６と、その出力側のＤ／Ａ出力バッファー７との間には、両者を直接接続するか、話速高速化部１７を経由して接続するかを選択する話速変換スイッチ１８が配置される。話速変換スイッチ１８の切り換えも補間制御部８により制御され、通常は直接接続に、遅延を解消する必要が生じる後述する期間に話速高速化部１７を経由するように切り換えられる。 The adder 16 that adds the outputs of the common output terminal 11C and the ambient noise holding buffer 15 and the D / A output buffer 7 on the output side thereof are either directly connected or the speech speed acceleration unit 17 is connected. A speech speed conversion switch 18 for selecting whether to connect via is arranged. The switching of the speech speed conversion switch 18 is also controlled by the interpolation control unit 8 and is normally switched to the direct connection via the speech speed acceleration unit 17 during a period to be described later when the delay needs to be eliminated.

Ｄ／Ａ出力バッファー７は、ＦＩＦＯ構造となっていて、出力側に接続されたＤ／Ａコンバータ１９は、Ｄ／Ａ出力バッファー７に一時記憶される音声データを入力順に連続して読み出される。音声データは、Ｄ／Ａコンバータ１９において一定サンプリングレートでＤ／Ａ変換され、図示しないスピーカーなどから音声として再生される。 The D / A output buffer 7 has a FIFO structure, and the D / A converter 19 connected to the output side continuously reads out audio data temporarily stored in the D / A output buffer 7 in the order of input. The audio data is D / A converted at a constant sampling rate in the D / A converter 19 and reproduced as audio from a speaker (not shown).

補間制御部８は、プレイアウトバッファー４に記憶される音声パケットの記憶状況を監視しながら、上述の各回路部の動作を制御するもので、以下、インターネット３に発生する遅延の状況に応じて異なる音声データ補間装置１の作用を、補間制御部８の動作を中心に説明する。以下に説明する各例では、「しんじゅくえきで、まって」という通話相手の会話があり、この会話がパケット０からパケット１０までの音声パケットとして、通話相手の送信装置からインターネット３を介して音声データ補間装置１へ送出されたものと仮定する。すなわち、パケット０は、無音声の音声データ、パケット１乃至パケット３は、「しん」の有音声を含む音声データ、パケット４乃至６は、「じゅくえきで」の有音声を含む音声データ、パケット７とパケット８は、「しんじゅくえきで」と「まって」を区切る会話から生じる間である無音声の音声データ、パケット９、パケット１０は、「まって」の有音声を含む音声データが、それぞれエンコードされたものとする。 The interpolation control unit 8 controls the operation of each circuit unit described above while monitoring the storage state of the voice packet stored in the playout buffer 4. Hereinafter, the interpolation control unit 8 responds to the delay state occurring in the Internet 3. The operation of the different audio data interpolating apparatus 1 will be described focusing on the operation of the interpolation control unit 8. In each of the examples described below, there is a conversation of the other party called “Shinjuku de te teki”, and this conversation is transmitted as voice packets from packet 0 to packet 10 from the other party's transmission device via the Internet 3. Assume that the data is transmitted to the audio data interpolating apparatus 1. That is, packet 0 is voiceless voice data, packets 1 to 3 are voice data including voiced “shin”, and packets 4 to 6 are voice data including voiced voice “juku-eki”. Packets 7 and 8 are voice-free voice data that is generated from a conversation that separates “Shinjuku-eki” and “Mate”. Packets 9 and 10 are voices that include “Mate” voiced voice. Assume that each data is encoded.

（１）第１例（遅延が発生しない例、図５参照）
始めに、送信装置から送信された音声パケットがインターネット３での遅延がなく正常に受信されたとすると、ネットワーク接続部２を介して受信される音声パケットは、送信装置からの送信順、すなわちパケット０からパケット１０にプレイアウトバッファー４に記憶され、順次音声コーデックデコーダ５により音声データにデコードされる。 (1) First example (example in which no delay occurs, see FIG. 5)
First, assuming that a voice packet transmitted from the transmission device is normally received without delay on the Internet 3, the voice packet received via the network connection unit 2 is transmitted in the order of transmission from the transmission device, that is, packet 0. Are stored in the playout buffer 4 and sequentially decoded by the audio codec decoder 5 into audio data.

通常、音声パケットが遅延なく受信されている間、第１切換スイッチ１０のコモン出力端子１０Ｃと第２切換スイッチ１１のコモン端子１１Ｃは、いずれも第１切換端子１０Ａ、１１Ａ側に切り換えられ、話速変換スイッチ１８は、加算器１６とＤ／Ａ出力バッファー７とを直接接続する側へ切り換えられているので、デコードされた音声データは、そのままＤ／Ａ出力バッファー７で一時記憶され、Ｄ／Ａコンバータ１９でＤ／Ａ変換された後、図５に示す音声波形で再生される。以下、プレイアウトバッファー４に記憶される音声パケットが通常音声として再生される出力経路はこの第１例と同一であるので、特に説明しない限り、第１切換スイッチ１０、第２切換スイッチ１１及び話速変換スイッチ１８の切り換え位置は、第１例と同一であるものとしてその説明を省略する。 Normally, while the voice packet is received without delay, the common output terminal 10C of the first changeover switch 10 and the common terminal 11C of the second changeover switch 11 are both switched to the first changeover terminals 10A and 11A side. Since the speed conversion switch 18 is switched to the side where the adder 16 and the D / A output buffer 7 are directly connected, the decoded audio data is temporarily stored in the D / A output buffer 7 as it is, and the D / A After being D / A converted by the A converter 19, it is reproduced with the voice waveform shown in FIG. Hereinafter, the output path through which the audio packet stored in the playout buffer 4 is reproduced as normal audio is the same as in the first example. Therefore, unless otherwise specified, the first changeover switch 10, the second changeover switch 11, and the talk Since the switching position of the speed conversion switch 18 is the same as that in the first example, the description thereof is omitted.

このとき、第１切換スイッチ１０のコモン出力端子１０Ｃから出力される音声コーデックデコーダ５によりデコードされた音声データは、音声判定部１２へも入力され、有音声の音声データであるか否かが判定され、その結果がトークスパート検出部２０へ出力される。パケット０とパケット７からデコードした音声データは、無音声の音声データであるので、これらの音声データがＤ／Ａ出力バッファー７へ出力され再生される間は無音声期間であり、トークスパート検出部２０からトークスパート期間信号が出力されない。従って、入力スイッチ１４が閉じて、コモン出力端子１０Ｃから出力されるパケット０とパケット７からデコードした音声データが、周囲雑音保持バッファー１５に記憶される。しかしながら、この例では、比較的長い遅延が発生しないので、周囲雑音保持バッファー１５に記憶された音声データは、出力されない。 At this time, the audio data decoded by the audio codec decoder 5 output from the common output terminal 10C of the first changeover switch 10 is also input to the audio determination unit 12, and it is determined whether or not it is voiced audio data. The result is output to the talk spurt detection unit 20. Since the audio data decoded from the packet 0 and the packet 7 is non-audio data, it is a non-audio period while these audio data are output to the D / A output buffer 7 and reproduced, and the talk spurt detection unit 20 does not output a talk spurt period signal. Accordingly, the input switch 14 is closed, and the audio data decoded from the packet 0 and the packet 7 output from the common output terminal 10C is stored in the ambient noise holding buffer 15. However, in this example, since a relatively long delay does not occur, the audio data stored in the ambient noise holding buffer 15 is not output.

一方、パケット１乃至パケット６からデコードした音声データは、有音声の音声データであるので、これらの音声データがＤ／Ａ出力バッファー７へ出力され再生される間、トークスパート検出部２０からトークスパート期間信号が補間制御部８へ出力される。また、無音声期間からトークスパート期間と判定された際、つまりパケット１からデコードした音声データが出力される際に、トークスパートスタート信号がトークスパート保持バッファー１３へ出力される。トークスパート保持バッファー１３は、トークスパートスタート信号が入力されたときにリセットされ、新たにコモン出力端子１０Ｃから出力される音声データの記憶を開始するので、その先頭アドレスから順に「しんじゅくえきで」と再生される音声データが記憶される。しかしながら、トークスパート保持バッファー１３に記憶される音声データについても、この例では、比較的長いスパイク遅延が発生しないので、出力されないまま、次のトークスパート期間のパケット９の音声データが出力される際にリセットされる。 On the other hand, since the audio data decoded from the packets 1 to 6 is voiced audio data, while these audio data are output to the D / A output buffer 7 and played back, the talk spurt detection unit 20 outputs the audio data. A period signal is output to the interpolation control unit 8. Further, when it is determined from the no-voice period to the talk spurt period, that is, when audio data decoded from the packet 1 is output, a talk spurt start signal is output to the talk spurt holding buffer 13. The talk spurt holding buffer 13 is reset when a talk spurt start signal is input and starts storing new audio data output from the common output terminal 10C. Is stored. However, since the voice data stored in the talk spurt holding buffer 13 does not generate a relatively long spike delay in this example, the voice data of the packet 9 in the next talk spurt period is output without being output. Reset to.

（２）第２例（無音声期間に遅延が発生した例、図６参照）
パケット８以下の音声パケットが遅延し、無音声期間であるパケット７とパケット８間に遅延が発生したものとして説明する。音声パケットの遅延は、補間制御部８がプレイアウトバッファー４に記憶される音声パケットの記憶状況を監視して検出するもので、Ｄ／Ａ出力バッファー７に記憶される音声データが枯渇しそうになり、プレイアウトバッファー４の先頭に記憶されている音声パケットを読み出そうとしたときに、読み出す音声パケットが記憶されていないことから、遅延の発生を検出する。 (2) Second example (example in which a delay occurs during a non-voice period, see FIG. 6)
A description will be given assuming that a voice packet equal to or less than the packet 8 is delayed and a delay occurs between the packet 7 and the packet 8 which are silent periods. The delay of the voice packet is detected by the interpolation control unit 8 by monitoring the storage state of the voice packet stored in the playout buffer 4, and the voice data stored in the D / A output buffer 7 is likely to be exhausted. When the voice packet stored at the head of the playout buffer 4 is read out, the voice packet to be read out is not stored, so that the occurrence of delay is detected.

パケット０からパケット７までは、第１例と同様に遅延なく音声を再生したものとし、パケット７をデコードした音声データがＤ／Ａ出力バッファー７から出力される際に、パケット８の受信が遅延するので、プレイアウトバッファー４にはまだ記憶されない。補間制御部８は、遅延を検出すると、検出時にトークスパート検出部２０からトークスパート期間信号が入力されているか否かによって、遅延がトークスパート期間中か無音声期間中に発生したものかを判別する。この例では、遅延発生の直前にコモン出力端子１０Ｃから出力された音声データは、パケット７をデコードした無音声の音声データであるので、トークスパート期間信号は出力されず、無音声期間中と判別される。 From packet 0 to packet 7, it is assumed that audio is reproduced without delay as in the first example. When audio data obtained by decoding packet 7 is output from D / A output buffer 7, reception of packet 8 is delayed. Therefore, it is not stored in the playout buffer 4 yet. When detecting the delay, the interpolation control unit 8 determines whether the delay occurs during the talk spurt period or the silent period depending on whether a talk spurt period signal is input from the talk spurt detection unit 20 at the time of detection. To do. In this example, since the audio data output from the common output terminal 10C immediately before the occurrence of the delay is silent audio data obtained by decoding the packet 7, the talk spurt period signal is not output and it is determined that the audio is not in use. Is done.

補間制御部８は、遅延が無音声期間中に発生したものと判別すると、周囲雑音保持バッファー１５にこれまでの無音声期間から記憶した無音声の音声データ（例えば、第１例のパケット０とパケット７からデコードした音声データ）を、加算器１６へ出力し、パケット７をデコードした音声データに連続させてＤ／Ａ出力バッファー７へ出力する。このとき話速変換スイッチ１８は、加算器１６とＤ／Ａ出力バッファー７を直接接続する側に切り換えられる。周囲雑音保持バッファー１５は、リングバッファー構造となっているので、この出力は、プレイアウトバッファ４に遅れて受信されるパケット８が記憶されるまで、繰り返される。従って、遅延による音切れが発生する間に、周囲雑音が繰り返して再生され、しかも周囲雑音が補間されるのは、会話「しんじゅくえきで」と会話「まって」の無音声期間であるので、通話者には会話中の単なる間に聞こえ、比較的長い遅延が発生しても気付かれない。 If the interpolation control unit 8 determines that the delay has occurred during the silent period, the silent voice data stored in the ambient noise holding buffer 15 from the previous silent period (for example, packet 0 in the first example) The audio data decoded from the packet 7) is output to the adder 16, and the audio data decoded from the packet 7 is output to the D / A output buffer 7 continuously. At this time, the speech speed conversion switch 18 is switched to the side where the adder 16 and the D / A output buffer 7 are directly connected. Since the ambient noise holding buffer 15 has a ring buffer structure, this output is repeated until the packet 8 received late in the playout buffer 4 is stored. Therefore, the ambient noise is repeatedly reproduced while the sound breaks due to the delay, and the ambient noise is interpolated during the silent period of the conversation “Shinjukueki” and the conversation “Mate”. So, the caller can hear it during the conversation and not even notice a relatively long delay.

遅れて受信され、プレイアウトバッファー４に記憶されたパケット８以下の音声パケットは、直ちにパケットシーケンス番号順に音声コーデックデコーダ５で音声データにデコードされ、Ｄ／Ａ出力バッファー７へ出力される。このとき、周囲雑音の音声データを再生した時間、再生に遅延が発生しているので、話速変換スイッチ１８は、話速高速化部１７を経由する側に切り換えられ、遅れたパケット８以下の音声データは、話速高速化部１７を経てＤ／Ａ出力バッファー７へ出力される。話速高速化部１７は、入力された音声データを目立たない程度に高速再生されるようにデータ変換するもので、例えば、無音声の音声データであるパケット８をデコードした音声データは、削除され、パケット９以下からデコードした有音声の音声データは、再生時間が圧縮された音声データに変換される。話速変換スイッチ１８は、プレイアウトバッファ４に新たに記憶される音声パケットのパケットシーケンス番号から再生の遅延が解消したと判定されるまで、話速高速化部１７を経由する側に切り換えられる。 Audio packets of packet 8 or less received after delay and stored in the playout buffer 4 are immediately decoded into audio data by the audio codec decoder 5 in the order of packet sequence numbers, and output to the D / A output buffer 7. At this time, since the playback time is delayed for the time when the voice data of the ambient noise is played back, the speech speed conversion switch 18 is switched to the side via the speech speed acceleration section 17 and the packet 8 or less delayed. The audio data is output to the D / A output buffer 7 through the speech speed acceleration unit 17. The speech speed accelerating unit 17 converts the input voice data so that the input voice data is reproduced at an unnoticeable speed. For example, the voice data obtained by decoding the packet 8 which is silent voice data is deleted. The voiced audio data decoded from the packet 9 and below is converted into audio data with a compressed reproduction time. The speech speed conversion switch 18 is switched to the side through the speech speed acceleration section 17 until it is determined from the packet sequence number of the voice packet newly stored in the playout buffer 4 that the reproduction delay has been eliminated.

（３）第３例（トークスパート期間に短い遅延が発生した例、図７参照）
パケット４以下の音声パケットが遅延し、トークスパート期間であるパケット３とパケット４間に４０ｍｓｅｃまでの短い遅延が発生したものとして説明する。パケット０からパケット３まで、第１例と同様に遅延なく音声が再生されるが、パケット３をデコードした音声データがＤ／Ａ出力バッファー７から出力される際には、パケット４の受信が遅延しているので、プレイアウトバッファー４に記憶されない。従って、遅延発生の直前にコモン出力端子１０Ｃから出力された音声データは、パケット３をデコードした有音声の音声データであるので、トークスパート期間信号が出力され、補間制御部８は、遅延検出時にトークスパート期間信号の入力されていることからトークスパート期間中に遅延が発生したものと判別する。 (3) Third example (example in which a short delay occurs in the talk spurt period, see FIG. 7)
A description will be given assuming that a voice packet equal to or less than packet 4 is delayed and a short delay of up to 40 msec occurs between packet 3 and packet 4, which is the talk spurt period. From packet 0 to packet 3, audio is reproduced without delay as in the first example. However, when audio data obtained by decoding packet 3 is output from D / A output buffer 7, reception of packet 4 is delayed. Therefore, it is not stored in the playout buffer 4. Accordingly, since the audio data output from the common output terminal 10C immediately before the occurrence of the delay is voiced audio data obtained by decoding the packet 3, a talk spurt period signal is output, and the interpolation control unit 8 performs the detection of the delay. Since a talk spurt period signal is input, it is determined that a delay has occurred during the talk spurt period.

補間制御部８は、遅延がトークスパート期間中に発生したものと判別すると、始めに第１切換スイッチ１０のコモン端子１０Ｃを第１切換端子１０Ａから第２切換端子１０Ｂへ切り換え、補間波形生成部９で生成された補間波形の音声データを出力する。従って、コモン端子１０Ｃに出力される補間波形の音声データは、直接Ｄ／Ａ出力バッファー７に出力され、補間波形の音声で再生される。 When the interpolation control unit 8 determines that the delay has occurred during the talk spurt period, the interpolation control unit 8 first switches the common terminal 10C of the first changeover switch 10 from the first changeover terminal 10A to the second changeover terminal 10B, and the interpolation waveform generation unit. The audio data of the interpolated waveform generated in 9 is output. Therefore, the interpolated waveform audio data output to the common terminal 10C is directly output to the D / A output buffer 7 and reproduced with the interpolated waveform audio.

補間波形は、前述の通り、直前に音声コーデックデコーダ５から出力される音声データ、すなわちパケット３をデコードした音声データの波形のピッチと位相に連続し、この音声データの波形に類似する波形となるように生成される。通話者が気付かない程度に補間波形で補間する時間には限界があるので、ここでは最大波形補間可能時間（ＷＣＴ）を４０ｍｓｅｃとし、補間波形の音声データが再生される時間は、この最大波形補間可能時間（ＷＣＴ）以内とする。 As described above, the interpolation waveform is continuous with the pitch and phase of the waveform of the audio data output from the audio codec decoder 5 just before, that is, the audio data obtained by decoding the packet 3, and is similar to the waveform of the audio data. Is generated as follows. Since there is a limit to the time to interpolate with the interpolation waveform to the extent that the caller does not notice, the maximum waveform interpolation possible time (WCT) is set to 40 msec, and the time for reproducing the interpolated waveform audio data is the maximum waveform interpolation. Within possible time (WCT).

補間波形の音声データを再生している間、すなわち補間制御部８が遅延を検出してから最大波形補間可能時間（ＷＣＴ）以内に、遅延していたパケット４が受信され、プレイアウトバッファ４にパケット４を先頭に所定数の音声パケットがバケットシーケンス番号順記憶されると、補間制御部８は、短時間の遅延が解消したものと判断する。短時間の遅延解消と判断した場合には、直ちに遅れて受信された音声パケット４から音声コーデックデコーダ５に読み出して出力することなく、補間波形の音声データが再生される時間に再生されるはずであった音声パケット（この例ではパケット４）をプレイアウトバッファ４から削除し、その後の音声パケット５から読み出す。これは、遅れて受信される音声パケットからそのまま音声を再生すると、補間波形による音声と同様の音声が繰り返して再生され、再生音声が間延びすることと、補間波形の再生時間分再生に遅延が発生している為である。 While reproducing the voice data of the interpolated waveform, that is, within the maximum waveform interpolable time (WCT) after the interpolation control unit 8 detects the delay, the delayed packet 4 is received in the playout buffer 4. When a predetermined number of voice packets are stored in the order of bucket sequence numbers starting with packet 4, interpolation control unit 8 determines that the short delay has been eliminated. If it is determined that the short delay is eliminated, the audio data of the interpolated waveform should be reproduced at the time when the interpolated waveform audio data is reproduced without being read out and output from the audio packet 4 received with a delay immediately. The voice packet (packet 4 in this example) was deleted from the playout buffer 4 and read from the subsequent voice packet 5. This is because if the voice is played back as it is from the delayed voice packet, the same voice as that of the interpolated waveform will be played back repeatedly, and the playback voice will be delayed, and there will be a delay in the playback of the interpolated waveform playback time. It is because it is doing.

従って、バケットシーケンス番号順で削除された後のパケット５から、音声コーデックデコーダ５に読み出されて音声データにデコードされる。パケット５からデコードした音声データは、補間波形生成部９で生成し再生中の補間波形の音声データに連続するものとなるため、補間波形の後方は、パケット５の音声波形に近似し、ピッチと位相を連続させた波形とする。これにより、補間波形の音声データからは、前後のパケット３、５の波形から「じゅ」に近似する音韻が再生される。 Therefore, the packet 5 after being deleted in the order of the bucket sequence number is read by the audio codec decoder 5 and decoded into audio data. Since the audio data decoded from the packet 5 is continuous with the audio data of the interpolated waveform that is generated by the interpolated waveform generation unit 9 and is being reproduced, the back of the interpolated waveform approximates the audio waveform of the packet 5, and the pitch and A waveform with a continuous phase. As a result, a phoneme that approximates “ju” is reproduced from the waveforms of the preceding and following packets 3 and 5 from the speech data of the interpolated waveform.

デコードされたパケット５からの音声データは、第１切換スイッチ１０のコモン出力端子１０Ｃを、第１切換端子１０Ａへ切り換えて、パケットシーケンス番号順にＤ／Ａ出力バッファー７に一時記憶され、音声として再生される。このとき、バケット４を削除しているので、再生の遅延はなく、従って話速高速化部１７は経由させない。このように、トークスパート期間中に遅延が発生しても、その遅延が短い遅延であれば、図７に示すように、「しんじゅくえきで」と本来の通話内容とほぼ同一の会話が再生される。 The decoded audio data from the packet 5 is temporarily stored in the D / A output buffer 7 in the order of the packet sequence number by switching the common output terminal 10C of the first changeover switch 10 to the first changeover terminal 10A and reproduced as audio. Is done. At this time, since the bucket 4 is deleted, there is no reproduction delay, and therefore the speech speed increasing unit 17 is not routed. In this way, even if a delay occurs during the talk spurt period , if the delay is short, as shown in FIG. Played.

（４）第４例（トークスパート期間に長い遅延が発生し、スパイク遅延による遅延予想時間（ＳＰＴ）以内に遅れた音声パケットが受信される例、図８参照）
パケット４以下の音声パケットが遅延し、トークスパート期間であるパケット３とパケット４間に最大波形補間可能時間（ＷＣＴ）である４０ｍｓｅｃ以上の長い遅延が発生したものとして説明する。遅延を検出してから、最大波形補間可能時間（ＷＣＴ）まで補間波形生成部９で生成した補間波形の音声データを出力して再生するのは、第３例と同一である。 (4) Fourth example (example in which a long delay occurs in the talk spurt period and a voice packet delayed within the expected delay time (SPT) due to spike delay is received, see FIG. 8)
A description will be given assuming that a voice packet of packet 4 or less is delayed and a long delay of 40 msec or more, which is the maximum waveform interpolation possible time (WCT), occurs between packet 3 and packet 4 which are the talk spurt period. The audio data of the interpolated waveform generated by the interpolated waveform generating unit 9 is output and reproduced after detecting the delay until the maximum waveform interpolable time (WCT) is the same as in the third example.

最大波形補間可能時間（ＷＣＴ）を越えても遅れた音声パケットが受信されない場合には、以下の遅延処理シーケンスを実行する。遅延処理シーケンスに要する時間は、図８に示すように、スパイク遅延による遅延予想時間（ＳＰＴ）内に、可能な限り最大波形補間可能時間（ＷＣＴ）を含む全ての遅延処理シーケンスが完了するように設定されるが、繰り返しトークスパート再生時間が予想以上に長い場合には、遅延予想時間（ＳＰＴ）をわずかに越えるものとなってもよい。スパイク遅延による遅延予想時間（ＳＰＴ）は、スパイク遅延がパケット損失を伴うことがあるので、損失した音声パケットの再送要求を行い、遅延した全ての音声パケットがパケットシーケンス番号順にプレイアウトバッファ４に記憶されるまでの時間とし、ここでは１秒とする。遅延処理シーケンスに含まれる繰り返しトークスパート再生時間は、少なくとも遅延処理シーケンスを実行している間、トークスパート保持バッファー１３に記憶される音声データ量から得られる固定時間なので、周囲雑音の再生による周囲雑音補間時間で遅延処理シーケンスに要する時間を調整する。 If a delayed voice packet is not received even after the maximum waveform interpolation possible time (WCT) is exceeded, the following delay processing sequence is executed. As shown in FIG. 8, the time required for the delay processing sequence is such that all delay processing sequences including the maximum possible waveform interpolation time (WCT) are completed within the expected delay time (SPT) due to the spike delay. Although it is set, if the repeated talk spurt reproduction time is longer than expected, the expected delay time (SPT) may be slightly exceeded. The expected delay time (SPT) due to the spike delay may cause a packet loss due to the spike delay, so a retransmission request for the lost voice packet is made and all the delayed voice packets are stored in the playout buffer 4 in the order of the packet sequence numbers. It is assumed that the time is until 1 second, and here is 1 second. The repetitive talk spurt reproduction time included in the delay processing sequence is a fixed time obtained from the amount of audio data stored in the talk spurt holding buffer 13 at least during execution of the delay processing sequence. The time required for the delay processing sequence is adjusted by the interpolation time.

遅延を検出した後、最大波形補間可能時間（ＷＣＴ）が経過すると、補間波形生成部９から出力される補間波形の音声データを、補間波形が徐々に減衰するように新たな補間波形の音声データを生成するとともに、周囲雑音保持バッファー１５に無音声期間から記憶した無音声の音声データを、周囲雑音波形が０から徐々に増加するように所定の加算処理を行って出力し、両者を加算器１６で加算することにより、図８に示すように最大波形補間可能時間（ＷＣＴ）が経過する時点で、補間波形の音声データのみであった加算器１６の出力を、フィードアウト時間が経過した時点で、無音声の音声データの出力のみとする。 After the delay is detected, when the maximum waveform interpolation possible time (WCT) elapses, the interpolated waveform audio data output from the interpolated waveform generation unit 9 is replaced with new interpolated waveform audio data so that the interpolated waveform gradually attenuates. Is generated by performing a predetermined addition process so that the ambient noise waveform gradually increases from 0, and outputs both of them as an adder. When the maximum waveform interpolation possible time (WCT) elapses as shown in FIG. 8, the output of the adder 16 that is only the interpolated waveform audio data is added to the output of the adder 16 as shown in FIG. Thus, only voiceless voice data is output.

このように序序に減衰させた補間波形と補間波形にも含まれる周囲雑音成分の減衰分を補うために、周囲雑音波形による音声を０から序序に大きくしていくことにより、急激なゲイン変動を避け、補間波形から周囲雑音波形への切り換え時の不自然感を緩和させる。 In order to compensate for the interpolated waveform attenuated in this way and the attenuation of the ambient noise component included in the interpolated waveform, the sound by the ambient noise waveform is gradually increased from 0 to increase the abrupt gain. Avoid fluctuations and reduce unnatural feeling when switching from interpolated waveform to ambient noise waveform.

フィードアウト時間が経過した後は、上述の通り、スパイク遅延の遅延予想時間（ＳＰＴ）から、最大波形補間可能時間（ＷＣＴ）とフィードアウト時間と繰り返しトークスパート再生時間を差し引いた残りの時間、リングバッファー構造の周囲雑音保持バッファー１５に記憶されている周囲雑音の音声データを繰り返して出力し、周囲雑音を再生する。 After the feedout time has elapsed, as described above, the remaining time obtained by subtracting the maximum waveform interpolable time (WCT), feedout time, and repeated talk spurt playback time from the expected delay time (SPT) of the spike delay The ambient noise audio data stored in the buffer structure ambient noise holding buffer 15 is repeatedly output to reproduce the ambient noise.

続いて、補間制御部８の制御により第２切換スイッチ１１のコモン出力端子１１Ｃを、第１切換端子１１Ａから第２切換端子１１Ｂへ切り換えて、トークスパート保持バッファー１３に記憶されている繰り返しトークスパートの音声データを出力する。このときに出力される繰り返しトークスパートの音声データは、パケット１乃至パケット３からデコードした音声データであり、「しん」という音韻を再生させる音声データである。 Subsequently, the common output terminal 11C of the second changeover switch 11 is switched from the first changeover terminal 11A to the second changeover terminal 11B under the control of the interpolation control unit 8, and the repetitive talkspart stored in the talkspart holding buffer 13 is switched. Audio data is output. The audio data of the repeated talk spurt output at this time is audio data decoded from the packets 1 to 3, and is audio data for reproducing the phoneme “shin”.

この第４例では、遅延予想時間（ＳＰＴ）内に遅延した音声パケットが受信されるものとしているので、周囲雑音の再生時間若しくは続く繰り返しトークスパート再生時間に、遅れて受信される音声パケットがプレイアウトバッファ４に記憶される。パケット損失検出部６が、遅れてプレイアウトバッファ４の先頭に記憶される音声パケットのパケットシーケンス番号から、スパイク遅延に伴う音声パケットの損失を検出した場合には、補間制御部８がネットワーク接続部２から送信側の送信装置へ損失したパケットの再送要求を行うパケットを送出し、再送要求した音声パケットが受信され、プレイアウトバッファ４に記憶された時点で、スパイク遅延が解消したものとする。この例では、図１１に示すように、スパイク遅延が発生した後、最初にプレイアウトバッファ４の先頭に記憶される音声パケットのパケットシーケンス番号はパケット７であるため、パケット４、５、６が損失されたものとし、これらのパケットを再送要求しているが、通常、再送要求した音声パケットは再送要求後１００ｍｓｅｃ前後で受信されるので、スパイク遅延予想時間（ＳＰＴ）内に全ての遅延した音声パケットが得られる。 In this fourth example, it is assumed that a voice packet delayed within the expected delay time (SPT) is received, so that a voice packet received with a delay is played in the playback time of ambient noise or the subsequent repeated talk spurt playback time. Stored in the out buffer 4. When the packet loss detection unit 6 detects the loss of the voice packet due to the spike delay from the packet sequence number of the voice packet stored at the head of the playout buffer 4 with a delay, the interpolation control unit 8 detects that the network connection unit It is assumed that the spike delay has been eliminated when a packet requesting retransmission of the lost packet is sent from 2 to the transmission device on the transmission side, and the voice packet requested for retransmission is received and stored in the playout buffer 4. In this example, as shown in FIG. 11, after the spike delay occurs, the packet sequence number of the voice packet first stored at the head of the playout buffer 4 is the packet 7, so that the packets 4, 5, and 6 These packets are assumed to have been lost, and a retransmission request is made. However, since the audio packet requested to be retransmitted is usually received around 100 msec after the retransmission request, all the delayed audio is received within the expected spike delay time (SPT). A packet is obtained.

従って、スパイク遅延予想時間（ＳＰＴ）が経過した直後、すなわち図８に示すように、繰り返しトークスパートの音声データを再生した直後に、遅延したパケット４以下の音声パケットが送信順にプレイアウトバッファ４に記憶され、順次音声コーデックデコーダ５でデコードされた音声データが出力されて音声として再生される。すなわち、この例では、「しん」と再生された後、スパイク遅延の間補間波形と周囲雑音が再生され、その後「しんじゅくで」と再生されるので、相手の通話者が「しん、しんじゅくで」と言い直したように聞き取られ、単語である「しんじゅく」を再生する間に長い遅延が発生しても、単語の意味が正確に伝達され、また、通話者に長い遅延が発生したことを気付かせない。 Therefore, immediately after the estimated spike delay time (SPT) elapses, that is, as shown in FIG. The audio data stored and sequentially decoded by the audio codec decoder 5 is output and reproduced as audio. That is, in this example, after playing “shin”, the interpolated waveform and ambient noise are played during the spike delay, and then “shinjuku” is played. If you hear a long delay while playing the word “Shinjuku”, the meaning of the word is accurately communicated, and the caller has a long delay. I do not notice that occurred.

尚、スパイク遅延が発生している間、遅延処理シーケンスを実行して再生に遅延が発生しているので、遅れて受信された音声パケット（パケット４以下）をデコードした音声データは、話速変換スイッチ１８を、話速高速化部１７を経由する側に切り換え、再生の遅延が解消したと判定されるまで、図１１に示すように話速高速化部１７を用いて話速変換を行って再生する。 While the spike delay is occurring, the delay processing sequence is executed and the playback is delayed. Therefore, the voice data obtained by decoding the delayed voice packet (packet 4 or less) is converted to the speech speed. The switch 18 is switched to the side via the speech speed acceleration unit 17 and the speech speed conversion is performed using the speech speed acceleration unit 17 as shown in FIG. 11 until it is determined that the reproduction delay has been eliminated. Reproduce.

（５）第５例（トークスパート期間に長い遅延が発生し、スパイク遅延による遅延予想時間（ＳＰＴ）後に遅れた音声パケットが受信される例、図９参照）
第４例では、スパイク遅延による遅延予想時間（ＳＰＴ）内に遅延した全ての音声パケットが受信されたが、第５例は、遅延予想時間（ＳＰＴ）内に遅延した音声パケットが受信されない場合である。 (5) Fifth example (example in which a long delay occurs in the talk spurt period and a voice packet delayed after the expected delay time (SPT) due to spike delay is received, see FIG. 9)
In the fourth example, all voice packets delayed within the expected delay time (SPT) due to the spike delay are received, but in the fifth example, the voice packets delayed within the expected delay time (SPT) are not received. is there.

従って、一連の遅延処理シーケンスを実行し、繰り返しトークスパート再生時間が経過するまでは、第４例と同一であるので、その間の説明は省略する。遅延予想時間（ＳＰＴ）が経過すると、繰り返しトークスパートの音声データの最後の音声のピッチと位相に合わせた補間波形の音声データを補間波形生成部９で生成し、補間波形が徐々に減衰するように新たな補間波形の音声データを生成するとともに、周囲雑音保持バッファー１５に無音声期間から記憶した無音声の音声データを、周囲雑音波形が０から徐々に増加するように所定の加算処理を行って出力し、両者を加算器１６で加算することにより、図９に示すように遅延予想時間（ＳＰＴ）が経過する時点で、補間波形の音声データのみであった加算器１６の出力を、フィードアウト時間が経過した時点で、無音声の音声データの出力のみとする。この処理は、波形補間の音声データを周囲雑音へ移行させる第４例で説明した処理と同様であり、補間波形の音声データを自然に周囲雑音へ移行させるものである。フィードアウト時間が経過した後は、第４例と同一方法で周囲雑音の音声データのみを繰り返し出力し、周囲雑音を再生する。 Therefore, a series of delay processing sequences is executed until the repeated talk spurt reproduction time elapses, and therefore the description is omitted because it is the same as the fourth example. When the expected delay time (SPT) elapses, the interpolated waveform generation unit 9 generates interpolated waveform audio data in accordance with the pitch and phase of the last audio of the repeated talk spurt audio data so that the interpolated waveform gradually attenuates. New interpolated waveform voice data is generated, and a predetermined addition process is performed on the voiceless voice data stored in the ambient noise holding buffer 15 from the voiceless period so that the ambient noise waveform gradually increases from zero. 9 and adding them by the adder 16, the output of the adder 16 which is only the audio data of the interpolated waveform when the expected delay time (SPT) elapses as shown in FIG. When the out time elapses, only voice data without sound is output. This process is the same as the process described in the fourth example for transferring waveform interpolation voice data to ambient noise, and naturally transfers the voice data of the interpolation waveform to ambient noise. After the feed-out time has elapsed, only the ambient noise audio data is repeatedly output in the same manner as in the fourth example to reproduce the ambient noise.

この第５例では、遅延した音声パケットが受信されず、繰り返しトークスパートの音声のみが再び再生されないように、遅延した音声パケットの受信を条件に、繰り返しトークスパートを再生するものである。従って、図９に示す周囲雑音を再生する周囲雑音補間時間は未定であり、この間に遅延する音声パケットの受信を検出したときに、すなわち、プレイアウトバッファ４に音声パケットが記憶されたときに、第２切換スイッチ１１のコモン出力端子１１Ｃを、第１切換端子１１Ａから第２切換端子１１Ｂへ切り換えて、トークスパート保持バッファー１３に記憶されている繰り返しトークスパートの音声データを出力する。 In the fifth example, the delayed talk packet is reproduced on condition that the delayed voice packet is received so that the delayed voice packet is not received and only the voice of the repeated talk part is not reproduced again. Accordingly, the ambient noise interpolation time for reproducing the ambient noise shown in FIG. 9 is undecided, and when reception of a voice packet delayed during this time is detected, that is, when the voice packet is stored in the playout buffer 4, The common output terminal 11C of the second changeover switch 11 is switched from the first changeover terminal 11A to the second changeover terminal 11B, and the audio data of repeated talk spurts stored in the talk spurt holding buffer 13 is output.

このときプレイアウトバッファ４の先頭に記憶される音声パケットは、パケット損失を伴う場合に、必ずしも遅延後の音声パケットがすべて受信された状態とはならないが、遅延後にいずれかの音声パケットが受信されたことから、スパイク遅延が解消されたものと推定できる。従って、第４例と同様に損失パケット再送要求を行うことにより、損失パケットを含めて全ての遅延した音声パケットは、繰り返しトークスパート再生期間中にプレイアウトバッファ４に記憶される。 At this time, the voice packet stored at the head of the playout buffer 4 does not necessarily have a state where all the voice packets after the delay are received when there is a packet loss, but one of the voice packets is received after the delay. Therefore, it can be estimated that the spike delay has been eliminated. Therefore, by performing a lost packet retransmission request in the same manner as in the fourth example, all delayed voice packets including the lost packet are stored in the playout buffer 4 during the repeated talk spurt reproduction period.

尚、より確実に繰り返しトークスパートの音声データに、遅延した音声パケットの音声データを連続させる場合には、損失パケット再送要求を行い遅延後の音声パケットが全て受信されたことを検出した後、繰り返しトークスパートの音声データを出力するようにしてもよい。 If the voice data of the delayed voice packet is to be repeated more reliably and repeatedly with the voice data of the talk spurt, it is repeated after detecting that all the delayed voice packets have been received by making a loss packet retransmission request. You may make it output the audio | voice data of talk spurt.

いずれの場合であっても、第４例で説明したと同様の処理を行い、繰り返しトークスパートの音声データに連続して、遅延した音声パケットの音声データを出力し、両者を連続して再生できる。この第５例では、相手の通話者が「しん、しん−−しんじゅくで」と会話したように聞き取られ、単語である「しんじゅく」を再生する間に長い遅延が発生しても、単語の意味が正確に伝達される。 In any case, the same processing as described in the fourth example is performed, and the voice data of the delayed voice packet is output continuously from the repeated talk spurt voice data, and both can be reproduced continuously. . In this fifth example, even if the other party's caller is heard as if he / she spoke “shin, Shin--dedicated”, a long delay occurs while the word “shinjuku” is played back. , The meaning of the word is communicated accurately.

（５）第６例（トークスパート期間に長い遅延が発生し、そのまま遅延する音声パケットが受信されない例、図１０参照）
第５例は、遅延予想時間（ＳＰＴ）経過後に遅延した音声パケットが受信される場合であるが、第６例は、ネットワーク障害などに起因し、周囲雑音を再生しながら待機しても結局遅延した音声パケットが受信されない例である。 (5) Sixth example (example in which a long delay occurs in the talk spurt period and a delayed voice packet is not received, see FIG. 10)
The fifth example is a case where a delayed voice packet is received after the expected delay time (SPT) has elapsed, but the sixth example is due to a network failure or the like, and is delayed even after waiting while reproducing ambient noise. This is an example in which a received voice packet is not received.

従って、図１０に示すように、第４例で示す一連の遅延処理シーケンスを実行した後、第５例で示す周囲雑音の音声データを再生するまでは、第４例、第５例と同一であり、その間の説明は省略する。 Therefore, as shown in FIG. 10, after executing the series of delay processing sequences shown in the fourth example, until the ambient noise audio data shown in the fifth example is reproduced, it is the same as the fourth example and the fifth example. There will be no explanation in between.

補間制御部８は、遅延を検出した時点からの経過時間をオフ時間としてカウントし、オフ時間が許容待機時間を超えたときに、第１切換スイッチ１０のコモン端子１０Ｃを第２切換端子１０Ｂへ切り換え、補間波形生成部９で生成する無音の音声データを出力し、また、トークスパート保持バッファー１３に記憶されている音声データを消去する。許容待機時間は、通常の遅延と区別して機器の故障など容易に復旧されない障害の発生を通話者へ通知させる目的で設定するもので、ここでは、少なくとも繰り返しトークスパートを再生した後、周囲雑音の再生を開始するまでの時間より長い時間とし、５秒とする。従って、遅延する音声パケットが受信されず、オフ時間が５秒を経過すると、周囲雑音が無音となり、操作者は遅延以外の故障を知ることができる。 The interpolation control unit 8 counts the elapsed time from the time when the delay is detected as an off time, and when the off time exceeds the allowable standby time, the common terminal 10C of the first changeover switch 10 is changed to the second changeover terminal 10B. The silent audio data generated by the switching / interpolation waveform generation unit 9 is output, and the audio data stored in the talk spurt holding buffer 13 is deleted. The allowable waiting time is set for the purpose of notifying the caller of the occurrence of a failure that cannot be easily recovered, such as a device failure, as distinguished from a normal delay. Here, at least after repeatedly playing the talk spurt, The time is longer than the time until playback starts, and 5 seconds. Therefore, when a delayed voice packet is not received and the off time has elapsed for 5 seconds, the ambient noise becomes silent, and the operator can know a failure other than the delay.

尚、上述の第４例では、遅延予想時間（ＳＰＴ）を経過した後、遅延する音声パケットの受信を待たずに、繰り返しトークスパートの音声データを出力し再生したが、遅延を検出した当初から、第５例で説明したように、遅延する音声パケットの受信を条件に、繰り返しトークスパートの音声データを出力し再生してもよい。 In the fourth example described above, after the expected delay time (SPT) has elapsed, the voice data of the talk spurt is repeatedly output and reproduced without waiting for the reception of the delayed voice packet. As described in the fifth example, the voice data of the talk spurt may be repeatedly output and reproduced on condition that the delayed voice packet is received.

また、上述の実施の形態では、トークスパート保持バッファー１３と周囲雑音保持バッファー１５に、それぞれ音声パケットをデコードした音声データを記憶させたが、いずれか一方、若しくは双方には、デコードする前の音声パケットの状態で記憶させてもよい。 In the above-described embodiment, the audio data obtained by decoding the audio packet is stored in the talk spurt holding buffer 13 and the ambient noise holding buffer 15, respectively, but either or both of the audio data before decoding are stored. You may memorize | store in the state of a packet.

図１２は、トークスパート保持バッファーと周囲雑音保持バッファーに音声パケットを記憶させた他の実施の形態に係る音声データ補間装置３０のブロック図である。第１の実施の形態に係る音声データ補間装置１と同一、若しくは同様に作用する構成については、同一番号を付してその説明を省略する。 FIG. 12 is a block diagram of an audio data interpolating apparatus 30 according to another embodiment in which audio packets are stored in a talk spurt holding buffer and an ambient noise holding buffer. Components that are the same as or similar to those in the audio data interpolating apparatus 1 according to the first embodiment are given the same numbers, and descriptions thereof are omitted.

音声データ補間装置３０では、補間制御部８により切り換え制御される第３切換スイッチ３１のコモン端子３１Ｄが図に示すように、音声コーデックデコーダ５の入力に接続し、第１切換端子３１Ａ、第２切換端子３１Ｂ、第３切換端子３１Ｃを、それぞれ周囲雑音保持バッファー３３、トークスパート保持バッファー３２、プレイアウトバッファー４の出力に接続させることによって、音声コーデックデコーダ５に各バッファーから出力される音声パケットを選択的に入力するようにしている。 In the audio data interpolating device 30, the common terminal 31D of the third changeover switch 31 controlled to be switched by the interpolation control unit 8 is connected to the input of the audio codec decoder 5, as shown in the figure, and the first changeover terminal 31A and the second changeover terminal 31A are connected. By connecting the switching terminal 31B and the third switching terminal 31C to the outputs of the ambient noise holding buffer 33, the talk spurt holding buffer 32 and the playout buffer 4, respectively, the voice codec output from each buffer is sent to the voice codec decoder 5. Input is made selectively.

トークスパート保持バッファー３２は、プレイアウトバッファー４の出力に接続するので、プレイアウトバッファー４から音声コーデックデコーダ５へ出力される音声バケットを常時記憶している。このトークスパート保持バッファー３２に記憶される音声パケットは、トークスパート検出部２０から入力されるトークスパートスタート信号をリセット信号として、トークスパート期間が開始する毎に全て消去される。従って、トークスパート保持バッファー３２には、トークスパート期間中であれば、そのトークスパート期間の開始部分から出力される音声パケット（以下、繰り返しトークスパートの音声パケットという）が先頭アドレスより順に記憶されていることになる。繰り返しトークスパートの音声パケットは、第１実施の形態で繰り返しトークスパートの音声データが出力されるのと同じタイミングで、コモン端子３１Ｄを第２切換端子３１Ｂへ切り換えることにより、音声コーデックデコーダ５でデコードされ、繰り返しトークスパートの音声データとして出力される。 Since the talk spurt holding buffer 32 is connected to the output of the playout buffer 4, it always stores the audio buckets output from the playout buffer 4 to the audio codec decoder 5. The voice packets stored in the talk spurt holding buffer 32 are all erased every time the talk spurt period starts using the talk spurt start signal input from the talk spurt detection unit 20 as a reset signal. Therefore, in the talk spurt holding buffer 32, during the talk spurt period, voice packets output from the start part of the talk spurt period (hereinafter referred to as repeated talk spurt voice packets) are stored in order from the head address. Will be. The voice packet of the repeated talk spurt is decoded by the voice codec decoder 5 by switching the common terminal 31D to the second switching terminal 31B at the same timing as the voice data of the repeated talk spurt is output in the first embodiment. And output as repeated talk spurt audio data.

周囲雑音保持バッファー３３には、プレイアウトバッファー４から出力される音声パケットのうち、会話が途切れた間の無音声期間に出力される音声パケットが入力される。周囲雑音保持バッファー３３も、リングバッファー構造となっていて、記憶されていた最も古い音声パケットを新たに入力される無音声期間の音声パケットに書き換える。周囲雑音保持バッファー３３は、第１実施の形態の周囲雑音保持バッファー１５に相当するものであり、第１実施の形態において周囲雑音保持バッファー１５に記憶された無音声の音声データが出力すれるのと同じタイミングで、コモン端子３１Ｄを第１切換端子３１Ａへ切り換えられ、音声コーデックデコーダ５で無音声の音声データにデコードされ、周囲雑音として再生される。 Out of the audio packets output from the playout buffer 4, the audio packets output during the silent period during which the conversation is interrupted are input to the ambient noise holding buffer 33. The ambient noise holding buffer 33 also has a ring buffer structure, and rewrites the oldest voice packet stored therein into a voice packet of a voiceless period that is newly input. The ambient noise holding buffer 33 corresponds to the ambient noise holding buffer 15 of the first embodiment, and voiceless voice data stored in the ambient noise holding buffer 15 in the first embodiment is output. At the same timing, the common terminal 31D is switched to the first switching terminal 31A, decoded to voiceless voice data by the voice codec decoder 5, and reproduced as ambient noise.

音声データ補間装置３０の作用は、繰り返しトークスパートの音声パケットを繰り返しトークスパートの音声データとして、無音声期間の音声パケットを無音声の音声データに置き換えて、音声データ補間装置１について説明した作用と同様であり、この他の実施の形態に係る音声データ補間装置３０によっても、繰り返しトークスパートの音声パケットに連続して遅れて受信される音声パケットが、音声コーデックデコーダ５へ出力され、連続した音声が再生される。 The operation of the audio data interpolating device 30 is the same as the operation described for the audio data interpolating device 1 by replacing the audio packet of the repeated talk spurt as the audio data of the repeated talk spurt and replacing the audio packet of the no audio period with audio data of no audio. Similarly, in the audio data interpolating apparatus 30 according to the other embodiment, the audio packet that is received after being repeatedly delayed with respect to the audio packet of the repeated talk spurt is output to the audio codec decoder 5, and the continuous audio Is played.

上述の各実施の形態では、ＩＰネットワークの一例としてインターネットとしたが、ＩＰアドレスをもとにルーターで音声パケットが転送されるネットワークであれば、ＬＡＮなどのネットワークでもよい。また、ＩＰネットワークを介して相手側の送信装置と通話する通話システムであれば、携帯電話など他の電話機を通信装置とした通信システムにも適用できる。 In each of the embodiments described above, the Internet is used as an example of an IP network. However, a network such as a LAN may be used as long as a voice packet is transferred by a router based on an IP address. In addition, the present invention can be applied to a communication system in which another telephone such as a mobile phone is used as a communication device as long as it is a call system for making a call with a transmission device on the other side via an IP network.

本発明は、補間波形で補間ができない遅延の発生が予想されるネットワークを介して通話を行う通話システムに適している。 INDUSTRIAL APPLICABILITY The present invention is suitable for a call system that performs a call over a network in which a delay that cannot be interpolated with an interpolation waveform is expected.

音声データ補間装置１のブロック図である。1 is a block diagram of an audio data interpolation device 1. FIG. 有音声と無音声の判定方法を説明する説明図である。It is explanatory drawing explaining the determination method of voiced and unvoiced. スレッシホールドレベルの変化を示す波形図である。It is a wave form diagram which shows the change of a threshold level. トークスパート期間と無音声期間を判定する説明図である。It is explanatory drawing which determines a talk spurt period and a silent period. 遅延が発生しない第１例の波形図である。FIG. 6 is a waveform diagram of a first example in which no delay occurs. 無音声期間に遅延が発生した第２例の波形図である。It is a wave form diagram of the 2nd example in which delay occurred in a silent period. トークスパート期間に短い遅延が発生した第３例の波形図である。FIG. 10 is a waveform diagram of a third example in which a short delay occurs in the talk spurt period. トークスパート期間に長い遅延が発生し、スパイク遅延による遅延予想時間（ＳＰＴ）以内に遅れた音声パケットが受信される第４例の波形図である。FIG. 10 is a waveform diagram of a fourth example in which a long delay occurs in a talk spurt period and a voice packet delayed within an expected delay time (SPT) due to a spike delay is received. トークスパート期間に長い遅延が発生し、スパイク遅延による遅延予想時間（ＳＰＴ）後に遅れた音声パケットが受信される第５例の波形図である。FIG. 14 is a waveform diagram of a fifth example in which a long delay occurs in the talk spurt period and a voice packet delayed after the expected delay time (SPT) due to spike delay is received. トークスパート期間に長い遅延が発生し、そのまま遅延する音声パケットが受信されない第６例の波形図である。FIG. 16 is a waveform diagram of a sixth example in which a long delay occurs in the talk spurt period and a voice packet that is delayed as it is is not received. 第４例においてパケット損失を伴う場合の処理を説明する説明図である。It is explanatory drawing explaining the process in the case of accompanying packet loss in a 4th example. 本発明の他の実施の形態に係る音声データ補間装置３０のブロック図である。It is a block diagram of the audio | voice data interpolation apparatus 30 which concerns on other embodiment of this invention. 従来の音声データ補間装置１００を示すブロック図である。It is a block diagram which shows the conventional audio | voice data interpolation apparatus.

Explanation of symbols

１、３０音声データ補間装置
２ネットワーク接続部
３ＩＰネットワーク
４プレイアウトバッファー
５音声復号化回路
８遅延検出部（補間制御部）
１１切り換え制御部（第２切換スイッチ）
１２有音声判定部
１３、３２トークスパート保持バッファー
１９音声再生部（Ｄ／Ａコンバータ）
２０トークスパート検出部
３１切り換え制御部（第３切換スイッチ）
DESCRIPTION OF SYMBOLS 1, 30 Voice data interpolation apparatus 2 Network connection part 3 IP network 4 Playout buffer 5 Voice decoding circuit 8 Delay detection part (interpolation control part)
11 Change control unit (second changeover switch)
12 Voice determination unit 13, 32 Talk spurt holding buffer 19 Audio playback unit (D / A converter)
20 Talk spurt detection part 31 Change control part (3rd change-over switch)

Claims

A network connection unit that receives, via the network, voice packets to which packet sequence numbers are assigned in the order of transmission by the transmission device;
A playout buffer for temporarily storing voice packets received by the network connection unit in order of transmission from the packet sequence number;
An audio decoding circuit that reads out audio packets stored at the top of the playout buffer and decodes them into audio data;
A voice determination unit that compares voice characteristics output when voice data output from the voice decoding circuit includes voice with a predetermined threshold and determines voice data of voice,
A talk spurt detection unit for determining a talk spurt period to be estimated as a conversation period from a period in which voiced voice data is continuous;
A talk spurt holding buffer that is reset every time the start of the talk spurt period is detected by the talk spurt detection unit and temporarily stores the voice data of the talk spurt period newly output from the voice decoding circuit;
A sound playback unit for playing back sound from the sound data;
A switching control unit that selectively switches the output of the audio decoding circuit and the output of the talk spurt holding buffer to the input of the audio reproduction unit;
A delay detection unit that detects a spike delay of the voice packet due to a network failure from a storage state of the voice packet in the playout buffer;
When the delay detection unit detects a spike delay during the talk spurt period, before the voice packet received delayed by the spike delay is decoded into the audio data and output to the audio reproduction unit, the switching control unit While the audio data of the talk spurt period temporarily stored in the holding buffer is output, the input of the audio reproduction unit switched to the output of the audio decoding circuit is switched to the output of the talk spurt holding buffer,
An audio reproducing unit reproduces audio from audio data of an audio packet received delayed by a spike delay in succession to audio data of a talk spurt period temporarily stored in a talk spurt holding buffer. Data interpolation device.

The switching control unit outputs the speech decoding circuit and the talk spurt so that all the speech data of the talk spurt period temporarily stored in the talk spurt retention buffer is output when the expected delay time due to the spike delay elapses. The audio data interpolating apparatus according to claim 1, wherein the output of the buffer is switched to an input of an audio reproducing unit.

The switching control unit, on the condition that the voice packet received delayed by the spike delay is temporarily stored in the playout buffer.
The input of the audio playback unit that is switched to the output of the audio decoding circuit is switched to the output of the talk spurt holding buffer, and all audio data for the talk spurt period temporarily stored in the talk spurt holding buffer is output to the audio reproducing unit. After
The input of the audio playback unit switched to the output of the talk spurt holding buffer is switched to the output of the audio decoding circuit, and the audio data decoded from the audio packet temporarily stored in the playout buffer is sent to the audio playback unit. The audio data interpolating apparatus according to claim 1, wherein the audio data interpolating apparatus is output.

A network connection unit that receives, via the network, voice packets to which packet sequence numbers are assigned in the order of transmission by the transmission device;
A playout buffer for temporarily storing voice packets received by the network connection unit in order of transmission from the packet sequence number;
A voice decoding circuit for decoding voice packets into voice data;
A voice determination unit that compares voice characteristics output when voice data output from the voice decoding circuit includes voice with a predetermined threshold and determines voice data of voice,
A talk spurt detection unit for determining a talk spurt period to be estimated as a conversation period from a period in which voiced voice data is continuous;
An audio reproduction unit for reproducing audio from audio data output from the audio decoding circuit;
A talk spurt holding buffer which is reset every time the start of the talk spurt period is detected by the talk spurt detection unit and temporarily stores voice packets of the talk spurt period newly read from the head of the playout buffer, and
A switching control unit that selectively switches the output of the playout buffer and the output of the talk spurt holding buffer to the input of the audio decoding circuit;
A delay detection unit that detects a spike delay of the voice packet due to a network failure from a storage state of the voice packet in the playout buffer;
When the delay detection unit detects a spike delay during the talk spurt period, the switching control unit temporarily stores in the talk spurt holding buffer before the voice packet received delayed by the spike delay is output to the voice decoding circuit. While the stored voice packet of the talk spurt period is output, the input of the voice decoding circuit switched to the output of the playout buffer is switched to the output of the talk spurt holding buffer,
The voice decoding circuit decodes the voice packet received after the talk spurt period output from the talk spurt holding buffer, delayed by the spike delay, into voice data, and the voice playback unit decodes the voice data from the voice data. An audio data interpolating device characterized by reproducing the sound.

The switching control unit outputs the playout buffer and the talk spurt holding buffer so that all the voice packets in the talk spurt period temporarily stored in the talk spurt holding buffer are output when the expected delay time due to the spike delay elapses. The voice data interpolating apparatus according to claim 4, wherein the output is switched to an input of a voice decoding circuit.

The switching control unit, on the condition that the voice packet received delayed by the spike delay is temporarily stored in the playout buffer.
The input of the audio decoding circuit switched to the playout buffer output is switched to the output of the talk spurt holding buffer, and all the audio buckets in the talk spurt period temporarily stored in the talk spurt holding buffer are decoded by the audio decoding circuit. After
The input of the speech decoding circuit switched to the output of the talk spurt holding buffer is switched to the output of the playout buffer, and the speech packet temporarily stored in the playout buffer is decoded by the speech decoding circuit. The voice data interpolating device according to claim 4.

Decode voice packets received from the network into voice data in the transmission order of the transmitting device,
While playing back the audio from the decoded audio data, determine the talk spurt period to be estimated as the conversation period from the period in which the audio data including the voice is continuous,
From the start of the talk spurt period, voice packets newly received during the talk spurt period or voice data decoded from the voice packets are continuously temporarily stored in the talk spurt holding buffer and reset at the start of the talk spurt period. Repeat the temporary memory,
When spikes voice packet delay occurs during a talk spurt period, after playing a sound from the audio packet or audio data of the talk spurt period temporarily stored in the talk spurt holding buffer, the voice packets received with a delay by the spike delay A voice data interpolation method, wherein voice is continuously reproduced from the voice data.

8. The voice data interpolation method according to claim 7, wherein the voice packet received after the spike delay is reproduced by performing the speech speed conversion in the transmission order of the transmission device to eliminate the spike delay.

After temporarily storing voice packets or voice data decoded from voice packets outside the talk spurt period and detecting spike delay, before playing back voice from voice packets or voice data temporarily stored during the talk spurt period 9. The audio data interpolation method according to claim 7, wherein ambient noise is reproduced from an audio packet or audio data in a silent period.