JP4364555B2

JP4364555B2 - Voice packet transmitting apparatus and method

Info

Publication number: JP4364555B2
Application number: JP2003151462A
Authority: JP
Inventors: 登原田
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2003-05-28
Filing date: 2003-05-28
Publication date: 2009-11-18
Anticipated expiration: 2023-05-28
Also published as: JP2004356898A

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech packet transmitting device and its method, a speech packet receiving device, and a speech packet communication device in which speech reproduction quality is improved at the border between a voiced state and a voiceless state. <P>SOLUTION: Generated is a speech data frame obtained by performing conversion into speech data generated by gradually increasing the level of speech data of a voiceless speech data frame before becoming voiced when an inputted speech signal 4 increases in level from a voiceless speech not lower than a specified threshold level Sth to a voiced speech larger than the threshold level Sth. Further, generated and packetized is a speech data frame generated by performing conversion into speech data generated by gradually decreasing the level of speech data of a voiceless speech data frame after becoming voiced when the speech changes from a voiced state to a voiceless state. Consequently, fade-in and fade-out processing is carried out and a speech waveform has no discontinuous part, so that no abnormal sound is generated at the transition part to reduce deterioration in speech quality. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、有音状態と無音状態との境界における音声再生品質の向上、送信機能ボタン利用時の音声送信状態と、音声送信休止状態の境界における音声再生品質の向上を図った音声パケット送信装置とその方法に関するものである。
【０００２】
【従来の技術】
従来、電子機器のディジタル化に伴い、情報通信においては転送対象となる情報をパケット化して転送することが一般的に行われている。例えば、音声信号を転送する場合には、送信側では、所定のサンプリング周波数にてサンプリングした音声データを所定量ずつ別個のパケットに分散して収納し、パケット単位で転送を行っている。受信側においては、受信したパケットから音声データを取りだし、取り出した音声データを繋ぎ合わせて再生する。
【０００３】
即ち、上記のようなパケット通信を行う電子機器では、送信側においては１パケット分のデータが得られた段階でパケットを形成して送信する処理を行い、受信側では受け取ったパケットに収納されているデータの再生に要する時間毎にパケット内のデータを読み出す処理を行っている。これにより、受信側では、例えば音声データのリアルタイム転送の場合、分割して受け取った複数のパケットから連続した音声を再生することができる。
【０００４】
このとき、送信する音声データのデータ通信量を削減するために、送信側で、サンプリングして得られた音声データのうち、無音と判定された部分の送信を実際には行わない無音圧縮の技術が利用されている。
【０００５】
同様に、無音を自動で判定する変わりに、明示的にユーザの発話意志を機能ボタン等を用いて取得し、発話機能ボタンが押下されている間だけ、音声データを送信し、発話機能ボタンが押下されていない間は、音声データの送信を行わないようにしてデータ通信量を削減する技術が利用されている。
【０００６】
この様なパケット通信は、ほとんどの場合コンピュータ装置を使用して行っており、例えば、無線通信を利用した携帯型電話機やインターネット等の通信網を利用した周知のＩＰ電話、配信サーバから音楽などのコンテンツをユーザ端末装置に配信するシステム、及び遠隔会議システムなどに用いられている。
【０００７】
【特許文献１】
特開２０００−８３０５０号公報
【非特許文献１】
ITU-T Recommendation G.729 Annex B
【０００８】
【発明が解決しようとする課題】
しかしながら、前述したようなパケットを用いた音声データの転送においては、有音状態から無音状態に遷移する部分で波形が不連続となるため、この遷移部分で異音が生じ、音声品質が劣化することがある。この様な音声品質の劣化を低減するために、送信側において無音と判定された部分の背景音に関する情報を送信し、受信側で受け取った有音部分の情報と背景音に関する情報から、無音部分の背景音を生成するＣＮＧという技術が知られているが、送信側、受信側で行うには演算処理負荷が高くなるという問題がある。
【０００９】
また、有音状態から無音状態に遷移したと同時に受信側で周知のパケット消失補償処理（ＰＬＣ：Pauqtte Loss Concealment）を行う場合には、パケットが消失して無いのか或いは無音状態のためにパケットが無いのかを判別できないため、送信側が無音と判断して送信を打ち切った場合にも、受信側でパケット消失補償処理が適用され、受け取った最後のパケットから擬似音声を生成してしまうという問題がある。
【００１０】
尚、パケット消失補償処理の一例としてはＧ．７１１Ａｐｐｅｎｄｉｘ１等が知られている。
【００１１】
また、無音圧縮を行っている場合に、有音状態から無音状態への遷移部分では、判定にヒステリシス特性を用いて話尾が切れないように安全側に調整することが可能であるが、無音状態から有音状態への復帰時に話頭が消失してしまうという問題がある。このような話頭切れに対して安全余裕を設けるためには、常に先読みをして遅延を許容しておく必要があり、現時点ではほとんどのシステムにおいて実現されていない。
【００１２】
本発明の目的は上記の問題点に鑑み、有音状態と無音状態との境界における音声再生品質の向上を図った音声パケット送信装置とその方法を提供することである。
【００１３】
【課題を解決するための手段】
本発明は上記の目的を達成するために、連続して入力した音声に基づく音声データを所定時間間隔で切り取った音声データフレームを生成する音声データフレーム生成手段を備え、該生成した音声データフレームを含んだパケットを生成し、該パケットを通信網を介して送信する音声パケット送信装置において、前記入力された音声データフレームを保持する手段と、前記入力した音声データフレームが有音であるか無音であるかを判定する有音無音判定手段と、前記有音無音判定手段によって入力された音声データフレームが無音と判定された状態が一定時間以上連続する場合にパケットの送信を停止する手段と、前記入力された音声データフレームが無音であると判定された状態が一定時間以上連続し、パケットの送信を停止している状態で、前記入力された音声データフレームが前記有音判定手段で有音状態であると判定されたときに、パケットの送信を再開する手段と、前記パケットの送信を再開するにあたって、有音と判定された音声データフレームの少なくとも１つ前の無音データフレームを再分析フレームとして、有音と判定された音声フレームの情報と前記再分析フレームまでの無音と判定された音声データフレームの情報とを用いて再分析する手段と、前記再分析の結果、前記再分析フレームが有音に近いと判定された場合には、前記再分析フレームの１つ前の音声データフレームを、末尾から先頭に向かって音声レベルを徐々に減少させた音声データに変換し、有音状態の先頭として該変換された音声データフレームを含むパケットを送信し、次に前記有音状態であると判定された音声データフレームと前記再分析フレームの１つ前のフレームとの間の無音データフレームを送信し、次に前記有音状態であると判定された音声データフレームを送信する手段と、前記再分析の結果、前記再分析フレームが無音に近いと判定された場合には、該再分析フレームを、末尾から先頭に向かって音声レベルを徐々に減少させた音声データに変換し、有音状態の先頭として該変換された音声データフレームを含むパケットを送信し、次に前記有音であると判定された音声データフレームを送信する手段とを備えた音声パケット送信装置を提案する。
【００２７】
本発明の音声パケット送信装置によれば、入力された音声データフレームが無音であると判定された状態が一定時間以上連続してパケットの送信が停止されている状態で、入力された音声データが有音状態に変わったときに、パケットの送信が再開される。無音状態から有音状態に変化してパケットの送信が再開されるとき、有音と判定された音声データフレームの少なくとも１つ前の無音データフレームが再分析フレームとされ、有音と判定された音声フレームの情報と再分析フレームまでの無音と判定された音声データフレームの情報とを用いて再分析される。この再分析の結果、前記再分析フレームが有音に近いと判定された場合には、前記再分析フレームの１つ前の音声データフレームが、末尾から先頭に向かって音声レベルを徐々に減少させたフェードイン処理が施された音声データに変換される。さらに、このフェードイン処理が施された音声データフレームが有音状態の先頭とされて、該音声データフレームを含むパケットが送信される。この後、前記有音状態であると判定された音声データフレームまでの無音データフレームを含むパケットが送信され、次に前記有音状態であると判定された音声データフレームを含むパケットが送信される。
【００２８】
また、前記再分析の結果、前記再分析フレームが無音に近いと判定された場合には、該無音データフレームが、末尾から先頭に向かって音声レベルを徐々に減少させた音声データに変換され、有音状態の先頭として該変換された音声データフレームを含むパケットが送信された後に、前記有音であると判定された音声データフレームが送信される。
【００２９】
また、本発明は上記の音声パケット送信装置において、前記パケットの送信を再開するにあたって、有音と判定された音声データフレームの少なくとも１つ前のフレームを、有音状態の先頭として送信した場合に、余分に送信した無音フレームによって増加した遅延に相当するサンプル数だけ後続のサンプルを短縮する手段を有する音声パケット送信装置を提案する。
【００３０】
本発明の音声パケット送信装置によれば、パケットの送信を再開するにあたって、有音と判定された音声データフレームの少なくとも１つ前のフレームから有音状態として送信した場合に、余分に送信した無音フレームによって増加した遅延に相当するサンプル数だけ後続のサンプルが短縮される。これにより、遅延量が調整され、リアルタイムのデータ送信が保持される。
【００３１】
また、本発明は上記の音声パケット送信装置において、前記無音判定手段は、入力された音声フレームの音声レベルが所定の閾値レベル以下であるときに無音状態であると判定する手段を備えた音声パケット送信装置を提案する。
【００３２】
本発明の音声パケット送信装置によれば、無音判定手段によって、入力された音声フレームの音声レベルが所定の閾値レベル以下であるときに無音状態であると判定される。
【００３３】
また、本発明は上記の音声パケット送信装置において、前記有音判定手段は、入力された音声フレームの音声レベルが所定の閾値レベル以上であるときに有音状態であると判定する手段を備えた音声パケット送信装置を提案する。
【００３４】
本発明の音声パケット送信装置によれば、有音判定手段によって、入力された音声フレームの音声レベルが所定の閾値レベル以上であるときに有音状態であると判定される。
【００３９】
また、本発明は上記の目的を達成するために、連続して入力した音声に基づく音声データを所定時間間隔で切り取った音声データフレームを生成する音声データフレーム生成手段と、ユーザの発話の意志を取得する発話機能ボタンとを供え、該生成した音声データフレームを含んだパケットを生成し、該発話機能ボタンが押されている間だけ音声データフレームを含んだパケットを、通信網を介して送信する音声パケット送信装置において、前記発話機能ボタンが押されている状態か否かを判定する発話機能ボタン押下判定手段と、前記発話機能ボタン押下判定手段によって発話機能ボタンが押されている状態から、押されていない発話休止状態になったと判定された場合にパケットの送信を停止する手段と、送信を停止するにあたって、送信する最後の音声データフレームの音声レベルを、先頭サンプルから末尾サンプルに向かって徐々に減少させた音声データに変換し、該変換された音声データフレームを最終パケットとして送信する手段とを備えた音声パケット送信装置を提案する。
【００４０】
本発明の音声パケット送信装置によれば、発話機能ボタンが押されている状態から発話機能ボタンが押されていない発話休止状態になるとパケットの送信が停止される。このとき、送信する最後の音声データフレームの音声レベルは先頭サンプルから末尾サンプルに向けて徐々に減少され、最後に送信される音声データフレームにはフェードアウト処理が施される。
【００４１】
これにより、有音状態から無音状態になって最後に送信される音声データ部分すなわち話尾部分の音声レベルが徐々に減少されるフェードアウト処理が施されるため、受信側において有音状態から無音状態に遷移する部分で音声波形が不連続となることがないので、この遷移部分で異音が生じることがなく、音声品質の劣化が低減される。
【００４２】
また、本発明は上記の音声パケット送信装置において、前記発話機能ボタン押下判定手段によって発話機能ボタンが押されている状態から押されていない発話休止状態になったと判定された場合にパケットの送信を停止する手段と、送信を停止するにあたって、発話休止状態になったと判定された後に入力された音声フレームを少なくとも１つ以上送信する手段と、送信する最後の音声データフレームの音声レベルを、先頭サンプルから末尾サンプルに向かって徐々に減少させた音声データに変換し、該変換された音声データフレームを最終パケットとして送信する手段とを備えた音声パケット送信装置を提案する。
【００４３】
本発明の音声パケット送信装置によれば、発話機能ボタンが押されている状態から押されていない発話休止状態になるとパケットの送信が停止される。このとき、発話機能ボタンが押されていない状態になったと判定された後に入力された音声フレームのうちの１つ以上の音声フレームが送信されると共に、最後に送信される音声データフレームにはその音声レベルが先頭サンプルから末尾サンプルに向かって徐々に減少されるフェードアウト処理が施される。
【００４４】
これにより、発話機能ボタンが押されていない状態になったと判定されてから１つ以上の音声データフレームが送信されるので、話尾部分が突然切れることがなくなると共に音声レベルが徐々に減少されるフェードアウト処理が施されるため、受信側において有音状態から無音状態に遷移する部分で音声波形が不連続となることがないので、この遷移部分で異音が生じることがなく、音声品質の劣化が低減される。
【００４５】
また、本発明は上記の音声パケット送信装置において、前記音声レベルを先頭サンプルから末尾サンプルに向けて徐々に減少させた音声データフレームを含むパケットを送信した後に、さらに１つの無音音声データフレームを生成し、該無音音声データフレームを含むパケットを送信する手段を備えた音声パケット送信装置を提案する。
【００４６】
本発明の音声パケット送信装置によれば、音声レベルが先頭サンプルから末尾サンプルに向けて徐々に減少された音声データフレームを含むパケットを送信した後に、さらに１つの無音音声データフレームを含むパケットが送信されるので、有音状態から無音状態になり、送信が停止されたことを受信側において確実に認識することができる。
【００４７】
また、本発明は上記の音声パケット送信装置において、前記発話機能ボタン押下判定手段によって発話機能ボタンが押されていない状態から押されている状態になったと判定された場合に、発話開始状態として、前記停止していたパケットの送信を再開する手段と、パケットの送信を再開するにあたって、送信する最初の音声データフレームを、末尾サンプルから先頭サンプルに向かって音声レベルを徐々に減少させた音声データに変換し、発話状態の先頭として該変換された音声データフレームを含むパケットを送信する手段とを備えた音声パケット送信装置を提案する。
【００４８】
本発明の音声パケット送信装置によれば、発話機能ボタンが押されずにパケットの送信が停止されている状態で、発話機能ボタンが押された状態に変わったときに、パケットの送信が再開される。また、このとき、パケットの送信を再開するにあたって、送信する最初の音声データフレームは、先頭サンプルから末尾サンプルに向かって音声レベルが徐々に増加させた音声データに変換され、有音状態の先頭として該変換された音声データフレームを含むパケットが送信される。
【００４９】
これにより、発話機能ボタンが押されていない状態から押された状態になったときの音声データ部分すなわち話頭部分の音声レベルが徐々に増大されるフェードイン処理が施されるため、受信側において無音状態から有音状態に遷移する部分で音声波形が不連続となることがないので、この遷移部分で異音が生じることが無く、音声品質の劣化が低減される。
【００５０】
また、本発明は上記の音声パケット送信装置において、前記発話機能ボタン押下判定手段によって発話機能ボタンが押されていない状態から押されている状態になったと判定された場合に、発話開始状態として、前記停止していたパケットの送信を再開する手段と、パケットの送信を再開するにあたって、発話機能ボタンが押された状態になった後に入力された最初の音声データフレームより前の音声データフレームを少なくとも１つ以上送信し、次に前記発話機能ボタンが押された状態になった後に入力された最初の音声データフレームを送信する手段とを備えた音声パケット送信装置を提案する。
【００５１】
本発明の音声パケット送信装置によれば、発話機能ボタンが押されていない状態から押されている状態になり、停止していたパケットの送信が再開されるとき、発話機能ボタンが押された状態になった後に入力された最初の音声データフレームより前の音声データフレームが少なくとも１つ以上送信され、次に前記発話機能ボタンが押された状態になった後に入力された最初の音声データフレームが送信される。これにより、話頭切れが防止される。
【００５２】
また、本発明は上記の音声パケット送信装置において、前記発話開始状態になった場合にパケットの送信を再開するにあたって、発話開始状態と判定された音声データフレームの少なくとも１つ前のフレームを、送信フレームの先頭として送信した場合に、余分に送信した音声データフレームによって増加した遅延に相当するサンプル数だけ後続のサンプルを短縮する手段を有する音声パケット送信装置を提案する。
【００５３】
本発明の音声パケット送信装置によれば、パケットの送信を再開するにあたって、発話開始状態と判定された音声データフレームの少なくとも１つ前のフレームから送信した場合に、余分に送信した音声データフレームによって増加した遅延に相当するサンプル数だけ後続のサンプルが短縮される。これにより、遅延量が調整され、リアルタイムのデータ送信が保持される。
【００５４】
また、本発明は上記の音声パケット送信装置において、発話機能ボタンを押下されない状態で、パケット送信停止状態となっていて、発話機能ボタン押下により発話開始状態としてパケットを送信するに際して、送信する最初の音声フレームを符号化する場合に、音声符号化器の内部状態を初期化した後に音声フレームを符号化処理する手段と、最初のフレームをパケット化して送信するにあたって、パケット内に無音から復帰した最初のフレームであることを表す情報を含めて送信する手段とを有する音声パケット送信装置を提案する。
【００５５】
本発明の音声パケット送信装置によれば、送信停止状態から送信状態に移る際に音声符号化器の内部状態が初期化されてから音声データフレームが符号化処理される。これにより、送信停止前の符号化処理に用いられたデータ等の内部状態が初期化されるので、最適な符号化処理を行うことができる。
【００５６】
さらに、送信停止状態から送信状態に移り、最初のフレームをパケット化して送信するにあたって、パケット内に無音から復帰した最初のフレームであることを表す情報を含めて送信されるため、該情報を受信側において参照することにより、最適な復号化処理を行うことができる。
【００５７】
また、本発明は上記の音声パケット送信装置において、前記符号化処理手段は、当該フレームを符号化処理するにあたって、符号化器の内部状態を初期化せずに前のフレームに続けて当該フレームを符号化した場合の符号化誤差と、符号化器の内部状態を初期化した後に当該フレームを符号化した場合の符号化誤差とを比較し、誤差の少ない方の符号化結果を送信する手段と、内部状態をリセットした後に当該フレームを符号化した結果を選択した場合には、無音から復帰した最初のフレームであるという情報を送信パケット内に含めて送信する手段とを有する音声パケット送信装置を提案する。
【００５８】
本発明の音声パケット送信装置によれば、前記符号化誤差が小さい方の符号化音声データフレームが用いられて、該符号化音声データフレームを含むパケットが送信される。さらに、符号化器の内部状態がリセットされた状態で符号化された符号化音声データフレームが用いられるときには、無音から復帰した最初のフレームであることを表す情報がパケットに含められて送信されるので、受信側において的確な復号化処理を行うことができる。
【００５９】
また、本発明は上記の目的を達成するために、音声入力手段によって入力した音声を音声データに変換する手段を有するコンピュータ装置を用いて、連続して入力した音声に基づく音声データを所定時間間隔で切り取った音声データフレームを生成すると共に該音声データフレームを含んだパケットを生成し、該パケットを通信網を介して送信する音声パケット送信方法において、前記コンピュータ装置は、前記入力した音声データフレームを保持し、前記入力した音声データフレームが有音であるか無音であるかを判定し、前記入力された音声データフレームが無音と判定された状態が一定時間以上連続する場合にパケットの送信を停止し、前記入力された音声データフレームが無音と判定された状態が一定時間以上連続し、パケットの送信を停止している状態で、前記入力された音声データフレームが有音状態であると判定されたときに、パケットの送信を再開し、前記パケットの送信を再開するにあたって、有音と判定された音声データフレームの少なくとも１つ前の無音データフレームを再分析フレームとして、有音と判定された音声フレームの情報と前記再分析フレームまでの無音と判定された音声データフレームの情報とを用いて再分析し、前記再分析の結果、前記再分析フレームが有音に近いと判定された場合には、前記再分析フレームの１つ前の音声データフレームを、末尾から先頭に向かって音声レベルを徐々に減少させた音声データに変換し、有音状態の先頭として該変換された音声データフレームを含むパケットを送信し、次に前記有音状態であると判定された音声データフレームと前記再分析フレームの１つ前のフレームとの間の無音データフレームを送信し、次に前記有音状態であると判定された音声データフレームを送信し、前記再分析の結果、前記再分析フレームが無音に近いと判定された場合には、該再分析フレームを、末尾から先頭に向かって音声レベルを徐々に減少させた音声データに変換し、有音状態の先頭として該変換された音声データフレームを含むパケットを送信し、次に前記有音であると判定された音声データフレームを送信する音声パケット送信方法を提案する。
【００６０】
本発明の音声パケット送信方法によれば、入力された音声データフレームが無音であると判定された状態が一定時間以上連続してパケットの送信が停止されている状態で、入力された音声データが有音状態に変わったときに、パケットの送信が再開される。無音状態から有音状態に変化してパケットの送信が再開されるとき、有音と判定された音声データフレームの少なくとも１つ前の無音データフレームが再分析フレームとされ、有音と判定された音声フレームの情報と再分析フレームまでの無音と判定された音声データフレームの情報とを用いて再分析される。この再分析の結果、前記再分析フレームが有音に近いと判定された場合には、前記再分析フレームの１つ前の音声データフレームが、末尾から先頭に向かって音声レベルを徐々に減少させたフェードイン処理が施された音声データに変換される。さらに、このフェードイン処理が施された音声データフレームが有音状態の先頭とされて、該音声データフレームを含むパケットが送信される。この後、前記有音状態であると判定された音声データフレームまでの無音データフレームを含むパケットが送信され、次に前記有音状態であると判定された音声データフレームを含むパケットが送信される。
【００６１】
また、前記再分析の結果、前記再分析フレームが無音に近いと判定された場合には、該無音データフレームが、末尾から先頭に向かって音声レベルを徐々に減少させた音声データに変換され、有音状態の先頭として該変換された音声データフレームを含むパケットが送信された後に、前記有音であると判定された音声データフレームが送信される。
【００６５】
また、本発明は上記の目的を達成するために、音声入力手段によって入力した音声を音声データに変換する手段を有するコンピュータ装置を用いて、連続して入力した音声に基づく音声データを所定時間間隔で切り取った音声データフレームを生成すると共に、該生成した音声データフレームを含んだパケットを生成し、ユーザの発話の意志を取得する発話機能ボタンが押されている間だけ音声データフレームを含んだパケットを、通信網を介して送信する音声パケット送信方法において、前記コンピュータ装置は、前記発話機能ボタンが押されている状態か否かを判定し、前記発話機能ボタンが押されている状態から、押されていない発話休止状態になったと判定された場合にパケットの送信を停止し、送信を停止するにあたって、送信する最後の音声データフレームの音声レベルを、先頭サンプルから末尾サンプルに向かって徐々に減少させた音声データに変換し、前記変換した音声データフレームを最終パケットとして送信する音声パケット送信方法を提案する。
【００６６】
本発明の音声パケット送信方法によれば、発話機能ボタンが押されている状態から発話機能ボタンが押されていない発話休止状態になるとパケットの送信が停止される。このとき、送信する最後の音声データフレームの音声レベルは先頭サンプルから末尾サンプルに向けて徐々に減少され、最後に送信される音声データフレームにはフェードアウト処理が施される。
【００６７】
これにより、有音状態から無音状態になって最後に送信される音声データ部分すなわち話尾部分の音声レベルが徐々に減少されるフェードアウト処理が施されるため、受信側において有音状態から無音状態に遷移する部分で音声波形が不連続となることがないので、この遷移部分で異音が生じることがなく、音声品質の劣化が低減される。
【００６８】
また、本発明は上記の音声パケット送信方法において、前記コンピュータ装置は、前記発話機能ボタンが押されていない状態から押されている状態になったと判定された場合に、発話開始状態として、前記停止していたパケットの送信を再開し、パケットの送信を再開するにあたって、送信する最初の音声データフレームを、末尾サンプルから先頭サンプルに向かって音声レベルを徐々に減少させた音声データに変換し、発話状態の先頭として前記変換した音声データフレームを含むパケットを送信する音声パケット送信方法を提案する。
【００６９】
本発明の音声パケット送信方法によれば、発話機能ボタンが押されずにパケットの送信が停止されている状態で、発話機能ボタンが押された状態に変わったときに、パケットの送信が再開される。また、このとき、パケットの送信を再開するにあたって、送信する最初の音声データフレームは、先頭サンプルから末尾サンプルに向かって音声レベルが徐々に増加させた音声データに変換され、有音状態の先頭として該変換された音声データフレームを含むパケットが送信される。
【００７０】
これにより、発話機能ボタンが押されていない状態から押された状態になったときの音声データ部分すなわち話頭部分の音声レベルが徐々に増大されるフェードイン処理が施されるため、受信側において無音状態から有音状態に遷移する部分で音声波形が不連続となることがないので、この遷移部分で異音が生じることが無く、音声品質の劣化が低減される。
【００７５】
【発明の実施の形態】
以下、図面に基づいて本発明の一実施形態を説明する。
【００７６】
図１は本発明の第１実施形態における音声パケット通信システムの機能構成を示すブロック図、図２は本発明の第１実施形態における音声パケット送信装置による音声信号のパケット化を説明する図、図３は本発明の第１実施形態において用いているリアルタイム転送プロトコル（以下、ＲＴＰと称する）ヘッダを説明する図である。図において１は音声パケット送信装置（以下、単に送信装置と称する）、２は音声パケット受信装置（以下、単に受信装置と称する）、３はインターネット等の通信網である。本実施形態では、一例として、通信網３を介して送信装置１からＵＤＰ／ＩＰを用いて音声パケットをリアルタイムで受信装置２に転送するシステムに関して説明する。
【００７７】
送信装置１は、周知のコンピュータ装置から構成され、予め設定されているプログラムよって動作し、音声入力部１１と、アナログ／ディジタル（Ａ／Ｄ）変換部１２、有音無音判定部１３、スイッチ部１４、フェードイン・フェードアウト処理部１５、符号化処理部１６、パケット生成部１７、送信部１８とから構成されている。これらの送信装置１を構成する各部分は、ハードウェア及びソフトウェアの両方によって構成されている。
【００７８】
受信装置２は、周知のコンピュータ装置から構成され、予め設定されているプログラムよって動作し、受信部２１と、パケット解析部２２、復号化処理部２３、ディジタル／アナログ（Ｄ／Ａ）変換部２４、音声出力部２５とから構成されている。これらの受信装置２を構成する各部分は、ハードウェア及びソフトウェアの両方によって構成されている。
【００７９】
音声入力部１１は音声信号を図２に示すようなアナログ電気信号４に変換してＡ／Ｄ変換部１２に出力し、Ａ／Ｄ変換部１２によって所定のサンプリングタイムでディジタル信号に変換された音声データ（サンプル）が有音無音判定部１３に備わるバッファに順次格納される。
【００８０】
また、図２に示すように、バッファに格納された音声データは、有音無音判定部１３によって、所定周期Ｔ毎に切り取られ音声データフレームとして先頭から順に１フレームずつ順送りに有音状態であるか無音状態であるかが判定される。
【００８１】
さらに、有音無音判定部１３は、上記有音状態であるか無音状態であるかの判定結果に基づいて、無音状態から有音状態に変わったときにフェードイン処理を行うためにスイッチ部１４を切り替えることによって出力信号をフェードイン・フェードアウト処理部１５に出力すると共に、有音状態から無音状態に変わったときにフェードアウト処理を行うためにスイッチ部１４を切り替えることによって出力信号をフェードイン・フェードアウト処理部１５に出力する。また、有音状態が続いているときは、有音無音判定部１３は、スイッチ部１４を切り替えることによって出力信号を符号化処理部１６に出力する。このとき、図２に示すように所定のしきい値Ｓthを越えたときに有音状態と判定される。
【００８２】
フェードイン・フェードアウト処理部１５は、音声入力が無音状態であって送信休止状態にあるときから、音声入力が有音状態になり、送信を開始するときに、音声データフレームの音声レベルを末尾サンプルから先頭サンプルに向けて徐々に減少させるフェードイン処理と、音声入力が有音状態であって送信状態にあるときから、音声入力が無音状態になり、送信を休止するときに、音声データフレームの音声レベルを先頭サンプルから末尾サンプルに向けて徐々に減少させるフェードアウト処理を行う。
【００８３】
符号化処理部１６は、有音無音判定部１３或いはフェードイン・フェードアウト処理部から入力した符号化対象となる音声データフレームの符号化処理を行うが、符号化処理を行うに際して前のフレームを符号化した結果の内部状態を保持し、過去からの予測を行うことで符号化利得を向上させている。
【００８４】
本実施例においては、パケット消失により送信元と受信側での符号化器、復号化器の内部状態不一致による品質劣化を低減するために、無音状態から有音状態に変化した場合に、符号化器の内部状態をリセットし初期値を用いることにより伝送誤りによる品質低下の発生を低減している。
【００８５】
さらに、符号化処理１６は、分析結果に基づいて符号化対象となる音声データフレームを符号化してパケット生成部１７に送出する。
【００８６】
これにより生成された音声データフレームは、図２に示すように、無音状態の次の有音状態とされた音声データフレームは音声レベルが徐々に増加されるフェードイン処理が施された音声データ３１となる。さらに、有音状態の次に無音状態とされた音声データフレームは音声レベルが徐々に減少されるフェードアウト処理が施された音声データ３１となる。
【００８７】
パケット生成部１７は、符号化処理部１６から入力した符号化された音声データを含むＲＴＰパケットを生成して送信部１８へ送出する。このときのＲＴＰパケットには図３に示すようなＲＴＰヘッダが付加される。
【００８８】
ＲＴＰヘッダには、周知のように、２ビットのVersion情報Ｖと、１ビットのPadding情報Ｐ、１ビットのExtension情報Ｘ、３ビットのCSRC-Count情報ＣＣ、１ビットのMarker情報（以下、マーカービットと称する）Ｍ、７ビットのPayload-Type情報ＰＴ、１６ビットのシーケンス番号（順序番号：Sequence Number）、３２ビットのタイムスタンプ（Timestamp）、３２ビットの同期信号元（ＳＳＲＣ）識別子、３２ビットの寄与送信元（ＣＳＲＣ）識別子等が含まれている。
【００８９】
また、本実施形態では、無音状態であってパケット送信を停止していた後に有音状態になって最初に送信するパケットのマーカービットＭを「１」に設定し、その他のパケットのマーカービットＭを「０」に設定する。
【００９０】
送信部１８は、パケット生成部１７から入力したＲＴＰパケットを通信網３を介して受信装置２に送信する。
【００９１】
一方、受信装置２の受信部２１は、通信網３を介して送信装置１から送信されたＲＴＰパケットを受信しパケット解析部２２に送出する。
【００９２】
パケット解析部２２は、受信部２１から入力したＲＴＰパケットを解析してヘッダ部と符号化された音声データフレームに分離すると共に、ヘッダ部の内容を解析し、ＲＴＰタイムスタンプに基づいて、送信された順番に符号化された音声データフレームを復号化処理部２３に出力する。さらに、パケット解析部２２は、ＲＴＰヘッダのマーカービットＭの値を復号化処理部２３に通知する。
【００９３】
復号化処理部２３は、パケット解析部２２から入力した符号化された音声データフレームを復号してディジタル音声データに変換しこのディジタル音声データをＤ／Ａ変換部２３に出力する。また、復号化処理部２３は、復号化を行う際に、符号化された音声データフレームを分析しその分析結果を一時記憶すると共に、データ分析を行う際に、一時記憶されている分析結果或いは分析初期値を参照してデータ分析を行う。ここで、一時記憶されている１フレーム前の分析結果を用いることにより前後のフレーム間の相関を考慮した最適な分析及び復号を行えるようにしている。
【００９４】
また、復号化処理部２３は、パケット解析部２２から入力したマーカービットＭの値が「１」であるときに、復号化器の内部状態をリセットして初期化する。この初期化により、復号化対象となる音声データフレームが無音状態にあった後に有音状態の音声データフレームを分析するときは内部状態を初期化して復号処理を行うこととなるので、パケット消失等の伝送誤りが発生した場合にも、送信側の符号化器、受信側の復号化器の内部状態が不一致となる状態から復帰することができ、音声品質の劣化を低減することができる。
【００９５】
Ｄ／Ａ変換部２３は、復号化処理部２３によって復号して得られたディジタル音声データを入力してアナログ音声信号に変換して音声出力部２４に出力する。
【００９６】
音声出力部２４は、Ｄ／Ａ変換部２３から入力したアナログ音声データを音声に変換して出力する。
【００９７】
次に、上記構成よりなる音声パケット通信システムの動作に関して、主に送信装置の動作に関する処理フローチャートを図４乃至図６に示して説明する。
【００９８】
送信装置１においては、駆動開始直後に初期化処理を行う（ＳＡ１）。この初期化処理では、変数である無音判定カウントSilentCountの値を「０」に設定すると共にRTPTimeStampの値を「０」に設定する。
【００９９】
次に、送信装置１は、処理を開始すると、音声入力部１１を介して入力した音声信号は順次Ａ／Ｄ変換部１２を介して有音無音判定部１３のバッファに格納する（ＳＡ２）。
【０１００】
次いで、送信装置１は、有音無音判定部１３のバッファに格納されている先頭の音声データから順に、判定対象となる音声データフレームのパワーが閾値以下であるか否かすなわち無音状態であるか有音状態であるかを判定し（ＳＡ３）、音声データフレームのパワーが閾値よりも大きい有音状態のときは後述する前記ＳＡ１９の処理に移行する。
【０１０１】
また、音声データフレームのパワーが閾値以下の無音状態のときは、無音判定カウントSilentCountを「１」増加し（ＳＡ４）、無音判定カウントSilentCountの値がカウント閾値SilentThresよりも大きい送信休止状態であるか否かを判定する（ＳＡ５）。
【０１０２】
この判定の結果、無音判定カウントSilentCountの値がカウント閾値SilentThresよりも大きい送信休止状態であるときは後述するＳＡ１２の処理に移行し、無音判定カウントSilentCountの値がカウント閾値SilentThres以下のときは、無音判定カウントSilentCountの値がカウント閾値SilentThresに等しいか否かすなわち送信休止状態が開始されたか否かを判定する（ＳＡ６）。
【０１０３】
この判定の結果、無音判定カウントSilentCountの値がカウント閾値SilentThresに等しくないときは後述するＳＡ７の処理に移行し、無音判定カウントSilentCountの値がカウント閾値SilentThresに等しいときは現在の音声データフレームをフェードアウト処理する（ＳＡ７）。
【０１０４】
次に、フェードアウト処理した音声データフレームを符号化処理し（ＳＡ８）、この符号化処理した音声データフレームとこれに対応するＲＴＰタイムスタンプRTPTimeStampとを含むパケットを生成して、このパケットを送信する（ＳＡ９）。
【０１０５】
この後、ＲＴＰタイムスタンプRTPTimeStampをフレーム長分増加する（ＳＡ１０）。即ち、ＲＴＰタイムスタンプRTPTimeStampの値にフレーム長FrameLenの値を加算した値を新たなＲＴＰタイムスタンプRTPTimeStampの値とする。
【０１０６】
次いで、現在の音声データフレームをバッファに保持して（ＳＡ１１）、前記ＳＡ２の処理に移行する。
【０１０７】
一方、前記ＳＡ５の判定の結果、無音判定カウントSilentCountの値がカウント閾値SilentThresよりも大きい送信休止状態であるときは、無音判定カウントSilentCountの値がカウント閾値SilentThres＋１に等しく送信休止状態になったばかりであるか否かを判定する（ＳＡ１２）。
【０１０８】
この判定の結果、無音判定カウントSilentCountの値がカウント閾値SilentThres＋１に等しくないときは後述するＳＡ１６の処理に移行し、無音判定カウントSilentCountの値がカウント閾値SilentThres＋１に等しいときは理想的な無音音声データフレームを生成して（ＳＡ１３）、この無音音声データフレームを符号化し（ＳＡ１４）、この符号化した理想的な無音音声データフレームとこれに対応するＲＴＰタイムスタンプRTPTimeStampとを含むパケットを生成して、このパケットを送信する（ＳＡ１５）。この後、前記ＳＡ１０の処理に移行する。
【０１０９】
また、前記ＳＡ１２の判定の結果、無音判定カウントSilentCountの値がカウント閾値SilentThres＋１に等しくないときは、遅延カウントDelayCountの値からフレーム長FrameLenの値を減算した値を新たな遅延カウントDelayCountの値とし（ＳＡ１６）、遅延カウントDelayCountの値が０以下であるか否かを判定する（ＳＡ１７）。
【０１１０】
この判定の結果、遅延カウントDelayCountの値が０よりも大きいときは前記ＳＡ１０の処理に移行し、遅延カウントDelayCountの値が０以下であるときは遅延カウントDelayCountの値を０に設定して（ＳＡ１８）、前記ＳＡ１０の処理に移行する。
【０１１１】
一方、前記ＳＡ３の判定の結果、音声データフレームのパワーが閾値よりも大きい有音状態のときは、無音判定カウントSilentCountの値がカウント閾値SilentThresよりも大きいか否かすなわち１つ前のフレームは送信休止状態であるか否かを判定する（ＳＡ１９）。
【０１１２】
この判定の結果、無音判定カウントSilentCountの値がカウント閾値SilentThres以下のときは後述するＳＡ２７の処理に移行する。また、無音判定カウントSilentCountの値がカウント閾値SilentThresよりも大きいときは、バッファに保持されている１つ前の音声データフレームをフェードイン処理し（ＳＡ２０）、さらにこのフェードイン処理した音声データフレームを符号化処理する（ＳＡ２１）。
【０１１３】
次に、フェードイン処理した音声データフレームと該音声データフレームに対応するＲＴＰタイムスタンプ（RTPTimeStamp-FrameLen）とを含むパケットを生成して該パケットを送信する（ＳＡ２２）。このとき、ＲＴＰヘッダのマーカービットＭを「１」に設定しておく。
【０１１４】
この後、現在の音声データフレームすなわちフェードイン処理した音声データフレームの次の音声データフレームを符号化処理し（ＳＡ２３）、該符号化処理した音声データフレームと現在のＲＴＰタイムスタンプRTPTimestampとを含むパケットを生成してこれを送信する（ＳＡ２４）。
【０１１５】
次に、遅延増加量カウンタDelayCountの値をフレーム長FrameLenの値分だけ増加させる（ＳＡ２５）と共に、無音判定カウントSilentCountの値を「０」に初期化する（ＳＡ２６）。この後、前記ＳＡ１０の処理に移行する。
【０１１６】
一方、前記ＳＡ１９の判定の結果、無音判定カウントSilentCountの値がカウント閾値SilentThres以下のときは現在の音声データフレームを符号化処理し（ＳＡ２７）、この符号化処理した音声データフレームとこれに対応する現在のＲＴＰタイムスタンプRTPTimeStampとを含むパケットを生成してこれを送信する（ＳＡ２８）。この後、無音判定カウントSilentCountの値を「０」に設定して初期化した後（ＳＡ２９）、前記ＳＡ１０の処理に移行する。
【０１１７】
上記実施形態によれば、無音状態から有音状態に遷移するときに有音状態になる１つ前のフレームをフェードイン処理して得られた音声データフレームを含むパケットを送信するので、無音状態から有音状態への復帰時に話頭が消失してしまうことがなくなる。
【０１１８】
さらに、送信装置１は、音声レベルが先頭サンプルから末尾サンプルに向けて徐々に減少された音声データフレームを含むパケットを送信した後に、さらに１つの理想的な無音音声データフレームを含むパケットが送信されるので、有音状態から無音状態になり、送信が停止されたことを受信側において確実に認識することができる。
【０１１９】
また、有音状態から無音状態になったときの音声データフレームすなわち話尾部分の音声レベルが徐々に減少されるフェードアウト処理が施されるため、受信側において有音状態から無音状態に遷移する部分で音声波形が不連続となることがないので、この遷移部分で異音が生じることがなく、音声品質の劣化が低減される。
【０１２０】
さらに、上記実施形態によれば、有音状態から無音状態になったと判定されてから１つ以上の音声データフレームが送信されるので、話尾部分が突然切れることがなくなると共に音声レベルが徐々に減少されるフェードアウト処理が施されるため、受信側において有音状態から無音状態に遷移する部分で音声波形が不連続となることがないので、この遷移部分で異音が生じることがなく、音声品質の劣化が低減される。
【０１２１】
また、上記実施形態によれば、送信休止状態から送信状態に移り、最初の音声データフレームをパケット化して送信するときに、ＲＴＰヘッダのマーカービットＭを「１」に設定することによって、無音から復帰した最初の音声データフレームであることを表しているので、このマーカービットＭを受信側において参照し、復号化器の内部状態をリセットすることにより、伝送誤りに対する耐性を向上することができる。
【０１２２】
尚、上記フェードイン処理及びフェードアウト処理を複数の音声データフレームにまたがって施しても良い。また、複数の無音状態の音声データフレームが連続した後に有音状態の音声データフレームが存在したときに上記フェードイン処理を行うようにしても良い。
【０１２３】
次に、本発明の第２実施形態を説明する。
【０１２４】
図７は第２実施形態における音声パケット通信システムの機能構成を示すブロック図である。図において、前述した第１実施形態と同一構成部分は同一符号を持って表しその説明を省略する。
【０１２５】
また、第２実施形態と第１実施形態との相違点は、第２実施形態では前述した第１実施形態における有音無音判定部１３に代えて発話機能ボタン制御情報を入力してこの発話機能ボタン制御情報に基づいて有音無音の判定を行う送信判定処理部１９を設けたことである。
【０１２６】
第２実施形態における送信判定処理部１９は、第１実施形態における有音無音判定部１３が有する機能に加えて、発話者が発話するときに押下する発話機能ボタン（図示せず）から入力した発話機能ボタン制御情報に基づいて、発話制御ボタンが押下されている（オンされている）ことを認識すると共に、発話制御ボタンが押下された時点から数フレーム過去にさかのぼった音声データフレームから送信を開始する機能を備えている。このときもフェードイン処理を行うことは第１実施形態と同様である。ここで、発話者が発話中であるときは発話機能ボタンは押下され続ける。
【０１２７】
以下に、第２実施形態における処理の詳細を図８乃至図１０のフローチャートを参照して説明する。
【０１２８】
送信装置１においては、駆動開始直後に初期化処理を行う（ＳＢ１）。この初期化処理では、変数である無音判定カウントSilentCountの値を「０」に設定すると共にRTPTimeStampの値を「０」に設定する。
【０１２９】
次に、送信装置１は、処理を開始すると、音声入力部１１を介して入力した音声信号は順次Ａ／Ｄ変換部１２を介して有音無音判定部１３のバッファに格納する（ＳＢ２）。
【０１３０】
次いで、送信装置１は、発話機能ボタンが押下されているか否かを判定し（ＳＢ３）、発話機能ボタンが押下中であるときは後述する前記ＳＢ１９の処理に移行する。
【０１３１】
また、発話機能ボタンが押下されていないときは、無音判定カウントSilentCountの値を「１」増加し（ＳＢ４）、無音判定カウントSilentCountの値がカウント閾値SilentThresよりも大きい送信休止状態であるか否かを判定する（ＳＢ５）。
【０１３２】
この判定の結果、無音判定カウントSilentCountの値がカウント閾値SilentThresよりも大きい送信休止状態であるときは後述するＳＢ１２の処理に移行し、無音判定カウントSilentCountの値がカウント閾値SilentThres以下のときは、無音判定カウントSilentCountの値がカウント閾値SilentThresに等しいか否かすなわち送信休止状態が開始されたか否かを判定する（ＳＢ６）。
【０１３３】
この判定の結果、無音判定カウントSilentCountの値がカウント閾値SilentThresに等しくないときは後述するＳＢ７の処理に移行し、無音判定カウントSilentCountの値がカウント閾値SilentThresに等しいときは現在の音声データフレームをフェードアウト処理する（ＳＢ７）。
【０１３４】
次に、フェードアウト処理した音声データフレームを符号化処理し（ＳＢ８）、この符号化処理した音声データフレームとこれに対応するＲＴＰタイムスタンプRTPTimeStampとを含むパケットを生成して、このパケットを送信する（ＳＢ９）。
【０１３５】
この後、ＲＴＰタイムスタンプRTPTimeStampをフレーム長分増加する（ＳＢ１０）。即ち、ＲＴＰタイムスタンプRTPTimeStampの値にフレーム長FrameLenの値を加算した値を新たなＲＴＰタイムスタンプRTPTimeStampの値とする。
【０１３６】
次いで、現在の音声データフレームをバッファに保持して（ＳＢ１１）、前記ＳＢ２の処理に移行する。
【０１３７】
一方、前記ＳＢ５の判定の結果、無音判定カウントSilentCountの値がカウント閾値SilentThresよりも大きい送信休止状態であるときは、無音判定カウントSilentCountの値がカウント閾値SilentThres＋１に等しく送信休止状態になったばかりであるか否かを判定する（ＳＢ１２）。
【０１３８】
この判定の結果、無音判定カウントSilentCountの値がカウント閾値SilentThres＋１に等しくないときは後述するＳＢ１６の処理に移行し、無音判定カウントSilentCountの値がカウント閾値SilentThres＋１に等しいときは理想的な無音音声データフレームを生成して（ＳＢ１３）、この無音音声データフレームを符号化し（ＳＢ１４）、この符号化した理想的な無音音声データフレームとこれに対応するＲＴＰタイムスタンプRTPTimeStampとを含むパケットを生成して、このパケットを送信する（ＳＢ１５）。この後、前記ＳＢ１０の処理に移行する。
【０１３９】
また、前記ＳＢ１２の判定の結果、無音判定カウントSilentCountの値がカウント閾値SilentThres＋１に等しくないときは、遅延カウントDelayCountの値からフレーム長FrameLenの値を減算した値を新たな遅延カウントDelayCountの値とし（ＳＢ１６）、遅延カウントDelayCountの値が０以下であるか否かを判定する（ＳＢ１７）。
【０１４０】
この判定の結果、遅延カウントDelayCountの値が０よりも大きいときは前記ＳＢ１０の処理に移行し、遅延カウントDelayCountの値が０以下であるときは遅延カウントDelayCountの値を０に設定して（ＳＢ１８）、前記ＳＢ１０の処理に移行する。
【０１４１】
一方、前記ＳＢ３の判定の結果、音声データフレームのパワーが閾値よりも大きい有音状態のときは、無音判定カウントSilentCountの値がカウント閾値SilentThresよりも大きいか否かすなわち１つ前のフレームは送信休止状態であるか否かを判定する（ＳＢ１９）。
【０１４２】
この判定の結果、無音判定カウントSilentCountの値がカウント閾値SilentThres以下のときは後述するＳＢ２２の処理に移行する。また、無音判定カウントSilentCountの値がカウント閾値SilentThresよりも大きいときは、後述する送信再開処理を実行し（ＳＢ２０）、無音判定カウントSilentCountの値を「０」に初期化する（ＳＢ２６）。この後、前記ＳＢ１０の処理に移行する。
【０１４３】
一方、前記ＳＢ１９の判定の結果、無音判定カウントSilentCountの値がカウント閾値SilentThres以下のときは現在の音声データフレームを符号化処理し（ＳＢ２２）、この符号化処理した音声データフレームとこれに対応する現在のＲＴＰタイムスタンプRTPTimeStampとを含むパケットを生成してこれを送信する（ＳＢ２３）。この後、無音判定カウントSilentCountの値を「０」に設定して初期化した後（ＳＢ２４）、前記ＳＢ１０の処理に移行する。
【０１４４】
（送信再開処理の第１実施例）
図１１は第１実施例の送信再開処理における音声信号のパケット化を説明する図、図１２は第１実施例の送信再開処理を説明するフローチャートである。
【０１４５】
第１実施例の送信再開処理では、無音判定カウントSilentCountの値が定数StartFramesの値よりも大きいか否か、すなわち送信停止時間が十分長いか否かを判定し（ＳＣ１）、無音判定カウントSilentCountの値が定数StartFramesの値よりも大きいときは送信再開時に時間をさかのぼって送信する音声データフレームの数を設定する（ＳＣ２）。このとき、さかのぼる数Ｎの値を上記定数StartFramesの値に設定する。また、無音判定カウントSilentCountの値が定数StartFramesの値以下のときは数Ｎの値を無音判定カウントSilentCountの値よりも２だけ小さい数（Ｎ＝ SilentCount - 2）に設定する（ＳＣ３）。
【０１４６】
次に、バッファに保持されているＮ個前の音声データフレームをフェードイン処理し（ＳＣ４）、さらにこのフェードイン処理した音声データフレームを符号化処理する（ＳＣ５）。
【０１４７】
この後、フェードイン処理した音声データフレームとこれに対応するＮ個前のＲＴＰタイムスタンプ（RTPTimeStamp - FrameLen * N）(ここで、*は乗算を表す)とを含むパケットを生成してこれを送信する（ＳＣ６）。このとき、ＲＴＰヘッダにおけるマーカービットＭを「１」に設定して送信する。
【０１４８】
次いで、バッファに保持したＮ−１個前から現在までの音声データフレームを順次符号化処理し、この符号化処理した音声データフレームと、これに対応するＲＴＰタイムスタンプ（RTPTimeStamp - (N-1-i) * FrameLen）とを含むパケットを生成して順次送信する（ＳＣ７）。ここで、ｉは１以上（Ｎ−１）以下の整数である。
【０１４９】
この後、遅延増加量カウンタDelayCountの値をＮ個のフレーム長分（FrameLen * N）だけ増加させて（DelayCount += FrameLen * N）（ＳＣ８）、送信再開処理を終了する。
【０１５０】
このとき、Ｎ個のフレームうち、無音音声データフレームや定常部分を信号処理により間引いたり、DelayCountが正の間だけ将来のフレームについても間引くことによって遅延の増加を抑えるようにすることもできる。その場合には、間引いた分だけDelayCountを減少させる。
【０１５１】
上記第２実施形態の第１実施例によれば、発話機能ボタンが押下されて発話が開始され、無音状態から有音状態に遷移するときに、発話機能ボタンが押下された瞬間のフレームからＮ個前のフレームにさかのぼって、フェードイン処理して得られた音声データフレームを含むパケットから送信するので、無音状態から有音状態への復帰時に話頭が消失してしまうことがなくなる。
【０１５２】
さらに、有音状態から無音状態になったときの音声データフレームすなわち話尾部分の音声レベルが徐々に減少されるフェードアウト処理も第１実施形態と同様に施されるため、受信側において有音状態から無音状態に遷移する部分で音声波形が不連続となることがないので、この遷移部分で異音が生じることがなく、音声品質の劣化が低減される。
【０１５３】
（送信再開処理の第２実施例）
図１３は第２実施例の送信再開処理における音声信号のパケット化を説明する図、図１４は第２実施例の送信再開処理を説明するフローチャートである。
【０１５４】
第２実施例の送信再開処理では、無音判定カウントSilentCountの値が定数StartFramesの値よりも大きいか否か、すなわち送信停止時間が十分長いか否かを判定し（ＳＤ１）、無音判定カウントSilentCountの値が定数StartFramesの値よりも大きいときは送信再開時に時間をさかのぼって送信する音声データフレームの数を設定する（ＳＤ２）。このとき、さかのぼる数Ｎの値を上記定数StartFramesの値に設定する。また、無音判定カウントSilentCountの値が定数StartFramesの値以下のときは数Ｎの値を無音判定カウントSilentCountの値よりも２だけ小さい数（Ｎ＝ SilentCount - 2）に設定する（ＳＤ３）。
【０１５５】
次に、変数ｉを１に設定（ｉ＝１）し（ＳＤ４）、バッファに保持されている音声データフレームのうち、現時点の音声データフレームからｉ個前の音声データフレームのパワーｐ(i)を計算し（ＳＤ５）、パワーｐ(i)が所定の閾値以下であるか否か、又は変数ｉの値が（Ｎ−１）以上であるか否かを判定する。
【０１５６】
この判定の結果、パワーｐ(i)が閾値よりも大きいとき又は変数ｉの値が（Ｎ−１）よりも小さいときは、変数ｉの値を１だけ増加して（ＳＤ７）前記ＳＤ５の処理に移行する。また、パワーｐ(i)が閾値以下であるか又は変数ｉの値が（Ｎ−１）以上であるときは、現時点の音声データフレームよりｉ個前の音声データフレームの１つ前のフレームをフェードイン処理し（ＳＤ８）、さらにこのフェードイン処理した音声データフレームを符号化処理する（ＳＤ９）。
【０１５７】
この後、フェードイン処理した音声データフレームとこれに対応するｉ個前のＲＴＰタイムスタンプ（RTPTimeStamp - FrameLen * i）(ここで、*は乗算を表す)とを含むパケットを生成してこれを送信する（ＳＤ１０）。このとき、ＲＴＰヘッダにおけるマーカービットＭを「１」に設定して送信する。
【０１５８】
次いで、バッファに保持したｉ個前から現在までの音声データフレームを順次符号化処理し、この符号化処理した音声データフレームと、これに対応するＲＴＰタイムスタンプ（RTPTimeStamp - (i-j) * FrameLen）とを含むパケットを生成して順次送信する（ＳＤ１１）。ここで、無音音声データフレームや定常部分を信号処理により間引くことにより、遅延の増加を抑えるようにすることもできる。また、ｊは１以上ｉ以下の整数である。
【０１５９】
さらに、遅延増加量カウンタDelayCountの値をＮ個のフレーム長分（FrameLen * N）だけ増加させて（DelayCount += FrameLen * i）（ＳＤ１２）、送信再開処理を終了する。
【０１６０】
上記第２実施形態の第２実施例によれば、発話機能ボタンが押下されて発話が開始され、無音状態から有音状態に遷移するときに、発話機能ボタンが押下された瞬間のフレームから音声データのパワーが閾値以上になる１つまえのフレーム、すなわち現時点のフレームからｉ個前のフレームにさかのぼって、フェードイン処理して得られた音声データフレームを含むパケットから送信するので、無音状態から有音状態への復帰時に話頭が消失してしまうことがなくなる。
【０１６１】
さらに、有音状態から無音状態になったときの音声データフレームすなわち話尾部分の音声レベルが徐々に減少されるフェードアウト処理も第１実施形態と同様に施されるため、受信側において有音状態から無音状態に遷移する部分で音声波形が不連続となることがないので、この遷移部分で異音が生じることがなく、音声品質の劣化が低減される。
【０１６２】
上記第１実施例、第２実施例において、受信側で背景雑音を推定し、パケットを受信しない無音区間に受信側で生成した擬似背景雑音を出力するようなＣＮＧと組み合わせて利用することもできる。この場合には、受信側で受け取った最後のフェードアウト処理を行ったフレームに対して、擬似背景雑音をフェードインしながら足し合わせることで、有音区間から擬似背景雑音区間に連続的に遷移させることができる。また、有音として受信した最初のパケットに対して、擬似背景雑音をフェードアウトしながら足し合わせることで、擬似背景雑音区間から有音区間に連続的に遷移させることができる。
【０１６３】
次に、本発明の第３実施形態を説明する。
【０１６４】
図１５は本発明の第３実施形態における音声パケット通信システムの機能構成を示すブロック図である。図において、前述した第１実施形態と同一構成部分は同一符号を持って表しその説明を省略する。
【０１６５】
また、第３実施形態と第１実施形態との相違点は、第３実施形態では前述した第１実施形態における符号化処理部１６に代えて符号化処理部１６’を備えると共に、パケット解析部２２及び復号化処理部２３に代えてパケット解析部２２’及び復号化処理部２３’を設けたことである。
【０１６６】
符号化処理部１６’は後述するように、無音状態から有音状態に遷移したときに用いる分析結果として、分析初期値を参照してデータ分析及び符号化を行った場合のＳ／Ｎ（信号雑音比）と、前記一時記憶されている分析結果を参照してデータ分析及び符号化を行った場合のＳ／Ｎとを比較して、良好なＳ／Ｎをもつ符号化音声データフレームを使用する。
【０１６７】
パケット解析部２２’は、パケット解析部２２が有する機能に加えて、受信パケットを解析してＲＴＰヘッダのマーカービットＭが「１」のときにリセット情報を復号化処理部２３’に送出する機能を備えている。
【０１６８】
復号化処理部２３’は、復号化処理部２３が有する機能に加えて、パケット解析部２２’からリセット情報を受けたときにだけ一時記憶されている分析結果ではなく分析初期値を参照してデータ分析を行い、データの復号化処理を行う機能を備えている。
【０１６９】
図１６は符号化処理部１６’を示す機能ブロック図である。図に示すように、符号化処理部１６’は、入力音声データ保持部161と、符号化部162,163、内部データ保持部164、符号化音声データ保持部165,166、ローカル復号化部167,168、第１誤差計算部169、第２誤差計算部170、誤差比較部171、スイッチ部172とから構成されている。
【０１７０】
入力音声データ保持部161は、入力した音声データフレームを保持し、この音声データフレームを符号化部162,163と、第１誤差計算部169及び第２誤差計算部170に供給する。
【０１７１】
符号化部162は、内部データ保持部164に保持されているデータに基づいて、入力音声データ保持部161から供給された音声データフレームを符号化し、これを符号化音声データ保持部165に出力する。
【０１７２】
符号化部163は、入力音声データ保持部161から供給された音声データフレームを符号化し、これを符号化音声データ保持部166に出力する。ここで、符号化するときは、常に内部状態がリセットされた状態、すなわち前のデータ符号化の状態を参照しないで符号化を行う。
【０１７３】
内部データ保持部164は、符号化部162において音声データフレームを符号化した符号化音声データを保持し、次の音声データフレームが符号化部162において符号化される際に保持している符号化音声データを符号化部162に供給する。
【０１７４】
符号化音声データ保持部165は、符号化部162によって符号化された符号化音声データを一時的に保持すると共に、この符号化音声データをローカル復号化部167とスイッチ部172に出力する。
【０１７５】
符号化音声データ保持部166は、符号化部163によって符号化された符号化音声データを一時的に保持すると共に、この符号化音声データをローカル復号化部168とスイッチ部172に出力する。
【０１７６】
ローカル復号化部167は、符号化音声データ保持部165から供給された符号化音声データを復号して得られた音声データを第１誤差計算部169に出力する。
【０１７７】
ローカル復号化部168は、符号化音声データ保持部166から供給された符号化音声データを復号して得られた音声データを第２誤差計算部170に出力する。
【０１７８】
第１誤差計算部169は、入力音声データ保持部161から供給される音声データとローカル復号化部167から入力した音声データとの誤差分（符号化誤差（Ｓ／Ｎ））を求めて、これを誤差比較部171に出力する。さらに、第１誤差計算部169は、無音状態から有音状態に遷移したときに内部データ保持部164に保持されているデータを消去して初期化する。
【０１７９】
第２誤差計算部170は、入力音声データ保持部161から供給される音声データとローカル復号化部168から入力した音声データとの誤差分（符号化誤差（Ｓ／Ｎ））を求めて、これを誤差比較部171に出力する。
【０１８０】
誤差比較部171は、第１誤差計算部169から入力した誤差分（符号化誤差（Ｓ／Ｎ））と第２誤差計算部170から入力した誤差分（符号化誤差（Ｓ／Ｎ））とを比較して、この比較結果に基づいて、符号化誤差（Ｓ／Ｎ）が良好な（小さい）方の符号化音声データフレームをパケット生成部１７に出力するようにスイッチ部172を切り替える。さらに、誤差比較部171は、前述したように無音状態から有音状態になったときにＲＴＰヘッダのマーカービットＭを「１」に設定するようにパケット生成部１７に通知する。
【０１８１】
次に、上記構成よりなる第３実施形態における送信装置１及び受信装置２の動作を図１７乃至図２０に示すフローチャートを参照して詳細に説明する。
【０１８２】
送信装置１においては、駆動開始直後に初期化処理を行う（ＳＥ１）。この初期化処理では、変数である無音判定カウントSilentCountの値を「０」に設定すると共にRTPTimeStampの値を「０」に設定する。
【０１８３】
次に、送信装置１は、処理を開始すると、音声入力部１１を介して入力した音声信号は順次Ａ／Ｄ変換部１２を介して有音無音判定部１３のバッファに格納する（ＳＥ２）。
【０１８４】
次いで、送信装置１は、有音無音判定部１３のバッファに格納されている先頭の音声データから順に、判定対象となる音声データフレームのパワーが閾値以下であるか否かすなわち無音状態であるか有音状態であるかを判定し（ＳＥ３）、音声データフレームのパワーが閾値よりも大きい有音状態のときは後述する前記ＳＥ１９の処理に移行する。
【０１８５】
また、音声データフレームのパワーが閾値以下の無音状態のときは、無音判定カウントSilentCountを「１」増加し（ＳＥ４）、無音判定カウントSilentCountの値がカウント閾値SilentThresよりも大きい送信休止状態であるか否かを判定する（ＳＥ５）。
【０１８６】
この判定の結果、無音判定カウントSilentCountの値がカウント閾値SilentThresよりも大きい送信休止状態であるときは後述するＳＥ１２の処理に移行し、無音判定カウントSilentCountの値がカウント閾値SilentThres以下のときは、無音判定カウントSilentCountの値がカウント閾値SilentThresに等しいか否かすなわち送信休止状態が開始されたか否かを判定する（ＳＥ６）。
【０１８７】
この判定の結果、無音判定カウントSilentCountの値がカウント閾値SilentThresに等しくないときは後述するＳＥ７の処理に移行し、無音判定カウントSilentCountの値がカウント閾値SilentThresに等しいときは現在の音声データフレームをフェードアウト処理する（ＳＥ７）。
【０１８８】
次に、フェードアウト処理した音声データフレームを符号化処理し（ＳＥ８）、この符号化処理した音声データフレームとこれに対応するＲＴＰタイムスタンプRTPTimeStampとを含むパケットを生成して、このパケットを送信する（ＳＥ９）。
【０１８９】
この後、ＲＴＰタイムスタンプRTPTimeStampをフレーム長分増加する（ＳＥ１０）。即ち、ＲＴＰタイムスタンプRTPTimeStampの値にフレーム長FrameLenの値を加算した値を新たなＲＴＰタイムスタンプRTPTimeStampの値とする。
【０１９０】
次いで、現在の音声データフレームをバッファに保持して（ＳＥ１１）、前記ＳＥ２の処理に移行する。
【０１９１】
一方、前記ＳＥ５の判定の結果、無音判定カウントSilentCountの値がカウント閾値SilentThresよりも大きい送信休止状態であるときは、無音判定カウントSilentCountの値がカウント閾値SilentThres＋１に等しいか否か、すなわち送信休止状態になったばかりであるか否かを判定する（ＳＥ１２）。
【０１９２】
この判定の結果、無音判定カウントSilentCountの値がカウント閾値SilentThres＋１に等しくないときは後述するＳＥ１６の処理に移行し、無音判定カウントSilentCountの値がカウント閾値SilentThres＋１に等しいときは理想的な無音音声データフレームを生成して（ＳＥ１３）、この無音音声データフレームを符号化し（ＳＥ１４）、この符号化した理想的な無音音声データフレームとこれに対応するＲＴＰタイムスタンプRTPTimeStampとを含むパケットを生成して、このパケットを送信する（ＳＥ１５）。この後、前記ＳＥ１０の処理に移行する。
【０１９３】
また、前記ＳＥ１２の判定の結果、無音判定カウントSilentCountの値がカウント閾値SilentThres＋１に等しくないときは、遅延カウントDelayCountの値からフレーム長FrameLenの値を減算した値を新たな遅延カウントDelayCountの値とし（ＳＥ１６）、遅延カウントDelayCountの値が０以下であるか否かを判定する（ＳＥ１７）。
【０１９４】
この判定の結果、遅延カウントDelayCountの値が０よりも大きいときは前記ＳＥ１０の処理に移行し、遅延カウントDelayCountの値が０以下であるときは遅延カウントDelayCountの値を０に設定して（ＳＥ１８）、前記ＳＥ１０の処理に移行する。
【０１９５】
一方、前記ＳＥ３の判定の結果、音声データフレームのパワーが閾値よりも大きい有音状態のときは、無音判定カウントSilentCountの値がカウント閾値SilentThresよりも大きいか否かすなわち１つ前のフレームは送信休止状態であるか否かを判定する（ＳＥ１９）。
【０１９６】
この判定の結果、無音判定カウントSilentCountの値がカウント閾値SilentThres以下のときは後述するＳＥ２８の処理に移行する。また、無音判定カウントSilentCountの値がカウント閾値SilentThresよりも大きいときは、バッファに保持されている１つ前の音声データフレームをフェードイン処理する（ＳＥ２０）と共に、符号化処理部１６’を初期化し（ＳＥ２１）、さらに前記フェードイン処理した音声データフレームを符号化処理する（ＳＥ２２）。
【０１９７】
次に、フェードイン処理した音声データフレームと該音声データフレームに対応するＲＴＰタイムスタンプ（RTPTimeStamp-FrameLen）とを含むパケットを生成して該パケットを送信する（ＳＥ２３）。このとき、ＲＴＰヘッダのマーカービットＭを「１」に設定しておく。
【０１９８】
この後、現在の音声データフレームすなわちフェードイン処理した音声データフレームの次の音声データフレームを符号化処理し（ＳＥ２４）、該符号化処理した音声データフレームと現在のＲＴＰタイムスタンプRTPTimestampとを含むパケットを生成してこれを送信する（ＳＥ２５）。
【０１９９】
次に、遅延増加量カウンタDelayCountの値をフレーム長FrameLenの値分だけ増加させる（ＳＥ２６）と共に、無音判定カウントSilentCountの値を「０」に初期化する（ＳＥ２７）。この後、前記ＳＥ１０の処理に移行する。
【０２００】
一方、前記ＳＥ１９の判定の結果、無音判定カウントSilentCountの値がカウント閾値SilentThres以下のときは現在の音声データフレームを符号化部162によって符号化処理する（ＳＥ２８）、と共に現在の音声データフレームを符号化部163によって符号化処理する。ここで、符号化部163では内部状態がリセットされているので前回の符号化処理における符号化データは参照されずに符号化処理が行われる。
【０２０１】
次に、符号化部162によって符号化された符号化音声データフレームのＳ／Ｎと、符号化部163によって符号化された符号化音声データフレームのＳ／Ｎとを比較し（ＳＥ３０）、Ｓ／Ｎの良い方の符号化音声データフレームとこれに対応するＲＴＰタイムスタンプとを含むパケットを生成してこれを送信する（ＳＥ３１）。このとき、符号化部163によって符号化した音声データフレームを用いる場合は、ＲＴＰヘッダのマーカービットＭを「１」に設定して無音復帰フラグを立てる。
【０２０２】
この後、無音判定カウントSilentCountの値を「０」に設定して初期化した後（ＳＥ３２）、前記ＳＥ１０の処理に移行する。
【０２０３】
上記実施形態の送信装置によれば、保持されている分析結果に基づいて符号化した音声データフレームにおける音声データのＳ／Ｎ（信号雑音比）と、前記分析結果の初期値に基づいて符号化した音声データフレームにおける音声データの信号雑音比とが比較され、Ｓ／Ｎが大きい方の音声データフレームがパケット生成用の音声データフレームとして採用されるため、符号化対象となる音声データフレームと、この１つ前の音声データフレームとの間の相関関係が良好となるように符号化処理を行うことができるので、受信側において再生時に自然な状態の音声データを得ることができる。
【０２０４】
次に、受信装置２の動作を図２０に示すフローチャートを参照して説明する。
【０２０５】
受信装置２は、受信処理を開始すると、受信したパケットを解析処理して（ＳＦ１）、ＲＴＰヘッダと音声データフレームとを分離すると共にＲＴＰヘッダの情報を解析する。
【０２０６】
上記解析によってＲＴＰヘッダのマーカービットＭ（無音復帰フラグ）が「１」に設定されフラグがオンになっているか否かを判定し（ＳＦ２）、マーカービットＭが「０」のときは後述するＳＦ４の処理に移行し、マーカービットＭが「１」のときはパケット解析部２２’から復号化処理部２３’にリセット情報を通知して復号化処理部の内部状態を初期化（リセット）する（ＳＦ３）。
【０２０７】
この後、受信した符号化音声データフレームを復号化処理し（ＳＦ４）、復号化処理された音声データフレームの音声再生処理を行う（ＳＦ５）。
【０２０８】
上記受信処理によれば、無音状態から有音状態に遷移した音声データ箇所に加えて、符号化器の内部状態をリセットしても符号化利得に影響を与えない音声データ箇所での送信側符号化器の内部状態リセットをマーカービットＭによって認識することができるので、送信側符号化器の内部状態と、受信側の復号化器の内部状態の同期を保つことが可能となる。リセット処理を行うことで、パケット消失等の伝送誤りに際して送信側と受信側の内部状態の不一致を回復することが可能であり、品質の低下を低減できる。
【０２０９】
尚、上記各実施形態は本発明の一具体例であって本発明が上記実施形態の構成のみに限定されることはない。
【０２１０】
【発明の効果】
以上説明したように本発明によれば、有音状態になる音声データ部分の音声レベルが徐々に増大されるフェードイン処理が施されるため、受信側において無音状態から有音状態に遷移する部分で音声波形が不連続となることがないので、この遷移部分で異音が生じることが無く、音声品質の劣化を低減することができる。
【０２１４】
また、本発明によれば、無音状態でパケット送信を停止していた後にパケット送信を開始するときの最初のパケットに有音開始状態を意味する情報を含めるので、この情報によって、受信側においては無音状態から有音状態に復帰したことを認識することができる。
【０２１６】
また、本発明によれば、前記有音開始状態を意味する情報が送信パケットに含められるので、受信側において有音状態の開始を容易に認識することができる。
【図面の簡単な説明】
【図１】本発明の第１実施形態における音声パケット通信システムの機能構成を示すブロック図
【図２】本発明の第１実施形態における音声パケット送信装置による音声信号のパケット化を説明する図
【図３】本発明の第１実施形態において用いているリアルタイム転送プロトコルヘッダを説明する図
【図４】本発明の第１実施形態における送信装置のパケット送信処理を説明するフローチャート
【図５】本発明の第１実施形態における送信装置のパケット送信処理を説明するフローチャート
【図６】本発明の第１実施形態における送信装置のパケット送信処理を説明するフローチャート
【図７】本発明の第２実施形態における音声パケット通信システムの機能構成を示すブロック図
【図８】本発明の第２実施形態における送信装置のパケット送信処理を説明するフローチャート
【図９】本発明の第２実施形態における送信装置のパケット送信処理を説明するフローチャート
【図１０】本発明の第２実施形態における送信装置のパケット送信処理を説明するフローチャート
【図１１】本発明の第２実施形態の第１実施例の送信再開処理における音声信号のパケット化を説明する図
【図１２】本発明の第２実施形態の第１実施例の送信再開処理を説明するフローチャート
【図１３】本発明の第２実施形態の第２実施例の送信再開処理における音声信号のパケット化を説明する図
【図１４】本発明の第２実施形態の第２実施例の送信再開処理を説明するフローチャート
【図１５】本発明の第３実施形態における音声パケット通信システムの機能構成を示すブロック図
【図１６】本発明の第３実施形態における符号化処理部を示す機能ブロック図
【図１７】本発明の第３実施形態における送信装置の動作を説明するフローチャート
【図１８】本発明の第３実施形態における送信装置の動作を説明するフローチャート
【図１９】本発明の第３実施形態における送信装置の動作を説明するフローチャート
【図２０】本発明の第３実施形態における受信装置の動作を説明するフローチャート
【符号の説明】
１…音声パケット送信装置、２…音声パケット受信装置、３…通信網、１１…音声入力部、１２…アナログ／ディジタル（Ａ／Ｄ）変換部、１３……有音無音判定部、１４…スイッチ部、１５…フェードイン・フェードアウト処理部、１６，１６’…符号化処理部、１７…パケット生成部、１８…送信部、１９…送信判定処理部、２１…受信部、２２，２２’…パケット解析部、２３，２３’…復号化処理部、２４…アナログ／ディジタル（Ａ／Ｄ）変換部、２５…音声出力部、161…入力音声データ保持部、162,163…符号化部、164…内部データ保持部、165,166…符号化音声データ保持部、167,168…ローカル復号化部、169…第１誤差計算部、170…第２誤差計算部、171…誤差比較部、172…スイッチ部。[0001]
BACKGROUND OF THE INVENTION
  The present invention relates to an audio packet transmitting apparatus for improving voice reproduction quality at a boundary between a voiced state and a silent state, and improving voice reproduction quality at a boundary between a voice transmission state when using a transmission function button and a voice transmission pause state. And that personTo the lawIt is related.
[0002]
[Prior art]
2. Description of the Related Art Conventionally, along with the digitization of electronic devices, information to be transferred is generally packetized and transferred in information communication. For example, in the case of transferring an audio signal, on the transmitting side, audio data sampled at a predetermined sampling frequency is distributed and stored in separate packets by a predetermined amount and transferred in units of packets. On the receiving side, audio data is extracted from the received packet, and the extracted audio data is connected and reproduced.
[0003]
That is, in an electronic device that performs packet communication as described above, the transmission side performs processing to form and transmit a packet when one packet of data is obtained, and the reception side stores the packet in the received packet. A process of reading data in the packet is performed every time required for reproducing the data. Thereby, on the receiving side, for example, in the case of real-time transfer of audio data, it is possible to reproduce continuous audio from a plurality of packets received in a divided manner.
[0004]
At this time, in order to reduce the data communication amount of the audio data to be transmitted, a silence compression technique that does not actually transmit a portion determined to be silent among audio data obtained by sampling on the transmission side. Is being used.
[0005]
Similarly, instead of automatically determining silence, the user's utterance intention is explicitly acquired using a function button or the like, and voice data is transmitted only while the utterance function button is pressed. A technique for reducing the amount of data communication by not transmitting audio data while the button is not pressed is used.
[0006]
Such packet communication is almost always performed using a computer device. For example, a mobile phone using wireless communication, a well-known IP phone using a communication network such as the Internet, music from a distribution server, etc. It is used in a system for distributing content to a user terminal device, a remote conference system, and the like.
[0007]
[Patent Document 1]
JP 2000-83050 A
[Non-Patent Document 1]
ITU-T Recommendation G.729 Annex B
[0008]
[Problems to be solved by the invention]
However, in the transfer of voice data using the packet as described above, the waveform becomes discontinuous at the portion where the sound state transitions from the voiced state to the silent state, so that abnormal noise occurs at this transition portion and the voice quality deteriorates. Sometimes. In order to reduce such deterioration in voice quality, information on the background sound of the part determined to be silent on the transmission side is transmitted, and the silent part is obtained from the information on the sound part and the background sound received on the reception side. However, there is a problem that the calculation processing load becomes high to perform on the transmission side and the reception side.
[0009]
In addition, when a well-known packet loss compensation process (PLC) is performed on the receiving side at the same time as the transition from the voiced state to the silent state, the packet is not lost or the packet is not transmitted due to the silent state. Since it cannot be determined whether there is no sound, even if the transmission side determines that there is no sound and the transmission is terminated, the packet loss compensation process is applied on the reception side, and pseudo speech is generated from the last packet received. .
[0010]
As an example of the packet loss compensation process, G.I. 711Appendix1 and the like are known.
[0011]
In addition, when silence compression is performed, it is possible to adjust to the safe side so that the talk is not cut off using a hysteresis characteristic in the transition from the voiced state to the silent state. There is a problem that the talk head disappears when returning from the state to the voiced state. In order to provide a safety margin against such a talk break, it is necessary to always read ahead and allow a delay, which is not realized in most systems at present.
[0012]
  SUMMARY OF THE INVENTION In view of the above problems, an object of the present invention is to provide a voice packet transmitting apparatus and a method for improving voice reproduction quality at the boundary between a voiced state and a silent state.The lawIs to provide.
[0013]
[Means for Solving the Problems]
  In order to achieve the above object, the present invention comprises audio data frame generation means for generating audio data frames obtained by cutting audio data based on continuously input audio at predetermined time intervals. In a voice packet transmitting apparatus that generates a packet including the packet and transmits the packet via a communication network,Means for holding the input audio data frame;The input voiceData frameButIs soundSilentOrjudgeSoundSilence determination means;A means for stopping packet transmission when the state in which the voice data frame input by the voiced / silent determination means is determined to be silent continues for a certain period of time; and determining that the input voice data frame is silent Packet transmission when the input voice data frame is determined to be in a voiced state by the voice determination means in a state where the transmitted state has been continued for a certain period of time and packet transmission is stopped. And resuming the transmission of the packet, the silent data frame that is at least one voice data frame determined to be voiced as a reanalysis frame, and information on the voice frame determined to be voiced Means for performing reanalysis using information of the audio data frame determined to be silent until the reanalysis frame, and as a result of the reanalysis, the reanalysis When it is determined that the frame is close to sound, the sound data frame immediately before the re-analysis frame is converted to sound data in which the sound level is gradually decreased from the end toward the head. Silence data between the voice data frame determined to be in the voiced state and the frame immediately before the reanalysis frame is transmitted as a packet including the converted voice data frame as the head of the state Means for transmitting a frame and then transmitting a voice data frame determined to be in a voiced state; and if the reanalysis frame is determined to be close to silence as a result of the reanalysis, The analysis frame is converted into audio data in which the audio level is gradually reduced from the end toward the beginning, and a packet including the converted audio data frame is transmitted as the beginning of the sound state, Wherein and means for transmitting the audio data frame which is determined to be voiced inA voice packet transmitter is proposed.
[0027]
BookAccording to the voice packet transmitting apparatus of the invention,When the input voice data frame changes to a voiced state when the input voice data frame changes to a voiced state when the state in which the input voice data frame is determined to be silent has been stopped for a certain period of time. Transmission resumes.When the transmission of a packet is resumed after changing from a silent state to a voiced state, the silent data frame at least one voice data frame determined to be voiced is set as a reanalysis frame and determined to be voiced. Re-analysis is performed using the information of the voice frame and the information of the voice data frame determined to be silent until the re-analysis frame. As a result of the reanalysis, if it is determined that the reanalysis frame is close to sound, the audio data frame immediately before the reanalysis frame gradually decreases the audio level from the end toward the beginning. It is converted into audio data that has been subjected to fade-in processing. Further, the voice data frame that has been subjected to the fade-in process is set to the head of the voiced state, and a packet including the voice data frame is transmitted. Thereafter, a packet including a silent data frame up to the voice data frame determined to be in the voiced state is transmitted, and then a packet including a voice data frame determined to be in the voiced state is transmitted. .
[0028]
Further, as a result of the reanalysis, when it is determined that the reanalysis frame is close to silence, the silence data frame is converted into audio data in which the audio level is gradually reduced from the end toward the head, After the packet including the converted voice data frame is transmitted as the head of the voiced state, the voice data frame determined to be voiced is transmitted.
[0029]
Further, the present invention provides the above voice packet transmitting apparatus, wherein when resuming the transmission of the packet, at least one frame before the voice data frame determined to be voiced is transmitted as the head of the voiced state. A voice packet transmitting apparatus having means for shortening subsequent samples by the number of samples corresponding to a delay increased by an extra transmitted silence frame is proposed.
[0030]
According to the voice packet transmitting apparatus of the present invention, when resuming packet transmission, when the voice data frame is sent as a voiced state from at least one previous frame of the voice data frame determined to be voiced, the extra silence is transmitted. Subsequent samples are shortened by the number of samples corresponding to the delay increased by the frame. Thereby, the delay amount is adjusted, and real-time data transmission is maintained.
[0031]
Further, in the voice packet transmitting apparatus according to the present invention, the silence determination unit includes a unit configured to determine that the silence state is in a silent state when the voice level of the input voice frame is equal to or lower than a predetermined threshold level. A transmitter is proposed.
[0032]
According to the voice packet transmitting apparatus of the present invention, the silence judgment unit judges that the voice is not in the silent state when the voice level of the input voice frame is equal to or lower than the predetermined threshold level.
[0033]
In the voice packet transmitting apparatus according to the present invention, the voice determination unit includes a unit that determines that a voice state is present when a voice level of an input voice frame is equal to or higher than a predetermined threshold level. A voice packet transmitter is proposed.
[0034]
According to the voice packet transmitting apparatus of the present invention, the voice determination unit determines that the voice is in a voiced state when the voice level of the input voice frame is equal to or higher than the predetermined threshold level.
[0039]
In order to achieve the above object, the present invention provides voice data frame generation means for generating a voice data frame obtained by cutting voice data based on continuously inputted voice at predetermined time intervals, and a user's intention to speak. A packet including the generated voice data frame is generated, and the packet including the voice data frame is transmitted via the communication network only while the voice function button is pressed. In the voice packet transmitting apparatus, the speech function button pressing determining means for determining whether or not the speech function button is being pressed, and the speech function button being pressed by the speech function button pressing determining means. Means for stopping the packet transmission when it is determined that the utterance pause state has not been performed, and An audio packet comprising: means for converting the audio level of the last audio data frame to audio data that is gradually reduced from the first sample toward the last sample, and transmitting the converted audio data frame as a final packet A transmitter is proposed.
[0040]
According to the voice packet transmitting apparatus of the present invention, packet transmission is stopped when the speech function button is pressed and the speech pause state in which the speech function button is not pressed is entered. At this time, the audio level of the last audio data frame to be transmitted is gradually decreased from the first sample to the last sample, and the audio data frame transmitted last is subjected to a fade-out process.
[0041]
As a result, a fade-out process is performed in which the voice level of the voice data part that is transmitted last from the voiced state to the silent state, that is, the voice level of the tail part is gradually reduced. Since the speech waveform does not become discontinuous at the portion where the transition is made, no abnormal sound is generated at the transition portion, and the degradation of speech quality is reduced.
[0042]
Further, the present invention provides the voice packet transmitting apparatus described above, wherein when the speech function button pressing determining means determines that the speech function button has been pressed but has not been pressed, the packet transmission is performed. Means for stopping, means for transmitting at least one or more audio frames input after it is determined that the speech has been suspended, and the audio level of the last audio data frame to be transmitted is set to the first sample. A voice packet transmitting apparatus is provided that includes means for converting voice data that has been gradually reduced from the first to the last sample and transmitting the converted voice data frame as a final packet.
[0043]
According to the voice packet transmitting apparatus of the present invention, the packet transmission is stopped when the speech function button is pressed and the speech pause state in which the speech function button is not pressed is entered. At this time, one or more voice frames of the voice frames input after it is determined that the speech function button has not been pressed are transmitted, and the voice data frame transmitted last includes that voice frame. A fade-out process is performed in which the audio level is gradually decreased from the first sample toward the last sample.
[0044]
Thereby, since it is determined that the speech function button has not been pressed, one or more audio data frames are transmitted, so that the speech tail portion is not suddenly cut and the audio level is gradually reduced. Since fade-out processing is performed, the voice waveform will not be discontinuous at the transition from the voiced state to the silent state on the receiving side, so there will be no abnormal noise at this transition, and voice quality will deteriorate. Is reduced.
[0045]
According to the present invention, in the above voice packet transmitting apparatus, after transmitting a packet including a voice data frame in which the voice level is gradually decreased from the first sample to the last sample, one more silent voice data frame is generated. Then, a voice packet transmitting device having means for transmitting a packet including the silent voice data frame is proposed.
[0046]
According to the voice packet transmitting apparatus of the present invention, after transmitting a packet including a voice data frame whose voice level is gradually decreased from the first sample to the last sample, a packet including one silent voice data frame is further transmitted. Therefore, the reception side can reliably recognize that the sound state has changed to the silent state and transmission has been stopped.
[0047]
Further, in the voice packet transmitting apparatus according to the present invention, when it is determined by the utterance function button pressing determining means that the utterance function button has been pressed from a non-pressed state, as an utterance start state, The means for resuming the transmission of the packet that has been stopped, and when resuming the transmission of the packet, the first voice data frame to be transmitted is changed to voice data in which the voice level is gradually reduced from the last sample to the first sample. Proposed is a voice packet transmitting device comprising a means for converting and transmitting a packet including the converted voice data frame as the head of the speech state.
[0048]
According to the voice packet transmitting apparatus of the present invention, when the speech function button is pressed without changing the speech function button and the speech function button is pressed, the transmission of the packet is resumed. . At this time, when resuming the transmission of the packet, the first audio data frame to be transmitted is converted into audio data whose audio level is gradually increased from the first sample toward the last sample, and is used as the head of the sound state. A packet including the converted voice data frame is transmitted.
[0049]
As a result, a fade-in process is performed in which the voice level at the time when the voice function button is pressed from the state where the voice function button is not pressed, that is, the voice level of the head part is gradually increased. Since the speech waveform does not become discontinuous at the transition from the state to the voiced state, no abnormal noise is generated at the transition, and voice quality deterioration is reduced.
[0050]
Further, in the voice packet transmitting apparatus according to the present invention, when it is determined by the utterance function button pressing determining means that the utterance function button has been pressed from a non-pressed state, as an utterance start state, The means for resuming transmission of the packet that has been stopped, and at the time of resuming packet transmission, at least a voice data frame before the first voice data frame input after the speech function button is pressed Proposed is a voice packet transmitting apparatus comprising means for transmitting one or more and then transmitting a first voice data frame input after the speech function button is pressed.
[0051]
According to the voice packet transmitting apparatus of the present invention, the state in which the speech function button is pressed when the speech function button is pressed from the state in which the speech function button is not pressed and transmission of the stopped packet is resumed. At least one voice data frame before the first voice data frame input after the first voice data frame is transmitted, and the first voice data frame input after the speech function button is pressed is Sent. This prevents the talk head from being cut off.
[0052]
In addition, in the above voice packet transmitting apparatus, when resuming packet transmission in the speech start state, at least one previous frame of the voice data frame determined to be in the speech start state is transmitted. Proposed is a voice packet transmitting apparatus having means for shortening subsequent samples by the number of samples corresponding to the delay increased by an extra transmitted voice data frame when transmitted as the head of a frame.
[0053]
According to the voice packet transmitting apparatus of the present invention, when resuming the packet transmission, when the voice packet is transmitted from at least one frame before the voice data frame determined to be in the utterance start state, the extra voice data frame is transmitted. Subsequent samples are shortened by the number of samples corresponding to the increased delay. Thereby, the delay amount is adjusted, and real-time data transmission is maintained.
[0054]
In the voice packet transmitting apparatus described above, the packet transmission is stopped when the speech function button is not pressed, and when the packet is transmitted as the speech start state by pressing the speech function button, the first transmission is performed. When encoding an audio frame, the means for encoding the audio frame after initializing the internal state of the audio encoder, and the first frame returned from silence in the packet when the first frame is packetized and transmitted A voice packet transmitting apparatus having means for transmitting including information indicating that the frame is a frame is proposed.
[0055]
According to the voice packet transmitting apparatus of the present invention, the voice data frame is encoded after the internal state of the voice coder is initialized when the state is changed from the transmission stop state to the transmission state. As a result, the internal state of the data and the like used for the encoding process before transmission stop is initialized, so that the optimal encoding process can be performed.
[0056]
Furthermore, when shifting from the transmission stop state to the transmission state and packetizing and transmitting the first frame, information indicating that it is the first frame recovered from silence is included in the packet. By referencing on the side, an optimal decoding process can be performed.
[0057]
In the voice packet transmitting apparatus according to the present invention, when the encoding processing unit encodes the frame, the encoding unit does not initialize the internal state of the encoder and continues to the frame following the previous frame. A means for comparing the encoding error when encoding and the encoding error when encoding the frame after initializing the internal state of the encoder, and transmitting the encoding result with the smaller error; A voice packet transmitting device having means for transmitting the information indicating that the frame is the first frame returned from silence when the result of encoding the frame is selected after resetting the internal state. suggest.
[0058]
According to the voice packet transmitting apparatus of the present invention, the coded voice data frame having the smaller coding error is used, and a packet including the coded voice data frame is transmitted. Further, when an encoded audio data frame encoded with the internal state of the encoder being reset is used, information indicating that it is the first frame returned from silence is included in the packet and transmitted. Therefore, an accurate decoding process can be performed on the receiving side.
[0059]
  In order to achieve the above object, the present invention uses a computer device having means for converting voice input by voice input means into voice data, and converts voice data based on continuously input voice into predetermined time intervals. In the voice packet transmission method of generating a voice data frame cut out in step (b) and generating a packet including the voice data frame and transmitting the packet via a communication network, the computer device includes:Holding the input audio data frame;The input voiceData frameButIs soundSilentOrJudgment,When the state in which the input voice data frame is determined to be silent continues for a certain time or longer, packet transmission is stopped,The state where the input audio data frame is determined to be silent continues for a certain period of timeWhen it is determined that the input voice data frame is in a voiced state while the packet transmission is stopped, the packet transmission is resumed and the packet transmission is resumed. Information of a voice frame determined to be sound and information of a voice data frame determined to be silent up to the re-analysis frame, using a silent data frame at least one previous voice data frame determined to be a sound as a re-analysis frame When the reanalysis frame is determined to be close to sound as a result of the reanalysis, the audio data frame immediately before the reanalysis frame is moved from the end to the beginning. The voice level is gradually reduced to voice data, and a packet including the voice data frame thus converted is transmitted as the head of the voiced state. Transmitting a silent data frame between the voice data frame determined to be and the frame immediately before the reanalysis frame, and then transmitting the voice data frame determined to be in the voiced state; As a result of the re-analysis, if it is determined that the re-analysis frame is close to silence, the re-analysis frame is converted into audio data in which the audio level is gradually decreased from the end toward the beginning, A packet including the converted voice data frame is transmitted as the head of the state, and then the voice data frame determined to be voiced is transmitted.A voice packet transmission method is proposed.
[0060]
  According to the voice packet transmission method of the present invention,When the input voice data frame changes to a voiced state when the input voice data frame changes to a voiced state when the state in which the input voice data frame is determined to be silent has been stopped for a certain period of time. Transmission resumes. When the transmission of a packet is resumed after changing from a silent state to a voiced state, the silent data frame at least one voice data frame determined to be voiced is set as a reanalysis frame and determined to be voiced. Re-analysis is performed using the information of the voice frame and the information of the voice data frame determined to be silent until the re-analysis frame. As a result of the reanalysis, if it is determined that the reanalysis frame is close to sound, the audio data frame immediately before the reanalysis frame gradually decreases the audio level from the end toward the beginning. It is converted into audio data that has been subjected to fade-in processing. Further, the voice data frame that has been subjected to the fade-in process is set to the head of the voiced state, and a packet including the voice data frame is transmitted. Thereafter, a packet including a silent data frame up to the voice data frame determined to be in the voiced state is transmitted, and then a packet including a voice data frame determined to be in the voiced state is transmitted. .
[0061]
  Further, as a result of the reanalysis, when it is determined that the reanalysis frame is close to silence, the silence data frame is converted into audio data in which the audio level is gradually reduced from the end toward the head, After the packet including the converted voice data frame is transmitted as the head of the voiced state, the voice data frame determined to be voiced is transmitted.
[0065]
In order to achieve the above object, the present invention uses a computer device having means for converting voice input by voice input means into voice data, and converts voice data based on continuously input voice into predetermined time intervals. A packet including the voice data frame only while the voice function frame for generating the voice data frame cut out in step S2 is generated and the voice function frame for acquiring the intention of the user's voice is pressed. In the voice packet transmission method for transmitting the message via the communication network, the computer device determines whether or not the utterance function button is pressed. If it is determined that the utterance pause state has not been entered, the packet transmission is stopped. Last speech level of the audio data frame, from the beginning the sample toward the end the sample is converted into audio data gradually reduced, proposes a voice packet transmission method for transmitting the converted audio data frame as a final packet.
[0066]
According to the voice packet transmission method of the present invention, transmission of a packet is stopped when a speech pause state in which the speech function button is not pressed is changed from a state where the speech function button is pressed. At this time, the audio level of the last audio data frame to be transmitted is gradually decreased from the first sample to the last sample, and the audio data frame transmitted last is subjected to a fade-out process.
[0067]
As a result, a fade-out process is performed in which the voice level of the voice data part that is transmitted last from the voiced state to the silent state, that is, the voice level of the tail part is gradually reduced. Since the speech waveform does not become discontinuous at the portion where the transition is made, no abnormal sound is generated at the transition portion, and the degradation of speech quality is reduced.
[0068]
Further, in the voice packet transmission method according to the present invention, when the computer device determines that the speech function button has been pressed from a state in which the speech function button has not been pressed, the computer device is stopped as the speech start state. When resuming packet transmission, the first audio data frame to be transmitted is converted into audio data with the audio level gradually reduced from the last sample to the first sample, A voice packet transmission method for transmitting a packet including the converted voice data frame as the head of the state is proposed.
[0069]
According to the voice packet transmission method of the present invention, packet transmission is resumed when the speech function button is pressed and the speech function button is pressed while the speech function button is not pressed. . At this time, when resuming the transmission of the packet, the first audio data frame to be transmitted is converted into audio data whose audio level is gradually increased from the first sample toward the last sample, and is used as the head of the sound state. A packet including the converted voice data frame is transmitted.
[0070]
As a result, a fade-in process is performed in which the voice level at the time when the voice function button is pressed from the state where the voice function button is not pressed, that is, the voice level of the head part is gradually increased. Since the speech waveform does not become discontinuous at the transition from the state to the voiced state, no abnormal noise is generated at the transition, and voice quality deterioration is reduced.
[0075]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, an embodiment of the present invention will be described with reference to the drawings.
[0076]
FIG. 1 is a block diagram showing a functional configuration of a voice packet communication system according to the first embodiment of the present invention. FIG. 2 is a diagram for explaining voice signal packetization by a voice packet transmitting apparatus according to the first embodiment of the present invention. 3 is a diagram for explaining a real-time transfer protocol (hereinafter referred to as RTP) header used in the first embodiment of the present invention. In the figure, 1 is a voice packet transmitting device (hereinafter simply referred to as a transmitting device), 2 is a voice packet receiving device (hereinafter simply referred to as a receiving device), and 3 is a communication network such as the Internet. In this embodiment, as an example, a system that transfers voice packets from the transmission device 1 to the reception device 2 in real time using UDP / IP via the communication network 3 will be described.
[0077]
The transmission device 1 is composed of a well-known computer device and operates according to a preset program, and includes a voice input unit 11, an analog / digital (A / D) conversion unit 12, a sound / silence determination unit 13, and a switch unit. 14, a fade-in / fade-out processing unit 15, an encoding processing unit 16, a packet generation unit 17, and a transmission unit 18. Each part which comprises these transmitters 1 is comprised by both hardware and software.
[0078]
The receiving device 2 is composed of a well-known computer device and operates according to a preset program. The receiving unit 21, the packet analyzing unit 22, the decoding processing unit 23, and the digital / analog (D / A) converting unit 24. And an audio output unit 25. Each part which comprises these receivers 2 is comprised by both hardware and software.
[0079]
The audio input unit 11 converts the audio signal into an analog electric signal 4 as shown in FIG. 2 and outputs it to the A / D conversion unit 12, and the A / D conversion unit 12 converts it into a digital signal at a predetermined sampling time. Audio data (sample) is sequentially stored in a buffer provided in the utterance / non-utterance determination unit 13.
[0080]
Also, as shown in FIG. 2, the voice data stored in the buffer is cut and voiced by the voice / silence determination unit 13 every predetermined period T, and is in a voiced state in which the voice data frame is forwarded one frame at a time from the beginning. Or whether it is silent.
[0081]
Furthermore, the sound / silence determination unit 13 performs a fade-in process when the sound state is changed to the sound state based on the determination result of the sound state or the silence state. The output signal is output to the fade-in / fade-out processing unit 15 by switching the signal, and the output signal is faded-in / fade-out by switching the switch unit 14 to perform the fade-out process when the sound state changes to the silent state. Output to the processing unit 15. When the sound state continues, the sound / silence determination unit 13 outputs an output signal to the encoding processing unit 16 by switching the switch unit 14. At this time, as shown in FIG. 2, when a predetermined threshold value Sth is exceeded, it is determined that there is a sound state.
[0082]
The fade-in / fade-out processing unit 15 samples the audio level of the audio data frame at the end when the audio input is in a sound state and transmission starts when the audio input is silent and in the transmission pause state. The fade-in process that gradually decreases from the first sample to the first sample, and when the audio input is silent and in the transmission state, the audio input becomes silent and the transmission is paused. A fade-out process for gradually decreasing the audio level from the first sample to the last sample is performed.
[0083]
The encoding processing unit 16 performs encoding processing of the audio data frame to be encoded input from the utterance / silence determination unit 13 or the fade-in / fade-out processing unit, and encodes the previous frame when performing the encoding processing. The coding state is improved by maintaining the internal state of the result of the conversion and performing prediction from the past.
[0084]
In the present embodiment, in order to reduce quality degradation due to the internal state mismatch between the encoder and the decoder on the transmission side and the reception side due to packet loss, encoding is performed when the state changes from a silent state to a voiced state. The occurrence of quality degradation due to transmission error is reduced by resetting the internal state of the device and using the initial value.
[0085]
Further, the encoding process 16 encodes the audio data frame to be encoded based on the analysis result and sends it to the packet generator 17.
[0086]
As shown in FIG. 2, the audio data frame generated in this way is subjected to a fade-in process in which the audio level is gradually increased in the audio data frame in the next sound state after the silence state. It becomes. Further, the audio data frame that is set to the silent state next to the voiced state becomes the audio data 31 that has been subjected to fade-out processing in which the audio level is gradually reduced.
[0087]
The packet generator 17 generates an RTP packet including the encoded voice data input from the encoding processor 16 and sends it to the transmitter 18. An RTP header as shown in FIG. 3 is added to the RTP packet at this time.
[0088]
As is well known, the RTP header includes a 2-bit version information V, a 1-bit padding information P, a 1-bit extension information X, a 3-bit CSRC-Count information CC, a 1-bit marker information (hereinafter referred to as a marker). M, 7-bit Payload-Type information PT, 16-bit sequence number (sequence number), 32-bit time stamp (Timestamp), 32-bit synchronization signal source (SSRC) identifier, 32 bits , A contributing transmission source (CSRC) identifier or the like.
[0089]
In the present embodiment, the marker bit M of the first packet to be transmitted after entering the voiced state after stopping the packet transmission in the silent state is set to “1”, and the marker bit M of the other packet is set. Is set to “0”.
[0090]
The transmission unit 18 transmits the RTP packet input from the packet generation unit 17 to the reception device 2 via the communication network 3.
[0091]
On the other hand, the reception unit 21 of the reception device 2 receives the RTP packet transmitted from the transmission device 1 via the communication network 3 and sends it to the packet analysis unit 22.
[0092]
The packet analysis unit 22 analyzes the RTP packet input from the reception unit 21 and separates it into a header portion and an encoded audio data frame, analyzes the contents of the header portion, and transmits the result based on the RTP time stamp. The audio data frames encoded in the specified order are output to the decoding processing unit 23. Further, the packet analysis unit 22 notifies the decoding processing unit 23 of the value of the marker bit M in the RTP header.
[0093]
The decoding processing unit 23 decodes the encoded audio data frame input from the packet analysis unit 22, converts it into digital audio data, and outputs this digital audio data to the D / A conversion unit 23. Further, the decoding processing unit 23 analyzes the encoded speech data frame when performing decoding, temporarily stores the analysis result, and also stores the analysis result temporarily stored when performing data analysis, or Data analysis is performed with reference to the initial analysis value. Here, by using the temporarily stored analysis result of the previous frame, optimal analysis and decoding can be performed in consideration of the correlation between the previous and subsequent frames.
[0094]
In addition, when the value of the marker bit M input from the packet analysis unit 22 is “1”, the decoding processing unit 23 resets and initializes the internal state of the decoder. With this initialization, when analyzing a voice data frame in a voiced state after the voice data frame to be decoded is in a silent state, the internal state is initialized and decoding processing is performed. Even when a transmission error occurs, it is possible to recover from a state in which the internal states of the encoder on the transmission side and the decoder on the reception side do not coincide with each other, and it is possible to reduce deterioration in voice quality.
[0095]
The D / A converter 23 receives the digital audio data obtained by decoding by the decoding processor 23, converts it into an analog audio signal, and outputs it to the audio output unit 24.
[0096]
The audio output unit 24 converts the analog audio data input from the D / A conversion unit 23 into audio and outputs it.
[0097]
Next, regarding the operation of the voice packet communication system configured as described above, processing flowcharts mainly relating to the operation of the transmission apparatus will be described with reference to FIGS.
[0098]
In the transmission apparatus 1, initialization processing is performed immediately after the start of driving (SA1). In this initialization process, the value of the silent determination count SilentCount, which is a variable, is set to “0” and the value of RTPTimeStamp is set to “0”.
[0099]
Next, when the transmission apparatus 1 starts processing, the audio signal input via the audio input unit 11 is sequentially stored in the buffer of the utterance / non-utterance determination unit 13 via the A / D conversion unit 12 (SA2).
[0100]
Next, the transmitting apparatus 1 sequentially determines whether the power of the audio data frame to be determined is equal to or lower than the threshold value in order from the head audio data stored in the buffer of the utterance / non-utterance determination unit 13, that is, in the silence state. It is determined whether or not it is in a sound state (SA3), and if the sound data frame is in a sound state where the power of the audio data frame is larger than the threshold value, the process proceeds to the process of SA19 described later.
[0101]
Also, when the power of the audio data frame is in the silent state below the threshold, the silence determination count SilentCount is increased by “1” (SA4), and is the transmission pause state in which the value of the silence determination count SilentCount is larger than the count threshold SilentThres? It is determined whether or not (SA5).
[0102]
As a result of the determination, when the silent determination count SilentCount is in a transmission pause state where the value is greater than the count threshold SilentThres, the process proceeds to SA12 described later, and when the silence determination count SilentCount is equal to or less than the count threshold SilentThres, there is silence. It is determined whether or not the value of the determination count SilentCount is equal to the count threshold SilentThres, that is, whether or not a transmission suspension state has been started (SA6).
[0103]
As a result of this determination, when the silent determination count SilentCount is not equal to the count threshold SilentThres, the process proceeds to SA7 described later. When the silent determination count SilentCount is equal to the count threshold SilentThres, the current audio data frame is faded out. Process (SA7).
[0104]
Next, the audio data frame subjected to the fade-out process is encoded (SA8), a packet including the encoded audio data frame and the RTP time stamp RTPTimeStamp corresponding thereto is generated, and this packet is transmitted ( SA9).
[0105]
Thereafter, the RTP time stamp RTPTimeStamp is increased by the frame length (SA10). That is, a value obtained by adding the frame length FrameLen value to the RTP time stamp RTPTimeStamp value is set as a new RTP time stamp RTPTimeStamp value.
[0106]
Next, the current audio data frame is held in the buffer (SA11), and the process proceeds to SA2.
[0107]
On the other hand, as a result of the determination of SA5, when the silence determination count SilentCount is in a transmission pause state where the value is larger than the count threshold SilentThres, the silence determination count SilentCount value is equal to the count threshold SilentThres + 1 and the transmission pause state has just been entered. Is determined (SA12).
[0108]
As a result of this determination, when the value of the silent determination count SilentCount is not equal to the count threshold SilentThres + 1, the process proceeds to SA16 described later, and when the value of the silence determination count SilentCount is equal to the count threshold SilentThres + 1, an ideal silent audio data frame (SA13), the silent audio data frame is encoded (SA14), and a packet including the encoded ideal silent audio data frame and the corresponding RTP time stamp RTPTimeStamp is generated. The packet is transmitted (SA15). Thereafter, the process proceeds to SA10.
[0109]
If the silent determination count SilentCount is not equal to the count threshold SilentThres + 1 as a result of the determination in SA12, a value obtained by subtracting the frame length FrameLen from the delay count DelayCount is set as a new delay count DelayCount ( SA16), it is determined whether or not the value of the delay count DelayCount is 0 or less (SA17).
[0110]
As a result of the determination, when the value of the delay count DelayCount is larger than 0, the process proceeds to SA10. When the value of the delay count DelayCount is 0 or less, the value of the delay count DelayCount is set to 0 (SA18). ), The process proceeds to SA10.
[0111]
On the other hand, as a result of the determination of SA3, if the sound data frame is in a voiced state where the power is greater than the threshold, whether or not the silence determination count SilentCount is greater than the count threshold SilentThres, that is, the previous frame is transmitted. It is determined whether or not it is in a dormant state (SA19).
[0112]
As a result of this determination, when the value of the silent determination count SilentCount is equal to or less than the count threshold SilentThres, the process proceeds to SA27 described later. When the silent determination count SilentCount is larger than the count threshold SilentThres, the previous audio data frame held in the buffer is faded in (SA20), and the faded audio data frame is further processed. Encoding processing is performed (SA21).
[0113]
Next, a packet including an audio data frame subjected to fade-in processing and an RTP time stamp (RTPTimeStamp-FrameLen) corresponding to the audio data frame is generated and transmitted (SA22). At this time, the marker bit M of the RTP header is set to “1”.
[0114]
Thereafter, the current audio data frame, that is, the audio data frame next to the fade-in processed audio data frame is encoded (SA23), and the packet including the encoded audio data frame and the current RTP time stamp RTPTimestamp Is generated and transmitted (SA24).
[0115]
Next, the value of the delay increase counter DelayCount is increased by the frame length FrameLen (SA25), and the silence determination count SilentCount is initialized to “0” (SA26). Thereafter, the process proceeds to SA10.
[0116]
On the other hand, as a result of the determination in SA19, when the value of the silence determination count SilentCount is equal to or less than the count threshold SilentThres, the current audio data frame is encoded (SA27), and the encoded audio data frame and the corresponding audio data frame are handled. A packet including the current RTP time stamp RTPTimeStamp is generated and transmitted (SA28). Thereafter, the value of the silent determination count SilentCount is set to “0” and initialized (SA29), and then the process proceeds to SA10.
[0117]
According to the above-described embodiment, since the packet including the voice data frame obtained by fading in the previous frame that enters the voiced state when transitioning from the voiceless state to the voiced state is transmitted, The talk head will not disappear when the voice returns to the voiced state.
[0118]
Further, after transmitting the packet including the audio data frame in which the audio level is gradually decreased from the first sample to the last sample, the transmitting device 1 transmits another packet including one ideal silent audio data frame. Therefore, the reception side can reliably recognize that the sound state has changed to the silent state and transmission has been stopped.
[0119]
In addition, a fade-out process is performed in which the audio data frame when the voice state changes from the voiced state to the silent state, that is, the voice level of the tail part is gradually reduced. Therefore, the sound waveform does not become discontinuous, so that no abnormal sound is generated at the transition portion, and the deterioration of the sound quality is reduced.
[0120]
Furthermore, according to the above-described embodiment, since one or more audio data frames are transmitted after it is determined that the sound state is changed to the silence state, the speech portion is not suddenly cut and the audio level is gradually increased. Since the fade-out process is reduced, the speech waveform does not become discontinuous at the part where the receiving side transitions from the voiced state to the silent state. Quality degradation is reduced.
[0121]
Further, according to the above embodiment, when the first audio data frame is packetized and transmitted from the transmission suspension state to the transmission state, by setting the marker bit M of the RTP header to “1”, silence can be prevented. Since this represents the first audio data frame that has been restored, the marker bit M is referred to on the receiving side, and the internal state of the decoder is reset, thereby improving the tolerance against transmission errors.
[0122]
The fade-in process and the fade-out process may be performed across a plurality of audio data frames. In addition, the fade-in process may be performed when a voice data frame in a voiced state exists after a plurality of voice data frames in a silent state continue.
[0123]
Next, a second embodiment of the present invention will be described.
[0124]
FIG. 7 is a block diagram showing a functional configuration of the voice packet communication system in the second embodiment. In the figure, the same components as those of the first embodiment described above are denoted by the same reference numerals, and description thereof is omitted.
[0125]
Further, the difference between the second embodiment and the first embodiment is that in the second embodiment, the utterance function button control information is input instead of the voice / silence determination unit 13 in the first embodiment, and this utterance function is inputted. This is to provide a transmission determination processing unit 19 that determines whether there is a sound or no sound based on the button control information.
[0126]
The transmission determination processing unit 19 in the second embodiment is inputted from an utterance function button (not shown) that is pressed when the speaker speaks in addition to the function of the voiced / non-voiced determination unit 13 in the first embodiment. Based on the utterance function button control information, it is recognized that the utterance control button has been pressed (turned on), and transmission is started from an audio data frame that goes back several frames from when the utterance control button was pressed. Has the ability to start. At this time, the fade-in process is performed as in the first embodiment. Here, when the speaker is speaking, the speech function button is kept pressed.
[0127]
Details of the processing in the second embodiment will be described below with reference to the flowcharts of FIGS.
[0128]
In the transmission device 1, initialization processing is performed immediately after the start of driving (SB1). In this initialization process, the value of the silent determination count SilentCount, which is a variable, is set to “0” and the value of RTPTimeStamp is set to “0”.
[0129]
Next, when the transmission apparatus 1 starts processing, the audio signal input via the audio input unit 11 is sequentially stored in the buffer of the utterance / silence determination unit 13 via the A / D conversion unit 12 (SB2).
[0130]
Next, the transmitting apparatus 1 determines whether or not the speech function button is pressed (SB3), and when the speech function button is pressed, the process proceeds to the processing of SB19 described later.
[0131]
If the speech function button is not pressed, the silence determination count SilentCount is incremented by “1” (SB4), and whether or not the silence determination count SilentCount is in a transmission suspension state in which the value of the silence determination count SilentCount is greater than the count threshold SilentThres. Is determined (SB5).
[0132]
As a result of this determination, when the silence determination count SilentCount is in a transmission pause state where the value is greater than the count threshold SilentThres, the process proceeds to SB12 described later. When the silence determination count SilentCount is less than the count threshold SilentThres, It is determined whether or not the value of the determination count SilentCount is equal to the count threshold value SilentThres, that is, whether or not a transmission suspension state has been started (SB6).
[0133]
If the silent determination count SilentCount is not equal to the count threshold SilentThres as a result of this determination, the process proceeds to SB7 described later. If the silent determination count SilentCount is equal to the count threshold SilentThres, the current audio data frame is faded out. Process (SB7).
[0134]
Next, the audio data frame subjected to the fade-out process is encoded (SB8), a packet including the encoded audio data frame and the RTP time stamp RTPTimeStamp corresponding thereto is generated, and this packet is transmitted ( SB9).
[0135]
Thereafter, the RTP time stamp RTPTimeStamp is increased by the frame length (SB10). That is, a value obtained by adding the frame length FrameLen value to the RTP time stamp RTPTimeStamp value is set as a new RTP time stamp RTPTimeStamp value.
[0136]
Next, the current audio data frame is held in the buffer (SB11), and the process proceeds to SB2.
[0137]
On the other hand, as a result of the determination of SB5, when the silent determination count SilentCount is in a transmission pause state where the value is larger than the count threshold SilentThres, the silence determination count SilentCount value is just equal to the count threshold SilentThres + 1 and the transmission pause state has just been entered. Is determined (SB12).
[0138]
As a result of the determination, when the value of the silent determination count SilentCount is not equal to the count threshold SilentThres + 1, the processing proceeds to SB16 described later. When the value of the silent determination count SilentCount is equal to the count threshold SilentThres + 1, an ideal silent audio data frame Is generated (SB13), the silent audio data frame is encoded (SB14), and a packet including the encoded ideal silent audio data frame and the corresponding RTP time stamp RTPTimeStamp is generated. A packet is transmitted (SB15). Thereafter, the process proceeds to SB10.
[0139]
As a result of the determination in SB12, when the value of the silence determination count SilentCount is not equal to the count threshold SilentThres + 1, a value obtained by subtracting the frame length FrameLen from the delay count DelayCount is set as a new delay count DelayCount ( SB16), it is determined whether or not the value of the delay count DelayCount is 0 or less (SB17).
[0140]
As a result of this determination, when the value of the delay count DelayCount is larger than 0, the process proceeds to the process of SB10. When the value of the delay count DelayCount is 0 or less, the value of the delay count DelayCount is set to 0 (SB18). ), The process proceeds to the process of SB10.
[0141]
On the other hand, as a result of the determination of SB3, when the sound data frame is in a voiced state where the power is greater than the threshold, whether or not the silence determination count SilentCount is greater than the count threshold SilentThres, that is, the previous frame is transmitted. It is determined whether or not it is in a dormant state (SB19).
[0142]
As a result of this determination, when the value of the silent determination count SilentCount is equal to or less than the count threshold SilentThres, the process proceeds to SB22 described later. When the value of the silence determination count SilentCount is larger than the count threshold SilentThres, a transmission resumption process described later is executed (SB20), and the value of the silence determination count SilentCount is initialized to “0” (SB26). Thereafter, the process proceeds to SB10.
[0143]
On the other hand, as a result of the determination in SB19, when the value of the silence determination count SilentCount is equal to or less than the count threshold SilentThres, the current audio data frame is encoded (SB22), and the encoded audio data frame and the corresponding data frame are correspondingly processed. A packet including the current RTP time stamp RTPTimeStamp is generated and transmitted (SB23). Thereafter, the value of the silent determination count SilentCount is set to “0” and initialized (SB24), and then the process proceeds to SB10.
[0144]
(First Example of Transmission Resume Processing)
FIG. 11 is a diagram for explaining packetization of an audio signal in the transmission resumption process of the first embodiment, and FIG. 12 is a flowchart for explaining the transmission resumption process of the first embodiment.
[0145]
In the transmission restart process of the first embodiment, it is determined whether or not the value of the silence determination count SilentCount is larger than the value of the constant StartFrames, that is, whether or not the transmission stop time is sufficiently long (SC1). When the value is larger than the value of the constant StartFrames, the number of audio data frames to be transmitted retroactively when transmission is resumed is set (SC2). At this time, the value of the number N that goes back is set to the value of the constant StartFrames. Further, when the value of the silence determination count SilentCount is equal to or less than the value of the constant StartFrames, the value of the number N is set to a number smaller than the value of the silence determination count SilentCount by 2 (N = SilentCount-2) (SC3).
[0146]
Next, the N previous audio data frames held in the buffer are faded in (SC4), and the audio data frames subjected to the fade-in process are encoded (SC5).
[0147]
Thereafter, a packet including a voice data frame subjected to fade-in processing and an N-th previous RTP time stamp (RTPTimeStamp-FrameLen * N) (where * represents multiplication) is generated and transmitted. (SC6). At this time, the marker bit M in the RTP header is set to “1” and transmitted.
[0148]
Next, the audio data frames from the previous (N−1) th to the present stored in the buffer are sequentially encoded, and the encoded audio data frames and the corresponding RTP time stamps (RTPTimeStamp-(N-1- i) * FrameLen) is generated and transmitted sequentially (SC7). Here, i is an integer of 1 or more and (N-1) or less.
[0149]
Thereafter, the value of the delay increase counter DelayCount is increased by N frame lengths (FrameLen * N) (DelayCount + = FrameLen * N) (SC8), and the transmission restart process is terminated.
[0150]
At this time, it is possible to suppress the increase in delay by thinning out the silent audio data frame or the steady portion of the N frames by signal processing, or by thinning out the future frame only when the DelayCount is positive. In that case, DelayCount is decreased by the thinned amount.
[0151]
According to the first example of the second embodiment, when the utterance function button is pressed and the utterance is started and the transition is made from the silent state to the voiced state, N is determined from the frame at the moment when the utterance function button is pressed. Since the packet including the voice data frame obtained by performing the fade-in process is transmitted back to the previous frame, the speech head is not lost when the silent state returns to the voiced state.
[0152]
Furthermore, since the fade-out process in which the voice data frame when the voice state is changed to the silent state, that is, the voice level of the tail portion is gradually reduced, is performed in the same manner as in the first embodiment, Since the speech waveform does not become discontinuous at the portion where the transition is made from the sound to the silent state, no abnormal sound is generated at the transition portion, and the deterioration of the speech quality is reduced.
[0153]
(Second embodiment of transmission restart processing)
FIG. 13 is a diagram for explaining packetization of an audio signal in the transmission resumption process of the second embodiment, and FIG. 14 is a flowchart for explaining the transmission resumption process of the second embodiment.
[0154]
In the transmission restart process of the second embodiment, it is determined whether or not the value of the silence determination count SilentCount is larger than the value of the constant StartFrames, that is, whether or not the transmission stop time is sufficiently long (SD1). When the value is larger than the value of the constant StartFrames, the number of audio data frames to be transmitted retroactively when transmission is resumed is set (SD2). At this time, the value of the number N that goes back is set to the value of the constant StartFrames. Further, when the value of the silence determination count SilentCount is equal to or less than the value of the constant StartFrames, the value of the number N is set to a number smaller than the value of the silence determination count SilentCount by 2 (N = SilentCount-2) (SD3).
[0155]
Next, the variable i is set to 1 (i = 1) (SD4), and the power p (i) of the i-th previous audio data frame from the current audio data frame among the audio data frames held in the buffer. Is calculated (SD5), and it is determined whether or not the power p (i) is less than or equal to a predetermined threshold, or whether or not the value of the variable i is greater than or equal to (N-1).
[0156]
As a result of this determination, when the power p (i) is larger than the threshold value or the value of the variable i is smaller than (N−1), the value of the variable i is increased by 1 (SD7). Migrate to When the power p (i) is equal to or less than the threshold value or the value of the variable i is equal to or greater than (N−1), the frame immediately preceding the i-th previous audio data frame is selected from the current audio data frame. The fade-in process is performed (SD8), and the audio data frame subjected to the fade-in process is encoded (SD9).
[0157]
Thereafter, a packet including the audio data frame subjected to fade-in processing and the i-th previous RTP time stamp (RTPTimeStamp-FrameLen * i) (where * represents multiplication) is generated and transmitted. (SD10). At this time, the marker bit M in the RTP header is set to “1” and transmitted.
[0158]
Next, the i-th to last audio data frames held in the buffer are sequentially encoded, the encoded audio data frames, and the corresponding RTP time stamp (RTPTimeStamp-(ij) * FrameLen) Are generated and transmitted sequentially (SD11). Here, it is also possible to suppress an increase in delay by thinning out the silent audio data frame and the steady portion by signal processing. J is an integer from 1 to i.
[0159]
Further, the value of the delay increase counter DelayCount is increased by N frame lengths (FrameLen * N) (DelayCount + = FrameLen * i) (SD12), and the transmission restart process is terminated.
[0160]
According to the second example of the second embodiment, when the utterance function button is pressed and the utterance is started and a transition is made from the silent state to the voiced state, the sound is generated from the frame at the moment when the utterance function button is pressed. Since the previous frame in which the data power is equal to or greater than the threshold, i.e., i frames before the current frame, is transmitted from the packet including the voice data frame obtained by the fade-in process. The talk head will not disappear when returning to the voiced state.
[0161]
Furthermore, since the fade-out process in which the voice data frame when the voice state is changed to the silent state, that is, the voice level of the tail portion is gradually reduced, is performed in the same manner as in the first embodiment, Since the speech waveform does not become discontinuous at the portion where the transition is made from the sound to the silent state, no abnormal sound is generated at the transition portion, and the deterioration of the speech quality is reduced.
[0162]
In the first and second embodiments, the background noise can be estimated on the receiving side, and can be used in combination with a CNG that outputs the pseudo background noise generated on the receiving side during a silent period in which no packet is received. . In this case, it is possible to continuously transition from the voiced section to the pseudo background noise section by adding the pseudo background noise while fading in the frame that has been subjected to the final fade-out process received on the receiving side. Can do. In addition, by adding the pseudo background noise while fading out the first packet received as a sound, it is possible to continuously transition from the pseudo background noise section to the sound section.
[0163]
Next, a third embodiment of the present invention will be described.
[0164]
FIG. 15 is a block diagram showing a functional configuration of a voice packet communication system according to the third embodiment of the present invention. In the figure, the same components as those of the first embodiment described above are denoted by the same reference numerals, and description thereof is omitted.
[0165]
The third embodiment is different from the first embodiment in that the third embodiment includes an encoding processing unit 16 ′ instead of the encoding processing unit 16 in the first embodiment described above, and a packet analysis unit. In this case, a packet analyzing unit 22 ′ and a decoding processing unit 23 ′ are provided instead of the decoding processing unit 22 and the decoding processing unit 23.
[0166]
As will be described later, the encoding processing unit 16 ′ uses an S / N (signal when data analysis and encoding are performed with reference to an analysis initial value as an analysis result used when transitioning from a silent state to a voiced state. Noise ratio) and S / N when data analysis and encoding are performed with reference to the temporarily stored analysis result, and an encoded speech data frame having a good S / N is used. To do.
[0167]
In addition to the functions of the packet analysis unit 22, the packet analysis unit 22 ′ analyzes the received packet and sends the reset information to the decoding processing unit 23 ′ when the marker bit M of the RTP header is “1”. It has.
[0168]
In addition to the functions of the decoding processing unit 23, the decoding processing unit 23 ′ refers to the analysis initial value instead of the analysis result temporarily stored only when the reset information is received from the packet analysis unit 22 ′. It has a function to perform data analysis and data decryption processing.
[0169]
FIG. 16 is a functional block diagram showing the encoding processing unit 16 '. As shown in the figure, the encoding processing unit 16 ′ includes an input audio data holding unit 161, encoding units 162 and 163, an internal data holding unit 164, encoded audio data holding units 165 and 166, local decoding units 167 and 168, a first error. It comprises a calculation unit 169, a second error calculation unit 170, an error comparison unit 171, and a switch unit 172.
[0170]
The input audio data holding unit 161 holds the input audio data frame, and supplies the audio data frame to the encoding units 162 and 163, the first error calculation unit 169, and the second error calculation unit 170.
[0171]
Encoding section 162 encodes the audio data frame supplied from input audio data holding section 161 based on the data held in internal data holding section 164, and outputs this to encoded audio data holding section 165 .
[0172]
The encoding unit 163 encodes the audio data frame supplied from the input audio data holding unit 161 and outputs this to the encoded audio data holding unit 166. Here, when encoding, the encoding is always performed without referring to the state where the internal state is reset, that is, the previous data encoding state.
[0173]
The internal data holding unit 164 holds the encoded audio data obtained by encoding the audio data frame in the encoding unit 162, and the encoding held when the next audio data frame is encoded in the encoding unit 162 The audio data is supplied to the encoding unit 162.
[0174]
The encoded audio data holding unit 165 temporarily holds the encoded audio data encoded by the encoding unit 162, and outputs the encoded audio data to the local decoding unit 167 and the switch unit 172.
[0175]
The encoded audio data holding unit 166 temporarily holds the encoded audio data encoded by the encoding unit 163, and outputs the encoded audio data to the local decoding unit 168 and the switch unit 172.
[0176]
The local decoding unit 167 outputs the audio data obtained by decoding the encoded audio data supplied from the encoded audio data holding unit 165 to the first error calculation unit 169.
[0177]
The local decoding unit 168 outputs the audio data obtained by decoding the encoded audio data supplied from the encoded audio data holding unit 166 to the second error calculation unit 170.
[0178]
The first error calculation unit 169 obtains an error (coding error (S / N)) between the audio data supplied from the input audio data holding unit 161 and the audio data input from the local decoding unit 167, Is output to the error comparison unit 171. Further, the first error calculation unit 169 erases and initializes the data held in the internal data holding unit 164 when transitioning from the silent state to the voiced state.
[0179]
The second error calculation unit 170 obtains an error (coding error (S / N)) between the audio data supplied from the input audio data holding unit 161 and the audio data input from the local decoding unit 168. Is output to the error comparison unit 171.
[0180]
The error comparison unit 171 includes the error input from the first error calculation unit 169 (encoding error (S / N)) and the error input from the second error calculation unit 170 (encoding error (S / N)). Based on the comparison result, the switch unit 172 is switched so as to output the encoded audio data frame having the better (smaller) encoding error (S / N) to the packet generation unit 17. Furthermore, as described above, the error comparison unit 171 notifies the packet generation unit 17 to set the marker bit M of the RTP header to “1” when the silent state is changed to the voiced state.
[0181]
Next, operations of the transmission device 1 and the reception device 2 in the third embodiment having the above-described configuration will be described in detail with reference to the flowcharts shown in FIGS.
[0182]
In the transmission apparatus 1, initialization processing is performed immediately after the start of driving (SE1). In this initialization process, the value of the silent determination count SilentCount, which is a variable, is set to “0” and the value of RTPTimeStamp is set to “0”.
[0183]
Next, when the transmission apparatus 1 starts processing, the audio signal input via the audio input unit 11 is sequentially stored in the buffer of the utterance / non-utterance determination unit 13 via the A / D conversion unit 12 (SE2).
[0184]
Next, the transmitting apparatus 1 sequentially determines whether the power of the audio data frame to be determined is equal to or lower than the threshold value in order from the head audio data stored in the buffer of the utterance / non-utterance determination unit 13, that is, in the silence state. It is determined whether or not it is in a voiced state (SE3), and if the voice data frame is in a voiced state in which the power of the audio data frame is greater than the threshold value, the process proceeds to SE19 described later.
[0185]
Also, when the power of the audio data frame is in the silent state below the threshold, the silence determination count SilentCount is increased by “1” (SE4), and is the transmission pause state in which the value of the silence determination count SilentCount is larger than the count threshold SilentThres? It is determined whether or not (SE5).
[0186]
As a result of this determination, when the silent determination count SilentCount is in a transmission pause state where the value is greater than the count threshold SilentThres, the process proceeds to SE12 described later. When the silence determination count SilentCount is equal to or less than the count threshold SilentThres, there is silence. It is determined whether or not the value of the determination count SilentCount is equal to the count threshold SilentThres, that is, whether or not a transmission suspension state has been started (SE6).
[0187]
As a result of this determination, when the silent determination count SilentCount is not equal to the count threshold SilentThres, the process proceeds to SE7 described later. When the silent determination count SilentCount is equal to the count threshold SilentThres, the current audio data frame is faded out. Process (SE7).
[0188]
Next, the audio data frame subjected to the fade-out process is encoded (SE8), a packet including the encoded audio data frame and the RTP time stamp RTPTimeStamp corresponding thereto is generated, and this packet is transmitted ( SE9).
[0189]
Thereafter, the RTP time stamp RTPTimeStamp is increased by the frame length (SE10). That is, a value obtained by adding the frame length FrameLen value to the RTP time stamp RTPTimeStamp value is set as a new RTP time stamp RTPTimeStamp value.
[0190]
Next, the current audio data frame is held in the buffer (SE11), and the process proceeds to SE2.
[0191]
On the other hand, as a result of the determination of SE5, when the silence determination count SilentCount is in a transmission pause state where the value is greater than the count threshold SilentThres, whether or not the silence determination count SilentCount is equal to the count threshold SilentThres + 1, that is, the transmission pause state It is determined whether or not it has just become (SE12).
[0192]
As a result of the determination, when the value of the silent determination count SilentCount is not equal to the count threshold SilentThres + 1, the process proceeds to SE16 described later. When the value of the silence determination count SilentCount is equal to the count threshold SilentThres + 1, an ideal silent audio data frame Is generated (SE13), the silent audio data frame is encoded (SE14), and a packet including the encoded ideal silent audio data frame and the corresponding RTP time stamp RTPTimeStamp is generated. A packet is transmitted (SE15). Thereafter, the process proceeds to SE10.
[0193]
As a result of the determination in SE12, when the silent determination count SilentCount is not equal to the count threshold SilentThres + 1, a value obtained by subtracting the frame length FrameLen from the delay count DelayCount is set as a new delay count DelayCount ( SE16), it is determined whether or not the value of the delay count DelayCount is 0 or less (SE17).
[0194]
If the result of this determination is that the value of the delay count DelayCount is greater than 0, the process proceeds to SE10. If the value of the delay count DelayCount is 0 or less, the value of the delay count DelayCount is set to 0 (SE18). ), The process proceeds to SE10.
[0195]
On the other hand, as a result of the determination in SE3, if the voice data frame is in a voiced state where the power is greater than the threshold, whether or not the silence determination count SilentCount is greater than the count threshold SilentThres, that is, the previous frame is transmitted. It is determined whether or not it is in a dormant state (SE19).
[0196]
As a result of this determination, when the value of the silence determination count SilentCount is equal to or less than the count threshold SilentThres, the process proceeds to SE28 described later. When the silent determination count SilentCount is larger than the count threshold SilentThres, the previous audio data frame held in the buffer is faded in (SE20) and the encoding processing unit 16 'is initialized. (SE21) Further, the audio data frame subjected to the fade-in process is encoded (SE22).
[0197]
Next, a packet including an audio data frame subjected to the fade-in process and an RTP time stamp (RTPTimeStamp-FrameLen) corresponding to the audio data frame is generated and transmitted (SE23). At this time, the marker bit M of the RTP header is set to “1”.
[0198]
Thereafter, the current audio data frame, that is, the audio data frame next to the fade-in processed audio data frame is encoded (SE24), and the packet including the encoded audio data frame and the current RTP time stamp RTPTimestamp Is generated and transmitted (SE25).
[0199]
Next, the value of the delay increase counter DelayCount is increased by the frame length FrameLen (SE26), and the silence determination count SilentCount is initialized to “0” (SE27). Thereafter, the process proceeds to SE10.
[0200]
On the other hand, if the result of the determination in SE19 is that the silent determination count SilentCount is less than or equal to the count threshold SilentThres, the current audio data frame is encoded by the encoding unit 162 (SE28), and the current audio data frame is encoded. The encoding unit 163 performs encoding processing. Here, since the internal state is reset in the encoding unit 163, the encoding process is performed without referring to the encoded data in the previous encoding process.
[0201]
Next, the S / N of the encoded audio data frame encoded by the encoding unit 162 is compared with the S / N of the encoded audio data frame encoded by the encoding unit 163 (SE30). A packet including an encoded audio data frame having a better / N and a corresponding RTP time stamp is generated and transmitted (SE31). At this time, when the audio data frame encoded by the encoding unit 163 is used, the marker bit M of the RTP header is set to “1” and the silence return flag is set.
[0202]
Thereafter, the value of the silent determination count SilentCount is set to “0” and initialized (SE32), and then the process proceeds to SE10.
[0203]
According to the transmission apparatus of the above embodiment, encoding is performed based on the S / N (signal-to-noise ratio) of audio data in the audio data frame encoded based on the retained analysis result and the initial value of the analysis result. The signal-to-noise ratio of the voice data in the voice data frame is compared, and the voice data frame with the larger S / N is adopted as the voice data frame for packet generation. Since the encoding process can be performed so that the correlation with the previous audio data frame is good, audio data in a natural state at the time of reproduction can be obtained on the receiving side.
[0204]
Next, the operation of the receiving device 2 will be described with reference to the flowchart shown in FIG.
[0205]
When the reception device 2 starts the reception process, the reception device 2 analyzes the received packet (SF1), separates the RTP header and the audio data frame, and analyzes the information of the RTP header.
[0206]
Based on the above analysis, it is determined whether or not the marker bit M (silence return flag) of the RTP header is set to “1” and the flag is turned on (SF2). When the marker bit M is “0”, SF4 to be described later. When the marker bit M is “1”, the packet analysis unit 22 ′ notifies the decoding processing unit 23 ′ of the reset information to initialize (reset) the internal state of the decoding processing unit ( SF3).
[0207]
Thereafter, the received encoded audio data frame is decoded (SF4), and the audio reproduction process of the decoded audio data frame is performed (SF5).
[0208]
According to the above reception process, in addition to the voice data portion that has transitioned from the silent state to the voiced state, the transmission side code in the voice data portion that does not affect the coding gain even if the internal state of the encoder is reset. Since the internal state reset of the encoder can be recognized by the marker bit M, it is possible to keep the internal state of the transmitting side encoder and the internal state of the receiving side decoder synchronized. By performing the reset process, it is possible to recover the mismatch between the internal states of the transmission side and the reception side in the event of a transmission error such as packet loss, and the reduction in quality can be reduced.
[0209]
Each of the above embodiments is a specific example of the present invention, and the present invention is not limited to the configuration of the above embodiment.
[0210]
【The invention's effect】
  As explained above, according to the present invention,For example, a fade-in process that gradually increases the sound level of the sound data portion that enters the sound state is performed, so that the sound waveform becomes discontinuous at the receiving side at the transition from the soundless state to the sound state. Therefore, no abnormal noise is generated at the transition portion, and deterioration of voice quality can be reduced.
[0214]
Further, according to the present invention, since information indicating a voice start state is included in the first packet when packet transmission is started after packet transmission is stopped in a silent state, It can be recognized that the silent state has returned to the voiced state.
[0216]
In addition, according to the present invention, since the information indicating the voiced start state is included in the transmission packet, the reception side can easily recognize the start of the voiced state.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a functional configuration of a voice packet communication system according to a first embodiment of the present invention.
FIG. 2 is a diagram for explaining packetization of a voice signal by the voice packet transmitting apparatus according to the first embodiment of the present invention.
FIG. 3 is a diagram for explaining a real-time transfer protocol header used in the first embodiment of the present invention.
FIG. 4 is a flowchart illustrating packet transmission processing of the transmission device according to the first embodiment of the present invention.
FIG. 5 is a flowchart illustrating packet transmission processing of the transmission device according to the first embodiment of the present invention.
FIG. 6 is a flowchart for explaining packet transmission processing of the transmission apparatus according to the first embodiment of the present invention.
FIG. 7 is a block diagram showing a functional configuration of a voice packet communication system according to the second embodiment of the present invention.
FIG. 8 is a flowchart for explaining packet transmission processing of the transmission device according to the second embodiment of the present invention.
FIG. 9 is a flowchart for explaining packet transmission processing of the transmission device according to the second embodiment of the present invention.
FIG. 10 is a flowchart for explaining packet transmission processing of the transmission device according to the second embodiment of the present invention.
FIG. 11 is a diagram for explaining packetization of a voice signal in the transmission restart process according to the first example of the second embodiment of the present invention;
FIG. 12 is a flowchart illustrating transmission restart processing according to the first example of the second embodiment of the present invention;
FIG. 13 is a diagram for explaining packetization of a voice signal in the transmission resuming process according to the second example of the second embodiment of the present invention;
FIG. 14 is a flowchart illustrating transmission restart processing according to the second example of the second embodiment of the present invention;
FIG. 15 is a block diagram showing a functional configuration of a voice packet communication system according to a third embodiment of the present invention.
FIG. 16 is a functional block diagram showing an encoding processing unit according to the third embodiment of the present invention.
FIG. 17 is a flowchart for explaining the operation of the transmission apparatus according to the third embodiment of the present invention.
FIG. 18 is a flowchart for explaining the operation of the transmission apparatus according to the third embodiment of the present invention.
FIG. 19 is a flowchart for explaining the operation of the transmission apparatus according to the third embodiment of the present invention.
FIG. 20 is a flowchart for explaining the operation of the receiving apparatus according to the third embodiment of the present invention.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 ... Voice packet transmitter, 2 ... Voice packet receiver, 3 ... Communication network, 11 ... Voice input part, 12 ... Analog / digital (A / D) converter, 13 ... Sound / silence determination part, 14 ... Switch 15, fade-in / fade-out processing unit, 16, 16 ′, encoding processing unit, 17, packet generation unit, 18, transmission unit, 19, transmission determination processing unit, 21, reception unit, 22, 22 ′, packet Analysis unit, 23, 23 '... Decoding processing unit, 24 ... Analog / digital (A / D) conversion unit, 25 ... Audio output unit, 161 ... Input audio data holding unit, 162, 163 ... Encoding unit, 164 ... Internal data Holding unit, 165, 166 ... encoded audio data holding unit, 167,168 ... local decoding unit, 169 ... first error calculation unit, 170 ... second error calculation unit, 171 ... error comparison unit, 172 ... switch unit.

Claims

Voice data frame generation means for generating voice data frames generated by cutting voice data based on continuously input voices at predetermined time intervals is generated, a packet including the generated voice data frames is generated, and the packets are transmitted to a communication network. In the voice packet transmitting apparatus for transmitting via
Means for holding the input audio data frame;
Voiced / silent determination means for determining whether the input voice data frame is voiced or silent;
Means for stopping transmission of a packet when a state in which the voice data frame input by the voiced / silent determination unit is determined to be silent continues for a certain period of time;
In the state where the input voice data frame is determined to be silent for a certain period of time or longer and packet transmission is stopped, the input voice data frame is detected by the voice / silence determination unit. Means for resuming packet transmission when it is determined to be in a sound state;
When resuming the transmission of the packet, the silent data frame that is at least one voice data frame determined to be voiced is used as a reanalysis frame, and the information about the voice frame determined to be voiced and the reanalysis frame is recorded. Means for reanalysis using information of the audio data frame determined to be silent;
As a result of the reanalysis, when it is determined that the reanalysis frame is close to sound, the audio data frame immediately before the reanalysis frame is gradually decreased from the end toward the beginning. One of the voice data frame determined to be in the voiced state and the re-analyzed frame is transmitted as a head of the voiced state, and a packet including the converted voice data frame is transmitted. Means for transmitting a silent data frame between previous frames and then transmitting a voice data frame determined to be in the voiced state;
As a result of the re-analysis, if it is determined that the re-analysis frame is close to silence, the re-analysis frame is converted into audio data in which the audio level is gradually decreased from the end toward the beginning, It transmits a packet including the converted voice data frame as the first state, then this, features and be Ruoto voice packet transmission and means for transmitting the voice data frame determined as the a sound .

When resuming the transmission of the packet, this corresponds to the delay increased by the extra transmitted silent frame when at least one frame before the voice data frame determined to be voiced is transmitted as the head of the voiced state. The voice packet transmitting apparatus according to claim 1, further comprising means for shortening subsequent samples by the number of samples to be performed.

The voiced / silent determination means comprises means for determining that the sound is in a silent state when the voice level of the input voice frame is equal to or lower than a predetermined threshold level. The voice packet transmitting device described.

The voiced / silent determination means comprises means for determining that a voiced state is present when a voice level of an input voice frame is equal to or higher than a predetermined threshold level. The voice packet transmitting device described in 1.

Provided is an audio data frame generating means for generating an audio data frame obtained by cutting audio data based on continuously input audio at predetermined time intervals, and an utterance function button for acquiring a user's utterance intention, and the generated audio data In a voice packet transmitting apparatus that generates a packet including a frame and transmits a packet including a voice data frame through the communication network only while the speech function button is pressed.
An utterance function button pressing determination means for determining whether or not the utterance function button is being pressed;
Means for stopping packet transmission when it is determined that the utterance function button is pressed by the utterance function button pressing determination means, and the utterance pause state is determined to have been pressed.
When stopping transmission, the audio level of the last audio data frame to be transmitted is converted to audio data that is gradually decreased from the first sample toward the last sample, and the converted audio data frame is transmitted as the last packet. And a voice packet transmitting device.

Means for stopping packet transmission when it is determined by the utterance function button pressing determining means that the utterance function button has been pressed but has not been pressed;
Means for transmitting at least one voice frame input after it is determined that the speech has been suspended when stopping transmission;
Means for converting the audio level of the last audio data frame to be transmitted into audio data that is gradually decreased from the first sample toward the last sample, and for transmitting the converted audio data frame as a final packet. The voice packet transmitting apparatus according to claim 5 , wherein:

After transmitting a packet including an audio data frame in which the audio level is gradually decreased from the first sample to the last sample, one more silence audio data frame is generated, and the packet including the silence audio data frame is transmitted. The voice packet transmitting device according to claim 5, further comprising: means.

Means for resuming transmission of the packet that has been stopped as an utterance start state when it is determined by the utterance function button press determination means that the utterance function button has not been pressed; ,
When resuming packet transmission, the first audio data frame to be transmitted is converted into audio data whose audio level is gradually decreased from the last sample to the first sample, and the converted audio data is used as the head of the speech state. The voice packet transmitting device according to any one of claims 5 to 7 , further comprising: a unit that transmits a packet including a frame.

Means for resuming transmission of the packet that has been stopped as an utterance start state when it is determined by the utterance function button press determination means that the utterance function button has not been pressed; ,
When resuming packet transmission, at least one voice data frame before the first voice data frame input after the voice function button is pressed is transmitted, and then the voice function button is pressed. The voice packet transmitting device according to any one of claims 5 to 8 , further comprising a means for transmitting the first voice data frame input after entering the switched state.

When resuming packet transmission in the utterance start state, an extra transmission was made when at least one frame before the voice data frame determined to be in the utterance start state was transmitted as the head of the transmission frame. The voice packet transmitting device according to claim 8 or 9 , further comprising means for shortening subsequent samples by the number of samples corresponding to the delay increased by the voice data frame.

When the first speech frame to be transmitted is encoded when the packet transmission is stopped when the speech function button is not pressed and the packet is transmitted as the speech start state by pressing the speech function button. Means for encoding a speech frame after initializing the internal state of
9. The method according to claim 5, further comprising means for transmitting information including information indicating that the first frame recovered from silence is included in the packet when the first frame is packetized and transmitted. Voice packet transmitter.

The encoding processing means includes:
When encoding the frame, the encoding error and the internal state of the encoder were initialized when the frame was encoded following the previous frame without initializing the internal state of the encoder. Means for comparing the encoding error when the frame is encoded later, and transmitting the encoding result with the smaller error;
When you select a result of the frame is encoded after resetting the internal state, claims and having a means for transmitting include information that is the first frame after returning from silence in the transmission packet Item 12. The voice packet transmitting device according to Item 11 .

Using a computer device having means for converting voice input by voice input means into voice data, a voice data frame is generated by cutting voice data based on continuously input voice at predetermined time intervals, and the voice data frame In a voice packet transmission method for generating a packet including the packet and transmitting the packet via a communication network,
The computer device includes:
Holding the input audio data frame;
Determining whether the input voice data frame is voiced or silent;
When the state in which the input voice data frame is determined to be silent continues for a certain time or longer, packet transmission is stopped,
When it is determined that the input audio data frame is in a sound state in a state in which the input audio data frame is determined to be silent for a predetermined time or longer and packet transmission is stopped To resume sending packets,
When resuming the transmission of the packet, the silent data frame that is at least one voice data frame determined to be voiced is used as a reanalysis frame, and the information about the voice frame determined to be voiced and the reanalysis frame is recorded. Re-analyze using information of audio data frame determined to be silent,
As a result of the reanalysis, when it is determined that the reanalysis frame is close to sound, the audio data frame immediately before the reanalysis frame is gradually decreased from the end toward the beginning. One of the voice data frame determined to be in the voiced state and the re-analyzed frame is transmitted as a head of the voiced state, and a packet including the converted voice data frame is transmitted. Transmitting a silent data frame between the previous frame and then transmitting a voice data frame determined to be in the voiced state;
As a result of the re-analysis, if it is determined that the re-analysis frame is close to silence, the re-analysis frame is converted into audio data in which the audio level is gradually decreased from the end toward the beginning, A voice packet transmission method comprising: transmitting a packet including the converted voice data frame as a head of a state, and then transmitting the voice data frame determined to be voiced.

Using a computer device having means for converting voice input by voice input means into voice data, a voice data frame is generated by cutting voice data based on continuously input voice at predetermined time intervals, and the generated In a voice packet transmission method for generating a packet including a voice data frame and transmitting a packet including a voice data frame via a communication network only while a speech function button for acquiring a user's speech intention is pressed. ,
The computer device includes:
It is determined whether or not the speech function button is pressed,
When it is determined from the state where the speech function button is pressed to the state where the speech is not pressed, the transmission of the packet is stopped,
When stopping transmission, convert the audio level of the last audio data frame to be transmitted into audio data that gradually decreases from the first sample to the last sample,
The voice packet transmitting method, wherein the converted voice data frame is transmitted as a final packet.

The computer device includes:
When it is determined that the utterance function button has been pressed from a state where the utterance function button has not been pressed, as the utterance start state, the transmission of the stopped packet is resumed,
When resuming packet transmission, the first audio data frame to be transmitted is converted into audio data with the audio level gradually reduced from the last sample to the first sample,
The voice packet transmission method according to claim 14 , wherein a packet including the converted voice data frame is transmitted as a head of an utterance state.