JP2004053763A

JP2004053763A - Speech encoding transmission system of multipoint controller

Info

Publication number: JP2004053763A
Application number: JP2002208664A
Authority: JP
Inventors: Hisashi Yajima; 矢島　久
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2002-07-17
Filing date: 2002-07-17
Publication date: 2004-02-19
Anticipated expiration: 2022-07-17
Also published as: JP4108396B2

Abstract

<P>PROBLEM TO BE SOLVED: To realize speech transmission of high quality even for simultaneous speaking of a plurality of speakers assumed in one-to-multiple or multiple-to-multiple teleconferencing and speaking in a highly-noisy environment. <P>SOLUTION: A speech encoding parameter control part 211 specifies one conference terminal decided as a voicing one among respective conference terminals to an exchange 201 and transmits some of a plurality of kinds of speech encoding parameters of a decoded speech signal of the conference terminal to the respective conference terminals connected to the exchange 201 without performing speech re-encoding processing. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
この発明は、符号励振線形予測（Ｃｏｄｅ　Ｅｘｃｉｔｅｄ　Ｌｉｎｅａｒ　Ｐｒｅｄｉｃｔｉｏｎ：以下、「ＣＥＬＰ」とする）方式をはじめとする情報源符号化方式に基づき符号化された符号化音声信号を用いる電話会議（テレビ会議）システムに適用される多地点制御装置（Ｍｕｌｔｉ−ｐｏｉｎｔ　Ｃｏｎｔｒｏｌ　Ｕｎｉｔ：以下、「ＭＣＵ」とする）に関し、特に、多数話者による発言の混在に対応して復号した音声信号を再符号化する多地点制御装置の音声符号化伝送システムに関するものである。
【０００２】
【従来の技術】
ＭＣＵは、通信ネットワークを介して複数の通信端末を種々の形態で接続し、各通信端末で取り扱われる映像、音声、データ等からなる異種かつ異符号の情報内容を受信および送信の対象とし、各情報内容に合致して交換および分配の処理を施し、処理した情報を複数の通信端末に配信するサービスを提供する装置として開発されてきた。その典型的な適用例は電話会議やテレビ会議のシステムである。ＭＣＵで扱う音声の伝送処理については、圧縮符号化が適用されている。
近年、電話帯域の音声を高能率に圧縮符号化する手法として、ＣＥＬＰ方式をはじめとした情報源符号化方式に基づくものが、主にデジタル携帯電話、国際通信、企業内通信等の分野で実用化されている。その中でも、ＩＴＵ−Ｔ勧告Ｇ．７２９で使用されるＣＳ−ＡＣＥＬＰ（Ｃｏｎｊｕｇａｔｅ　Ｓｔｒｕｃｔｕｒｅ　−　Ａｌｇｅｂｒａｉｃ　ＣｏｄｅＥｘｃｉｔｅｄ　Ｌｉｎｅａｒ　Ｐｒｅｄｉｃｔｉｏｎ　：共役構造の代数符号励振線形予測）方式やＧＳＭ−ＡＭＲ（Ｇｌｏｂａｌ　Ｓｙｓｔｅｍ　ｆｏｒ　Ｍｏｂｉｌｅ　ｃｏｍｍｕｎｉｃａｔｉｏｎｓ　−　Ａｄａｐｔｉｖｅ　Ｍｕｌｔｉ−Ｒａｔｅ）方式等が、国際標準または地域標準方式として採用されている。
【０００３】
音声波の中で、特に有声音は、声帯の振動により発生する振動波に、口腔や鼻腔等の声道の共振特性が加わって生じるものである。元来、ＣＥＬＰ方式はこのような人間の発声機構をモデル化した符号化方式である。そこでは、声帯振動波は、その繰り返し成分を表現するピッチ周期や、変動成分を表現する雑音パラメータで表現する。また、喉、口、鼻を音声が通過する際の声道伝達特性や、唇の放射特性については、線形予測の手法を用いて近似的に表現される。
【０００４】
具体的なＣＥＬＰ方式においては、基本的に２つの符号帳（ｃｏｄｅ　ｂｏｏｋ）である適応符号帳（ａｄａｐｔｉｖｅ　ｃｏｄｅ　ｂｏｏｋ）および雑音符号帳（ｆｉｘｅｄ　ｃｏｄｅ　ｂｏｏｋ）を用いるほか、ＬＳＰ量子化符号帳（Ｌｉｎｅ　Ｓｐｅｃｔｒａｌ　Ｐａｉｒ　ｑｕａｎｔｉｚａｔｉｏｎ　ｃｏｄｅ　ｂｏｏｋ）、利得量子化符号帳（ｇａｉｎ　ｑｕａｎｔｉｚａｔｉｏｎ　ｃｏｄｅ　ｂｏｏｋ）を用いる。適応符号帳は、駆動源信号の周期的な信号成分を表現するものであり、過去の駆動源信号をメモリに蓄積し、フレーム周期毎に更新されるものがよく用いられる。一方、雑音符号帳は、適応符号帳では表現できない非周期的な信号成分を表現するものであり、複数の典型的な信号パターンを固定的に蓄積したものがよく用いられる。なお、このタイプの雑音符号帳は、信号パターンを蓄積するためのメモリ量が膨大になるため、その改良版として、近年、少ない本数のパルスで近似的に非周期的な信号成分を表現する代数符号帳が、よく用いられるようになってきた。また、駆動源信号の周期的・非周期的成分の利得は、利得量子化符号帳を用いてベクトル量子化する。さらに、送信すべき音声を線形予測分析した結果得られた線形予測係数をＬＳＰに変換し、これをＬＳＰ量子化符号帳を用いてベクトル量子化する。
【０００５】
図１６は従来のＣＳ−ＡＣＥＬＰ方式に基づく音声符号化装置の構成例を示すブロック図である。図において、６０４は線形予測分析処理部、６０５はＬＳＰ（Ｌｉｎｅ　Ｓｐｅｃｔｒａｌ　Ｐａｉｒ）量子化処理部、６０６はＬＳＰ量子化符号帳、６０８は多重化部、６０９は逆量子化処理部、６１０は適応符号帳、６１１は代数符号帳、６１２，６１６は加算器、６１３は利得制御増幅部、６１４は利得量子化符号帳、６１５は合成フィルタ、６１７は聴覚重み付フィルタ、６１８は歪最小化部、６２４はＬＰＣ（Ｌｉｎｅａｒ　Ｐｒｅｄｉｃｔｉｖｅ　Ｃｏｄｉｎｇ）・ＬＳＰ変換部である。
【０００６】
次に動作について説明する。
線形予測分析処理部６０４は音声入力からＬＳＰパラメータを得るが、このＬＳＰパラメータは直接ＬＳＰ量子化処理部６０５に入力され、ＬＳＰ量子化符号帳６０６を参照して符号化される。符号化されたＬＳＰパラメータ（符号帳インデックス）は、逆量子化処理部６０９に送出されると共に、多重化部６０８にも送られる。逆量子化処理部６０９では、符号化されたＬＳＰパラメータ（符号帳インデックス）を基にＬＳＰ量子化符号帳６０６を参照して得られたＬＳＰ係数を用いて線形予測係数を計算し、合成フィルタ６１５に供給する。この合成フィルタ６１５を含み、適応符号帳６１０、代数符号帳６１１、加算器６１２，６１６、利得制御増幅部６１３、利得量子化符号帳６１４、聴覚重み付フィルタ６１７で構成される処理ブロック群にて、適応符号帳６１０、代数符号帳６１１、及び利得量子化符号帳６１４の組み合わせを変えることで、複数の音声波形を合成する。歪最小化部６１８において、これらの複数の合成音声波形と入力信号波形との聴覚重み付けエラー電力（＝自乗誤差）を計算し、その中でエラー電力を最小とする適応符号帳６１０、代数符号帳６１１、及び利得量子化符号帳６１４の組み合わせを選択する。いわゆるＡ−ｂ−Ｓ（Ａｎａｌｙｓｉｓ　ｂｙ　Ｓｙｎｔｈｅｓｉｓ）法に基づき音声符号化処理が実行される。このようにして量子化された符号化パラメータ（適応符号帳６１０、代数符号帳６１１、及び利得量子化符号帳６１４等）と、先にＬＳＰ量子化処理部６０５で量子化された量子化パラメータは、多重化部６０８により多重化された後、復号器側に送られる。
【０００７】
次に、上記符号化方式を、ＭＣＵを用いた電話会議装置に適用し、１対多、または多対多の通話を行った場合について述べる。
図１７は従来のＭＣＵの構成を示すブロック図であり、図において、２０は会議端末を構成する電話器、２１は交換機、２２は回線インタフェース部、２３は音声復号処理部、２４は音声検出部、２５は雑音抑圧部、２６は音声加算部、２７は分配処理部、２８は自端末音声減算部、２９は自動利得制御部、３０は音声再符号化処理部である。
【０００８】
次に動作について説明する。
まず、回線インタフェース部２２にて受信された符号化音声は、音声復号処理部２３によりチャネル毎に音声信号に復号される。復号された音声信号は、会議端末２０からのそれぞれについて、音声検出部２４により有音／無音の状態が検出される。この検出結果は、音声加算部２６における加算対象端末の決定、雑音抑圧処理部２５の雑音抑圧のための重み付け、自動利得制御部２９による音声レベルの自動調整のために用いられる。
【０００９】
会議参加者数Ｎが多くなると、背景ノイズレベルもＮに比例して大きくなるため、ＳＮ比の低下により通話品質の劣化を招くという問題がある。そこで、雑音抑圧処理部２５では、無音状態にあるチャネルから入力されてくる雑音を小さくするために、各会議参加者の有音／無音の検出結果に従って有音チャネルと無音チャネルのそれぞれに対して会議参加者数Ｎに基づく各重み係数を決定し、この重み係数をそれぞれの復号音声信号に掛ける雑音制御処理を行う。音声加算部２６では、音声検出部２４の検出結果を参照して、加算すべきチャネルの雑音制御処理された復号音声信号を加算する。加算された復号音声信号は、分配処理部２７において、各チャネルに再分配される。
【００１０】
自端末音声減算部２８では、自端末の音声信号の回りこみによるエコーを抑圧し、聴感上のわずらわしさを解消するため、加算された復号音声信号から自端末の信号を減算する。また、多地点の電話会議システムにおいては、多人数の音声を加算することによる飽和歪みを起こす可能性があるため、自動利得制御部２９にて、飽和歪みを防ぎ、なおかつ個人や全体の音声レベルを調整する。自動利得制御部２９の出力信号は、音声再符号化処理部３０にて符号化されて回線に出力される。
【００１１】
図１７に示された多地点制御装置によるシステムの第１の問題点として、復号・再符号化が繰り返されること（以下、「タンデム接続」という）による音声品質の劣化が挙げられる。ＣＥＬＰ方式に基づく音声符号化・復号装置は、非可逆符号化であるため、復号された音声信号は量子化誤差を含んでいる。さらに、これをタンデム接続することにより、再符号化処理にて量子化誤差がさらに蓄積されるため、音声品質の劣化となる。
【００１２】
この問題を解決する方法として、例えば特開２０００−１７４９０９号公報「会議端末制御装置」に示される技術がある。図１８はこの多地点制御装置（ＭＣＵ）の構成を示すブロック図である。図において、２ａ〜２ｍはデマルチプレクサ、４は話者検出回路、６は第一セレクタ、８ａ〜８ｎはデコーダ（音声復号処理部）、１０ａ〜１０ｎは減衰回路、１２は合成回路、１４はエンコーダ（音声再符号化処理部）、１６は第二セレクタ、１８は分配回路である。
【００１３】
次に動作について説明する。
ＭＣＵが各会議端末からの音声情報を受けると、デマルチプレクサ２ａ〜２ｍで有声／無声信号が分離され、その有声／無声信号を用いて、どの会議端末からの圧縮音声符号が有声であるかを話者検出回路４で検出する。また、有声である会議端末の数が計測される。有声、無声の判定がなされると、その情報が第一セレクタ６に送られ、第一セレクタ６において、有声状態の会議端末の数に応じて次のように作動する。
【００１４】
有声状態の会議端末が２台以上あった場合には、第一セレクタ６は、有声状態の会議端末を選択してデコーダ８ａ〜８ｎに対して１対１の関係で接続し、各圧縮音声符号を送出する。デコーダ８ａ〜８ｎの該当するそれぞれは、供給された圧縮音声符号を復号して音声信号を生成する。生成された音声信号は減衰回路１０ａ〜１０ｎで所定の値に減衰される。それぞれの音声信号は、減衰回路１０でレベル調整が行われた後、合成回路により合成される。つまり、有声状態の会議端末からの音声がすべて集められる。合成された音声信号は、エンコーダ１４で再符号化されて第二セレクタ１６へ送られ、分配回路１８により全会議端末に伝送される。
【００１５】
一方、有声状態の会議端末が１台であった場合には、その旨が第一セレクタ６に伝達され、第一セレクタ６でその有声状態の会議端末が選択され、そのまま直接第二セレクタ１６に接続される。第二セレクタ１６に接続された有声状態の会議端末の圧縮音声符号は、そのまま分配回路１８を介して全会議端末に伝送される。
【００１６】
ＭＣＵが図１８に示す構成をとることにより、単一話者のケースでは、符号化音声のデータ全てがパススルーされるため、タンデム接続が回避され、低ビットレートの音声符号化方式を用いても高品質の音声を提供することができる。
【００１７】
また、前述した図１７に示された多地点制御装置によるシステムの第２の問題点として、情報源符号化処理特有の劣化要因、すなわち複数話者による同時発言がある。この場合の劣化は著しく、聴感上聞き苦しくなるという問題があった。
前述した従来のＣＥＬＰ方式に基づく音声符号化装置では、図１６に示したように、元来、単一話者の発声を想定した符号化を行っている。すなわち、声帯音源（駆動ベクトル）、声道情報（線形予測係数）、利得情報などが符号化の過程で抽出されて量子化伝送される。音声の特徴量を示すパラメータはそれぞれ唯一であり、複数話者の混合音声を送信する場合においては、これを精度よく符号化することができない。例えば、複数話者の混合音声の場合、声帯音源は話者毎に異なる複数種類のピッチ周期情報を含んでいるが、その複数種類のピッチ周期を表現する手段が無い。また、複数話者の混合音声の場合、声道情報も単一話者の場合と比較してスペクトル構造が複雑になっている。さらに、それを忠実に表現する量子化パターン（量子化テーブル）が用意されておらず、量子化時の誤差が大きくなる傾向になる。
【００１８】
このような問題を解決する手段として、例えば、特開平１０−２４０２９９号公報「音声符号化及び復号装置」に示される方式がある。図１９はこの公報に開示されたＣＥＬＰ系音声符号化装置の構成を示すブロック図である。図において、３１は複数話者音声分離部、３２_１〜３２_Ｎは長期予測器、３３_１〜３３_Ｎは源音コ―ドブック、３４は反射係数分析部、３５はのど（喉）近似フィルタ、３６_１〜３６_Ｎ，３７は加算器、３８は減算器、３９はエラー分析部である。
【００１９】
次に動作について説明する。
複数話者音声分離部３１は、入力される音声信号の周期的特徴を分析して話者数ｎ（１＜ｎ≦Ｎ）を特定し、この音声信号に含まれる各話者の音声を分離して各話者の源音声Ａ_１〜Ａ_ｎとして出力する。複数話者音声分離部３１で得られた話者数ｎは、反射係数分析部３４に供給される。反射係数分析部３４では、話者数ｎが１人の場合は１０次、２人の場合は１５次、それ以上の場合は２０次というように、話者数ｎに応じた次数で反射係数ｒを算出する。反射係数ｒは、例えば入力音声の自己相関を用いてＦＬＡＴ（固定小数点共分散格子型アルゴリズム）を実行することにより求めることができる。求められた反射係数ｒは、のど近似フィルタ３５の係数として与えられる。
【００２０】
一方、複数話者音声分離部３１で分離された各話者の源音声Ａ_１〜Ａ_ｎは、ｎ個の長期予測器３２_１〜３２_ｎにそれぞれ入力される。長期予測器３２_１〜３２_ｎでは、これらの源音声Ａ_１〜Ａ_ｎと前フレームの源音声との相関関係などから源音声のピッチＬ_１〜Ｌ_ｎを抽出する。これらのピッチＬ_１〜Ｌ_ｎによってそれぞれ復号された信号と源音コードブック３３_１〜３３_ｎからのコードベクトルとが加算器３６_１〜３６_ｎにおいてそれぞれ加算され、各話者についての源音声が復号される。これらの複数話者分の源音声が加算器３７によって加算され、のど近似フィルタ３５で声道の特徴を付与されて局部復号信号となる。この局部復号信号と入力音声とが減算器３８によって減算され、減算器３８からのエラー信号が最小となるようにエラー分析部３９で源音コードブック３３_１〜３３_ｎのインデックスＩ_１〜Ｉ_ｎが順次決定される。
【００２１】
【発明が解決しようとする課題】
従来の多地点制御装置の音声符号化伝送システムは以上のように構成されているので、量子化誤差の蓄積による音声品質の劣化を回避するために、図１８に示した方式を用いた場合、話者検出回路４が、発言者以外の発する短区間の音声（咳払い、相槌など）に反応したり、背景ノイズの変動などによって誤判定を起こしたりする恐れがある。この場合、パススルーと復号・再符号化処理との回路切り替えが頻繁に行われる結果となる。すなわち、図１８において、話者検出回路４の判定結果により、エンコーダ１４の出力（符号化音声信号）とパススルーされた符号化音声信号とがセレクタ６によって切り替わることになる。ところがＣＥＬＰ方式のような低ビットレートの音声符号化方式の場合には、通常、符号器と復号器とが常に一対で動作することにより、高品質の音声を伝送することができる。しかしながら図１８の構成では、セレクタ６の切り替えによって、受信者の会議端末に内蔵されている復号器と対になる符号器が、唯一の発言者の会議端末に内蔵されている符号器とＭＣＵの符号器（エンコーダ１４）とに頻繁にスイッチングされる。このため、音声を受ける端末側では、このスイッチングにより音声が不連続となることにより、また音声品質が頻繁に変動することにより不自然に感じられ、聞き苦しくなるという課題があった。
【００２２】
また、従来の多地点制御装置の音声符号化伝送システムは以上のように構成されているので、情報源符号化処理特有である複数話者の同時発言による著しい劣化を回避するために、図１９に示した方式を用いた場合、話者が増えるたびにピッチ周期情報をその分用意せねばならず、話者の増加に比例してビットレートが増大するため、伝送速度をフレキシブルに変化できる通信網、例えばＡＴＭ網やＩＰ網に代表される非同期伝送網等、適用できる伝送網が限定されてしまうという課題があった。
【００２３】
この発明は、上記の課題を解決するためになされたものであり、１対多または多対多の電話会議で想定される複数話者の同時発言や、高ノイズ環境下での発言、咳払い、相槌などの短区間の発言に対しても高品質な音声伝送を実現できる多地点制御装置の音声符号化伝送システムを得ることを目的とする。
また、この発明は、伝送速度が一定あるいはチャネルあたりの伝送速度に制限を受ける伝送網に対しても適用できる多地点制御装置の音声符号化伝送システムを得ることを目的とする。
【００２４】
【課題を解決するための手段】
この発明に係る多地点制御装置の音声符号化伝送システムは、通信ネットワークを介して複数の通信端末と接続し、各通信端末で取り扱われる音声を符号化した符号化音声信号を受信および送信の対象とし、これら符号化音声信号の情報内容に応じて所定の処理を施し、複数の通信端末に対し処理した符号化音声信号を配信する多地点制御装置の音声符号化伝送システムにおいて、各通信端末から受信した符号化音声信号を復号して復号音声信号を生成し、復号音声信号が有音である通信端末を判定し、有音であると判定した通信端末の１つを指定し、その指定した通信端末の復号音声信号における複数種類の音声符号化パラメータのうち一部の音声符号化パラメータについては音声再符号化処理を施すことなく各通信端末に送信し、指定した通信端末の復号音声信号における複数種類の音声符号化パラメータのうち他の音声符号化パラメータおよび他の通信端末の復号音声信号における複数種類の音声符号化パラメータについては音声再符号化処理を施した後に各通信端末に送信するように構成したものである。
【００２５】
この発明に係る多地点制御装置の音声符号化伝送システムは、１つの通信端末からの復号音声信号のみが有音である場合には１つの通信端末を指定し、その指定した通信端末の復号音声信号における一部の音声符号化パラメータについては音声再符号化処理を施すことなく各通信端末に送信するように構成したものである。
【００２６】
この発明に係る多地点制御装置の音声符号化伝送システムは、複数の通信端末に対して優先順位を設定し、復号音声信号が有音であると判定した通信端末が複数である場合にはその中で優先順位が最も高い１つの通信端末を指定し、その指定した通信端末の復号音声信号における一部の音声符号化パラメータについては音声再符号化処理を施すことなく各通信端末に送信するように構成したものである。
【００２７】
この発明に係る多地点制御装置の音声符号化伝送システムは、復号音声信号が有音であると判定した通信端末が複数である場合には先に音声信号を受信した先着順に優先順位を設定するように構成したものである。
【００２８】
この発明に係る多地点制御装置の音声符号化伝送システムは、あらかじめ１つの特定の通信端末を優先的に指定して優先順位を設定するように構成したものである。
【００２９】
この発明に係る多地点制御装置の音声符号化伝送システムにおける一部の音声符号化パラメータは、ピッチ周期情報を担うパラメータであるように構成したものである。
【００３０】
この発明に係る多地点制御装置の音声符号化伝送システムにおける一部の音声符号化パラメータは、スペクトル包絡情報を担うパラメータであるように構成したものである。
【００３１】
この発明に係る多地点制御装置の音声符号化伝送システムは、複数の通信端末から受信した復号音声信号が有音であると判定した場合には、有音の通信端末の数に応じて音声符号化パラメータのフレームのビット配分を適応的に設定するように構成したものである。
【００３２】
この発明に係る多地点制御装置の音声符号化伝送システムは、複数の通信端末のうち優先順位が第１位の通信端末および第２位の通信端末から受信した復号音声信号が有音であると判定した場合には、第１位の通信端末の復号音声信号におけるピッチ周期情報およびスペクトル包絡情報をそれぞれ担う２つの音声符号化パラメータ並びに第２位の通信端末の復号音声信号におけるピッチ周期情報を担う音声符号化パラメータについては音声再符号化処理を施すことなく各通信端末に送信し、第１位および第２位の通信端末の復号音声信号において音声再符号化処理を施す音声符号化パラメータについては伝送レート制御機能に基づく縮退の量子化ビット制御を行った後に各通信端末に送信するように構成したものである。
【００３３】
この発明に係る多地点制御装置の音声符号化伝送システムは、複数の通信端末のうち主音声の通信端末および副音声の通信端末から受信した復号音声信号が有音であると判定した場合には、主音声の通信端末の復号音声信号におけるピッチ周期情報およびスペクトル包絡情報をそれぞれ担う２つの音声符号化パラメータ並びに副音声の通信端末の復号音声信号におけるピッチ周期情報を担う音声符号化パラメータについては音声再符号化処理を施すことなく各通信端末に送信し、主音声および副音声の通信端末の復号音声信号において音声再符号化処理を施す音声符号化パラメータについては伝送レート制御機能に基づく縮退の量子化ビット制御を行った後に各通信端末に送信するように構成したものである。
【００３４】
この発明に係る多地点制御装置の音声符号化伝送システムにおける縮退の量子化ビット制御が行われる符号化パラメータは、利得符号帳であるように構成したものである。
【００３５】
この発明に係る多地点制御装置の音声符号化伝送システムにおける縮退の量子化ビット制御が行われる符号化パラメータは、雑音符号帳であるように構成したものである。
【００３６】
この発明に係る多地点制御装置の音声符号化伝送システムは、音声符号化パラメータについて音声再符号化処理を施した後に各通信端末に送信するかまたは音声再符号化処理を施すことなく各通信端末に送信するかを決定する符号化モード情報を所定ビット数からなる符号化パラメータのフレームに含めるように構成したものである。
【００３７】
【発明の実施の形態】
以下、この発明の実施の一形態について図を参照して説明する。
実施の形態１．
図１はこの発明の実施の形態１による音声符号化伝送システムを適用した多地点制御装置の構成を示すブロック図である。図１において、２０１は交換機、２０２は回線インタフェース部、２０３は音声復号処理部、２０４は音声検出部、２０５は雑音抑圧処理部、２０６は音声加算部、２０７は分配処理部、２０８は自端末音声減算部、２０９は自動利得制御部、２１０は音声再符号化処理部、２１１は音声符号化パラメータ制御部である。
【００３８】
図１の構成の基本的な機能は以下の通りである。
各会議端末（通信端末）からの符号化音声が交換機２０１を介して回線インタフェース部２０２で受信される。受信された符号化音声は、音声復号処理部２０３によりチャネル毎に音声信号に復号される。復号された音声信号は、会議端末からのそれぞれについて、音声検出部２０４により有音／無音の状態を検出される。この検出結果は、音声加算部２０６における加算対象端末の決定、雑音抑圧処理部２０５の雑音抑圧のための重み付け、自動利得制御部２０９による音声レベルの自動調整のために用いられる。また、音声復号処理部２０３で復号された復号音声信号および抽出された音声パラメータは、音声符号化パラメータ制御部２１１に入力される。
【００３９】
会議参加者数Ｎが多くなると、雑音が大きくなり通話品質の劣化を招くので、雑音抑圧処理部２０５では、無音状態にあるチャネルから入力されてくる雑音を小さくする。そのために、各会議参加者の有音／無音の検出結果に従って有音チャネルと無音チャネルのそれぞれに対して会議参加者数Ｎに基づく各重み係数を決定し、この重み係数をそれぞれの復号音声信号に掛ける雑音制御処理を行う。音声加算部２０６では、音声検出部２０４の検出結果を参照して、加算すべきチャネルの雑音制御処理された復号音声信号を加算する。加算された復号音声信号は、分配処理部２０７において、各チャネルに再分配される。
【００４０】
自端末音声減算部２０８では、自端末の音声信号の回りこみによるエコーを抑圧し、聴感上のわずらわしさを解消するため、加算された復号音声信号から自端末の信号を減算する。また、多地点の電話会議システムにおいては、多人数の音声を加算することによる飽和歪みを起こす可能性があるため、自動利得制御部２０９にて、飽和歪みを防ぎ、なおかつ個人や全体の音声レベルを調整する。自動利得制御部２０９の出力信号は、音声再符号化処理部２１０において符号化されて回線に出力される。
【００４１】
図２は図１の音声符号化パラメータ制御部２１１のさらに詳細な構成を示すブロック図であり、図において、２１２はセレクタ、２１３は分配処理部、２１４は発言者選択部である。
図２の構成の基本的な機能は以下の通りである。
発言者選択部２１４は、各チャネルの音声検出部２０４の検出結果を集計して、発言者が唯一であると見なせる場合に、その発言者の情報（チャネル番号など）をセレクタ２１２および音声再符号化処理部２１０に出力する。セレクタ２１２は、発言者選択部２１４の選択結果に応じて、音声復号処理部２０３にて抽出された各チャネルの音声符号化パラメータから音声再符号化処理部２１０にパススルーする符号化パラメータを選択する。分配処理部２１３は、セレクタ２１２で選択された符号化パラメータを、各チャネルの音声再符号化処理部２１０に再分配する。
【００４２】
図３は、図１の音声復号処理部２０３のさらに詳細な構成を示すブロック図であり、図において、１２６は多重分離部、１３１ａは適応符号帳、１３１ｂは利得復号部、１３２は復号利得ＭＡ予測部、１３３は代数符号復号部、１３４はピッチプレフィルタ、１３５はＬＳＰ復号部、１３６はＬＳＰ内挿部、１３７はＬＳＰ・ＬＰＣ変換部、１２７，１２８は制御増幅部、１２９は加算器、１３８は合成フィルタ、１３９はポストフィルタである。
【００４３】
図４は、図１の音声再符号化処理部２１０の構成を示すブロック図であり、上述した図１１の音声符号化装置に対応する。図において、１０４は線形予測分析処理部、１０５はＬＳＰ（Ｌｉｎｅ　Ｓｐｅｃｔｒａｌ　Ｐａｉｒ）量子化処理部、１０６はＬＳＰ量子化符号帳、１０８は多重化部、１０９は逆量子化処理部、１１０は適応符号帳、１１１は代数符号帳、１１２，１１６は加算器、１１４は利得量子化符号帳、１１５は合成フィルタ、１１７は聴覚重み付フィルタ、１１８は歪最小化部、１１９，１２０，１２２は切替スイッチ、１１３ａ，１１３ｂは利得制御増幅部、１２１はピッチプレフィルタ、１２５は利得ＭＡ予測部、１２４はＬＰＣ（Ｌｉｎｅａｒ　Ｐｒｅｄｉｃｔｉｖｅ　Ｃｏｄｉｎｇ）・ＬＳＰ変換部である。
【００４４】
次に、動作について説明する。なお、説明のために用いる音声符号化方式については、ＩＴＵ−Ｔ勧告Ｇ．７２９　ＣＳ−ＡＣＥＬＰ方式に基づく。
音声復号処理部２０３では、伝送されてきた音声符号化データを基に復号処理を実行する。それと共に、発言者の音声に固有の特徴量を示すパラメータ、すなわち、声帯振動波の繰返し周期を表現するピッチ周期情報である適応符号帳インデックスと、声道情報を表現するスペクトル包絡情報であるＬＳＰ符号帳インデックスとを音声符号化パラメータ制御部２１１に出力する。この場合において、音声検出部２０４の判定結果により、発言者が唯一に決まった場合は、その発言者の端末装置に割り当てたチャネルから受信した符号化パラメータのうち、適応符号帳インデックスとＬＳＰ符号帳インデクスとをそのまま音声再符号化処理部２１０へパススルーする。
【００４５】
このとき音声再符号化処理部２１０では、切替スイッチ１１９，１２０，１２２を各接点１１９Ａ，１２０Ａ，１２２Ａ側に接続し、パススルーされた符号化パラメータについては、再符号化処理すなわち符号帳探索処理を行わずに、そのまま多重化部１０８に送る。その他のパラメータ（図４においては、代数符号帳インデックスおよび利得符号帳インデックス）については、音声加算された復号音声信号に基づいて再符号化処理を行い、歪最小化部１１８で最小自乗誤差の探索により最適な量子化値を抽出して多重化部１０８に送る。多重化部１０８ではこれらパラメータを多重化して回線インタフェース部２０２に出力する。
【００４６】
ここで、音声検出部２０４について、例えば、あるチャネルで発言中、他のチャネルから割り込んで発言があった場合は、そのチャネルにおいて音声の立ち上がりを検出しても、即座に発音中のチャネルからの切り替えは行わず、所定の待ち時間を持たせて、発言中のチャネルからの切り替えを遅らせることにより、相槌、咳払いなど、比較的短区間の、重要度の低い発言での切り替えを防ぐ。これが比較的長い割り込み発言であった場合には、スイッチングの遅れが発生するが、音声加算部２０６の出力（加算復号音声信号）には反映されているため、音声再符号化処理部２１０で合わせて符号化されるので、頭切れなどの心配はない。但し、音声信号を復号するための重要な情報を含む符号化パラメータである適応符号帳インデックス（ピッチ周期情報）およびＬＳＰ符号帳インデックス（スペクトル包絡情報）が一部欠けているため、元々の発言者に比べて割り込み発言の品質は劣化している。しかし、会議運営上、割り込み発言の重要性は低いケースとなることが多く、また、若干ではあるが代数符号帳インデックスの中にも周波数成分に関する情報が漏れているので、実運用上では、例えば、耳障りな音を復号すうような、割込み発言者の音声が異常になるようなことはなく、この劣化は比較的気にならないと言える。
【００４７】
また、音声検出部２０４において、話者が唯一に決まらない場合は、音声再符号化処理部２１０の、スイッチ１１９，１２０，１２２をそれぞれ１１９Ｂ，１２０Ｂ，１２２Ｂ側に接続することによって、パススルー動作は行わない。したがって、音声再符号化処理部２１０内において、話者入力のある各チャネルの復号音声について所定の再符号化処理が行われる。
【００４８】
以上のように、この実施の形態１によれば、指定した会議端末の復号音声信号における複数種類の音声符号化パラメータのうち一部の音声符号化パラメータについては音声再符号化処理を施すことなく各通信端末に送信し、指定した会議端末の復号音声信号における複数種類の音声符号化パラメータのうち、他の音声符号化パラメータ、および他の通信端末の復号音声信号における複数種類の音声符号化パラメータについては、音声再符号化処理を施した後に各通信端末に送信するので、１対多または多対多の電話会議で想定される複数話者の同時発言や、高ノイズ環境下での発言、咳払い、相槌などの短区間の発言に対しても高品質な音声伝送を実現できるという効果が得られる。
【００４９】
すなわち、一部の音声符号化パラメータについてパススルーすることにより、メインの話者（指定した会議端末の話者）に関しては、主要な音声符号化パラメータについては、復号・再符号化を繰り返すことによる劣化を受けずに維持されるため、メイン話者については、パススルー時の音声品質に近いものが再現できるという効果が得られる。また、符号化パラメータが常に音声再符号化処理部を経由しているため、頻繁なスイッチングが発生しても、音声の不連続感は解消されるという効果が得られる。また、品質は常に一定に保たれるため、音声品質が揺らぐことによる不自然感が解消されるという効果が得られる。
【００５０】
また、この実施の形態１によれば、音声再符号化処理を施すことなくパススルーする一部の音声符号化パラメータは、ピッチ周期情報を担うパラメータであるので、発言者の音声に固有な声帯振動数の繰り返し周期を表現する情報をそのまま伝送して、高品質な音声伝送を実現できるという効果が得られる。
【００５１】
また、この実施の形態１によれば、音声再符号化処理を施すことなくパススルーする一部の音声符号化パラメータは、スペクトル包絡情報を担うパラメータであるので、発言者の口腔や鼻腔等の声道を表現する情報をそのまま伝送して、高品質な音声伝送を実現できるという効果が得られる。
【００５２】
実施の形態２．
実施の形態２においては、実施の形態１の図１乃至図３に相当する構成は、ほぼ同じである。ただし、この実施の形態２では、音声復号処理部２０３から音声符号化パラメータ制御部２１１にパススルーされる適応符号帳インデックスは、優先順位が第１位の会議端末の話者からの符号化パラメータのうちの第１位の適応符号帳インデックス、および、優先順位が第２位の会議端末の話者からの符号化パラメータのうちの第２位の適応符号帳インデックスである。図５は、この発明の実施の形態２における音声再生符号化処理部の構成を示すブロック図であり、図において、１３０は切替スイッチ、１１０ｂは第２の適応符号帳、１１３ｃは利得制御増幅部、１４０は符号化レート制御部である。他の構成要素は、図４に示した実施の形態１における音声再符号化処理部の構成要素と同じものであるので同一符号を付し、原則としてその説明を省略する。
【００５３】
図６は、交換機２０１に接続される電話機すなわち会議端末の構成を示すブロック図であり、図において、５００は回線インタフェース部、５０１は音声復号処理部、５０２はＤ／Ａコンバータ、５０３はスピーカである。
【００５４】
図７は、図６の会議端末における音声復号処理部５０１のさらに詳細な構成を示すブロック図であり、図において、５０４は多重分離部、５０５は符号化モード解読部である。他の構成要素は、図５に含まれている一部の構成要素と同じものであるので同一符号を付し、原則としてその説明を省略する。
【００５５】
図８は、１話者発言の場合、２話者発言の場合、および３話者発言の場合における符号化パラメータのフレーム構成例を示す説明図である。
【００５６】
次に、動作について説明する。なお、実施の形態１の場合と同様に、説明のために用いる音声符号化方式については、ＩＴＵ−Ｔ勧告Ｇ．７２９　ＣＳ−ＡＣＥＬＰ方式に基づく。
スイッチ１３０は、音声符号化パラメータ制御部２１１からの第２位の適応符号帳インデックスを入力するか入力しないか切り替える。符号化レート制御部１４０は、音声符号化パラメータ制御部２１１と音声再符号化処理部２１０との間に設けられ、音声符号化パラメータ制御部２１１からの発言者選択情報に基づき、パラメータの符号化レートを決定する制御信号を代数符号帳１１１および利得量子化符号帳１１４に出力するとともに、符号化モード情報をスイッチ１１９，１２０，１３０の切替制御信号として与えるとともに、多重化部１０８に出力する。多重分離部５０４は、図５の多重化部１０８で多重化された音声符号化パラメータを分離する。符号化モード解読部５０５は、多重分離部５０４で分離された符号化モード情報（図８のフレームにおける最後の１ビットの値）を解読する。
【００５７】
以下、図１〜図９を参照して、全体的な動作を説明する。
いま、話者が唯一である場合には、図１の音声検出部２０４の検出結果により、図２の発言者選択部２１４は「１話者発言」であることを決定して、その決定内容を示す発言者選択情報をセレクタ２１２に出力するとともに、音声再符号化処理部２１０に転送する。セレクタ２１２は、その１話者の会議端末に対応しているチャネルから受信した符号化パラメータを選択して分配処理部２１３に出力する。分配処理部２１３は、その符号化パラメータ（ＬＳＰ符号化インデックスおよび第１位の適応符号化インデックス）をそのまま音声再符号化処理部２１０へパススルーする。このときの動作は、スイッチ１３０を１３０Ｂ側に接続する。すなわち、第２位の適応符号帳インデックスの入力をオフにする。また、符号化レート制御部１４０から出力されるビットレートが決定する。他の動作は実施の形態１の場合と全く同一である。
【００５８】
また、話者が１人もいない場合、あるいは３者以上が同時に発言した場合は、音声検出部２０４の検出結果により、発言者選択部２１４は「３話者以上発言」であることを決定して、その決定内容を示す発言者選択情報をセレクタ２１２に出力するとともに、音声再符号化処理部２１０に転送する。この場合には、セレクタ２１２は符号化パラメータを選択しない。音声再符号化処理部２１０は、発言者選択情報に応じてタンデム接続による再符号化処理を実行する。すなわち、スイッチ１１９，１２０，１３０をそれぞれ接点１１９Ｂ，１２０Ｂ，１３０Ｂ側に接続して、ＬＳＰ符号化インデックスのパススルーを行わず、第１位および第２位の適応符号帳インデックスの入力をオフにする。そして、全符号化パラメータについて、音声加算された信号に基づいて再符号化処理を行い、最適な量子化値を探索して多重化部１０８に送る。もっとも、話者が１人もいない場合には、符号化パラメータが存在しないので、再符号化処理を行うことはない。
【００５９】
一方、２話者が同時発言した場合には、発言者選択部２１４は「２話者発言」であることを決定して、その決定内容を示す発言者選択情報をセレクタ２１２に出力するとともに、音声再符号化処理部２１０に転送する。セレクタ２１２は、その２話者の会議端末に対応しているチャネルから受信した符号化パラメータを選択して分配処理部２１３に出力する。分配処理部２１３は、その符号化パラメータをそのまま音声再符号化処理部２１０へパススルーする。音声再符号化処理部２１０は、第１の話者のＬＳＰ符号化インデックスおよび第１位の適応符号化インデックスをパススルーするとともに、第２の話者の第２位の適応符号化インデックス（ピッチ周期情報）をパススルーする。
【００６０】
すなわち、スイッチ１１９，１２０，１３０をそれぞれ接点１１９Ａ，１２０Ａ，１３０Ａ側に接続して、ＬＳＰ符号化インデックス、第１位および第２位の適応符号帳インデックスを多重化部１０８に送る。他の符号化パラメータである代数符号化インデックスおよび利得符号化インデックスについては、音声加算された信号に基づいて再符号化処理を行い、最適な量子化値を探索して多重化部１０８に送る。多重化部１０８は、これら複数の符号化パラメータを多重化して回線インタフェース部２０２に出力する。
【００６１】
同時に、符号化レート制御部１４０では、音声符号化パラメータ制御部２１１からの発言者選択情報に応じて、代数符号化インデックスおよび利得符号化インデックスに割り当てられるビット数を伝送速度に見合うように調整して、それぞれ代数符号帳１１１および利得量子化符号帳１１４に出力するとともに、後述する符号化モード情報を生成して、スイッチ１１９，１２０，１３０および多重化部１０８に出力する。
【００６２】
この場合において、符号化モードを２モード設定することとし、「１話者発言」および「３話者以上発言」の符号化モードを「モード０」とし、「２話者発言」の符号化モードを「モード１」とする。すなわち、「０」および「１」の符号化モード情報を設定する。したがって、この符号化モード情報を伝送するには１ビットを必要とする。伝送速度が８キロビット／秒の場合の各モードにおける符号化パラメータの１フレーム（８０ビット）のビット割り当ての例を図８および図９に示す。
【００６３】
「モード０」においては、ＩＴＵ−Ｔ勧告Ｇ．７２９に示されているビット割り当てとほぼ同じである。すなわち、図８に示すように、「１話者発言」の場合には、ＬＳＰ符号帳インデックスおよび適応符号帳インデックスの符号化パラメータをパススルーし、「３話者以上発言」の場合には、全ての符号化パラメータをパススルーしない。ただし、伝送速度が８キロビット／秒であるので、標準方式では符号化モード情報を送信する余地がないため、図９に示すように、標準方式でパリティビットとして設定されている１ビットを符号化モード情報に転用して送信する。
【００６４】
一方、「モード１」においては、第２位の適応符号化インデックス（１３ビット）を送信する必要があるので、その１３ビット分だけ他の符号化パラメータのビット割り当てを減らす（これを縮退という）必要がある。したがって、図９および図８に示すように、「２話者発言」の場合には、代数符号帳インデックスを３４ビットから２２ビットに縮退し、利得符号帳インデックスを１４ビットから１３ビットに縮退する。そして、第１話者のＬＳＰ符号帳インデックスおよび適応符号帳インデックス、並びに、第２話者の第２位の適応符号帳インデックスからなる符号化パラメータをパススルーする。
【００６５】
なお、代数符号帳インデックスの縮退方式については、第１サブフレームおよび第２サブフレームともに、ＩＴＵ−Ｔ勧告Ｇ．７２９　Ａｎｎｅｘ　Ｄ　第Ｄ．５．８節に記述されている方式（１１ビット量子化）を用いて実現する。また、利得符号帳インデックスの縮退方式については、第１サブフレームはＩＴＵ−Ｔ勧告Ｇ．７２９本体の第３．９節に記載されている方式（７ビット量子化）を用いて、第２サブフレームはＩＴＵ−Ｔ勧告Ｇ．７２９　Ａｎｎｅｘ　Ｄ　第Ｄ．５．９節に記述されている方式（６ビット量子化）を用いて実現する。
【００６６】
多重化部１０８は、これらの符号化パラメータを多重化して、図８に示す１フレームごとに回線インタフェース部２０２および交換機２０１を介して、各会議端末に送信する。そして、図６に示した会議端末の受信部において、音声復号処理部５０１は、交換機２０１および回線インタフェース部５００を介して受信した符号化パラメータのフレームを復号する。すなわち、図７に示した音声復号処理部５０１において、多重分離部５０４は、１フレームの符号化パラメータを各符号化パラメータに分離し、符号化モード情報を符号化モード解読部５０５に出力する。符号化モード解読部５０５は、その符号化モード情報に基づいて、スイッチ１３０に切替制御信号を与え、ビット割当情報を代数符号帳１１１および利得量子化符号帳１１４に与える。
【００６７】
したがって、符号化モード情報が０のフレームを受信したときは、スイッチ１３０は切替制御信号によって接点１３０Ｂ側に接続され、第２位の適応符号帳インデックスの入力をオフにする。また、代数符号帳１１１は、ビット割当情報に応じて、ＩＴＵ−Ｔ勧告Ｇ．７２９本体の第４．１節に示された復号方式を用いて代数符号帳インデックスを復号する。また、利得量子化符号帳１１４も、ビット割当情報に応じて、ＩＴＵ−Ｔ勧告Ｇ．７２９本体の第４．１節に示された復号方式を用いて利得符号帳インデックスを復号する。
【００６８】
一方、符号化モード情報が１のフレームを受信したときは、スイッチ１３０は切替制御信号によって接点１３０Ａ側に接続され、第２位の適応符号帳インデックスの入力をオンにしてその復号を開始する。また、代数符号帳１１１は、ビット割当情報に応じて、第１サブフレームについては、ＩＴＵ−Ｔ勧告Ｇ．７２９本体の第４．１節に示された復号方式を用いて代数符号帳インデックスを復号し、第２サブフレームについては、同勧告Ｇ．７２９　Ａｎｎｅｘ　Ｄ．６章に示された復号方式を用いて代数符号帳インデックスを復号する。次に、第２位の適応符号帳インデックスを復号して得られた第２話者のピッチ周期成分、第１位の適応符号帳インデックスを復号して得られた第１話者のピッチ周期成分、および代数符号帳インデックスを復号して得られた雑音成分を加算器１１２で加算して、励振信号として合成フィルタ１１５に出力する。合成フィルタ１１５は、この励振信号に基づいて声道情報を畳み込み、復号音声を得る。
【００６９】
以上のように、この実施の形態２によれば、複数の会議端末から受信した復号音声信号が有音であると判定した場合には、有音の会議端末の数に応じて符号化パラメータのフレームのビット配分を適応的に設定するので、伝送速度が一定あるいはチャネルあたりの伝送速度に制限を受ける伝送網に対しても適用できるという効果が得られる。
【００７０】
また、この実施の形態２によれば、複数の会議端末のうち優先順位が第１位の会議端末の第１話者および第２位の会議端末の第２話者から受信した復号音声信号が有音であると判定した場合には、第１話者の復号音声信号におけるピッチ周期情報である第１位の適応符号帳インデックス、およびスペクトル包絡情報であるＬＳＰ符号帳インデックスからなる音声符号化パラメータ、並びに、第２話者の復号音声信号におけるピッチ周期情報である第２位の適応符号帳インデックスからなる音声符号化パラメータについては、音声再符号化処理を施すことなく各会議端末にパススルーして送信するので、２話者が同時に発言した場合でも、比較的良好な音声品質の伝送が可能になるという効果が得られる。
【００７１】
また、この実施の形態２によれば、音声再符号化処理を施す音声符号化パラメータについては、伝送レート制御機能に基づく縮退の量子化ビット制御を行うので、音声品質の劣化に対して影響の少ない符号化パラメータを縮退させて、第２話者の復号音声信号におけるピッチ周期情報である第２位の適応符号帳インデックスからなる音声符号化パラメータについて音声再符号化処理を施すことなく各会議端末にパススルーして送信するので、２話者が同時に発言した場合でも、音声品質の劣化を抑制できるという効果が得られる。
【００７２】
また、この実施の形態２によれば、縮退の量子化ビット制御が行われる符号化パラメータを励振利得である利得符号帳インデックスとしたので、音声品質の劣化に対して影響の少ない利得符号帳インデックスの符号化パラメータを縮退させて、２話者が同時に発言した場合でも、音声品質の劣化を抑制できるという効果が得られる。
【００７３】
また、この実施の形態２によれば、縮退の量子化ビット制御が行われる符号化パラメータを雑音符号帳である代数符号帳インデックスとしたので、音声品質の劣化に対して影響の少ない代数符号帳インデックスの符号化パラメータを縮退させて、２話者が同時に発言した場合でも、音声品質の劣化を抑制できるという効果が得られる。
【００７４】
また、この実施の形態２によれば、音声符号化パラメータについて音声再符号化処理を施した後に各会議端末に送信するか、または音声再符号化処理を施すことなく各会議端末にパススルーして送信するかを決定する１ビットの符号化モード情報を、８０ビットからなる符号化パラメータのフレームに含めるので、伝送レートに影響を与えることなく、特定の符号化パラメータをパススルーするか否かの情報を符号化パラメータのフレームに含めることができるという効果が得られる。
【００７５】
実施の形態３．
図１０はこの発明の実施の形態３における多地点制御装置の構成を示すブロック図であり、図１１は実施の形態３における音声符号化パラメータ制御部のブロック図である。図１に相当する図１０の部分には同一符号を付し、図２に相当する図１１の部分には同一符号を付し、原則としてその説明を省略する。図１０および図１１において、２１５は先着チャネル判定部で、発言のあったチャネルのうち音声検出部２０４が最初に検出したチャネルの検出結果に応答して音声符号化パラメータ制御部２１１のセレクタ２１２を制御するものである。
【００７６】
次に、動作について説明する。
発言者が競合した場合、先着チャネル判定部２１５は、先に発言のあったチャネルについて優先話者と判定し、その判定結果をセレクタ２１２に与える。セレクタ２１２は、その先着チャネルの符号化パラメータのみを音声再符号化処理部２１０へパススルーする。それ以降においては、音声再符号化処理部２１０は実施の形態１と同様に動作する。
【００７７】
以上のように、この実施の形態３によれば、複数の会議端末に対して優先順位を設定し、復号音声信号が有音であると判定した会議端末が複数である場合にはその中で優先順位が最も高い１つの会議端末を指定し、その指定した会議端末の復号音声信号における一部の音声符号化パラメータについては音声再符号化処理を施すことなく各通信端末に送信するので、複数の会議端末の話者が同時に発言した場合でも、会議が混乱するのを回避できるという効果が得られる。
【００７８】
また、この実施の形態３によれば、復号音声信号が有音であると判定した会議端末が複数である場合には、先に音声信号を受信した先着順に優先順位を設定するので、無用な咳払い、相槌やルールを逸脱した割込み発言によって、会議の進行が乱されることがないという効果が得られる。
【００７９】
実施の形態４．
図１２はこの発明の実施の形態４における多地点制御装置およびその他の構成を示すブロック図であり、図１３は実施の形態４における音声符号化パラメータ制御部のブロック図である。図１に相当する図１２の部分には同一符号を付し、図２に相当する図１３の部分には同一符号を付し、原則としてその説明を省略する。図１２において、２２４はＭＣＵ制御部であり、インターネットなどを通じて優先話者とする特定のチャネルを優先話者判定部２１６に登録する。図１３において、２１６は優先話者判定部で、電話会議の特定のチャネルを優先話者として予め登録しておき、復号音声信号と音声検出結果に応じて有音と判定されたチャネルが登録された特定のチャネルであった場合、そのチャンネルの話者を優先話者して判定する。
【００８０】
次に、動作について説明する。
例えば、会議主催者は、会議設定時にインターネットなどによりＭＣＵ制御部２２４を介して、会議主催者のチャネルあるいは指名された司会進行役のチャネルなどを、優先話者としてりＭＣＵ制御部２２４に登録しておく。次に、発言者が競合した場合において、登録されている優先話者、すなわち会議主催者のチャネルや指名された司会進行役のチャネルにおいて発言があったとき、優先話者判定部２１６は、復号音声信号、音声検出結果および登録されたデータとから、優先話者のチャネルを検知し、音声符号化パラメータ制御部２１１のセレクタ２１２が該当するチャネルの符号化パラメータのみを音声再符号化処理部２１０へパススルーするように制御する。それ以降の動作については、実施の形態１の場合と同様である。
【００８１】
以上のように、この実施の形態４によれば、あらかじめ１つの特定の会議端末を優先的に指定し、復号音声信号が有音であると判定した複数の会議端末の中に特定の会議端末が含まれている場合には、その特定の会議端末の復号音声信号における一部の音声符号化パラメータについては音声再符号化処理を施すことなく各会議端末にパススルーして送信するので、会議主催者や指名された司会進行役の発言を検知して、発言者が競合した場合でも、円滑な会議進行が可能になるという効果が得られる。
【００８２】
実施の形態５．
図１４および図１５は、この発明の実施の形態５における符号化パラメータのフレーム構成を示す説明図である。なお、この実施の形態５において、多地点制御装置、多地点制御装置における音声再符号化処理部、および会議端末における音声復号処理部の構成は、それぞれ図１、図２および図５、並びに図７に示した実施の形態２における構成と同じである。また、多地点制御装置における音声符号化パラメータ制御部の構成は、図１３に示した実施の形態４における構成と同じである。
【００８３】
次に、動作について説明する。説明のために用いる音声符号化方式についても、実施の形態２と同様に、ＩＴＵ−Ｔ勧告Ｇ．７２９のＣＳ−ＡＣＥＬＰ方式に基づく。図１において、音声検出部２０４の検出結果により、図２における音声符号化パラメータ制御部２１１の発言者選択部２１４の選択が「１話者発言」、「２話者発言」、または「３話者以上発言」に決定した場合には、実施の形態２の場合と同様に、その決定に基づく発言者選択情報をセレクタ２１２に与えるとともに、図５における符号化レート制御部１４０に出力する。符号化レート制御部１４０は、この発言者選択情報に応じて、「モード０」または「モード１」を示す１ビットの符号化モード情報を生成して、音声再符号化処理部２１０に出力する。音声再符号化処理部２１０は、この符号化モード情報に基づいて、多重化部１０８から会議端末に出力する符号化パラメータのフレームを構成する。
【００８４】
すなわち、「モード１」においては、第２位の適応符号化インデックス（８ビット）を送信する必要があるので、その８ビット分だけ他の符号化パラメータを縮退する必要がある。したがって、図１４および図１５に示すように、「２話者発言」の場合には、代数符号帳インデックスを３４ビットから２８ビットに縮退し、利得符号帳インデックスを１４ビットから１２ビットに縮退する。そして、第１話者のＬＳＰ符号帳インデックスおよび適応符号帳インデックス、並びに、第２話者の第２位の適応符号帳インデックスからなる符号化パラメータをパススルーする。
【００８５】
なお、代数符号帳インデックスの縮退方式については、第１サブフレームについてはＩＴＵ−Ｔ勧告Ｇ．７２９　Ａｎｎｅｘ　Ｄ　第Ｄ．５．８節に記述されている方式（１１ビット量子化）を用い、第２サブフレームについてはＩＴＵ−Ｔ勧告Ｇ．７２９本体第３．８節に記述されている方式（１７ビット量子化）を用いて実現する。また、利得符号帳インデックスの縮退方式については、第１サブフレームおよび第２サブフレームともにＩＴＵ−Ｔ勧告Ｇ．７２９　Ａｎｎｅｘ　Ｄ　第Ｄ．５．９節に記述されている方式（６ビット量子化）を用いて実現する。
【００８６】
ところで、電話会議システムにおいて使用されるチャネルは、主音声と副音声で構成されている。この場合、主音声は発言者の音声であり、副音声は、例えば発言者の発言に対する同時通訳などが適用される。そこで、図１３における優先話者判定部２１６において、主音声を第１話者とし副音声を第２話者として順位付けを設定しておく。図１４に示すように、主音声の符号化パラメータであるＬＳＰ符号帳インデックスおよび第１位の適応符号帳インデックスについては、パススルーされた量子化パラメータをそのまま伝送するようにし、副音声の符号化パラメータである第２位の適応符号帳インデックスについては、パススルーされた量子化パラメータを縮退させて伝送する。
【００８７】
以上のように、この実施の形態５によれば、主音声および副音声の会議端末から受信した復号音声信号が有音であると判定した場合には、主音声の会議端末の復号音声信号におけるピッチ周期情報およびスペクトル包絡情報をそれぞれ担う２つの音声符号化パラメータ、並びに、副音声の会議端末の復号音声信号におけるピッチ周期情報を担う音声符号化パラメータについては音声再符号化処理を施すことなく各通信端末に送信し、音声再符号化処理を施す音声符号化パラメータについては伝送レート制御機能に基づく縮退の量子化ビット制御を行った後に各会議端末に送信するので、主音声と副音声とが同時に発言した場合でも、副音声の符号化パラメータを縮退させることで、伝送レートを維持しつつ主音声の音声品質の劣化を極力抑えて伝送できるという効果が得られる。
【００８８】
また、この実施の形態５によれば、実施の形態２と同様に、音声品質の劣化に対して影響の少ない利得符号帳インデックスの符号化パラメータを縮退させて、２話者が同時に発言した場合でも、音声品質の劣化を抑制できるという効果が得られる。また、音声品質の劣化に対して影響の少ない代数符号帳インデックスの符号化パラメータを縮退させて、２話者が同時に発言した場合でも、音声品質の劣化を抑制できるという効果が得られる。また、伝送レートに影響を与えることなく、特定の符号化パラメータをパススルーするか否かの符号化モード情報を符号化パラメータのフレームに含めることができるという効果が得られる。
【００８９】
さらに、上記実施の形態２乃至実施の形態５によれば、実施の形態１と同様に、１対多または多対多の電話会議で想定される複数話者の同時発言や、高ノイズ環境下での発言、咳払い、相槌などの短区間の発言に対しても高品質な音声伝送を実現できるという効果が得られる。また、発言者の音声に固有な声帯振動数の繰り返し周期を表現する情報をそのまま伝送して、高品質な音声伝送を実現できるという効果が得られる。また、発言者の口腔や鼻腔等の声道を表現する情報をそのまま伝送して、高品質な音声伝送を実現できるという効果が得られる。
【００９０】
なお、上記各実施の形態においては、通信端末として会議端末（電話機）を例に採ってこの発明を説明したが、通信端末の態様は会議端末に限定されない。例えば、複数種類の音声信号を異なるチャネルで同時に伝送する放送システムや有線放送システムにおいて、その音声信号を受信する複数の受信機を通信端末として適用し、受信機側で特定の１つのチャネル（例えば、主音声のチャネルまたは副音声のチャネル）を指定する構成にしてもよい。
【００９１】
【発明の効果】
以上のように、この発明によれば、多地点制御装置の音声符号化伝送システムを、各通信端末から受信した符号化音声信号を復号して復号音声信号を生成し、復号音声信号が有音である通信端末を判定し、有音であると判定した通信端末の１つを指定し、その指定した通信端末の復号音声信号における複数種類の音声符号化パラメータのうち一部の音声符号化パラメータについては音声再符号化処理を施すことなく各通信端末に送信し、指定した通信端末の復号音声信号における複数種類の音声符号化パラメータのうち他の音声符号化パラメータおよび他の通信端末の復号音声信号における複数種類の音声符号化パラメータについては音声再符号化処理を施した後に各通信端末に送信するように構成したので、１対多または多対多の電話会議で想定される複数話者の同時発言や、高ノイズ環境下での発言、咳払い、相槌などの短区間の発言に対しても高品質な音声伝送を実現できるという効果がある。
【００９２】
この発明によれば、多地点制御装置の音声符号化伝送システムを、１つの通信端末からの復号音声信号のみが有音である場合には１つの通信端末を指定し、その指定した通信端末の復号音声信号における一部の音声符号化パラメータについては音声再符号化処理を施すことなく各通信端末に送信するように構成したので、１対多または多対多の電話会議において高品質な音声伝送を実現できるという効果がある。
【００９３】
この発明によれば、多地点制御装置の音声符号化伝送システムを、複数の通信端末に対して優先順位を設定し、復号音声信号が有音であると判定した通信端末が複数である場合にはその中で優先順位が最も高い１つの通信端末を指定し、その指定した通信端末の復号音声信号における一部の音声符号化パラメータについては音声再符号化処理を施すことなく各通信端末に送信するように構成したので、複数の会議端末の話者が同時に発言した場合でも、会議が混乱するのを回避できるという効果がある。
【００９４】
この発明によれば、多地点制御装置の音声符号化伝送システムを、復号音声信号が有音であると判定した通信端末が複数である場合には先に音声信号を受信した先着順に優先順位を設定するように構成したので、無用な咳払い、相槌やルールを逸脱した割込み発言によって、会議の進行が乱されることがないという効果がある。
【００９５】
この発明によれば、多地点制御装置の音声符号化伝送システムを、あらかじめ１つの特定の通信端末を優先的に指定して優先順位を設定するように構成したので、会議主催者や指名された司会進行役の発言を検知して、発言者が競合した場合でも、円滑な会議進行が可能になるという効果がある。
【００９６】
この発明によれば、多地点制御装置の音声符号化伝送システムにおける一部の音声符号化パラメータを、ピッチ周期情報を担うパラメータであるように構成したので、発言者の音声に固有な声帯振動数の繰り返し周期を表現する情報をそのまま伝送して、高品質な音声伝送を実現できるという効果がある。
【００９７】
この発明によれば、多地点制御装置の音声符号化伝送システムにおける一部の音声符号化パラメータを、スペクトル包絡情報を担うパラメータであるように構成したので、発言者の口腔や鼻腔等の声道を表現する情報をそのまま伝送して、高品質な音声伝送を実現できるという効果がある。
【００９８】
この発明によれば、多地点制御装置の音声符号化伝送システムを、複数の通信端末から受信した復号音声信号が有音であると判定した場合には、有音の通信端末の数に応じて音声符号化パラメータのフレームのビット配分を適応的に設定するように構成したので、伝送速度が一定あるいはチャネルあたりの伝送速度に制限を受ける伝送網に対しても適用できるという効果がある。
【００９９】
この発明によれば、多地点制御装置の音声符号化伝送システムを、複数の通信端末のうち優先順位が第１位の通信端末および第２位の通信端末から受信した復号音声信号が有音であると判定した場合には、第１位の通信端末の復号音声信号におけるピッチ周期情報およびスペクトル包絡情報をそれぞれ担う２つの音声符号化パラメータ並びに第２位の通信端末の復号音声信号におけるピッチ周期情報を担う音声符号化パラメータについては音声再符号化処理を施すことなく各通信端末に送信し、第１位および第２位の通信端末の復号音声信号において音声再符号化処理を施す音声符号化パラメータについては伝送レート制御機能に基づく縮退の量子化ビット制御を行った後に各通信端末に送信するように構成したので、２話者が同時に発言した場合でも、比較的良好な音声品質の伝送が可能になるという効果がある。
【０１００】
この発明によれば、多地点制御装置の音声符号化伝送システムを、複数の通信端末のうち主音声の通信端末および副音声の通信端末から受信した復号音声信号が有音であると判定した場合には、主音声の通信端末の復号音声信号におけるピッチ周期情報およびスペクトル包絡情報をそれぞれ担う２つの音声符号化パラメータ並びに副音声の通信端末の復号音声信号におけるピッチ周期情報を担う音声符号化パラメータについては音声再符号化処理を施すことなく各通信端末に送信し、主音声および副音声の通信端末の復号音声信号において音声再符号化処理を施す音声符号化パラメータについては伝送レート制御機能に基づく縮退の量子化ビット制御を行った後に各通信端末に送信するように構成したので、主音声と副音声とが同時に発言した場合でも、副音声の符号化パラメータを縮退させることで、伝送レートを維持しつつ主音声の音声品質の劣化を極力抑えて伝送できるという効果がある。
【０１０１】
この発明によれば、多地点制御装置の音声符号化伝送システムにおける縮退の量子化ビット制御が行われる符号化パラメータを、利得符号帳であるように構成したので、音声品質の劣化に対して影響の少ない利得符号帳インデックスの符号化パラメータを縮退させて、２話者が同時に発言した場合でも、音声品質の劣化を抑制できるという効果がある。
【０１０２】
この発明によれば、多地点制御装置の音声符号化伝送システムにおける縮退の量子化ビット制御が行われる符号化パラメータを、雑音符号帳であるように構成したので、音声品質の劣化に対して影響の少ない代数符号帳インデックスの符号化パラメータを縮退させて、２話者が同時に発言した場合でも、音声品質の劣化を抑制できるという効果がある。
【０１０３】
この発明によれば、多地点制御装置の音声符号化伝送システムを、音声符号化パラメータについて音声再符号化処理を施した後に各通信端末に送信するかまたは音声再符号化処理を施すことなく各通信端末に送信するかを決定する符号化モード情報を所定ビット数からなる符号化パラメータのフレームに含めるように構成したので、伝送レートに影響を与えることなく、特定の符号化パラメータをパススルーするか否かの情報を符号化パラメータのフレームに含めることができるという効果がある。
【図面の簡単な説明】
【図１】この発明の実施の形態１による音声符号化伝送システムに適用した多地点制御装置の構成を示すブロック図である。
【図２】同実施の形態１における音声符号化パラメータ制御部の構成を示すブロック図である。
【図３】同実施の形態１における音声復号処理部の構成を示すブロック図である。
【図４】同実施の形態１における音声再符号化処理部の構成を示すブロック図である。
【図５】同実施の形態２における音声再生符号化処理部の構成を示すブロック図である。
【図６】同実施の形態２における会議端末の受信側の構成を示すブロック図である。
【図７】同実施の形態２における会議端末の音声復号処理部の構成を示すブロック図である。
【図８】同実施の形態２に係る音声符号化パラメータのフレーム構成を示す図である。
【図９】同実施の形態２に係る符号化パラメータの１フレームのビット割り当て例を示す説明図である。
【図１０】同実施の形態３における多地点制御装置の構成を示すブロック図である。
【図１１】同実施の形態３における音声符号化パラメータ制御部の構成を示すブロック図である。
【図１２】同実施の形態４に係る音声符号化伝送システムに適用する多地点制御装置の構成を示すブロック図である。
【図１３】同実施の形態４における音声符号化パラメータ制御部の構成を示すブロック図である。
【図１４】同実施の形態５に係る音声符号化パラメータのフレーム構成を示す図である。
【図１５】同実施の形態５に係る符号化パラメータの１フレームのビット割り当て例を示す説明図である。
【図１６】従来のＣＥＬＰ方式に基づく音声符号化装置の構成を示すブロック図である。
【図１７】従来の多地点制御装置の構成を示すブロック図である。
【図１８】従来の多地点制御装置の他の構成を示すブロック図である。
【図１９】従来のＣＥＬＰ系音声符号化装置の構成を示すブロック図である。
【符号の説明】
１０４　線形予測分析処理部、１０５　ＬＳＰ量子化処理部、１０６　ＬＳＰ量子化符号帳、１０８　多重化部、１０９　逆量子化処理部、１１０　適応符号帳、１１０ｂ　第２の適応符号帳、１１１　代数符号帳（雑音符号帳）、１１２，１１６　加算器、１１３ａ，１１３ｂ　利得制御増幅部、１１３ｃ　利得制御増幅部、１１４　利得量子化符号帳、１１５　合成フィルタ、１１７　聴覚重み付フィルタ、１１８　歪最小化部、１１９，１２０　切替スイッチ、１２１　ピッチプレフィルタ、１２４　ＬＰＣ・ＬＳＰ変換部、１２５　利得ＭＡ予測部、１２６　多重分離部、１２７，１２８　制御増幅部、１２９　加算器、１３０　切替スイッチ、１３１ａ　適応符号帳、１３１ｂ　利得復号部、１３２　復号利得ＭＡ予測部、１３３　代数符号復号部、１３４　ピッチプレフィルタ、１３５ＬＳＰ復号部、１３６　ＬＳＰ内挿部、１３７　ＬＳＰ・ＬＰＣ変換部、１３８　合成フィルタ、１３９　ポストフィルタ、１４０　符号化レート制御部、２０１　交換機、２０２　回線インタフェース部、２０３　声声復号処理部、２０４　音声検出部、２０５　雑音抑圧処理部、２０６　音声加算部、２０７　分配処理部、２０８　自端末音声減算部、２０９　自動利得制御部、２１０　音声再符号化処理部、２１１　音声符号化パラメータ制御部、２１２　セレクタ、２１３　分配処理部、２１４　発言者選択部、２１５　先着チャネル判定部、２１６
優先話者判定部、２２４　ＭＣＵ制御部。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a telephone conference (teleconference) using an encoded voice signal encoded based on an information source encoding scheme such as a code-excited linear prediction (hereinafter referred to as “CELP”) scheme. More specifically, the present invention relates to a multi-point control unit (hereinafter, referred to as “MCU”) applied to a system, and particularly to a multi-point control unit that re-encodes a decoded audio signal in response to a mixture of utterances by a large number of speakers. The present invention relates to a speech coded transmission system of a control device.
[0002]
[Prior art]
The MCU connects a plurality of communication terminals in various forms via a communication network, and receives and transmits information contents of different and different codes including video, audio, data, and the like handled by each communication terminal. It has been developed as an apparatus that provides a service of performing exchange and distribution processing in accordance with information content and delivering the processed information to a plurality of communication terminals. Typical applications are teleconferencing and videoconferencing systems. For audio transmission processing handled by the MCU, compression encoding is applied.
In recent years, as a technique for efficiently compressing and encoding voice in the telephone band, a technique based on an information source coding scheme such as the CELP scheme has been practically used mainly in fields such as digital mobile phones, international communications, and intra-company communications. Has been Among them, ITU-T Recommendation G. The CS-ACELP (Conjugate Structure-Algebraic CodeExcited Linear Prediction: Algebraic Code Excited Linear Prediction of Conjugate Structure) method used in G.729 and the GSM-AMR (Global System for Mobile-Regulation-Mime-Ratio-Mime-Ratio-Mobi-Ratio-Mobi-Ratio-Mobi. It has been adopted as a regional standard method.
[0003]
Among voice waves, especially voiced sounds are generated by adding resonance characteristics of a vocal tract such as an oral cavity and a nasal cavity to a vibration wave generated by vibration of a vocal cord. Originally, the CELP method is an encoding method that models such a human vocal mechanism. Here, the vocal cord vibration wave is represented by a pitch period representing a repetitive component thereof and a noise parameter representing a fluctuation component. In addition, the vocal tract transfer characteristics when the sound passes through the throat, mouth, and nose and the radiation characteristics of the lips are approximately expressed using a linear prediction method.
[0004]
In a specific CELP system, an adaptive codebook and a fixed codebook, which are basically two codebooks, are used, and an LSP quantized codebook (Line Spectral) is used. Pair quantification code book and gain quantization codebook are used. The adaptive codebook expresses a periodic signal component of a drive source signal, and a code that stores a previous drive source signal in a memory and is updated every frame period is often used. On the other hand, a random codebook expresses a non-periodic signal component that cannot be expressed by an adaptive codebook, and a fixed codebook of a plurality of typical signal patterns is often used. Since this type of noise codebook requires a large amount of memory to store a signal pattern, as an improved version, recently, an algebra that expresses an approximately aperiodic signal component with a small number of pulses has been developed. Codebooks have become popular. Further, the gain of the periodic / aperiodic component of the drive source signal is vector-quantized using a gain quantization codebook. Further, a linear prediction coefficient obtained as a result of the linear prediction analysis of the voice to be transmitted is converted into an LSP, and this is vector-quantized using an LSP quantization codebook.
[0005]
FIG. 16 is a block diagram showing a configuration example of a conventional speech coding apparatus based on the CS-ACELP system. In the figure, 604 is a linear prediction analysis processing unit, 605 is an LSP (Line Spectral Pair) quantization processing unit, 606 is an LSP quantization codebook, 608 is a multiplexing unit, 609 is an inverse quantization processing unit, and 610 is an adaptive code. Book, 611 is an algebraic codebook, 612 and 616 are adders, 613 is a gain control amplifier, 614 is a gain quantization codebook, 615 is a synthesis filter, 617 is an auditory weighting filter, 618 is a distortion minimizing section, and 624. Is an LPC (Linear Predictive Coding) LSP conversion unit.
[0006]
Next, the operation will be described.
The linear prediction analysis processing unit 604 obtains the LSP parameter from the speech input. The LSP parameter is directly input to the LSP quantization processing unit 605, and is encoded with reference to the LSP quantization codebook 606. The encoded LSP parameter (codebook index) is sent to the inverse quantization processing unit 609 and also to the multiplexing unit 608. The inverse quantization processing unit 609 calculates a linear prediction coefficient using the LSP coefficient obtained by referring to the LSP quantization codebook 606 based on the encoded LSP parameter (codebook index), and generates a synthesis filter 615. To supply. A processing block group including the synthesis filter 615 and including the adaptive codebook 610, the algebraic codebook 611, the adders 612 and 616, the gain control amplifier 613, the gain quantization codebook 614, and the perceptual weighting filter 617. By changing the combination of the adaptive codebook 610, the algebraic codebook 611, and the gain quantization codebook 614, a plurality of speech waveforms are synthesized. The distortion minimizing section 618 calculates an auditory weighted error power (= square error) between the plurality of synthesized speech waveforms and the input signal waveform, and among them, an adaptive codebook 610 and an algebraic codebook which minimize the error power. 611 and the combination of the gain quantization codebook 614 are selected. Speech coding processing is performed based on the so-called AbS (Analysis by Synthesis) method. The coding parameters (the adaptive codebook 610, the algebraic codebook 611, the gain quantization codebook 614, etc.) quantized in this way and the quantization parameters previously quantized by the LSP quantization processing unit 605 are: Are multiplexed by the multiplexing unit 608, and then sent to the decoder side.
[0007]
Next, a case will be described in which the above-described encoding method is applied to a telephone conference device using an MCU, and one-to-many or many-to-many communication is performed.
FIG. 17 is a block diagram showing the configuration of a conventional MCU. In FIG. 17, reference numeral 20 denotes a telephone constituting a conference terminal, reference numeral 21 denotes an exchange, reference numeral 22 denotes a line interface unit, reference numeral 23 denotes a voice decoding unit, and reference numeral 24 denotes a voice detection unit. , 25 is a noise suppressing unit, 26 is a voice adding unit, 27 is a distribution processing unit, 28 is a local terminal voice subtracting unit, 29 is an automatic gain control unit, and 30 is a voice re-encoding processing unit.
[0008]
Next, the operation will be described.
First, the coded voice received by the line interface unit 22 is decoded by the voice decoding processing unit 23 into a voice signal for each channel. For the decoded audio signal, the sound detection unit 24 detects the presence / absence of a sound / no sound for each from the conference terminal 20. The detection result is used for determining the addition target terminal in the voice addition unit 26, weighting for noise suppression in the noise suppression processing unit 25, and automatic adjustment of the voice level by the automatic gain control unit 29.
[0009]
When the number N of conference participants increases, the background noise level also increases in proportion to N, and therefore, there is a problem that a decrease in the S / N ratio causes deterioration in speech quality. Therefore, the noise suppression processing unit 25 sets each of the speech channel and the silence channel according to the speech / silence detection result of each conference participant in order to reduce the noise input from the channel in the silence state. Each weight coefficient is determined based on the number of conference participants N, and noise control processing is performed to multiply the decoded voice signal by the weight coefficient. The voice adding unit 26 adds the decoded voice signal of the channel to be added, which has been subjected to the noise control processing, with reference to the detection result of the voice detecting unit 24. The added decoded audio signal is redistributed to each channel in the distribution processing unit 27.
[0010]
The own-terminal audio subtraction unit 28 subtracts the own-terminal signal from the added decoded audio signal in order to suppress the echo caused by the rounding of the audio signal of the own terminal and eliminate annoying audibility. Further, in a multipoint telephone conference system, since there is a possibility that saturation distortion may be caused by adding voices of many people, the automatic gain control unit 29 prevents the saturation distortion, and furthermore, the individual and overall audio level To adjust. The output signal of the automatic gain control unit 29 is encoded by the audio re-encoding processing unit 30 and output to the line.
[0011]
A first problem of the system using the multipoint control device shown in FIG. 17 is that the voice quality is deteriorated due to repeated decoding and re-encoding (hereinafter, referred to as “tandem connection”). Since the audio encoding / decoding device based on the CELP method is lossy encoding, the decoded audio signal includes a quantization error. Furthermore, by connecting these in tandem, the quantization error is further accumulated in the re-encoding process, and the voice quality is degraded.
[0012]
As a method for solving this problem, for example, there is a technique disclosed in Japanese Patent Application Laid-Open No. 2000-174909 "Conference terminal control device". FIG. 18 is a block diagram showing a configuration of the multipoint control device (MCU). In the figure, 2a to 2m are demultiplexers, 4 is a speaker detection circuit, 6 is a first selector, 8a to 8n are decoders (speech decoding processing units), 10a to 10n are attenuation circuits, 12 is a synthesis circuit, and 14 is an encoder. (Audio re-encoding unit), 16 is a second selector, 18 is a distribution circuit.
[0013]
Next, the operation will be described.
When the MCU receives voice information from each conference terminal, voice / unvoiced signals are separated by the demultiplexers 2a to 2m, and the voice / unvoiced signal is used to determine which of the conference terminals from which the compressed voice code is voiced. The speaker detection circuit 4 detects it. In addition, the number of voiced conference terminals is measured. When a voiced or unvoiced determination is made, the information is sent to the first selector 6, and the first selector 6 operates as follows according to the number of conference terminals in a voiced state.
[0014]
If there are two or more voiced conference terminals, the first selector 6 selects the voiced conference terminals and connects them to the decoders 8a to 8n in a one-to-one relationship. Is sent. Each of the corresponding decoders 8a to 8n decodes the supplied compressed audio code to generate an audio signal. The generated audio signal is attenuated to a predetermined value by the attenuation circuits 10a to 10n. Each audio signal is subjected to level adjustment by the attenuation circuit 10 and then synthesized by the synthesis circuit. That is, all voices from the conference terminal in a voiced state are collected. The synthesized audio signal is re-encoded by the encoder 14, sent to the second selector 16, and transmitted by the distribution circuit 18 to all conference terminals.
[0015]
On the other hand, if there is only one conference terminal in the voiced state, the fact is transmitted to the first selector 6, the conference terminal in the voiced state is selected by the first selector 6, and the conference terminal is directly transmitted to the second selector 16 as it is. Connected. The compressed voice code of the conference terminal in a voiced state connected to the second selector 16 is transmitted to all conference terminals via the distribution circuit 18 as it is.
[0016]
When the MCU has the configuration shown in FIG. 18, in the case of a single speaker, all data of the coded voice is passed through, so that tandem connection is avoided and a low bit rate voice coding scheme can be used. High quality audio can be provided.
[0017]
As a second problem of the system using the multipoint control device shown in FIG. 17 described above, there is a deterioration factor peculiar to the information source coding process, that is, simultaneous utterances by a plurality of speakers. In this case, the deterioration is remarkable, and there is a problem that the hearing becomes hard to hear.
In the above-described conventional speech coding apparatus based on the CELP scheme, as shown in FIG. 16, coding is originally performed on the assumption that a single speaker speaks. That is, the vocal cord sound source (drive vector), vocal tract information (linear prediction coefficient), gain information, and the like are extracted in the encoding process and are quantized and transmitted. There is only one parameter indicating the feature amount of speech, and when transmitting mixed speech of a plurality of speakers, it cannot be encoded with high accuracy. For example, in the case of a mixed voice of a plurality of speakers, the vocal cord sound source includes a plurality of types of pitch period information different for each speaker, but there is no means for expressing the plurality of types of pitch periods. Also, in the case of a mixed voice of a plurality of speakers, the vocal tract information has a more complex spectral structure than that of a single speaker. Furthermore, a quantization pattern (quantization table) that faithfully expresses it is not prepared, and the error at the time of quantization tends to increase.
[0018]
As means for solving such a problem, for example, there is a method disclosed in Japanese Patent Application Laid-Open No. H10-240299, entitled "Speech Encoding and Decoding Device". FIG. 19 is a block diagram showing a configuration of a CELP speech encoding device disclosed in this publication. In the figure, 31 is a multi-speaker voice separation unit, 32 ₁ ~ 32 _N Is the long-term predictor, 33 ₁ ~ 33 _N Is a source codebook, 34 is a reflection coefficient analysis unit, 35 is a throat (throat) approximation filter, 36 ₁ ~ 36 _N , 37 are an adder, 38 is a subtractor, and 39 is an error analyzer.
[0019]
Next, the operation will be described.
The multi-speaker voice separating unit 31 analyzes the periodic characteristics of the input voice signal, specifies the number of speakers n (1 <n ≦ N), and separates the voice of each speaker included in the voice signal. Source voice A of each speaker ₁ ~ A _n Is output as The number n of speakers obtained by the multi-speaker voice separation unit 31 is supplied to the reflection coefficient analysis unit 34. In the reflection coefficient analysis unit 34, the reflection coefficient is an order corresponding to the number n of speakers, such as 10th order when the number n of speakers is one, 15th order when there are two speakers, and 20th order when it is more than two. Calculate r. The reflection coefficient r can be obtained, for example, by executing FLAT (fixed-point covariance lattice type algorithm) using autocorrelation of input speech. The obtained reflection coefficient r is given as a coefficient of the throat approximation filter 35.
[0020]
On the other hand, the source speech A of each speaker separated by the multi-speaker speech separation unit 31 ₁ ~ A _n Are n long-term predictors 32 ₁ ~ 32 _n Respectively. Long-term predictor 32 ₁ ~ 32 _n Then, these source sounds A ₁ ~ A _n Pitch L of the source voice from the correlation between the ₁ ~ L _n Is extracted. These pitches L ₁ ~ L _n And the source codebook 33 respectively ₁ ~ 33 _n From the adder 36 ₁ ~ 36 _n , And the source speech for each speaker is decoded. The source voices of the plurality of speakers are added by the adder 37, and the characteristics of the vocal tract are given by the throat approximation filter 35 to become a local decoded signal. The local decoded signal and the input voice are subtracted by a subtractor 38, and an error analyzer 39 generates a source codebook 33 so that an error signal from the subtracter 38 is minimized. ₁ ~ 33 _n Index I ₁ ~ I _n Are sequentially determined.
[0021]
[Problems to be solved by the invention]
Since the conventional voice coded transmission system of the multipoint control device is configured as described above, in order to avoid voice quality deterioration due to accumulation of quantization errors, when the method shown in FIG. 18 is used, There is a possibility that the speaker detection circuit 4 may react to short-period voices (such as coughing, hammering, etc.) uttered by a person other than the speaker, or cause erroneous determination due to a change in background noise. In this case, the circuit switching between pass-through and decoding / re-encoding processing is frequently performed. That is, in FIG. 18, the selector 6 switches between the output (encoded audio signal) of the encoder 14 and the encoded audio signal that has been passed through according to the determination result of the speaker detection circuit 4. However, in the case of a low-bit-rate audio coding method such as the CELP method, normally, a high-quality sound can be transmitted by always operating a pair of an encoder and a decoder. However, in the configuration of FIG. 18, by switching the selector 6, the encoder paired with the decoder built in the conference terminal of the receiver becomes the encoder built in the conference terminal of the sole speaker and the MCU. It is frequently switched to an encoder (encoder 14). For this reason, there is a problem on the terminal side that receives the sound that the sound becomes unnatural due to the discontinuity of the sound due to the switching and that the sound quality fluctuates frequently, which makes the sound hard to hear.
[0022]
Further, since the conventional voice coded transmission system of the multipoint control device is configured as described above, in order to avoid remarkable deterioration due to simultaneous utterances of a plurality of speakers, which is peculiar to the information source coding process, FIG. When the method described in (1) is used, the pitch period information must be prepared each time the number of speakers increases, and the bit rate increases in proportion to the increase in the number of speakers. There has been a problem that applicable transmission networks are limited, such as networks, for example, an asynchronous transmission network represented by an ATM network or an IP network.
[0023]
The present invention has been made in order to solve the above-mentioned problems, and has a simultaneous utterance of a plurality of speakers assumed in a one-to-many or many-to-many telephone conference, a utterance in a high noise environment, a coughing, It is an object of the present invention to obtain a voice coded transmission system of a multipoint control device capable of realizing high-quality voice transmission even for short sections such as a hammer.
It is another object of the present invention to provide a voice coded transmission system of a multipoint control device that can be applied to a transmission network in which the transmission speed is constant or the transmission speed per channel is limited.
[0024]
[Means for Solving the Problems]
A voice coded transmission system of a multipoint control device according to the present invention is connected to a plurality of communication terminals via a communication network, and receives and transmits a coded voice signal obtained by coding voice handled by each communication terminal. In the voice coded transmission system of the multi-point control device that performs predetermined processing according to the information content of these coded voice signals and distributes the processed coded voice signal to a plurality of communication terminals, Decoding the received encoded audio signal to generate a decoded audio signal, determining a communication terminal in which the decoded audio signal is voiced, specifying one of the communication terminals determined to be voiced, and specifying the specified communication terminal Some of the plurality of types of speech encoding parameters in the decoded speech signal of the communication terminal are transmitted to each communication terminal without performing speech re-encoding processing and designated. After performing a voice re-encoding process on the other voice coding parameters of the plurality of types of voice coding parameters in the decoded voice signal of the communication terminal and the plurality of types of voice coding parameters in the decoded voice signal of the other communication terminal. It is configured to transmit to each communication terminal.
[0025]
The voice coded transmission system of the multipoint control apparatus according to the present invention designates one communication terminal when only the decoded voice signal from one communication terminal has sound, and decodes the voice of the specified communication terminal. Some speech coding parameters in a signal are transmitted to each communication terminal without performing speech recoding processing.
[0026]
The voice encoding transmission system of the multipoint control device according to the present invention sets a priority order for a plurality of communication terminals, and when there are a plurality of communication terminals that have determined that the decoded voice signal is sound, the One of the communication terminals having the highest priority among the communication terminals is designated, and some of the speech encoding parameters in the decoded speech signal of the designated communication terminal are transmitted to each communication terminal without performing speech re-encoding processing. It is what was constituted.
[0027]
In the voice coded transmission system of the multipoint control device according to the present invention, when there are a plurality of communication terminals that have determined that the decoded voice signal has sound, the priority is set in the order of arrival of the voice signal first. It is configured as follows.
[0028]
The voice coded transmission system of the multipoint control device according to the present invention is configured so that one specific communication terminal is preferentially designated in advance and the priority is set.
[0029]
Some voice coding parameters in the voice coding transmission system of the multipoint control device according to the present invention are configured to be parameters carrying pitch period information.
[0030]
A part of speech encoding parameters in the speech encoding transmission system of the multipoint control device according to the present invention are configured to be parameters carrying spectrum envelope information.
[0031]
The voice coded transmission system of the multipoint control device according to the present invention, when it is determined that the decoded voice signals received from the plurality of communication terminals are voiced, the voice coding is performed according to the number of voiced communication terminals. It is configured to adaptively set the bit allocation of the frame of the parameterization parameter.
[0032]
In the voice coded transmission system of the multipoint control device according to the present invention, when the decoded voice signals received from the first and second communication terminals among the plurality of communication terminals are sound, If it is determined, the two speech coding parameters carry the pitch cycle information and the spectrum envelope information in the decoded speech signal of the first communication terminal and the pitch cycle information in the decoded speech signal of the second communication terminal. The voice coding parameter is transmitted to each communication terminal without performing the voice recoding process, and the voice coding parameter for performing the voice recoding process on the decoded voice signals of the first and second communication terminals is It is configured to perform degenerate quantization bit control based on a transmission rate control function, and then transmit to each communication terminal.
[0033]
The voice coded transmission system of the multipoint control apparatus according to the present invention is configured such that, when it is determined that the decoded voice signal received from the main voice communication terminal and the sub voice communication terminal among the plurality of communication terminals is sound, The two speech coding parameters that carry the pitch period information and the spectrum envelope information in the decoded speech signal of the main speech communication terminal and the speech coding parameter that carries the pitch period information in the decoded speech signal of the sub speech communication terminal Speech encoding parameters to be transmitted to each communication terminal without performing re-encoding processing and subjected to audio re-encoding processing in the decoded audio signals of the main audio and sub audio communication terminals are degenerated quantum based on the transmission rate control function. This is configured to transmit to each communication terminal after performing coded bit control.
[0034]
The encoding parameter for performing the degenerate quantization bit control in the speech encoding transmission system of the multipoint control device according to the present invention is configured to be a gain codebook.
[0035]
The encoding parameter for performing the degenerate quantization bit control in the speech encoding transmission system of the multipoint control apparatus according to the present invention is configured to be a noise codebook.
[0036]
The voice coded transmission system of the multipoint control device according to the present invention may be configured such that voice coded parameters are subjected to voice re-coding processing and then transmitted to each communication terminal or each communication terminal is not subjected to voice re-coding processing. The encoding mode information for deciding whether to transmit the data is transmitted in a frame of an encoding parameter having a predetermined number of bits.
[0037]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, an embodiment of the present invention will be described with reference to the drawings.
Embodiment 1 FIG.
FIG. 1 is a block diagram showing a configuration of a multipoint control apparatus to which a voice coded transmission system according to Embodiment 1 of the present invention is applied. In FIG. 1, 201 is an exchange, 202 is a line interface unit, 203 is a speech decoding unit, 204 is a speech detection unit, 205 is a noise suppression processing unit, 206 is a speech addition unit, 207 is a distribution processing unit, and 208 is a local terminal. An audio subtraction unit, 209 is an automatic gain control unit, 210 is an audio re-encoding processing unit, and 211 is an audio encoding parameter control unit.
[0038]
The basic functions of the configuration of FIG. 1 are as follows.
Coded voice from each conference terminal (communication terminal) is received by the line interface unit 202 via the exchange 201. The received encoded speech is decoded by the speech decoding processing unit 203 into a speech signal for each channel. The sound detection unit 204 detects the presence / absence of a sound / no sound in each of the decoded audio signals from the conference terminal. This detection result is used for determining the addition target terminal in the voice addition unit 206, weighting for noise suppression in the noise suppression processing unit 205, and automatic adjustment of the voice level by the automatic gain control unit 209. Further, the decoded audio signal decoded by the audio decoding processing unit 203 and the extracted audio parameters are input to the audio encoding parameter control unit 211.
[0039]
When the number N of conference participants increases, noise increases and the communication quality deteriorates. Therefore, the noise suppression processing unit 205 reduces noise input from a channel in a silent state. For this purpose, each weighting factor based on the number N of conference participants is determined for each of the sound channel and the silent channel in accordance with the result of detection of voice / non-voice of each conference participant, and this weighting factor is determined for each decoded speech signal. To perform noise control processing. The voice adding unit 206 adds the decoded voice signal of the channel to be added, which has been subjected to the noise control processing, with reference to the detection result of the voice detecting unit 204. The added decoded audio signal is redistributed to each channel in distribution processing section 207.
[0040]
The own-terminal audio subtraction unit 208 subtracts the own-terminal signal from the added decoded audio signal in order to suppress the echo caused by the rounding of the audio signal of the own terminal and eliminate the annoying audibility. Further, in a multipoint telephone conference system, since there is a possibility of causing a saturation distortion due to the addition of voices of a large number of people, the automatic gain control unit 209 prevents the saturation distortion, and furthermore, the individual and overall voice level. To adjust. The output signal of the automatic gain control unit 209 is encoded by the audio re-encoding processing unit 210 and output to the line.
[0041]
FIG. 2 is a block diagram showing a more detailed configuration of the speech coding parameter control unit 211 of FIG. 1. In the figure, 212 is a selector, 213 is a distribution processing unit, and 214 is a speaker selection unit.
The basic functions of the configuration of FIG. 2 are as follows.
The speaker selection unit 214 adds up the detection results of the voice detection unit 204 of each channel, and when it can be considered that the speaker is the only one, the information (such as a channel number) of the speaker is selected by the selector 212 and the voice recoding. Output to the conversion processing unit 210. The selector 212 selects an encoding parameter to be passed through to the audio re-encoding processing unit 210 from the audio encoding parameters of each channel extracted by the audio decoding processing unit 203, according to the selection result of the speaker selection unit 214. . The distribution processing unit 213 redistributes the encoding parameter selected by the selector 212 to the audio re-encoding processing unit 210 of each channel.
[0042]
FIG. 3 is a block diagram showing a more detailed configuration of the speech decoding processing unit 203 of FIG. 1. In the drawing, 126 is a demultiplexing unit, 131a is an adaptive codebook, 131b is a gain decoding unit, and 132 is a decoding gain MA. A prediction unit, 133 is an algebraic code decoding unit, 134 is a pitch prefilter, 135 is an LSP decoding unit, 136 is an LSP interpolation unit, 137 is an LSP / LPC conversion unit, 127 and 128 are control amplification units, 129 is an adder, 138 is a synthesis filter, and 139 is a post filter.
[0043]
FIG. 4 is a block diagram illustrating a configuration of the audio re-encoding processing unit 210 in FIG. 1, and corresponds to the above-described audio encoding device in FIG. In the figure, reference numeral 104 denotes a linear prediction analysis processing unit, 105 denotes an LSP (Line Spectral Pair) quantization processing unit, 106 denotes an LSP quantization codebook, 108 denotes a multiplexing unit, 109 denotes an inverse quantization processing unit, and 110 denotes an adaptive code. Book, 111 is an algebraic codebook, 112 and 116 are adders, 114 is a gain quantization codebook, 115 is a synthesis filter, 117 is an auditory weighting filter, 118 is a distortion minimizing unit, and 119, 120 and 122 are changeover switches. , 113a and 113b are gain control amplification units, 121 is a pitch pre-filter, 125 is a gain MA prediction unit, and 124 is an LPC (Linear Predictive Coding) / LSP conversion unit.
[0044]
Next, the operation will be described. Note that the speech coding scheme used for the description is described in ITU-T Recommendation G. 729 CS-ACELP.
The audio decoding processing unit 203 executes a decoding process based on the transmitted audio encoded data. At the same time, a parameter indicating a characteristic amount unique to the speaker's voice, that is, an adaptive codebook index which is pitch period information expressing a repetition period of a vocal cord vibration wave, and an LSP which is spectrum envelope information expressing vocal tract information. The codebook index is output to the speech coding parameter control unit 211. In this case, if the speaker is uniquely determined by the determination result of the voice detection unit 204, the adaptive codebook index and the LSP codebook among the encoding parameters received from the channel allocated to the terminal device of the speaker are used. The index and the index are directly passed through to the audio re-encoding processing unit 210.
[0045]
At this time, the voice re-encoding processing unit 210 connects the changeover switches 119, 120, and 122 to the contacts 119A, 120A, and 122A, and performs the re-encoding process, that is, the code book search process for the passed-through encoding parameters. Instead, it is sent to multiplexing section 108 as it is. For the other parameters (algebraic codebook index and gain codebook index in FIG. 4), re-encoding processing is performed based on the decoded speech signal to which speech has been added, and distortion minimizing section 118 searches for a least square error. To extract the optimum quantization value and send it to the multiplexing unit 108. The multiplexing unit 108 multiplexes these parameters and outputs them to the line interface unit 202.
[0046]
Here, regarding the voice detection unit 204, for example, when a voice is interrupted from another channel while a voice is being voiced on a certain channel, even if a rising edge of voice is detected on that channel, the voice detection unit 204 immediately receives a voice from the currently voiced channel. The switching is not performed, and a predetermined waiting time is given to delay the switching from the channel in which the speech is being made, thereby preventing the switching in the less important speech of a relatively short section, such as hammering and coughing. If this is a relatively long interrupt message, a switching delay occurs, but this is reflected in the output (addition decoded voice signal) of the voice addition unit 206, so that the voice re-encoding processing unit 210 Since it is encoded, there is no worry about running out of head. However, since the adaptive codebook index (pitch period information) and the LSP codebook index (spectral envelope information), which are encoding parameters including important information for decoding a speech signal, are partially missing, the original speaker However, the quality of the interrupt message has deteriorated. However, in the operation of the conference, the importance of interrupt remarks is often low, and information about frequency components is also leaked in the algebraic codebook index, albeit slightly. However, the sound of the interrupting speaker, such as decoding of unpleasant sounds, does not become abnormal, and it can be said that this deterioration is relatively unnoticeable.
[0047]
In the case where the speaker is not uniquely determined in the voice detection unit 204, the pass-through operation is performed by connecting the switches 119, 120, and 122 of the voice re-encoding processing unit 210 to the terminals 119B, 120B, and 122B, respectively. Not performed. Therefore, predetermined re-encoding processing is performed in the audio re-encoding processing section 210 for the decoded audio of each channel having a speaker input.
[0048]
As described above, according to the first embodiment, some of the plurality of types of speech encoding parameters in the decoded speech signal of the designated conference terminal are not subjected to speech re-encoding processing. Among the plurality of types of speech encoding parameters transmitted to each communication terminal and in the decoded speech signal of the designated conference terminal, other speech encoding parameters, and a plurality of kinds of speech encoding parameters in the decoded speech signal of another communication terminal Is transmitted to each communication terminal after performing voice re-encoding processing, so simultaneous speeches of multiple speakers assumed in a one-to-many or many-to-many conference call, speeches in a high noise environment, It is possible to obtain an effect that high-quality voice transmission can be realized even for short sections such as coughing and companion speech.
[0049]
That is, by passing through some of the voice coding parameters, the main speaker (the speaker of the designated conference terminal) deteriorates by repeating decoding and re-encoding for the main voice coding parameters. Therefore, the main speaker can be reproduced with a sound quality close to the sound quality at the time of pass-through. Further, since the encoding parameter always passes through the audio re-encoding processing unit, even if frequent switching occurs, the effect of eliminating discontinuity in audio can be obtained. In addition, since the quality is always kept constant, an effect is obtained that the unnatural feeling due to the fluctuation of the voice quality is eliminated.
[0050]
Further, according to the first embodiment, some of the speech encoding parameters that pass through without performing speech re-encoding processing are parameters that carry pitch period information, and therefore, vocal cord vibrations specific to the speaker's speech. An effect is obtained in that information expressing a number of repetition periods is transmitted as it is, and high-quality voice transmission can be realized.
[0051]
Also, according to the first embodiment, some of the speech encoding parameters that pass through without performing speech re-encoding processing are parameters that carry spectral envelope information, and therefore, the voices of the speaker's oral cavity, nasal cavity, etc. An effect is obtained that the information expressing the road can be transmitted as it is to realize high quality voice transmission.
[0052]
Embodiment 2 FIG.
In the second embodiment, the configuration corresponding to FIGS. 1 to 3 of the first embodiment is almost the same. However, in the second embodiment, the adaptive codebook index that is passed through from speech decoding processing section 203 to speech encoding parameter control section 211 contains the encoding parameters of the speaker of the conference terminal having the first priority. The first-order adaptive codebook index and the second-order adaptive codebook index among the encoding parameters from the speaker of the conference terminal having the second priority. FIG. 5 is a block diagram showing a configuration of a sound reproduction / coding processing unit according to Embodiment 2 of the present invention. In the drawing, 130 is a changeover switch, 110b is a second adaptive codebook, and 113c is a gain control amplification unit. , 140 are coding rate control units. The other components are the same as the components of the speech re-encoding processing unit according to Embodiment 1 shown in FIG. 4, and thus are denoted by the same reference numerals, and description thereof will be omitted in principle.
[0053]
FIG. 6 is a block diagram showing the configuration of a telephone or conference terminal connected to the exchange 201. In the figure, reference numeral 500 denotes a line interface unit, 501 denotes a voice decoding processing unit, 502 denotes a D / A converter, and 503 denotes a speaker. is there.
[0054]
FIG. 7 is a block diagram showing a more detailed configuration of the audio decoding processing unit 501 in the conference terminal shown in FIG. 6. In the drawing, 504 is a demultiplexing unit, and 505 is an encoding mode decoding unit. Other components are the same as some of the components included in FIG. 5, and thus are denoted by the same reference numerals, and description thereof will be omitted in principle.
[0055]
FIG. 8 is an explanatory diagram showing an example of a frame configuration of encoding parameters in the case of one speaker, the case of two speakers, and the case of three speakers.
[0056]
Next, the operation will be described. Note that, as in the case of the first embodiment, the speech coding method used for the description is described in ITU-T Recommendation G. 729 CS-ACELP.
Switch 130 switches between inputting and not inputting the second-order adaptive codebook index from speech coding parameter control section 211. The encoding rate control unit 140 is provided between the audio encoding parameter control unit 211 and the audio re-encoding processing unit 210, and encodes parameters based on the speaker selection information from the audio encoding parameter control unit 211. A control signal for determining the rate is output to algebraic codebook 111 and gain quantization codebook 114, and coding mode information is given as a switching control signal for switches 119, 120, and 130, and is output to multiplexing section 108. The demultiplexing unit 504 separates the speech coding parameters multiplexed by the multiplexing unit 108 in FIG. The encoding mode decoding unit 505 decodes the encoding mode information (the value of the last one bit in the frame in FIG. 8) separated by the demultiplexing unit 504.
[0057]
Hereinafter, the overall operation will be described with reference to FIGS.
Now, when there is only one speaker, the speaker selection unit 214 of FIG. 2 determines “one-speaker” based on the detection result of the voice detection unit 204 of FIG. Is output to the selector 212 and transferred to the voice re-encoding processor 210. The selector 212 selects an encoding parameter received from a channel corresponding to the conference terminal of one speaker, and outputs the selected encoding parameter to the distribution processing unit 213. The distribution processing unit 213 passes the encoding parameters (the LSP encoding index and the first-order adaptive encoding index) to the audio re-encoding processing unit 210 as it is. The operation at this time is to connect the switch 130 to the 130B side. That is, the input of the second adaptive codebook index is turned off. Further, the bit rate output from coding rate control section 140 is determined. Other operations are exactly the same as those in the first embodiment.
[0058]
Also, when there is no speaker or three or more speakers speak at the same time, the speaker selection unit 214 determines that “three or more speakers” is determined based on the detection result of the voice detection unit 204. Then, the speaker selection information indicating the determined content is output to the selector 212 and transferred to the voice re-encoding processing unit 210. In this case, the selector 212 does not select an encoding parameter. The voice re-encoding processing unit 210 performs a re-encoding process by tandem connection according to the speaker selection information. That is, the switches 119, 120, and 130 are connected to the contacts 119B, 120B, and 130B, respectively, so that the input of the first and second adaptive codebook indexes is turned off without passing through the LSP coding index. . Then, re-encoding processing is performed on all the encoding parameters based on the signal subjected to the voice addition, and an optimum quantization value is searched for and sent to the multiplexing unit 108. However, if there is no speaker, no re-encoding process is performed because there is no encoding parameter.
[0059]
On the other hand, when two speakers speak at the same time, the speaker selection unit 214 determines that the message is “two speakers”, and outputs speaker selection information indicating the determined content to the selector 212. The data is transferred to the audio re-encoding unit 210. The selector 212 selects the encoding parameter received from the channel corresponding to the conference terminal of the two speakers and outputs the selected encoding parameter to the distribution processing unit 213. The distribution processing unit 213 passes the encoding parameter to the audio re-encoding processing unit 210 as it is. The speech re-encoding processing unit 210 passes through the LSP encoding index of the first speaker and the first adaptive encoding index of the first speaker, and also passes through the second adaptive encoding index of the second speaker (pitch period). Information).
[0060]
That is, the switches 119, 120, and 130 are connected to the contacts 119A, 120A, and 130A, respectively, and the LSP coding index and the first and second adaptive codebook indexes are sent to the multiplexer 108. The other coding parameters, ie, the algebraic coding index and the gain coding index, are re-encoded based on the signal to which the voice is added, search for the optimal quantization value, and send it to the multiplexing unit 108. The multiplexing unit 108 multiplexes the plurality of coding parameters and outputs the multiplexed coding parameter to the line interface unit 202.
[0061]
At the same time, the coding rate control unit 140 adjusts the number of bits allocated to the algebraic coding index and the gain coding index according to the speaker selection information from the voice coding parameter control unit 211 so as to match the transmission rate. Then, while outputting to the algebraic codebook 111 and the gain quantization codebook 114, respectively, it generates coding mode information to be described later and outputs it to the switches 119, 120, and 130 and the multiplexing unit 108.
[0062]
In this case, the encoding mode is set to two modes, the encoding mode of "one speaker's speech" and "three or more speakers 'is set to" mode 0 ", and the encoding mode of" two speakers' speech " Is "mode 1". That is, the coding mode information of “0” and “1” is set. Therefore, one bit is required to transmit the encoding mode information. FIGS. 8 and 9 show an example of bit allocation of one frame (80 bits) of the encoding parameter in each mode when the transmission speed is 8 kbit / sec.
[0063]
In “mode 0”, ITU-T Recommendation G. This is almost the same as the bit assignment shown in FIG. That is, as shown in FIG. 8, in the case of “one-speaker utterance”, the coding parameters of the LSP codebook index and the adaptive codebook index are passed through, and in the case of “three-or-speaker utterance”, all Do not pass through the encoding parameters of However, since the transmission speed is 8 kilobits / second, there is no room for transmitting the encoding mode information in the standard system. Therefore, as shown in FIG. 9, one bit set as a parity bit in the standard system is encoded. Divert to mode information and transmit.
[0064]
On the other hand, in “mode 1”, the second-order adaptive coding index (13 bits) needs to be transmitted, so the bit allocation of other coding parameters is reduced by 13 bits (this is referred to as degeneration). There is a need. Therefore, as shown in FIGS. 9 and 8, in the case of “two-speaker”, the algebraic codebook index is reduced from 34 bits to 22 bits, and the gain codebook index is reduced from 14 bits to 13 bits. . Then, it passes through an encoding parameter consisting of the LSP codebook index and the adaptive codebook index of the first speaker, and the second adaptive codebook index of the second speaker.
[0065]
In addition, regarding the degeneracy method of the algebraic codebook index, the ITU-T Recommendation G. 729 Annex D This is realized by using the method (11-bit quantization) described in section 5.8. As for the degeneration method of the gain codebook index, the first subframe is defined in ITU-T Recommendation G. Using the method (7-bit quantization) described in section 3.9 of the G.729 body, the second subframe is defined in ITU-T Recommendation G.323. 729 Annex D This is realized by using the method (6-bit quantization) described in section 5.9.
[0066]
The multiplexing unit 108 multiplexes these coding parameters, and transmits the multiplexed parameters to each conference terminal via the line interface unit 202 and the exchange 201 for each frame shown in FIG. Then, in the receiving unit of the conference terminal shown in FIG. 6, speech decoding processing unit 501 decodes the frame of the encoding parameter received via exchange 201 and line interface unit 500. That is, in speech decoding processing section 501 shown in FIG. 7, demultiplexing section 504 separates one frame of encoding parameters into respective encoding parameters, and outputs encoding mode information to encoding mode decoding section 505. The coding mode decoding unit 505 provides a switch control signal to the switch 130 based on the coding mode information, and provides bit allocation information to the algebraic codebook 111 and the gain quantization codebook 114.
[0067]
Therefore, when a frame whose encoding mode information is 0 is received, the switch 130 is connected to the contact 130B side by the switching control signal, and turns off the input of the second adaptive codebook index. In addition, the algebraic codebook 111 uses ITU-T Recommendation G. The algebraic codebook index is decoded using the decoding method described in section 4.1 of the 729 body. In addition, the gain quantization codebook 114 also complies with ITU-T Recommendation G. The gain codebook index is decoded using the decoding method described in section 4.1 of the G.729 body.
[0068]
On the other hand, when the frame of which the encoding mode information is 1 is received, the switch 130 is connected to the contact point 130A side by the switching control signal, and turns on the input of the second adaptive codebook index to start decoding. In addition, the algebraic codebook 111 determines that the first subframe according to ITU-T Recommendation G. The algebraic codebook index is decoded using the decoding method described in section 4.1 of the G.729 main body, and the second subframe is described in Recommendation G.729. 729 Annex D. The algebraic codebook index is decoded using the decoding method described in Chapter 6. Next, the pitch period component of the second speaker obtained by decoding the second adaptive codebook index and the pitch period component of the first speaker obtained by decoding the first adaptive codebook index , And the noise component obtained by decoding the algebraic codebook index are added by the adder 112 and output to the synthesis filter 115 as an excitation signal. The synthesis filter 115 convolves the vocal tract information based on the excitation signal to obtain a decoded voice.
[0069]
As described above, according to the second embodiment, when it is determined that the decoded audio signals received from the plurality of conference terminals are voiced, the encoding parameter is determined according to the number of voiced conference terminals. Since the bit allocation of the frame is adaptively set, an effect is obtained that the present invention can be applied to a transmission network where the transmission speed is constant or the transmission speed per channel is limited.
[0070]
Further, according to the second embodiment, the decoded voice signals received from the first speaker of the conference terminal having the first priority and the second speaker of the conference terminal having the second priority among the plurality of conference terminals are provided. If it is determined that there is speech, a speech coding parameter consisting of a first-order adaptive codebook index, which is the pitch period information in the decoded speech signal of the first speaker, and an LSP codebook index, which is the spectrum envelope information. , And a speech encoding parameter consisting of a second-order adaptive codebook index, which is pitch period information in the decoded speech signal of the second speaker, is passed through to each conference terminal without performing speech re-encoding processing. Since the transmission is performed, even if two speakers speak at the same time, an effect of enabling transmission of relatively good voice quality can be obtained.
[0071]
Further, according to the second embodiment, since the degenerate quantization bit control based on the transmission rate control function is performed on the voice coding parameter for performing the voice re-encoding process, the voice quality is not affected by the deterioration of the voice quality. Each of the conference terminals is degenerated by reducing a small number of encoding parameters, and performing a speech re-encoding process on a speech encoding parameter including a second-order adaptive codebook index which is pitch period information in a decoded speech signal of a second speaker. , The effect of suppressing deterioration of voice quality can be obtained even when two speakers speak at the same time.
[0072]
Further, according to the second embodiment, since the coding parameter for which the degenerate quantization bit control is performed is the gain codebook index which is the excitation gain, the gain codebook index which has little effect on the deterioration of voice quality Thus, even if two speakers speak at the same time, it is possible to suppress the deterioration of voice quality.
[0073]
Further, according to the second embodiment, since the coding parameter for performing the degenerate quantization bit control is the algebraic codebook index which is a noise codebook, the algebraic codebook having little effect on the deterioration of speech quality. An effect is obtained in which even if two speakers speak at the same time by degenerating the encoding parameter of the index, deterioration in voice quality can be suppressed.
[0074]
Further, according to the second embodiment, the audio coding parameter is transmitted to each conference terminal after performing the audio re-encoding process, or is passed through each conference terminal without performing the audio re-encoding process. Since 1-bit coding mode information for determining whether to transmit is included in a frame of 80-bit coding parameters, information on whether to pass through a specific coding parameter without affecting the transmission rate. Can be included in the frame of the encoding parameter.
[0075]
Embodiment 3 FIG.
FIG. 10 is a block diagram illustrating a configuration of a multipoint control apparatus according to Embodiment 3 of the present invention, and FIG. 11 is a block diagram of a speech coding parameter control unit according to Embodiment 3. 10 corresponding to FIG. 1 are denoted by the same reference numerals, and portions of FIG. 11 corresponding to FIG. 2 are denoted by the same reference numerals, and the description thereof is omitted in principle. In FIGS. 10 and 11, reference numeral 215 denotes a first-arrival channel determination unit which responds to a detection result of a channel first detected by the voice detection unit 204 among channels in which speech has been made, and controls the selector 212 of the voice-coding parameter control unit 211 in response to the detection result. Control.
[0076]
Next, the operation will be described.
When the speakers compete, the first-arrival channel determination unit 215 determines the channel that has uttered first as the priority speaker, and provides the determination result to the selector 212. The selector 212 passes through only the encoding parameter of the first channel to the audio re-encoding processing unit 210. After that, the audio re-encoding processor 210 operates in the same manner as in the first embodiment.
[0077]
As described above, according to the third embodiment, priorities are set for a plurality of conference terminals, and when there are a plurality of conference terminals that have determined that the decoded audio signal is a sound, among them, Since one conference terminal having the highest priority is designated, and some of the speech encoding parameters in the decoded speech signal of the designated conference terminal are transmitted to each communication terminal without performing speech re-encoding processing, a plurality of This makes it possible to avoid confusion in the conference even when the speakers of the conference terminal speak simultaneously.
[0078]
Further, according to the third embodiment, if there are a plurality of conference terminals that have determined that the decoded audio signal is sound, priority is set in the order of arrival of the audio signal first, so that unnecessary This has the effect that the progress of the conference is not disturbed by coughing, a hammering, or an interrupted speech that deviates from the rules.
[0079]
Embodiment 4 FIG.
FIG. 12 is a block diagram showing a multipoint control device and another configuration according to Embodiment 4 of the present invention, and FIG. 13 is a block diagram of a speech coding parameter control unit according to Embodiment 4. 12 corresponding to FIG. 1 are denoted by the same reference numerals, and those in FIG. 13 corresponding to FIG. 2 are denoted by the same reference numerals, and description thereof is omitted in principle. In FIG. 12, reference numeral 224 denotes an MCU control unit which registers a specific channel as a priority speaker through the Internet or the like in the priority speaker determination unit 216. In FIG. 13, reference numeral 216 denotes a priority speaker determination unit, in which a specific channel of the telephone conference is registered in advance as a priority speaker, and a channel determined to be sound based on the decoded voice signal and the voice detection result is registered. If the channel is a specific channel, the speaker of that channel is determined as the priority speaker.
[0080]
Next, the operation will be described.
For example, the conference organizer registers the channel of the conference organizer or the channel of the designated moderator via the MCU control unit 224 via the Internet or the like at the time of setting the conference as the priority speaker and registers it in the MCU control unit 224. Keep it. Next, in a case where the speakers compete with each other, when there is a speaker on the channel of the registered priority speaker, that is, the channel of the conference organizer or the channel of the designated moderator, the priority speaker determination unit 216 performs decoding. The channel of the priority speaker is detected from the voice signal, the voice detection result, and the registered data, and the selector 212 of the voice coding parameter control unit 211 determines only the coding parameter of the corresponding channel by the voice recoding processing unit 210. Control to pass through to. Subsequent operations are the same as in the first embodiment.
[0081]
As described above, according to the fourth embodiment, one specific conference terminal is specified preferentially in advance, and the specific conference terminal is included in the plurality of conference terminals that have been determined to have sound in the decoded audio signal. Is included, a part of the speech encoding parameters in the decoded speech signal of the specific conference terminal is transmitted through each conference terminal without performing speech re-encoding processing, so that the conference host In this case, the speech of the speaker or the nominating facilitator is detected, and even if the speakers compete with each other, the effect is obtained that the conference can be smoothly performed.
[0082]
Embodiment 5 FIG.
FIGS. 14 and 15 are explanatory diagrams showing the frame configuration of the encoding parameters according to Embodiment 5 of the present invention. Note that, in the fifth embodiment, the configurations of the multipoint control device, the audio re-encoding processing unit in the multipoint control device, and the audio decoding processing unit in the conference terminal are shown in FIGS. This is the same as the configuration in the second embodiment shown in FIG. The configuration of the speech coding parameter control unit in the multipoint control device is the same as the configuration in Embodiment 4 shown in FIG.
[0083]
Next, the operation will be described. As in the second embodiment, the speech coding method used for the description is also the same as in the second embodiment. 729 CS-ACELP system. In FIG. 1, according to the detection result of the voice detection unit 204, the selection of the speaker selection unit 214 of the voice coding parameter control unit 211 in FIG. If it is determined to be “speaker or more”, as in the second embodiment, the speaker selection information based on the determination is provided to the selector 212 and output to the coding rate control unit 140 in FIG. The encoding rate control unit 140 generates 1-bit encoding mode information indicating “mode 0” or “mode 1” according to the speaker selection information, and outputs the generated encoding mode information to the audio re-encoding processing unit 210. . The speech re-encoding processing unit 210 forms a frame of the encoding parameter to be output from the multiplexing unit 108 to the conference terminal based on the encoding mode information.
[0084]
That is, in “mode 1”, since the second adaptive coding index (8 bits) needs to be transmitted, it is necessary to degenerate other coding parameters by 8 bits. Therefore, as shown in FIGS. 14 and 15, in the case of “two-speaker's speech”, the algebraic codebook index is reduced from 34 bits to 28 bits, and the gain codebook index is reduced from 14 bits to 12 bits. . Then, it passes through an encoding parameter consisting of the LSP codebook index and the adaptive codebook index of the first speaker, and the second adaptive codebook index of the second speaker.
[0085]
As for the degeneration method of the algebraic codebook index, the ITU-T Recommendation G. 729 Annex D The method (11-bit quantization) described in section 5.8 is used, and the ITU-T recommendation G. This is realized using the method (17-bit quantization) described in section 3.8 of the G.729 body. As for the degeneration method of the gain codebook index, both the first subframe and the second subframe conform to ITU-T Recommendation G.264. 729 Annex D This is realized by using the method (6-bit quantization) described in section 5.9.
[0086]
By the way, channels used in the telephone conference system are composed of a main audio and a sub audio. In this case, the main voice is the voice of the speaker, and the sub-voice is, for example, a simultaneous translation for the voice of the speaker. Therefore, in the priority speaker determination unit 216 in FIG. 13, the ranking is set in such a manner that the main voice is the first speaker and the auxiliary voice is the second speaker. As shown in FIG. 14, for the LSP codebook index and the first-order adaptive codebook index which are the coding parameters of the main voice, the quantized parameters passed through are transmitted as they are, For the second-order adaptive codebook index of, the quantization parameter that has been passed through is degenerated and transmitted.
[0087]
As described above, according to the fifth embodiment, when it is determined that the decoded audio signals received from the main audio and sub audio conference terminals are sound, the decoded audio signals of the main audio conference terminal are determined. Two speech coding parameters each carrying pitch cycle information and spectrum envelope information, and speech coding parameters carrying pitch cycle information in a decoded speech signal of a sub-speech conference terminal, are not subjected to speech re-encoding processing. Speech encoding parameters to be transmitted to the communication terminal and subjected to audio re-encoding are transmitted to each conference terminal after performing degenerate quantization bit control based on the transmission rate control function, so that the main audio and the sub audio are Even when speaking at the same time, the coding parameters of the sub-speech are degenerated, minimizing the deterioration of the speech quality of the main speech while maintaining the transmission rate. Effect of transmission is obtained.
[0088]
Further, according to the fifth embodiment, similarly to the second embodiment, when the coding parameters of the gain codebook index which have little effect on the deterioration of voice quality are degenerated, and two speakers speak simultaneously However, the effect that the deterioration of the voice quality can be suppressed can be obtained. In addition, it is possible to obtain an effect that the coding parameters of the algebraic codebook index having little effect on the deterioration of the voice quality are degenerated, and the deterioration of the voice quality can be suppressed even when two speakers speak simultaneously. Further, an effect is obtained that the encoding mode information indicating whether or not to pass through a specific encoding parameter can be included in the frame of the encoding parameter without affecting the transmission rate.
[0089]
Further, according to the second to fifth embodiments, as in the first embodiment, simultaneous utterances of a plurality of speakers assumed in a one-to-many or many-to-many telephone conference, or in a high-noise environment. This makes it possible to achieve high-quality voice transmission even for short-term speeches such as speeches, coughing, and hammering. Further, there is an effect that information expressing the repetition period of the vocal cord frequency unique to the speaker's voice is transmitted as it is, and high-quality voice transmission can be realized. Further, there is an effect that information expressing the vocal tract such as the oral cavity and nasal cavity of the speaker is transmitted as it is, and high-quality voice transmission can be realized.
[0090]
Note that, in each of the above embodiments, the present invention has been described using a conference terminal (telephone) as an example of a communication terminal, but the form of the communication terminal is not limited to the conference terminal. For example, in a broadcasting system or a cable broadcasting system that simultaneously transmits a plurality of types of audio signals on different channels, a plurality of receivers that receive the audio signals are applied as communication terminals, and a specific one channel (for example, , A main audio channel or a sub audio channel).
[0091]
【The invention's effect】
As described above, according to the present invention, the voice coded transmission system of the multipoint control device decodes the coded voice signal received from each communication terminal to generate a decoded voice signal, and the decoded voice signal is Is determined, and one of the communication terminals determined to have sound is designated, and some of the speech encoding parameters of a plurality of types of speech encoding parameters in the decoded speech signal of the designated communication terminal are determined. Is transmitted to each communication terminal without performing voice re-encoding processing, and among a plurality of types of voice coding parameters in the decoded voice signal of the designated communication terminal, other voice coding parameters and decoded voice of another communication terminal Since a plurality of types of voice coding parameters in a signal are transmitted to each communication terminal after performing voice recoding processing, a one-to-many or many-to-many telephone conference A plurality of speakers simultaneously speaking of contemplated and, speaking in a high noise environment, throat clearing, there is an effect that it also provides high-quality voice transmission against remarks short interval, such as back-channel feedback.
[0092]
According to the present invention, the speech coded transmission system of the multipoint control device designates one communication terminal when only the decoded speech signal from one communication terminal has sound, and specifies the communication terminal of the designated communication terminal. Since a part of the voice coding parameter in the decoded voice signal is transmitted to each communication terminal without performing voice re-encoding processing, high quality voice transmission in a one-to-many or many-to-many telephone conference is performed. There is an effect that can be realized.
[0093]
According to the present invention, the voice coded transmission system of the multipoint control apparatus sets a priority order for a plurality of communication terminals, and when there are a plurality of communication terminals that have determined that the decoded voice signal is sound. Designates one communication terminal having the highest priority among them, and transmits some of the speech coding parameters in the decoded speech signal of the designated communication terminal to each communication terminal without performing speech re-encoding processing. Therefore, even when speakers of a plurality of conference terminals speak simultaneously, there is an effect that the conference can be prevented from being confused.
[0094]
According to the present invention, when there are a plurality of communication terminals that determine that the decoded audio signal is sound, the voice encoding transmission system of the multipoint control device assigns the priority in the order of arrival of the audio signal received first. Since the setting is made, there is an effect that the progress of the conference is not disturbed by useless coughing, a hammering, or an interrupt message deviating from the rules.
[0095]
According to the present invention, the voice coded transmission system of the multipoint control device is configured such that one specific communication terminal is preferentially designated in advance and the priorities are set. There is an effect that even if a speaker competes by detecting a moderator's statement, a smooth conference can be performed.
[0096]
According to the present invention, some of the voice coding parameters in the voice coding and transmission system of the multipoint control device are configured to be parameters that carry pitch period information, so that the vocal cord frequency inherent to the voice of the speaker There is an effect that high-quality voice transmission can be realized by transmitting the information expressing the repetition cycle of this as it is.
[0097]
According to the present invention, since some of the speech encoding parameters in the speech encoding transmission system of the multipoint control device are configured to be parameters carrying spectrum envelope information, the vocal tract such as the oral cavity and nasal cavity of the speaker can be obtained. Is transmitted as it is, and high quality voice transmission can be realized.
[0098]
According to the present invention, when it is determined that the decoded audio signal received from the plurality of communication terminals is sound, the voice coded transmission system of the multipoint control device determines the number of the voice communication terminals according to the number of voice communication terminals. Since the configuration is such that the bit allocation of the frame of the voice coding parameter is set adaptively, there is an effect that the present invention can be applied to a transmission network where the transmission speed is constant or the transmission speed per channel is limited.
[0099]
According to the present invention, the voice coded transmission system of the multipoint control device is configured to transmit the decoded voice signal received from the communication terminal having the first priority and the communication terminal having the second priority among the plurality of communication terminals with sound. If it is determined that there are, two speech coding parameters respectively carrying the pitch cycle information and the spectrum envelope information in the decoded speech signal of the first communication terminal and the pitch cycle information in the decoded speech signal of the second communication terminal Are transmitted to each communication terminal without performing a voice re-encoding process, and a voice coding parameter for performing a voice re-coding process on the decoded voice signals of the first and second-rank communication terminals. Is transmitted to each communication terminal after performing degenerate quantization bit control based on the transmission rate control function, so that two speakers speak simultaneously. Even if there is an effect that it is possible to relatively good voice quality transmission.
[0100]
According to the present invention, when the voice coded transmission system of the multipoint control device determines that the decoded voice signal received from the main voice communication terminal and the sub voice communication terminal among the plurality of communication terminals is sound, There are two speech encoding parameters that respectively carry the pitch period information and the spectrum envelope information in the decoded speech signal of the main speech communication terminal, and the speech encoding parameter that carries the pitch period information in the decoded speech signal of the sub speech communication terminal. Is transmitted to each communication terminal without performing audio re-encoding processing, and the audio encoding parameters for performing audio re-encoding processing on the decoded audio signals of the main audio and sub audio communication terminals are degenerated based on the transmission rate control function. Is transmitted to each communication terminal after performing the quantization bit control of the main voice and the sub-voice simultaneously. Even if, by degenerating the coding parameters of the sub-audio, there is an effect that can be transmitted minimizing the degradation of the voice quality of the main voice while maintaining the transmission rate.
[0101]
According to the present invention, since the encoding parameter for performing the degenerate quantization bit control in the audio encoding transmission system of the multipoint control device is configured to be a gain codebook, it has an effect on the degradation of audio quality. Thus, there is an effect that even if two speakers speak at the same time, the deterioration of the voice quality can be suppressed even if the encoding parameters of the gain codebook index with less number are degenerated.
[0102]
According to the present invention, since the encoding parameter for performing the degenerate quantization bit control in the audio encoding transmission system of the multipoint control device is configured to be a noise codebook, it has an effect on the degradation of audio quality. In this case, the coding parameters of the algebraic codebook index with less number are degenerated, so that even if two speakers speak at the same time, there is an effect that the deterioration of the voice quality can be suppressed.
[0103]
According to the present invention, the voice coded transmission system of the multipoint control apparatus transmits the voice coded parameters to the respective communication terminals after performing the voice coded process or performs the voice coded process without performing the voice recoded process. Since the encoding mode information for deciding whether to transmit to the communication terminal is configured to be included in the frame of the encoding parameter consisting of a predetermined number of bits, it is possible to pass through a specific encoding parameter without affecting the transmission rate. There is an effect that the information of whether or not it can be included in the frame of the encoding parameter.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a multipoint control device applied to a speech coded transmission system according to Embodiment 1 of the present invention.
FIG. 2 is a block diagram showing a configuration of a speech coding parameter control unit according to the first embodiment.
FIG. 3 is a block diagram illustrating a configuration of an audio decoding processing unit according to the first embodiment.
FIG. 4 is a block diagram showing a configuration of a speech re-encoding processing unit according to the first embodiment.
FIG. 5 is a block diagram showing a configuration of an audio reproduction encoding processing unit according to the second embodiment.
FIG. 6 is a block diagram showing a configuration on a receiving side of the conference terminal according to the second embodiment.
FIG. 7 is a block diagram showing a configuration of an audio decoding processing unit of the conference terminal according to the second embodiment.
FIG. 8 is a diagram showing a frame configuration of speech coding parameters according to Embodiment 2.
FIG. 9 is an explanatory diagram showing an example of bit assignment of one frame of an encoding parameter according to Embodiment 2;
FIG. 10 is a block diagram showing a configuration of a multipoint control device according to the third embodiment.
FIG. 11 is a block diagram showing a configuration of a speech coding parameter control unit according to the third embodiment.
FIG. 12 is a block diagram showing a configuration of a multipoint control device applied to the speech coded transmission system according to Embodiment 4.
FIG. 13 is a block diagram showing a configuration of a speech coding parameter control unit according to the fourth embodiment.
FIG. 14 is a diagram showing a frame configuration of speech coding parameters according to Embodiment 5.
FIG. 15 is an explanatory diagram showing an example of bit assignment of one frame of an encoding parameter according to the fifth embodiment.
FIG. 16 is a block diagram illustrating a configuration of a conventional speech encoding device based on the CELP scheme.
FIG. 17 is a block diagram illustrating a configuration of a conventional multipoint control device.
FIG. 18 is a block diagram showing another configuration of a conventional multipoint control device.
FIG. 19 is a block diagram illustrating a configuration of a conventional CELP-based speech encoding device.
[Explanation of symbols]
104 linear prediction analysis processing unit, 105 LSP quantization processing unit, 106 LSP quantization codebook, 108 multiplexing unit, 109 inverse quantization processing unit, 110 adaptive codebook, 110b second adaptive codebook, 111 algebraic codebook (Noise codebook), 112, 116 adders, 113a, 113b gain control amplifier, 113c gain control amplifier, 114 gain quantization codebook, 115 synthesis filter, 117 auditory weighting filter, 118 distortion minimizing unit, 119 , 120 selector switch, 121 pitch pre-filter, 124 LPC / LSP converter, 125 gain MA predictor, 126 demultiplexer, 127, 128 control amplifier, 129 adder, 130 switch, 131a adaptive codebook, 131b gain Decoding section, 132 Decoding gain MA prediction section, 133 algebraic code decoding section, 134 pitch pref Filter, 135 LSP decoding section, 136 LSP interpolation section, 137 LSP / LPC conversion section, 138 synthesis filter, 139 post filter, 140 coding rate control section, 201 switch, 202 line interface section, 203 voice decoding processing section, 204 Voice detection section, 205 noise suppression processing section, 206 voice addition section, 207 distribution processing section, 208 own terminal voice subtraction section, 209 automatic gain control section, 210 voice recoding processing section, 211 voice coding parameter control section, 212 Selector, 213 distribution processing unit, 214 speaker selection unit, 215 first-arrival channel determination unit, 216
Priority speaker determination unit, 224 MCU control unit.

Claims

A plurality of communication terminals are connected via a communication network, and coded voice signals obtained by coding voice handled by each communication terminal are to be received and transmitted, and predetermined processing is performed according to the information content of these coded voice signals. In the voice coded transmission system of the multipoint control device that distributes the coded voice signal processed to the plurality of communication terminals,
Decoding the encoded audio signal received from each communication terminal to generate a decoded audio signal,
Determining the communication terminal in which the decoded audio signal has sound,
Designate one of the communication terminals determined to be sound,
Of the plurality of types of speech encoding parameters in the decoded speech signal of the specified communication terminal, some of the speech encoding parameters are transmitted to the respective communication terminals without performing speech re-encoding processing,
The voice re-encoding process is performed on the plurality of types of voice encoding parameters in the decoded voice signal of the specified communication terminal and the plurality of types of voice coding parameters in the decoded voice signal of the other communication terminal out of the plurality of types of voice coding parameters. The voice coded transmission system of the multipoint control device, wherein the voice coded transmission system transmits the data to each of the communication terminals after the transmission.

If only the decoded speech signal from one communication terminal is sound, the one communication terminal is designated, and speech re-encoding is performed for some speech encoding parameters in the decoded speech signal of the designated communication terminal. The voice coded transmission system for a multipoint control device according to claim 1, wherein the data is transmitted to each communication terminal without performing any processing.

A priority is set for a plurality of communication terminals, and when there are a plurality of communication terminals that have determined that the decoded audio signal is sound, one communication terminal having the highest priority among the communication terminals is designated. 2. The voice code of the multipoint control device according to claim 1, wherein a part of voice coding parameters in the decoded voice signal of the designated communication terminal is transmitted to each communication terminal without performing voice re-encoding processing. Transmission system.

4. The voice of the multipoint control device according to claim 3, wherein when a plurality of communication terminals determine that the decoded voice signal is a sound, the priority is set in a first-come-first-served order in which the voice signal is received first. Coded transmission system.

4. The voice coded transmission system for a multipoint control device according to claim 3, wherein one specific communication terminal is specified in advance and priorities are set.

The speech encoding transmission system for a multipoint control device according to any one of claims 1 to 5, wherein some speech encoding parameters are parameters carrying pitch period information.

The speech encoding transmission system of the multipoint control device according to claim 1, wherein some speech encoding parameters are parameters carrying spectrum envelope information.

If the decoded speech signals received from the plurality of communication terminals are determined to be speech, the bit allocation of the speech coding parameter frame is adaptively set according to the number of speech communication terminals. The voice coded transmission system for a multipoint control device according to any one of claims 1 to 7, wherein:

When it is determined that the decoded voice signals received from the first and second communication terminals are sound, the decoded voice of the first communication terminal is determined. The speech re-encoding process is performed on two speech encoding parameters respectively carrying the pitch period information and the spectrum envelope information in the signal and the speech encoding parameter carrying the pitch period information in the decoded speech signal of the second communication terminal. For each of the speech encoding parameters to be transmitted to each communication terminal and to perform speech re-encoding processing on the decoded speech signals of the first and second communication terminals, degenerate quantization bit control based on a transmission rate control function is performed. 9. The voice coded transmission system for a multipoint control device according to claim 8, wherein the transmission is performed to each of the communication terminals after performing.

When it is determined that the decoded voice signal received from the communication terminal of the main voice and the communication terminal of the sub-voice among the plurality of communication terminals is sound, the pitch period information and the pitch period information in the decoded voice signal of the communication terminal of the main voice are determined. Two voice coding parameters each carrying spectrum envelope information and a voice coding parameter carrying pitch period information in the decoded voice signal of the sub-voice communication terminal are transmitted to each communication terminal without performing voice re-encoding processing. The voice coding parameters for performing voice re-encoding on the decoded voice signals of the communication terminals of the main voice and the sub-voice are transmitted to the respective communication terminals after performing degenerate quantization bit control based on a transmission rate control function. 9. The voice coded transmission system for a multipoint control device according to claim 8, wherein:

11. The speech encoding transmission system for a multipoint control device according to claim 9, wherein the encoding parameter for which the quantization bit control of the degeneration is performed is a gain codebook.

11. The speech encoding transmission system for a multipoint control device according to claim 9, wherein the encoding parameter for which the degenerate quantization bit control is performed is a noise codebook.

A predetermined bit is set as coding mode information for determining whether to transmit to each communication terminal after performing a voice re-encoding process on a voice coding parameter or to transmit to each of the communication terminals without performing a voice re-coding process. The speech encoding transmission system for a multipoint control device according to any one of claims 1 to 12, wherein the parameter is included in a frame of encoding parameters consisting of numbers.