JP5931707B2

JP5931707B2 - Video conferencing system

Info

Publication number: JP5931707B2
Application number: JP2012264245A
Authority: JP
Inventors: 達也加古; 小林　和則; 和則小林; 仲大室
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2012-12-03
Filing date: 2012-12-03
Publication date: 2016-06-08
Anticipated expiration: 2032-12-03
Also published as: JP2014110546A

Description

この発明は、自由に配置した複数のクライアントで収音した音響信号から特定の音を強調し雑音を抑圧するビデオ会議技術に関する。 The present invention relates to a video conference technology that emphasizes a specific sound and suppresses noise from acoustic signals collected by a plurality of freely arranged clients.

電話会議装置やビデオ会議装置には、特定話者の音声や特定方向のみの音声を強調し雑音を抑えるために音声強調装置が使われている。音声強調装置は、複数のマイクのサンプリング周波数が同期して収音ができるマイクアレーを利用し、マイクに到達する音の到来時間差を求める。例えば、非特許文献１には、音の到来時間差から音源の方向を推定し、位相を揃えて加算することで特定方向の音を強調することができるビームフォーミングの技術が記載されている。また、非特許文献２には、音源方向から話者を区別し特定話者の音声とその他の雑音のパワー比が最大になるフィルタを設計して特定話者の音声を強調する技術が記載されている。 In a telephone conference device and a video conference device, a voice emphasis device is used to emphasize a voice of a specific speaker or a voice in a specific direction and suppress noise. The speech enhancement device uses a microphone array that can collect sound by synchronizing sampling frequencies of a plurality of microphones, and obtains an arrival time difference between sounds that reach the microphones. For example, Non-Patent Document 1 describes a beamforming technique that can enhance a sound in a specific direction by estimating the direction of a sound source from the difference in arrival time of sounds and adding the same phase. Non-Patent Document 2 describes a technique for emphasizing the voice of a specific speaker by distinguishing the speaker from the sound source direction and designing a filter that maximizes the power ratio between the voice of the specific speaker and other noises. ing.

Yusuke Hioka, Kazunori Kobayashi, Ken’ichi Furuya, and Akitoshi Kataoka, “Enhancement of Sounds in a Specific Directional Area using Power Spectra Estimated from Multiple Beamforming Outputs”, HSCMA, pp.49-52, 2008.Yusuke Hioka, Kazunori Kobayashi, Ken’ichi Furuya, and Akitoshi Kataoka, “Enhancement of Sounds in a Specific Directional Area using Power Spectra Estimated from Multiple Beamforming Outputs”, HSCMA, pp.49-52, 2008. 小笠原基, 石塚健太郎, 荒木章子, 藤本雅清, 中谷智広, 大塚和弘, “SN比最大化ビームフォーマを用いたオンライン会議音声強調”, 日本音響学会2009年春季研究発表会講演論文集, pp.695-698, 2009.Moto Ogasawara, Kentaro Ishizuka, Akiko Araki, Masaki Fujimoto, Tomohiro Nakatani, Kazuhiro Otsuka, “Online Conference Speech Enhancement Using SN Ratio Maximized Beamformer”, Proc. -698, 2009.

従来のビデオ会議で使われる音声強調装置では、マイクアレーを利用して音源方向に基づいて音響信号区間を分類する手法が用いられていた。このような手法では、複数のマイクで録音するサンプリング周波数が同期されている必要があり、またマイクの相対位置関係が既知である必要があった。そのため、マイクアレーが搭載された専用のビデオ会議用端末が必要であった。 In a conventional speech enhancement apparatus used in a video conference, a method of classifying acoustic signal sections based on a sound source direction using a microphone array has been used. In such a method, the sampling frequency for recording with a plurality of microphones needs to be synchronized, and the relative positional relationship between the microphones needs to be known. Therefore, a dedicated video conference terminal equipped with a microphone array is required.

この発明はこのような点に鑑み、専用のビデオ会議用端末を利用することなく、汎用の端末装置を自由に配置して取得した音響信号を利用して、特定の音を強調し雑音を抑圧することができるビデオ会議技術を提供することを目的とする。 In view of these points, the present invention emphasizes a specific sound and suppresses noise by using an acoustic signal obtained by freely arranging a general-purpose terminal device without using a dedicated video conference terminal. It aims to provide video conferencing technology that can be.

上記の課題を解決するために、この発明のビデオ会議システムは、収音手段と撮影手段を含む複数のクライアントとサーバとを含む。 In order to solve the above-described problems, a video conference system according to the present invention includes a plurality of clients and a server including sound collection means and photographing means.

クライアントは、収音手段により収音した音響信号をサーバへ送信する音声取得部と、撮影手段により撮影した映像信号をサーバへ送信する映像取得部と、サーバから受信する配信音響信号を再生する音声再生部と、サーバから受信する配信映像信号を表示する映像表示部とを含む。 The client includes an audio acquisition unit that transmits to the server the acoustic signal collected by the sound collection unit, a video acquisition unit that transmits the video signal captured by the imaging unit to the server, and an audio that reproduces the distribution acoustic signal received from the server. A reproduction unit and a video display unit for displaying a distribution video signal received from the server are included.

サーバは、複数のクライアントから受信する複数チャネルの音響信号を入力とし、所定の音声処理により送信音響信号を生成する音声処理部と、複数のクライアントから受信する複数個の映像信号から任意に選択した送信映像信号を決定する映像選択部と、複数チャネルの音響信号を入力として所定の音声処理により生成した配信音響信号を複数のクライアントへ送信する音声配信部と、複数個の映像信号から任意に選択された配信映像信号を複数のクライアントへ送信する映像配信部とを含む。 The server receives an audio signal of a plurality of channels received from a plurality of clients, arbitrarily selected from an audio processing unit that generates a transmission audio signal by predetermined audio processing, and a plurality of video signals received from a plurality of clients A video selection unit that determines a transmission video signal, an audio distribution unit that transmits a plurality of channels of audio signals as input to a distribution audio signal generated by predetermined audio processing, and a plurality of video signals. And a video distribution unit that transmits the distributed video signal to a plurality of clients.

音声処理部は、複数チャネルの音響信号を入力とし、チャネルごとに音声区間の音響信号の大きさを非音声区間の音響信号の大きさで正規化した特徴量を得る特徴量列取得部と、複数チャネルに対して得られた特徴量からなる特徴量列をクラスタリングし、特徴量列が属する信号区間分類を決定する分類部と、複数個の時間区間のそれぞれで音響信号を周波数領域に変換し、複数個の振幅スペクトルと位相スペクトルとを得るスペクトル算出部と、複数個の振幅スペクトルに対し、信号区間分類のいずれかである強調信号区間分類に属する特徴量列に対応する振幅スペクトルを強調する処理を行い、複数個の処理後振幅スペクトルを得る強調処理部と、処理後振幅スペクトルに位相スペクトルを付与して複素スペクトルを得る位相付与部とを含む。 The speech processing unit receives a plurality of channels of acoustic signals, and a feature amount sequence acquisition unit that obtains a feature amount obtained by normalizing the magnitude of the acoustic signal of the speech section for each channel by the magnitude of the acoustic signal of the non-speech section; Clustering feature value sequences consisting of feature values obtained for multiple channels, classifying unit to determine the signal section classification to which the feature value sequence belongs, and converting acoustic signals to frequency domain in each of multiple time sections A spectrum calculation unit that obtains a plurality of amplitude spectra and phase spectra, and emphasizes the amplitude spectrum corresponding to the feature amount sequence belonging to the enhanced signal section classification which is one of the signal section classifications with respect to the plurality of amplitude spectra; An emphasis processing unit that performs processing and obtains a plurality of post-processing amplitude spectra, and a phase addition unit that assigns a phase spectrum to the post-processing amplitude spectrum to obtain a complex spectrum. No.

この発明のビデオ会議技術によれば、クライアントの配置やマイク感度の違いに依らず特定話者の音声の強調・抽出および雑音抑圧をすることが可能になる。また、音声だけでなくクライアントに搭載されているカメラから映像を取得することで、主として話している利用者の映像を各クライアントに配信してビデオ会議を行うことができる。 According to the video conferencing technique of the present invention, it is possible to enhance and extract a specific speaker's voice and suppress noise regardless of the arrangement of clients and the difference in microphone sensitivity. Also, by acquiring video as well as voice from a camera mounted on the client, video of a user who is mainly talking can be distributed to each client and a video conference can be performed.

ビデオ会議システムの機能構成を例示する図。The figure which illustrates the function structure of a video conference system. サーバおよびクライアントの機能構成を例示する図。The figure which illustrates the function structure of a server and a client. 音声処理部の機能構成を例示する図。The figure which illustrates the function structure of an audio | voice processing part. クライアントの構成を例示する図。The figure which illustrates the structure of a client. サーバおよびクライアントの処理フローを例示する図。The figure which illustrates the processing flow of a server and a client. 映像選択部の構成を例示する図。The figure which illustrates the structure of a video selection part. 音声処理部の処理フローを例示する図。The figure which illustrates the processing flow of an audio | voice processing part.

この発明は、様々なサンプリング周波数やマイク感度の収音手段を含む複数のクライアントから取得した複数チャネルの音響信号を処理することで、特定話者の音声の強調と雑音を抑圧した音声を受聴することができるビデオ会議システムである。さらに、この発明は、クライアントに含まれる撮影手段からの映像信号をサーバに送ることで、現在どのクライアントに含まれる収音手段で取得した音が主として収音されているのかを、撮影手段で取得した複数の映像を利用して提示するインターフェース、およびビデオ会議を行うために配信する映像を選択する技術を含んでいる。 The present invention processes the sound signals of a plurality of channels acquired from a plurality of clients including sound collection means having various sampling frequencies and microphone sensitivities so as to listen to a voice in which a specific speaker is emphasized and noise is suppressed. It is a video conferencing system that can. Furthermore, the present invention acquires, by the imaging means, the sound acquired by the sound collecting means currently included in the client is mainly collected by sending a video signal from the imaging means included in the client to the server. And a technology for selecting an image to be distributed for a video conference.

以下、この発明の実施の形態について詳細に説明する。なお、図面中において同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. In addition, the same number is attached | subjected to the component which has the same function in drawing, and duplication description is abbreviate | omitted.

［実施形態］
この実施形態は、自由に配置したサンプリング周波数やマイク感度の異なる複数のスマートフォンから取得した音響信号を利用したハンズフリーのビデオ会議システムである。 [Embodiment]
This embodiment is a hands-free video conference system using acoustic signals acquired from a plurality of smartphones that are freely arranged and have different sampling frequencies and microphone sensitivities.

＜ビデオ会議システムの構成＞
図１を参照して、この実施形態のビデオ会議システム１０の構成例を説明する。ビデオ会議システム１０は、サーバ１とK(>1)台のクライアント３₁,…,３_Kとを含む。サーバ１とクライアント３₁,…,３_Kはネットワーク５に接続される。ネットワーク５は、接続される各装置が相互に通信可能なように構成されていればよく、例えばインターネットやＬＡＮ（Local Area Network）、ＷＡＮ（Wide Area Network）などで構成することができる。また、ネットワーク５を構成する物理媒体は有線／無線の別を問わず、無線ＬＡＮ、３ＧやＬＴＥなどに代表される携帯電話回線、Bluetooth(登録商標)などを利用してもよい。 <Configuration of video conference system>
With reference to FIG. 1, the structural example of the video conference system 10 of this embodiment is demonstrated. The video conference system 10 includes a server 1 and K (> 1) clients 3 ₁ ,..., 3 _K. The server 1 and the clients 3 ₁ ,..., 3 _K are connected to the network 5. The network 5 only needs to be configured so that the connected devices can communicate with each other. For example, the network 5 can be configured with the Internet, a LAN (Local Area Network), a WAN (Wide Area Network), or the like. The physical medium constituting the network 5 may be a wireless LAN, a mobile phone line typified by 3G or LTE, Bluetooth (registered trademark), etc., regardless of whether it is wired or wireless.

ビデオ会議システム１０は少なくとも2組で連携することによりビデオ会議を実現する。図１に示す通り、ビデオ会議システム１０_Aとビデオ会議システム１０_Bとによりビデオ会議を実現する場合、ビデオ会議システム１０_Aのクライアント３₁,…,３_Kが配置された空間とビデオ会議システム１０_Bのクライアント３₁,…,３_Kが配置された空間とはそれぞれ異なる空間である。ビデオ会議システム１０_Aのサーバ１とビデオ会議システム１０_Bのサーバ１は共にネットワーク６に接続される。ネットワーク６は、接続される各装置が相互に通信可能なように構成されていればよく、例えばインターネット、ＷＡＮ（Wide Area Network）、公衆交換電話網、専用線などで構成することができる。ビデオ会議システム１０_Aに含まれるクライアント３の数とビデオ会議システム１０_Bに含まれるクライアント３の数とは同一であってもよいし、異なっていてもよい。この実施形態ではビデオ会議システム１０が2組の場合を例として説明するが、3組以上のビデオ会議システム１０を用いて3つ以上の空間が1つのビデオ会議に参加しても構わない。 The video conference system 10 realizes a video conference by linking at least two sets. As shown in FIG. 1, when implementing video conferencing by the video conferencing system 10 _A and a video conferencing system 10 _B, the client 3 ₁ of a video conferencing system 10 _A, ..., 3 _K are arranged space and video conferencing system 10 Each of the clients 3 ₁ ,..., 3 _{K of} _B is a different space. The server 1 of the video conference system 10 _{A and} the server 1 of the video conference system 10 _B are both connected to the network 6. The network 6 only needs to be configured so that the connected devices can communicate with each other. For example, the network 6 can be configured by the Internet, a WAN (Wide Area Network), a public switched telephone network, a dedicated line, or the like. May be the same as the number of clients 3 included in the number of video conferencing system 10 _B of the client 3 included in the video conferencing system 10 _A, it may be different. In this embodiment, a case where there are two sets of video conference systems 10 will be described as an example. However, three or more sets of video conference systems 10 may be used to join three or more spaces in one video conference.

図２を参照して、この実施形態のサーバ１とクライアント３₁,…,３_Kの構成例を説明する。サーバ１は、K個の音声受信部１１₁,…,１１_K、K個の映像受信部１２₁,…,１２_K、音声処理部１３、音声送信部１４、映像選択部１５、映像送信部１６、端末管理部１７、音声取得部２１、音声配信部２２、映像取得部２３、映像配信部２４、コマンド送信部２５を含む。この実施形態のサーバ１は、例えばＣＰＵ（central processing unit）やＲＡＭ（random access memory）等を備える公知のコンピュータに所定のプログラムが読み込まれて構成される特別な装置である。サーバ１に入力されたデータおよび処理されたデータは、図示していないメモリに格納され、必要に応じて処理部から読み出される。 A configuration example of the server 1 and the clients 3 ₁ ,..., 3 _K according to this embodiment will be described with reference to FIG. The server 1 includes K audio receiving units 11 ₁ ,..., 11 _K , K video receiving units 12 ₁ ,..., 12 _K , an audio processing unit 13, an audio transmission unit 14, an image selection unit 15, and an image transmission unit. 16, a terminal management unit 17, an audio acquisition unit 21, an audio distribution unit 22, a video acquisition unit 23, a video distribution unit 24, and a command transmission unit 25. The server 1 of this embodiment is a special device configured by reading a predetermined program into a known computer including, for example, a CPU (central processing unit), a RAM (random access memory), and the like. Data input to the server 1 and processed data are stored in a memory (not shown) and are read from the processing unit as necessary.

クライアント３₁,…,３_Kはそれぞれ、音声取得部３１、映像取得部３２、音声再生部３３、映像表示部３４、コマンド受信部３５、設定部３６、制御部３７、マイク等の収音手段３０１、カメラ等の撮影手段３０２、スピーカ等の再生手段３０３、液晶ディスプレイ等の表示手段３０４を含む。クライアント３₁,…,３_Kの位置や互いの相対位置は、未知であってもよいし、既知であってもよい。クライアント３₁,…,３_Kは互いに独立に動作する。収音手段３０１のマイク感度は、互いに異なっていてもよいし、同一であってもよい。また、収音手段３０１のサンプリング周波数は、互いに異なっていてもよいし、同一であってもよい。クライアント３₁,…,３_Kは収音手段と撮影手段と再生手段と表示手段を有している装置であればどのようなものであってもよい。クライアント３₁,…,３_Kの具体例は、スマートフォン、携帯電話端末、ラップトップコンピュータなどの汎用の端末装置である。 The clients 3 ₁ ,..., 3 _K are respectively a sound acquisition unit 31, a video acquisition unit 32, an audio reproduction unit 33, a video display unit 34, a command reception unit 35, a setting unit 36, a control unit 37, and a sound collecting means such as a microphone. 301, an imaging unit 302 such as a camera, a reproduction unit 303 such as a speaker, and a display unit 304 such as a liquid crystal display. The positions of the clients 3 ₁ ,..., 3 _K and their relative positions may be unknown or may be known. The clients 3 ₁ ,..., 3 _K operate independently of each other. The microphone sensitivities of the sound collecting means 301 may be different from each other or the same. Further, the sampling frequencies of the sound collection means 301 may be different from each other or the same. The clients 3 ₁ ,..., 3 _K may be any devices as long as they have sound collection means, photographing means, reproduction means, and display means. Specific examples of the clients 3 ₁ ,..., 3 _K are general-purpose terminal devices such as smartphones, mobile phone terminals, and laptop computers.

図３を参照して、音声処理部１３の構成例をより詳細に説明する。音声処理部１３は、入力部１０１、サンプリング周波数変換部１０２、信号同期部１０３、フレーム分割部１０４、ＶＡＤ判定部１０５、非音声パワー記憶部１０６、Ｓ／Ｎベクトル生成部１０７（特徴量列取得部）、ベクトル分類部１０８（分類部）、スペクトル算出部１０９、振幅スペクトル記憶部１１０、位相スペクトル記憶部１１１、フィルタ係数算出部１１２（強調処理部）、フィルタ係数記憶部１１３、フィルタリング部１１４（強調処理部）、位相付与部１１５、時間領域変換部１１６を含む。 With reference to FIG. 3, the example of a structure of the audio | voice processing part 13 is demonstrated in detail. The audio processing unit 13 includes an input unit 101, a sampling frequency conversion unit 102, a signal synchronization unit 103, a frame division unit 104, a VAD determination unit 105, a non-audio power storage unit 106, an S / N vector generation unit 107 (feature quantity sequence acquisition) Section), vector classification section 108 (classification section), spectrum calculation section 109, amplitude spectrum storage section 110, phase spectrum storage section 111, filter coefficient calculation section 112 (enhancement processing section), filter coefficient storage section 113, filtering section 114 ( Emphasis processing unit), phase applying unit 115, and time domain converting unit 116.

図４を参照して、クライアント３₁,…,３_Kの構成例をより具体的に説明する。この具体例はクライアント３₁,…,３_Kをスマートフォンとして構成した場合の例である。クライアント３_k(1≦k≦K)は、本体下端付近に収音手段としてマイク３０１を内蔵し、本体上端付近に撮影手段としてカメラ３０２を本体前面へ配置し、再生手段としてスピーカ３０３を本体前面へ配置し、本体前面の広範囲にわたり表示手段として矩形の液晶ディスプレイ３０４を配置する。また液晶ディスプレイは位置入力機能を備えたタッチパネルであることが望ましいが、テンキーや十字キー等の他のポインティングデバイスを備えていても構わない。この実施形態では、液晶ディスプレイ３０４上を複数の領域に分割し、それぞれ異なる情報を表示できるように構成する。この実施形態では、表示領域１（３０４１）、表示領域２（３０４２）、操作領域（３０４３）、フレーム枠（３０４４）に分割した例を用いて説明する。 A configuration example of the clients 3 ₁ ,..., 3 _K will be described more specifically with reference to FIG. This specific example is an example in which the clients 3 ₁ ,..., 3 _K are configured as smartphones. The client 3 _k (1 ≦ k ≦ K) has a built-in microphone 301 near the lower end of the main unit, a camera 302 as a photographing unit near the upper end of the main unit, and a speaker 303 as a reproducing unit. The rectangular liquid crystal display 304 is arranged as a display means over a wide range on the front surface of the main body. The liquid crystal display is desirably a touch panel having a position input function, but may include other pointing devices such as a numeric keypad and a cross key. In this embodiment, the liquid crystal display 304 is divided into a plurality of areas so that different information can be displayed. In this embodiment, description will be made using an example in which display area 1 (3041), display area 2 (3042), operation area (3043), and frame frame (3044) are divided.

＜ビデオ会議システムの処理＞
図５を参照して、この実施形態のビデオ会議システム１０の動作例を説明する。 <Processing of video conference system>
With reference to FIG. 5, the operation example of the video conference system 10 of this embodiment is demonstrated.

クライアント３_k(1≦k≦K)の備える設定部３６は、クライアント３_kが動作するために必要な設定情報を保持し、必要に応じて各部に設定情報を提供する。設定情報は、例えば、サーバ１のIPアドレスの指定や、サーバ１の備える音声受信部１１₁,…,１１_Kおよび映像受信部１２₁,…,１２_Kのポート番号、などである。また、クライアント３_kがスマートフォンであれば、撮影手段３０２として前面（フロント）のカメラを使用するか背面（リア）のカメラを使用するかの選択なども含まれ得る。設定情報は表示手段３０４を通じて利用者に呈示される設定画面から入力することができる。設定画面は、例えば、操作領域（３０４３）に「メニュー」ボタンを表示して利用者によるメニュー選択に応じて起動できるようにすればよい。 The setting unit 36 included in the client 3 _k (1 ≦ k ≦ K) holds setting information necessary for the client 3 _k to operate, and provides setting information to each unit as necessary. Setting information includes, for example, specify or IP address of the server 1, voice receiving unit 11 ₁ provided in the server 1, ..., 11 _K and image receiver 12 _1, ..., 12 _K port number, and the like. In addition, if the client 3 _k is a smartphone, selection of whether to use a front (front) camera or a back (rear) camera as the photographing unit 302 may be included. The setting information can be input from a setting screen presented to the user through the display unit 304. The setting screen may be activated, for example, by displaying a “menu” button in the operation area (3043) in response to a menu selection by the user.

クライアント３_kの備える制御部３７は、クライアント３_kの備える各部に対して処理の開始もしくは停止を指示する制御信号を送信する。クライアント３_kの利用者は、クライアント３_kの操作領域（３０４３）に表示されるボタン操作により送信すべき制御信号を選択することができる。具体的には、クライアント３_kは起動直後に操作領域（３０４３）に「開始」ボタンを表示して、利用者のボタン押下により、音声取得部３１、映像取得部３２、音声再生部３３、映像表示部３４に対して開始を指示する制御信号を送信する。この際同時に「開始」ボタンは「停止」ボタンに再描画される。開始を指示する制御信号を受信した音声取得部３１、映像取得部３２、音声再生部３３、映像表示部３４はそれぞれの処理を開始する。その後に利用者が「停止」ボタンを押下することで、音声取得部３１、映像取得部３２、音声再生部３３、映像表示部３４に対して停止を指示する制御信号を送信する。停止を指示する制御信号を受信した音声取得部３１、映像取得部３２、音声再生部３３、映像表示部３４はそれぞれの処理を停止する。以降の説明では、音声取得部３１、映像取得部３２、音声再生部３３、映像表示部３４はすでに制御部３７より開始を指示する制御信号を受信し処理を開始している状態であることを前提とする。 Control unit 37 included in the client 3 _k transmits a control signal for instructing start or stop of processing for each unit of the client 3 _k. Client 3 _k of the user may select a control signal to be transmitted by a button operation displayed on the client 3 _k of the operation area (3043). Specifically, the client 3 _k displays a “start” button in the operation area (3043) immediately after the activation, and the user presses the button to obtain the audio acquisition unit 31, the video acquisition unit 32, the audio reproduction unit 33, and the video. A control signal for instructing the display unit 34 to start is transmitted. At the same time, the “start” button is redrawn as a “stop” button. The audio acquisition unit 31, the video acquisition unit 32, the audio reproduction unit 33, and the video display unit 34 that have received the control signal for instructing start start the respective processes. Thereafter, when the user presses a “stop” button, a control signal for instructing stop is transmitted to the audio acquisition unit 31, the video acquisition unit 32, the audio reproduction unit 33, and the video display unit. The audio acquisition unit 31, the video acquisition unit 32, the audio reproduction unit 33, and the video display unit 34 that have received the control signal for instructing stop stop the respective processes. In the following description, it is understood that the audio acquisition unit 31, the video acquisition unit 32, the audio reproduction unit 33, and the video display unit 34 have already received a control signal instructing start from the control unit 37 and have started processing. It is assumed.

サーバ１の備える端末管理部１７は、通信を行なっているクライアント３₁,…,３_Kの数を管理する。音声受信部１１_kもしくは映像受信部１２_kへクライアント３_k(1≦k≦K)からの通信が開始されると、通信開始の通知を受けて接続元のクライアント３_kのIPアドレスを管理対象として追加する。すでに接続可能なクライアント数を超過しているなど管理対象の追加ができない場合には、音声受信部１１₁,…,１１_Kおよび映像受信部１２₁,…,１２_Kへ管理不能である旨を通知する。したがって、端末管理部１７はサーバ１が通信しているクライアント３₁,…,３_KのIPアドレスの最新値を保持している。 Terminal management unit 17 provided in the server 1, the client 3 ₁ doing the communication, ..., to manage the number of 3 _K. When communication from the client 3 _k (1 ≦ k ≦ K) is started to the audio receiving unit 11 _k or the video receiving unit 12 _k , the communication start notification is received and the IP address of the connection source client 3 _k is managed. Add as If you can not already added, such as managed exceeds the number of connectable client, voice receiving unit 11 _1, ..., 11 _K and image receiver 12 _1, ..., to the effect that unmanageable to 12 _K Notice. Accordingly, the terminal management unit 17 holds the latest value of the IP addresses of the clients 3 ₁ ,..., 3 _K with which the server 1 is communicating.

クライアント３_kの備える音声取得部３１は、収音手段３０１を用いて観測した音をそれぞれのサンプリング周波数でＡ／Ｄ変換し、複数個のサンプル点でのデジタル音響信号x_k(i_k)を取得する（ステップＳ３１）。ただし、i_kは時間領域のサンプル点を表す整数のインデックスである。すなわち、x_k(i_k)は、インデックスi_kで表されるサンプル点のデジタル音響信号を表す。デジタル音響信号x_k(i_k)の形式はどのような形式であってもよいが、この実施形態ではサンプリング周波数16kHz、量子化ビット数16bitのモノラルＰＣＭ（pulse code modulation）とする。 The sound acquisition unit 31 included in the client 3 _k performs A / D conversion on the sound observed using the sound collection unit 301 at each sampling frequency, and converts the digital acoustic signal x _k (i _k ) at a plurality of sample points. Obtain (step S31). Here, i _k is an integer index representing a sampling point in the time domain. That is, x _k (i _k ) represents the digital acoustic signal at the sample point represented by the index i _k . The format of the digital audio signal x _k (i _k ) may be any format, but in this embodiment, it is a monaural PCM (pulse code modulation) with a sampling frequency of 16 kHz and a quantization bit number of 16 bits.

クライアント３_kの音声取得部３１で得られたデジタル音響信号x_k(i_k)に対応する処理を行う処理系列をチャネルkと呼ぶ。すなわち、チャネルkはデジタル音響信号x_k(i_k)およびデジタル音響信号x_k(i_k)から得られる値を取り扱う。この実施形態ではK個のチャネルk=1,…,Kが存在する。 A processing sequence for performing processing corresponding to the digital acoustic signal x _k (i _k ) obtained by the voice acquisition unit 31 of the client 3 _k is referred to as a channel k. That is, channel k handles values obtained from digital acoustic signal x _k (i _k ) and digital acoustic signal x _k (i _k ). In this embodiment, there are K channels k = 1,.

取得したデジタル音響信号x_k(i_k)はサーバ１_Aへ送信される。送信プロトコルは公知の通信プロトコルを用いることができるが、この実施形態ではＲＴＰ（Real-time Transport Protocol）を用いるものとする。送信時の宛先ポート番号は設定部３７により予め設定されたポート番号である。具体的には、クライアント３₁であればサーバ１の6100番ポート、クライアント３₂であればサーバ１の6102番ポート、クライアント３_kであればサーバ１の(6100+2(k-1))番ポートのように、規則的に順番に割り当てればよい。ただし、これに限定されず、サーバ１上で重複しない限り任意のポート番号を割り当てることができる。また、ＲＴＰで送信する場合のデータ長は任意に設定可能であるが、例えば20ミリ秒単位とすればよい。 The acquired digital acoustic signal x _k (i _k ) is transmitted to the server _1A . Although a known communication protocol can be used as the transmission protocol, RTP (Real-time Transport Protocol) is used in this embodiment. The destination port number at the time of transmission is a port number preset by the setting unit 37. Specifically, if the client 3 ₁ , the 6100 port of the server 1, the client 3 ₂ the 6102 port of the server 1, and the client 3 _k of the server 1 (6100 + 2 (k−1)) As in the case of the port number, they may be assigned regularly in order. However, the present invention is not limited to this, and an arbitrary port number can be assigned as long as it does not overlap on the server 1. In addition, the data length in the case of transmitting by RTP can be arbitrarily set, but may be set in units of 20 milliseconds, for example.

サーバ１の備える音声受信部１１_kは、クライアント３_kから送信されるデジタル音響信号x_k(i_k)を受信する（ステップＳ１１）。音声受信部１１_kは通信開始時に送信元のIPアドレスを端末管理部１７へ通知する。端末管理部１７により管理不能である旨の応答があった場合には受信したパケットを破棄し処理を終了する。音声受信部１１₁,…,１１_Kが受信したKチャネルのデジタル音響信号x₁(i₁),…,x_K(i_K)は音声処理部１３へ入力される。音声受信部１１_kは必要に応じてデジタル音響信号x_k(i_k)をバッファリングしながら音声処理部１３へ入力してもよい。 The voice receiving unit 11 _k included in the server 1 receives the digital acoustic signal x _k (i _k ) transmitted from the client 3 _k (step S11). The voice receiving unit 11 _k notifies the terminal management unit 17 of the IP address of the transmission source at the start of communication. When there is a response that the terminal management unit 17 cannot manage the received packet, the received packet is discarded and the process is terminated. The K channel digital acoustic signals x ₁ (i ₁ ),..., X _K (i _K ) received by the audio receivers 11 ₁ ,..., 11 _K are input to the audio processor 13. The audio receiving unit 11 _k may input the digital acoustic signal x _k (i _k ) to the audio processing unit 13 while buffering as necessary.

クライアント３_kの備える映像取得部３２は、撮影手段３０２を用いて指定したフォーマットのデジタル映像信号v_k(i_k)を取得する（ステップＳ３２）。ただし、i_kは時間領域のサンプル点を表す整数のインデックスである。すなわち、v_k(i_k)は、インデックスi_kで表されるサンプル点のデジタル映像信号を表す。デジタル映像信号v_k(i_k)の形式はどのような形式であってもよいが、この実施形態では予め定めたサイズの静止画像を定期的に撮影するものとする。より具体的にはQVGAサイズ（320×240ピクセル）のJPEG圧縮した静止画像を100ミリ秒ごとに撮影する。 The video acquisition unit 32 included in the client 3 _k acquires the digital video signal v _k (i _k ) in the format specified using the photographing unit 302 (step S32). Here, i _k is an integer index representing a sampling point in the time domain. That is, v _k (i _k ) represents the digital video signal at the sample point represented by the index i _k . The format of the digital video signal v _k (i _k ) may be any format, but in this embodiment, a still image having a predetermined size is periodically taken. More specifically, JPEG-compressed still images of QVGA size (320 x 240 pixels) are taken every 100 milliseconds.

取得したデジタル映像信号v_k(i_k)はサーバ１へ送信される。送信プロトコルは公知の通信プロトコルを用いることができるが、この実施形態ではＵＤＰ（User Datagram Protocol）を用いるものとする。送信時の宛先ポート番号は設定部３７により予め設定されたポート番号である。具体的には、クライアント３₁であればサーバ１の6200番ポート、クライアント３₂であればサーバ１の6202番ポート、クライアント３_kであればサーバ１の(6200+2(k-1))番ポートのように、規則的に順番に割り当てればよい。ただし、これに限定されず、サーバ１上で重複しない限り任意のポート番号を割り当てることができる。送信するUDPパケットにはシーケンス番号を付与する。撮影する画像サイズはUDPプロトコルにおけるパケットサイズの上限である64KBを超過しないようにし、1枚の画像が複数のUDPパケットに分割されないようにするのが望ましい。 The acquired digital video signal v _k (i _k ) is transmitted to the server 1. Although a known communication protocol can be used as the transmission protocol, UDP (User Datagram Protocol) is used in this embodiment. The destination port number at the time of transmission is a port number preset by the setting unit 37. Specifically, if the client 3 ₁ , the 6200 port of the server 1, the client 3 ₂ the 6202 port of the server 1, and the client 3 _k of the server 1 (6200 + 2 (k−1)) As in the case of the port number, they may be assigned regularly in order. However, the present invention is not limited to this, and an arbitrary port number can be assigned as long as it does not overlap on the server 1. A sequence number is assigned to the UDP packet to be transmitted. It is desirable that the image size to be captured does not exceed 64 KB, which is the upper limit of the packet size in the UDP protocol, so that one image is not divided into a plurality of UDP packets.

サーバ１の備える映像受信部１２_kは、クライアント３_kから送信される入力デジタル映像信号v_k(i_k)を受信する（ステップＳ１２）。映像受信部１２_kは通信開始時に送信元のIPアドレスを端末管理部１７へ通知する。端末管理部１７により管理不能である旨の応答があった場合には受信したパケットを破棄し処理を終了する。映像受信部１２₁,…,１２_Kが受信したKチャネルのデジタル映像信号v₁(i₁),…,v_K(i_K)は映像選択部１５へ入力される。この実施形態では、受信したUDPパケットから1枚のJPEG圧縮の静止画像が得られる度に映像選択部１５へ入力される。 The video receiver 12 _k included in the server 1 receives the input digital video signal v _k (i _k ) transmitted from the client 3 _k (step S12). The video reception unit 12 _k notifies the terminal management unit 17 of the IP address of the transmission source at the start of communication. When there is a response that the terminal management unit 17 cannot manage the received packet, the received packet is discarded and the process is terminated. The K channel digital video signals v ₁ (i ₁ ),..., V _K (i _K ) received by the video receivers 12 ₁ ,..., 12 _K are input to the video selector 15. In this embodiment, every time a JPEG-compressed still image is obtained from the received UDP packet, it is input to the video selection unit 15.

サーバ１の備える音声処理部１３は、Kチャネルのデジタル音響信号x₁(i₁),…,x_K(i_K)に対して所定の音声処理を行うことにより特定の音声を強調し雑音を抑圧した送信音響信号y_c(n)を生成する（ステップＳ１３）。ただし、nはサンプル点を表す整数のインデックスである。cは音を強調する信号区間分類（強調信号区間分類）を表す分類ラベル番号である。分類ラベル番号については後述する。また、音声処理部１３の行う「所定の音声処理」についての詳細も後述する。送信音響信号y_c(n)は音声送信部１４へ入力される。 The voice processing unit 13 included in the server 1 emphasizes a specific voice by performing predetermined voice processing on the K channel digital acoustic signals x ₁ (i ₁ ),..., X _K (i _K ), and generates noise. A suppressed transmission acoustic signal y _c (n) is generated (step S13). Here, n is an integer index representing a sample point. c is a classification label number representing a signal section classification (emphasis signal section classification) for emphasizing sound. The classification label number will be described later. Details of “predetermined audio processing” performed by the audio processing unit 13 will also be described later. The transmission acoustic signal y _c (n) is input to the voice transmission unit 14.

サーバ１の備える音声送信部１４は、ビデオ会議に参加する他のビデオ会議システム１０に含まれるサーバ１へ、入力された送信音響信号y_c(n)を送信する（ステップＳ１４）。3組以上のビデオ会議システム１０を用いてビデオ会議を実施する場合には、同一の送信音響信号y_c(n)をそれぞれのサーバ１へ送信すればよい。 The audio transmission unit 14 included in the server 1 transmits the input transmission acoustic signal y _c (n) to the server 1 included in another video conference system 10 participating in the video conference (step S14). When a video conference is performed using three or more sets of video conference systems 10, the same transmission acoustic signal y _c (n) may be transmitted to each server 1.

サーバ１の備える映像選択部１５は、K台のクライアント３₁,…,３_Kから受信するK個のデジタル映像信号v₁(i₁),…,v_K(i_K)から任意に選択した送信映像信号w(n)を決定する（ステップＳ１５）。ただし、nはサンプル点を表すインデックスである。送信映像信号w(n)は映像送信部１６へ入力される。送信映像信号w(n)の選択は、サーバ１の備えるディスプレイ等の表示部（図示せず）もしくはクライアント３₁,…,３_Kのいずれかが備える表示部３０４における利用者の操作により行われる。 Video selection unit 15 provided in the server 1, the client 3 ₁ K heights, ..., 3 K number of digital video signals v ₁ received from _K (i _1), ..., arbitrarily selected from v _K (i _K) The transmission video signal w (n) is determined (step S15). Here, n is an index representing a sample point. The transmission video signal w (n) is input to the video transmission unit 16. Selection of the transmission video signal w (n), the display unit such as a display provided in the server 1 (not shown) or the client 3 _1, ..., are performed by the user operation on the display unit 304 with one of the 3 _K is .

図６を参照して、送信映像信号w(n)を選択する操作例を説明する。図６の例はサーバ１の備える表示部において操作する場合の例である。映像選択部１５は端末管理部１７からクライアント３₁,…,３_Kのうち通信中であるクライアントの数を取得し、端末数の数に合わせてデジタル映像信号v₁(i₁),…,v_K(i_K)を縮小してアイコン化した画像を埋め込んだ操作画面を生成する。操作画面の「サーバのIP番号」欄にはサーバ１のIPアドレスを表示する。「受信画像」欄にはクライアント３₁,…,３_Kから受信したデジタル映像信号v₁(i₁),…,v_K(i_K)を縮小した画像を並べて表示する。またデジタル映像信号v₁(i₁),…,v_K(i_K)の近傍には対応するクライアント３₁,…,３_KのIPアドレスを表示する。利用者の操作により「受信画像」欄のいずれかの画像が選択されると、送信映像信号w(n)が決定される。サーバ１に接続するクライアントが1台である場合には、その1台が自動的に選択されるようにしてもよい。選択された送信映像信号w(n)は「送信画像」欄に表示される。 With reference to FIG. 6, an operation example for selecting the transmission video signal w (n) will be described. The example of FIG. 6 is an example in the case of operating on the display unit included in the server 1. The video selection unit 15 obtains the number of clients in communication among the clients 3 ₁ ,..., 3 _K from the terminal management unit 17, and the digital video signal v ₁ (i ₁ ),. v Reduce the _K (i _K ) to generate an operation screen in which an iconized image is embedded. The IP address of the server 1 is displayed in the “server IP number” field on the operation screen. The client 3 ₁ in the "received image" column, ..., 3 digital video signals v ₁ received from _K (i _1), ..., v side by side image obtained by reducing the _K (i _K) is displayed. The digital video signal _{_{v 1 (i 1), ...}} , v client 3 ₁ corresponding to the vicinity of the _{_K} (i _K), ..., and displays the IP address of 3 _K. When any image in the “received image” column is selected by the user's operation, the transmission video signal w (n) is determined. When there is one client connected to the server 1, that one may be automatically selected. The selected transmission video signal w (n) is displayed in the “transmission image” field.

クライアント３_kの備える表示部３０４における利用者の操作により送信映像信号w(n)を選択する場合には、図６に示す操作画面例と同様の操作画面を図４に示す表示領域２（３０４２）へ表示し、利用者の操作により選択された映像を映像選択部１５へ通知するように構成すればよい。 When the transmission video signal w (n) is selected by the user's operation on the display unit 304 included in the client 3 _k , an operation screen similar to the operation screen example shown in FIG. 6 is displayed in the display area 2 (3042) shown in FIG. The video selection unit 15 may be notified of the video selected by the user's operation.

サーバ１の備える映像送信部１６は、ビデオ会議に参加する他のビデオ会議システム１０に含まれるサーバ１へ、入力された送信映像信号w(n)を送信する（ステップＳ１６）。3組以上のビデオ会議システム１０を用いてビデオ会議を実施する場合には、同一の送信映像信号w(n)をそれぞれのサーバ１へ送信すればよい。 The video transmission unit 16 included in the server 1 transmits the input transmission video signal w (n) to the server 1 included in the other video conference system 10 participating in the video conference (step S16). When a video conference is performed using three or more sets of video conference systems 10, the same transmission video signal w (n) may be transmitted to each server 1.

サーバ１の備える音声取得部２１は、ビデオ会議に参加する他のビデオ会議システム１０に含まれるサーバ１の音声送信部１４が送信した配信音響信号y’_c(n)を受信する（ステップＳ２１）。受信した配信音響信号y’_c(n)は音声配信部２２へ入力される。3組以上のビデオ会議システム１０を用いてビデオ会議を実施する場合には、他のビデオ会議システム１０から受信した複数の配信音響信号を多重化して配信音響信号y’_c(n)とすればよい。 The audio acquisition unit 21 included in the server 1 receives the distribution acoustic signal y ′ _c (n) transmitted by the audio transmission unit 14 of the server 1 included in another video conference system 10 that participates in the video conference (step S21). . The received distribution acoustic signal y ′ _c (n) is input to the audio distribution unit 22. When a video conference is performed using three or more sets of video conference systems 10, a plurality of distributed audio signals received from other video conference systems 10 can be multiplexed to be distributed audio signals y ′ _c (n). Good.

サーバ１の備える音声配信部２２は、K台のクライアント３₁,…,３_Kに配信音響信号y’_c(n)を送信する（ステップＳ２２）。送信プロトコルは公知の通信プロトコルを用いることができるが、この実施形態ではＲＴＰ（Real-time Transport Protocol）を用いるものとする。送信先のIPアドレスは端末管理部１７より取得し、ポート番号は予め設定されたポート番号とする。ポート番号はクライアント３_k上で重複しない限り任意のポート番号を使用することができる。例えばこの実施形態では6004番ポートとする。また、ＲＴＰで送信する場合のデータ長は任意に設定可能であるが、例えば20ミリ秒単位とすればよい。 The audio distribution unit 22 included in the server 1 transmits the distribution acoustic signal y ′ _c (n) to the _K clients 3 ₁ ,..., 3 _K (step S22). Although a known communication protocol can be used as the transmission protocol, RTP (Real-time Transport Protocol) is used in this embodiment. The IP address of the transmission destination is acquired from the terminal management unit 17, and the port number is a preset port number. Any port number can be used as long as it does not overlap on the client _3k . For example, in this embodiment, the 6004th port is assumed. In addition, the data length in the case of transmitting by RTP can be arbitrarily set, but may be set in units of 20 milliseconds, for example.

クライアント３_kは、音声再生部３３により配信音響信号y’_c(n)を受信し、再生手段３０３を用いて、その配信音響信号y’_c(n)を再生する（ステップＳ３３）。音声再生部３３は必要に応じて配信音響信号y’_c(n)をバッファリングしながら再生してもよい。 The client 3 _k, the distribution sound signal y by the audio reproduction unit 33 'receives the _c (n), using the reproduction unit 303, the distribution sound signal y' Play _c (n) (step S33). The sound reproducing unit 33 may reproduce the distribution acoustic signal y ′ _c (n) as necessary, while buffering it.

サーバ１の備える映像取得部２３は、ビデオ会議に参加する他のビデオ会議システム１０に含まれるサーバ１の映像送信部１６が送信した配信映像信号w’(n)を受信する（ステップＳ２３）。受信した配信映像信号w’(n)は映像配信部２４へ入力される。3組以上のビデオ会議システム１０を用いてビデオ会議を実施する場合には、他のビデオ会議システム１０から受信した複数の配信映像信号をフレーム分割等の手法により1つの映像信号に合成して配信映像信号w’(n)とすればよい。 The video acquisition unit 23 included in the server 1 receives the distribution video signal w ′ (n) transmitted by the video transmission unit 16 of the server 1 included in another video conference system 10 that participates in the video conference (step S23). The received distribution video signal w ′ (n) is input to the video distribution unit 24. When a video conference is performed using three or more sets of video conference systems 10, a plurality of distribution video signals received from other video conference systems 10 are combined into a single video signal by a method such as frame division and distributed. The video signal w ′ (n) may be used.

サーバ１の備える映像配信部２４は、K台のクライアント３₁,…,３_Kに配信映像信号w’(n)を送信する（ステップＳ２４）。送信プロトコルは公知の通信プロトコルを用いることができるが、この実施形態ではＵＤＰ（User Datagram Protocol）を用いるものとする。送信先のIPアドレスは端末管理部１７より取得し、ポート番号は予め設定されたポート番号とする。ポート番号はクライアント３_k上で重複しない限り任意のポート番号を使用することができる。例えばこの実施形態では6001番ポートとする。送信するUDPパケットにはシーケンス番号を付与する。 The video distribution unit 24 included in the server 1 transmits the distribution video signal w ′ (n) to the _K clients 3 ₁ ,..., 3 _K (step S24). Although a known communication protocol can be used as the transmission protocol, UDP (User Datagram Protocol) is used in this embodiment. The IP address of the transmission destination is acquired from the terminal management unit 17, and the port number is a preset port number. Any port number can be used as long as it does not overlap on the client _3k . For example, in this embodiment, port 6001 is assumed. A sequence number is assigned to the UDP packet to be transmitted.

クライアント３_kは、映像表示部３４により配信映像信号w’(n)を受信し、表示手段３０４を用いて、その配信映像信号w’(n)を再生する（ステップＳ３４）。この実施形態では、受信したUDPパケットから1枚のJPEG圧縮の静止画像が得られる度に表示手段３０４へその画像を表示する。 The client 3 _k receives the distribution video signal w ′ (n) by the video display unit 34 and reproduces the distribution video signal w ′ (n) using the display unit 304 (step S34). In this embodiment, each time a single JPEG compressed still image is obtained from the received UDP packet, the image is displayed on the display unit 304.

サーバ１の備えるコマンド送信部２５は、K台のクライアント３₁,…,３_Kに映像選択部１５の選択する送信映像信号w(n)に基づく要求信号を送信する（ステップＳ２５）。送信映像信号w(n)に基づく要求信号とは、例えば、クライアント３₁,…,３_Kから受信したK個のデジタル映像信号v₁(i₁),…,v_K(i_K)のうち、いずれの映像信号が選択されているかを示す情報などである。具体的には、送信映像信号w(n)として選択された映像信号を撮影しているクライアント３_kに対しては表示手段３０４を囲むフレーム枠（３０４４）の色を第一の色（例えば、赤）とする旨を示す要求信号を送信し、送信映像信号w(n)として選択されなかった映像信号を撮影しているクライアント３_kに対しては表示手段３０４を囲むフレーム枠（３０４４）の色を第二の色（例えば、青）とする旨を示す要求信号を送信する。具体的にはフレーム枠（３０４４）を描画すべき色のRGB値（例えば、赤であれば「#ff0000」、青であれば「#0000ff」など）を決められたテキストフォーマットに従って設定した文字列を要求信号として送信する。送信プロトコルは公知の通信プロトコルを用いることができるが、この実施形態ではＵＤＰ（User Datagram Protocol）を用いるものとする。送信先のIPアドレスは端末管理部１７より取得し、ポート番号は予め設定されたポート番号とする。ポート番号はクライアント３_k上で重複しない限り任意のポート番号を使用することができる。例えばこの実施形態では6000番ポートとする。ＵＤＰにより送信する場合にはパケットロスを考慮して、例えば20ミリ秒ごとなど定期的に同一の要求信号を繰り返し送信することが望ましい。 The command transmission unit 25 included in the server 1 transmits a request signal based on the transmission video signal w (n) selected by the video selection unit 15 to the _K clients 3 ₁ ,..., 3 _K (step S25). Request signal based on the transmission video signal w (n) is, for example, the client 3 _1, ..., 3 K number of digital video signal received from the _{_{_{K v 1 (i 1),}}} ..., v of _K (i _K) , Information indicating which video signal is selected. Specifically, the first color to the color of the framework surrounding the display means 304 to the client 3 _k which have taken the selected video signal as a transmission video signal w (n) (3044) (e.g., A request signal indicating red), and for a client 3 _k that is capturing a video signal that has not been selected as a transmission video signal w (n), a frame frame (3044) surrounding the display means 304 is displayed. A request signal indicating that the color is the second color (for example, blue) is transmitted. Specifically, a character string in which the RGB value of the color to be drawn in the frame (3044) (for example, “# ff0000” for red, “# 0000ff” for blue, etc.) is set according to a predetermined text format. As a request signal. Although a known communication protocol can be used as the transmission protocol, UDP (User Datagram Protocol) is used in this embodiment. The IP address of the transmission destination is acquired from the terminal management unit 17, and the port number is a preset port number. Any port number can be used as long as it does not overlap on the client _3k . For example, in this embodiment, the port number is 6000. When transmitting by UDP, it is desirable to repeatedly transmit the same request signal periodically in consideration of packet loss, for example, every 20 milliseconds.

クライアント３_kの備えるコマンド受信部３５は、サーバ１から受信した要求信号に基づいて表示手段３０４に送信映像信号w(n)に基づく情報を表示する（ステップＳ３５）。例えば、そのクライアント３_kが撮影した映像信号v_k(i_k)が送信映像信号w(n)として選択されていた場合には、表示手段３０４を囲むフレーム枠（３０４４）を第一の色（例えば、赤）で表示し、そのクライアント３_kが撮影した映像信号v_k(i_k)が送信映像信号w(n)として選択されていなかった場合には、表示手段３０４を囲むフレーム枠（３０４４）を第二の色（例えば、青）で表示する。この際、決められたテキストフォーマットに合致しない場合や指定されたRGB値が255を超えるような不正な値であった場合などには、その要求信号を破棄して処理を終了する。このように構成することで、そのクライアント３_kを利用する利用者は、ビデオ会議に参加する他のビデオ会議システム１０に含まれるクライアントに、そのクライアント３_kにより撮影された映像（多くの場合その利用者の顔）が表示されているか否かを知ることが可能となる。 The command receiving unit 35 included in the client 3 _k displays information based on the transmission video signal w (n) on the display unit 304 based on the request signal received from the server 1 (step S35). For example, when the video signal v _k (i _k ) captured by the client 3 _k is selected as the transmission video signal w (n), the frame frame (3044) surrounding the display unit 304 is set to the first color ( For example, if the video signal v _k (i _k ) displayed in red) and the client 3 _k captured is not selected as the transmission video signal w (n), a frame frame (3044) surrounding the display unit 304 is displayed. ) In a second color (eg, blue). At this time, if the determined text format does not match or the specified RGB value is an invalid value exceeding 255, the request signal is discarded and the process is terminated. With this configuration, a user who uses the client 3 _k can send a video captured by the client 3 _k (in many cases, to a client included in another video conference system 10 participating in the video conference. It is possible to know whether or not the user's face is displayed.

＜音声処理部の処理詳細＞
図７を参照して、音声処理部１３の動作例をより詳細に説明する。 <Processing details of the voice processing unit>
With reference to FIG. 7, the operation example of the audio | voice processing part 13 is demonstrated in detail.

入力部１０１へ、K個の音声受信部１１₁,…,１１_Kで受信したKチャネルの入力デジタル音響信号x₁(i₁),…,x_K(i_K)が入力される（ステップＳ１０１）。受信された複数個のチャネルk=1,…,Kの入力デジタル音響信号x_k(i_k)は、サンプリング周波数変換部１０２に入力される。異なるチャネルkの入力デジタル音響信号x_k(i_k)は、異なるクライアント３_kで得られたものであるため、サンプリング周波数が異なる場合がある。サンプリング周波数変換部１０２は、すべてのチャネルk=1,…,Kの入力デジタル音響信号x_k(i_k)のサンプリング周波数を任意の同一のサンプリング周波数に揃える。言い換えると、サンプリング周波数変換部１０２は、複数個のチャネルk=1,…,Kの入力デジタル音響信号x_k(i_k)をサンプリング周波数変換し、特定のサンプリング周波数の変換デジタル音響信号cx_k(i_k)を複数個のチャネルk=1,…,Kについて得る。「特定のサンプリング周波数」は、クライアント３₁,…,３_Kのいずれか一つのサンプリング周波数であってもよいし、その他のサンプリング周波数であってもよい。「特定のサンプリング周波数」の一例は16kHzである。サンプリング周波数変換部１０２は、各クライアント３_kのサンプリング周波数の公称値に基づいてサンプリング周波数変換を行う。すなわち、サンプリング周波数変換部１０２は、各クライアント３_kのサンプリング周波数の公称値でサンプリングされた信号を、特定のサンプリング周波数でサンプリングされた信号に変換する。このようなサンプリング周波数変換は周知である。サンプリング周波数変換部１０２は、以上のように得た各チャネルkの変換デジタル音響信号cx_k(i_k)を出力する（ステップＳ１０２）。 To an input unit 101, K pieces of voice receiving unit 11 _1, ..., 11 input digital audio signals of K channels received by _{_{_{K x 1 (i 1),}}} ..., x K (i K) is input (step S101 ). The received digital acoustic signals x _k (i _k ) of the plurality of channels k = 1,..., K are input to the sampling frequency converter 102. Since the input digital acoustic signals x _k (i _k ) of different channels k are obtained by different clients 3 _k , the sampling frequencies may be different. The sampling frequency conversion unit 102 aligns the sampling frequencies of the input digital acoustic signals x _k (i _k ) of all channels k = 1,..., K to any same sampling frequency. In other words, the sampling frequency conversion unit 102 converts the input digital acoustic signal x _k (i _k ) of the plurality of channels k = 1,..., K to the sampling frequency, and converts the converted digital acoustic signal cx _k ( i _k ) is _obtained for a plurality of channels k = 1,. The “specific sampling frequency” may be any one of the clients 3 ₁ ,..., 3 _K , or may be another sampling frequency. An example of “specific sampling frequency” is 16 kHz. The sampling frequency conversion unit 102 performs sampling frequency conversion based on the nominal value of the sampling frequency of each client 3 _k . That is, the sampling frequency conversion unit 102 converts a signal sampled at the nominal value of the sampling frequency of each client 3 _k into a signal sampled at a specific sampling frequency. Such sampling frequency conversion is well known. The sampling frequency converter 102 outputs the converted digital acoustic signal cx _k (i _k ) of each channel k obtained as described above (step S102).

信号同期部１０３は、チャネルk=1,…,Kの変換デジタル音響信号cx₁(i₁),…,cx_K(i_K)を入力として受け取る。信号同期部１０３は、変換デジタル音響信号cx₁(i₁),…,cx_K(i_K)をチャネルk=1,…,K間で同期させ、チャネルk=1,…,Kのデジタル音響信号sx₁(i₁),…,sx_K(i_K)を得て出力する（ステップＳ１０３）。以下にこの詳細を説明する。 The signal synchronizer 103 receives the converted digital acoustic signals cx ₁ (i ₁ ),..., Cx _K (i _K ) of the channels k = 1,. The signal synchronization unit 103 synchronizes the converted digital acoustic signals cx ₁ (i ₁ ),..., Cx _K (i _K ) between the channels k = 1,. Signals sx ₁ (i ₁ ),..., Sx _K (i _K ) are obtained and output (step S103). The details will be described below.

クライアント３_kには個体差がある。そのためクライアント３_kのサンプリング周波数の公称値がf_kであっても、クライアント３_kがサンプリング周波数f_k/α_kでＡ／Ｄ変換を行う場合もある。ただしα_kはクライアント３_kの実際のサンプリング周波数とサンプリング周波数の公称値との間の周波数ずれを表す正のパラメータである。音響信号をサンプリング周波数f_kでＡ／Ｄ変換して得られるデジタル音響信号をx_k’(i_k)とおくと、同じ音響信号をサンプリング周波数f_k/α_kでＡ／Ｄ変換して得られるデジタル音響信号はx_k’(i_k×α_k)となる。ただし「×」は乗算演算子を表す。すなわち、サンプリング周波数の周波数ずれは、デジタル音響信号の時間領域でのタイミングずれとなって現れる。 There are individual differences in client _3k . Therefore, even if the nominal value of the sampling frequency of the client 3 _k is f _k , the client 3 _k may perform A / D conversion at the sampling frequency f _k / α _k . However, α _k is a positive parameter representing a frequency shift between the actual sampling frequency of the client 3 _k and the nominal value of the sampling frequency. If a digital acoustic signal obtained by A / D converting the acoustic signal at the sampling frequency f _k is x _k ′ (i _k ), the same acoustic signal is obtained by A / D conversion at the sampling frequency f _k / α _k. The digital acoustic signal to be obtained is x _k ′ (i _k × α _k ). However, “×” represents a multiplication operator. That is, the frequency deviation of the sampling frequency appears as a timing deviation in the time domain of the digital acoustic signal.

サンプリング周波数変換部１０２は、各クライアント３_kのサンプリング周波数の公称値f_kに基づいてサンプリング周波数変換を行っている。すなわち、すべてのチャネルk=1,…,Kに共通の「特定のサンプリング周波数」をFとすると、サンプリング周波数変換部１０２は、各チャネルkのサンプリング周波数をF/f_k倍にするサンプリング周波数変換を行っている。そのため、各クライアント３_kの実際のサンプリング周波数がf_k/α_kであるとすると、各チャネルkの変換デジタル音響信号cx_k(i_k)のサンプリング周波数はF×α_kとなる。この個体差に基づく周波数ずれは、チャネルk=1,…,K間における、変換デジタル音響信号cx_k(i_k)の時間領域でのタイミングずれとなって現れる。 The sampling frequency conversion unit 102 performs sampling frequency conversion based on the nominal value f _k of the sampling frequency of each client 3 _k . That is, assuming that “specific sampling frequency” common to all channels k = 1,..., K is F, the sampling frequency conversion unit 102 performs sampling frequency conversion that increases the sampling frequency of each channel k by F / f _k times. It is carried out. Therefore, if the actual sampling frequency of each client 3 _k is f _k / α _k , the sampling frequency of the converted digital acoustic signal cx _k (i _k ) of each channel k is F × α _k . The frequency shift based on the individual difference appears as a timing shift in the time domain of the converted digital acoustic signal cx _k (i _k ) between the channels k = 1,.

信号同期部１０３は、個体差に基づく変換デジタル音響信号cx_k(i_k)の時間領域でのタイミングずれを減らすために、時間領域の変換デジタル音響信号cx₁(i₁),…,cx_K(i_K)をチャネルk=1,…,K間で同期させる。例えば信号同期部１０３は、チャネル間の相互相関が最大になるように、変換デジタル音響信号cx₁(i₁),…,cx_K(i_K)を時間軸方向（サンプル点方向）に互いにずらし、同期後のデジタル音響信号sx₁(i₁),…,sx_K(i_K)を得る。 Signal synchronization unit 103 in order to reduce the conversion timing shift in the time domain digital audio signal cx _{_k} (i _k) based on individual difference, converts digital audio signals in the time domain _{_{cx 1 (i 1), ...}} , cx K (i _K ) is synchronized between channels k = 1,. For example, the signal synchronizer 103 shifts the converted digital acoustic signals cx ₁ (i ₁ ),..., Cx _K (i _K ) in the time axis direction (sample point direction) so that the cross-correlation between channels is maximized. , Sx _K (i _K ) are obtained after the digital audio signals sx ₁ (i ₁ ),.

例えば信号同期部１０３は、各チャネルkの変換デジタル音響信号cx_k(i_k)から、単語の発話など十分特徴的な波形の変化を観測できる長さ（例えば3秒）のサンプル列cx_k(1),…, cx_k(I)を取り出す（ステップＳ１０３１）。ただし、Iは正の整数を表す。次に信号同期部１０３は、取り出したサンプル列のうち1つのチャネルk’∈{1,…,K}のサンプル列cx_k’(1),…,cx_k’(I)を基準サンプル列とする（ステップＳ１０３２）。次に信号同期部１０３は、チャネルk’以外のチャネルk”∈{1,…,K}(k”≠k’)のサンプル列cx_k”(1),…,cx_k”(I)を時間軸にずらしたサンプル列cx_k”(1+τ_k”),…,cx_k”(I+τ_k”)と基準サンプル列cx_k’(1),…,cx_k’(I)との相互相関Σ_n{cx_k”(n)×cx_k’(n)}を最大にする遅延τ_k”を所定の探索範囲から探索し、sx_k”(i_k”)=cx_k”(i_k”+τ_k”)およびsx_k’(i_k’)=cx_k’(i_k’)とする（ステップＳ１０３３）。さらに信号同期部１０３は、サンプル列cx_k(1),…,cx_k(I)を切り出す範囲をシフトさせ（例えば1秒の時間に対応するサンプル点だけシフトさせ）、ステップＳ１０３１〜Ｓ１０３３の処理を実行する処理を繰り返し、同期後のデジタル音響信号sx₁(i₁),…,sx_K(i_K)をすべてのサンプル点について得て出力する。 For example, the signal synchronizer 103 can sample from the converted digital acoustic signal cx _k (i _k ) of each channel k with a length (for example, 3 seconds) of a sample string cx _k ( 1),..., Cx _k (I) are taken out (step S1031). However, I represents a positive integer. Next, the signal synchronizer 103 uses the sample sequence cx _{k ′} (1),..., Cx _{k ′} (I) of one channel k′∈ {1,. (Step S1032). Next, the signal synchronizer 103 converts the sample sequences cx _{k ″} (1),..., Cx _{k ″} (I) of channels k ″ ∈ {1,..., K} (k ″ ≠ k ′) other than the channel k ′. Sample sequence cx _{k ″} (1 + τ _{k ″} ), ..., cx _{k ″} (I + τ _{k ″} ) shifted to the time axis and reference sample sequence cx _k ′ (1),…, cx _k ′ (I) A delay τ _{k ″} that maximizes the cross-correlation Σ _n {cx _{k ″} (n) × cx _{k ′} (n)} is searched from a predetermined search range, and sx _{k ″} (i _{k ″} ) = cx _{k ″} ( i _{k ″} + τ _{k ″} ) and sx _{k ′} (i _{k ′} ) = cx _{k ′} (i _{k ′} ) (step S1033). Further, the signal synchronizer 103 sets the sample sequence cx _k (1),. The range in which cx _k (I) is cut out is shifted (for example, by shifting only the sample point corresponding to the time of 1 second), and the processing of executing the processing of steps S1031 to S1033 is repeated, and the synchronized digital acoustic signal sx ₁ (i ₁ ), ..., sx _K (i _K ) are obtained and output for all sample points.

フレーム分割部１０４は、同期後のデジタル音響信号sx₁(i₁),…,sx_K(i_K)を入力として受け取る。フレーム分割部１０４は、チャネルkごとにデジタル音響信号sx_k(i_k)を所定の時間区間であるフレームに分割する（ステップＳ１０４）。このフレーム分割処理では、フレーム切り出し区間長（フレーム長）L点と切り出し区間のずらし幅m点を任意に決めることができる。ただし、Lおよびmは正の整数である。例えば、切り出し区間長を2048点、切り出し区間のずらし幅を256点とする。フレーム分割部１０４は、チャネルkごとに切り出し区間長のデジタル音響信号sx_k(i_k)を切り出して出力する。さらにフレーム分割部１０４は、決められた切り出し区間のずらし幅に従い切り出し区間をずらし、チャネルkごとに上記切り出し区間長のデジタル音響信号sx_k(i_k)を切り出して出力する処理を繰り返す。以上の処理により、各チャネルkについて各フレームのデジタル音響信号が出力される。以下では、チャネルkのr番目のフレームrに属するデジタル音響信号をsx_k(i_k,r,0),…,sx_k(i_k,r,L-1)と表現する。 The frame dividing unit 104 receives the digital audio signals sx ₁ (i ₁ ),..., Sx _K (i _K ) after synchronization as inputs. The frame dividing unit 104 divides the digital acoustic signal sx _k (i _k ) into frames that are predetermined time intervals for each channel k (step S104). In this frame division processing, the frame cutout section length (frame length) L point and the shift width m point of the cutout section can be arbitrarily determined. However, L and m are positive integers. For example, the cut-out section length is 2048 points, and the shift width of the cut-out section is 256 points. The frame dividing unit 104 cuts out and outputs a digital acoustic signal sx _k (i _k ) having a cut-out section length for each channel k. Further, the frame division unit 104 shifts the cutout section according to the determined shift width of the cutout section, and repeats the process of cutting out and outputting the digital audio signal sx _k (i _k ) having the cutout section length for each channel k. Through the above processing, a digital audio signal of each frame is output for each channel k. Hereinafter, the digital acoustic signal belonging to the r-th frame r of the channel k is expressed as sx _k (i _{k, r, 0} ),..., Sx _k (i _{k, r, L−1} ).

ＶＡＤ判定部１０５は、各チャネルkの各フレームrに属するデジタル音響信号sx_k(i_k,r,0),…,sx_k(i_k,r,L-1)を入力として受け取る。ＶＡＤ判定部１０５は、入力されたデジタル音響信号を用い、各チャネルkの各フレームrが音声区間であるか非音声区間であるかを判定する（ステップＳ１０５）。ＶＡＤ判定部１０５は、例えば参考文献１に記載されたような周知技術を用い、フレームrが音声区間であるか非音声区間であるかの判定を行う。
［参考文献１］Jongseo Sohn, Nam Soo Kim, Wonyong Sung, “A Statistic Model-Based
Voice Activity Detection,” IEEE SIGNAL PROCESSING LETTERS, VOL.6, NO.1, 1999． The VAD determination unit 105 receives as input the digital acoustic signals sx _k (i _{k, r, 0} ),..., Sx _k (i _{k, r, L−1} ) belonging to each frame r of each channel k. The VAD determination unit 105 determines whether each frame r of each channel k is a speech section or a non-speech section using the input digital acoustic signal (step S105). The VAD determination unit 105 determines whether the frame r is a speech segment or a non-speech segment using a well-known technique as described in Reference 1, for example.
[Reference 1] Jongseo Sohn, Nam Soo Kim, Wonyong Sung, “A Statistic Model-Based
Voice Activity Detection, ”IEEE SIGNAL PROCESSING LETTERS, VOL.6, NO.1, 1999.

これらの判定に基づき、ＶＡＤ判定部１０５は、各フレームrに対し、音声区間であるか非音声区間であるかの判定結果を表すラベルθ_rを付与する。例えば、「フレームrが音声区間であると判定されたチャネルの個数」が「フレームrが非音声区間であると判定されたチャネルの個数」以上である場合、ＶＡＤ判定部１０５は、フレームrが音声区間であると判定し、音声区間であることを表すラベルθ_rをフレームrに対して付与する。一方、「フレームrが音声区間であると判定されたチャネルの個数」が「フレームrが非音声区間であると判定されたチャネルの個数」未満である場合、ＶＡＤ判定部１０５は、フレームrが非音声区間であると判定し、非音声区間であることを表すラベルθ_rをフレームrに対して付与する。その他、チャネルk=1,…,Kのうち、デジタル音響信号sx_k(i_k,r,0),…,sx_k(i_k,r,L-1)の平均パワーや平均Ｓ／Ｎ比が最も大きなチャネルに対する判定結果を表すラベルθ_rをフレームrに付与してもよい。また、音声区間であることを表すラベルの例はθ_r=1であり、非音声区間であることを表すラベルの例はθ_r=0である。ＶＡＤ判定部１０５は、各ラベルθ_rを出力する。 Based on these determinations, the VAD determination unit 105 assigns to each frame r a label θ _r that indicates a determination result as to whether the frame is a speech segment or a non-speech segment. For example, when the “number of channels determined that the frame r is a speech segment” is equal to or greater than the “number of channels determined that the frame r is a non-speech segment”, the VAD determination unit 105 determines that the frame r is It is determined that it is a voice section, and a label θ _r indicating that it is a voice section is given to the frame r. On the other hand, when the “number of channels determined that the frame r is a speech segment” is less than the “number of channels determined that the frame r is a non-speech segment”, the VAD determination unit 105 determines that the frame r is It is determined that it is a non-speech segment, and a label θ _r indicating that it is a non-speech segment is _assigned to the frame r. In addition, among the channels k = 1,..., K, the average power and average S / N ratio of the digital acoustic signals sx _k (i _{k, r, 0} ),..., Sx _k (i _{k, r, L-1} ) A label θ _r indicating the determination result for the channel with the largest may be attached to the frame r. In addition, an example of a label indicating a voice section is θ _r = 1, and an example of a label indicating a non-voice section is θ _r = 0. VAD decision unit 105 outputs each label theta _r.

Ｓ／Ｎベクトル生成部１０７は、各チャネルkの各フレームrのデジタル音響信号sx_k(i_k,r,0),…,sx_k(i_k,r,L-1)およびラベルθ_rを入力として受け取る。Ｓ／Ｎベクトル生成部１０７は、チャネルkごとに音声区間のデジタル音響信号の大きさを非音声区間のデジタル音響信号の大きさで正規化した特徴量を得、チャネルk=1,…,Kに対して得られた特徴量を要素とするＳ／Ｎベクトル（特徴量列）を得て出力する（ステップＳ１０７）。「特徴量」の例は、非音声区間のデジタル音響信号の大きさに対する音声区間のデジタル音響信号の大きさの比を表す値である。「デジタル音響信号の大きさ」の例は、デジタル音響信号のパワーや絶対値、デジタル音響信号のパワーの平均値や絶対値の平均値、デジタル音響信号のパワーの合計値や絶対値の合計値、それらの正負反転値や関数値などである。「比を表す特徴量」の例は、「非音声区間のデジタル音響信号の大きさに対する音声区間のデジタル音響信号の大きさの比」そのもの、その逆数その他の関数値である。以下では、デジタル音響信号のパワーの平均値を「デジタル音響信号の大きさ」とし、「非音声区間のデジタル音響信号の大きさに対する音声区間のデジタル音響信号の大きさの比」そのものを「特徴量」とした例を示す。 The S / N vector generation unit 107 obtains the digital acoustic signal sx _k (i _{k, r, 0} ),..., Sx _k (i _{k, r, L−1} ) and the label θ _r for each frame r of each channel k. Receive as input. The S / N vector generation unit 107 obtains a feature amount obtained by normalizing the magnitude of the digital acoustic signal in the speech section for each channel k with the magnitude of the digital acoustic signal in the non-speech section, and obtains the channel k = 1,. An S / N vector (feature quantity sequence) having the obtained feature quantity as an element is obtained and output (step S107). An example of the “feature amount” is a value representing a ratio of the magnitude of the digital acoustic signal in the voice section to the magnitude of the digital acoustic signal in the non-voice section. Examples of “digital audio signal magnitude” include the power and absolute value of the digital audio signal, the average and absolute value of the digital audio signal power, and the total and absolute value of the digital audio signal power. , Their inverted values and function values. An example of the “feature representing the ratio” is “the ratio of the magnitude of the digital acoustic signal in the speech section to the magnitude of the digital acoustic signal in the non-speech section” itself, its reciprocal, and other function values. In the following, the average value of the power of the digital audio signal is referred to as “digital audio signal magnitude”, and “the ratio of the digital audio signal magnitude in the audio section to the digital audio signal magnitude in the non-audio section” itself is “characteristic”. An example of “quantity” is shown.

Ｓ／Ｎベクトル生成部１０７は、以下の処理を実行する。 The S / N vector generation unit 107 executes the following processing.

［ステップＳ１０７１］
Ｓ／Ｎベクトル生成部１０７は、rを1に初期化する。 [Step S1071]
The S / N vector generation unit 107 initializes r to 1.

［ステップＳ１０７２］
Ｓ／Ｎベクトル生成部１０７は、ラベルθ_rが音声区間を表すか非音声区間を表すかを判定する。 [Step S1072]
The S / N vector generation unit 107 determines whether the label θ _r represents a speech section or a non-speech section.

［ステップＳ１０７３］
ラベルθ_rが非音声区間を表す場合、Ｓ／Ｎベクトル生成部１０７は、すべてのチャネルk=1,…,Kについて、フレームrに属するデジタル音響信号sx_k(i_k,r,0),…,sx_k(i_k,r,L-1)の平均パワーP_N(k,r)を計算し（式（１）参照）、平均パワーP_N(k,r)をk番目の要素とする平均パワーベクトルP_N(r)=(P_N(1,r),…,P_N(K,r))を非音声パワー記憶部１０６に格納する。 [Step S1073]
When the label θ _r represents a non-speech interval, the S / N vector generation unit 107 performs the digital acoustic signal sx _k (i _{k, r, 0} ), belonging to the frame r for all channels k = 1,. ..., sx _k (i _{k, r, L-1} ) average power P _N (k, r) is calculated (see equation (1)), and average power P _N (k, r) is calculated as the k-th element. The average power vector P _N (r) = (P _N (1, r),..., P _N (K, r)) is stored in the non-speech power storage unit 106.

［ステップＳ１０７４］
ラベルθ_rが音声区間を表す場合、Ｓ／Ｎベクトル生成部１０７は、非音声パワー記憶部１０６に格納されている非音声区間のフレームr’の平均パワーベクトルP_N(r’)=(P_N(1,r’),…,P_N(K,r’))を取り出す。このフレームr’は処理対象のフレームrに近いことが望ましい。例えば、Ｓ／Ｎベクトル生成部１０７は、フレームrに最も近い非音声区間のフレームr’の平均パワーベクトルP_N(r’)を取り出す。なお、非音声パワー記憶部１０６には平均パワーベクトルの初期値も格納されている。平均パワーベクトルの初期値の例は、K個の定数（例えば1）を要素とするベクトルなどである。非音声区間の平均パワーベクトルが得られていない場合、Ｓ／Ｎベクトル生成部１０７は、平均パワーベクトルの初期値を非音声パワー記憶部１０６から取り出し、それをP_N(r’)=(P_N(1,r’),…,P_N(K,r’))とする。 [Step S1074]
When the label θ _r represents a speech section, the S / N vector generation unit 107 calculates the average power vector P _N (r ′) = (P of the frame r ′ in the non-speech section stored in the non-speech power storage unit 106. _N (1, r '), ..., P _N (K, r')) are taken out. This frame r ′ is preferably close to the frame r to be processed. For example, the S / N vector generation unit 107 extracts the average power vector P _N (r ′) of the frame r ′ in the non-speech interval closest to the frame r. The non-speech power storage unit 106 also stores an initial value of the average power vector. An example of the initial value of the average power vector is a vector having K constants (for example, 1) as elements. When the average power vector of the non-speech section is not obtained, the S / N vector generation unit 107 extracts the initial value of the average power vector from the non-speech power storage unit 106, and obtains it as P _N (r ′) = (P _N (1, r '), ..., P _N (K, r')).

さらにＳ／Ｎベクトル生成部１０７は、すべてのチャネルk=1,…,Kについて、音声区間のフレームrに属するデジタル音響信号sx_k(i_k,r,0),…,sx_k(i_k,r,L-1)の平均パワーをP_N(k,r’)で除算し、正規化平均パワーP_V(k,r)を得る（式（２）参照）。 Further, the S / N vector generation unit 107 performs digital acoustic signals sx _k (i _{k, r, 0} ),..., Sx _k (i _k ) belonging to the frame r in the speech section for all channels k = 1,. _{, r, L-1} ) is divided by P _N (k, r ′) to obtain a normalized average power P _V (k, r) (see equation (2)).

P_N(k,r’)で除算することで各チャネルkのデジタル音響信号の平均パワーを正規化し、各チャネルkのマイク感度の違いによる影響を排除できる。Ｓ／Ｎベクトル生成部１０７は、得られた正規化平均パワーP_V(k,r)をk番目の要素とするＳ／ＮベクトルP_V(r)=(P_V(1,r),…,P_V(K,r))を出力する。 By dividing by P _N (k, r ′), the average power of the digital acoustic signal of each channel k can be normalized, and the influence due to the difference in microphone sensitivity of each channel k can be eliminated. The S / N vector generation unit 107 uses the obtained normalized average power P _V (k, r) as the k-th element, and the S / N vector P _V (r) = (P _V (1, r),. , P _V (K, r)) is output.

［ステップＳ１０７５］
未処理のデジタル音響信号が存在する場合、Ｓ／Ｎベクトル生成部１０７はrに1を加算した値を新たなrとし、処理がステップＳ１０７２に進む。未処理のデジタル音響信号が存在しない場合、Ｓ／Ｎベクトル生成部１０７の処理を終える。 [Step S1075]
If there is an unprocessed digital acoustic signal, the S / N vector generation unit 107 sets a value obtained by adding 1 to r to a new r, and the process proceeds to step S1072. When there is no unprocessed digital acoustic signal, the process of the S / N vector generation unit 107 is finished.

前述のように、非音声パワー記憶部１０６は、平均パワーベクトルの初期値、およびＳ／Ｎベクトル生成部１０７で得られた平均パワーベクトルP_N(r)を格納する。 As described above, the non-speech power storage unit 106 stores the initial value of the average power vector and the average power vector P _N (r) obtained by the S / N vector generation unit 107.

ベクトル分類部１０８は、複数個のＳ／ＮベクトルP_V(r)（複数個のチャネルに対して得られた特徴量からなる特徴量列）を入力として受け取る。ベクトル分類部１０８は、入力された複数個のＳ／ＮベクトルP_V(r)をクラスタリングし、各Ｓ／ＮベクトルP_V(r)が属する信号区間分類（クラスタ）を決定する（ステップＳ１０８）。ベクトル分類部１０８は、複数個のＳ／ＮベクトルP_N(r)（例えば、5秒間に対応する区間でのＳ／ＮベクトルP_N(r)）が入力されるたびに、新たに入力されたＳ／ＮベクトルP_N(r)をクラスタリング対象に追加してクラスタリングを実行してもよいし、1個のＳ／ＮベクトルP_N(r)が入力されるたびに、新たに入力されたＳ／ＮベクトルP_N(r)をクラスタリング対象に追加してクラスタリングを実行してもよい。クラスタリングの例は、教師なし学習であるオンラインクラスタリングなどであり、その一例はleader-followerクラスタリング（例えば、参考文献２参照）である。クラスタリングの指標となる距離にはコサイン類似度を用いることができる。コサイン類似度の距離関数は以下のように定義できる。 The vector classification unit 108 receives as input a plurality of S / N vectors P _V (r) (a feature amount sequence made up of feature amounts obtained for a plurality of channels). The vector classification unit 108 clusters a plurality of input S / N vectors P _V (r), and determines a signal section classification (cluster) to which each S / N vector P _V (r) belongs (step S108). . Vector classifying portion 108, a plurality of S / N vector P _N (r) (e.g., at intervals corresponding to five seconds S / N vector P _N (r)) each time the input is newly input Clustering may be performed by adding the S / N vector P _N (r) to the clustering target, or each time one S / N vector P _N (r) is input, a new input is performed. Clustering may be performed by adding the S / N vector P _N (r) to the clustering target. An example of clustering is online clustering that is unsupervised learning, and one example is leader-follower clustering (see, for example, Reference 2). The cosine similarity can be used for the distance that is an index for clustering. The distance function of cosine similarity can be defined as follows.

ただし、CLは各クラスタのラベルであり、ラベルCLは非音声区間を表すラベルθ_r（例えば0）以外の値（例えば、1以上の整数）をとる。P_CLはクラスタCLの重心ベクトルである。d(CL)はクラスタCLの重心ベクトルP_CLと入力されたＳ／ＮベクトルP_V(r)との距離を表す。コサイン類似度を距離関数とするクラスタリングによって得られたラベルCLが、入力されたＳ／ＮベクトルP_V(r)が属する信号区間分類を表す。ベクトル分類部１０８は、入力されたＳ／ＮベクトルP_V(r)に対して得られたラベルCLをラベルθ_rに代入してラベルθ_rを更新する。これにより、音声区間のフレームrのラベルθ_rはラベルCLの値となり、非音声区間のフレームrのラベルθ_rは非音声区間を表す値となる。ベクトル分類部１０８は各フレームrのラベルθ_rを出力する。
［参考文献２］Richard O. Duda, Peter E. Hart, David G. Stork, “Pattern Classication,” Wiley-Interscience, 2000． However, CL is a label of each cluster, and the label CL takes a value (for example, an integer of 1 or more) other than a label θ _r (for example, 0) representing a non-voice segment. P _CL is the centroid vector of the cluster CL. d (CL) represents the distance between the centroid vector P _CL of the cluster CL and the input S / N vector P _V (r). A label CL obtained by clustering using the cosine similarity as a distance function represents a signal section classification to which the input S / N vector P _V (r) belongs. Vector classifying portion 108 substitutes the label CL obtained for the input S / N vector P _V (r) to the label theta _r updating the label theta _r. As a result, the label θ _r of the frame r in the voice section becomes the value of the label CL, and the label θ _r of the frame r in the non-voice section becomes a value representing the non-voice section. The vector classification unit 108 outputs the label θ _r of each frame r.
[Reference 2] Richard O. Duda, Peter E. Hart, David G. Stork, “Pattern Classication,” Wiley-Interscience, 2000.

スペクトル算出部１０９は、フレーム分割部１０４で分割された、各チャネルkの各フレームrに属するデジタル音響信号sx_k(i_k,r,0),…,sx_k(i_k,r,L-1)を入力として受け取る。ここで、フレームrでの各チャネルkのデジタル音響信号sx_k(i_k,r,j)を要素とするK次元の縦ベクトルをx(j,r)=[sx₁(i_1,r,j),…,sx_K(i_K,r,j)]^Tと記述する。ただし、[η]^Tは[η]の転置を表す。また、フレームrに属するK次元ベクトルx(0,r),…,x(L-1,r)の要素を周波数領域に変換して得られる値を要素とするK次元の縦ベクトルをX(f,r)と記述する。すなわち、フレームrに属するデジタル音響信号sx_k(i_k,r,0),…,sx_k(i_k,r,L-1)を周波数領域に変換して得られる値X(k,f,r)をk番目の要素とするK次元の縦ベクトルをスペクトルベクトルX(f,r)=[X(1,f,r),…,X(K,f,r)]^Tと記述する。ただし、fは離散周波数を表すインデックスである。周波数領域への変換方法の例は、ＦＦＴ（Fast Fourier Transform）などの離散フーリエ変換である。また、X(k,f,r)の振幅スペクトルA(k,f,r)をk番目の要素とするK次元の縦ベクトルを振幅スペクトルベクトルA(f,r)=[A(1,f,r),…,A(K,f,r)]^Tと記述する。さらに、X(k,f,r)の位相スペクトルφ(k,f,r)をk番目の要素とするK次元の縦ベクトルを位相スペクトルベクトルφ(f,r)=[φ(1,f,r),…,φ(K,f,r)]^Tと記述する。スペクトル算出部１０９は、x(j,r)=[sx₁(i_1,r,j),…,sx_K(i_K,r,j)]^Tを周波数領域に変換し、フレームrごとに、k個の振幅スペクトルA(k,f,r)からなる振幅スペクトルベクトルA(f,r)と、k個の位相スペクトルφ(k,f,r)からなる位相スペクトルベクトルφ(f,r)を得て出力する（ステップＳ１０９）。 The spectrum calculation unit 109 divides by the frame division unit 104 and belongs to each frame r of each channel k. The digital acoustic signal sx _k (i _{k, r, 0} ),..., Sx _k (i _{k, r, L− 1} ) is received as input. Here, a K-dimensional vertical vector whose element is the digital acoustic signal sx _k (i _{k, r, j} ) of each channel k in the frame r is x (j, r) = [sx ₁ (i _{1, r, j} ), ..., sx _K (i _{K, r, j} )] ^T However, [η] ^T represents transposition of [η]. Further, a K-dimensional vertical vector whose element is a value obtained by converting the elements of the K-dimensional vector x (0, r),..., X (L-1, r) belonging to the frame r into the frequency domain is represented by X ( f, r). That is, the value X (k, f,) obtained by converting the digital acoustic signal sx _k (i _{k, r, 0} ),..., Sx _k (i _{k, r, L-1} ) belonging to the frame r into the frequency domain. A K-dimensional vertical vector having r) as the k-th element is described as a spectrum vector X (f, r) = [X (1, f, r),..., X (K, f, r)] ^T. Here, f is an index representing a discrete frequency. An example of the method of transforming to the frequency domain is discrete Fourier transform such as FFT (Fast Fourier Transform). Further, a K-dimensional vertical vector having the amplitude spectrum A (k, f, r) of X (k, f, r) as the k-th element is expressed as an amplitude spectrum vector A (f, r) = [A (1, f , r), ..., A (K, f, r)] ^T. Further, a K-dimensional vertical vector having the phase spectrum φ (k, f, r) of X (k, f, r) as the k-th element is expressed as a phase spectrum vector φ (f, r) = [φ (1, f , r),..., φ (K, f, r)] ^T. The spectrum calculation unit 109 converts x (j, r) = [sx ₁ (i _{1, r, j} ),..., Sx _K (i _{K, r, j} )] ^T to the frequency domain, and for each frame r , Amplitude spectrum vector A (f, r) consisting of k amplitude spectra A (k, f, r) and phase spectrum vector φ (f, r) consisting of k phase spectra φ (k, f, r) ) And output (step S109).

振幅スペクトルベクトルA(f,r)は振幅スペクトル記憶部１１０に格納され、位相スペクトルベクトルφ(f,r)は位相スペクトル記憶部１１１に格納される。 The amplitude spectrum vector A (f, r) is stored in the amplitude spectrum storage unit 110, and the phase spectrum vector φ (f, r) is stored in the phase spectrum storage unit 111.

フィルタ係数算出部１１２は、ベクトル分類部１０８から出力された各フレームrのラベルθ_r、および振幅スペクトル記憶部１１０から読み出した振幅スペクトルベクトルA(f,r)を入力として受け取る。ここでラベルθ_rがとり得る値（分類ラベル番号）のうち、音を強調する信号区間分類（強調信号区間分類）を表す分類ラベル番号をcとする。1個の分類ラベル番号cのみが設定されてもよいし、複数個の分類ラベル番号cが設定されてもよい。例えば、任意に分類ラベル番号cが決定されてもよいし、属するＳ／ＮベクトルP_V(r)のノルムの平均値または合計値が大きい順に選択された1個以上の信号区間分類を強調信号区間分類として分類ラベル番号cが決定されてもよいし、属するＳ／ＮベクトルP_V(r)のノルムの平均値または合計値が閾値を超える信号区間分類を強調信号区間分類として分類ラベル番号cが決定されてもよい。θ_r=cは、フレームrが強調信号区間分類に分類されていることを表す。 The filter coefficient calculation unit 112 receives the label θ _r of each frame r output from the vector classification unit 108 and the amplitude spectrum vector A (f, r) read from the amplitude spectrum storage unit 110 as inputs. Here, among the values (classification label numbers) that the label θ _r can take, the classification label number representing the signal section classification (emphasis signal section classification) for emphasizing the sound is set as c. Only one classification label number c may be set, or a plurality of classification label numbers c may be set. For example, the classification label number c may be arbitrarily determined, or one or more signal section classifications selected in descending order of the average value or the total value of the norms of the S / N vector P _V (r) to which it belongs are emphasized. The classification label number c may be determined as the section classification, or the classification label number c is defined as a signal section classification in which the average value or the total value of the norms of the S / N vector P _V (r) to which it belongs exceeds the threshold value. May be determined. θ _r = c represents that the frame r is classified into the enhanced signal section classification.

フィルタ係数算出部１１２は、強調信号区間分類に属するＳ／ＮベクトルP_V(r)に対応する振幅スペクトルA(k,f,r)を強調するフィルタリングのためのフィルタ係数を算出する（ステップＳ１１２）。以下の参考文献３に開示されたＳＮ比最大化ビームフォーマでは、複素スペクトルをそのまま用いて、最大固有値に対する固有ベクトルを求めてフィルタ係数としている。これに対し、本形態のフィルタ係数算出部１１２は、振幅スペクトルベクトルA(f,r)を用いてＳＮ比最大化ビームフォーマを構成する。すなわち、フィルタ係数算出部１１２は、以下の式（４）の一般化固有値問題を解き、最大固有値γ(f)に対応する固有ベクトルの値を、各分類ラベル番号cの音声を強調するフィルタ係数w_c(f)として得る。 The filter coefficient calculation unit 112 calculates a filter coefficient for filtering that enhances the amplitude spectrum A (k, f, r) corresponding to the S / N vector P _V (r) belonging to the enhanced signal section classification (step S112). ). In the S / N ratio maximizing beamformer disclosed in Reference 3 below, the eigenvector for the maximum eigenvalue is obtained as a filter coefficient using the complex spectrum as it is. On the other hand, the filter coefficient calculation unit 112 of this embodiment forms an S / N ratio maximizing beamformer using the amplitude spectrum vector A (f, r). That is, the filter coefficient calculation unit 112 solves the generalized eigenvalue problem of the following equation (4), and uses the eigenvector value corresponding to the maximum eigenvalue γ (f) as the filter coefficient w that emphasizes the speech of each classification label number c. _c Obtain as (f).

また、E[ρ]_θr=c（下付きθrはθ_r）は、θ_r=cであるフレームrからなる区間における、行列ρの要素の期待値からなる行列を表す。E[ρ]_θr≠cは、θ_r≠cであるフレームrからなる区間における、行列ρの要素の期待値からなる行列を表す。式（５）（６）を求めるための区間は、例えば10秒以上の時間に対応する。またフィルタ係数w_c(f)は、チャネルkに対応する係数w_c(f,k)をk番目の要素とするK次元の横ベクトル[w_c(f,1),…,w_c(f,K)]である。フィルタ係数算出部１１２は、各インデックスfおよび各分類ラベル番号cについてフィルタ係数w_c(f)を得て出力する。さらにフィルタ係数算出部１１２は、（５）（６）を求めるための区間において、θ_r=cである各フレームrのＳ／ＮベクトルP_V(r)の要素のうち最大の要素に対応するチャネルを、最大チャネル番号k_c,rとして得る。フィルタ係数算出部１１２は、フィルタ係数w_c(f)と最大チャネル番号k_c,rとを各分類ラベル番号cに対応付け、フィルタ係数記憶部１１３に格納する。話者の移動や雑音の変化に対応するため、フィルタ係数算出部１１２は、定期的（例えば1分置き）に、式（５）（６）を得るための区間を更新し、各フィルタ係数w_c(f)および最大チャネル番号k_c,rを得て、フィルタ係数記憶部１１３に格納された各フィルタ係数w_c(f)および最大チャネル番号k_c,rを更新する。
［参考文献３］H. L. Van Tree, ed., “Optimum Array Processing,” Wiley, 2002. E [ρ] _{θr = c} (subscript θr is θ _r ) represents a matrix composed of the expected values of the elements of the matrix ρ in the section composed of the frame r with θ _r = c. E [ρ] _{θr ≠ c} represents a matrix composed of expected values of elements of the matrix ρ in the section composed of the frame r where θ _r ≠ c. The section for obtaining equations (5) and (6) corresponds to a time of 10 seconds or more, for example. The filter coefficient w _c (f) is a K-dimensional horizontal vector [w _c (f, 1), ..., w _c (f) with the coefficient w _c (f, k) corresponding to the channel k as the k-th element. , K)]. The filter coefficient calculation unit 112 obtains and outputs a filter coefficient w _c (f) for each index f and each classification label number c. Further, the filter coefficient calculation unit 112 corresponds to the largest element among the elements of the S / N vector P _V (r) of each frame r with θ _r = c in the section for obtaining (5) and (6). The channel is obtained as the maximum channel number k _{c, r} . The filter coefficient calculation unit 112 associates the filter coefficient w _c (f) and the maximum channel number k _{c, r} with each classification label number c and stores them in the filter coefficient storage unit 113. In order to cope with the movement of the speaker and the change in noise, the filter coefficient calculation unit 112 updates the intervals for obtaining the equations (5) and (6) periodically (for example, every one minute), and each filter coefficient w _c (f) and the maximum channel number k _{c, r} are obtained, and each filter coefficient w _c (f) and the maximum channel number k _{c, r} stored in the filter coefficient storage unit 113 are updated.
[Reference 3] HL Van Tree, ed., “Optimum Array Processing,” Wiley, 2002.

フィルタリング部１１４は、フィルタ係数記憶部１１３から読み出したフィルタ係数w_c(f)、および振幅スペクトル記憶部１１０から読み出した振幅スペクトルベクトルA(f,r)を入力として受け取る。フィルタリング部１１４は、振幅スペクトルベクトルA(f,r)を構成する複数個の振幅スペクトルA(1,f,r),…,A(K,f,r)に対し、フィルタ係数w_c(f)=[w_c(f,1),…,w_c(f,K)]によるフィルタリングを行い、処理後振幅スペクトルA_c’(f,r)を得て出力する（ステップＳ１１４）。例えばフィルタリング部１１４は、以下の式（７）のように、フィルタ係数w_c(f)と振幅スペクトルベクトルA(f,r)との内積を処理後振幅スペクトルA_c’(f,r)として得る。
A_c’(f,r)=w_c(f)A(f,r) （７） The filtering unit 114 receives the filter coefficient w _c (f) read from the filter coefficient storage unit 113 and the amplitude spectrum vector A (f, r) read from the amplitude spectrum storage unit 110 as inputs. The filtering unit 114 applies the filter coefficient w _c (f) to the plurality of amplitude spectra A (1, f, r),..., A (K, f, r) constituting the amplitude spectrum vector A (f, r). ) = [w _c (f, 1),..., w _c (f, K)] to obtain a processed amplitude spectrum A _c ′ (f, r) and output it (step S114). For example, the filtering unit 114 sets the inner product of the filter coefficient w _c (f) and the amplitude spectrum vector A (f, r) as the processed amplitude spectrum A _c ′ (f, r) as shown in the following equation (7). obtain.
A _c '(f, r) = w _c (f) A (f, r) (7)

以上のステップＳ１１２およびＳ１１４により、複数個の振幅スペクトルA(1,f,r),…,A(K,f,r)に対し、強調信号区間分類に属するＳ／ＮベクトルP_V(r)に対応する振幅スペクトルを強調する処理が行われ、複数個の処理後振幅スペクトルA_c’(f,r)が得られる。 Through the above steps S112 and S114, the S / N vector P _V (r) belonging to the emphasized signal section classification is applied to the plurality of amplitude spectra A (1, f, r),..., A (K, f, r). A process of emphasizing the amplitude spectrum corresponding to is performed, and a plurality of processed amplitude spectra A _c ′ (f, r) are obtained.

位相付与部１１５は、処理後振幅スペクトルA_c’(f,r)に、それに対応する位相スペクトルを付与して複素スペクトルを得て出力する（ステップＳ１１５）。この実施形態では、位相付与部１１５は、フィルタ係数記憶部１１３から各フレームrおよび各分類ラベル番号cに対応する最大チャネル番号k_c,rを読み出す。位相付与部１１５は、位相スペクトル記憶部１１１から全チャネルkに対応する位相スペクトルφ(k,f,r)を読み出し、それらから最大チャネル番号k_c,rに対応する位相スペクトルφ(k_c,r,f,r)を選択する。さらに位相付与部１１５は、フィルタリング部１１４から出力された処理後振幅スペクトルA_c’(f,r)を入力として受け取る。位相付与部１１５は、以下の式（８）のように処理後振幅スペクトルA_c’(f,r)に位相スペクトルφ(k_c,r,f,r)を付与し、複素スペクトルY_c(f,r)を得て出力する。
Y_c(f,r)=A_c’(f,r)exp(iφ(k_c,r,f,r)) （８）
ただし、iは虚数単位であり、expは指数関数である。 The phase assigning unit 115 assigns a phase spectrum corresponding to the processed amplitude spectrum A _c ′ (f, r) to obtain a complex spectrum and outputs it (step S115). In this embodiment, the phase assigning unit 115 reads the maximum channel number k _{c, r} corresponding to each frame r and each classification label number c from the filter coefficient storage unit 113. Phase deposition unit 115, the phase spectrum corresponding to all channels k from the phase spectrum storage section 111 φ (k, f, r ) read out, up from their channel number k _c, the phase spectrum corresponding to the _r phi (k _{c, r} , f, r). Further, the phase adding unit 115 receives the processed amplitude spectrum A _c ′ (f, r) output from the filtering unit 114 as an input. The phase assigning unit 115 assigns the phase spectrum φ (k _{c, r} , f, r) to the processed amplitude spectrum A _c ′ (f, r) as shown in the following equation (8), and the complex spectrum Y _c ( f, r) is obtained and output.
Y _c (f, r) = A _c '(f, r) exp (iφ (k _{c, r} , f, r)) (8)
Where i is an imaginary unit and exp is an exponential function.

時間領域変換部１１６は、複素スペクトルY_c(f,r)を入力として受け取り、複素スペクトルY_c(f,r)を時間領域に変換して強調音響信号y_c(n,r)(n=0,…,L−1)を得る。ただし、nはサンプル点を表すインデックスである。時間領域に変換する方法としては、例えば逆フーリエ変換を用いることができる。さらに時間領域変換部１１６は、オーバーラップアド法を用いて強調音響信号y_c(n,r)(n=0,…,L−1)を合成して時間領域の音響信号波形を得て出力する。分類ラベル番号cが複数存在する場合、時間領域変換部１１６は、各分類ラベル番号cに対応する音響信号波形を複数出力する。または、各分類ラベル番号cに対応する音響信号波形の同じサンプル点ごとの加算値を出力してもよい。 The time domain conversion unit 116 receives the complex spectrum Y _c (f, r) as an input, converts the complex spectrum Y _c (f, r) to the time domain, and enhances the acoustic signal y _c (n, r) (n = 0, ..., L-1). Here, n is an index representing a sample point. As a method for converting to the time domain, for example, inverse Fourier transform can be used. Further, the time domain conversion unit 116 synthesizes the enhanced acoustic signal y _c (n, r) (n = 0,..., L−1) using the overlap add method to obtain and output the acoustic signal waveform in the time domain. To do. When there are a plurality of classification label numbers c, the time domain conversion unit 116 outputs a plurality of acoustic signal waveforms corresponding to the classification label numbers c. Or you may output the addition value for every same sample point of the acoustic signal waveform corresponding to each classification label number c.

この実施形態の音声処理部１３では、音声区間のデジタル音響信号の大きさを非音声区間のデジタル音響信号の大きさで正規化して得られる複数個のＳ／Ｎベクトルをクラスタリングする。そのため、自由に配置された感度が異なる収音手段を備えた複数個のスマートフォン、ラップトップコンピュータなどの端末装置で収音されたデジタル音響信号から、音源位置に基づいた信号区間分類を行うことができる。 The speech processing unit 13 of this embodiment clusters a plurality of S / N vectors obtained by normalizing the magnitude of the digital acoustic signal in the speech section with the magnitude of the digital acoustic signal in the non-speech section. Therefore, it is possible to perform signal section classification based on the sound source position from digital sound signals collected by a plurality of terminal devices such as a plurality of smartphones and laptop computers provided with sound collecting means with different sensitivity. it can.

また、この実施形態の音声処理部１３では、音源から収音手段３０１へ到達するまでの音圧の減衰に着目するため、クラスタリングに用いる距離尺度にコサイン類似度を使用した。さらに、この実施形態では、サンプリング周波数変換部１０２でサンプリング周波数変換を行ってチャネル間のサンプリング周波数のずれを補正し、信号同期部１０３でチャネル間での同期を行ってクライアント３₁,…,３_Kの個体差による影響を抑制した。そのため、各チャネルの収音手段３０１のサンプリング周波数の公称値が互いに異なっていたり、サンプリング周波数の個体差があったりしても、信号区間分類を精度よく行うことができる。 Further, in the sound processing unit 13 of this embodiment, the cosine similarity is used as a distance measure used for clustering in order to pay attention to the attenuation of sound pressure from the sound source to the sound collecting means 301. Further, in this embodiment, the sampling frequency conversion unit 102 performs sampling frequency conversion to correct a sampling frequency shift between channels, and the signal synchronization unit 103 performs synchronization between channels to perform client 3 ₁ ,. _The effect of individual differences in _K was suppressed. Therefore, even if the nominal values of the sampling frequencies of the sound collecting means 301 of each channel are different from each other or there are individual differences in the sampling frequencies, the signal section classification can be performed with high accuracy.

以上のような区間分類結果を用いて目的音区間とその他の音源区間に分類ができるため、雑音を抑圧し目的音を強調するフィルタの設計のための情報として利用できる。そのためこの実施形態では、自由に配置した複数のサンプリング周波数およびマイク感度が異なる、スマートフォン、携帯電話端末、ラップトップコンピュータなどの複数の端末装置で得られたデジタル音響信号から、特定の目的音を強調することができる。 Since the section classification result as described above can be used to classify the target sound section and other sound source sections, it can be used as information for designing a filter that suppresses noise and emphasizes the target sound. Therefore, in this embodiment, a specific target sound is emphasized from digital acoustic signals obtained by a plurality of terminal devices such as a smartphone, a mobile phone terminal, and a laptop computer, which have a plurality of freely arranged sampling frequencies and microphone sensitivities. can do.

＜変形例等＞
なお、この発明は上述の実施の形態に限定されるものではない。例えば、すべてのチャネルk=1,…,Kの収音手段３０１のサンプリング周波数の公称値が互いに同一であるならば、サンプリング周波数変換部１０２の処理を行わなくてもよい。この場合には「入力デジタル音響信号」がそのまま「変換デジタル音響信号」として信号同期部１０３に入力されてもよい。このような場合にはサンプリング周波数変換部１０２を設けなくてもよい。 <Modifications>
The present invention is not limited to the embodiment described above. For example, if the nominal values of the sampling frequencies of the sound collection means 301 of all channels k = 1,..., K are the same, the processing of the sampling frequency conversion unit 102 may not be performed. In this case, the “input digital acoustic signal” may be directly input to the signal synchronization unit 103 as the “converted digital acoustic signal”. In such a case, the sampling frequency conversion unit 102 may not be provided.

さらにすべてのチャネルk=1,…,Kの収音手段３０１のサンプリング周波数の公称値が互いに同一であり、それらの個体差の影響も小さいのであれば、サンプリング周波数変換部１０２および信号同期部１０３の処理を行わなくてもよい。この場合には「入力デジタル音響信号」がそのまま「デジタル音響信号」としてフレーム分割部１０４に入力されてもよい。このような場合にはサンプリング周波数変換部１０２および信号同期部１０３を設けなくてもよい。 Furthermore, if the nominal values of the sampling frequencies of the sound collecting means 301 of all the channels k = 1,..., K are the same and the influence of their individual differences is small, the sampling frequency converter 102 and the signal synchronizer 103 It is not necessary to perform the process. In this case, the “input digital audio signal” may be directly input to the frame dividing unit 104 as the “digital audio signal”. In such a case, the sampling frequency conversion unit 102 and the signal synchronization unit 103 need not be provided.

また位相付与部１１５は、最大チャネル番号k_c,rに対応する位相スペクトルφ(k_c,r,f,r)を処理後振幅スペクトルA_c’(f,r)に付与した。しかしながら、その他のチャネルの位相スペクトルφ(k,f,r)を処理後振幅スペクトルA_c’(f,r)に付与してもよい。 In addition, the phase assigning unit 115 assigns the phase spectrum φ (k _{c, r} , f, r) corresponding to the maximum channel number k _{c, r} to the processed amplitude spectrum A _c ′ (f, r). However, the phase spectrum φ (k, f, r) of other channels may be added to the processed amplitude spectrum A _c ′ (f, r).

この発明は上述の実施形態に限定されるものではなく、この発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。上記実施例において説明した各種の処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。 The present invention is not limited to the above-described embodiment, and it goes without saying that modifications can be made as appropriate without departing from the spirit of the present invention. The various processes described in the above-described embodiments are not only executed in time series according to the order described, but may be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes.

＜発明の効果＞
上記のように、この発明のビデオ会議技術によれば、自由に配置した複数のクライアントにより、サンプリング周波数および感度が異なるマイク等の収音手段を用いて収音した音響信号であっても、特定話者の音声の強調・抽出および雑音抑圧をすることが可能になる。また、音声だけでなくクライアントに搭載されているカメラ等の撮影手段から映像を取得することで、主として話している利用者の映像を各クライアントに配信し、ビデオ通話を行うことができる。 <Effect of the invention>
As described above, according to the video conferencing technique of the present invention, even if an acoustic signal is collected by a plurality of freely arranged clients using sound collecting means such as microphones having different sampling frequencies and sensitivities. It is possible to enhance / extract speaker's voice and suppress noise. Further, by acquiring not only voice but also video from photographing means such as a camera mounted on the client, video of a user who is mainly talking can be distributed to each client and a video call can be performed.

１サーバ
３クライアント
５ネットワーク
１０ビデオ会議システム
１１音声受信部
１２映像受信部
１３音声処理部
１４音声送信部
１５映像選択部
１６映像送信部
１７端末管理部
２１音声取得部
２２音声配信部
２３映像取得部
２４映像配信部
２５コマンド送信部
３１音声取得部
３２映像取得部
３３音声再生部
３４映像表示部
３５コマンド受信部
３６設定部
３７制御部
３０１収音手段
３０２再生手段
３０３撮影手段
３０４表示手段 1 server 3 client 5 network 10 video conference system 11 audio receiving unit 12 video receiving unit 13 audio processing unit 14 audio transmitting unit 15 video selecting unit 16 video transmitting unit 17 terminal managing unit 21 audio acquiring unit 22 audio distributing unit 23 video acquiring unit 24 video distribution unit 25 command transmission unit 31 audio acquisition unit 32 video acquisition unit 33 audio reproduction unit 34 video display unit 35 command reception unit 36 setting unit 37 control unit 301 sound collection unit 302 reproduction unit 303 photographing unit 304 display unit

Claims

A video conference system including a plurality of clients and a server including sound collection means and photographing means,
The client
A client-side voice acquisition unit that transmits an acoustic signal collected by the sound collecting unit to the server;
A video acquisition unit that transmits a video signal captured by the imaging unit to the server;
An audio reproduction unit for reproducing a distribution acoustic signal received from the server;
A video display unit for displaying a distribution video signal received from the server;
Including
The server
An audio processing unit that receives an audio signal of a plurality of channels received from the plurality of clients and generates a transmission audio signal by predetermined audio processing;
An audio transmission unit for transmitting the transmission acoustic signal to a server included in another video conference system;
A video selection unit for determining a transmission video signal arbitrarily selected from a plurality of video signals received from the plurality of clients;
A server-side audio acquisition unit for receiving the distributed audio signal generated by the predetermined audio processing with an audio signal of a plurality of channels as an input from a server included in the other video conference system;
An audio distribution unit for transmitting the distribution acoustic signal to the plurality of clients;
A video distribution unit for transmitting the distribution video signal arbitrarily selected from a plurality of video signals to the plurality of clients;
Including
The voice processing unit
A feature quantity sequence acquisition unit that receives the acoustic signals of the plurality of channels and obtains a feature quantity obtained by normalizing the magnitude of the acoustic signal in the voice section for each channel by the magnitude of the acoustic signal in the non-voice section;
A clustering unit configured to cluster feature quantity sequences made up of the feature quantities obtained for the plurality of channels, and to determine a signal section classification to which the feature quantity sequence belongs;
A spectrum calculation unit that converts the acoustic signal into a frequency domain in each of a plurality of time intervals, and obtains a plurality of amplitude spectra and phase spectra;
Emphasis processing for obtaining a plurality of post-processing amplitude spectra by performing processing for emphasizing the amplitude spectrum corresponding to the feature amount sequence belonging to the emphasis signal section classification which is one of the signal section classifications with respect to the plurality of amplitude spectra. And
A phase adding unit that obtains a complex spectrum by adding the phase spectrum to the processed amplitude spectrum;
A time domain conversion unit that converts the complex spectrum into the time domain to obtain the transmission acoustic signal;
Including video conferencing system.

The video conferencing system according to claim 1,
The server includes a command transmission unit that transmits a request signal for displaying predetermined information to the plurality of clients based on the transmission video signal,
The video conference system, wherein the client includes a command receiving unit that displays predetermined information based on the transmission video signal according to the request signal.

The video conferencing system according to claim 2,
The video receiving system, wherein the command receiving unit displays information indicating which client has acquired the transmitted video signal.

The video conference system according to any one of claims 1 to 3,
The video conference system, wherein the client is a smartphone.

The video conference system according to any one of claims 1 to 4,
The enhancement processing unit
A filter coefficient calculation unit for calculating a filter coefficient for filtering that emphasizes an amplitude spectrum corresponding to a feature amount sequence belonging to the enhancement signal section classification;
Filtering the plurality of amplitude spectra with the filter coefficient to obtain the processed amplitude spectrum;
Including video conferencing system.

The video conference system according to any one of claims 1 to 5,
The video conference system, wherein the feature amount represents a ratio of a magnitude of the acoustic signal in the voice section to a magnitude of the acoustic signal in the non-voice section.

The video conference system according to any one of claims 1 to 6,
The voice processing unit
Sampling frequency conversion of the acoustic signals of the plurality of channels to obtain a converted acoustic signal of a specific sampling frequency; and
A signal synchronization unit that synchronizes the converted acoustic signal between channels and obtains the acoustic signal;
Including video conferencing system.