JP2016092679A

JP2016092679A - Voice processing unit, program and method

Info

Publication number: JP2016092679A
Application number: JP2014227163A
Authority: JP
Inventors: 高詩石黒; Takashi Ishiguro
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 2014-11-07
Filing date: 2014-11-07
Publication date: 2016-05-23
Anticipated expiration: 2034-11-07
Also published as: JP6476768B2

Abstract

PROBLEM TO BE SOLVED: To provide a voice processing unit capable of mixing a larger number of voice channels by software at a low cost.SOLUTION: A voice processing unit includes a plurality of voice processing means, each having a plurality of tone signal reception means for receiving the reception sound data based on the reception sound of each time series, first synthesis means for generating first synthesized voice data by synthesizing the reception sound data received by the plurality of tone signal reception means, buffer means for holding the first synthesized voice data, second synthesis means for generating second synthesized voice data by synthesizing the first synthesized voice data, generated in other voice processing means, respectively, and third synthesis means for generating transmission sound data by synthesizing the voice data, obtained by removing the reception sound data related to the transmission destination from the reception sound data received by the plurality of tone signal reception means, for each reception destination.SELECTED DRAWING: Figure 1

Description

この発明は、音声処理装置、プログラム及び方法に関し、例えば、多地点間をネットワーク接続して会議環境を提供する会議システムを構成する会議サーバ（例えば、ＭＣＵ（ＭｕｌｔｉｐｏｉｎｔＣｏｎｔｒｏｌＵｎｉｔ等の装置）の音声ミキシング処理に適用し得る。 The present invention relates to an audio processing apparatus, a program, and a method. For example, audio mixing of a conference server (for example, an apparatus such as an MCU (Multipoint Control Unit)) that constitutes a conference system that provides a conference environment by connecting multiple points to a network. Applicable to processing.

従来、多地点間をネットワーク接続して会議環境を提供する会議システムにおいて、多数の拠点間の音声をミキシングする処理には、通常、専用ハードウェアが用いられる。 Conventionally, in a conference system that provides a conference environment by connecting multiple points to a network, dedicated hardware is usually used for the process of mixing audio between multiple sites.

ところで、従来の会議システムの音声ミキシング処理を行う装置では、ＩＴＵ−ＴＧ．７２９などの高圧縮の符号化方式を用いると、デコーダおよびエンコーダの処理負荷が大きくなり、ミキシング可能なチャネル数が制限されるという課題がある。さらに、近年、ネットワーク設備に関する、コスト低減および維持管理の観点から、専用ハードを使用せずに、汎用サーバ上にソフトウェア的にミキシング機能を実現することが求められている。 By the way, in an apparatus for performing audio mixing processing of a conventional conference system, ITU-TG When a high-compression encoding method such as 729 is used, there is a problem that the processing load on the decoder and encoder increases, and the number of channels that can be mixed is limited. Furthermore, in recent years, from the viewpoint of cost reduction and maintenance management related to network equipment, it is required to implement a mixing function in software on a general-purpose server without using dedicated hardware.

このような課題を解決するための従来技術として特許文献１の記載技術がある。 As a conventional technique for solving such a problem, there is a technique described in Patent Document 1.

特許文献１には、複数の入出力音声データをミキシングする通信ブロックを複数と、中央ミキサを用いることにより、ミキシング演算処理の負荷を軽減し、多地点の音声ミキシングを実現することについて記載されている。 Patent Document 1 describes that a plurality of communication blocks for mixing a plurality of input / output audio data and a central mixer are used to reduce the load of mixing calculation processing and realize multi-point audio mixing. Yes.

特許公開２００８−２１１２９１号公報Japanese Patent Publication No. 2008-211291

一般的に汎用サーバは、複数のＣＰＵを搭載しており、各ＣＰＵは複数のＣＰＵコアで構成されているが、特許文献１の装置における中央ミキサを汎用サーバ上にソフトウェア的に実装する場合、１つのＣＰＵコアの処理能力が、メディア処理スレッド処理の性能限界となってしまう。例えば、１２コア（２ＣＰＵ×６コア）のサーバの場合、ＣＰＵ全体の１／１２の処理能力でミキシング可能なチャネル数が制限されてしまう。 Generally, a general-purpose server is equipped with a plurality of CPUs, and each CPU is composed of a plurality of CPU cores. However, when the central mixer in the apparatus of Patent Document 1 is mounted on a general-purpose server by software, The processing capability of one CPU core becomes the performance limit of media processing thread processing. For example, in the case of a 12-core (2 CPU × 6 core) server, the number of channels that can be mixed is limited by the processing capacity of 1/12 of the entire CPU.

そのため、低コストでより多くの音声チャネルについてソフトウェア的にミキシング処理可能な音声処理装置、プログラム及び方法が望まれている。 Therefore, a voice processing apparatus, program, and method capable of performing software mixing processing on more voice channels at a low cost are desired.

第１の本発明の音声処理装置は、（１）複数の音声処理手段を備え、（１−１）それぞれの上記音声処理手段は、（１−２）時系列ごとの受信音に基づいた受信音データを受信する複数の音信号受信手段と、（１−３）上記複数の音信号受信手段が受信した受信音データを合成して第１の合成音声データを生成する第１の合成手段と、（１−４）第１の合成音声データを保持するバッファ手段と、（１−５）他の上記音声処理手段のそれぞれで生成された第１の合成音声データを合成して第２の合成音声データを生成する第２の合成手段と、（１−６）送信先ごとに、複数の上記音信号受信手段が受信した受信音データから当該送信先に係る受信音データを除外した音声データと上記第２の合成音声データとを合成した送信音データを生成する第３の合成手段とを有することを特徴とする。 The sound processing apparatus of the first aspect of the present invention includes (1) a plurality of sound processing means, and (1-1) each of the sound processing means is (1-2) reception based on reception sound for each time series. A plurality of sound signal receiving means for receiving sound data; and (1-3) a first synthesizing means for generating first synthesized sound data by synthesizing the received sound data received by the plurality of sound signal receiving means. , (1-4) buffer means for holding the first synthesized voice data, and (1-5) second synthesized voice by synthesizing the first synthesized voice data generated by each of the other voice processing means. A second synthesizing unit that generates audio data; and (1-6) audio data obtained by excluding reception sound data relating to the transmission destination from reception sound data received by the plurality of sound signal reception units for each transmission destination; Generate transmission sound data synthesized with the second synthesized voice data. And having a third combining means.

第２の本発明の音声処理プログラムは、（１）コンピュータを複数の音声処理手段として機能させ、（２）それぞれの上記音声処理手段は、（２−１）時系列ごとの受信音に基づいた受信音データを受信する複数の音信号受信手段と、（２−２）上記複数の音信号受信手段が受信した受信音データを合成して第１の合成音声データを生成する第１の合成手段と、（２−３）第１の合成音声データを保持するバッファ手段と、（２−４）他の上記音声処理手段のそれぞれで生成された第１の合成音声データを合成して第２の合成音声データを生成する第２の合成手段と、（２−５送信先ごとに、複数の上記音信号受信手段が受信した受信音データから当該送信先に係る受信音データを除外した音声データと上記第２の合成音声データとを合成した送信音データを生成する第３の合成手段とを有することを特徴とする。 The audio processing program of the second aspect of the present invention is (1) makes a computer function as a plurality of audio processing means, (2) each of the audio processing means is based on (2-1) received sound for each time series. A plurality of sound signal receiving means for receiving the received sound data; and (2-2) a first synthesizing means for generating the first synthesized sound data by synthesizing the received sound data received by the plurality of sound signal receiving means. (2-3) buffer means for holding the first synthesized voice data and (2-4) first synthesized voice data generated by each of the other voice processing means to synthesize the second A second synthesizing unit that generates synthesized audio data; (2-5 audio data obtained by excluding reception sound data related to the transmission destination from reception sound data received by the plurality of sound signal reception units for each transmission destination; The second synthesized voice data is synthesized And having a third synthesizing means for generating a transmitted sound data.

第３の本発明は音声処理装置が実行する音声処理方法において、（１）複数の音声処理手段を備え、（２）それぞれの音声処理手段は、複数の音信号受信手段、第１の合成手段、バッファ手段、第２の合成手段、及び第３の合成手段を備え、（３）それぞれの上記音信号受信手段は、時系列ごとの受信音に基づいた受信音データを受信し、（４）上記第１の合成手段は、上記複数の音信号受信手段が受信した受信音データを合成して第１の合成音声データを生成し、（５）上記バッファ手段は、第１の合成音声データを保持し、（６）上記第２の合成手段は、他の上記音声処理手段のそれぞれで生成された第１の合成音声データを合成して第２の合成音声データを生成し、（７）上記第３の合成手段は、送信先ごとに、複数の上記音信号受信手段が受信した受信音データから当該送信先に係る受信音データを除外した音声データと上記第２の合成音声データとを合成した送信音データを生成することを特徴とする。 According to a third aspect of the present invention, there is provided a speech processing method executed by the speech processing apparatus, comprising: (1) a plurality of speech processing means; and (2) each speech processing means comprising a plurality of sound signal receiving means and first synthesis means. , Buffer means, second synthesizing means, and third synthesizing means, (3) each of the sound signal receiving means receives the received sound data based on the received sound for each time series, and (4) The first synthesizing unit synthesizes reception sound data received by the plurality of sound signal receiving units to generate first synthetic audio data. (5) The buffer unit converts the first synthetic audio data into the first synthetic audio data. (6) the second synthesizing unit synthesizes the first synthesized audio data generated by each of the other audio processing units to generate second synthesized audio data, and (7) the above-described The third synthesizing unit receives a plurality of the sound signals for each transmission destination. And generating a transmitted sound data stage obtained by synthesizing the audio data and the second synthesized speech data from the receiving sound data received, excluding the received sound data relating to the destination.

本発明によれば、低コストでより多くの音声チャネルについてソフトウェア的にミキシング処理可能な音声処理装置を提供することができる。 According to the present invention, it is possible to provide a voice processing apparatus capable of performing software mixing processing on a larger number of voice channels at a low cost.

第１の実施形態に係る多地点音声ミキシング装置で動作するメディア処理スレッドの構成例について示したブロック図である。It is the block diagram shown about the structural example of the media processing thread | sled which operate | moves with the multipoint audio | voice mixing apparatus which concerns on 1st Embodiment. 第１の実施形態に係る多地点音声ミキシング装置のハードウェア構成及び接続構成について示したブロック図である。It is the block diagram shown about the hardware constitutions and connection structure of the multipoint audio | voice mixing apparatus which concerns on 1st Embodiment. 第１の実施形態に係る多地点音声ミキシング装置に接続する端末の構成の例について示したブロック図である。It is the block diagram shown about the example of the structure of the terminal connected to the multipoint audio | voice mixing apparatus which concerns on 1st Embodiment. 第１の実施形態に係る音声受信処理の内部構成の例について示したブロック図である。It is the block diagram shown about the example of the internal structure of the audio | voice reception process which concerns on 1st Embodiment. 第１の実施形態に係る音声送信処理の内部構成の例について示したブロック図である。It is the block diagram shown about the example of the internal structure of the audio | voice transmission process which concerns on 1st Embodiment. 第１の実施形態に係る循環バッファの構成例について示した説明図である。It is explanatory drawing shown about the structural example of the circular buffer which concerns on 1st Embodiment. 第１の実施形態にミキサの内部構成の例について示した説明図である。It is explanatory drawing shown about the example of the internal structure of the mixer in 1st Embodiment. 第１の実施形態に係る多地点音声ミキシング装置の動作（メディア処理スレッドの動作）の例について示したタイミングチャートである。It is the timing chart shown about the example of operation (operation of a media processing thread) of the multipoint audio mixing device concerning a 1st embodiment. 第２の実施形態に係る多地点音声ミキシング装置の動作（メディア処理スレッドの動作）の例について示したタイミングチャートである。It is the timing chart shown about the example of operation (operation of a media processing thread) of the multipoint audio mixing device concerning a 2nd embodiment.

（Ａ）第１の実施形態
以下、本発明による音声処理装置、プログラム及び方法の第１の実施形態を、図面を参照しながら詳述する。以下では、本発明の音声処理装置および音声処理プログラムを多地点音声ミキシング装置に適用する例について説明する。 (A) First Embodiment Hereinafter, a first embodiment of a sound processing apparatus, program, and method according to the present invention will be described in detail with reference to the drawings. Below, the example which applies the audio | voice processing apparatus and audio | voice processing program of this invention to a multipoint audio | voice mixing apparatus is demonstrated.

（Ａ−１）第１の実施形態の構成
図２は、この実施形態の多地点音声ミキシング装置１０のハードウェア構成及び接続構成の例について示したブロック図である。 (A-1) Configuration of the First Embodiment FIG. 2 is a block diagram showing an example of the hardware configuration and connection configuration of the multipoint audio mixing device 10 of this embodiment.

図２に示すように、多地点音声ミキシング装置１０は、Ｎ×Ｍ台の端末２０−１＿１〜２０−Ｍ＿Ｎ（端末２０−１＿１、２０−１＿２、…、２０−１＿Ｎ、２０−２＿１、２０−２＿２、…、２０−２＿Ｎ、…２０−Ｍ＿１、２０−Ｍ＿２、…、２０−Ｍ＿Ｎ）とネットワーク４０を介して接続している。なお、Ｎ、Ｍは、２以上の任意の整数である。また、Ｎ×Ｍ台の端末２０は、多地点音声ミキシング装置１０に接続可能な最大の端末２０の数であり、多地点音声ミキシング装置１０に接続される端末２０の数は、Ｎ×Ｍ台以下であってもよい。 As shown in FIG. 2, the multipoint audio mixing apparatus 10 includes N × M terminals 20-1_1 to 20-M_N (terminals 20-1_1, 20-1_2,..., 20-1_N, 20-2_1, 20−). 2_2, ..., 20-2_N, ... 20-M_1, 20-M_2, ..., 20-M_N) and the network 40. N and M are arbitrary integers of 2 or more. The N × M terminals 20 are the maximum number of terminals 20 that can be connected to the multipoint audio mixing apparatus 10, and the number of terminals 20 connected to the multipoint audio mixing apparatus 10 is N × M. It may be the following.

ネットワーク４０としては例えばＩＰネットワークを適用することができるが、多地点音声ミキシング装置１０と各端末２０との間のネットワーク接続構成については限定されないものである。この実施形態では、多地点音声ミキシング装置１０と各端末２０との間ではＩＰ通信により、音声（会議音声）データのリアルタイム送受信を行うことが可能であるものとする。 For example, an IP network can be applied as the network 40, but the network connection configuration between the multipoint audio mixing apparatus 10 and each terminal 20 is not limited. In this embodiment, it is assumed that real-time transmission / reception of audio (conference audio) data can be performed between the multipoint audio mixing apparatus 10 and each terminal 20 by IP communication.

図３は、端末２０の内部構成の例について示したブロック図である。 FIG. 3 is a block diagram showing an example of the internal configuration of the terminal 20.

各端末２０は、会議端末（電話端末）として機能するものである。各端末２０の具体的な構成は、図３の構成に限定されないものであり、例えば、ＩＰ電話機やソフトフォンのアプリケーションをインストールしたＰＣ等を適用することができる。 Each terminal 20 functions as a conference terminal (telephone terminal). The specific configuration of each terminal 20 is not limited to the configuration shown in FIG. 3, and for example, an IP telephone or a PC installed with a softphone application can be applied.

この実施形態では、端末２０は、全て図３のブロック図で示される構成であるものとして以下の説明を行うが、各端末２０の具体的な構成は、図３の構成に限定されないものである。例えば、ＩＰ電話機や、ソフトフォンとして機能するコンピュータ（例えば、ＰＣ，スマートフォン、タブレットＰＣ等にソフトフォンのアプリケーションをインストールしたもの）等を適用することができる。 In this embodiment, the following description will be made assuming that all the terminals 20 have the configuration shown in the block diagram of FIG. 3, but the specific configuration of each terminal 20 is not limited to the configuration of FIG. . For example, an IP telephone or a computer functioning as a soft phone (for example, a PC, a smartphone, a tablet PC or the like in which a soft phone application is installed) can be applied.

図３に示す端末２０は、通話処理部２１、ネットワークインタフェースとしての通信部２２、ユーザの音声を捕捉するマイク２３、及びユーザに音声出力するスピーカ２４を有している。マイク２３、及びスピーカ２４は、端末２０において送受話器として機能するものであり、例えば、電話機やスピーカフォンの受話器やヘッドセット等を適用することができる。 The terminal 20 shown in FIG. 3 includes a call processing unit 21, a communication unit 22 as a network interface, a microphone 23 that captures a user's voice, and a speaker 24 that outputs a voice to the user. The microphone 23 and the speaker 24 function as a handset in the terminal 20, and for example, a telephone or speakerphone handset or headset can be applied.

通話処理部２１は、音声データ／音声信号の処理や呼制御処理等、通話に係る処理を行うものである。端末２０が、例えば、ＰＣ等の汎用的なコンピュータで構成されている場合には、通話処理部２１はソフトフォンのアプリケーションに該当する構成要素となる。 The call processing unit 21 performs processing related to a call such as processing of voice data / voice signals and call control processing. For example, when the terminal 20 is configured by a general-purpose computer such as a PC, the call processing unit 21 is a constituent element corresponding to a softphone application.

通話処理部２１は、多地点音声ミキシング装置１０と電話通信の呼の接続を行い、音声データをリアルタイムに送受信する。端末２０と多地点音声ミキシング装置１０との間の呼制御処理や音声通信のプロトコルは限定されないものであるが、例えば、ＳＩＰ（ＳｅｓｓｉｏｎＩｎｉｔｉａｔｉｏｎＰｒｏｔｏｃｏｌ）やＲＴＰ（Ｒｅａｌ−ｔｉｍｅＴｒａｎｓｐｏｒｔＰｒｏｔｏｃｏｌ）等のプロトコルを用いて呼制御処理及び音声通信が可能であるものとする。 The call processing unit 21 connects a call for telephone communication with the multipoint audio mixing device 10 and transmits and receives audio data in real time. Protocols for call control processing and voice communication between the terminal 20 and the multipoint voice mixing device 10 are not limited. For example, protocols such as SIP (Session Initiation Protocol) and RTP (Real-time Transport Protocol) are used. It is assumed that call control processing and voice communication are possible.

通話処理部２１は、マイク２３で捕捉した音声信号に所定の符号化処理（コーデック化）を施した音声データを、多地点音声ミキシング装置１０へ送信する。また、通話処理部２１は、多地点音声ミキシング装置１０から受信した音声データを音声信号に復号して、スピーカ２４から出力させる処理を行う。多地点音声ミキシング装置１０と端末２０との間で用いられる音信号の符号化方式（コーデック）については限定されないものであるが、例えば、例えば、ＩＴＵ−ＴＧ．７１１、Ｇ．７２９などの符号化方式が適用できる。 The call processing unit 21 transmits audio data obtained by performing a predetermined encoding process (codec conversion) on the audio signal captured by the microphone 23 to the multipoint audio mixing apparatus 10. In addition, the call processing unit 21 performs a process of decoding the audio data received from the multipoint audio mixing device 10 into an audio signal and outputting the audio signal from the speaker 24. The sound signal encoding method (codec) used between the multipoint audio mixing apparatus 10 and the terminal 20 is not limited, but for example, ITU-T G. 711, G.G. An encoding scheme such as 729 can be applied.

次に、多地点音声ミキシング装置１０のハードウェア構成について説明する。 Next, the hardware configuration of the multipoint audio mixing apparatus 10 will be described.

多地点音声ミキシング装置１０は、データ処理部１１、及び通信部１２を備えている。 The multipoint audio mixing apparatus 10 includes a data processing unit 11 and a communication unit 12.

多地点音声ミキシング装置１０は、例えば、ＰＣやワークステーション等の汎用的なコンピュータ（サーバ装置）に、実施形態の音声処理プログラム等をインストールすることにより構成することができる。 The multipoint audio mixing apparatus 10 can be configured, for example, by installing the audio processing program of the embodiment on a general-purpose computer (server apparatus) such as a PC or a workstation.

通信部１２は、ネットワーク４０と接続するための通信インタフェースである。 The communication unit 12 is a communication interface for connecting to the network 40.

データ処理部１１は、端末２０との電話通信（会議通信）に係る音声処理(例えば、音声データの送受信、や音声ミキシング処理等)をソフトウェア的に行うデータ処理手段（コンピュータ）として機能するものである。 The data processing unit 11 functions as data processing means (computer) that performs voice processing (for example, transmission / reception of voice data, voice mixing processing, etc.) related to telephone communication (conference communication) with the terminal 20 in software. is there.

データ処理部１１は、データ処理手段（コンピュータ）として機能するためのプロセッサ１１１及びメモリ１１２を備えている。 The data processing unit 11 includes a processor 111 and a memory 112 for functioning as data processing means (computer).

図２では、プロセッサ１１１は１つのブロックで図示しているが、物理的に複数のプロセッサで構成するようにしてもよい。また、プロセッサ１１１としては、複数のコアを備えるプロセッサ（マルチコアプロセッサ）を用いて構成するようにしてもよい。 In FIG. 2, the processor 111 is illustrated as one block, but may be physically configured by a plurality of processors. The processor 111 may be configured using a processor (multi-core processor) having a plurality of cores.

さらに、図２では、メモリ１１２は１つのブロックで図示しているが、具体的な構成については限定されないものであり、例えば、高速に動作する揮発メモリ（例えば、ＳＲＡＭ、ＤＲＡＭ等）と、不揮発メモリ（例えば、フラッシュメモリーやＨＤＤ等）等、複数メディアのメモリを組み合わせて構成するようにしてもよい。この実施形態では、データ処理部１１により構成されるコンピュータに、実施形態の音声処理プログラムがインストールされているものとして説明する。 Further, in FIG. 2, the memory 112 is illustrated as one block, but the specific configuration is not limited. For example, the volatile memory (eg, SRAM, DRAM, etc.) that operates at high speed, and the nonvolatile You may make it comprise combining the memory of multiple media, such as memory (for example, flash memory, HDD, etc.). In this embodiment, a description will be given assuming that the audio processing program of the embodiment is installed in a computer configured by the data processing unit 11.

次に、データ処理部１１に実施形態の音声処理プログラムがインストールされた場合の機能的な構成（スレッドの構成）について図１を用いて説明する。 Next, a functional configuration (configuration of a thread) when the voice processing program of the embodiment is installed in the data processing unit 11 will be described with reference to FIG.

図１に示すように、多地点音声ミキシング装置１０のデータ処理部１１（実施形態の音声処理プログラム）は、Ｍ個の音声処理手段としてのメディア処理スレッド３０（３０−１〜３０−Ｍ）を用いて、ＮＸＭチャネル（Ｎ×Ｍ台の端末２０との音声データの送信及び受信するチャネル）の音声ミキシング処理を実現する。 As shown in FIG. 1, the data processing unit 11 (speech processing program of the embodiment) of the multipoint speech mixing apparatus 10 includes media processing threads 30 (30-1 to 30-M) as M speech processing means. By using this, the audio mixing processing of the NXM channel (channel for transmitting and receiving audio data with N × M terminals 20) is realized.

メディア処理スレッド３０は、例えば、１０ｍｓごとに、周期起動し、その周期処理の中で、当該周期分（この実施形態の例では１０ｍｓ）を１つの処理単位として音声データの処理を行う。すなわち、それぞれのメディア処理スレッド３０は、Ｎ個のチャネルを処理（Ｎ台の端末２０が送受信する音声データのストリームを処理）することが可能であるものとする。なお、データ処理部１１において、各メディア処理スレッド３０（３０−１〜３０−Ｍ）を生成及び管理する手段（例えば、スレッドの生成及び管理を行うミドルウェアやプログラミング言語等の環境）は限定されないものであり、種々の構成を適用することができる。また、メディア処理スレッド３０における周期起動の間隔や音声データの処理単位の時間は１０ｍｓに限定されないものである。 The media processing thread 30 is activated periodically, for example, every 10 ms, and performs processing of audio data in the period processing (in this example, 10 ms) as one processing unit. That is, each media processing thread 30 is capable of processing N channels (processing a stream of audio data transmitted and received by N terminals 20). In the data processing unit 11, means for generating and managing each media processing thread 30 (30-1 to 30-M) (for example, an environment such as middleware or programming language for generating and managing threads) is not limited. Therefore, various configurations can be applied. Further, the interval between periodic activations and the time for processing audio data in the media processing thread 30 are not limited to 10 ms.

各メディア処理スレッド３０は、Ｎ個の音声受信処理３１を有している。図１では、メディア処理スレッド３０−１〜３０−Ｍは、それぞれ音声受信処理３１−１＿１〜３１−１＿Ｎ、３１−２＿１〜３１−２＿Ｎ、…、３１−Ｍ＿１〜３１−Ｍ＿Ｎを有している。また、メディア処理スレッド３０−１〜３０−Ｍは、それぞれ音声送信処理３６−１＿１〜３６−１＿Ｎ、３６−２＿１〜３６−２＿Ｎ、…、３６−Ｍ＿１〜３６−Ｍ＿Ｎを有している。音声受信処理３１−１＿１〜３１−Ｍ＿Ｎは、それぞれ端末２０−１＿１〜２０−Ｍ＿Ｎから供給される音声データ（符号化された音声データ）を受信処理するものとなっている。また、音声送信処理３６−１＿１〜３６−Ｍ＿Ｎは、それぞれ端末２０−１＿１〜２０−Ｍ＿Ｎにミキシング処理した音声データ（符号化された音声データ）を送信するものである。 Each media processing thread 30 has N audio reception processes 31. In FIG. 1, the media processing threads 30-1 to 30-M have audio reception processes 31-1_1 to 311-1_N, 31-2_1 to 31-2_N,..., 31-M_1 to 31-M_N, respectively. . In addition, the media processing threads 30-1 to 30-M have audio transmission processes 36-1_1 to 36-1_N, 36-2_1 to 36-2_N,..., 36-M_1 to 36-M_N, respectively. The voice reception processes 31-1_1 to 31-M_N receive and process voice data (encoded voice data) supplied from the terminals 20-1_1 to 20-M_N, respectively. The voice transmission processes 36-1_1 to 36-M_N transmit the voice data (encoded voice data) subjected to the mixing process to the terminals 20-1_1 to 20-M_N, respectively.

図４は、各音声受信処理３１の内部構成について示したブロック図である。 FIG. 4 is a block diagram showing the internal configuration of each audio reception process 31.

図４に示すように各音声受信処理３１は、ＲＴＰ受信３１１、ジッタバッファ３１２およびデコーダ３１３を有している。ＲＴＰ受信３１１は、周期処理までに到来したＲＴＰパケットの受信処理を実行し、ＲＴＰペイロード（音声符号化データ）をジッタバッファ３１２に投入する。そして、ジッタバッファ３１２は、１０ｍｓ分（所定の処理単位時間分）の符号化された音声データを取出し、デコーダ３１３で復調（リニア音声データに復調）し、１０ｍｓ分の音声データ（リニア音声データ）を出力する。 As shown in FIG. 4, each audio reception process 31 includes an RTP reception 311, a jitter buffer 312, and a decoder 313. The RTP reception 311 executes reception processing of RTP packets that have arrived before the periodic processing, and inputs an RTP payload (voice encoded data) into the jitter buffer 312. The jitter buffer 312 takes out encoded audio data for 10 ms (a predetermined processing unit time), demodulates (demodulates into linear audio data) by the decoder 313, and audio data for 10 ms (linear audio data). Is output.

図５は、各音声送信処理３６の内部構成について示したブロック図である。 FIG. 5 is a block diagram showing the internal configuration of each audio transmission process 36.

図５に示すように、音声送信処理３６は、エンコーダ３６１およびＲＴＰ送信３６２を有している。エンコーダは、１０ｍｓ分（所定の処理単位時間分）の符号化された音声データをＲＴＰ送信３６２に供給する。そして、ＲＴＰ送信３６２は、ＲＴＰパケット化周期分の符号化された音声データが蓄積できたらＲＴＰパケットを生成し、ＲＴＰパケットを送信する。この実施形態では、例として、ＲＴＰパケット化周期が、２０ｍｓであるものとして説明する。なお、ＲＴＰパケット化周期は２０ｍｓに限定されないものである。 As shown in FIG. 5, the voice transmission processing 36 includes an encoder 361 and an RTP transmission 362. The encoder supplies the encoded voice data for 10 ms (a predetermined processing unit time) to the RTP transmission 362. The RTP transmission 362 generates an RTP packet when the encoded voice data for the RTP packetization period can be accumulated, and transmits the RTP packet. In this embodiment, as an example, it is assumed that the RTP packetization period is 20 ms. The RTP packetization period is not limited to 20 ms.

また、各メディア処理スレッド３０（３０−１〜３０−Ｍ）は、加算処理３２（３２−１〜３２−Ｍ）、循環バッファ３３（３３−１〜３３−Ｍ）、加算処理３４（３４−１〜３４−Ｍ）、及びミキサ３５（３５−１〜３５−Ｍ）を有している。例えば、メディア処理スレッド３０−１は、加算処理３２−１、循環バッファ３３−１、加算処理３４−１、及びミキサ３５−１を有している。 Each media processing thread 30 (30-1 to 30-M) includes an addition process 32 (32-1 to 32-M), a circular buffer 33 (33-1 to 33-M), and an addition process 34 (34-). 1 to 34-M) and a mixer 35 (35-1 to 35-M). For example, the media processing thread 30-1 includes an addition process 32-1, a circular buffer 33-1, an addition process 34-1 and a mixer 35-1.

以下では、メディア処理スレッド３０−１〜３０−Ｍを、それぞれ１〜Ｍ番目のメディア処理スレッド３０と呼ぶものとする。そして、以下では、任意のｍ番目（ｍは、１〜Ｍの任意の整数）のメディア処理スレッド３０を、メディア処理スレッド３０−ｍと表すものとする。また、以下では、（ｍ−１）番目のメディア処理スレッド３０をメディア処理スレッド３０−（ｍ−１）と呼び、（ｍ＋１）番目のメディア処理スレッド３０をメディア処理スレッド３０−（ｍ＋１）と呼ぶものとする。したがって、メディア処理スレッド３０−ｍは、メディア処理スレッド３０−ｍ、音声受信処理３１−ｍ＿１〜３１−ｍ＿Ｎ、加算処理３２−ｍ、循環バッファ３３−ｍ、加算処理３４−ｍ、ミキサ３５−ｍ、及び音声送信処理３６−ｍ＿１〜３６−ｍ＿Ｎを有していることになる。 Hereinafter, the media processing threads 30-1 to 30-M are referred to as the 1st to Mth media processing threads 30, respectively. In the following, an arbitrary m-th (m is an arbitrary integer from 1 to M) media processing thread 30 is represented as a media processing thread 30-m. Hereinafter, the (m−1) th media processing thread 30 is referred to as a media processing thread 30- (m−1), and the (m + 1) th media processing thread 30 is referred to as a media processing thread 30- (m + 1). Shall. Therefore, the media processing thread 30-m includes the media processing thread 30-m, the audio reception processes 31-m_1 to 31-m_N, the addition process 32-m, the circular buffer 33-m, the addition process 34-m, and the mixer 35-m. And voice transmission processing 36-m_1 to 36-m_N.

また、メディア処理スレッド３０−１〜３０−Ｍの順序は循環的に管理されるものとして説明する。例えば、メディア処理スレッド３０−１に順序が隣接するメディア処理スレッド３０は、メディア処理スレッド３０−２と、メディア処理スレッド３０−Ｍであるものとする。したがって、例えば、メディア処理スレッド３０−（ｍ＋１）〜３０−（ｍ＋Ｍ−１）とした場合は、メディア処理スレッド３０−１〜３０−Ｍのうちメディア処理スレッド３０−ｍ以外を表すことになる。例えば、ｍ＝３とした場合、メディア処理スレッド３０−（ｍ＋１）〜３０−（ｍ＋Ｍ−１）は、メディア処理スレッド３０−１〜３０−２及びメディア処理スレッド３０−４〜３０−Ｍを表すことになる。 The order of the media processing threads 30-1 to 30-M will be described as being cyclically managed. For example, it is assumed that the media processing threads 30 whose order is adjacent to the media processing thread 30-1 are the media processing thread 30-2 and the media processing thread 30-M. Therefore, for example, when the media processing threads 30- (m + 1) to 30- (m + M-1) are used, the media processing threads 30-1 to 30-M other than the media processing thread 30-m are represented. For example, when m = 3, media processing threads 30- (m + 1) to 30- (m + M-1) represent media processing threads 30-1 to 30-2 and media processing threads 30-4 to 30-M. It will be.

以下では、ｍ番目のメディア処理スレッド３０−ｍを中心とした例で、各メディア処理スレッド３０の内部構成について説明する。 In the following, the internal configuration of each media processing thread 30 will be described with an example centering on the mth media processing thread 30-m.

加算処理３２−ｍは、音声受信処理３１−ｍ＿１〜３１−ｍ＿Ｎのそれぞれから出力される単位時間分の音声データ（例えば、１０ｍｓ分のリニア音声データ）を加算（合成）し、１０ｍｓ分の音声データを循環バッファ３３−ｍへ書込むものである。なお、加算処理３２が複数の音声データに係る音声を加算（合成）し、加算後（合成後）の音声データを生成する処理については、種々の音声データ加算技術（合成技術）を適用することができるため、詳しい説明は省略する。 The addition process 32-m adds (synthesizes) unit time audio data (for example, linear audio data for 10 ms) output from each of the audio reception processes 31-m_1 to 31-m_N, and adds 10 ms of audio. Data is written to the circular buffer 33-m. Note that various audio data addition techniques (synthetic techniques) are applied to the process in which the addition process 32 adds (synthesizes) sounds related to a plurality of audio data and generates the audio data after the addition (after synthesis). Detailed explanation is omitted.

図６は、循環バッファ３３−ｍの内部構成について示したブロック図である。 FIG. 6 is a block diagram showing the internal configuration of the circular buffer 33-m.

循環バッファ３３は、Ｚ個のバッファ面３３１−１〜３３１−Ｚ、読込み位置ポインタ群３３２、書込み面選択手段３３３、読込面選択手段３３４、及び読込み位置ポインタ選択手段３３５を有している。また、循環バッファ３３−ｍは、書込み位置ポインタＷＰ（ｍ）を保持している。各バッファ面３３１は、メディア処理スレッド３０で、音声データを処理する際の１単位分（この実施形態では１０ｍｓ分の音声データ（リニア音声データ））を保持可能なバッファであるものとする。以下では、バッファ面３３１−１〜３３１−Ｚに対応するポインタ値（アドレス値）を、それぞれ１〜Ｚとする。 The circular buffer 33 includes Z buffer surfaces 331-1 to 331-Z, a reading position pointer group 332, a writing surface selection unit 333, a reading surface selection unit 334, and a reading position pointer selection unit 335. The circular buffer 33-m holds a write position pointer WP (m). Each buffer surface 331 is a buffer capable of holding one unit (in this embodiment, 10 ms of audio data (linear audio data)) when the audio data is processed by the media processing thread 30. Hereinafter, pointer values (address values) corresponding to the buffer surfaces 331-1 to 331 -Z are 1 to Z, respectively.

書込み位置ポインタＷＰ（ｍ）には、音声データを書き込むポインタ値（バッファ面３３１−１〜３３１−Ｚのいずれかを示すポインタ値）が管理されている。書込み面選択手段３３３は、書込み位置ポインタＷＰ（ｍ）の値を更新（インクリメント）してから、書込み位置ポインタＷＰ（ｍ）に対応するバッファ面３３１に１０ｍｓ分のリニア音声データを書込んでいく。書込み位置ポインタＷＰ（ｍ）のポインタ値は、書込みの契機に、１ずつインクリメントされていき、Ｚまで達したら、その次の書込み契機で１となる。すなわち、バッファ面３３１−１〜３３１−Ｚは、書込み面選択手段３３３により循環的に音声データが書き込まれることになる。 The write position pointer WP (m) manages a pointer value for writing audio data (a pointer value indicating one of the buffer surfaces 331-1 to 331 -Z). The writing surface selection means 333 updates (increments) the value of the writing position pointer WP (m) and then writes linear audio data for 10 ms into the buffer surface 331 corresponding to the writing position pointer WP (m). . The pointer value of the write position pointer WP (m) is incremented by 1 at the time of writing, and becomes 1 at the next writing time when reaching Z. That is, audio data is cyclically written into the buffer surfaces 331-1 to 331 -Z by the writing surface selection means 333.

読込み位置ポインタ群３３２には、自己（メディア処理スレッド３０−ｍ）以外のメディア処理スレッド３０−（ｍ＋１）〜３０−（ｍ＋Ｍ−１）のそれぞれに対応する読込み位置ポインタＲＰ（すなわちＭ−１個の読込み位置ポインタＲＰ）が配置されている。例えば、図６では、メディア処理スレッド３０−ｍの読込み位置ポインタ群３３２には、読込み位置ポインタＲＰ（１）、ＲＰ（２）、…、ＲＰ（ｍ−１）、ＲＰ（ｍ＋１）、…、ＲＰ（Ｍ）が配置されている。読込み位置ポインタＲＰは対応するメディア処理スレッド３０に対して読み込ませる音声データのポインタ値（バッファ面３３１−１〜３３１−Ｚのいずれかを示すポインタ値）が保持されている。 The read position pointer group 332 includes read position pointers RP (that is, M-1 pieces) corresponding to each of the media processing threads 30- (m + 1) to 30- (m + M-1) other than the self (media processing thread 30-m). Are read position pointers RP). For example, in FIG. 6, the read position pointer group 332 of the media processing thread 30-m includes read position pointers RP (1), RP (2),..., RP (m−1), RP (m + 1),. RP (M) is arranged. The read position pointer RP holds a pointer value of audio data to be read by the corresponding media processing thread 30 (a pointer value indicating one of the buffer surfaces 331-1 to 331 -Z).

読込み位置ポインタ選択手段３３５は、いずれかのメディア処理スレッド３０に対応する読込み位置ポインタＲＰを選択し、当該読込み位置ポインタＲＰのポインタ値を更新（インクリメント）してから、選択した読込み位置ポインタＲＰのポインタ値を読込面選択手段３３４に供給する。読込面選択手段３３４は、供給されたポインタ値に対応するバッファ面３３１から音声データを読込んで出力（読込み位置ポインタ選択手段３３５で選択されたメディア処理スレッド３０に出力）する処理を行う。 The read position pointer selection means 335 selects a read position pointer RP corresponding to any one of the media processing threads 30, updates (increments) the pointer value of the read position pointer RP, and then selects the selected read position pointer RP. The pointer value is supplied to the reading surface selection means 334. The reading surface selection unit 334 performs processing of reading and outputting audio data from the buffer surface 331 corresponding to the supplied pointer value (outputting to the media processing thread 30 selected by the reading position pointer selection unit 335).

読込み位置ポインタ選択手段３３５も、書込み面選択手段３３３と同様に、読込みポインタ値ＲＰのポインタ値を、読込みの契機ごとに１ずつインクリメントしていき、Ｚまで達したら、その次の読込み契機で１とする循環的な動作を行う。 Similarly to the writing surface selection unit 333, the reading position pointer selection unit 335 increments the pointer value of the reading pointer value RP by 1 for each reading trigger, and when it reaches Z, 1 is read at the next reading trigger. Performs a cyclic operation.

例えば、メディア処理スレッド３０−（ｍ−１）では、読込み位置ポインタＲＰ（ｍ−１）を更新してから、読込み位置ポインタＲＰ（ｍ−１）に対応するバッファ面から１単位分（１０ｍｓ分）の音声データ（リニア音声データ）を読み込んでいく。読込み位置ポインタＲＰ（ｍ−１）は、読込みの契機に１ずつインクリメントしていき、Ｚまで達したら、その次の読込み契機で１とする巡回的な動作を行う。メディア処理スレッド３０−１〜３０−（ｍ−２）、３０−（ｍ＋１）〜３０−Ｍで同様の読込み処理が実施されることになる。 For example, in the media processing thread 30- (m−1), after updating the reading position pointer RP (m−1), one unit (10 ms worth) from the buffer surface corresponding to the reading position pointer RP (m−1). ) Audio data (linear audio data). The reading position pointer RP (m−1) is incremented by 1 at the time of reading, and when reaching to Z, a cyclic operation of setting to 1 at the next reading time is performed. The same read processing is performed by the media processing threads 30-1 to 30- (m-2) and 30- (m + 1) to 30-M.

加算処理３４は、他のメディア処理スレッド３０（循環バッファ３３）に、音声データ（読込み位置ポインタＲＰのポインタ値に対応するバッファ面３３１で保持された音声データ）の出力を要求して取得し、それらを加算（合成）した音声データを生成して、ミキサ３５に供給するものである。すなわち、加算処理３４−ｍは、メディア処理スレッド３０−ｍ以外の循環バッファ３３（循環バッファ３３−１〜（ｍ−１）および、循環バッファ３３−（ｍ＋１）〜３３−Ｍ）から、１０ｍｓ分のリニア音声データを読込み、加算（合成）し、１０ｍｓ分のリニア音声データをミキサ３５−ｍへ受け渡すものである。なお、加算処理３４が複数の音声データに係る音声を加算（合成）し、加算後（合成後）の音声データを生成する処理については、種々の音声データ加算技術（合成技術）を適用することができるため、詳しい説明は省略する。 The addition process 34 requests other media processing threads 30 (circular buffer 33) to output audio data (audio data held in the buffer surface 331 corresponding to the pointer value of the reading position pointer RP), Audio data obtained by adding (synthesizing) them is generated and supplied to the mixer 35. That is, the addition process 34-m is performed for 10 ms from the circular buffers 33 (circular buffers 33-1 to (m-1) and circular buffers 33- (m + 1) to 33-M) other than the media processing thread 30-m. The linear audio data is read and added (synthesized), and 10 ms of linear audio data is transferred to the mixer 35-m. Note that various audio data addition techniques (synthesis techniques) are applied to the process in which the addition process 34 adds (synthesizes) sounds related to a plurality of sound data and generates the sound data after the addition (after synthesis). Detailed explanation is omitted.

次に、ミキサ３５の内部構成について説明する。 Next, the internal configuration of the mixer 35 will be described.

図７は、任意のメディア処理スレッド３０−ｍを構成するミキサ３５−ｍの内部構成について示した説明図である。 FIG. 7 is an explanatory diagram showing the internal configuration of the mixer 35-m that constitutes an arbitrary media processing thread 30-m.

ミキサ３５−ｍには、音声送信処理３６−ｍ＿１〜３６−ｍ＿Ｎのそれぞれに供給する音声データを合成処理するミキサ部３５１−１〜３５１−Ｎを有している。 The mixer 35-m includes mixer units 351-1 to 351-N that synthesize audio data supplied to each of the audio transmission processes 36-m_1 to 36-m_N.

ミキサ部３５１−ｎは、音声受信処理３１−ｍ＿（ｎ＋１）〜３１−ｍ＿（ｎ＋Ｎ―１）から供給される音声データと、加算処理３４−ｍから供給される音声データとを合成（ミキシング）した音声データを生成して、対応する音声送信処理３６に供給する。具体的には、例えば、図７に示すように、ミキサ部３５１−１は、音声送信処理３６−ｍ＿１に合成した音声データを供給するものである。したがって、ミキサ部３５１−１は、図７に示すように、音声送信処理３６−ｍ＿２〜３６−ｍ＿Ｎから供給される音声データと、加算処理３４−ｍから供給される音声データとを合成（ミキシング）することになる。なお、ミキサ部３５１が複数の音声データに係る音声を合成し、合成後の音声データを生成する処理については、種々の音声データ合成技術を適用することができるため、詳しい説明は省略する。 The mixer unit 351-n synthesizes (mixes) the audio data supplied from the audio reception processes 31-m_ (n + 1) to 31-m_ (n + N-1) and the audio data supplied from the addition process 34-m. The generated voice data is generated and supplied to the corresponding voice transmission processing 36. Specifically, for example, as illustrated in FIG. 7, the mixer unit 351-1 supplies the synthesized audio data to the audio transmission process 36-m_1. Therefore, as shown in FIG. 7, the mixer unit 351-1 synthesizes (mixes) the audio data supplied from the audio transmission processing 36-m_2 to 36-m_N and the audio data supplied from the addition processing 34-m. ). Note that various audio data synthesizing techniques can be applied to the process in which the mixer unit 351 synthesizes audio related to a plurality of audio data and generates synthesized audio data, and thus detailed description thereof is omitted.

これにより、音声送信処理３６−１＿１の送信先の端末２０−１＿１には、自装置（端末２０−１＿１）から送信された音声データ（音声受信処理３１−１＿１に供給された音声データ）以外の全ての端末２０−１−２〜２０−Ｍ＿Ｎの音声データを合成（ミキシング）した音声データが送信されることになる。これは、その他の全ての端末２０−１−２〜２０−Ｍ＿Ｎについても同様である。このように、多地点音声ミキシング装置１０では、Ｍ×Ｎ台の全ての端末２０−１＿１〜２０−Ｍ＿Ｎに対して、各送信先の端末２０を除外した音声データを合成したものを合成（ミキシング）して送信することができる。 As a result, the terminal 20-1_1 that is the destination of the voice transmission process 36-1_1 receives data other than the voice data transmitted from the own apparatus (terminal 20-1_1) (voice data supplied to the voice reception process 31-1_1). Audio data obtained by synthesizing (mixing) audio data of all the terminals 20-1-2 to 20-M_N is transmitted. The same applies to all the other terminals 20-1-2 to 20-M_N. As described above, the multipoint audio mixing apparatus 10 combines (mixing) a combination of audio data excluding the destination terminals 20 with respect to all M × N terminals 20-1_1 to 20-M_N. ) Can be sent.

（Ａ−２）第１の実施形態の動作
次に、以上のような構成を有する第１の実施形態の多地点音声ミキシング装置１０の動作（実施形態の音声処理方法）を説明する。 (A-2) Operation of the First Embodiment Next, the operation (audio processing method of the embodiment) of the multipoint audio mixing apparatus 10 of the first embodiment having the above configuration will be described.

図８は、この実施形態の、循環バッファ３３−ｍの書込み／読込みタイミング例について示したタイミングチャートである。 FIG. 8 is a timing chart showing an example of write / read timing of the circular buffer 33-m in this embodiment.

なお、図８に示す例では、メディア処理スレッド３０を構成するバッファ面３３１の数は８個（すなわちＺ＝８）であるものとして図示している。 In the example illustrated in FIG. 8, the number of buffer surfaces 331 configuring the media processing thread 30 is illustrated as eight (that is, Z = 8).

図８（ａ）では、メディア処理スレッド３０−ｍの周期処理タイミング、循環バッファ３３−ｍの書込みタイミング、書込み位置ポインタＷＰ（ｍ）の変更タイミング及びポインタ値が図示されている、また、図８（ｂ）では、メディア処理スレッド３０−（ｍ−１）の周期処理タイミング、循環バッファ３３−（ｍ−１）の読込みタイミング、読込み位置ポインタＲＰ（ｍ−１）の変更タイミング及びポインタ値が図示されている。さらに、図８（ｃ）では、メディア処理スレッド３０−（ｍ＋１）の周期処理タイミング、循環バッファ３３−（ｍ＋１）の読込みタイミング、読込み位置ポインタＲＰ（ｍ＋１）の変更タイミング及びポインタ値が図示されている。なお、図８（ａ）に示す書込み位置ポインタＷＰは、メディア処理スレッド３０−ｍに属するポインタである。また、図８（ｂ）に示す読込み位置ポインタＲＰ（ｍ−１）及び図８（ｃ）に示す読込み位置ポインタＲＰ（ｍ＋１）は、いずれもメディア処理スレッド３０−ｍの読込み位置ポインタ群３３２に属するものである。 FIG. 8A shows the periodic processing timing of the media processing thread 30-m, the writing timing of the circular buffer 33-m, the changing timing of the writing position pointer WP (m), and the pointer value. In (b), the periodic processing timing of the media processing thread 30- (m-1), the reading timing of the circular buffer 33- (m-1), the changing timing of the reading position pointer RP (m-1), and the pointer value are illustrated. Has been. Further, FIG. 8C illustrates the periodic processing timing of the media processing thread 30- (m + 1), the reading timing of the circular buffer 33- (m + 1), the changing timing of the reading position pointer RP (m + 1), and the pointer value. Yes. Note that the write position pointer WP shown in FIG. 8A is a pointer belonging to the media processing thread 30-m. Also, the read position pointer RP (m−1) shown in FIG. 8B and the read position pointer RP (m + 1) shown in FIG. 8C are both stored in the read position pointer group 332 of the media processing thread 30-m. It belongs to.

図８では、メディア処理スレッド３０−（ｍ）からメディア処理スレッド３０−（ｍ−１）へのデータ受け渡し、及び、メディア処理スレッド３０−（ｍ）からメディア処理スレッド３０−（ｍ＋１）へのデータ受け渡しについて説明しているが、メディア処理スレッド３０−（ｍ）からメディア処理スレッド３０−１〜３０−（ｍ−２）および３０−（ｍ＋２）〜３０−Ｍへのデータ受渡しについても同様であるので、詳しい説明を省略する。 In FIG. 8, data is transferred from the media processing thread 30- (m) to the media processing thread 30- (m-1), and data is transferred from the media processing thread 30- (m) to the media processing thread 30- (m + 1). Although the transfer has been described, the same applies to the data transfer from the media processing thread 30- (m) to the media processing threads 30-1 to 30- (m-2) and 30- (m + 2) to 30-M. Therefore, detailed explanation is omitted.

ここでは、メディア処理スレッド３０−（ｍ）が、例えば、１０ｍｓごとに周期起動し、起動周期の乱れが発生しなかったとする。すると、メディア処理スレッド３０−（ｍ）では、１０ｍｓごとに循環バッファ（ｍ）への書込み契機が発生する。初回の書込み契機では、書込み位置ポインタＷＰ（ｍ）を１に設定し、バッファ面３３１−１へ１０ｍｓ分のリニア音声データを書込む。次回以降の書込み契機では、書込み位置ポインタＷＰ（ｍ）をインクリメントし、対応するバッファ面３３１へ１０ｍｓ分のリニア音声データを書込んでいく。メディア処理スレッド３０−（ｍ）では、書込み位置ポインタＷＰ（ｍ）がバッファ面数と同じ８となった次の書込み契機では、書込み位置ポインタＷＰ（ｍ）が１に設定され、バッファ面１へ１０ｍｓ分のリニア音声データが書込まれる。 Here, it is assumed that the media processing thread 30- (m) is periodically activated every 10 ms, for example, and the activation cycle is not disturbed. Then, in the media processing thread 30- (m), a write trigger to the circular buffer (m) occurs every 10 ms. At the first writing opportunity, the writing position pointer WP (m) is set to 1 and linear audio data for 10 ms is written to the buffer surface 331-1. At the writing opportunity after the next time, the writing position pointer WP (m) is incremented, and linear audio data for 10 ms is written to the corresponding buffer surface 331. In the media processing thread 30- (m), the write position pointer WP (m) is set to 1 at the next write trigger when the write position pointer WP (m) becomes 8, which is the same as the number of buffer planes. 10 ms worth of linear audio data is written.

ここでは、メディア処理スレッド３０−（ｍ−１）が、メディア処理スレッド３０−（ｍ）と同様に１０ｍｓごとに周期起動し、起動周期の乱れが発生しなかったとする。すると、メディア処理スレッド３０−（ｍ−１）では、１０ｍｓごとに循環バッファ３３−（ｍ）からの読込み契機が発生する。 Here, it is assumed that the media processing thread 30- (m−1) is periodically activated every 10 ms as in the case of the media processing thread 30- (m), and the activation cycle is not disturbed. Then, in the media processing thread 30- (m−1), a read trigger from the circular buffer 33- (m) occurs every 10 ms.

メディア処理スレッド３０−（ｍ−１）は、例えば、メディア処理スレッド３０−（ｍ）のバッファ面３３１−１へデータ書込み以降に読込み処理を開始する。 For example, the media processing thread 30- (m−1) starts the reading process after data is written to the buffer surface 331-1 of the media processing thread 30- (m).

メディア処理スレッド３０−（ｍ）がバッファ面３３１−１へ１０ｍｓ分のリニア音声データを書込んだ後、メディア処理スレッド３０−（ｍ−１）の読込み契機が発生すると、これがメディア処理スレッド３０−（ｍ−１）における初回の読込み契機となる。そうすると、メディア処理スレッド３０−（ｍ）では、読込み位置ポインタＲＰ（ｍ−１）が１に設定される。そして、メディア処理スレッド３０−（ｍ）のバッファ面３３１−１で保持されている音声データ（１０ｍｓ分のリニア音声データ）が、メディア処理スレッド３０−（ｍ−１）に読込まれる。メディア処理スレッド３０−（ｍ）では、次回以降のメディア処理スレッド３０−（ｍ−１）による読込み契機で、読込み位置ポインタＲＰ（ｍ−１）をインクリメントし、対応するバッファ面３３１から音声データ（１０ｍｓ分のリニア音声データ）を供給することになる。読込み位置ポインタＲＰ（ｍ−１）は、バッファ面３３１の面数（最大数）と同じ８（＝Ｚ）となった次の読込み契機では、１に設定されることになる。 After the media processing thread 30- (m) writes linear audio data for 10 ms to the buffer surface 331-1, when a read trigger of the media processing thread 30- (m-1) occurs, this is the media processing thread 30-. This is the first read trigger in (m-1). Then, the read position pointer RP (m−1) is set to 1 in the media processing thread 30- (m). Then, the audio data (linear audio data for 10 ms) held on the buffer surface 331-1 of the media processing thread 30- (m) is read into the media processing thread 30- (m-1). In the media processing thread 30- (m), the reading position pointer RP (m-1) is incremented when the media processing thread 30- (m-1) reads from the next time onward, and the audio data ( 10 ms linear audio data) is supplied. The reading position pointer RP (m−1) is set to 1 at the next reading opportunity when the number (maximum number) of the buffer surfaces 331 becomes 8 (= Z).

ここでは、メディア処理スレッド３０−（ｍ＋１）は、メディア処理スレッド３０−３０−（ｍ）、３０−（ｍ−１）と同様に、１０ｍｓごとに周期起動するものとする。しかし、ここでは、図８に示すように、メディア処理スレッド３０−（ｍ＋１）が、メディア処理スレッド３０−（ｍ）のバッファ面３３１−５の音声データを読込む周期で、周期処理に時間がかかったものとする。さらに、ここでは、図８に示すように、メディア処理スレッド３０−（ｍ＋１）において、バッファ面３３１−５、３３１−６、３３１−７、３３１−８を読込む周期で、起動周期の乱れが発生したとする。データ処理部１１におけるソフトウェア（スレッド）での処理においては、例えばハードディスクへのデータ書込み待ちの発生などにより、通常よりも周期処理時聞が長くなり、起動周期が乱れることが有り得る。 Here, it is assumed that the media processing thread 30- (m + 1) periodically starts every 10 ms, similarly to the media processing threads 30-30- (m) and 30- (m-1). However, here, as shown in FIG. 8, the media processing thread 30- (m + 1) reads the audio data in the buffer surface 331-5 of the media processing thread 30- (m), and the cycle processing takes time. It is assumed that it took. Further, here, as shown in FIG. 8, in the media processing thread 30- (m + 1), the activation cycle is disturbed at the cycle of reading the buffer surfaces 331-5, 331-6, 331-7, 331-8. Suppose that it occurred. In the processing by the software (thread) in the data processing unit 11, for example, due to the occurrence of waiting for data writing to the hard disk, the periodic processing time may be longer than usual, and the activation cycle may be disturbed.

この場合、メディア処理スレッド３０−（ｍ＋１）において、バッファ面３３１−５、３３１−６、３３１−７、３３１−８を読込む周期で、起動周期の乱れに伴いＲＴＰ送信処理が遅延して揺らぎが生じるが、所望のミキシング処理を継続することが出来る。 In this case, in the media processing thread 30- (m + 1), the RTP transmission processing is delayed and fluctuated in accordance with the disturbance of the startup cycle in the cycle of reading the buffer surfaces 331-5, 331-6, 331-7, 331-8. However, the desired mixing process can be continued.

（Ａ−３）第１の実施形態の効果
第１の実施形態によれば、以下のような効果を奏することができる。 (A-3) Effects of First Embodiment According to the first embodiment, the following effects can be achieved.

上述の通り、多地点音声ミキシング装置１０では、Ｍ×Ｎ台の全ての端末２０−１＿１〜２０−Ｍ＿Ｎに対して、各送信先の端末２０に係る音声データを除外した音声データを合成（ミキシング）したものを送信することができる。 As described above, the multipoint audio mixing apparatus 10 synthesizes (mixes) audio data excluding audio data related to each destination terminal 20 to all M × N terminals 20-1_1 to 20-M_N. ) Can be sent.

データ処理部１１では、例えば、各メディア処理スレッド３０を、異なるＣＰＵコアで並列動作することが出来るので、従来、１ＣＰＵコアでＮチャネルのミキシング処理が性能限界であったものが、Ｎ×Ｍチャネルのミキシング処理が実行できるようになる。したがって、本発明の多地点音声ミキシング装置１０は、汎用ＯＳ上でソフトウェア的にＮ×Ｍの音声データを処理することが可能となる。特に多地点音声ミキシング装置１０では、各メディア処理スレッド３０に加算処理３２、循環バッファ３３、加算処理３４、及びミキサ３５を備えることにより、他のメディア処理スレッド３０で合成された音声データを収集して、段階的に合成する。これにより多地点音声ミキシング装置１０では、複数のメディア処理スレッド３０の音声データを中央で処理する構成を必要とせず、複数のメディア処理スレッド３０に分散された音声データ処理のみでＮ×Ｍチャネルのミキシング処理を実現している。 In the data processing unit 11, for example, each media processing thread 30 can be operated in parallel by different CPU cores. Therefore, in the past, N × M channels were the performance limit of N channel mixing processing in one CPU core. The mixing process can be executed. Therefore, the multipoint audio mixing apparatus 10 of the present invention can process N × M audio data in software on the general-purpose OS. In particular, in the multipoint audio mixing device 10, each media processing thread 30 includes an addition process 32, a circular buffer 33, an addition process 34, and a mixer 35, thereby collecting audio data synthesized by other media processing threads 30. And step by step. As a result, the multipoint audio mixing apparatus 10 does not require a configuration for centrally processing the audio data of the plurality of media processing threads 30, and only N × M channel of audio data processing distributed to the plurality of media processing threads 30 is required. Mixing processing is realized.

（Ｂ）第２の実施形態
以下、本発明による音声処理装置およびミキシングプログラムの第２の実施形態を、図面を参照しながら詳述する。以下では、本発明の音声処理装置およびミキシングプログラムを多地点音声ミキシング装置に適用する例について説明する。 (B) Second Embodiment Hereinafter, a second embodiment of the speech processing apparatus and the mixing program according to the present invention will be described in detail with reference to the drawings. Below, the example which applies the audio | voice processing apparatus and mixing program of this invention to a multipoint audio | voice mixing apparatus is demonstrated.

（Ｂ−１）第２の実施形態の構成
第２の実施形態の多地点音声ミキシング装置１０の構成も上述の図１〜図７を用いて示すことができる。以下では、第２の実施形態の多地点音声ミキシング装置１０について第１の実施形態との差異のみを説明する。 (B-1) Configuration of Second Embodiment The configuration of the multipoint audio mixing apparatus 10 of the second embodiment can also be shown using the above-described FIGS. Below, only the difference with 1st Embodiment is demonstrated about the multipoint audio | voice mixing apparatus 10 of 2nd Embodiment.

各メディア処理スレッド３０の間では、各メディア処理スレッド３０の間で、起動周期の乱れ等により動作タイミングがずれる場合もあり得る。その場合、一部のメディア処理スレッド３０で、いずれかの読込み位置ポインタＲＰの位置が、書込み位置ポインタＷＰの位置を追い越して、時系列的に不正なデータを処理してしまうおそれがある。そこで、第２の実施形態では、起動周期乱れによるメディア処理スレッド３０間の動作タイミング差分を吸収する処理が行われる。 Among the media processing threads 30, the operation timing may be shifted between the media processing threads 30 due to disturbance of the activation cycle or the like. In that case, in some media processing threads 30, the position of any read position pointer RP may pass the position of the write position pointer WP, and incorrect data may be processed in time series. Therefore, in the second embodiment, a process for absorbing a difference in operation timing between the media processing threads 30 due to the start cycle disturbance is performed.

具体的には、第２の実施形態では、例えば、任意の基準となる１又は複数のメディア処理スレッド３０−（ｍ）以外のメディア処理スレッド３０について、メディア処理スレッド３０−（ｍ）の所定数のバッファ面３３１に書込みが行われて以降（この実施形態では、例として、バッファ面３３１−２へデータ書込み以降）に読込み処理を開始するようにする。さらに、第２の実施形態では、各メディア処理スレッド３０の音声データの読込み契機（他のメディア処理スレッド３０の循環バッファ３３からの読込契機）において、当該メディア処理スレッド３０の読込み位置ポインタＲＰが書込み位置ポインタＷＰ（ｍ）を追い越さないようにする処理構成（例えば、追い越す場合にはインクリメントを行わない）が追加されている。 Specifically, in the second embodiment, for example, a predetermined number of media processing threads 30- (m) for media processing threads 30 other than one or more media processing threads 30- (m) serving as an arbitrary reference. The reading process is started after writing to the buffer surface 331 (in this embodiment, for example, after writing data to the buffer surface 331-2 in this embodiment). Furthermore, in the second embodiment, the read position pointer RP of the media processing thread 30 is written when the audio data is read by each media processing thread 30 (reading from the circular buffer 33 of the other media processing thread 30). A processing configuration that prevents the position pointer WP (m) from being overtaken (for example, no increment is performed when overtaking) is added.

（Ｂ−２）第２の実施形態の動作
次に、以上のような構成を有する第２の実施形態の多地点音声ミキシング装置１０の動作（実施形態の音声処理方法）を説明する。 (B-2) Operation | movement of 2nd Embodiment Next, operation | movement (voice processing method of embodiment) of the multipoint audio | voice mixing apparatus 10 of 2nd Embodiment which has the above structures is demonstrated.

図９は、第２の実施形態の、循環バッファ３３−ｍの書込み／読込みタイミング例について示したタイミングチャートである。 FIG. 9 is a timing chart showing an example of write / read timing of the circular buffer 33-m according to the second embodiment.

なお、図９に示す例では、メディア処理スレッド３０を構成するバッファ面３３１の数は８個（すなわちＺ＝８）であるものとして図示している。 In the example illustrated in FIG. 9, the number of buffer surfaces 331 configuring the media processing thread 30 is illustrated as eight (that is, Z = 8).

図９（ａ）では、メディア処理スレッド３０−（ｍ）の周期処理タイミング、循環バッファ３３−ｍの書込みタイミング、書込み位置ポインタＷＰ（ｍ）の変更タイミング及びポインタ値が図示されている、また、図９（ｂ）では、メディア処理スレッド３０−（ｍ−１）の周期処理タイミング、循環バッファ３３−（ｍ−１）の読込みタイミング、読込み位置ポインタＲＰ（ｍ−１）の変更タイミング及びポインタ値が図示されている。さらに、図９（ｃ）では、メディア処理スレッド３０−（ｍ＋１）の周期処理タイミング、循環バッファ３３−（ｍ＋１）の読込みタイミング、読込み位置ポインタＲＰ（ｍ＋１）の変更タイミング及ポインタ値が図示されている。なお、図９（ａ）に示す書込み位置ポインタＷＰは、メディア処理スレッド３０−（ｍ）に属するポインタである。また、図９（ｂ）に示す読込み位置ポインタＲＰ（ｍ−１）及び図９（ｃ）に示す読込み位置ポインタＲＰ（ｍ＋１）は、いずれもメディア処理スレッド３０−（ｍ）の読込み位置ポインタ群３３２に属するものである。 FIG. 9A illustrates the periodic processing timing of the media processing thread 30- (m), the writing timing of the circular buffer 33-m, the change timing of the writing position pointer WP (m), and the pointer value. In FIG. 9B, the periodic processing timing of the media processing thread 30- (m-1), the reading timing of the circular buffer 33- (m-1), the changing timing of the reading position pointer RP (m-1), and the pointer value. Is shown. Further, FIG. 9C illustrates the periodic processing timing of the media processing thread 30- (m + 1), the reading timing of the circular buffer 33- (m + 1), the changing timing of the reading position pointer RP (m + 1) and the pointer value. Yes. Note that the write position pointer WP shown in FIG. 9A is a pointer belonging to the media processing thread 30- (m). Also, the read position pointer RP (m−1) shown in FIG. 9B and the read position pointer RP (m + 1) shown in FIG. 9C are both read position pointer groups of the media processing thread 30- (m). It belongs to 332.

図９では、メディア処理スレッド３０−（ｍ）からメディア処理スレッド３０−（ｍ−１）へのデータ受け渡し、及び、メディア処理スレッド３０−（ｍ）からメディア処理スレッド３０−（ｍ＋１）へのデータ受け渡しについて説明しているが、メディア処理スレッド３０−（ｍ）からメディア処理スレッド３０−１〜３０−（ｍ−２）および３０−（ｍ＋２）〜３０−Ｍへのデータ受渡しについても同様であるので、詳しい説明を省略する。 In FIG. 9, data is transferred from the media processing thread 30- (m) to the media processing thread 30- (m-1), and data is transferred from the media processing thread 30- (m) to the media processing thread 30- (m + 1). Although the transfer has been described, the same applies to the data transfer from the media processing thread 30- (m) to the media processing threads 30-1 to 30- (m-2) and 30- (m + 2) to 30-M. Therefore, detailed explanation is omitted.

メディア処理スレッド３０−ｍは、例えば、１０ｍｓごとに周期起動するが、バッファ面３３１−４を書込む周期で、周期処理に時間がかかり、バッファ面３３１−５、３３１−６、３３１−７、３３１−８、３３１−１を書込む周期で、起動周期の乱れが発生したとする。ソフトウェアでの処理においては、例えばハードディスクへのデータ書込み待ちの発生などにより、通常よりも周期処理時間が長くなり、起動周期が乱れることが有り得る。 For example, the media processing thread 30-m is periodically started every 10 ms, but it takes time to perform periodic processing at a cycle in which the buffer surface 331-4 is written, and the buffer surfaces 331-5, 331-6, 331-7, Assume that a disturbance in the start cycle occurs in the cycle in which 331-8 and 331-1 are written. In processing by software, for example, due to the occurrence of waiting for data writing to the hard disk, the cycle processing time becomes longer than usual, and the startup cycle may be disturbed.

メディア処理スレッド３０−（ｍ−１）は、メディア処理スレッド３０−ｍと同様に１０ｍｓごとに周期起動し、起動周期の乱れが発生しなかったとする。すると、メディア処理スレッド３０−（ｍ−１）では、１０ｍｓごとに循環バッファ３３−ｍからの読込み契機が発生する。 Similarly to the media processing thread 30-m, the media processing thread 30- (m-1) is periodically activated every 10 ms, and the activation cycle is not disturbed. Then, in the media processing thread 30- (m−1), a read trigger from the circular buffer 33-m occurs every 10 ms.

また、この実施形態では、メディア処理スレッド３０−ｍ以外のメディア処理スレッド３０は、メディア処理スレッド３０−ｍのバッファ面３３１−２へデータ書込み以降に読込み処理を開始するものとする。したがって、メディア処理スレッド３０−（ｍ−１）は、メディア処理スレッド３０−ｍのバッファ面３３１−２へデータ書込み以降に読込み処理を開始する。したがって、図９に示すように、メディア処理スレッド３０−ｍがバッファ面３３１−２へ１０ｍｓ分の音声データ（リニア音声データ）を書込んだ後、メディア処理スレッド３０−（ｍ−１）の読込み契機が発生すると、これが初回の読込み契機となる。このとき、メディア処理スレッド３０−ｍは、読込み位置ポインタＲＰ（ｍ−１）を１に設定し、バッファ面３３１−１から１０ｍｓ分の音声データ（リニア音声データ）を読込む。メディア処理スレッド３０−ｍは、次回以降の読込み契機において、読込み位置ポインタＲＰ（ｍ−１）をインクリメントし、対応するバッファ面３３１から１０ｍｓ分の音声データ（リニア音声データ）を読込んでいくことになる。 In this embodiment, media processing threads 30 other than the media processing thread 30-m start reading processing after data is written to the buffer surface 331-2 of the media processing thread 30-m. Therefore, the media processing thread 30- (m−1) starts the reading process after data is written to the buffer surface 331-2 of the media processing thread 30-m. Therefore, as shown in FIG. 9, after the media processing thread 30-m writes 10 ms worth of audio data (linear audio data) to the buffer surface 331-2, the media processing thread 30- (m-1) is read. When an opportunity occurs, this is the first reading opportunity. At this time, the media processing thread 30-m sets the reading position pointer RP (m-1) to 1, and reads audio data (linear audio data) for 10 ms from the buffer surface 331-1. The media processing thread 30-m increments the reading position pointer RP (m−1) and reads audio data (linear audio data) for 10 ms from the corresponding buffer surface 331 at the next reading opportunity. Become.

図９に示すように、読込み位置ポインタＲＰ（ｍ−１）が５の時の最初の読込み契機（メディア処理スレッド３０−ｍの読込み契機）では、メディア処理スレッド３０−ｍの書込み位置ポインタＷＰ（ｍ）の値が５である。したがって、この時点で、メディア処理スレッド３０−ｍにおいて、バッファ面３３１−５に最新の音声データが書込まれており、バッファ面３３１−６には、まだ音声データが書き込まれていない。そこで、メディア処理スレッド３０−（ｍ−１）は、この読込み契機において、読込み位置ポインタＲＰ（ｍ−１）を前値保持として(インクリメントを行わず)、バッファ面３３１−５の音声データを読込むようにする。この場合メディア処理スレッド３０−（ｍ−１）は、バッファ面３３１−５の音声データを２回連続で読込むので、この契機では、一瞬、音声の連続性が失われるが、それ以降の音声の連続性を保つことが出来る。 As shown in FIG. 9, at the first read trigger (read trigger of the media processing thread 30-m) when the read position pointer RP (m-1) is 5, the write position pointer WP ( The value of m) is 5. Therefore, at this time, the latest audio data is written in the buffer surface 331-5 in the media processing thread 30-m, and the audio data is not yet written in the buffer surface 331-6. Therefore, the media processing thread 30- (m-1) reads the audio data of the buffer surface 331-5 with the reading position pointer RP (m-1) held as the previous value (without incrementing) at this reading opportunity. Make sure. In this case, the media processing thread 30- (m-1) reads the audio data of the buffer surface 331-5 twice in succession, and at this moment, the continuity of the audio is lost for a moment. Can be maintained.

メディア処理スレッド３０−（ｍ−１）では、音声データを２回連続で読込む代わりに、デコーダでのパケットロス補償と同様の処理を実行することにより、過去音声より疑似人工音声（ダミー用音声）を生成してもよい。そうすれば、この契機においても音声の連続性を保つことが出来る。そして、読込み位置ポインタＲＰ（ｍ−１）がバッファ面数と同じ８となった次の読込み契機では、読込み位置ポインタＲＰ（ｍ−１）を１に設定し、メディア処理スレッド３０−ｍのバッファ面３３１−１から１０ｍｓ分の音声データ（リニア音声データ）を読込む。 The media processing thread 30- (m-1) executes a process similar to the packet loss compensation in the decoder, instead of reading the voice data twice in succession, so that the artificial voice (dummy voice) is obtained from the past voice. ) May be generated. Then, the continuity of the voice can be maintained even at this opportunity. Then, at the next read trigger when the read position pointer RP (m−1) is equal to 8 as the number of buffer planes, the read position pointer RP (m−1) is set to 1, and the buffer of the media processing thread 30-m The voice data (linear voice data) for 10 ms is read from the surface 331-1.

（Ｂ−３）第２の実施形態の効果
第２の実施形態によれば、第１の実施形態の効果に加えて、以下のような効果を奏することができる。 (B-3) Effects of Second Embodiment According to the second embodiment, the following effects can be obtained in addition to the effects of the first embodiment.

第２の実施形態の多地点音声ミキシング装置１０では、任意の基準となるメディア処理スレッド３０で複数のバッファ面３３１へのデータ書込み以降に、他のメディア処理スレッド３０の読込み処理を開始するようにしている。また、第２の実施形態の多地点音声ミキシング装置１０では、メディア処理スレッド３０の読込み契機（他のメディア処理スレッド３０からの音声データの読込み契機）において、読込み位置ポインタＲＰが書込み位置ポインタＷＰ（ｍ）を追い越さないようにする処理を追加したので、読込みデータの音声の連続性を保つことができる。 In the multipoint audio mixing apparatus 10 of the second embodiment, after the data is written to the plurality of buffer surfaces 331 by the media processing thread 30 as an arbitrary reference, the reading process of the other media processing thread 30 is started. ing. In the multipoint audio mixing apparatus 10 according to the second embodiment, the reading position pointer RP is changed to the writing position pointer WP (when the media processing thread 30 reads) (when the audio data is read from other media processing threads 30). Since a process for preventing overtaking m) is added, the continuity of the read data voice can be maintained.

（Ｃ）他の実施形態
本発明は、上記の各実施形態に限定されるものではなく、以下に例示するような変形実施形態も挙げることができる。 (C) Other Embodiments The present invention is not limited to the above-described embodiments, and may include modified embodiments as exemplified below.

（Ｃ−１）第２実施形態では、読込み位置ポインタＲＰが書込み位置ポインタＷＰ（ｍ）を追い越さないようにする処理について説明した。しかし、書込み位置ポインタＷＰ（ｍ）と読込み位置ポインタＲＰとのバッファ面数差分は、バッファ面数差分×起動周期の伝送遅延となり、この伝送遅延が大きくなり過ぎるのは望ましくない。そこで、各メディア処理スレッド３０において、伝送遅延の遅延回復処理を実施するようにしても良い。例えば、各メディア処理スレッド３０は、「書込み位置ポインタＷＰ（ｍ）−読込み位置ポインタＲＰ≧Ｘ」（例えば、Ｘ＝４）となった時点で、読込み位置ポインタＲＰを２つインクリメントしてから、データ読込みを実施するようにして、伝送遅延の遅延回復処理を実施するようにしても良い。更に、各メディア処理スレッド３０は、「書込み位置ポインタＷＰ（ｍ）−読込み位置ポインタＲＰ≧Ｘ」が成立した時に、音声データが無い無音区間のみで、読込み位置ポインタＲＰを２つインクリメントするようにしても良い。 (C-1) In the second embodiment, the process for preventing the read position pointer RP from overtaking the write position pointer WP (m) has been described. However, the buffer plane number difference between the write position pointer WP (m) and the read position pointer RP becomes a transmission delay of the buffer plane number difference × starting cycle, and it is not desirable that this transmission delay becomes too large. Therefore, each media processing thread 30 may perform delay recovery processing for transmission delay. For example, each media processing thread 30 increments the read position pointer RP by two when “write position pointer WP (m) −read position pointer RP ≧ X” (for example, X = 4), Data recovery may be performed to perform a delay recovery process for transmission delay. Further, each media processing thread 30 increments the reading position pointer RP by two only in a silent section where there is no audio data when “writing position pointer WP (m) −reading position pointer RP ≧ X” is satisfied. May be.

（Ｃ−２）上記の各実施形態では、本発明の音声処理装置を、多地点音声ミキシング装置に適用する例について説明したが、音声データを受信してミキシングする種々の装置に適用することができる。また、上記の各実施形態では、符号化された音声データのパケットを受信して音声ミキシング処理する例について説明したが、受信する際の音声データの符号化方式や分割形式等は限定されないことは当然である。 (C-2) In each of the above embodiments, the example in which the audio processing device of the present invention is applied to a multipoint audio mixing device has been described. However, the audio processing device can be applied to various devices that receive and mix audio data. it can. In each of the above embodiments, an example in which a packet of encoded audio data is received and audio mixing processing has been described, but the encoding method and division format of audio data at the time of reception are not limited. Of course.

１０…多地点音声ミキシング装置、１１…データ処理部、１１１…プロセッサ、１１２…メモリ、１２…通信部、４０…ネットワーク、２０、２０−１＿１〜２０−Ｍ＿Ｎ…端末、２１…通話処理部、２２…通信部、２３…マイク、２４…スピーカ、３０、３０−１〜３０−Ｍ…メディア処理スレッド、３１、３１−１＿１〜３１−Ｍ＿Ｎ…音声受信処理、３１１…ＲＴＰ受信、３１２…ジッタバッファ、３１３…デコーダ、３２、３２−１〜３２−Ｍ…加算処理、３３、３３−１〜３３−Ｍ…循環バッファ、３４、３４−１〜３４−Ｍ…加算処理３４、３５、３５−１〜３５−Ｍ…ミキサ、３６、３６−１＿１〜３６−Ｍ＿Ｎ…音声送信処理、３６１…エンコーダ、３６２…ＲＴＰ送信、３３１、３３１−１〜３３１−Ｚ…バッファ面、３３２…読込み位置ポインタ群、３３３…書込み面選択手段、３３４…読込面選択手段、３３５…読込み位置ポインタ選択手段、３５１、３５１−１〜３５１−Ｎ…ミキサ部。 DESCRIPTION OF SYMBOLS 10 ... Multipoint audio mixing apparatus, 11 ... Data processing part, 111 ... Processor, 112 ... Memory, 12 ... Communication part, 40 ... Network, 20, 20-1_1 to 20-M_N ... Terminal, 21 ... Call processing part, 22 Communication unit, 23, microphone, 24, speaker, 30, 30-1 to 30-M, media processing thread, 31, 31-1_1 to 31-M_N, voice reception processing, 311 ... RTP reception, 312 ... jitter buffer, 313: Decoder, 32, 32-1 to 32-M ... Addition processing, 33, 33-1 to 33-M ... Circular buffer, 34, 34-1 to 34-M ... Addition processing 34, 35, 35-1 35-M: Mixer, 36, 36-1_1-36-M_N: Audio transmission processing, 361: Encoder, 362: RTP transmission, 331, 331-1-331-Z: Buffer surface, 3 2 ... read position pointers, 333 ... writing surface selecting means, 334 ... read plane selecting means, 335 ... read position pointer selecting means, 351,351-1~351-N ... mixer.

Claims

A plurality of voice processing means,
Each of the above voice processing means
A plurality of sound signal receiving means for receiving reception sound data based on reception sound for each time series;
First synthesizing means for synthesizing received sound data received by the plurality of sound signal receiving means to generate first synthesized sound data;
Buffer means for holding first synthesized speech data;
Second synthesis means for synthesizing the first synthesized voice data generated by each of the other voice processing means to generate second synthesized voice data;
For each transmission destination, transmission sound data is generated by synthesizing voice data obtained by excluding the reception sound data related to the transmission destination from the reception sound data received by the plurality of sound signal receiving means and the second synthesized voice data. And a third synthesizing unit.

The buffer means stores the first synthesized speech data in a time series order in a circular buffer in which a plurality of buffer units holding one time-series first synthesized speech data are circularly arranged. The speech processing apparatus according to claim 1.

3. The buffer means starts outputting the first synthesized voice data held in the circular buffer after holding a plurality of first synthesized voice data in the circular buffer. The voice processing apparatus according to 1.

The buffer means has a read pointer for managing the read position of the circular buffer and a write pointer for managing the write position of the circular buffer, so that the position of the read pointer does not pass the position of the write pointer. 4. The voice processing apparatus according to claim 2, wherein the position of the reading pointer is controlled.

The buffer means does not increment the position of the read pointer if the position of the read pointer and the position of the write pointer are the same when reading the first synthesized speech data corresponding to the read pointer. The speech processing apparatus according to claim 4.

The buffer means reads the first synthesized speech data corresponding to the read pointer, and if the difference between the position of the write pointer and the position of the read pointer is equal to or greater than a threshold value, 2 for the read position pointer. The speech processing apparatus according to claim 4, wherein the value is incremented.

7. The sound processing apparatus according to claim 1, wherein each of the sound processing means is constituted by a thread on a computer.

Let the computer function as multiple audio processing means,
Each of the above voice processing means
A plurality of sound signal receiving means for receiving reception sound data based on reception sound for each time series;
First synthesizing means for synthesizing received sound data received by the plurality of sound signal receiving means to generate first synthesized sound data;
Buffer means for holding first synthesized speech data;
Second synthesis means for synthesizing the first synthesized voice data generated by each of the other voice processing means to generate second synthesized voice data;
For each transmission destination, transmission sound data is generated by synthesizing voice data obtained by excluding the reception sound data related to the transmission destination from the reception sound data received by the plurality of sound signal receiving means and the second synthesized voice data. A voice processing program comprising: a third synthesizing unit.

In the speech processing method executed by the speech processing apparatus,
A plurality of voice processing means,
Each sound processing means comprises a plurality of sound signal receiving means, first synthesizing means, buffer means, second synthesizing means, and third synthesizing means,
Each of the sound signal receiving means receives reception sound data based on reception sound for each time series,
The first synthesizing unit synthesizes the received sound data received by the plurality of sound signal receiving units to generate first synthesized voice data,
The buffer means holds the first synthesized voice data,
The second synthesis means synthesizes the first synthesized voice data generated by each of the other voice processing means to generate second synthesized voice data,
The third synthesizing means, for each transmission destination, audio data obtained by excluding reception sound data related to the transmission destination from reception sound data received by the plurality of sound signal reception means, and the second synthetic audio data. A voice processing method characterized by generating synthesized transmission sound data.