JP4709734B2

JP4709734B2 - Speaker selection device, speaker selection method, speaker selection program, and recording medium recording the same

Info

Publication number: JP4709734B2
Application number: JP2006325954A
Authority: JP
Inventors: 仲大室; 祐介日和▲崎▼; 岳至森; 章俊片岡
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2006-12-01
Filing date: 2006-12-01
Publication date: 2011-06-22
Anticipated expiration: 2026-12-01
Also published as: JP2008141505A

Description

本発明は、話者選択技術に関する。より詳しくは、音声ミキシングにおける話者選択技術に関する。 The present invention relates to a speaker selection technique. More specifically, the present invention relates to a speaker selection technique in voice mixing.

ディジタル通信網であるパケット通信網を介して多地点（３地点以上を多地点と云い、以下同様とする。）を結び、音声パケット通信で多地点音声通信（例えば多地点音声会議である。）を行うことが増えている。 Multipoints (three or more points are referred to as multipoints, the same shall apply hereinafter) are connected via a packet communication network, which is a digital communication network, and multipoint voice communication (for example, multipoint audio conference) is performed by voice packet communication. Doing more is increasing.

図１に、３つの地点Ａ、Ｂ、Ｃから送られた音声パケットを音声ミキシング装置でミキシングして、ミキシング後の音声パケットを地点Ｄに送る構成例の概要を示す。
地点Ａの音声パケット送信部（９００Ａ）は、地点Ａでの入力音声信号〔離散化された音声信号や音楽信号などの音響信号を総称して音声信号と云うことにする。〕を音声パケットＡに変換して、パケット通信網（９５０）経由で音声ミキシング装置（９７０）に送り出す。同様に、地点Ｂの音声パケット送信部（９００Ｂ）は、地点Ｂでの入力音声信号を音声パケットＢに変換して、パケット通信網（９５０）経由で音声ミキシング装置（９７０）に送り出し、地点Ｃの音声パケット送信部（９００Ｃ）は、地点Ｃでの入力音声信号を音声パケットＣに変換して、パケット通信網（９５０）経由で音声ミキシング装置（９７０）に送り出す。音声ミキシング装置（９７０）は、音声パケットＡ、音声パケットＢ、音声パケットＣから１地点分の音声パケットを作成の上、これをパケット通信網（９５０）経由で地点Ｄに送り出す。地点Ｄの音声パケット受信部（９００Ｄ）は、受信した音声パケットを出力音声信号に変換する。
なお、図１では、説明を簡単にするため入力が３地点、出力が１地点の例を示しているが、入力は何地点でもよい。また、通常の多地点音声通信利用においては、入力地点と出力地点は同一であり、入力がＡ、Ｂ、Ｃ地点であれば、出力もＡ、Ｂ、Ｃの３地点分必要であり、地点ＤはＡ、Ｂ、Ｃのいずれかの例であると読み替えるとよい（以下、同様である。）。 FIG. 1 shows an outline of a configuration example in which voice packets sent from three points A, B, and C are mixed by a voice mixing device and the mixed voice packets are sent to a point D.
The voice packet transmission unit (900A) at the point A collectively refers to the input voice signal [the acoustic signal such as the discretized voice signal or the music signal at the point A. ] Are converted into voice packets A and sent to the voice mixing device (970) via the packet communication network (950). Similarly, the voice packet transmitting unit (900B) at the point B converts the input voice signal at the point B into the voice packet B and sends it to the voice mixing device (970) via the packet communication network (950). The voice packet transmitting unit (900C) converts the input voice signal at the point C into the voice packet C and sends it to the voice mixing device (970) via the packet communication network (950). The voice mixing device (970) creates a voice packet for one point from the voice packet A, voice packet B, and voice packet C, and sends it out to the point D via the packet communication network (950). The voice packet receiving unit (900D) at the point D converts the received voice packet into an output voice signal.
Although FIG. 1 shows an example in which the input is three points and the output is one point for simplicity of explanation, the input may be any number of points. Moreover, in normal multipoint voice communication use, the input point and the output point are the same, and if the input is the points A, B, and C, the output is also required for three points A, B, and C. D may be read as an example of any one of A, B, and C (the same applies hereinafter).

従来、このような多地点音声通信の用途においては、音声符号化方法としてＩＴＵ−Ｔ〔International Telecommunication Union-Telecommunication Standardization Sector：国際電気通信連合電気通信標準化部門〕Ｇ．７１１が利用されることがほとんどであった。
その理由は、音声ミキシング装置において、各地点から送られてくる音声パケットに含まれる音声符号をいったんデコードし、ＰＣＭ（Pulse Code Modulation）信号レベルでミキシング処理を行った後、再度エンコード処理を行って各地点向けの音声パケットを生成する必要があり、Ｇ．７１１以外の符号化方式では、音声ミキシング装置に多大な負荷がかかるためである。 Conventionally, in such multipoint voice communication applications, ITU-T (International Telecommunication Union-Telecommunication Standardization Sector) is used as a voice encoding method. 711 was mostly used.
The reason is that the audio mixing device once decodes the audio code included in the audio packet sent from each point, performs the mixing process at the PCM (Pulse Code Modulation) signal level, and then performs the encoding process again. It is necessary to generate a voice packet for each point. This is because an encoding method other than 711 places a great load on the audio mixing device.

このような状況において本発明者らは、本発明に先立ち、音声符号化方式としてＧ．７１１よりも音質の良い広帯域音声符号化方式を用いながら、音声ミキシング装置に多大な負荷のかからない多地点ミキシング方法を実現した（特許文献１参照。）。
図２および図３に特許文献１に開示される多地点ミキシング方法の一例を示す。ただし図２は図１における１地点分の音声パケット送信部の機能構成例を示し、図３では、送信側として地点Ａ、受信側として地点Ｄのみを記載し、地点Ｂ、Ｃは省略した。地点Ｂ、Ｃでの処理は地点Ａでの処理と同様である。 In such a situation, prior to the present invention, the present inventors have used G.D. While using a wideband speech coding method with better sound quality than 711, a multi-point mixing method that does not place a heavy load on the speech mixing device was realized (see Patent Document 1).
2 and 3 show an example of the multipoint mixing method disclosed in Patent Document 1. FIG. However, FIG. 2 shows an example of the functional configuration of the voice packet transmitting unit for one point in FIG. 1. In FIG. 3, only the point A is shown as the transmitting side, and only the point D is shown as the receiving side, and the points B and C are omitted. The processing at points B and C is the same as the processing at point A.

特許文献１に開示される多地点ミキシング方法では、フレームと呼ばれる所定時間長（例えば１０ｍｓ〜２０ｍｓ程度である。）ごとに区切った入力音声信号を音声波形符号化部（９０１）が音声符号に変換してこれを出力する。また、音声区間検出部（９０２）がフレーム毎にそれが音声区間のものであるか非音声区間のものであるかを示す判定結果（以下、ＶＡＤフラグと云う。）を出力する。さらに、音声情報取得部（９０４）は、入力音声信号の音響特徴量である音声情報を取得する。音声情報としては、例えば入力音声信号の音声のパワーやピッチの相関値とすることができる。音声情報の取得方法として、入力音声信号の音声波形の二乗和を計算することによってパワーを求めてもよいし、入力音声信号の音声波形のピッチの相関値を求めてもよい。そして、パケット構成部（９０３）がＶＡＤフラグおよび音声情報を音声符号とともに音声パケットに組み込んでこれをパケット通信網（９５０）へ送っている。 In the multipoint mixing method disclosed in Patent Document 1, the speech waveform encoding unit (901) converts an input speech signal divided into predetermined time lengths (for example, about 10 ms to 20 ms) called frames into speech codes. And output this. Further, the voice section detection unit (902) outputs a determination result (hereinafter referred to as a VAD flag) indicating whether the voice section is for a voice section or a non-voice section for each frame. Further, the voice information acquisition unit (904) acquires voice information that is an acoustic feature amount of the input voice signal. As the audio information, for example, the correlation value of the power or pitch of the input audio signal can be used. As a method for acquiring speech information, the power may be obtained by calculating the sum of squares of the speech waveform of the input speech signal, or the correlation value of the pitch of the speech waveform of the input speech signal may be obtained. The packet construction unit (903) incorporates the VAD flag and the voice information into the voice packet together with the voice code, and sends it to the packet communication network (950).

パケット通信網（９５０）を介して音声パケットを受け取った音声ミキシング装置（９７０）では、パケット分解部（９７１）が、音声パケットから音声符号、ＶＡＤフラグおよび音声情報を取り出す。
更に、音声符号分解部（９７２）が、音声符号を低域符号および高域符号に分解し、低域符号は低域ミキシング部（９７３）に、高域符号は高域切換部（９７４）に送られる。ここで、高域符号は拡張レイヤ符号のひとつであり、高域符号は任意の拡張レイヤ符号に読み替えてもよいものとする（以下同様である。）。 In the voice mixing device (970) that has received the voice packet via the packet communication network (950), the packet decomposing unit (971) extracts the voice code, the VAD flag, and the voice information from the voice packet.
Further, the speech code decomposition unit (972) decomposes the speech code into a low frequency code and a high frequency code, the low frequency code is converted to the low frequency mixing unit (973), and the high frequency code is converted to the high frequency switching unit (974). Sent. Here, the high frequency code is one of enhancement layer codes, and the high frequency code may be read as an arbitrary enhancement layer code (the same applies hereinafter).

低域ミキシング部（９７３）は、各地点からの低域符号を受け取ってデコードし、出力する地点毎にミキシング音声を生成してこれを低域符号に変換し、この低域符号を音声符号結合部（９７５）に送る。 The low-frequency mixing unit (973) receives and decodes the low-frequency code from each point, generates mixing speech for each point to be output, converts it to a low-frequency code, and combines this low-frequency code with a voice code combination Part (975).

話者選択部（９７６）は、各地点からのＶＡＤフラグおよび音声情報を参照して、時々刻々、どの地点が主たる発言者であるのかの自動判定処理を行ったうえで、話者の入力地点を指示する話者指標（以下、話者番号と云う。）を出力する。 The speaker selection unit (976) refers to the VAD flag and voice information from each point, performs automatic determination processing to determine which point is the main speaker, and then inputs the speaker input point. A speaker index (hereinafter referred to as speaker number) is output.

高域切換部（９７４）は、各地点からの高域符号を受け取り、話者番号を用いて、出力する地点毎に高域符号を１地点分選択し、音声符号結合部（９７５）に送る。 The high-frequency switching unit (974) receives the high-frequency code from each point, selects one high-frequency code for each point to be output using the speaker number, and sends it to the speech code combining unit (975). .

音声符号結合部（９７５）は、低域符号と高域符号とを結合して音声符号を生成し、この音声符号をパケット構成部（９７７）に送る。パケット構成部（９７７）は受け取った音声符号から音声パケット生成して、この音声パケットを出力する。 The speech code combining unit (975) combines the low frequency code and the high frequency code to generate a speech code, and sends this speech code to the packet configuration unit (977). The packet construction unit (977) generates a voice packet from the received voice code and outputs the voice packet.

特許文献１には、話者選択部（９７６）の話者選択方法として、サンプル当たりの平均パワーが最も大きく、かつ音声区間と判別された対地の高域符号（拡張レイヤ情報）を選択すること、および、あまり頻繁に話者が切り替わると異音が聴こえる可能性や、不自然な再生音声になる可能性があるので、ある一定時間（切換制限時間；例えば４０ms〜２００ms）以上は話者を切り替えないようにすることが述べられている。
特開２００５−２２９２５９号公報 In Patent Document 1, as a speaker selection method of the speaker selection unit (976), a high-frequency code (enhancement layer information) of the ground having the highest average power per sample and determined as a speech section is selected. And if the speaker switches too frequently, there is a possibility that an audible noise may be heard or an unnatural reproduction voice, so that the speaker is not used for a certain period of time (for example, 40 ms to 200 ms). It is stated not to switch.
JP 2005-229259 A

一般的な人間の会話では、３割〜５割程度は非音声区間が含まれている。それにも関らず、送信側の入力音声信号に人間の声が背景雑音として含まれている場合や、双方向通信の場合で他の地点の音声が音響スピーカから再生され、それがエコーとなってマイクから入ってしまった場合などには、その地点が実際には発言していないにもかかわらず背景雑音やエコーを入力音声信号として音声区間と判定されてしまうことがあり、正しく音声区間を判定できない。結果として、話者選択部において複数の話者（地点）の何れが発言者（発言地点）であるかを正しく判定できず、主たる発言者の自動判定の誤り、あるいは、誤って発話中に音声符号の切り替えが発生して、ミキシング後の通話品質・音質が劣化するという懸念がある。 In general human conversation, about 30% to 50% include non-speech intervals. Nevertheless, when the input voice signal on the transmitting side contains human voice as background noise or in the case of two-way communication, the voice at another point is played back from the acoustic speaker, which becomes an echo. If you enter the microphone from the microphone, the background noise or echo may be determined as the input voice signal as the input voice signal even though the point is not actually speaking. Cannot judge. As a result, the speaker selection unit cannot correctly determine which of the plurality of speakers (speaking points) is the speaker (speaking point), and the automatic determination of the main speaker is incorrect, or the voice is erroneously spoken. There is a concern that code switching occurs and the quality of speech and sound after mixing deteriorates.

このような懸念に対して、話者選択方法として、各地点の音声情報と音声区間と判定されているか否かを示すＶＡＤフラグを利用して、音声区間と判定された話者（地点）のうち音声のパワーが最大の地点を発言者（発言地点）とする方法では、入力音声レベルが変動した場合や、地点毎に入力音声レベルが異なる場合などに、正確に発言者（発言地点）を判定することができず、上記懸念を十分に払拭できるとは必ずしも言い切れない。 In response to such concerns, as a speaker selection method, voice information of each point and a VAD flag indicating whether or not it is determined as a voice interval are used, and a speaker (point) determined as a voice interval is used. In the method of using the point with the highest voice power as the speaker (speaking point), the speaker (speaking point) is accurately selected when the input voice level fluctuates or when the input voice level varies from point to point. It cannot be determined and it cannot always be said that the above concerns can be sufficiently eliminated.

また、話者選択方法として、一定時間は発言者（発言地点）を切り替えない方法では、短時間の間に実際の発話者（発言地点）が変わった場合には正確に発言者（発言地点）を判定することができず、上記懸念を十分に払拭できるとは必ずしも言い切れない。 In addition, as a method for selecting a speaker, when the speaker (speaking point) is not switched for a certain period of time, if the actual speaker (speaking point) changes in a short time, the speaker (speaking point) is accurately determined. Cannot be determined, and it cannot always be said that the above concerns can be sufficiently eliminated.

そこで本発明は、少なくとも入力音声レベルが変動したり、入力音声レベルが地点毎に異なる場合でも、ミキシング後の通話品質・音質の劣化を低減せしめる話者選択装置、方法、プログラムおよびその記録媒体を提供することを目的とする。 Therefore, the present invention provides a speaker selection device, method, program, and recording medium for reducing deterioration in call quality and sound quality after mixing even when the input voice level fluctuates or the input voice level varies from point to point. The purpose is to provide.

上記課題を解決するため、本発明では、３地点以上の各地点（以下、入力地点と云う。）から送られた音声パケットに対して、話者選択をしてミキシングを行う音声ミキシングにおける話者選択であって、一つの入力地点から送られた音声パケットに含まれる１フレーム（以下、現フレームと云う。）の音声符号が音声区間のものであるかあるいは非音声区間のものであるかのいずれであるかを示す判定結果（以下、ＶＡＤフラグという。）を用いて、当該入力地点の発話継続状態を示す指標である継続状態指標を更新し、一つの入力地点から送られた過去および現在の各音声パケットに含まれるフレームごとの音声符号の音響特徴量（以下、音声情報という。）を蓄積し、上記音声情報蓄積手段に蓄積された上記音声情報から１つの音声情報（以下、選択音声情報と云う。）を選択し、各入力地点の上記継続状態指標と、各入力地点から送られた音声パケットそれぞれに含まれる現フレームに対応する上記音声情報と、１フレーム前の話者の入力地点を指示する話者指標と、上記選択音声情報とを用いて、上記話者指標によって指示される入力地点の上記継続状態指標が発話継続中を示すものである場合に、上記話者指標によって指示される入力地点以外の入力地点であって継続状態指標が発話継続中を示している入力地点（以下、発話継続他地点と云う。）の上記音声情報と、上記話者指標によって指示される入力地点の上記選択音声情報とをそれぞれ比較して、上記選択音声情報よりも上記発話継続他地点の上記音声情報が話者適格である場合には、各入力地点の上記音声情報を比較して現フレームの話者の入力地点を決定し、上記発話継続他地点の上記音声情報よりも上記選択音声情報が話者適格である場合には、現フレームの話者の入力地点を上記話者指標によって指示される入力地点として決定し、上記話者指標によって指示される入力地点の上記継続状態指標が発話終了を示すものである場合に、各入力地点の上記音声情報を比較して現フレームの話者の入力地点を決定するとすることもできる。継続状態指標の更新、音声情報の蓄積、選択音声情報の選択は、上記入力地点単位で行う。 In order to solve the above-described problem, in the present invention, a speaker in voice mixing that performs speaker selection for mixing voice packets sent from three or more points (hereinafter referred to as input points). Whether the voice code of one frame (hereinafter referred to as the current frame) included in a voice packet sent from one input point is a voice section or a non-voice section Using a determination result (hereinafter referred to as a VAD flag) indicating which one is used, the continuation state index that is an index indicating the continuation state of speech at the input point is updated, and the past and present sent from one input point are updated. Are stored in the speech information storage unit, and the speech information stored in the speech information storage means is stored as one speech information. Information (hereinafter referred to as selected voice information) is selected, the continuation state index of each input point, the voice information corresponding to the current frame included in each voice packet sent from each input point, and one frame When the continuation status index at the input point indicated by the speaker index indicates that the utterance is continuing, using the speaker index indicating the input point of the previous speaker and the selected voice information The voice information of an input point other than the input point indicated by the speaker index and the continuation state index indicating that the utterance is continuing (hereinafter referred to as another utterance continuation other point), and the talk Each of the input points indicated by the speaker index is compared with the selected voice information. Voice To determine the input point of the speaker of the current frame, and when the selected voice information is more eligible for the speaker than the voice information of the other point where the utterance continues, the input point of the speaker of the current frame is determined. When it is determined as an input point indicated by the speaker index and the continuation state index of the input point indicated by the speaker index indicates the end of the utterance, the voice information of each input point is compared. It is also possible to determine the input point of the speaker of the current frame. The updating of the continuation state index, the accumulation of audio information, and the selection of selected audio information are performed in units of the input points.

このように、１フレーム前まで発話地点と判定されていた地点の継続状態指標が発話継続中を表している場合においても、１フレーム前まで発話地点と判定されていた地点以外の地点であって継続状態指標が発話継続中を表している地点の音声情報と、１フレーム前まで発話地点と判定されていた地点の音声情報から選択された選択音声情報とをそれぞれ比較して、選択音声情報よりも他の地点の音声情報の方が話者適格である場合には、各地点の音声情報を比較して新たな話者地点を決定するので、入力音声信号に背景雑音が含まれるなどして、第一話者に対応するＶＡＤフラグが音声区間を示し続けるような状況に対して話者選択を適正化することができる。なお、音声のパワーを話者選択における直接の指標としないため、話者選択において、送信側の入力音声レベルの異同や変動の直接の影響を受けることがない。 As described above, even when the continuation state index of the point determined to be the utterance point until one frame ago indicates that the utterance is continuing, the point other than the point determined to be the utterance point until one frame before is From the selected voice information, the voice information of the point where the continuation state index indicates that the utterance is being continued is compared with the selected voice information selected from the voice information of the point determined to be the utterance point one frame before. However, if the voice information at other points is more eligible for the speaker, the voice information at each point is compared to determine a new speaker point. The speaker selection can be optimized for a situation in which the VAD flag corresponding to the first speaker continues to indicate the voice section. Since voice power is not a direct indicator in speaker selection, speaker selection is not directly affected by differences or fluctuations in the input voice level on the transmission side.

このような構成にあっては、各入力地点の上記継続状態指標と、各入力地点から送られた音声パケットそれぞれに含まれる現フレームに対応する上記音声情報と、１フレーム前の主たる話者である第一話者の入力地点を指示する第一話者指標と、１フレーム前の第一話者を除いて主たる話者である第二話者の入力地点を指示する第二話者指標と、上記選択音声情報とを用いて、上記第一話者指標によって指示される入力地点の上記継続状態指標が発話継続中を示すものである場合（以下、第１の場合と云う。）であって、かつ、上記第一話者指標によって指示される入力地点以外の入力地点であって継続状態指標が発話継続中を示している入力地点（以下、発話継続他地点と云う。）の上記音声情報と、上記第一話者指標によって指示される入力地点の上記選択音声情報とをそれぞれ比較して、上記発話継続他地点の上記音声情報よりも上記選択音声情報が話者適格である場合に、現フレームにおける第一話者の入力地点を上記第一話者指標によって指示される入力地点として決定し、上記第二話者指標によって指示される入力地点の上記継続状態指標が発話継続中を示すものであれば、現フレームにおける第二話者の入力地点を上記第二話者指標によって指示される入力地点として決定し、上記第二話者指標によって指示される入力地点の上記継続状態指標が発話終了を示すものであれば、上記第一話者指標によって指示される入力地点以外の各入力地点の上記音声情報を比較して現フレームにおける第二話者の入力地点を決定し、上記発話継続他地点の上記音声情報と、上記第一話者指標によって指示される入力地点の上記選択音声情報とをそれぞれ比較して、上記選択音声情報よりも上記発話継続他地点の上記音声情報が話者適格である場合かつ上記第１の場合に、あるいは、上記第一話者指標によって指示される入力地点の上記継続状態指標が発話終了を示すものである場合に、上記第二話者指標によって指示される入力地点の上記継続状態指標が発話継続中を示すものであれば、現フレームにおける第一話者の入力地点を上記第二話者指標によって指示される入力地点として決定し、上記第二話者指標によって指示される入力地点以外の各入力地点の上記音声情報を比較して現フレームにおける第二話者の入力地点を決定し、上記第二話者指標によって指示される入力地点の上記継続状態指標が発話終了を示すものであれば、各入力地点の上記音声情報を比較して現フレームにおける第一話者の入力地点を決定して、この決定された第一話者の入力地点以外の各入力地点の上記音声情報を比較して現フレームにおける第二話者の入力地点を決定するとすることもできる。 In such a configuration, the continuation state index of each input point, the voice information corresponding to the current frame included in each voice packet sent from each input point, and the main speaker one frame before A first speaker index that indicates an input point of a certain first speaker, a second speaker index that indicates an input point of a second speaker who is a main speaker except for the first speaker one frame before; In the case where the continuation state index of the input point indicated by the first speaker index indicates that the utterance is continuing using the selected voice information (hereinafter referred to as the first case). The voice at an input point other than the input point indicated by the first speaker index and whose continuation state index indicates that the utterance is continuing (hereinafter referred to as another utterance continuation other point). Indicated by the information and the first speaker indicator When the selected voice information is more qualified for the speaker than the voice information at the other point where the utterance is continued, the input point of the first speaker in the current frame is If it is determined as the input point indicated by the first speaker indicator and the continuation state indicator of the input point indicated by the second speaker indicator indicates that the utterance is continuing, the second speaker in the current frame Is determined as the input point indicated by the second speaker indicator, and if the continuation state indicator of the input point indicated by the second speaker indicator indicates the end of the utterance, The voice information of each input point other than the input point indicated by the speaker index is compared to determine the input point of the second speaker in the current frame, and the voice information of the other point where the utterance continues and the first When the selected voice information at the input point indicated by the speaker index is compared with the selected voice information, the voice information at the other point where the utterance is continued is more suitable for the speaker than the selected voice information and in the first case. Or, when the continuation state index of the input point indicated by the first speaker indicator indicates the end of the utterance, the continuation state indicator of the input point indicated by the second speaker indicator is the utterance. If it indicates continuation, the input point of the first speaker in the current frame is determined as the input point indicated by the second speaker indicator, and other than the input point indicated by the second speaker indicator The voice information at each input point is compared to determine the second speaker's input point in the current frame, and the continuation status indicator at the input point indicated by the second speaker indicator indicates the end of speech. If so, the voice information of each input point is compared to determine the first speaker's input point in the current frame, and the voice information of each input point other than the determined first speaker's input point To determine the input point of the second speaker in the current frame.

複数の人間が例えば不規則に入れ替わって発言するような現実の自然な会話の音声パケット通信のように複数の話者選択を行う場合であっても、第一話者および第二話者について継続状態指標を用いた話者選択を行うことで、話者選択において、送信側の入力音声レベルの異同や変動の直接の影響を受けることがない上、入力音声信号に背景雑音が含まれるなどして、第一話者に対応するＶＡＤフラグが音声区間を示し続けるような状況に対して話者選択を適正化することができる。 Continue for the first and second speakers, even when multiple speakers are selected, such as voice packet communication for real natural conversations where multiple people speak randomly By performing speaker selection using state indicators, the speaker selection is not directly affected by differences or fluctuations in the input voice level on the transmission side, and background noise is included in the input voice signal. Thus, the speaker selection can be optimized for a situation where the VAD flag corresponding to the first speaker continues to indicate the voice section.

なお、上記音声情報を音声のパワーまたはピッチの相関値であるとし、上記選択音声情報は、上記音声情報蓄積手段に蓄積された上記音声情報のうち、音声のパワーの最大値または音声のピッチの相関値の最大値であるとすることができる。 The audio information is a correlation value of audio power or pitch, and the selected audio information is the maximum audio power value or the audio pitch of the audio information stored in the audio information storage means. The maximum correlation value can be assumed.

なお、上記継続状態指標の更新について、上記継続状態指標を整数値とし、予め決められた閾値以上／より大の上記継続状態指標は発話継続中を表し、上記閾値未満／以下の上記継続状態指標は発話終了を表すとし、上記ＶＡＤフラグが音声区間を表す場合には、現在の上記継続状態指標に所定の正値を加えることで上記継続状態指標を更新し、上記ＶＡＤフラグが非音声区間を表す場合には、現在の上記継続状態指標から上記所定の正値と同じまたは異なる所定の正値を減じることで上記継続状態指標を更新することができる。 Regarding the update of the continuation state index, the continuation state index is an integer value, the continuation state index that is greater than or equal to a predetermined threshold value indicates that utterance is ongoing, and the continuation state index that is less than or less than the threshold value. Represents the end of utterance, and if the VAD flag represents a speech interval, the continuation status indicator is updated by adding a predetermined positive value to the current continuation status indicator, and the VAD flag indicates a non-speech interval. In the case of expressing, the continuation state index can be updated by subtracting a predetermined positive value that is the same as or different from the predetermined positive value from the current continuation state index.

このように、数値の継続状態指標が時々刻々のＶＡＤフラグによって加減されることになり、この値によって継続状態指標が発話継続中を表すものであるか発話終了を表すものであるかを判定し、それが発話継続中であれば話者を引き続き話者とするが、それが発話終了であれば音声情報を用いて話者の探索を行うことから、短時間の間に実際の発話者（発言地点）が変わる場合などにも滑らかな話者切換となり、ミキシング後の通話品質・音質の劣化を低減できる。 In this way, the numerical continuation state index is incremented or decremented by the VAD flag from time to time, and this value is used to determine whether the continuation state index represents utterance continuation or utterance end. If the utterance is ongoing, the speaker will continue to be the speaker, but if that is the end of the utterance, the speaker will be searched using voice information, so that the actual speaker ( Smooth speaker switching even when the (speaking point) changes, etc., and it is possible to reduce deterioration in call quality and sound quality after mixing.

また、上記継続状態指標に上限および下限を設け、上記上限を超えては上記継続状態指標の加算更新をせず、また、上記下限を下回っては上記継続状態指標の減算更新をしないとしてもよい。 In addition, an upper limit and a lower limit may be provided for the continuous state index, and the continuous state index may not be added and updated if the upper limit is exceeded, and the continuous state index may not be subtracted and updated if the lower limit is exceeded. .

本発明の話者選択装置としてコンピュータを機能させる話者選択プログラムによって、コンピュータを話者選択装置として作動処理させることができる。そして、この話者選択プログラムを記録した、コンピュータに読み取り可能なプログラム記録媒体によって、他のコンピュータを話者選択装置として機能させることや、話者選択プログラムを流通させることなどが可能になる。 The computer can be operated as a speaker selection device by the speaker selection program that causes the computer to function as the speaker selection device of the present invention. Then, a computer-readable program recording medium in which the speaker selection program is recorded can cause another computer to function as a speaker selection device or distribute the speaker selection program.

本発明によれば、ＶＡＤフラグによって更新される継続状態指標を用いて、継続状態指標が発話継続中を表すものであるか発話終了を表すものであるかを判定し、それが発話継続中であれば話者を引き続き話者とするが、それが発話終了であれば音声情報を用いて話者の探索を行うことから、音声のパワーを話者選択における直接の指標としないため、話者選択において、送信側の入力音声レベルの異同や変動などの直接の影響を受けることがなく、ミキシング後の通話品質・音質の劣化を低減せしめる話者選択が行える。 According to the present invention, using the continuation state index updated by the VAD flag, it is determined whether the continuation state index represents utterance continuation or utterance end, and the utterance is ongoing. If so, the speaker will continue to be the speaker, but if that is the end of the utterance, the speaker will be searched using speech information, so the power of speech is not a direct indicator in speaker selection. In the selection, a speaker can be selected so as not to be directly affected by the difference or fluctuation of the input voice level on the transmission side, and to reduce the deterioration of the speech quality and sound quality after mixing.

また、１フレーム前まで発話地点と判定されていた地点の継続状態指標が発話継続中を表している場合においても、各地点の音声情報をもとに第一話者の再探索の要否を判定することで（監視機能）、入力音声信号に背景雑音が含まれるなどして第一話者に対応するＶＡＤフラグが音声区間を示し続けるような状況に対しても、話者選択を適正化することができる。また、数値の継続状態指標を時々刻々のＶＡＤフラグで加減し、この継続状態指標の値によって継続状態指標が発話継続中を表すものであるか発話終了を表すものであるかを判定し、それが発話継続中であれば話者を引き続き話者とするが、それが発話終了であれば音声情報を用いて話者の探索を行うことから、短時間の間に実際の発話者（発言地点）が変わる場合などにも滑らかな話者切換となる（緩衝機能）。このように、入力音声信号に背景雑音が含まれる場合や、短時間の間に実際の発話者（発言地点）が変わる場合などでも、監視・緩衝の機能によるこれらの状況に対する耐性を持ち、ミキシング後の通話品質・音質の劣化を低減せしめる話者選択が行える。 In addition, even when the continuation state index of the point determined to be the utterance point until one frame ago indicates that the utterance is continuing, whether or not the first speaker needs to be re-searched based on the voice information of each point is determined. Judgment (monitoring function) optimizes speaker selection even in situations where the VAD flag corresponding to the first speaker continues to indicate the voice interval due to background noise included in the input voice signal, etc. can do. Further, the numerical continuous state index is adjusted by the VAD flag from time to time, and it is determined whether the continuous state index represents the utterance continuation or the utterance end by the value of the continuation state index. If the utterance is ongoing, the speaker will continue to be the speaker, but if that is the end of the utterance, the speaker will be searched using voice information. ) Changes smoothly, etc. (buffer function). In this way, even if the input speech signal contains background noise or the actual speaker (speaking point) changes within a short period of time, it has resistance to these situations by the monitoring and buffering function, and mixing Speaker selection can be performed to reduce deterioration of later call quality and sound quality.

また、継続状態指標を更新して、この継続状態指標が発話継続中を表すものであるか発話終了を表すものであるかを判定するという、少ない演算量でこのような効果を得ることができるから、本発明の話者選択を適用することによる音声ミキシングの品質劣化を招く虞もない。 Further, such an effect can be obtained with a small amount of calculation such that the continuation state index is updated to determine whether the continuation state index represents the utterance continuation or the utterance end. Therefore, there is no possibility of degrading the quality of voice mixing by applying the speaker selection of the present invention.

≪第１実施形態≫
本発明である話者選択装置・方法の第１実施形態を説明する。
本発明の第１実施形態である話者選択装置（１００）は、それ単体で独立に存在するよりは、例えば音声ミキシングを行う音声ミキシング装置を構成するエンティティとして存在するのが一般的である。さらに云えば、話者選択装置（１００）は、音声ミキシング装置とは容易に分離可能に音声ミキシング装置を構成するエンティティではなく、音声ミキシング装置自体の一部の機能に着眼して片面的に評価したものと云うこともできる。要するに、話者選択装置（１００）は、音声ミキシング装置そのものであることが一般的である。具体的には、話者選択装置（１００）の機能をデジタルシグナルプロセッサや専用ＬＳＩに実装して、話者選択装置（１００）を実現することができる。
ただし、単体独立のエンティティとして存在すること、音声ミキシング装置とは容易に分離可能に音声ミキシング装置を構成するエンティティであることを排除する趣旨ではない。例えば話者選択自体を目的とするならば、話者選択装置（１００）を単体独立のエンティティとして実現することに何らの妨げは無い。
ここで音声ミキシング装置は、例えば専用のハードウェアで構成された専用機やパーソナルコンピュータのような汎用機といったコンピュータで実現されるとし、単体独立のエンティティとして話者選択装置（１００）を実現する場合も同様である。 << First Embodiment >>
1st Embodiment of the speaker selection apparatus and method which is this invention is described.
The speaker selection device (1 00 ) according to the first embodiment of the present invention is generally present as an entity constituting a speech mixing device that performs, for example, speech mixing, rather than being independently present alone. . As far Furthermore, the speaker selection device (1 00) is not an entity which constitutes a readily detachably voice mixing device the sound mixing apparatus, one-sided manner and focusing on some of the functions of the audio mixing device itself It can also be said that it was evaluated. In short, the speaker selection device (1 00 ) is generally a voice mixing device itself. Specifically, the speaker selection device (1 00 ) can be realized by mounting the function of the speaker selection device (1 00 ) on a digital signal processor or a dedicated LSI.
However, this does not mean that it exists as a single independent entity and that it is an entity constituting the audio mixing device that can be easily separated from the audio mixing device. For example, for the purpose of speaker selection itself, there is no obstacle to realizing the speaker selection device (1 00 ) as a single independent entity.
Here sound mixing apparatus, for example, to be realized by a computer such as a general purpose machine, such as a dedicated special-purpose machine or a personal computer configured with hardware, to implement the speaker selection device as a single independent entity (1 00) The same applies to the case.

以下、話者選択装置を、コンピュータ（汎用機）で実現する場合の話者ミキシング装置（５００）を構成するエンティティとして説明する。このとき、話者選択装置が話者ミキシング装置（５００）を構成する機能上のエンティティであることを意識して、話者選択装置を話者選択部と呼称することにする。なお、本発明である話者選択の特徴を明らかにする観点から、話者選択に係る機能・処理を中心に説明する。 Hereinafter, the speaker selection device will be described as an entity constituting the speaker mixing device (500) when realized by a computer (general-purpose device). At this time, the speaker selecting device is referred to as a speaker selecting unit, considering that the speaker selecting device is a functional entity constituting the speaker mixing device (500). Note that, from the viewpoint of clarifying the feature of speaker selection according to the present invention, functions and processing related to speaker selection will be mainly described.

図４に話者ミキシング装置（５００）のハードウェア構成例を示す。
図４に例示するように、話者ミキシング装置（５００）は、キーボード、ポインティングデバイスなどが接続可能な入力部（１１）、液晶ディスプレイ、ＣＲＴ（Cathode Ray Tube）ディスプレイなどが接続可能な出力部（１２）、話者ミキシング装置（５００）外部に通信可能な通信装置（例えば通信ケーブル、ＬＡＮカード、ルータ、モデムなど）が接続可能な通信部（１３）、ＤＳＰ（Digital Signal Processor）（１４）〔ＣＰＵ（Central Processing Unit）でも良い。またキャッシュメモリやレジスタ（１９）などを備えていてもよい。〕、メモリであるＲＡＭ（１５）、ＲＯＭ（１６）やハードディスク、光ディスク、半導体メモリなどである外部記憶装置（１７）並びにこれらの入力部（１１）、出力部（１２）、通信部（１３）、ＤＳＰ（１４）、ＲＡＭ（１５）、ＲＯＭ（１６）、外部記憶装置（１７）間のデータのやり取りが可能なように接続するバス（１８）を有している。また必要に応じて、話者選択装置（１）に、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＤＶＤ（Digital Versatile Disc）などの記憶媒体を読み書きできる装置（ドライブ）などを設けるとしてもよい。 FIG. 4 shows a hardware configuration example of the speaker mixing device (500).
As illustrated in FIG. 4, the speaker mixing device (500) includes an input unit (11) to which a keyboard, a pointing device, and the like can be connected, an output unit to which a liquid crystal display, a CRT (Cathode Ray Tube) display, and the like can be connected. 12) a communication unit (13) to which a communication device (for example, a communication cable, a LAN card, a router, a modem, etc.) capable of communicating outside the speaker mixing device (500) can be connected, a DSP (Digital Signal Processor) (14) [ A CPU (Central Processing Unit) may be used. Further, a cache memory, a register (19), and the like may be provided. ] RAM (15) as a memory, ROM (16), hard disk, optical disk, external storage device (17) as a semiconductor memory, etc., and their input unit (11), output unit (12), communication unit (13) , A DSP (14), a RAM (15), a ROM (16), and a bus (18) connected so that data can be exchanged between the external storage devices (17). If necessary, the speaker selection device (1) may be provided with a device (drive) capable of reading and writing a storage medium such as a CD-ROM (Compact Disc Read Only Memory) and a DVD (Digital Versatile Disc).

話者ミキシング装置（５００）の外部記憶装置（１７）には、話者選択のためのプログラム、音声ミキシングの処理に必要になる一般的なプログラムが保存記憶されている〔外部記憶装置に限らず、例えばプログラムを読み出し専用記憶装置であるＲＯＭに保存記憶させておくなどでもよい。〕。また、これらのプログラムの処理によって得られるデータなどは、ＲＡＭや外部記憶装置などに適宜に保存記憶される。 The external storage device (17) of the speaker mixing device (500) stores and stores a program for selecting a speaker and a general program necessary for voice mixing processing (not limited to an external storage device). For example, the program may be stored in a ROM that is a read-only storage device. ]. Further, data obtained by the processing of these programs is appropriately stored and stored in a RAM or an external storage device.

より具体的には、話者ミキシング装置（５００）の外部記憶装置（１７）〔あるいはＲＯＭなど〕には、各地点の発話継続状態を更新するためのプログラム、全地点の中で主として発言している地点の話者（第一話者）の発話継続状態および第一話者地点を除いた地点の中で主として発言している地点の話者（第二話者）の発話継続状態を検証して、第一話者および第二話者を選択するためのプログラム、音声ミキシングの処理に必要になる一般的なプログラムが保存記憶されている。その他、これらのプログラムに基づく処理を制御するための制御プログラムも適宜に保存しておけばよい。 More specifically, in the external mixing device (17) of the speaker mixing device (500) [or ROM, etc.], a program for updating the continuation state of utterance at each point, mainly speaking at all points. The utterance continuation state of the speaker at the point (first speaker) and the utterance continuation state of the speaker (second speaker) at the point where the speaker is mainly speaking out of the points other than the first speaker point are verified. In addition, a program for selecting the first speaker and the second speaker, and a general program required for voice mixing processing are stored and stored. In addition, a control program for controlling processing based on these programs may be stored as appropriate.

話者選択の処理において必要となるデータは、話者地点ごとの音声パケットに添付された、ＶＡＤフラグとこれに対応付けられた音声情報である。これらのデータは、パケット分解部の分解作用によって音声パケットから得ることができる。音声情報としては、既述のとおり、例えば音声符号の音声のパワーやピッチの相関値とすることができる。音声情報の取得方法は、例えばパワーを音声情報とする場合に、音声符号にパワーを示す符号が含まれている場合はそれを参照してもよいし、パワーを示す符号が含まれていない場合は音声符号を一旦デコードし、フレーム内の音声波形の二乗和を計算することによってパワーを求めてもよい。
音声パケットから得たこれらのデータは、例えば適宜の揮発性メモリに記憶しておく。ＶＡＤフラグとこれに対応付けられた音声情報は、一般的には、音声パケットに含まれる音声符号に対応付けられている。 Data necessary for the speaker selection process is a VAD flag attached to the voice packet for each speaker point and voice information associated with the VAD flag. These data can be obtained from the voice packet by the decomposition action of the packet decomposition unit. As described above, the voice information may be, for example, a correlation value between the power and pitch of the voice of the voice code. For example, when the power is voice information, the voice information acquisition method may refer to the power code if the power code includes the power code, or the power code may not be included. May decode the speech code once and calculate the power by calculating the sum of squares of the speech waveform in the frame.
These data obtained from the voice packet are stored in an appropriate volatile memory, for example. The voice information associated with the VAD flag is generally associated with the voice code included in the voice packet.

第１実施形態に係る話者ミキシング装置（５００）では、上記各プログラムとこの各プログラムの処理に必要なデータ（ＶＡＤフラグ、音声情報）が必要に応じてＲＡＭ（１５）に読み込まれて、ＤＳＰ（１４）で解釈実行・処理される。その結果、ＤＳＰ（１４）が所定の機能（話者選択部、パケット分解部、音声符号分解部、低域ミキシング部、高域切換部、音声符号結合部、パケット構成部）を実現することで、本発明である話者選択を行うことを含む音声ミキシングが実現される。話者ミキシング装置（５００）の要諦をなす本発明の話者選択部は、継続状態更新部、話者切換判断部を含む。 In the speaker mixing apparatus (500) according to the first embodiment, each program and data (VAD flag, voice information) necessary for processing each program are read into the RAM (15) as necessary, and the DSP is used. Interpretation is executed and processed in (14). As a result, the DSP (14) realizes predetermined functions (speaker selection unit, packet decomposing unit, speech code decomposing unit, low-frequency mixing unit, high-frequency switching unit, speech code combining unit, packet composing unit). Voice mixing including performing speaker selection according to the present invention is realized. The speaker selection unit of the present invention, which is a gist of the speaker mixing device (500), includes a continuation state update unit and a speaker switching determination unit.

そこで次に、図面を参照して、音声ミキシング装置（５００）における話者選択処理の流れを順次説明する。 Next, the flow of speaker selection processing in the audio mixing device (500) will be sequentially described with reference to the drawings.

まず、図５に、第１実施形態の話者選択部（１００）の機能構成例を示す。なお、既述と同様に、第１実施形態では入力が３地点の場合を例とし、実際には地点数は複数地点であれば何地点でもよい。
また、図中では各地点に対応した機能構成部の符号に地点Ａ、Ｂ、Ｃを含めて区別しているが、以下の説明では、特に断りのない限り、或る地点に対応する機能構成部で代表して説明を行う。つまり、例えば、図中の継続状態更新部（１０１Ａ）（１０１Ｂ）（１０１Ｃ）を、以下の説明では特に断りのない限り継続状態更新部（１０１）に代表して説明する（以下同様）。 First, FIG. 5 shows a functional configuration example of the speaker selection unit (100) of the first embodiment. Note that, as described above, in the first embodiment, the input is performed at three points as an example, and the number of points may actually be any number as long as it is a plurality of points.
In addition, in the drawing, the reference numerals of the functional components corresponding to each point are distinguished including points A, B, and C. In the following description, unless otherwise specified, the functional component corresponding to a certain point. This will be explained as a representative. That is, for example, continues the state updating unit in figure (101A) (101 B) (101C), is described as a representative to the particular continuation state updating unit unless otherwise indicated (101) in the following description (same hereinafter).

話者選択部（１００）は、入力される地点数と少なくとも同数の継続状態更新部（１０１）と、１つの話者切換判断部（１０２）とを含む構成である。
話者選択部（１００）は、各地点の音声パケットに含まれるＶＡＤフラグおよび音声情報を受け取る（ステップＳ３０１）。例えば、ＶＡＤフラグの値として、非音声区間は０、音声区間は１が割り当てられているものとする。音声情報は、既述のとおり、音声のパワーやピッチの相関値とすることができる。
ＶＡＤフラグは、話者選択部（１００）内の各地点ごとに対応する継続状態更新部（１０１）に入力され、音声情報は、話者選択部（１００）内の話者切換判断部（１０２）に入力される。 The speaker selection unit (100) includes at least the same number of continuous state update units (101) as the number of input points and one speaker switching determination unit (102).
The speaker selection unit (100) receives the VAD flag and the voice information included in the voice packet at each point (step S301). For example, it is assumed that 0 is assigned to the non-voice section and 1 is assigned to the voice section as the value of the VAD flag. As described above, the voice information can be a correlation value of voice power and pitch.
The VAD flag is input to the continuation state update unit (101) corresponding to each point in the speaker selection unit (100), and the voice information is input to the speaker switching determination unit (102 in the speaker selection unit (100). ).

継続状態更新部（１０１）は、音声ミキシング装置（５００）に順次に入力される各地点からの音声パケットから得られるＶＡＤフラグの時系列を受けて、現在の継続状態指標の値を更新し、更新した継続状態指標を話者切換判断部（１０２）に送る（ステップＳ３０２）。
継続状態指標は、各地点の発話継続状態を示す指標であり、所定のルールに従って継続状態指標を定義することができる。この実施形態では、継続状態指標を整数値として定義し、正整数は発話継続中を表し、０または負整数は発話終了を表すとする。 The continuation state update unit (101) receives the time series of the VAD flag obtained from the voice packets from each point sequentially input to the voice mixing device (500), and updates the value of the current continuation state index. The updated continuation state index is sent to the speaker switching determination unit (102) (step S302).
The continuation state index is an index indicating the continuation state of speech at each point, and the continuation state index can be defined according to a predetermined rule. In this embodiment, it is assumed that the continuation state index is defined as an integer value, a positive integer indicates that the utterance is continuing, and 0 or a negative integer indicates the end of the utterance.

継続状態指標の値を更新する具体的な処理例は、下記のとおりである。
継続状態指標の初期値を０とし、直近に入力されたＶＡＤフラグが１であれば現在の継続状態指標に１を加算、直近に入力されたＶＡＤフラグが０であれば現在の継続状態指標から１を減算する。加算時は２ずつ、減算は１ずつというように、加算と減算が異なってもよい。また、継続状態指標の値がＮ（Ｎは予め決められた１以上の整数）以上のときはＭ（Ｍは予め決められた１以上の整数）ずつ加算、Ｎ未満の場合はＬ（Ｌは予め決められた１以上の整数）ずつ加算というように、継続状態指標のそのときの値によって加算する数を変動させてもよい。減算の場合についても、同様に、継続状態指標のそのときの値によって減算する数を変動させてもよい。
なお、継続状態指標を整数値として定義し、正整数は発話継続中を表し、０または負整数は発話終了を表すとしたが、上記の如く加減算の方法によっては、継続状態指標が予め決められた閾値以上（あるいは閾値より大）を発話継続中、当該閾値未満（あるいは閾値以下）を発話終了としてもよい。 A specific processing example for updating the value of the continuation state index is as follows.
If the initial value of the continuous state index is 0 and the most recently input VAD flag is 1, 1 is added to the current continuous state index. If the most recently input VAD flag is 0, the current continuous state index is 1 is subtracted. Addition and subtraction may be different, such as 2 for addition and 1 for subtraction. Further, when the value of the continuation state index is N (N is a predetermined integer of 1 or more) or more, M (M is a predetermined integer of 1 or more) is added, and when it is less than N, L (L is The number to be added may be varied depending on the current value of the continuation state index, such as addition in increments of a predetermined integer of 1 or more. Similarly, in the case of subtraction, the number to be subtracted may be varied depending on the current value of the continuation state index.
Although the continuous state index is defined as an integer value, a positive integer indicates that the utterance is continuing, and 0 or a negative integer indicates the end of the utterance. The utterance may be continued when the utterance is continuing above the threshold (or greater than the threshold) and less than the threshold (or below the threshold).

継続状態指標の上限〔より正確には、この上限は上記閾値よりも大とされる。〕を予め決めておき、それを超えて加算しないとすることが好ましい。また、継続状態指標の下限は上記閾値か当該閾値よりも小の予め決められた値〔上記の例で云えば、０または予め決めた負の値である。〕とし、それを下回るようには減算しないとすることが好ましい。 The upper limit of the continuous state index [more precisely, the upper limit is made larger than the threshold value. ] Is determined in advance, and it is preferable not to add more than that. The lower limit of the continuation state index is the above threshold or a predetermined value smaller than the threshold [in the above example, it is 0 or a predetermined negative value. It is preferable not to subtract below that.

継続状態指標の上限値を大きく設定しておくと、実際に発話が終了した後、後述する高域切換部（５０４）によって高域の地点が切り替わるまでの時間が長くなる。逆に継続状態指標の上限値を小さく設定しておくと、実際に発話が終了した後、高域の地点が切り替わるまでの時間を短くすることができる。あまり頻繁に切り替わるとミキシング後の音質が劣化し、切り替わり時間があまり長いと切り替わり時に不自然な音になるので、上限値は、例えば上記特許文献１に記載されている４０ｍｓ〜２００ｍｓの切換制限時間に対応する値にするのがよい。 If the upper limit value of the continuation state index is set large, the time until the high frequency point is switched by the high frequency switching unit (504) to be described later after the utterance is actually finished becomes longer. Conversely, if the upper limit value of the continuation state index is set to be small, it is possible to shorten the time from when the utterance is actually finished until the high frequency point is switched. If the frequency is switched too frequently, the sound quality after mixing deteriorates, and if the switching time is too long, the sound becomes unnatural when switching, so the upper limit value is, for example, the switching time limit of 40 ms to 200 ms described in Patent Document 1 above. It should be a value corresponding to.

下限値についても、継続状態指標の下限値を例えば−３のように小さく設定しておくと、実際に発話が開始した後、高域の地点が切り替わるまでの時間が長くなる。逆に継続状態指標の下限値を例えば−１や０のように大きく設定しておくと、実際に発話が開始した後、高域の地点が切り替わるまでの時間を短くすることができる。なお、突発的に発生した一時的な音（例えば、発声していない話者の背後などで物が落下した時に発生する雑音などである。）に対して、これが音声区間のものと判断されて話者切換が行われることを防止する観点からは、下限値を例えば−１〔より正確には、上記閾値よりもわずかに小さい値とされる。〕に設定しておくのがよい。つまり、上記閾値よりもわずかに小と設定された下限値が上記のような突発的に発生した一時的な音に起因する話者切換を緩衝する役割を果たすことになる。 As for the lower limit value, if the lower limit value of the continuation state index is set to a small value such as -3, for example, the time until the high frequency point is switched after the utterance is actually started becomes longer. Conversely, if the lower limit value of the continuation state index is set to a large value such as −1 or 0, for example, it is possible to shorten the time from when the utterance actually starts until the high frequency point is switched. It should be noted that it is determined that the sound is suddenly generated for a temporary sound (for example, noise generated when an object falls behind a speaker who is not speaking). From the viewpoint of preventing speaker switching, the lower limit value is set to, for example, −1 [more precisely, a value slightly smaller than the threshold value. ] Should be set. That is, the lower limit value set to be slightly smaller than the threshold value plays a role in buffering speaker switching caused by the temporary sound as described above.

話者切換判断部（１０２）は、継続状態更新部（１０１）が出力した各地点の継続状態指標と、話者選択部（１００）に入力された各地点の音声情報とを受け取り、第一話者の地点を示す話者指標（以下、第一話者番号と云う。）と第二話者の地点を示す話者指標（以下、第二話者番号と云う。）を出力する（ステップＳ３０３）。 The speaker switching determination unit (102) receives the continuation state index of each point output from the continuation state update unit (101) and the voice information of each point input to the speaker selection unit (100). A speaker index indicating the location of the speaker (hereinafter referred to as the first speaker number) and a speaker index indicating the location of the second speaker (hereinafter referred to as the second speaker number) are output (steps). S303).

図６−２に、話者切換判断部（１０２）における話者選択処理のフローチャート（ステップＳ３０３の詳細）を示す。下記各処理の情報処理主体は話者切換判断部（１０２）である。
ここで第一話者番号と第二話者番号は、各入力地点に割り当てられた番号とし、例えば地点Ａは０、地点Ｂは１、地点Ｃは２とする。
第一話者番号と第二話者番号の初期値は、任意の地点のものでよい。ただし、同じ値とはしないものとする。例えば、第一話者番号の初期値を０、第二話者番号の初期値を１とする。 FIG. 6B shows a flowchart of the speaker selection process (details of step S303) in the speaker switching determination unit (102). The information processing subject of each process below is a speaker switching determination unit (102).
Here, the first speaker number and the second speaker number are numbers assigned to the respective input points. For example, the point A is 0, the point B is 1, and the point C is 2.
The initial values of the first speaker number and the second speaker number may be at arbitrary points. However, the same value shall not be assumed. For example, the initial value of the first speaker number is 0, and the initial value of the second speaker number is 1.

〈ステップＳ１〉
まず、１つ前のフレーム〔なお、１フレームは音声パケットに含まれる音声符号に対応する。〕における第一話者番号を取得する（１フレーム前の第一話者番号は、ＲＡＭなどのメモリに適宜保存されている。１つ前のフレームが無い場合は話者番号の初期値を用いればよい。）。これをＴ１（ｋ−１）とする。引数のｋは現在のフレーム番号を表すものとし、ｋ−１は１フレーム前のフレーム番号を意味する。 <Step S1>
First, the previous frame [Note that one frame corresponds to the voice code included in the voice packet. ] (The first speaker number one frame before is appropriately stored in a memory such as a RAM. If there is no previous frame, the initial value of the speaker number is used.) Just do it.) This is defined as T1 (k-1). The argument k represents the current frame number, and k-1 means the frame number one frame before.

〈ステップＳ２〉
第一話者番号Ｔ１（ｋ−１）が示す地点の継続状態指標から、第一話者とされた地点の発話継続状態を判別する。例えば、継続状態指標が正であれば発話継続中、０または負であれば発話終了とする。 <Step S2>
From the continuation state index of the point indicated by the first speaker number T1 (k-1), the utterance continuation state of the point designated as the first speaker is determined. For example, if the continuation state index is positive, the utterance is continued, and if it is 0 or negative, the utterance is ended.

〈ステップＳ３〉
ステップＳ２の処理における継続状態の判別結果で、第一話者番号Ｔ１（ｋ−１）が示す地点が発話継続中であればステップＳ４の処理を行ない、発話終了であればステップＳ５の処理を行う。 <Step S3>
If the point indicated by the first speaker number T1 (k-1) is continuing the utterance as a result of the continuation state determination in the process of step S2, the process of step S4 is performed, and if the utterance is ended, the process of step S5 is performed. Do.

〈ステップＳ４〉
第一話者番号Ｔ１（ｋ−１）が示す地点が発話継続中と判断された場合には、第一話者の再検索フラグを０にリセットする。再検索フラグの値０は再検索不要を示すものとする。 <Step S4>
If it is determined that the point indicated by the first speaker number T1 (k-1) is continuing utterance, the re-search flag for the first speaker is reset to zero. The value 0 of the re-search flag indicates that re-search is not necessary.

〈ステップＳ５〉
第一話者番号Ｔ１（ｋ−１）が示す地点が発話終了と判断された場合には、第一話者の再検索フラグを１にセットする。再検索フラグの値１は再検索が必要であることを示すものとする。 <Step S5>
When it is determined that the point indicated by the first speaker number T1 (k-1) is the end of the utterance, the re-search flag of the first speaker is set to 1. The value 1 of the re-search flag indicates that re-search is necessary.

〈ステップＳ６〉
次に、ステップＳ１の処理と同様にして、１フレーム前の第二話者番号を取得する（１フレーム前の第二話者番号は、ＲＡＭなどのメモリに適宜保存されている。１つ前のフレームが無い場合は話者番号の初期値を用いればよい。）。これをＴ２（ｋ−１）とする。 <Step S6>
Next, the second speaker number one frame before is acquired in the same manner as the process of step S1 (the second speaker number one frame before is appropriately stored in a memory such as a RAM. If there is no frame, the initial value of the speaker number may be used.) This is T2 (k−1).

〈ステップＳ７〉
第二話者番号Ｔ２（ｋ−１）が示す地点の継続状態指標から、第二話者とされた地点の発話継続状態を判別する。判別方法は、ステップＳ２の処理の場合と同様に継続状態指標が正であれば発話継続中、０または負であれば発話終了とする。但し、判断基準は第一話者地点と第二話者地点で異なってよい。 <Step S7>
The utterance continuation state at the point designated as the second speaker is determined from the continuation state index at the point indicated by the second speaker number T2 (k-1). As in the case of the process of step S2, the determination method is that the utterance is continued if the continuation state index is positive, and the utterance is ended if it is 0 or negative. However, the judgment criteria may be different between the first speaker point and the second speaker point.

〈ステップＳ８〉
ステップＳ７処理における継続状態の判別結果で、第二話者番号Ｔ２（ｋ−１）が示す地点が発話継続中であればステップＳ９の処理を行ない、発話終了であればステップＳ１０の処理を行う。 <Step S8>
If the point indicated by the second speaker number T2 (k-1) is continuing the utterance as a result of the continuation state determination in step S7, the process of step S9 is performed, and if the utterance is ended, the process of step S10 is performed. .

〈ステップＳ９〉
第二話者番号Ｔ２（ｋ−１）が示す地点が発話継続中と判断された場合には、第二話者の再検索フラグを０にリセットする。再検索フラグの値０は再検索不要を示すものとする。 <Step S9>
If it is determined that the point indicated by the second speaker number T2 (k-1) is continuing utterance, the second speaker re-search flag is reset to zero. The value 0 of the re-search flag indicates that re-search is not necessary.

〈ステップＳ１０〉
第二話者番号Ｔ２（ｋ−１）が示す地点が発話終了と判断された場合には、第二話者の再検索フラグを１にセットする。再検索フラグの値１は再検索が必要であることを示すものとする。 <Step S10>
When it is determined that the point indicated by the second speaker number T2 (k-1) is the end of the utterance, the second speaker re-search flag is set to 1. The value 1 of the re-search flag indicates that re-search is necessary.

〈ステップＳ１１〉
第一話者の再検索フラグの値が０であればステップＳ１２の処理を、第一話者の再検索フラグの値が１であればステップＳ１５の処理を行う。 <Step S11>
If the value of the first speaker re-search flag is 0, the process of step S12 is performed. If the value of the first speaker re-search flag is 1, the process of step S15 is performed.

〈ステップＳ１２〉
第一話者の再検索フラグが０の場合には、現フレームにおける第一話者番号Ｔ１（ｋ）には引き続き１フレーム前における第一話者番号Ｔ１（ｋ−１）をセットする。 <Step S12>
When the re-search flag of the first speaker is 0, the first speaker number T1 (k-1) in the previous frame is set to the first speaker number T1 (k) in the current frame.

〈ステップＳ１３〉
次いで、第二話者の再検索フラグの値を調べ、第二話者の再検索フラグの値が０であればステップＳ１４の処理を、第二話者の再検索フラグの値が１であればステップＳ１８の処理を行う。 <Step S13>
Next, the value of the second speaker's re-search flag is checked. If the value of the second speaker's re-search flag is 0, the process of step S14 is performed. Step S18 is performed.

〈ステップＳ１４〉
第二話者の再検索フラグが０の場合には、１フレーム前における第二話者番号Ｔ２（ｋ−１）を現フレームにおける第二話者番号Ｔ２（ｋ）にセットする。 <Step S14>
When the second speaker re-search flag is 0, the second speaker number T2 (k-1) one frame before is set to the second speaker number T2 (k) in the current frame.

〈ステップＳ１５〉
ステップＳ１１の処理において、第一話者の再検索フラグが１の場合には、第二話者の再検索フラグの値によって、第二話者の再検索フラグの値が０であればステップＳ１６の処理を、第二話者の再検索フラグの値が１であればステップＳ１７の処理を行う。 <Step S15>
In the process of step S11, if the re-search flag of the first speaker is 1, if the re-search flag value of the second speaker is 0 according to the re-search flag value of the second speaker, step S16 If the value of the second speaker re-search flag is 1, the process of step S17 is performed.

〈ステップＳ１６〉
第二話者再検索フラグが０の場合には、１フレーム前における第二話者番号Ｔ２（ｋ−１）を現フレームにおける第一話者番号Ｔ１（ｋ）にセットする。続いて、ステップＳ１８の処理を行う。 <Step S16>
When the second speaker re-search flag is 0, the second speaker number T2 (k-1) one frame before is set to the first speaker number T1 (k) in the current frame. Subsequently, the process of step S18 is performed.

〈ステップＳ１７〉
第二話者の再検索フラグが１の場合には、第一話者地点を探索する。探索方法は、例えば、継続状態指標が発話継続中を示す地点の中から、音声情報――例えばパワーやピッチの相関値――が最大の地点を第一話者地点とする方法に拠ればよい。続いて、ステップＳ１８の処理を行う。 <Step S17>
If the second speaker re-search flag is 1, the first speaker point is searched. The search method may be based on, for example, a method in which the point where the voice information--for example, the correlation value of power and pitch--is the maximum is set as the first speaker point from the points where the continuation state index indicates that the utterance is continuing. . Subsequently, the process of step S18 is performed.

〈ステップＳ１８〉
ステップＳ１３の処理において第二話者の再検索フラグが１の場合、ステップＳ１６あるいはステップＳ１７のいずれかの処理が終了した場合には、現フレームの第一話者番号Ｔ１（ｋ）が示す地点を除く地点から第二話者地点を探索する。この具体的な探索手順は、ステップＳ１７の処理における第一話者地点の探索と同様でよい。 <Step S18>
If the second speaker re-search flag is 1 in the process of step S13, and if the process of either step S16 or step S17 is completed, the point indicated by the first speaker number T1 (k) of the current frame The second speaker point is searched from points other than. This specific search procedure may be the same as the search for the first speaker point in the process of step S17.

このように、第１実施形態では、前フレームの段階で第一話者が発話継続中であれば、そのまま前フレームの段階での第一話者を現フレームの第一話者とする。この場合、第二話者について、これが前フレームの段階で発話継続中であれば、そのまま前フレームの段階での第二話者を現フレームの第二話者とし、前フレームの段階で発話終了であれば、第一話者の地点を除く各地点の話者の中から現フレームの第二話者を探索する。また、前フレームの段階で第一話者が発話終了であれば、前フレームの段階での第二話者の発話継続状態に依存して処理が展開される。つまり、前フレームの段階での第二話者が発話継続中であれば、この前フレームの段階での第二話者を現フレームの第一話者とし、ここで第一話者とされた地点を除く各地点の話者の中から現フレームの第二話者を探索する。前フレームの段階での第二話者が発話終了であれば、まず各地点の話者の中から現フレームの第一話者を探索して、次いで、ここで第一話者とされた地点を除く各地点の話者の中から現フレームの第二話者を探索する。 As described above, in the first embodiment, if the first speaker is continuing to speak at the stage of the previous frame, the first speaker at the stage of the previous frame is directly used as the first speaker of the current frame. In this case, if the utterance of the second speaker is ongoing at the previous frame, the second speaker at the previous frame is used as the second speaker of the current frame, and the utterance ends at the previous frame. If so, the second speaker of the current frame is searched from among the speakers at each point except the point of the first speaker. Further, if the first speaker ends the utterance at the previous frame stage, the processing is developed depending on the utterance continuation state of the second speaker at the previous frame stage. In other words, if the second speaker at the previous frame stage is still speaking, the second speaker at the previous frame stage is set as the first speaker of the current frame, and is designated as the first speaker here. The second speaker of the current frame is searched from the speakers at each point except the point. If the second speaker in the previous frame stage is the end of the utterance, first the first speaker of the current frame is searched from among the speakers at each point, and then the point designated as the first speaker here The second speaker of the current frame is searched from the speakers at each point except for.

以下、図７を参照して、第１実施形態の話者選択部（１００）を適用した音声ミキシング装置（５００）における音声ミキシング処理について説明を加える。ここでは、３地点入力３地点出力の例で説明する。 Hereinafter, with reference to FIG. 7, a description will be given of the speech mixing processing in the speech mixing device (500) to which the speaker selection unit (100) of the first embodiment is applied. Here, a description will be given using an example of three-point input and three-point output.

パケット分解部（５０１）は、音声パケットから音声符号、ＶＡＤフラグおよび音声情報を取り出す。音声符号は、音声符号分解部（５０２）の入力となる。ＶＡＤフラグおよび音声情報は、話者選択部（１００）の入力となる。 The packet decomposing unit (501) extracts a voice code, a VAD flag, and voice information from the voice packet. The speech code is input to the speech code decomposition unit (502). The VAD flag and voice information are input to the speaker selection unit (100).

音声符号分解部（５０２）は、音声符号を低域符号および高域符号に分解し、低域符号は低域ミキシング部（５０３）に、高域符号は高域切換部（５０４）に送られる。既述のとおり、高域符号は拡張レイヤ符号のひとつであり、高域符号は任意の拡張レイヤ符号に読み替えてもよい。 The speech code decomposition unit (502) decomposes the speech code into a low frequency code and a high frequency code, the low frequency code is sent to the low frequency mixing unit (503), and the high frequency code is sent to the high frequency switching unit (504). . As described above, the high frequency code is one of enhancement layer codes, and the high frequency code may be read as an arbitrary enhancement layer code.

低域ミキシング部（９７３）は、低域復号部（５０３１）、ミキシング部（５０３２）および低域符号化部（５０３３）を含んで構成される。まず、低域復号部（５０３１）が、各地点からの低域符号を受け取ってこれをデコードし、低域信号を得る。次いでミキシング部（５０３２）が、全ての地点の低域信号を加算して総低域信号を得て、各地点ごとに総低域信号から各地点の低域信号を除外して、各地点向けの低域信号を生成する。例えばＡ地点用の低域符号化部（５０３３Ａ）は、Ａ地点向けの低域信号を符号化して（Ａ地点向けの）低域符号を生成し、この低域符号を音声符号結合部（５０５Ａ）に送る。Ｂ地点用の低域符号化部（５０３３Ｂ）、Ｃ地点用の低域符号化部（５０３３Ｃ）についても同様である。 The low frequency mixing unit (973) includes a low frequency decoding unit (5031), a mixing unit (5032), and a low frequency encoding unit (5033). First, a low frequency decoding unit (5031) receives a low frequency code from each point and decodes it to obtain a low frequency signal. Next, the mixing unit (5032) adds the low-frequency signals of all points to obtain a total low-frequency signal, and excludes the low-frequency signals of each point from the total low-frequency signal for each point. Generates a low-frequency signal. For example, the low-frequency encoding unit for point A (5033A) encodes the low-frequency signal for point A to generate a low-frequency code (for point A), and converts the low-frequency code into the speech code combining unit (505A). ) The same applies to the low-frequency encoding unit for point B (5033B) and the low-frequency encoding unit for point C (5033C).

話者選択部（１００）は、既述のとおり、各地点からのＶＡＤフラグおよび音声情報を参照して、第一話者番号および第二話者番号を出力する。第一話者番号および第二話者番号は、高域切換部（５０４）の入力となる。 As described above, the speaker selection unit (100) refers to the VAD flag and voice information from each point, and outputs the first speaker number and the second speaker number. The first speaker number and the second speaker number are input to the high frequency switching unit (504).

高域切換部（５０４）は、各地点からの高域符号を受け取るとともに、第一話者番号および第二話者番号を受取ると、高域符号を切り替えて各音声符号結合部（５０５Ａ）（５０５Ｂ）（５０５Ｃ）に出力する。
第一話者番号の示す地点が地点Ａであり、第二話者番号の示す地点が地点Ｂである例で具体的に説明する。第一話者の音声については、第１切換部（５０４１）が地点Ａの高域符号を選択し、第２切換部（５０４２）が第一話者番号の示す地点以外の地点――つまり、地点Ｂ宛ておよび地点Ｃ宛て――を出力先として選択して、地点Ａの高域符号が音声符号結合部（５０５Ｂ）および音声符号結合部（５０５Ｃ）に送られる。また、第二話者の音声については、第１切換部（５０４１）が地点Ｂの高域符号を選択し、第２切換部（５０４２）が第一話者番号の示す地点である地点Ａを出力先として選択して、地点Ｂの高域符号が音声符号結合部（５０５Ａ）に送られる。 The high-frequency switching unit (504) receives the high-frequency code from each point and, upon receiving the first speaker number and the second speaker number, switches the high-frequency code to switch each speech code combining unit (505A) ( 505B) (505C).
An example in which the point indicated by the first speaker number is the point A and the point indicated by the second speaker number is the point B will be specifically described. For the voice of the first speaker, the first switching unit (5041) selects the high-frequency code of the point A, and the second switching unit (5042) is a point other than the point indicated by the first speaker number--that is, Point B and destination C are selected as output destinations, and the high-frequency code at point A is sent to the speech code combining unit (505B) and the speech code combining unit (505C). As for the voice of the second speaker, the first switching unit (5041) selects the high band code of the point B, and the second switching unit (5042) selects the point A which is the point indicated by the first speaker number. The high frequency code of point B is selected as the output destination and sent to the speech code combining unit (505A).

音声符号結合部（５０５）は、低域符号と高域符号とを結合して音声符号を生成し、この音声符号をパケット構成部（５０７）に送る。パケット構成部（５０７）は受け取った音声符号を用いて音声パケット生成して、この音声パケットを対象の地点に対して出力する。 The speech code combining unit (505) combines the low frequency code and the high frequency code to generate a speech code, and sends this speech code to the packet configuration unit (507). The packet construction unit (507) generates a voice packet using the received voice code, and outputs the voice packet to a target point.

≪第１実施形態の変形例≫
次に、第１実施形態の変形例を説明する。
第１実施形態は、話者選択部（１００）が第一話者番号および第二話者番号を出力する形態であった。これに対して、第１実施形態の変形例は、簡易な形態として第一話者番号のみを出力する形態である。ここでは、第１実施形態と異なる部分について説明を行う。 << Modification of First Embodiment >>
Next, a modification of the first embodiment will be described.
In the first embodiment, the speaker selection unit (100) outputs the first speaker number and the second speaker number. On the other hand, the modification of 1st Embodiment is a form which outputs only a 1st speaker number as a simple form. Here, a different part from 1st Embodiment is demonstrated.

図８に、第１実施形態の変形例における話者選択処理のフローチャート（上記ステップＳ３０３相当）を示す。下記各処理の情報処理主体は話者切換判断部（１０２）である。
ステップＳ１からステップＳ５の各処理は、第１実施形態と同様である。第１実施形態の変形例では、ステップＳ４またはステップＳ５の処理に続いて、第１実施形態の上記ステップＳ１１の処理を行う。第１実施形態におけるステップＳ６からステップＳ１０の各処理は不要である。
第１実施形態の変形例では、ステップＳ１１の処理の後、第一話者の再検索フラグの値が０であればステップＳ１２の処理を行う。そして、上記ステップＳ１３、ステップＳ１４の各処理を経ることなく終了である。ステップＳ１１の処理の後、第一話者の再検索フラグの値が１であれば上記ステップＳ１７の処理を行う。そして、上記ステップＳ１８の処理を経ることなく終了である。なお、第１実施形態におけるステップＳ１５、ステップＳ１６の各処理は不要である。 FIG. 8 shows a flowchart (corresponding to step S303) of speaker selection processing in a modification of the first embodiment. The information processing subject of each process below is a speaker switching determination unit (102).
Each processing from step S1 to step S5 is the same as that of the first embodiment. In the modification of the first embodiment, the process of step S11 of the first embodiment is performed following the process of step S4 or step S5. Each processing from Step S6 to Step S10 in the first embodiment is not necessary.
In the modification of the first embodiment, after the process of step S11, if the value of the re-search flag for the first speaker is 0, the process of step S12 is performed. And it is complete | finished without passing through each process of said step S13 and step S14. After the process of step S11, if the re-search flag value of the first speaker is 1, the process of step S17 is performed. And it is complete | finished without passing through the process of said step S18. In addition, each process of step S15 and step S16 in 1st Embodiment is unnecessary.

なお、図８に示すフローチャートから明らかなように、実質的にステップＳ４、ステップＳ５、ステップＳ１１の各処理は不要である。従って、ステップＳ３の処理を、ステップＳ２の処理における継続状態の判別結果で、第一話者番号Ｔ１（ｋ−１）が示す地点が発話継続中であればステップＳ１２の処理を行ない、発話終了であればステップＳ１７の処理を行うとしてもよい。 Note that, as is apparent from the flowchart shown in FIG. 8, the processes of step S4, step S5, and step S11 are substantially unnecessary. Accordingly, if the point indicated by the first speaker number T1 (k-1) is continuing the utterance as a result of the determination of the continuation state in the process of step S2, the process of step S3 is performed, and the process of step S12 is performed. If so, the process of step S17 may be performed.

この第１実施形態の変形例が端的に示すように、本発明の基本的な特徴は、ＶＡＤフラグの時系列によって継続状態指標を逐次に更新して、継続状態指標が発話継続中を表すものであるか発話終了を表すものであるかを判定し、それが発話継続中であれば第一話者を引き続き第一話者とするが、それが発話終了であれば音声情報を用いて第一話者の探索を行うことにある。
上記の第１実施形態は、この基本的特徴の所作を第一話者および第二話者の組み合わせにおいて行うものと評することができる。 As the modification of the first embodiment briefly shows, the basic feature of the present invention is that the continuation state index is sequentially updated according to the time series of the VAD flag, and the continuation state index indicates that the utterance is continuing. Or the end of the utterance, and if the utterance is ongoing, the first speaker will continue to be the first speaker. To search for one speaker.
The first embodiment described above can be described as performing this basic feature in a combination of the first speaker and the second speaker.

第１実施形態の変形例の話者選択部（１００）を適用した音声ミキシング装置（５００）における音声ミキシング処理について説明を加える。ここでは第１実施形態の場合とは異なる部分についてのみ説明を行う。 The speech mixing process in the speech mixing device (500) to which the speaker selection unit (100) of the modification of the first embodiment is applied will be described. Here, only the parts different from the case of the first embodiment will be described.

話者選択部（１００）は、既述のとおり、各地点からのＶＡＤフラグおよび音声情報を参照して、第一話者番号のみを出力する。第一話者番号は、高域切換部（５０４）の入力となる。 As described above, the speaker selection unit (100) refers to the VAD flag and voice information from each point and outputs only the first speaker number. The first speaker number is input to the high frequency switching unit (504).

高域切換部（５０４）は、各地点からの高域符号を受け取るとともに、第一話者番号を受取ると、高域符号を切り替えて各音声符号結合部（５０５Ａ）（５０５Ｂ）（５０５Ｃ）に出力する。
第一話者番号の示す地点が地点Ａである例で具体的に説明する。第一話者の音声について、第１切換部（５０４１）が地点Ａの高域符号を選択し、第２切換部（５０４２）が第一話者番号の示す地点以外の地点――つまり、地点Ｂ宛ておよび地点Ｃ宛て――を出力先として選択して、地点Ａの高域符号が音声符号結合部（５０５Ｂ）および音声符号結合部（５０５Ｃ）に送られる。このとき、第一話者番号の示す地点が地点Ａに対応する音声符号結合部（５０５Ａ）には、高域が無音の符号を送るとすればよい。 The high frequency switching unit (504) receives the high frequency code from each point and, upon receiving the first speaker number, switches the high frequency code to each of the voice code combining units (505A) (505B) (505C). Output.
An example in which the point indicated by the first speaker number is the point A will be specifically described. For the voice of the first speaker, the first switching unit (5041) selects the high-frequency code of the point A, and the second switching unit (5042) is a point other than the point indicated by the first speaker number—that is, the point The address B and the address C are selected as output destinations, and the high frequency code at the point A is sent to the speech code combining unit (505B) and the speech code combining unit (505C). At this time, it is only necessary that a high-frequency silence code is sent to the speech code combining unit (505A) in which the point indicated by the first speaker number corresponds to the point A.

≪第２実施形態≫
以下、本発明の第２実施形態を説明する。
図９に第２実施形態における話者選択部（２００）の機能構成例を示す。なお、記述と同様に、第１実施形態では入力が３地点の場合を例とし、実際には地点数は複数地点であれば何地点でもよい。 << Second Embodiment >>
Hereinafter, a second embodiment of the present invention will be described.
FIG. 9 shows a functional configuration example of the speaker selecting unit (200) in the second embodiment. Similar to the description, in the first embodiment, the input is performed at three points, and in practice, the number of points may be any number as long as it is a plurality of points.

第２実施形態は、１フレーム前まで発話地点と判定されていた地点の継続状態指標が発話継続中を表している場合においても、１フレーム前まで発話地点と判定されていた地点以外の地点のうち継続状態指標が発話継続中を表している地点の音声情報と、１フレーム前まで発話地点と判定されていた地点の音声情報から選択された選択音声情報とをそれぞれ比較して、選択音声情報よりも他の地点の音声情報の方が話者適格である場合には、各地点の音声情報を比較して新たな話者地点を決定する形態である。この第２実施形態は、入力音声信号に背景雑音が含まれるなどして、第一話者に対応するＶＡＤフラグが音声区間を示し続けるような状況に対して、各地点の音声情報をもとに第一話者の再探索の要否を判定することで話者選択を適正化するものである。 In the second embodiment, even when the continuation state index of the point determined to be the utterance point until one frame ago indicates that the utterance is continuing, the points other than the point determined to be the utterance point until one frame ago Of the selected voice information, the voice information at the point where the continuation state index indicates that the utterance is being continued is compared with the selected voice information selected from the voice information at the point determined to be the utterance point one frame before. If the voice information at other points is more eligible for the speaker than the voice information at the other points, the voice information at each point is compared to determine a new speaker point. This second embodiment is based on the voice information at each point in a situation where the VAD flag corresponding to the first speaker continues to indicate the voice section because the input voice signal includes background noise. In addition, the selection of the speaker is made appropriate by determining whether the first speaker needs to be searched again.

第２実施形態は、第１実施形態の話者選択部（１００）の機能構成に、バッファ（１０３）と音声情報探索部（１０４）が追加された構成である。説明を簡単にするため、図５と同一の部分の説明は省略し、第１実施形態との差異について説明を行う。 In the second embodiment, a buffer (103) and a voice information search unit (104) are added to the functional configuration of the speaker selection unit (100) of the first embodiment. In order to simplify the description, the description of the same part as in FIG. 5 is omitted, and the difference from the first embodiment will be described.

話者選択部（２００）に入力された音声情報は、話者切換判断部（１０２）に送られるとともに、バッファ（１０３）に送られてこれに蓄積される（ステップＳ３０２１）。バッファ（１０３）は、過去のフレームにおける音声情報を予め決められたフレーム数分だけ蓄積することのできる記憶手段である。バッファ（１０３）は、ＲＡＭ（１５）やレジスタ（１９）などで実現可能であり、例えばシフトバッファとしてもよい。 The voice information input to the speaker selection unit (200) is sent to the speaker switching determination unit (102) and also sent to the buffer (103) and stored therein (step S3021). The buffer (103) is storage means capable of accumulating audio information in past frames for a predetermined number of frames. The buffer (103) can be realized by a RAM (15), a register (19) or the like, and may be a shift buffer, for example.

音声情報探索部（１０４）は、バッファ（１０３）内を探索して、予め決められた過去一定時間内における音声情報から１つの音声情報（選択音声情報）を選択して、話者切換判断部（１０２）に送る（ステップＳ３０２２）。選択音声情報の選択方法は、音声情報の種類によって適宜に定めることができ、バッファ（１０３）に蓄積された音声情報のうち最も話者適格を示す音声情報（選択音声情報）を選択する。例えば、音声情報が音声のパワーやピッチの相関値であれば、その最大値を選択音声情報とすることができる。なお、選択音声情報の選択は、１フレーム前において第一話者とされた地点に対応する音声情報探索部（１０４）が行うことで足りる。 The voice information search unit (104) searches the buffer (103), selects one piece of voice information (selected voice information) from voice information within a predetermined past predetermined time, and selects a speaker switching judgment unit. (102) (step S3022). The selection method of the selected voice information can be appropriately determined according to the type of the voice information, and the voice information (selected voice information) indicating the speaker eligibility is selected from the voice information stored in the buffer (103). For example, if the voice information is a correlation value of voice power or pitch, the maximum value can be used as the selected voice information. Note that the selection of the selected voice information may be performed by the voice information search unit (104) corresponding to the point designated as the first speaker one frame before.

話者切換判断部（１０２）は、各地点ごとの継続状態指標および音声情報、第一話者の選択音声情報を受け取り、第一話者番号および第二話者番号を出力する（ステップＳ３０３’）。 The speaker switching determination unit (102) receives the continuation state index and voice information for each point, and the selected voice information of the first speaker, and outputs the first speaker number and the second speaker number (step S303 ′). ).

図１０−２に、第２実施形態の話者切換判断部（１０２）における話者選択処理のフローチャート（ステップＳ３０３’の詳細）を示す。下記各処理の情報処理主体は話者切換判断部（１０２）である。
第２実施形態における話者選択処理のフローチャートは、第１実施形態のそれに対して、ステップＳ３１とステップＳ３２の各処理が追加された処理となっている。第２実施形態における話者選択処理の説明についても、第１実施形態における話者選択処理と同一の部分は説明を省略する。 FIG. 10-2 shows a flowchart of the speaker selection process (details of step S303 ′) in the speaker switching determination unit (102) of the second embodiment. The information processing subject of each process below is a speaker switching determination unit (102).
The flowchart of the speaker selection process in the second embodiment is a process in which the processes in step S31 and step S32 are added to that in the first embodiment. In the description of the speaker selection process in the second embodiment, the description of the same part as the speaker selection process in the first embodiment is omitted.

〈ステップＳ３〉
ステップＳ２の処理における継続状態の判別結果で、第一話者番号Ｔ１（ｋ−１）が示す地点が発話継続中であればステップＳ３１の処理を行ない、発話終了であればステップＳ５の処理を行う。 <Step S3>
If the point indicated by the first speaker number T1 (k-1) is continuing the utterance as a result of the continuation state determination in the process of step S2, the process of step S31 is performed. Do.

〈ステップＳ３１〉
第一話者番号Ｔ１（ｋ−１）が示す地点が発話継続中と判断された場合には、第一話者番号Ｔ１（ｋ−１）が示す地点の選択音声情報と他の地点の音声情報から強制切断が必要であるかどうかの判別をする。ここで強制切断とは、たとえ継続状態指標が発話継続中を示していても、再検索フラグの値を１――つまり、再検索を要とする――に強制的にセットすることを云う。
例えば、音声情報を音声パワーの値とすれば、Ｔ１（ｋ−１）が示す地点の選択音声情報（例えば、音声パワーの最大値）が予め決められた閾値より小さく、かつ、Ｔ１（ｋ−１）が示す地点の選択音声情報が、他の地点の音声情報のいずれかよりも小さく、かつ、Ｔ１（ｋ−１）が示す地点の選択音声情報よりも音声情報の大きい地点の継続状態指標が発話継続中を示す場合には強制切断が必要と判断し、そうでない場合には強制切断は不要と判断する。 <Step S31>
When it is determined that the point indicated by the first speaker number T1 (k-1) is continuing the utterance, the selected voice information of the point indicated by the first speaker number T1 (k-1) and the voice of another point Determine whether forced disconnection is necessary from the information. Here, forced disconnection means that the value of the re-search flag is forcibly set to 1—that is, re-search is required—even if the continuation state index indicates that the utterance is continuing.
For example, if the voice information is a voice power value, the selected voice information (for example, the maximum voice power value) at the point indicated by T1 (k−1) is smaller than a predetermined threshold value and T1 (k− The continuation state index of the point where the selected voice information of the point indicated by 1) is smaller than any of the voice information of other points and the voice information is larger than the selected voice information of the point indicated by T1 (k-1) If it indicates that the utterance is continuing, it is determined that forced disconnection is necessary, and if not, it is determined that forced disconnection is unnecessary.

〈ステップＳ３２〉
強制切断が必要か不要かによって、強制切断が不要であればステップＳ４の処理を行ない、強制切断が必要であればステップＳ５の処理を行う。 <Step S32>
Depending on whether forced disconnection is necessary or not, step S4 is performed if forced disconnection is not required, and step S5 is performed if forced disconnection is necessary.

〈ステップＳ４〉
強制切断が不要の場合は、第一話者の再検索フラグを０にリセットする。 <Step S4>
If the forced disconnection is unnecessary, the first speaker re-search flag is reset to zero.

〈ステップＳ５〉
ステップＳ３の処理において継続判別結果で発話終了と判断された場合と、ステップＳ３２の処理において強制切断判別で強制切断必要と判断された場合には、第一話者の再検索フラグを１にセットする。 <Step S5>
If it is determined in the process of step S3 that the utterance has ended as a result of the continuation determination, or if it is determined in the process of step S32 that the forced disconnection is necessary, the re-search flag for the first speaker is set to 1. To do.

ステップＳ６の処理以降は、第１実施形態における話者選択処理と同じである。 The processing after step S6 is the same as the speaker selection processing in the first embodiment.

≪第２実施形態の変形例≫
次に、第２実施形態の変形例を説明する。
第２実施形態は、話者選択部（１００）が第一話者番号および第二話者番号を出力する形態であった。これに対して、第２実施形態の変形例は、簡易な形態として第一話者番号のみを出力する形態である。つまり、第２実施形態の変形例は、第１実施形態の変形例に類似の形態である。第２実施形態と異なる部分について説明を行う。 << Modification of Second Embodiment >>
Next, a modification of the second embodiment will be described.
In the second embodiment, the speaker selection unit (100) outputs the first speaker number and the second speaker number. On the other hand, the modification of 2nd Embodiment is a form which outputs only a 1st speaker number as a simple form. That is, the modification of the second embodiment is similar to the modification of the first embodiment. A different part from 2nd Embodiment is demonstrated.

図１１に、第２実施形態の変形例における話者選択処理のフローチャートを示す（上記ステップＳ３０３’相当）。下記各処理の情報処理主体は話者切換判断部（１０２）である。
ステップＳ３１およびステップＳ３２を含むステップＳ１からステップＳ５の各処理は、第２実施形態と同様である。第２実施形態の変形例では、ステップＳ４またはステップＳ５の処理に続いて、第２実施形態の上記ステップＳ１１の処理を行う。第２実施形態におけるステップＳ６からステップＳ１０の各処理は不要である。
第２実施形態の変形例では、ステップＳ１１の処理の後、第一話者の再検索フラグの値が０であればステップＳ１２の処理を行う。そして、上記ステップＳ１３、ステップＳ１４の各処理を経ることなく終了である。ステップＳ１１の処理の後、第一話者の再検索フラグの値が１であれば上記ステップＳ１７の処理を行う。そして、上記ステップＳ１８の処理を経ることなく終了である。なお、第２実施形態におけるステップＳ１５、ステップＳ１６の各処理は不要である。 FIG. 11 shows a flowchart of speaker selection processing in a modification of the second embodiment (corresponding to step S303 ′). The information processing subject of each process below is a speaker switching determination unit (102).
Each process of step S1 to step S5 including step S31 and step S32 is the same as that of 2nd Embodiment. In the modification of the second embodiment, the process of step S11 of the second embodiment is performed following the process of step S4 or step S5. Each processing from Step S6 to Step S10 in the second embodiment is not necessary.
In the modification of the second embodiment, after the process of step S11, if the value of the re-search flag of the first speaker is 0, the process of step S12 is performed. And it is complete | finished without passing through each process of said step S13 and step S14. After the process of step S11, if the re-search flag value of the first speaker is 1, the process of step S17 is performed. And it is complete | finished without passing through the process of said step S18. In addition, each process of step S15 and step S16 in 2nd Embodiment is unnecessary.

なお、図１１に示すフローチャートから明らかなように、実質的にステップＳ４、ステップＳ５、ステップＳ１１の各処理は不要である。従って、ステップＳ３２の処理で強制切断不要と判断された場合には、ステップＳ１２の処理を行うようにし、ステップＳ３の処理において継続判別結果で発話終了と判断された場合と、ステップＳ３２の処理において強制切断判別で強制切断必要と判断された場合には、ステップＳ１７の処理を行うとしてもよい。 Note that, as is apparent from the flowchart shown in FIG. 11, the processes of step S4, step S5, and step S11 are substantially unnecessary. Therefore, if it is determined in step S32 that forced disconnection is not necessary, step S12 is performed. In step S3, if it is determined that the utterance is ended by the continuation determination result, If it is determined in the forced disconnection determination that forced disconnection is necessary, the process of step S17 may be performed.

第２実施形態の変形例の話者選択部（１００）を適用した音声ミキシング装置（５００）における音声ミキシング処理については、第１実施形態の変形例と同じであるから、説明を省略する。 Since the audio mixing process in the audio mixing device (500) to which the speaker selection unit (100) of the modification of the second embodiment is applied is the same as that of the modification of the first embodiment, the description thereof is omitted.

以上の各実施形態では、話者選択装置を音声ミキシング装置の機能上のエンティティ（話者選択部）として説明をしたが、話者選択装置を単体独立のエンティティとして例えばコンピュータ（汎用機）で実現するとしても、上記実施形態に述べた話者選択部の機能構成やこれを実現するコンピュータ（汎用機）のハードウェア構成と実質的に異なることがない。 In each of the above embodiments, the speaker selection device has been described as a functional entity (speaker selection unit) of the voice mixing device. However, the speaker selection device is realized as a single independent entity, for example, by a computer (general-purpose device). Even so, there is no substantial difference between the functional configuration of the speaker selection unit described in the above embodiment and the hardware configuration of a computer (general-purpose device) that realizes the functional configuration.

以上の実施形態の他、本発明である話者選択装置・方法は上述の実施形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。また、上記話者選択装置・方法において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 In addition to the above embodiment, the speaker selection device / method according to the present invention is not limited to the above-described embodiment, and can be appropriately changed without departing from the spirit of the present invention. Further, the processing described in the above speaker selection device / method is not only executed in time series in the order described, but also executed in parallel or individually as required by the processing capability of the device that executes the processing. It may be.

また、上記話者選択装置における処理機能をコンピュータによって実現する場合、話者選択装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記話者選択装置における処理機能がコンピュータ上で実現される。 When the processing functions in the speaker selection device are realized by a computer, the processing contents of the functions that the speaker selection device should have are described by a program. Then, by executing this program on a computer, the processing functions in the speaker selection device are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto-Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only). Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording medium, MO (Magneto-Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、コンピュータ上で所定のプログラムを実行させることにより、話者選択装置を構成するとしてもよいが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Further, the speaker selection device may be configured by executing a predetermined program on a computer, but at least a part of the processing contents may be realized by hardware.

本発明は、各地点から送られてくる音声パケットに含まれる音声に背景雑音が含まれている場合や、入力音声レベルが変動した場合や、各地点の入力音声レベルが異なる場合や、短時間の間に実際の発話者（発言地点）が変わった場合でも、より正確な発言者の選択が可能となり、結果としてミキシング後の音質の劣化が少ない安定したミキシング音声を得ることが可能となるから、音声パケット通信による多地点音声通信に有用である。 The present invention can be applied to the case where the background noise is included in the voice included in the voice packet transmitted from each point, when the input voice level fluctuates, when the input voice level at each point is different, or for a short time. Even if the actual speaker (speaking point) changes during this period, it is possible to select a more accurate speaker, and as a result, it is possible to obtain a stable mixed voice with little deterioration in sound quality after mixing. It is useful for multipoint voice communication by voice packet communication.

多地点音声通信のシステム構成例を示す図。The figure which shows the system structural example of multipoint audio | voice communication. 音声パケット送信部の機能構成例を示す図。The figure which shows the function structural example of an audio | voice packet transmission part. 音声ミキシング装置の機能構成例を示す図。The figure which shows the function structural example of an audio | voice mixing apparatus. 音声ミキシング装置のハードウェア構成例を示す図。The figure which shows the hardware structural example of an audio | voice mixing apparatus. 第１実施形態の話者選択装置の機能構成例を示す図。The figure which shows the function structural example of the speaker selection apparatus of 1st Embodiment. 第１実施形態における話者選択処理の流れを示す図。The figure which shows the flow of the speaker selection process in 1st Embodiment. 第１実施形態における話者選択処理において話者切換判断部の詳細な処理フローを示す図。The figure which shows the detailed process flow of a speaker switching judgment part in the speaker selection process in 1st Embodiment. 第１実施形態の話者選択装置（話者選択部）を適用した音声ミキシング装置の機能構成例を示す図。The figure which shows the function structural example of the audio | voice mixing apparatus to which the speaker selection apparatus (speaker selection part) of 1st Embodiment is applied. 第１実施形態の変形例における話者選択処理において話者切換判断部の詳細な処理フローを示す図。The figure which shows the detailed process flow of a speaker switching judgment part in the speaker selection process in the modification of 1st Embodiment. 第２実施形態の話者選択装置の機能構成例を示す図。The figure which shows the function structural example of the speaker selection apparatus of 2nd Embodiment. 第２実施形態における話者選択処理の流れを示す図。The figure which shows the flow of the speaker selection process in 2nd Embodiment. 第２実施形態における話者選択処理において話者切換判断部の詳細な処理フローを示す図。The figure which shows the detailed process flow of a speaker switching judgment part in the speaker selection process in 2nd Embodiment. 第２実施形態の変形例における話者選択処理において話者切換判断部の詳細な処理フローを示す図。The figure which shows the detailed process flow of a speaker switching judgment part in the speaker selection process in the modification of 2nd Embodiment.

Explanation of symbols

１００話者選択部
１０１継続状態更新部
１０２話者切換判断部
１０３バッファ
１０４音声情報最大値探索部
２００話者選択部
５００音声ミキシング装置
５０１Ａ、５０１Ｂ、５０１Ｃパケット分解部
５０２Ａ、５０２Ｂ、５０２Ｃ音声符号分解部
５０３低域ミキシング部
５０３１Ａ、５０３１Ｂ、５０３１Ｃ低域復号部
５０３２ミキシング部
５０３３Ａ、５０３３Ｂ、５０３３Ｃ低域符号化部
５０４高域切換部
５０５Ａ、５０５Ｂ、５０５Ｃ音声符号結合部
５０７Ａ、５０７Ｂ、５０７Ｃパケット構成部 100 speaker selection unit 101 continuation state update unit 102 speaker switching determination unit 103 buffer 104 voice information maximum value search unit 200 speaker selection unit 500 voice mixing device 501A, 501B, 501C packet decomposition unit 502A, 502B, 502C voice code decomposition Unit 503 low frequency mixing unit 5031A, 5031B, 5031C low frequency decoding unit 5032 mixing unit 503 3 A, 503 3 B, 503 3 C low frequency encoding unit 504 high frequency switching unit 505A, 505B, 505C speech code combining unit 507A, 507B, 507C Packet component

Claims

A speaker selection device in voice mixing that performs speaker selection and mixing for voice packets sent from three or more points (hereinafter referred to as input points),
Corresponding to each input point,
Whether the voice code of one frame (hereinafter referred to as the current frame) included in a voice packet sent from one input point is a voice section or a non-voice section Continuation state updating means for updating a continuation state index that is an index indicating the utterance continuation state of the input point using a determination result (hereinafter referred to as a VAD flag),
Speech information storage means for storing acoustic feature quantities (hereinafter referred to as speech information) of speech codes for each frame included in each past and present speech packet sent from one input point;
Voice information search means for selecting one voice information (hereinafter referred to as selected voice information) from the voice information stored in the voice information storage means;
The continuation state index of each input point, the voice information corresponding to the current frame included in each voice packet sent from each input point, a speaker index indicating the input point of the speaker one frame before, With the selected voice information as input,
When the continuation status index at the input point indicated by the speaker index indicates that the utterance is ongoing,
The voice information of an input point other than the input point indicated by the speaker index and the continuation state index indicating that the utterance is continuing (hereinafter referred to as another utterance continuation other point), and the speaker Compare the selected voice information of the input point indicated by the index,
If the voice information at the other point where the utterance continues than the selected voice information is eligible for the speaker, the voice information of each input point is compared to determine the input point of the speaker of the current frame,
If the selected voice information is more eligible for the speaker than the voice information of the utterance continuation other point, the input point of the speaker of the current frame is determined as the input point indicated by the speaker index,
When the continuation status indicator at the input point indicated by the speaker indicator indicates the end of utterance,
A speaker selection device comprising speaker switching judgment means for comparing the voice information of each input point to determine the input point of the speaker of the current frame.

The speaker switching judgment means is
The continuation state index of each input point, the voice information corresponding to the current frame included in each voice packet transmitted from each input point, and the input point of the first speaker who is the main speaker one frame before The first speaker indicator to be instructed, the second speaker indicator to instruct the input point of the second speaker who is the main speaker except the first speaker one frame before, and the selected voice information are input. ,
When the continuation state index of the input point indicated by the first speaker index indicates that the utterance is continuing (hereinafter referred to as the first case), and the first speaker index The voice information of the input point other than the input point indicated by the input point and the continuation state index indicating that the utterance is continuing (hereinafter referred to as another point of utterance continuation) and the first speaker index When the selected voice information is compared with the selected voice information at the designated input point, and the selected voice information is more eligible for the speaker than the voice information at the other point where the utterance continues,
Determining the input point of the first speaker in the current frame as the input point indicated by the first speaker index;
If the continuation status indicator at the input point indicated by the second speaker indicator indicates that the utterance is continuing, the input indicated by the second speaker indicator at the input point of the second speaker in the current frame As a point,
If the continuation status indicator at the input point indicated by the second speaker indicator indicates the end of utterance, the voice information at each input point other than the input point indicated by the first speaker indicator is compared. To determine the input point of the second speaker in the current frame,
The voice information of the utterance continuation other point is compared with the selected voice information by comparing the voice information of the utterance continuation other point with the selected voice information of the input point indicated by the first speaker index. In the first case, or when the continuation status indicator at the input point indicated by the first speaker indicator indicates the end of speech,
If the continuation status indicator at the input point indicated by the second speaker indicator indicates that the utterance is continuing, the input indicated by the second speaker indicator at the input point of the first speaker in the current frame Determine the point of input of the second speaker in the current frame by comparing the voice information of each input point other than the input point indicated by the second speaker index,
If the continuation status indicator at the input point indicated by the second speaker indicator indicates the end of utterance, the voice information at each input point is compared to determine the input point of the first speaker in the current frame Then, the voice information of each input point other than the determined input point of the first speaker is compared with that for determining the input point of the second speaker in the current frame. Item 2. The speaker selection device according to Item 1 .

The voice information is assumed to be a voice power or pitch correlation value,
The selection audio information, among the audio information stored the voice information in the storage unit, according to claim 1 or claim characterized in that it is a maximum value of the correlation value of the maximum value or audio pitch of the speech power 2. The speaker selection device according to 2.

The continuation state index is an integer value, the continuation state index that is greater than or equal to a predetermined threshold represents continuation of utterance, and the continuation state index less than or less than the threshold represents utterance end,
When the VAD flag represents a voice interval, the continuation state update means updates the continuation state indicator by adding a predetermined positive value to the current continuation state indicator, and the VAD flag indicates a non-voice interval. to represent the one of claims 1, characterized in that updating the persistent state indication by the current of the continuous state index reduces the same or different predetermined positive value to the predetermined positive value of claim 3 The speaker selection device according to the above.

An upper limit and a lower limit are set for the continuous state index,
The continued state updating means, exceed the upper limit without the addition updates the persistent state indicator, also to claim 4 is below the above lower limit, characterized in that not the subtraction update the persistent state indicator The speaker selection device described.

A speaker selection method in voice mixing in which a speaker is selected and mixed with respect to a voice packet transmitted from three or more points (hereinafter referred to as input points),
Corresponding to each input point,
Whether the voice code of one frame (hereinafter referred to as the current frame) included in a voice packet sent from one input point is a voice section or a non-voice section. A continuation state update step of updating a continuation state index that is an index indicating the utterance continuation state of the input point using a determination result (hereinafter referred to as a VAD flag) indicating
A voice information storage step in which voice information storage means stores an acoustic feature quantity (hereinafter referred to as voice information) of a voice code of a frame included in a voice packet sent from one input point;
A voice information search means for selecting one voice information (hereinafter referred to as selected voice information) from the voice information stored in the voice information storage means;
The speaker switching determination means determines the continuation state index of each input point, the voice information corresponding to the current frame included in each voice packet transmitted from each input point, and the speaker's input point one frame before. With the speaker index to be instructed and the selected voice information as input,
When the continuation status index at the input point indicated by the speaker index indicates that the utterance is ongoing,
The voice information of an input point other than the input point indicated by the speaker index and the continuation state index indicating that the utterance is continuing (hereinafter referred to as another utterance continuation other point), and the speaker Compare the selected voice information of the input point indicated by the index,
If the voice information at the other point where the utterance continues than the selected voice information is eligible for the speaker, the voice information of each input point is compared to determine the input point of the speaker of the current frame,
If the selected voice information is more eligible for the speaker than the voice information of the utterance continuation other point, the input point of the speaker of the current frame is determined as the input point indicated by the speaker index,
When the continuation status indicator at the input point indicated by the speaker indicator indicates the end of utterance,
A speaker selection method, comprising: a speaker switching determination step of determining the input point of the speaker of the current frame by comparing the voice information of each input point.

The above speaker switching judgment step
The speaker switching determination means is the continuation state index of each input point, the voice information corresponding to the current frame included in each voice packet transmitted from each input point, and the main speaker one frame before A first speaker index indicating the input point of the first speaker; a second speaker index indicating the input point of the second speaker who is the main speaker except for the first speaker one frame before; With selected voice information as input,
When the continuation state index of the input point indicated by the first speaker index indicates that the utterance is continuing (hereinafter referred to as the first case), and the first speaker index The voice information of the input point other than the input point indicated by the input point and the continuation state index indicating that the utterance is continuing (hereinafter referred to as another point of utterance continuation) and the first speaker index When the selected voice information is compared with the selected voice information at the designated input point, and the selected voice information is more eligible for the speaker than the voice information at the other point where the utterance continues,
Determining the input point of the first speaker in the current frame as the input point indicated by the first speaker index;
If the continuation status indicator at the input point indicated by the second speaker indicator indicates that the utterance is continuing, the input indicated by the second speaker indicator at the input point of the second speaker in the current frame As a point,
If the continuation status indicator at the input point indicated by the second speaker indicator indicates the end of utterance, the voice information at each input point other than the input point indicated by the first speaker indicator is compared. To determine the input point of the second speaker in the current frame,
The voice information of the utterance continuation other point is compared with the selected voice information by comparing the voice information of the utterance continuation other point with the selected voice information of the input point indicated by the first speaker index. In the first case, or when the continuation status indicator at the input point indicated by the first speaker indicator indicates the end of speech,
If the continuation status indicator at the input point indicated by the second speaker indicator indicates that the utterance is continuing, the input indicated by the second speaker indicator at the input point of the first speaker in the current frame Determine the point of input of the second speaker in the current frame by comparing the voice information of each input point other than the input point indicated by the second speaker index,
If the continuation status indicator at the input point indicated by the second speaker indicator indicates the end of utterance, the voice information at each input point is compared to determine the input point of the first speaker in the current frame Then, the voice information of each input point other than the determined input point of the first speaker is compared with that for determining the input point of the second speaker in the current frame. Item 7. The speaker selection method according to Item 6 .

The voice information is assumed to be a voice power or pitch correlation value,
The selection audio information, among the audio information stored the voice information in the storage unit, according to claim 6 or claim, characterized in that the maximum value of the correlation value of the maximum value or audio pitch of the speech power 8. The speaker selection method according to 7 .

The continuation state index is an integer value, the continuation state index that is greater than or equal to a predetermined threshold represents continuation of utterance, and the continuation state index less than or less than the threshold represents utterance end,
The continuation state update step updates the continuation state index by adding a predetermined positive value to the current continuation state index when the VAD flag represents a speech period, and the VAD flag indicates a non-speech period. to represent the one of claims 6, characterized in that updating the persistent state indication by the current of the continuous state index reduces the same or different predetermined positive value to the predetermined positive value of claims 8 The speaker selection method described in Crab.

An upper limit and a lower limit are set for the continuous state index,
The continued state updating step, exceed the upper limit without the addition updates the persistent state indicator, also to claim 9 is below the above lower limit, characterized in that not the subtraction update the persistent state indicator The speaker selection method described.

A speaker selection program for causing a computer to function as the speaker selection device according to any one of claims 1 to 5 .

A computer-readable recording medium on which the speaker selection program according to claim 11 is recorded.