JP2006039108A

JP2006039108A - Prescribed speaker speech output device and prescribed speaker determination program

Info

Publication number: JP2006039108A
Application number: JP2004217299A
Authority: JP
Inventors: Shoe Sato; 庄衛佐藤; Toru Imai; 亨今井
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2004-07-26
Filing date: 2004-07-26
Publication date: 2006-02-09
Anticipated expiration: 2024-07-26
Also published as: JP4510539B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a prescribed speaker speech output device and a prescribed speaker determination program that can precisely extract only speech data of a prescribed speaker from speech data containing a crosstalk component with a small arithmetic quantity. <P>SOLUTION: A crosstalk speech recognizing device 1 is equipped with a speech data input means 2 of inputting speech data (x) and (y) from microphones Mx and My provided for speakers X and Y respectively, a frame extracting means 3 of extracting frames from the speech data (x) and (y), speech data power calculation parts 4a and 4b which calculate power levels of the frames, a cross-correlation coefficient calculating means 5 of calculating cross-correlation coefficients of the frames of the two speech data (x) and (y), a speaker speech determination means 7 of determining whether the frames are the speech data of the speakers X and Y based upon the frame power levels and cross-correlation coefficients, and an attenuator 8 which outputs a frame determined as the speech data of the speaker X or Y. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、複数の話者が各々のマイクに向かって音声を発した際に、各々のマイクから出力される、当該マイクに対応する話者の音声データと、他の話者の音声データとを含む音声データから、当該マイクに対応する話者の音声データのみを出力する技術に関する。 In the present invention, when a plurality of speakers utter sound toward each microphone, the sound data of the speaker corresponding to the microphone and the sound data of other speakers output from each microphone, The present invention relates to a technique for outputting only voice data of a speaker corresponding to a microphone from voice data including

従来、放送番組への自動字幕付与を目的とした音声認識が実用化されている（例えば、非特許文献１参照）。この技術では、予めテキストデータで作成されたニュース番組の原稿（電子原稿）をもとに、当該電子原稿の一部が修正された原稿を読み上げたアナウンサの音声を音声認識して電子原稿を修正することで、字幕を生成している。そして、例えば、男女などの話者に依存した音響モデル（音素の特徴をモデル化したもの）を使用して音声認識を行うことで、音声認識の認識率を向上させることができる。 Conventionally, voice recognition for the purpose of automatically subtitled broadcast programs has been put into practical use (for example, see Non-Patent Document 1). In this technology, based on a news program manuscript (electronic manuscript) created in advance with text data, the voice of the announcer who read out the manuscript with a part of the electronic manuscript read out is recognized and corrected. By doing so, subtitles are generated. Then, for example, by performing speech recognition using an acoustic model that depends on speakers such as men and women (modeled phoneme features), the recognition rate of speech recognition can be improved.

また、対談のような複数の話者が交互に音声を発する場合において、各々の話者に対応して話者の近傍に設けられたマイクでは、当該マイク近傍の話者（以下、特定話者という）の音声以外に、他の話者の音声（クロストーク成分）も集音してしまうため、マイクから出力される音声データには、複数の話者の音声が含まれている。そして、この複数の話者の音声が含まれる音声データから、特定話者の音声のみを抽出する技術がある（例えば、非特許文献２参照）。この技術では、マイクから入力される音声データの入力パワー（電力）が小さい場合にはクロストーク成分であると判定するとともに、パワーが大きい場合には特定話者の音声データであると判定し、クロストーク成分を減衰させることで、目的とする話者の音声のみを抽出することができる。 In addition, when a plurality of speakers such as conversations alternately speak, a microphone provided in the vicinity of the speaker corresponding to each speaker is a speaker near the microphone (hereinafter referred to as a specific speaker). The voice data output from the microphone includes the voices of a plurality of speakers because the voice (crosstalk component) of other speakers is also collected. There is a technique for extracting only the voice of a specific speaker from voice data including the voices of a plurality of speakers (see, for example, Non-Patent Document 2). In this technology, when the input power (power) of the voice data input from the microphone is small, it is determined as a crosstalk component, and when the power is high, it is determined as voice data of a specific speaker, By attenuating the crosstalk component, only the target speaker's voice can be extracted.

また、特定話者の音声のみを抽出する他の方法として、相互相関係数から推定した伝達特性を利用して、クロストーク成分を算出し、このクロストーク成分をキャンセルする技術が開示されている（非特許文献３参照）。
今井亨、外３名、「ニュース番組自動字幕化のための音声認識システム」、音声言語情報処理技報、１９９８年１０月１７日、２３−１１、ｐ．５９−６４ＤＰＲ−５２２：ＢＳＳＡｕｄｉｏＭａｎｕａｌ，ｐ．１８−２５馬屋原将明、外２名、「非線形逐次最小２乗法に基づく耐クロストークノイズキャンセラ」電子情報通信学会論文誌、２００２年２月、ＡＶｏｌ．Ｊ８５−Ａ，Ｎｏ．２，ｐ．１６２−１６９ Further, as another method for extracting only the voice of a specific speaker, a technique for calculating a crosstalk component using a transfer characteristic estimated from a cross-correlation coefficient and canceling the crosstalk component is disclosed. (Refer nonpatent literature 3).
Satoshi Imai, 3 others, “Speech recognition system for automatic subtitle conversion of news program”, Spoken Language Information Processing Technical Report, October 17, 1998, 23-11, p. 59-64 DPR-522: BSS Audio Manual, p. 18-25 Masaaki Masaya and two others, “Anti-Crosstalk Noise Canceller Based on Nonlinear Sequential Least Squares”, IEICE Transactions, February 2002, A Vol. J85-A, no. 2, p. 162-169

しかしながら、音声認識する際に、話者に合わせた言語モデルを用いるためには、この話者のみの音声データとしなければならず、他の話者の音声データも含まれている場合には認識率が低下してしまうという問題があった。また、複数の話者の各々に対応してマイクを設置して、各々の音声データを音声認識すると、マイクに対応した特定話者の音声以外に他の話者の音声も音声認識されるため、重複した認識結果が出力されてしまうという問題があった。 However, in order to use a language model tailored to the speaker at the time of speech recognition, it must be the speech data of only this speaker, and it is recognized when speech data of other speakers are included. There was a problem that the rate would decrease. In addition, when a microphone is installed corresponding to each of a plurality of speakers and each voice data is recognized as voice, other speakers' voices are also recognized in addition to the voices of the specific speakers corresponding to the microphones. There is a problem that duplicate recognition results are output.

更に、音声データのパワーの大きさに基づいて、特定話者の音声データのみを抽出する方法では、各々の話者の相対的な声量に差がある場合には、話者の声量差を無くすために各々のマイクに設けられた増幅器における、各々の音声データの増幅率に差が生じる。そのため、この増幅率の差によって、特定話者の音声データとクロストーク成分とのパワーの比が反転し、クロストーク成分のパワーが特定話者のパワーより大きくなることで、誤検出が生じることがあった。また、伝達特性を推定する方法では、比較的大きな演算量が必要となる。 Further, in the method of extracting only the voice data of a specific speaker based on the power level of the voice data, if there is a difference in the relative voice volume of each speaker, the voice volume difference of the speaker is eliminated. Therefore, there is a difference in the amplification factor of each audio data in the amplifier provided in each microphone. For this reason, the difference in amplification factor reverses the power ratio between the voice data of the specific speaker and the crosstalk component, and the power of the crosstalk component becomes larger than the power of the specific speaker, resulting in false detection. was there. In addition, the method for estimating the transfer characteristic requires a relatively large amount of calculation.

本発明は、前記従来技術の問題を解決するために成されたもので、少ない演算量で精度良く、クロストーク成分を含む音声データから特定話者の音声データのみを抽出することができる特定話者音声出力装置及び特定話者判定プログラムを提供することを目的とする。 The present invention has been made to solve the above-described problems of the prior art, and is a specific story that can extract only the voice data of a specific speaker from voice data including a crosstalk component with a small amount of calculation and high accuracy. It is an object to provide a speaker voice output device and a specific speaker determination program.

前記課題を解決するため、請求項１に記載の特定話者音声出力装置は、話者ごとに設けられたマイクから音声データをそれぞれ入力し、少なくとも１つの前記音声データから当該音声データを出力したマイクに対応する話者の音声データを出力する特定話者音声出力装置であって、音声データ入力手段と、フレーム抽出手段と、パワー算出手段と、相互相関係数算出手段と、話者音声判定手段と、音声データ出力手段とを備える構成とした。 In order to solve the problem, the specific speaker voice output device according to claim 1 inputs voice data from a microphone provided for each speaker, and outputs the voice data from at least one voice data. A specific speaker voice output device for outputting voice data of a speaker corresponding to a microphone, wherein voice data input means, frame extraction means, power calculation means, cross-correlation coefficient calculation means, and speaker voice determination And a voice data output means.

かかる構成によれば、特定話者音声出力装置は、音声データ入力手段によって、話者ごとに設けられたマイクから、話者の音声を変換した音声データをそれぞれ入力し、フレーム抽出手段によって、音声データ入力手段から入力された音声データの各々から、所定データ長のフレームを抽出する。そして、特定話者音声出力装置は、パワー算出手段によって、フレーム抽出手段から出力されたフレームのパワーの大きさを算出し、また、相互相関係数算出手段によって、複数の音声データのうち１つの音声データのフレームである対象フレームの時間軸に対して、他の音声データの各々について、当該他の音声データのフレームの時間軸を所定の時間幅ずつずらしたフレーム間の相関を示す相互相関係数を算出する。 According to such a configuration, the specific speaker voice output device inputs the voice data obtained by converting the voice of the speaker from the microphone provided for each speaker by the voice data input unit, and the voice is input by the frame extraction unit. A frame having a predetermined data length is extracted from each audio data input from the data input means. Then, the specific speaker voice output device calculates the power level of the frame output from the frame extraction means by the power calculation means, and uses one of the plurality of voice data by the cross-correlation coefficient calculation means. Reciprocal relationship indicating correlation between frames in which the time axis of the other audio data frame is shifted by a predetermined time width with respect to the time axis of the target frame which is the frame of the audio data. Calculate the number.

ここで、マイクは話者ごとに設けられ、ある話者の発した音声は、この話者から一番近い位置にある、当該話者に対して設けられたマイクに最初に入力される。そして、他のマイクには当該話者から各々のマイクまでの距離の差に応じた時間差を生じて遅れて入力される。 Here, a microphone is provided for each speaker, and a voice uttered by a certain speaker is first input to a microphone provided for the speaker that is closest to the speaker. Then, the other microphones are delayed and input with a time difference corresponding to the difference in distance from the speaker to each microphone.

そのため、対象フレームが、当該話者に対応するマイクから入力された当該話者の音声データを含むフレームである場合には、対象フレームと他のフレームとの相互相関係数は、当該対象フレームの時間軸に対して他のフレームの時間軸をその時間差の分だけ早めたときに大きな値となる。また、対象フレームが、当該話者に対応するマイクから入力された音声データのフレームでない場合には、当該話者に対応するマイクから入力された音声データのフレームとの相互相関係数は、当該対象フレームの時間軸に対して、当該話者に対応するマイクから入力された音声データのフレームの時間軸をその時間差の分だけ遅らせた値のときに大きな値となる。 Therefore, when the target frame is a frame including the voice data of the speaker input from the microphone corresponding to the speaker, the cross-correlation coefficient between the target frame and other frames is A large value is obtained when the time axis of another frame is advanced by the time difference with respect to the time axis. If the target frame is not a frame of audio data input from the microphone corresponding to the speaker, the cross-correlation coefficient with the frame of audio data input from the microphone corresponding to the speaker is The value is large when the time axis of the frame of the audio data input from the microphone corresponding to the speaker is delayed by the time difference with respect to the time axis of the target frame.

そして、特定話者音声出力装置は、話者音声判定手段によって、パワー算出手段によって算出された各々の音声データのフレームのパワーの大きさと、相互相関係数算出手段によって算出された相互相関係数のうち、対象フレームの時間軸に対して、当該他の音声データのフレームの時間軸を所定の時間幅ごとに早める方向にずらした相互相関係数である進み相互相関係数と、当該他の音声データのフレームの時間軸を所定の時間幅ごとに遅らせる方向にずらした相互相関係数である遅れ相互相関係数とに基づいて、対象フレームの音声データに対応する音声が、特定話者の音声データであるかを判定する。 Then, the specific speaker voice output device uses the speaker voice determination unit to calculate the power level of each voice data frame calculated by the power calculation unit and the cross correlation coefficient calculated by the cross correlation coefficient calculation unit. Among these, the lead cross-correlation coefficient that is a cross-correlation coefficient shifted in the direction of advancing the time axis of the other audio data frame by a predetermined time width with respect to the time axis of the target frame, and the other Based on the delayed cross-correlation coefficient, which is a cross-correlation coefficient shifted in the direction of delaying the time axis of the frame of the audio data for each predetermined time width, the voice corresponding to the audio data of the target frame is It is determined whether it is audio data.

なお、この話者音声判定手段は、対象フレームのパワーが他の音声データのフレームのパワーより大きい場合には、対象フレームを出力したマイクに入力された話者の音声の大きさが、他のマイクに入力された当該話者の音声の大きさより大きいため、対象フレームの音声が特定話者の音声であると判定することができる。また、話者音声判定手段は、進み相互相関係数と遅れ相互相関係数とに基づいて、他のフレームより先に入力されたものか、あるいは、後に入力されたものかを判定することで、当該対象フレームが、当該対象フレームの音声データを出力したマイクに対応する話者である特定話者の音声データであるかを判定することができる。そして、音声データ出力手段によって、話者音声判定手段によって特定話者の音声データであると判定された対象フレームを出力する。 Note that, when the target frame power is greater than the power of the other audio data frame, the speaker voice determination means determines that the speaker's voice input to the microphone that has output the target frame has a different level. Since it is larger than the voice of the speaker input to the microphone, it can be determined that the voice of the target frame is the voice of the specific speaker. Further, the speaker voice determination means determines whether the input is made before or after the other frame based on the lead cross-correlation coefficient and the delay cross-correlation coefficient. It can be determined whether the target frame is voice data of a specific speaker who is a speaker corresponding to the microphone that has output the voice data of the target frame. Then, the target frame determined to be the voice data of the specific speaker by the speaker voice determination unit is output by the voice data output unit.

これによって、特定話者音声出力装置は、複数のマイクから入力された音声データのそれぞれからフレームを抽出し、少なくとも１つの音声データのフレームの各々について、特定話者の音声データであるかを判定して、特定話者のみの音声データを出力することができる。 Thus, the specific speaker voice output device extracts a frame from each of the voice data input from the plurality of microphones, and determines whether each of the frames of at least one voice data is the voice data of the specific speaker. Thus, it is possible to output voice data only for a specific speaker.

また、請求項２に記載の特定話者音声出力装置は、請求項１に記載の特定話者音声出力装置において、前記話者音声判定手段が、前記他の音声データの各々について、前記進み相互相関係数の合計と、前記遅れ相互相関係数の合計との差分が閾値を超える場合に、当該対象フレームが前記特定話者の音声データであると判定する構成とした。 Further, the specific speaker voice output device according to claim 2 is the specific speaker voice output device according to claim 1, wherein the speaker voice determination means performs the advance mutual processing for each of the other voice data. When the difference between the sum of the correlation coefficients and the sum of the delayed cross-correlation coefficients exceeds a threshold, the target frame is determined to be the voice data of the specific speaker.

これによって、特定話者音声出力装置は、進み相互相関係数の合計と、遅れ相互相関係数の合計との差分が閾値を超える場合には、対象フレームの音声データに対応する話者の音声が、当該話者に対応するマイクに、他の音声データを出力したマイクより先に入力されていると判断し、対象フレームが特定話者の音声データであると判定することができる。 Thus, the specific speaker voice output device, when the difference between the sum of the lead cross-correlation coefficients and the sum of the delay cross-correlation coefficients exceeds a threshold, the voice of the speaker corresponding to the voice data of the target frame However, it can be determined that the microphone corresponding to the speaker is input before the microphone that has output other audio data, and the target frame can be determined to be the audio data of the specific speaker.

更に、請求項３に記載の特定話者判定プログラムは、話者ごとに設けられたマイクから音声データをそれぞれ入力し、少なくとも１つの前記音声データから当該音声データを出力したマイクに対応する話者の音声データを出力するためにコンピュータを、音声データ入力手段、フレーム抽出手段、パワー算出手段、相互相関係数算出手段、話者音声判定手段、音声データ出力手段として機能させることとした。 Furthermore, the specific speaker determination program according to claim 3 inputs voice data from a microphone provided for each speaker, and a speaker corresponding to a microphone that outputs the voice data from at least one of the voice data. The computer is caused to function as voice data input means, frame extraction means, power calculation means, cross-correlation coefficient calculation means, speaker voice determination means, and voice data output means.

かかる構成によれば、特定話者判定プログラムは、音声データ入力手段によって、話者ごとに設けられたマイクから音声データをそれぞれ入力し、フレーム抽出手段によって、音声データ入力手段によって入力された音声データの各々から、所定データ長のフレームを抽出する。そして、パワー算出手段によって、フレーム抽出手段から出力されたフレームのパワーの大きさを算出し、また、相互相関係数算出手段によって、複数の音声データのうち１つの音声データのフレームである対象フレームの時間軸に対して、他の音声データの各々について、当該他の音声データのフレームの時間軸を所定の時間幅ずつずらしたフレーム間の相関を示す相互相関係数を算出する。 According to this configuration, the specific speaker determination program inputs the voice data from the microphone provided for each speaker by the voice data input unit, and the voice data input by the voice data input unit by the frame extraction unit. A frame having a predetermined data length is extracted from each of the above. Then, the power calculation means calculates the magnitude of the power of the frame output from the frame extraction means, and the cross correlation coefficient calculation means calculates the target frame which is a frame of one audio data among the plurality of audio data. For each of the other audio data, a cross-correlation coefficient indicating a correlation between frames obtained by shifting the time axis of the other audio data frame by a predetermined time width is calculated.

更に、話者音声判定手段によって、パワー算出手段によって算出された各々の音声データのフレームのパワーの大きさと、相互相関係数算出手段で算出された相互相関係数のうち、対象フレームの時間軸に対して、当該他の音声データのフレームの時間軸を所定の時間幅ごとに早める方向にずらした相互相関係数である進み相互相関係数と、当該他の音声データのフレームの時間軸を所定の時間幅ごとに遅らせる方向にずらした相互相関係数である遅れ相互相関係数とに基づいて、当該対象フレームが、当該対象フレームの音声データを出力したマイクに対応する話者である特定話者の音声データであるかを判定する。また、音声データ出力手段によって、話者音声判定手段で特定話者の音声データであると判定された対象フレームを出力する。 Further, the time axis of the target frame among the magnitude of the power of each voice data frame calculated by the power calculation unit and the cross correlation coefficient calculated by the cross correlation coefficient calculation unit by the speaker voice determination unit. On the other hand, a lead cross-correlation coefficient that is a cross-correlation coefficient shifted in a direction to advance the time axis of the frame of the other audio data every predetermined time width, and a time axis of the frame of the other audio data Based on a delayed cross-correlation coefficient that is a cross-correlation coefficient shifted in the direction of delay for each predetermined time width, the target frame is identified as a speaker corresponding to the microphone that output the audio data of the target frame It is determined whether the data is the voice data of the speaker. The target frame determined by the voice data output means to be the voice data of the specific speaker is output by the speaker voice determination means.

これによって、特定話者判定プログラムは、複数のマイクから入力された音声データのそれぞれからフレームを抽出し、少なくとも１つの音声データのフレームの各々について、特定話者の音声データであるかを判定して、特定話者のみの音声データを出力することができる。 Accordingly, the specific speaker determination program extracts a frame from each of the audio data input from the plurality of microphones, and determines whether each of the frames of at least one audio data is the audio data of the specific speaker. Thus, voice data of only a specific speaker can be output.

本発明に係る特定話者音声出力装置及び特定話者判定プログラムでは、以下のような優れた効果を奏する。 The specific speaker voice output device and the specific speaker determination program according to the present invention have the following excellent effects.

請求項１又は請求項３に記載の発明によれば、クロストーク成分の含まれる音声データから特定話者の音声データのみを出力することができる。そのため、例えば、トーク番組等の音声を音声認識して字幕を生成する場合には、同一の話者の音声が複数のマイクに入力されることによって同一の音声について複数音声認識されることを防ぐことができる。また、音声データを特定話者に対応した音響モデルに基づいて音声認識することで、高い認識率で音声認識することができる。 According to the first or third aspect of the invention, only the voice data of the specific speaker can be output from the voice data including the crosstalk component. Therefore, for example, when recognizing the sound of a talk program or the like to generate subtitles, the same speaker's sound is input to a plurality of microphones, thereby preventing the same sound from being recognized as a plurality of sounds. be able to. In addition, voice recognition can be performed with a high recognition rate by voice recognition of voice data based on an acoustic model corresponding to a specific speaker.

また、フレームのパワーの大きさと相互相関係数とに基づいて、対象フレームが特定話者の音声データであるかを判定するため、高い精度で判定することができる。そして、音声データに含まれるクロストーク成分を算出するのではなく、フレームごとにクロストーク成分であるか、あるいは、特定話者の音声であるかのみを判定して、クロストーク成分を除去するため、クロストーク成分を算出する複雑な演算を行う必要がなく、演算量を軽減して処理速度を向上させることができる。 Further, since it is determined whether the target frame is the voice data of the specific speaker based on the magnitude of the frame power and the cross-correlation coefficient, the determination can be made with high accuracy. Then, instead of calculating the crosstalk component included in the audio data, it is determined whether it is a crosstalk component for each frame or only the voice of a specific speaker, and the crosstalk component is removed. Therefore, it is not necessary to perform a complicated calculation for calculating the crosstalk component, and the processing amount can be reduced and the processing speed can be improved.

請求項２に記載の発明によれば、進み相互相関関数の合計と遅れ相互相関関数の合計との差に基づいて、ある話者の音声が、対象フレームに対応するマイクと他のマイクとのどちらに先に入力されたかを判定するため、容易に対象フレームが特定話者の音声データであるかを判定することができる。 According to the second aspect of the present invention, based on the difference between the sum of the lead cross-correlation functions and the sum of the delay cross-correlation functions, the voice of a certain speaker can be compared between the microphone corresponding to the target frame and another microphone. Since it is determined which is input first, it is possible to easily determine whether the target frame is voice data of a specific speaker.

以下、本発明の実施の形態について図面を参照して説明する。ここでは本発明を、対談のような複数の話者が交互に音声を発する番組等の音声を音声認識する場合に適用し、クロストーク音声認識装置として構成している。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. Here, the present invention is applied to the case of recognizing the sound of a program or the like in which a plurality of speakers, such as a conversation, alternately utter sound, and is configured as a crosstalk sound recognition device.

［クロストーク音声認識装置（特定話者音声出力装置）の構成］
図１を参照して、本発明の実施の形態であるクロストーク音声認識装置１の構成について説明する。図１は、本発明におけるクロストーク音声認識装置の構成を示したブロック図である。クロストーク音声認識装置１は、話者Ｘと話者Ｙとの各々に設けられたマイクＭｘ、Ｍｙに入力された話者Ｘと話者Ｙとの音声を変換した音声データｘ（ｘ（ｔ））、ｙ（ｙ（ｔ））を、当該マイクＭｘ、Ｍｙから入力し、話者Ｘのみの音声の音声認識結果と話者Ｙのみの音声の音声認識結果とを出力するものである。ここで、ｔは、音声データに対応する音声が、マイクＭｘ、Ｍｙに入力された時間の時間軸（以下、時間軸という）上における所定の時刻を始点とした時間を示している。ここでは、クロストーク音声認識装置１は、音声データ入力手段２と、フレーム抽出手段３と、フレームパワー算出手段４と、相互相関係数算出手段５と、平滑処理手段６と、話者音声判定手段７と、減衰器８と、記憶手段９と、音声認識手段１０と、音声認識結果出力手段１１とを備える。 [Configuration of Crosstalk Speech Recognition Device (Specific Speaker Speech Output Device)]
With reference to FIG. 1, the structure of the crosstalk speech recognition apparatus 1 which is embodiment of this invention is demonstrated. FIG. 1 is a block diagram showing the configuration of a crosstalk speech recognition apparatus according to the present invention. The crosstalk voice recognition device 1 converts voice data x (x (t (t)) obtained by converting voices of the speaker X and the speaker Y input to the microphones Mx and My provided to the speaker X and the speaker Y, respectively. )) And y (y (t)) are input from the microphones Mx and My, and the voice recognition result of the voice of the speaker X only and the voice recognition result of the voice of the speaker Y only are output. Here, t indicates the time from the predetermined time on the time axis (hereinafter referred to as the time axis) of the time when the audio corresponding to the audio data is input to the microphones Mx and My. Here, the crosstalk speech recognition apparatus 1 includes speech data input means 2, frame extraction means 3, frame power calculation means 4, cross-correlation coefficient calculation means 5, smoothing processing means 6, speaker voice determination. Means 7, an attenuator 8, storage means 9, speech recognition means 10, and speech recognition result output means 11 are provided.

ここで、クロストーク音声認識装置１は、話者Ｘと話者Ｙとの音声を音声データｘ、ｙに変換するマイクＭｘ、Ｍｙと、このマイクＭｘ、Ｍｙから入力された音声データｘ、ｙを、話者等の操作によって所望の減衰率で減衰させて出力するフェーダユニットＦＵｘ、ＦＵｙと、このフェーダユニットＦＵｘ、ＦＵｙから入力された音声データｘ、ｙを所望の増幅率で増幅し、増幅された音声データｘ、ｙを、当該クロストーク音声認識装置１に出力する増幅器Ａｘ、Ａｙとを外部に接続し、更に、図示しないＡ／Ｄ（ＡｎａｌｏｇｔｏＤｉｇｉｔａｌ）変換器によってＡ／Ｄ変換された、時系列の音声データｘ、ｙが入力されている。 Here, the crosstalk voice recognition apparatus 1 includes microphones Mx and My that convert voices of the speakers X and Y into voice data x and y, and voice data x and y input from the microphones Mx and My. Are attenuated at a desired attenuation rate by the operation of a speaker or the like, and fader units FUx and FUy that are output, and audio data x and y input from the fader units FUx and FUy are amplified at a desired amplification rate and amplified. Amplifiers Ax and Ay that output the audio data x and y to the crosstalk speech recognition apparatus 1 are connected to the outside, and are further A / D converted by an A / D (Analog to Digital) converter (not shown). In addition, time-series audio data x and y are input.

なお、マイクＭｘ、Ｍｙは、それぞれ話者Ｘ、Ｙに対応し、マイクＭｘは、話者Ｘから見てマイクＭｙより近い位置に設置され、マイクＭｙは、話者Ｙから見てマイクＭｘより近い位置に設置されている。そして、話者Ｘと話者Ｙとが交互に音声を発した場合には、マイクＭｘには話者Ｘの音声Ｈ（ＸＸ）と、話者Ｙの音声Ｈ（ＹＸ）とが交互に入力され、また、マイクＭｙには話者Ｘの音声Ｈ（ＸＹ）と、話者Ｙの音声Ｈ（ＹＹ）とが交互に入力される。 The microphones Mx and My correspond to the speakers X and Y, respectively, the microphone Mx is installed at a position closer to the microphone My when viewed from the speaker X, and the microphone My is more than the microphone Mx when viewed from the speaker Y. It is installed in a close position. Then, when the speaker X and the speaker Y emit voices alternately, the voice H (XX) of the speaker X and the voice H (YX) of the speaker Y are alternately input to the microphone Mx. In addition, the voice H (XY) of the speaker X and the voice H (YY) of the speaker Y are alternately input to the microphone My.

音声データ入力手段２は、外部から複数の音声データを入力するものである。ここでは、音声データ入力手段２は、増幅器Ａｘ、Ａｙから音声データｘ、ｙを入力することとした。音声データ入力手段２は、音声データ入力部２ａと、音声データ入力部２ｂとを備える。 The voice data input means 2 is for inputting a plurality of voice data from the outside. Here, the audio data input means 2 inputs the audio data x and y from the amplifiers Ax and Ay. The voice data input means 2 includes a voice data input unit 2a and a voice data input unit 2b.

音声データ入力部２ａは、増幅器Ａｘから音声データｘを入力するものである。この音声データｘには、話者Ｘの音声Ｈ（ＸＸ）の音声データ（話者Ｘの音声データ）と話者Ｙの音声Ｈ（ＹＸ）の音声データ（クロストーク成分）とが含まれている。ここで入力された音声データｘは、フレーム抽出手段３の音声データフレーム抽出部３ａに出力される。 The audio data input unit 2a inputs audio data x from the amplifier Ax. The voice data x includes voice data of the speaker X voice H (XX) (speaker X voice data) and voice data of the speaker Y voice H (YX) (crosstalk component). Yes. The audio data x input here is output to the audio data frame extraction unit 3a of the frame extraction means 3.

音声データ入力部２ｂは、増幅器Ａｙから音声データｙを入力するものである。この音声データｙには、話者Ｘの音声Ｈ（ＸＹ）の音声データ（クロストーク成分）と話者Ｙの音声Ｈ（ＹＹ）の音声データ（話者Ｙの音声データ）とが含まれている。ここで入力された音声データｙは、フレーム抽出手段３の音声データフレーム抽出部３ｂに出力される。 The audio data input unit 2b inputs the audio data y from the amplifier Ay. The voice data y includes voice data (crosstalk component) of the voice H (XY) of the speaker X and voice data (voice data of the speaker Y) of the voice H (YY) of the speaker Y. Yes. The audio data y input here is output to the audio data frame extraction unit 3b of the frame extraction means 3.

フレーム抽出手段３は、音声データ入力手段２から入力された音声データｘ、ｙの各々から、所定データ長のフレームを抽出するものである。ここでは、フレーム抽出手段３は、音声データフレーム抽出部３ａと、音声データフレーム抽出部３ｂとを備える。なお、フレームのデータ長は、時間軸上におけるフレームの時間幅と音速との積が、話者Ｘと話者Ｙとの間の距離より大きくなる任意の長さとすることができる。ここでは、１６ｋＨｚサンプリングで、４００ポイントのサンプリングデータとなるデータ長のフレームを抽出することとした。 The frame extraction unit 3 extracts a frame having a predetermined data length from each of the audio data x and y input from the audio data input unit 2. Here, the frame extraction means 3 includes an audio data frame extraction unit 3a and an audio data frame extraction unit 3b. The data length of the frame can be set to an arbitrary length in which the product of the time width of the frame on the time axis and the sound speed is larger than the distance between the speaker X and the speaker Y. Here, a frame having a data length of 400 points of sampling data is extracted at 16 kHz sampling.

音声データフレーム抽出部３ａは、音声データ入力部２ａから入力された音声データｘから所定データ長のフレームを抽出するものである。ここで抽出されたフレームは、フレームパワー算出手段４の音声データパワー算出部４ａと、相互相関係数算出手段５と、減衰器８ａとに出力される。 The audio data frame extraction unit 3a extracts a frame having a predetermined data length from the audio data x input from the audio data input unit 2a. The extracted frame is output to the audio data power calculation unit 4a, the cross-correlation coefficient calculation unit 5 and the attenuator 8a of the frame power calculation unit 4.

音声データフレーム抽出部３ｂは、音声データ入力部２ｂから入力された音声データｙから所定データ長のフレームを抽出するものである。ここで抽出されたフレームは、フレームパワー算出手段４の音声データパワー算出部４ｂと、相互相関係数算出手段５と、減衰器８ｂとに出力される。 The audio data frame extraction unit 3b extracts a frame having a predetermined data length from the audio data y input from the audio data input unit 2b. The extracted frame is output to the audio data power calculation unit 4b of the frame power calculation unit 4, the cross correlation coefficient calculation unit 5, and the attenuator 8b.

フレームパワー算出手段４は、フレーム抽出手段３から入力されたフレームのパワーの大きさ（フレームパワー）を算出し、このフレームパワーに基づいて、フェーダユニットＦＵｘ、ＦＵｙがマイクＭｘ、Ｍｙから入力された音声データを増幅器Ａｘ、Ａｙに出力していたかを判定するものである。フレームパワー算出手段４は、音声データパワー算出部４ａと、音声データパワー算出部４ｂと、ＦＵ状態判定部４ｃとを備える。 The frame power calculation means 4 calculates the magnitude (frame power) of the frame input from the frame extraction means 3, and based on this frame power, the fader units FUx and FUy are input from the microphones Mx and My. It is determined whether audio data has been output to the amplifiers Ax and Ay. The frame power calculation unit 4 includes an audio data power calculation unit 4a, an audio data power calculation unit 4b, and an FU state determination unit 4c.

音声データパワー算出部（パワー算出手段）４ａは、音声データフレーム抽出部３ａによって音声データｘから抽出されたフレームのフレームパワーを算出するものである。また、音声データパワー算出部（パワー算出手段）４ｂは、音声データフレーム抽出部３ｂによって音声データｙから抽出されたフレームのフレームパワーを算出するものである。ここで算出されたフレームパワーは、ＦＵ状態判定部４ｃと、平滑処理手段６とに出力される。 The audio data power calculation unit (power calculation means) 4a calculates the frame power of the frame extracted from the audio data x by the audio data frame extraction unit 3a. The voice data power calculation unit (power calculation means) 4b calculates the frame power of the frame extracted from the voice data y by the voice data frame extraction unit 3b. The frame power calculated here is output to the FU state determination unit 4c and the smoothing processing means 6.

なお、ここでは、フレームパワー算出手段４の音声データパワー算出部４ａ、４ｂは、フレームの各ポイントの振幅の２乗和を、当該フレームのフレームパワーとして算出することとした。ここで、音声データパワー算出部４ａによって算出される音声データｘのフレームパワーＰ（ｌ，ｘ）と、音声データパワー算出部４ｂによって算出される音声データｙのフレームパワーＰ（ｌ，ｙ）は、以下の式（１）で表される。なお、ここでは、フレーム抽出手段３が、音声データｘ（ｔ）、ｙ（ｔ）から、時間軸上において時間幅Ｎのフレームをシフト幅Ｍおきに抽出することとした。また、ｌは、音声データｘ、ｙの各々について、時系列にフレームに付されたフレーム番号である。 Here, the audio data power calculation units 4a and 4b of the frame power calculation unit 4 calculate the sum of squares of the amplitudes of the respective points of the frame as the frame power of the frame. Here, the frame power P (l, x) of the audio data x calculated by the audio data power calculation unit 4a and the frame power P (l, y) of the audio data y calculated by the audio data power calculation unit 4b are: Is represented by the following formula (1). Here, the frame extraction means 3 extracts frames with a time width N on the time axis from the audio data x (t) and y (t) every shift width M. Also, l is a frame number assigned to the frames in time series for each of the audio data x and y.

ＦＵ状態判定部４ｃは、音声データパワー算出部４ａ、４ｂから入力される音声データｘ、ｙのフレームパワーＰ（ｌ，ｘ）、Ｐ（ｌ，ｙ）に基づいて、後記する減衰器８（８ａ、８ｂ）の減衰率を設定する、あるいは、後記する相互相関係数算出手段５に対して、当該フレームの相互相関係数を算出する指令を出力するものである。ここで、ＦＵ状態判定部４ｃは、フレームパワーＰ（ｌ，ｘ）、Ｐ（ｌ，ｙ）に基づいて、フェーダユニットＦＵｘ、ＦＵｙがＯＮになっているか、あるいは、ＯＦＦになっているか、つまり、フェーダユニットＦＵｘ、ＦＵｙがマイクＭｘ、Ｍｙから入力された音声データｘ、ｙを増幅器Ａｘ、Ａｙを介してクロストーク音声認識装置１にそのまま出力しているか、あるいは、減衰させているかを判定する。 The FU state determination unit 4c is based on the frame powers P (l, x) and P (l, y) of the audio data x and y input from the audio data power calculation units 4a and 4b. 8a, 8b) is set, or a command for calculating the cross-correlation coefficient of the frame is output to the cross-correlation coefficient calculation means 5 described later. Here, the FU state determination unit 4c determines whether the fader units FUx and FUy are ON or OFF based on the frame powers P (l, x) and P (l, y), that is, The fader units FUx and FUy determine whether the audio data x and y input from the microphones Mx and My are directly output to the crosstalk speech recognition apparatus 1 via the amplifiers Ax and Ay or are attenuated. .

フェーダユニットＦＵｘ、ＦＵｙがＯＮになっているときの暗騒音レベルＰ_sil（話者Ｘ、Ｙが発話していないときの音のレベル）と比べて、フェーダユニットＦＵｘ、ＦＵｙがＯＦＦときのフレームパワーＰ_FU-OFFは充分に小さくなる。そのため、ＦＵ状態判定部４ｃは、Ｐ_FU-OFF＜Ｔｈ_FU＜Ｐ_silとなる閾値Ｔｈ_FUよりフレームパワーＰ（ｌ，ｘ）、Ｐ（ｌ，ｙ）が小さい場合には、フェーダユニットＦＵｘ、ＦＵｙがＯＦＦになっており、閾値Ｔｈ_FUよりフレームパワーＰ（ｌ，ｘ）、Ｐ（ｌ，ｙ）が大きい場合には、フェーダユニットＦＵｘ、ＦＵｙがＯＮになっていると判定することができる。 Fader unit Fux, compared Fuy is background noise level P _sil when is ON (the speaker X, Y is the level of sound when not speaking), the frame power when the fader units Fux, Fuy is OFF P _FU-OFF is sufficiently small. Therefore, when the frame power P (l, x), P (l, y) is smaller than the threshold Th _{FU that} satisfies P _FU-OFF <Th _FU <P _sil , the FU state determination unit 4c performs the fader unit FUx, When FUy is OFF and the frame power P (l, x) and P (l, y) are larger than the threshold Th _FU, it can be determined that the fader units FUx and FUy are ON. .

そして、フェーダユニットＦＵｘ、ＦＵｙのいずれか一方がＯＦＦになっているときには、ＯＮになっているフェーダユニットＦＵｘ、ＦＵｙに対応する話者（Ｘ又はＹ）のみが音声を発しているため、音声データｘ、ｙのフレームにはクロストーク成分が含まれていない。また、フェーダユニットＦＵｘ、ＦＵｙの両方がＯＦＦになっているときには、どちらの話者Ｘ、Ｙも音声を発していないため、音声データｘ、ｙのフレームにはクロストーク成分が含まれていない。そのため、当該フレームについて、後記する相互相関係数算出手段５と、平滑処理手段６と、話者音声判定手段７とによる処理を行って、当該フレームにクロストーク成分が含まれているかを判定する必要がない。 When either one of the fader units FUx and FUy is OFF, only the speaker (X or Y) corresponding to the fader unit FUx or FUy that is ON emits sound. The x and y frames do not contain a crosstalk component. Further, when both of the fader units FUx and FUy are OFF, neither of the speakers X and Y emits speech, so that the frames of the speech data x and y do not include a crosstalk component. Therefore, the cross-correlation coefficient calculation means 5, smoothing processing means 6, and speaker voice determination means 7 to be described later are processed for the frame to determine whether the frame includes a crosstalk component. There is no need.

そこで、ＦＵ状態判定部４ｃは、フェーダユニットＦＵｘ、ＦＵｙのいずれか一方又は両方がＯＦＦになっていると判定したときには、減衰器８（８ａ、８ｂ）の減衰率をゼロに設定し、相互相関係数算出手段５に対して、当該フレームの相互相関係数の算出を行う指令を出力しない。これによって、クロストーク音声認識装置１は、相互相関係数算出手段５と、平滑処理手段６と、話者音声判定手段７との当該フレームに対する処理を行わないため、演算量を軽減することができ、処理速度を向上させることができる。 Therefore, when it is determined that one or both of the fader units FUx and FUy are OFF, the FU state determination unit 4c sets the attenuation rate of the attenuator 8 (8a, 8b) to zero, and the mutual phase A command for calculating the cross-correlation coefficient of the frame is not output to the relation number calculation means 5. As a result, the crosstalk speech recognition apparatus 1 does not perform the processing on the frame by the cross-correlation coefficient calculation unit 5, the smoothing processing unit 6, and the speaker speech determination unit 7, thereby reducing the amount of calculation. And the processing speed can be improved.

また、ＦＵ状態判定部４ｃは、フェーダユニットＦＵｘ、ＦＵｙの両方がＯＮになっていると判定したときには、相互相関係数算出手段５に対して、当該フレームの処理を行う指令を出力する。 Further, when it is determined that both of the fader units FUx and FUy are ON, the FU state determination unit 4c outputs a command for processing the frame to the cross-correlation coefficient calculation unit 5.

相互相関係数算出手段５は、ＦＵ状態判定部４ｃから入力される指令に基づいて、フレーム抽出手段３の音声データフレーム抽出部３ａ、３ｂから入力された音声データｘ、ｙのフレームの相互相関係数を算出するものである。ここで算出された相互相関係数は、平滑処理手段６に出力される。 The cross-correlation coefficient calculation means 5 is based on a command input from the FU state determination section 4c, and the mutual phase of the frames of the audio data x and y input from the audio data frame extraction sections 3a and 3b of the frame extraction means 3 The number of relationships is calculated. The cross-correlation coefficient calculated here is output to the smoothing processing means 6.

なお、相互相関係数とは、時系列の２つの関数の一方の時間軸を所定の時間幅ずつずらして、２つの関数を掛け合わせることで得られ、当該２つの関数の相関が高いときには相対的に大きい値となり、相関が小さいときには相対的に小さい値となる。ここでは、相互相関係数算出手段５は、以下の式（２）に示すように、音声データｘ（ｔ）の各々のフレームについて、音声データｙ（ｔ）のフレームの時間軸を所定の時間幅τずつずらした相互相関係数Ｃ（τ，ｌ）を算出することとした。なお、σｘ（ｔ）、σｙ（ｔ）は音声データｘ、ｙの当該フレームでの標準偏差であり、各フレームは、フレーム抽出手段３によって、音声データｘ（ｔ）、ｙ（ｔ）から、時間軸上において時間幅Ｎでシフト幅Ｍおきに抽出されていることとする。 The cross-correlation coefficient is obtained by shifting one time axis of two time-series functions by a predetermined time width and multiplying the two functions. When the correlation between the two functions is high, When the correlation is small, the value is relatively small. Here, the cross-correlation coefficient calculation means 5 uses the time axis of the frame of the audio data y (t) for a predetermined time for each frame of the audio data x (t) as shown in the following equation (2). The cross-correlation coefficient C (τ, l) shifted by the width τ was calculated. Note that σx (t) and σy (t) are standard deviations of the audio data x and y in the corresponding frame, and each frame is obtained from the audio data x (t) and y (t) by the frame extraction means 3. It is assumed that a time width N is extracted every shift width M on the time axis.

平滑処理手段６は、フレームパワー算出手段４の音声データパワー算出部４ａ、４ｂから入力されたフレームパワーＰ（ｌ，ｘ）、Ｐ（ｌ，ｙ）と相互相関係数Ｃ（τ，ｌ）の平滑化を行うものである。ここでは、平滑処理手段６は、式（３）、（４）に示すように、音声データｘ、ｙの各々のフレームについて、各々のフレームを中心とした所定数（ｎ_p）のフレームのフレームパワーＰ（ｌ，ｘ）、Ｐ（ｌ，ｙ）の平均値Ｐ’（ｌ，ｘ）、Ｐ’（ｌ，ｙ）と、各々のフレームを中心とした所定数（ｎ_c）のフレームの相互相関係数Ｃ（τ，ｌ）の平均値Ｃ’（τ，ｌ）を算出することで、平滑化を行うこととした。これによって、息つぎ等の音声中の短いポーズ区間や、ペーパーノイズ等の雑音に起因する、後記する話者音声判定手段７における不要な判定結果の切り替わりを防ぐことができる。ここで算出されたフレームパワーの平均値Ｐ’（ｌ，ｘ）、Ｐ’（ｌ，ｙ）と、相互相関係数の平均値Ｃ’（τ，ｌ）は、話者音声判定手段７に出力される。 The smoothing means 6 receives the frame powers P (l, x) and P (l, y) and the cross-correlation coefficient C (τ, l) input from the audio data power calculation units 4a and 4b of the frame power calculation means 4. Smoothing is performed. Here, the smoothing processing means 6, as shown in equations (3) and (4), for each frame of the audio data x and y, a predetermined number (n _p ) of frames centered on each frame. Average values P ′ (l, x) and P ′ (l, y) of powers P (l, x) and P (l, y) and a predetermined number (n _c ) of frames centered on each frame Smoothing is performed by calculating an average value C ′ (τ, l) of the cross-correlation coefficient C (τ, l). As a result, it is possible to prevent unnecessary determination results from being switched in the speaker voice determination means 7 to be described later, which are caused by short pause intervals in speech such as breath breaths and noise such as paper noise. The average values P ′ (l, x) and P ′ (l, y) of the frame power calculated here and the average value C ′ (τ, l) of the cross-correlation coefficient are sent to the speaker voice judging means 7. Is output.

話者音声判定手段７は、フレームパワーの平均値Ｐ’（ｌ，ｘ）、Ｐ’（ｌ，ｙ）と、相互相関係数の平均値Ｃ’（τ，ｌ）とに基づいて、各々のフレームに対応する話者を判定し、後記する減衰器８の減衰率を設定するものである。 The speaker voice determination means 7 determines each of the average values P ′ (l, x) and P ′ (l, y) of the frame power and the average value C ′ (τ, l) of the cross correlation coefficient. The speaker corresponding to this frame is determined, and the attenuation rate of the attenuator 8 to be described later is set.

音声を発した話者（Ｘ又はＹ）の音声は、当該話者から一番近い位置にあるマイク（Ｍｘ又はＭｙ）に、相対的に大きい音量で入力され、他のマイク（Ｍｙ又はＭｘ）には小さい音量で入力される。そのため、話者音声判定手段７は、同時に変換された音声データのフレームのフレームパワーの平均値Ｐ’（ｌ，ｘ）、Ｐ’（ｌ，ｙ）が大きい方のフレームを出力したマイク（Ｍｘ又はＭｙ）に対応する話者が、当該フレームの音声を出力した話者であると判定することができる。 The voice of the speaker (X or Y) that has produced the voice is input to the microphone (Mx or My) closest to the speaker at a relatively high volume, and another microphone (My or Mx). Is input at a low volume. Therefore, the speaker voice determination means 7 outputs a frame (Mx) that outputs a frame having a larger average power P ′ (l, x), P ′ (l, y) of frames of voice data converted at the same time. Alternatively, it can be determined that the speaker corresponding to My) is the speaker who has output the sound of the frame.

また、音声を発した話者（Ｘ又はＹ）の音声は、当該話者（Ｘ又はＹ）から一番近い位置にあるマイク（Ｍｘ又はＭｙ）に、相対的に早く入力され、他のマイク（Ｍｙ又はＭｘ）には遅れて入力される。そのため、話者音声判定手段７は、音声データ（ｘ又はｙ）のフレームの相互相関係数の平均値Ｃ’（τ，ｌ）が、判定するフレームに対して他の音声データ（ｙ又はｘ）のフレームの時間軸を早める方向にずらしたときに大きくなる場合に、当該フレームを出力したマイク（Ｍｘ又はＭｙ）に対応する話者（Ｘ又はＹ）が、当該フレームの音声を出力した話者であると判定することができる。 In addition, the voice of the speaker (X or Y) who emitted the voice is input relatively early to the microphone (Mx or My) closest to the speaker (X or Y), and another microphone (My or Mx) is input with a delay. Therefore, the speaker voice determination unit 7 determines that the average value C ′ (τ, l) of the cross-correlation coefficient of the frame of the voice data (x or y) is different from the voice data (y or x) for the frame to be determined. ), The speaker (X or Y) corresponding to the microphone (Mx or My) that output the frame outputs the voice of the frame when the time axis of the frame becomes larger when the time axis is shifted forward. It can be determined that the person is a person.

そのため、ここでは、話者音声判定手段７は、判定するフレームのフレームパワーの平均値（Ｐ’（ｌ，ｘ）又はＰ’（ｌ，ｙ））の対数から、他の音声データ（ｙ又はｘ）のフレームのフレームパワーの平均値（Ｐ’（ｌ，ｙ）又はＰ’（ｌ，ｘ））の対数を減算した値（対数パワー比）Ｒ（ｌ）が閾値Ｔｈ_R（０＜Ｔｈ_R）より大きくなる場合、又は、判定するフレームに対して他の音声データ（ｙ又はｘ）のフレームの時間軸を早める方向にずらしたときの相互相関係数である進み相互相関係数の平均値の合計から、判定するフレームに対して他の音声データ（ｙ又はｘ）のフレームの時間軸を遅らせる方向にずらしたときの相互相関係数である遅れ相互相関係数の平均値の合計を減算した値（相互相関差）Ｄ（ｌ）が閾値Ｔｈ_D（０＜Ｔｈ_D）より大きくなる場合に、当該フレームを出力したマイク（Ｍｘ又はＭｙ）に対応する話者（Ｘ又はＹ）を、当該フレームの音声を出力した話者（Ｘ又はＹ）と判定することとした。なお、音声データｘのフレームを判定するための対数パワー比Ｒ_x（ｌ）及び相互相関差Ｄ_x（ｌ）と、音声データｙのフレームを判定するための対数パワー比Ｒ_y（ｌ）及び相互相関差Ｄ_y（ｌ）は、以下の式（５）、（６）によって表される。 Therefore, here, the speaker voice determination means 7 uses the logarithm of the average value (P ′ (l, x) or P ′ (l, y)) of the frame power of the frame to be determined to calculate other voice data (y or A value (log power ratio) R (l) obtained by subtracting the logarithm of the average value (P ′ (l, y) or P ′ (l, x)) of the frame power of the frame x) is the threshold Th _R (0 <Th _R ), or the average of the leading cross-correlation coefficients that are cross-correlation coefficients when the time axis of the frame of other audio data (y or x) is shifted in the direction of advancing with respect to the frame to be judged From the sum of the values, the sum of the average values of the delayed cross-correlation coefficients that are cross-correlation coefficients when the time axis of the frame of the other audio data (y or x) is shifted in the direction of delaying with respect to the frame to be determined The subtracted value (cross-correlation difference) D (l) is the threshold Th _D (0 <Th _D ) If the speaker becomes larger than the speaker (X or Y) corresponding to the microphone (Mx or My) that output the frame, the speaker (X or Y) that outputs the sound of the frame is determined. did. The logarithmic power ratio R _x (l) and the cross-correlation difference D _x (l) for determining the frame of the audio data x, and the logarithmic power ratio R _y (l) for determining the frame of the audio data y and The cross-correlation difference D _y (l) is expressed by the following equations (5) and (6).

そして、話者音声判定手段７は、Ｒ_x（ｌ）≧Ｔｈ_R又はＤ_x（ｌ）≧Ｔｈ_Dであるときには、音声データｘの当該フレームが話者Ｘの音声データであり、音声データｙの当該フレームがクロストーク成分であると判定する。そして、話者音声判定手段７は、後記する減衰器８ａの減衰率を充分に小さく、減衰器８ｂの減衰率を充分に大きく設定する。 Then, the speaker voice determination means 7, when R _x (l) ≧ Th _R or D _x (l) ≧ Th _D , the frame of the voice data x is the voice data of the speaker X, and the voice data y Is determined to be a crosstalk component. Then, the speaker voice determination means 7 sets the attenuation rate of the attenuator 8a, which will be described later, to be sufficiently small and sets the attenuation rate of the attenuator 8b to be sufficiently large.

また、話者音声判定手段７は、Ｒ_y（ｌ）≧Ｔｈ_R又はＤ_y（ｌ）≧Ｔｈ_Dであるときには、音声データｙの当該フレームが話者Ｙの音声データであり、音声データｘの当該フレームがクロストーク成分であると判定する。そして、話者音声判定手段７は、後記する減衰器８ｂの減衰率を充分に小さく（例えば、ゼロ）、減衰器８ａの減衰率を充分に大きく設定する。 Further, the speaker voice determination means 7 indicates that when R _y (l) ≧ Th _R or D _y (l) ≧ Th _D , the frame of the voice data y is the voice data of the speaker Y, and the voice data x Is determined to be a crosstalk component. Then, the speaker voice determination means 7 sets the attenuation rate of the attenuator 8b, which will be described later, to be sufficiently small (for example, zero) and sets the attenuation rate of the attenuator 8a to be sufficiently large.

なお、ここでは、話者音声判定手段７は、対数パワー比と相互相関差とに基づく判定結果が矛盾する場合、つまり、対数パワー比Ｒ_x（ｌ）及び相互相関差Ｄ_y（ｌ）の両方が閾値Ｔｈ_R、Ｔｈ_Dを超えている場合や、対数パワー比Ｒ_y（ｌ）及び相互相関差Ｄ_x（ｌ）の両方が閾値Ｔｈ_R、Ｔｈ_Dを超えている場合には、当該フレームの直前の判定結果を採用することとした。また、対数パワー比Ｒ_x（ｌ）、対数パワー比Ｒ_y（ｌ）、相互相関差Ｄ_x（ｌ）及び相互相関差Ｄ_y（ｌ）のすべてが閾値Ｔｈ_R、Ｔｈ_Dを超えない場合にも、当該フレームの直前の判定結果を採用することとした。これによって、話者音声判定手段７は、頻繁に話者が切り替わることを防ぎ、安定した検出結果を得ることができる。 Note that here, the speaker voice determination means 7 determines that the determination results based on the logarithmic power ratio and the cross-correlation difference contradict each other, that is, the logarithmic power ratio R _x (l) and the cross-correlation difference D _y (l). If both exceed the threshold Th _R , Th _D , or if both the log power ratio R _y (l) and the cross-correlation difference D _x (l) exceed the threshold Th _R , Th _D , The decision result immediately before the frame is adopted. When log power ratio R _x (l), log power ratio R _y (l), cross-correlation difference D _x (l) and cross-correlation difference D _y (l) do not exceed thresholds Th _R and Th _D In addition, the determination result immediately before the frame is adopted. As a result, the speaker voice determination means 7 can prevent frequent switching of the speaker and obtain a stable detection result.

更に、ここでは、話者音声判定手段７は、継続して同一の話者の音声データであると判定するフレーム数の最小値である最低持続フレーム数を設定し、判定結果が変化した後に、少なくともこの最低持続フレーム数のフレームは同一の判定結果を維持することで、頻繁に話者が切り替わることを防ぎ、安定した検出結果を得ることができる。 Further, here, the speaker voice determination means 7 sets the minimum number of continuous frames that is the minimum number of frames determined to be the same speaker's voice data continuously, and after the determination result changes, By maintaining the same determination result for at least the frames having the minimum number of frames, it is possible to prevent frequent switching of speakers and obtain stable detection results.

ここで、図２を参照して、話者音声判定手段７によって対数パワー比Ｒ_x（ｌ）及び相互相関差Ｄ_x（ｌ）に基づいて、当該話者の音声データか、あるいは、クロストーク成分かを判定する方法を説明する。図２は、話者音声判定手段によって話者を判定する方法を説明するための説明図、（ａ）は、話者の発話区間と対数パワー比の経時変化とを示したグラフ、（ｂ）は、相互相関差の経時変化を示したグラフ、（ｃ）は、話者音声判定手段による話者の判定結果を示した図である。 Here, referring to FIG. 2, the voice data of the speaker or the crosstalk based on the logarithmic power ratio R _x (l) and the cross-correlation difference D _x (l) by the speaker voice judging means 7. A method for determining whether it is a component will be described. FIG. 2 is an explanatory diagram for explaining a method of determining a speaker by the speaker voice determining means, (a) is a graph showing a speaker's utterance section and a logarithmic power ratio with time, (b) Is a graph showing the change in cross-correlation over time, and (c) is a diagram showing a speaker determination result by the speaker voice determination means.

ここで、男性の話者（話者Ｘ）と女性の話者（話者Ｙ）とが交互に発話し、男性の話者の声量が大きく、女性の話者の声量が小さい場合には、増幅器Ａｙによって、マイクＭｙから入力された音声がより大きく増幅されるため、図２（ａ）に示すように、男性の話者の発話区間における対数パワー比Ｒ_x（ｌ）が、女性の話者の発話区間の対数パワー比Ｒ_y（ｌ）（Ｒ_y（ｌ）＝−Ｒ_x（ｌ））に比べて、相対的に値が小さくなることがある。このとき、話者音声判定手段７が、対数パワー比Ｒ（ｌ）のみで話者の判定を行うと、男性の話者の発話区間（例えば、時刻５秒〜１２秒の間）において対数パワー比Ｒ_x（ｌ）が閾値Ｔｈ_Rを超えず、誤判定が起きてしまう。 Here, when the male speaker (speaker X) and the female speaker (speaker Y) speak alternately, the volume of the male speaker is large, and the volume of the female speaker is small, Since the amplifier Ay amplifies the voice input from the microphone My more greatly, as shown in FIG. 2A, the logarithmic power ratio R _x (l) in the utterance section of the male speaker is the female story. The value may be relatively smaller than the logarithmic power ratio R _y (l) (R _y (l) = − R _x (l)) of the person's utterance interval. At this time, if the speaker voice determination means 7 determines the speaker only with the logarithmic power ratio R (l), the logarithmic power in the utterance section of the male speaker (for example, between 5 seconds and 12 seconds). The ratio R _x (l) does not exceed the threshold Th _R and erroneous determination occurs.

ここで、図２（ｂ）に示すように、対数パワー比Ｒ_y（ｌ）が不十分な値となった男性の話者の発話区間において、相互相関差Ｄ_x（ｌ）は、閾値Ｔｈ_Dを超える値となり、また、女性の話者の発話区間において、相互相関差Ｄ_y（ｌ）（Ｄ_y（ｌ）＝−Ｄ_x（ｌ））は、閾値Ｔｈ_Dを超える値となった。このように、対数パワー比Ｒ（ｌ）だけでなく、相互相関差Ｄ（ｌ）に基づいて、話者の判定を行うことで、図２（ｃ）に示すように、実際の男性と女性の話者の発話区間に近い、話者の判定結果を得ることができる。 Here, as shown in FIG. 2B, the cross-correlation difference D _x (l) is a threshold Th in the utterance interval of a male speaker whose logarithmic power ratio R _y (l) is insufficient. _The cross-correlation difference D _y (l) (D _y (l) = − D _x (l)) exceeded the threshold Th _D in the utterance interval of the female speaker. . Thus, by performing speaker determination based on not only the logarithmic power ratio R (l) but also the cross-correlation difference D (l), as shown in FIG. It is possible to obtain a speaker determination result that is close to the speaker's utterance section.

図１に戻って説明を続ける。減衰器（音声データ出力手段）８は、フレーム抽出手段３から入力された音声データｘ、ｙのフレームを、フレームパワー算出手段４のＦＵ状態判定部４ｃあるいは話者音声判定手段７によって設定された減衰率で減衰させるものである。ここで減衰された音声データは、音声認識手段１０に出力される。なお、ここでは、減衰器８ａが、音声データフレーム抽出部３ａから入力された音声データｘのフレームを減衰させて音声認識手段１０ａに出力し、減衰器８ｂが、音声データフレーム抽出部３ｂから入力された音声データｙのフレームを減衰させて音声認識手段１０ｂに出力することとした。これによって、減衰器８ａは、話者Ｘの音声データのみを音声認識手段１０ａに出力し、減衰器８ｂは、話者Ｙの音声データのみを音声認識手段１０ｂに出力することができる。 Returning to FIG. 1, the description will be continued. The attenuator (voice data output means) 8 is set by the FU state determination unit 4c of the frame power calculation means 4 or the speaker voice determination means 7 for the frames of the voice data x and y input from the frame extraction means 3. It is attenuated by the attenuation rate. The voice data attenuated here is output to the voice recognition means 10. Here, the attenuator 8a attenuates the frame of the audio data x input from the audio data frame extraction unit 3a and outputs it to the audio recognition means 10a, and the attenuator 8b inputs from the audio data frame extraction unit 3b. The frame of the voice data y that has been received is attenuated and output to the voice recognition means 10b. Thereby, the attenuator 8a can output only the voice data of the speaker X to the voice recognition means 10a, and the attenuator 8b can output only the voice data of the speaker Y to the voice recognition means 10b.

記憶手段９は、後記する音声認識手段１０による音声認識に必要となる音響モデルを記憶するもので、半導体メモリ、ハードディスク等の一般的な記憶手段である。ここでは、記憶手段９ａは、話者Ｘに対応した音響モデルであるＸ音響モデルを記憶し、記憶手段９ｂは、話者Ｙに対応した音響モデルであるＹ音響モデルを記憶することとした。 The storage unit 9 stores an acoustic model necessary for speech recognition by the speech recognition unit 10 described later, and is a general storage unit such as a semiconductor memory or a hard disk. Here, the storage unit 9a stores an X acoustic model that is an acoustic model corresponding to the speaker X, and the storage unit 9b stores a Y acoustic model that is an acoustic model corresponding to the speaker Y.

音声認識手段１０は、減衰器８から入力された音声データを、記憶手段９に記憶されたＸ音響モデルあるいはＹ音響モデルに基づいて、音声認識するものである。ここでは、音声認識手段１０ａは、記憶手段９ａに記憶されたＸ音響モデルに基づいて、減衰器８ａから入力された音声データを音声認識し、音声認識手段１０ｂは、記憶手段９ｂに記憶されたＹ音響モデルに基づいて、減衰器８ｂから入力された音声データを音声認識することとした。そして、音声認識手段１０ａによって音声認識された話者Ｘ音声認識結果は音声認識結果出力部１１ａに出力され、音声認識手段１０ｂによって音声認識された話者Ｙ音声認識結果は音声認識結果出力部１１ｂに出力される。 The voice recognition unit 10 recognizes the voice data input from the attenuator 8 based on the X acoustic model or the Y acoustic model stored in the storage unit 9. Here, the voice recognition means 10a recognizes the voice data input from the attenuator 8a based on the X acoustic model stored in the storage means 9a, and the voice recognition means 10b is stored in the storage means 9b. Based on the Y acoustic model, the speech data input from the attenuator 8b is recognized as speech. Then, the speaker X speech recognition result speech-recognized by the speech recognition unit 10a is output to the speech recognition result output unit 11a, and the speaker Y speech recognition result speech-recognized by the speech recognition unit 10b is the speech recognition result output unit 11b. Is output.

このように、音声認識手段１０ａは、話者音声判定手段７によって話者Ｘの音声データと判定された音声データを、話者Ｘに対応した音響モデルであるＸ音響モデルに基づいて音声認識し、音声認識手段１０ｂは、話者音声判定手段７によって話者Ｙの音声データと判定された音声データを、話者Ｙに対応した音響モデルであるＹ音響モデルに基づいて音声認識するため、話者に依存しない音響モデルに基づいて音声認識する場合や、クロストーク成分を含む音声データを特定の話者に対応した音響モデル（Ｘ音響モデル又はＹ音響モデル）に基づいて音声認識する場合に比べて高い音声認識率を得ることができる。 As described above, the voice recognition unit 10a recognizes the voice data determined as the voice data of the speaker X by the speaker voice determination unit 7 based on the X acoustic model corresponding to the speaker X. The voice recognition means 10b recognizes the voice data determined as the voice data of the speaker Y by the speaker voice determination means 7 based on the Y acoustic model corresponding to the speaker Y. Compared to voice recognition based on an acoustic model that does not depend on the person, or voice recognition based on an acoustic model (X acoustic model or Y acoustic model) corresponding to a specific speaker for voice data including a crosstalk component High speech recognition rate.

音声認識結果出力手段１１は、音声認識手段１０から入力された音声認識結果を出力するものである。ここでは、音声認識結果出力手段１１は、音声認識結果出力部１１ａと、音声認識結果出力部１１ｂとを備える。 The voice recognition result output means 11 outputs the voice recognition result input from the voice recognition means 10. Here, the speech recognition result output unit 11 includes a speech recognition result output unit 11a and a speech recognition result output unit 11b.

音声認識結果出力部１１ａは、音声認識手段１０ａから入力された話者Ｘ音声認識結果を外部に出力するものである。また、音声認識結果出力部１１ｂは、音声認識手段１０ｂから入力された話者Ｙ音声認識結果を外部に出力するものである。 The speech recognition result output unit 11a outputs the speaker X speech recognition result input from the speech recognition means 10a to the outside. The speech recognition result output unit 11b outputs the speaker Y speech recognition result input from the speech recognition means 10b to the outside.

以上のようにクロストーク音声認識装置１を構成することで、クロストーク音声認識装置１は、複数の話者Ｘ、Ｙの各々に設けられたマイクＭｘ、Ｍｙから入力された音声データｘ、ｙに含まれるクロストーク成分を減衰させ、マイクＭｘから入力された音声データｘから話者Ｘの音声データのみを抽出し、また、マイクＭｙから入力された音声データｙから話者Ｙの音声データのみを抽出することができる。そして、各々の音声データを各々の話者に対応した音響モデルに基づいて音声認識することで、高い認識率で音声認識を行うことができる。 By configuring the crosstalk speech recognition apparatus 1 as described above, the crosstalk speech recognition apparatus 1 is configured so that the speech data x and y input from the microphones Mx and My provided to the plurality of speakers X and Y, respectively. Is attenuated, and only the voice data of the speaker X is extracted from the voice data x input from the microphone Mx, and only the voice data of the speaker Y is extracted from the voice data y input from the microphone My. Can be extracted. And voice recognition can be performed at a high recognition rate by recognizing each voice data based on an acoustic model corresponding to each speaker.

また、本発明のクロストーク音声認識装置１は、クロストーク成分を算出して、入力された音声データからクロストーク成分を除去するのではなく、フレームごとにクロストーク成分であるかを判定して、クロストーク成分と判定されたフレームを減衰させることでクロストーク成分を除去する。そのため、本発明のクロストーク音声認識装置１は、クロストーク成分を算出する複雑な演算を行う必要がなく、演算量を軽減して処理速度を向上させることができる。 Further, the crosstalk speech recognition apparatus 1 of the present invention calculates a crosstalk component and determines whether it is a crosstalk component for each frame, instead of removing the crosstalk component from the input speech data. The crosstalk component is removed by attenuating the frame determined to be the crosstalk component. Therefore, the crosstalk speech recognition apparatus 1 of the present invention does not need to perform a complicated calculation for calculating the crosstalk component, and can reduce the calculation amount and improve the processing speed.

なお、クロストーク音声認識装置１は、コンピュータにおいて各手段を各機能プログラムとして実現することも可能であり、各機能プログラムを結合して、特定話者判定プログラムとして動作させることも可能である。 Note that the crosstalk speech recognition apparatus 1 can also realize each unit as a function program in a computer, and can also operate the unit as a specific speaker determination program by combining the function programs.

また、ここでは２人の話者Ｘ、Ｙに対応するマイクＭｘ、Ｍｙから２つの音声データｘ、ｙを入力し、減衰器８ａからは話者Ｘの音声データを、減衰器８ｂからは話者Ｙの音声データを音声認識手段１０ａ、１０ｂに出力することとしたが、本発明のクロストーク音声認識装置１は、どちらか一方の音声データ（ｘ又はｙ）からクロストーク成分を減衰させて、一方の話者のみの音声データを出力することとしてもよい。 Further, here, two voice data x and y are inputted from the microphones Mx and My corresponding to the two speakers X and Y, the voice data of the speaker X is inputted from the attenuator 8a, and the voice data is inputted from the attenuator 8b. The voice data of the person Y is output to the voice recognition means 10a and 10b. However, the crosstalk voice recognition apparatus 1 of the present invention attenuates the crosstalk component from either one of the voice data (x or y). The voice data of only one speaker may be output.

更に、本発明のクロストーク音声認識装置１は、３人以上の話者の各々に対応するマイクから３つ以上の音声データを入力することとしてもよい。このとき、相互相関係数算出手段５は、特定話者に対応するマイクから入力された音声データと、各々の他の音声データとの相互相関係数を算出し、話者音声判定手段７は、他の音声データとの相互相関係数から各々の相互相関差を算出して、すべての相互相関差が閾値Ｔｈ_Dを超える場合に、当該フレームを特定話者の音声データと判定することができる。 Furthermore, the crosstalk speech recognition apparatus 1 of the present invention may input three or more speech data from microphones corresponding to each of three or more speakers. At this time, the cross-correlation coefficient calculation means 5 calculates a cross-correlation coefficient between the voice data input from the microphone corresponding to the specific speaker and each other voice data, and the speaker voice determination means 7 Each cross-correlation difference is calculated from the cross-correlation coefficient with other voice data, and when all the cross-correlation differences exceed the threshold Th _D , the frame is determined as the voice data of the specific speaker. it can.

また、ここでは、話者音声判定手段７によってクロストーク成分と判定されたフレームを減衰器８によって減衰させることとしたが、例えば、クロストーク音声認識装置１が、減衰器８に替えて、フレーム抽出手段３から入力された音声データｘ、ｙのフレームのどちらか一方に出力を切り替えるスイッチ手段（図示せず）を備え、このスイッチ手段が、話者音声判定手段７によって話者（Ｘ又はＹ）の音声データと判定されたフレームを出力するように切り替えることとしてもよい。 Here, the frame determined to be a crosstalk component by the speaker voice determination means 7 is attenuated by the attenuator 8. However, for example, the crosstalk voice recognition apparatus 1 replaces the attenuator 8 with a frame. There is provided switch means (not shown) for switching the output to either one of the frames of the voice data x and y inputted from the extraction means 3, and this switch means is set by the speaker voice judgment means 7 to the speaker (X or Y ) May be switched so as to output a frame determined to be audio data.

［クロストーク音声認識装置の動作］
次に、図３及び図４（適宜図１参照）を参照して、本発明におけるクロストーク音声認識装置１が、マイクＭｘ、Ｍｙによって変換された音声データを入力し、当該音声データからクロストーク成分を除去して、話者Ｘと話者Ｙの各々の音声データを音声認識する動作について説明する。図３は、本発明におけるクロストーク音声認識装置の動作を示したフローチャートである。図４は、本発明におけるクロストーク音声認識装置が、フレームごとに話者Ｘ、Ｙの音声データであるか、クロストーク成分であるかを判定し、クロストーク成分を減衰させる減衰率を設定する動作（話者判定・減衰率設定動作）を示したフローチャートである。 [Operation of crosstalk speech recognition device]
Next, referring to FIG. 3 and FIG. 4 (refer to FIG. 1 as appropriate), the crosstalk speech recognition apparatus 1 according to the present invention inputs speech data converted by the microphones Mx and My, and crosstalk from the speech data. The operation of recognizing the voice data of the speakers X and Y by removing the components will be described. FIG. 3 is a flowchart showing the operation of the crosstalk speech recognition apparatus according to the present invention. FIG. 4 shows whether the crosstalk speech recognition apparatus according to the present invention determines the speech data of the speakers X and Y for each frame or the crosstalk component, and sets the attenuation rate for attenuating the crosstalk component. 6 is a flowchart showing an operation (speaker determination / attenuation rate setting operation).

クロストーク音声認識装置１は、音声データ入力手段２の音声データ入力部２ａによって、マイクＭｘによって変換された音声データｘを入力し、音声データ入力部２ｂによって、マイクＭｙによって変換された音声データｙを入力する（ステップＳ１１；音声データ入力ステップ）。そして、クロストーク音声認識装置１は、フレーム抽出手段３によって、ステップＳ１１において入力された音声データｘ、ｙの各々からフレームを抽出する（ステップＳ１２；フレーム抽出ステップ）。 The crosstalk speech recognition apparatus 1 receives the speech data x converted by the microphone Mx by the speech data input unit 2a of the speech data input means 2, and the speech data y converted by the microphone My by the speech data input unit 2b. (Step S11; voice data input step). The crosstalk speech recognition apparatus 1 then extracts a frame from each of the speech data x and y input in step S11 by the frame extraction means 3 (step S12; frame extraction step).

更に、クロストーク音声認識装置１は、フレームパワー算出手段４、相互相関係数算出手段５、平滑処理手段６及び話者音声判定手段７によって、後記する話者判定・減衰率設定動作によって、ステップＳ１２において抽出されたフレームごとに、当該フレームを出力したマイクＭｘ、Ｍｙに対応する話者Ｘ、Ｙの音声データであるか、あるいは、クロストーク成分であるかを判定し、クロストーク成分を減衰させるように減衰器８ａ、８ｂの減衰率を設定する（ステップＳ１３）。 Further, the crosstalk speech recognition apparatus 1 performs the step of speaker determination / attenuation rate setting operation described later by the frame power calculation means 4, the cross-correlation coefficient calculation means 5, the smoothing processing means 6 and the speaker voice determination means 7. For each frame extracted in S12, it is determined whether it is voice data of speakers X and Y corresponding to the microphones Mx and My that output the frame or a crosstalk component, and the crosstalk component is attenuated. The attenuation rate of the attenuators 8a and 8b is set so as to cause (step S13).

そして、クロストーク音声認識装置１は、減衰器８ａによって、ステップＳ１３において設定された減衰率で音声データｘの各々のフレームを減衰させ、話者Ｘの音声データを音声認識手段１０ａに出力し、減衰器８ｂによって、ステップＳ１３において設定された減衰率で音声データｙの各々のフレームを減衰させ、話者Ｙの音声データを音声認識手段１０ｂに出力する（ステップＳ１４；音声データ出力ステップ）。 Then, the crosstalk speech recognition apparatus 1 attenuates each frame of the speech data x with the attenuation rate set in step S13 by the attenuator 8a, and outputs the speech data of the speaker X to the speech recognition means 10a. The attenuator 8b attenuates each frame of the voice data y with the attenuation rate set in step S13, and outputs the voice data of the speaker Y to the voice recognition means 10b (step S14; voice data output step).

更に、クロストーク音声認識装置１は、音声認識手段１０ａ、１０ｂによって、ステップＳ１４においてクロストーク成分が減衰された各々の音声データを、記憶手段９ａ、９ｂに記憶されたＸ音響モデル及びＹ音響モデルに基づいて音声認識する（ステップＳ１５）。そして、クロストーク音声認識装置１は、ステップＳ１５において音声認識手段１０ａによって音声認識された話者Ｘ音声認識結果を、音声認識結果出力手段１１の音声認識結果出力部１１ａによって出力し、また、ステップＳ１５において音声認識手段１０ａによって音声認識された話者Ｙ音声認識結果を、音声認識結果出力部１１ｂによって出力し（ステップＳ１６）、動作を終了する。 Furthermore, the crosstalk speech recognition apparatus 1 uses the X acoustic model and the Y acoustic model stored in the storage units 9a and 9b, respectively, for the speech data in which the crosstalk component is attenuated in step S14 by the speech recognition units 10a and 10b. Based on the voice recognition (step S15). Then, the crosstalk speech recognition apparatus 1 outputs the speaker X speech recognition result speech-recognized by the speech recognition unit 10a in step S15 by the speech recognition result output unit 11a of the speech recognition result output unit 11, and step The speaker Y speech recognition result speech-recognized by the speech recognition means 10a in S15 is output by the speech recognition result output unit 11b (step S16), and the operation ends.

（話者判定・減衰率設定動作）
次に図４を参照（適宜図１参照）して、クロストーク音声認識装置１が、音声データｘ、ｙのフレームごとに話者Ｘ、Ｙの音声データであるか、あるいは、クロストーク成分であるかを判定し、クロストーク成分を減衰させるように減衰器８ａ、８ｂの減衰率を設定する、話者判定・減衰率設定動作（図３のステップＳ１３）について説明する。なお、ここでは、音声データｘ、ｙの時間軸上において同一の区間の、１組の音声データｘ、ｙのフレームに対する動作について説明する。 (Speaker determination / attenuation rate setting operation)
Next, referring to FIG. 4 (refer to FIG. 1 as appropriate), the crosstalk speech recognition apparatus 1 is the speech data of speakers X and Y for each frame of speech data x and y, or a crosstalk component. A speaker determination / attenuation rate setting operation (step S13 in FIG. 3) for determining whether or not there is and setting the attenuation rate of the attenuators 8a and 8b so as to attenuate the crosstalk component will be described. Here, the operation for a set of frames of audio data x and y in the same section on the time axis of audio data x and y will be described.

まず、クロストーク音声認識装置１は、フレームパワー算出手段４の音声データパワー算出部４ａ、４ｂによって、図３のステップＳ１２において抽出された音声データｘ、ｙのフレーム（フレーム番号ｌ）のフレームパワーＰ（ｌ，ｘ）、Ｐ（ｌ，ｙ）を算出する（ステップＳ３１；パワー算出ステップ）。 First, in the crosstalk speech recognition apparatus 1, the frame power of the frames (frame number l) of the speech data x and y extracted in step S12 of FIG. P (l, x) and P (l, y) are calculated (step S31; power calculation step).

また、クロストーク音声認識装置１は、ＦＵ状態判定部４ｃによって、ステップＳ３１において算出されたフレームパワーＰ（ｌ，ｘ）に基づいて、フェーダユニットＦＵｘがＯＮかを判定する（ステップＳ３２）。ここで、ＦＵ状態判定部４ｃは、フレームパワーＰ（ｌ，ｘ）が閾値Ｔｈ_FUより大きい場合には、フェーダユニットＦＵｘがＯＮであると判定する。 In the crosstalk speech recognition apparatus 1, the FU state determination unit 4c determines whether the fader unit FUx is ON based on the frame power P (l, x) calculated in Step S31 (Step S32). Here, FU state determination unit 4c, when frame power P (l, x) is greater than the threshold value Th _FU, it is determined that the fader unit FUx is is ON.

そして、フェーダユニットＦＵｘがＯＮである場合（ステップＳ３２でＹｅｓ）には、クロストーク音声認識装置１は、ＦＵ状態判定部４ｃによって、フェーダユニットＦＵｙがＯＮかを判定する（ステップＳ３３）。ここで、ＦＵ状態判定部４ｃは、ステップＳ３１において算出されたフレームパワーＰ（ｌ，ｙ）が閾値Ｔｈ_FUより大きい場合には、フェーダユニットＦＵｙがＯＮであると判定する。 If the fader unit FUx is ON (Yes in step S32), the crosstalk speech recognition apparatus 1 determines whether the fader unit FUy is ON by the FU state determination unit 4c (step S33). Here, FU state determination unit 4c, frame power P (l, y) calculated in step S31 if is greater than the threshold Th _FU determines the fader unit FUy is is ON.

そして、フェーダユニットＦＵｙもまたＯＮである場合（ステップＳ３３でＹｅｓ）には、クロストーク音声認識装置１は、相互相関係数算出手段５によって、図３のステップＳ１２において抽出された音声データｘ、ｙのフレームの一方の時間軸を所定の時間幅τずつすらした相互相関係数Ｃ（τ，ｌ）を算出する（ステップＳ３４；相互相関係数算出ステップ）。 When the fader unit FUy is also ON (Yes in step S33), the crosstalk speech recognition apparatus 1 uses the cross-correlation coefficient calculation means 5 to extract the speech data x, extracted in step S12 in FIG. A cross-correlation coefficient C (τ, l) is calculated by shifting one time axis of the frame of y by a predetermined time width τ (step S34; cross-correlation coefficient calculation step).

そして、クロストーク音声認識装置１は、平滑処理手段６によって、ステップＳ３１において算出されたフレームパワーＰ（ｌ，ｘ）、Ｐ（ｌ，ｙ）と、ステップＳ３４において算出された相互相関係数Ｃ（τ，ｌ）とを平滑化する（ステップＳ３５）。なお、ここでは、平滑処理手段６は、所定数ｎ_pのフレームのフレームパワーＰ（ｌ，ｘ）、Ｐ（ｌ，ｙ）の平均値Ｐ’（ｌ，ｘ）、Ｐ’（ｌ，ｙ）を算出することで、フレームパワーＰ（ｌ，ｘ）、Ｐ（ｌ，ｙ）の平滑化を行い、所定数ｎ_cのフレームの相互相関係数Ｃ（τ，ｌ）の平均値Ｃ’（τ，ｌ）を算出して相互相関係数Ｃ（τ，ｌ）の平滑化を行うこととした。 Then, the crosstalk speech recognition apparatus 1 uses the smoothing means 6 to calculate the frame powers P (l, x) and P (l, y) calculated in step S31 and the cross-correlation coefficient C calculated in step S34. (Τ, l) is smoothed (step S35). In this case, the smoothing processing means 6 uses the average values P ′ (l, x) and P ′ (l, y) of the frame powers P (l, x) and P (l, y) of the predetermined number n _p frames. ) by calculating a frame power P (l, x), P (l, y) of the carried smoothing, the cross-correlation coefficients of the frame of a predetermined number n _{c C} (τ, l) of the average value C ' (Τ, l) is calculated and the cross-correlation coefficient C (τ, l) is smoothed.

更に、クロストーク音声認識装置１は、話者音声判定手段７によって、ステップＳ３５において算出された音声データｘ、ｙのフレームパワーの平均値Ｐ’（ｌ，ｘ）、Ｐ’（ｌ，ｙ）の各々の対数の差である対数パワー比Ｒ_x（ｌ）、Ｒ_y（ｌ）と、進み相互相関係数の平均値の合計と、遅れ相互相関係数の平均値の合計との差である相互相関差Ｄ_x（ｌ）、Ｄ_y（ｌ）とを算出する（ステップＳ３６）。 Further, the crosstalk speech recognition apparatus 1 uses the speaker speech determination means 7 to calculate the average frame powers P ′ (l, x) and P ′ (l, y) of the speech data x and y calculated in step S35. The difference between the log power ratios R _x (l) and R _y (l), which are logarithmic differences, and the sum of the average values of the lead cross-correlation coefficients and the sum of the average values of the delay cross-correlation coefficients Certain cross-correlation differences D _x (l) and D _y (l) are calculated (step S36).

そして、クロストーク音声認識装置１は、話者音声判定手段７によって、ステップＳ３６において算出された対数パワー比Ｒ_x（ｌ）が閾値Ｔｈ_R以上であるか、又は、相互相関差Ｄ_x（ｌ）が閾値Ｔｈ_D以上であるかを判断する（ステップＳ３７）。そして、対数パワー比Ｒ_x（ｌ）が閾値Ｔｈ_R以上である、又は、相互相関差Ｄ_x（ｌ）が閾値Ｔｈ_D以上である場合（ステップＳ３７でＹｅｓ）には、クロストーク音声認識装置１は、話者音声判定手段７によって、ステップＳ３６において算出された対数パワー比Ｒ_y（ｌ）が閾値Ｔｈ_R以上であるか、又は、相互相関差Ｄ_y（ｌ）が閾値Ｔｈ_D以上であるかを判断する（ステップＳ３８）。 Then, the crosstalk speech recognition apparatus 1 determines that the logarithmic power ratio R _x (l) calculated in step S36 by the speaker speech determination means 7 is greater than or equal to the threshold Th _R or the cross-correlation difference D _x (l ) Is greater than or equal to the threshold Th _D (step S37). If the log power ratio R _x (l) is equal to or greater than the threshold Th _R or the cross-correlation difference D _x (l) is equal to or greater than the threshold Th _D (Yes in step S37), the crosstalk speech recognition device. 1 is that the logarithmic power ratio R _y (l) calculated in step S36 by the speaker voice determination means 7 is greater than or equal to the threshold Th _R , or the cross-correlation difference D _y (l) is greater than or equal to the threshold Th _D. It is determined whether or not there is (step S38).

そして、対数パワー比Ｒ_y（ｌ）が閾値Ｔｈ_R以上である、又は、相互相関差Ｄ_y（ｌ）が閾値Ｔｈ_D以上である場合（ステップＳ３８でＹｅｓ）には、そのままステップＳ４６に進む。また、対数パワー比Ｒ_y（ｌ）が閾値Ｔｈ_R未満であり、かつ、相互相関差Ｄ_y（ｌ）が閾値Ｔｈ_D未満である場合（ステップＳ３８でＮｏ）には、クロストーク音声認識装置１は、話者音声判定手段７によって、音声データｘのフレームが話者Ｘの音声データであると判定し、この判定結果が、直前のフレームの話者の判定結果と同一であるかを判断する（ステップＳ３９）。そして、同一でない場合（ステップＳ３９でＮｏ）には、直前のフレームまでに同一の判定結果のフレームが最低持続フレーム数を超えて継続しているかを判断する（ステップＳ４０）。 If the log power ratio R _y (l) is greater than or equal to the threshold Th _R or the cross-correlation difference D _y (l) is greater than or equal to the threshold Th _D (Yes in Step S38), the process proceeds to Step S46 as it is. . If the log power ratio R _y (l) is less than the threshold Th _R and the cross-correlation difference D _y (l) is less than the threshold Th _D (No in step S38), the crosstalk speech recognition device. 1, the speaker voice determination unit 7 determines that the frame of the voice data x is the voice data of the speaker X, and determines whether the determination result is the same as the determination result of the speaker of the immediately preceding frame. (Step S39). If they are not the same (No in step S39), it is determined whether the frames with the same determination result have continued beyond the minimum number of frames until the immediately preceding frame (step S40).

そして、最低持続フレーム数を超えていない場合（ステップＳ４０でＮｏ）には、ステップＳ４６に進む。また、ステップＳ３８における話者の判定結果が直前のフレームの判定結果と同一である場合（ステップＳ３９でＹｅｓ）、又は、同一の判定結果が最低持続フレーム数を超えて継続している場合（ステップＳ４０でＹｅｓ）には、クロストーク音声認識装置１は、話者音声判定手段７によって、減衰器８ａの減衰率、つまり、音声データｘの減衰率をゼロに設定し、減衰器８ｂの減衰率、つまり、音声データｙの減衰率を充分に大きく設定して（ステップＳ４１）、動作を終了する。 If the minimum number of sustained frames is not exceeded (No in step S40), the process proceeds to step S46. Further, when the determination result of the speaker in step S38 is the same as the determination result of the immediately preceding frame (Yes in step S39), or when the same determination result continues beyond the minimum number of continuous frames (step In S40, the crosstalk speech recognition apparatus 1 sets the attenuation factor of the attenuator 8a, that is, the attenuation factor of the voice data x to zero by the speaker voice determination means 7, and the attenuation factor of the attenuator 8b. That is, the attenuation rate of the audio data y is set sufficiently large (step S41), and the operation is terminated.

また、対数パワー比Ｒ_x（ｌ）が閾値Ｔｈ_R未満であり、かつ、相互相関差Ｄ_x（ｌ）が閾値Ｔｈ_D未満である場合（ステップＳ３７でＮｏ）には、クロストーク音声認識装置１は、話者音声判定手段７によって、ステップＳ３６において算出された対数パワー比Ｒ_y（ｌ）が閾値Ｔｈ_R以上であるか、又は、相互相関差Ｄ_y（ｌ）が閾値Ｔｈ_D以上であるかを判断する（ステップＳ４２）。 When the log power ratio R _x (l) is less than the threshold Th _R and the cross-correlation difference D _x (l) is less than the threshold Th _D (No in step S37), the crosstalk speech recognition apparatus. 1 is that the logarithmic power ratio R _y (l) calculated in step S36 by the speaker voice determination means 7 is greater than or equal to the threshold Th _R , or the cross-correlation difference D _y (l) is greater than or equal to the threshold Th _D. It is determined whether or not there is (step S42).

そして、対数パワー比Ｒ_y（ｌ）が閾値Ｔｈ_R以上である、又は、相互相関差Ｄ_y（ｌ）が閾値Ｔｈ_D以上である場合（ステップＳ４２でＹｅｓ）には、クロストーク音声認識装置１は、話者音声判定手段７によって、音声データｙのフレームが話者Ｙの音声データであると判定し、この判定結果が、直前のフレームの話者の判定結果と同一であるかを判定する（ステップＳ４３）。そして、同一でない場合（ステップＳ４３でＮｏ）には、直前のフレームまでに同一の判定結果のフレームが最低持続フレーム数を超えて継続しているかを判断する（ステップＳ４４）。 When the log power ratio R _y (l) is equal to or greater than the threshold Th _R or the cross-correlation difference D _y (l) is equal to or greater than the threshold Th _D (Yes in step S42), the crosstalk speech recognition device. 1, the speaker voice determination means 7 determines that the frame of the voice data y is the voice data of the speaker Y, and determines whether this determination result is the same as the determination result of the speaker of the immediately preceding frame. (Step S43). If they are not the same (No in step S43), it is determined whether the frames with the same determination result have continued beyond the minimum number of frames until the immediately preceding frame (step S44).

そして、最低持続フレーム数を超えていない場合（ステップＳ４４でＮｏ）には、ステップＳ４６に進む。また、ステップＳ４２における話者の判定結果が直前のフレームの判定結果と同一である場合（ステップＳ４３でＹｅｓ）、又は、同一の判定結果が最低持続フレーム数を超えて継続している場合（ステップＳ４４でＹｅｓ）には、クロストーク音声認識装置１は、話者音声判定手段７によって、減衰器８ａの減衰率、つまり、音声データｘの減衰率を充分に大きく設定し、減衰器８ｂの減衰率、つまり、音声データｙの減衰率をゼロに設定して（ステップＳ４５）、動作を終了する。 If the minimum number of sustained frames is not exceeded (No in step S44), the process proceeds to step S46. Further, when the determination result of the speaker in step S42 is the same as the determination result of the immediately preceding frame (Yes in step S43), or when the same determination result continues beyond the minimum number of continuous frames (step In S44, the crosstalk speech recognition apparatus 1 sets the attenuation rate of the attenuator 8a, that is, the attenuation rate of the audio data x to be sufficiently large by the speaker audio determination means 7, and the attenuation of the attenuator 8b. The rate, that is, the attenuation rate of the audio data y is set to zero (step S45), and the operation ends.

また、対数パワー比Ｒ_y（ｌ）が閾値Ｔｈ_R未満であり、かつ、相互相関差Ｄ_y（ｌ）が閾値Ｔｈ_D未満である場合（ステップＳ４２でＮｏ）には、クロストーク音声認識装置１は、話者音声判定手段７によって、直前のフレームの話者の判定結果に基づいて、減衰器８ａ、８ｂの減衰率、つまり、音声データｘ、ｙの減衰率を、直前のフレームと同一の値に設定して（ステップＳ４６）、動作を終了する。 If the log power ratio R _y (l) is less than the threshold Th _R and the cross-correlation difference D _y (l) is less than the threshold Th _D (No in step S42), the crosstalk speech recognition apparatus. 1 shows that the speaker voice determination means 7 makes the attenuation rate of the attenuators 8a and 8b, that is, the attenuation rate of the voice data x and y the same as the previous frame, based on the determination result of the speaker of the previous frame. (Step S46), and the operation is terminated.

一方、ステップＳ３１において算出されたフレームパワーＰ（ｌ，ｘ）、Ｐ（ｌ，ｙ）に基づいて、ＦＵ状態判定部４ｃによって、フェーダユニットＦＵｘがＯＮでないと判断した場合（ステップＳ３２でＮｏ）、又は、フェーダユニットＦＵｙがＯＮでないと判断した場合（ステップＳ３３でＮｏ）には、クロストーク音声認識装置１は、ＦＵ状態判定部４ｃによって、減衰器８ａ、８ｂの減衰率、つまり、音声データｘ、ｙの両方の減衰率をゼロに設定して（ステップＳ４７）、動作を終了する。 On the other hand, when the FU state determination unit 4c determines that the fader unit FUx is not ON based on the frame powers P (l, x) and P (l, y) calculated in step S31 (No in step S32). Alternatively, when it is determined that the fader unit FUy is not ON (No in step S33), the crosstalk speech recognition apparatus 1 uses the FU state determination unit 4c to determine the attenuation rate of the attenuators 8a and 8b, that is, the audio data. The attenuation rate of both x and y is set to zero (step S47), and the operation ends.

以上の動作によって、クロストーク音声認識装置１は、音声データの各々のフレームがクロストーク成分であるかを判定し、クロストーク成分である場合には、当該フレームを出力する際の減衰器（８ａ又は８ｂ）の減衰率を充分に大きく設定し、クロストーク成分でない場合には、当該フレームを出力する際の減衰器（８ａ又は８ｂ）の減衰率をゼロに設定することができる。 With the above operation, the crosstalk speech recognition apparatus 1 determines whether each frame of the speech data is a crosstalk component. If the frame is a crosstalk component, the attenuator (8a) for outputting the frame is used. Alternatively, if the attenuation factor of 8b) is set sufficiently large and is not a crosstalk component, the attenuation factor of the attenuator (8a or 8b) when outputting the frame can be set to zero.

本発明におけるクロストーク音声認識装置の構成を示したブロック図である。It is the block diagram which showed the structure of the crosstalk audio | voice recognition apparatus in this invention. 話者音声判定手段によって話者を判定する方法を説明するための説明図、（ａ）は、話者の発話区間と対数パワー比経時の変化とを示したグラフ、（ｂ）は、相互相関比の経時変化を示したグラフ、（ｃ）は、話者音声判定手段による話者の判定結果を示した図である。Explanatory drawing for demonstrating the method of determining a speaker by a speaker audio | voice determination means, (a) is a graph which showed the speaker's speech area and the logarithmic power ratio change with time, (b) is a cross correlation A graph showing the change over time of the ratio, (c) is a diagram showing the speaker determination result by the speaker voice determination means. 本発明におけるクロストーク音声認識装置の動作を示したフローチャートである。It is the flowchart which showed operation | movement of the crosstalk audio | voice recognition apparatus in this invention. 本発明におけるクロストーク音声認識装置が、フレームごとに話者Ｘ、Ｙの音声データであるか、クロストーク成分であるかを判定し、クロストーク成分を減衰させる減衰率を設定する動作（話者判定・減衰率設定動作）を示したフローチャートである。The crosstalk speech recognition apparatus according to the present invention determines whether each frame is speech data of speakers X and Y or a crosstalk component, and sets an attenuation rate for attenuating the crosstalk component (speaker) 6 is a flowchart showing a determination / attenuation rate setting operation.

Explanation of symbols

１クロストーク音声認識装置（特定話者音声出力装置）
２音声データ入力手段
３フレーム抽出手段
４パワー算出手段
４ａ音声データパワー算出部（パワー算出手段）
４ｂ音声データパワー算出部（パワー算出手段）
５相互相関係数算出手段
６平滑処理手段
７話者音声判定手段
８ａ、８ｂ減衰器（音声データ出力手段）
９ａ、９ｂ記憶手段
１０ａ、１０ｂ音声認識手段
１１音声認識結果出力手段
Ｍｘ、Ｍｙマイク
ＦＵｘ、ＦＵｙフェーダユニット
Ａｘ、Ａｙ増幅器 1 Crosstalk voice recognition device (specific speaker voice output device)
2 voice data input means 3 frame extraction means 4 power calculation means 4a voice data power calculation section (power calculation means)
4b Audio data power calculation unit (power calculation means)
5 Cross-correlation coefficient calculation means 6 Smoothing processing means 7 Speaker voice determination means 8a, 8b Attenuator (voice data output means)
9a, 9b Storage means 10a, 10b Speech recognition means 11 Speech recognition result output means Mx, My microphone FUx, FUy fader unit Ax, Ay amplifier

Claims

A specific speaker voice output device that inputs voice data from a microphone provided for each speaker and outputs voice data of a speaker corresponding to a microphone that outputs the voice data from at least one of the voice data. ,
Voice data input means for inputting the voice data from the microphone;
Frame extraction means for extracting a frame having a predetermined data length from each of the audio data input from the audio data input means;
Power calculating means for calculating the magnitude of the power of the frame output from the frame extracting means;
With respect to the time axis of the target frame, which is a frame of one of the plurality of pieces of sound data, extracted by the frame extraction unit, for each piece of other sound data, the frame of the other sound data Cross-correlation coefficient calculating means for calculating a cross-correlation coefficient indicating a correlation between frames with the time axis shifted by a predetermined time width;
Of the magnitude of the power of each audio data frame calculated by the power calculation means and the cross-correlation coefficient calculated by the cross-correlation coefficient calculation means, the other relative to the time axis of the target frame Leading cross-correlation coefficient, which is a cross-correlation coefficient shifted in the direction of advancing the time axis of the audio data frame every predetermined time width, and the time axis of the other audio data frame every predetermined time width Based on a delayed cross-correlation coefficient that is a cross-correlation coefficient shifted in the delay direction, the target frame is voice data of a specific speaker that is a speaker corresponding to the microphone that output the voice data of the target frame. Speaker voice determination means for determining whether there is,
A specific speaker voice output device comprising: voice data output means for outputting a target frame determined to be voice data of the specific speaker by the speaker voice determination means.

If the difference between the sum of the lead cross-correlation coefficients and the sum of the delay cross-correlation coefficients exceeds a threshold for each of the other voice data, the speaker voice determination means determines that the target frame is The specific speaker voice output device according to claim 1, wherein the specific speaker voice data is determined to be voice data of the specific speaker.

In order to output voice data of a speaker corresponding to a microphone that has input voice data from a microphone provided for each speaker and output the voice data from at least one of the voice data,
Voice data input means for inputting the voice data from the microphone;
Frame extraction means for extracting a frame having a predetermined data length from each of the audio data input from the audio data input means;
Power calculating means for calculating the magnitude of the power of the frame output from the frame extracting means;
With respect to the time axis of the target frame, which is a frame of one of the plurality of pieces of sound data, extracted by the frame extraction unit, for each piece of other sound data, the frame of the other sound data A cross-correlation coefficient calculating means for calculating a cross-correlation coefficient indicating a correlation between frames with the time axis shifted by a predetermined time width;
Of the magnitude of the power of each audio data frame calculated by the power calculation means and the cross-correlation coefficient calculated by the cross-correlation coefficient calculation means, the other relative to the time axis of the target frame Leading cross-correlation coefficient, which is a cross-correlation coefficient shifted in the direction of advancing the time axis of the audio data frame every predetermined time width, and the time axis of the other audio data frame every predetermined time width Based on a delayed cross-correlation coefficient that is a cross-correlation coefficient shifted in the delay direction, the target frame is voice data of a specific speaker that is a speaker corresponding to the microphone that output the voice data of the target frame. Speaker voice judgment means for judging whether or not there is,
A specific speaker determination program that functions as a voice data output unit that outputs a target frame determined to be voice data of the specific speaker by the speaker voice determination unit.