JP2018084748A

JP2018084748A - Voice section detection device, voice section detection method and program

Info

Publication number: JP2018084748A
Application number: JP2016228953A
Authority: JP
Inventors: 隆朗福冨; Takaaki Fukutomi; 岡本　学; Manabu Okamoto; 学岡本; 清彰松井; Kiyoaki Matsui
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2016-11-25
Filing date: 2016-11-25
Publication date: 2018-05-31
Anticipated expiration: 2036-11-25
Also published as: JP6618885B2

Abstract

PROBLEM TO BE SOLVED: To provide a voice section detection technology that can detect a voice section of a target speaker even under a noise environment.SOLUTION: A voice section detection device comprises: a likelihood calculation unit 110 that calculates a voice likelihood L, and a non-voice likelihood L(j is a frame number) for each frame of voice data caught by a microphone i; a target speaker voice arrival direction estimation unit 120 that estimates a target speaker voice arrival direction θfrom the voice data; a weight calculation unit 130 that calculates weight wof the microphone i from θand a direction θand a distance rrepresenting a position of the microphone i; a target speaker voice arrival direction time change calculation unit 135 that calculates target speaker voice arrival direction time change Δθfrom the θ; a weighted likelihood ratio calculation unit 140 that calculates a weighted likelihood ratio Lfrom the L, the L, the wand the Δθ; and a voice section determination unit 150 that generates a voice section determination result j from the L.SELECTED DRAWING: Figure 2

Description

本発明は、音声区間検出技術に関し、特に複数のマイクロホンで収音した音声信号を用いて目的話者が発話した音声区間を検出する技術に関する。 The present invention relates to a voice section detection technique, and more particularly to a technique for detecting a voice section spoken by a target speaker using voice signals collected by a plurality of microphones.

音声認識技術においては、音声区間検出技術を用いて話者が発話した区間（以下、音声区間という）のみを切り出し、音声認識を行う。認識対象となる音声データから音声区間のみを切り出し、話者が発話していない雑音区間（以下、雑音区間のことを非音声区間ともいう）を除去した上で音声認識を行うことで、精度よく音声認識することができる（非特許文献１）。音声区間検出の一手法として、入力した音声データの各フレームに対して、音声が含まれる尤もらしさである音声尤度、音声が含まれない尤もらしさである非音声尤度を算出し、音声区間を決定する手法がある。 In the voice recognition technology, only a section in which a speaker speaks (hereinafter referred to as a voice section) is cut out using a voice section detection technique, and voice recognition is performed. By extracting only the speech segment from the speech data to be recognized and removing the speech segment where the speaker is not speaking (hereinafter, the noise segment is also referred to as a non-speech segment), the speech recognition is performed with high accuracy. Voice recognition can be performed (Non-patent Document 1). As a method for detecting a speech section, for each frame of input speech data, a speech likelihood that is a likelihood that speech is included and a non-speech likelihood that is a likelihood that speech is not included are calculated. There is a method to determine.

音声認識技術は、会話ができるロボットであるコミュニケーションロボットやデジタルサイネージ用のデバイスなどに適用されている。このコミュニケーションロボットやデジタルサイネージ用のデバイスでは、話者の位置が正面にくることがある程度想定されるため、特定のマイクロホン（具体的には、ロボットやデバイスの正面に位置するマイクロホン）で収音される音声データを強調した上で音声認識を行うことができる。これにより認識精度を向上させることが可能となる。 Voice recognition technology is applied to communication robots that are conversational robots and devices for digital signage. In this communication robot and digital signage device, it is assumed that the speaker's position will be in front of the speaker to some extent, so the sound is picked up by a specific microphone (specifically, a microphone located in front of the robot or device). Speech recognition can be performed after emphasizing the voice data. This makes it possible to improve recognition accuracy.

藤本雅清, “音声区間検出の基礎と最近の研究動向”, IEICE Technical Report, SP2010-23, pp.7-pp.12, 2010.Masayoshi Fujimoto, “Fundamentals of Speech Interval Detection and Recent Research Trends”, IEICE Technical Report, SP2010-23, pp.7-pp.12, 2010.

しかし、コミュニケーションロボット等の実際の使用環境では、雑音が存在する。雑音がある環境では、雑音の影響により、精度よく音声区間の検出が行えず、結果音声認識精度が低下してしまうという問題がある。 However, noise is present in an actual usage environment such as a communication robot. In an environment with noise, there is a problem that the speech section cannot be detected accurately due to the influence of the noise, resulting in a decrease in speech recognition accuracy.

この問題を解決する一つの方法として、雑音抑圧技術を用いることが考えられる。雑音抑圧技術を用いることによりある程度雑音の影響を低減することができる。しかし、干渉性雑音である、音声認識の対象となる目的話者と異なる話者の音声の影響を低減することは難しく、人混み等では十分な性能を発揮することができない。 One method for solving this problem is to use a noise suppression technique. By using the noise suppression technique, the influence of noise can be reduced to some extent. However, it is difficult to reduce the influence of speech of a speaker different from the target speaker that is the target of speech recognition, which is coherent noise, and sufficient performance cannot be exhibited in crowds.

そこで本発明は、雑音環境下であっても、目的話者の音声区間を検出することができる音声区間検出技術を提供することを目的とする。 Therefore, an object of the present invention is to provide a speech section detection technique that can detect a speech section of a target speaker even in a noisy environment.

本発明の一態様は、目的話者が発話すると想定される方向（以下、正面方向という）に向けて設置されるマイクロホンをマイクロホン0、それ以外のM個（M≧1）のマイクロホンをマイクロホン1、マイクロホン2、…、マイクロホンMとし、前記マイクロホン0から見て正面方向を0度の方向とし、前記マイクロホン0を中心として0度の方向を基準としたM+1個のマイクロホンi(i=0, 1, 2,…, M)の方向をθ_i（0≦θ_i≦2π, ただし、θ₀=0とする）、前記マイクロホン0とM+1個のマイクロホンi(i=0, 1, 2,…, M)の距離をr_i（r_i≧0, r₀=0）とし、前記マイクロホンi(i=0, 1, 2,…, M)で収音した目的話者の音声を含む音声データx_i(i=0, 1, 2,…, M)をフレームに分割し、各フレームについて、音声尤度L_{speech_i,j}、非音声尤度L_{noise_i,j}(i=0, 1, 2,…, M、jはフレーム番号を表すインデックス)を計算する尤度計算部と、前記音声データx_i(i=0, 1, 2,…, M)から、各フレームについて、前記マイクロホン0を中心として0度の方向を基準とした目的話者の方向である目的話者音声到来方向θ_est,j（0≦θ_est,j≦2π）を推定する目的話者音声到来方向推定部と、前記目的話者音声到来方向θ_est,jと、前記マイクロホンi(i=0, 1, 2,…, M)の位置を表す方向θ_iと距離r_iから、各フレームについて、前記マイクロホンi(i=0, 1, 2,…, M)の重みw_i,jを計算する重み計算部と、前記目的話者音声到来方向θ_est,jから、各フレームについて、過去tフレーム分（tは1以上の整数）の目的話者音声到来方向の変化分である目的話者音声到来方向時間変化Δθ_jを計算する目的話者音声到来方向時間変化計算部と、前記音声尤度L_{speech_i,j}、前記非音声尤度L_{noise_i,j}(i=0, 1, 2,…, M)と、前記重みw_i,j(i=0, 1, 2,…, M)と、前記目的話者音声到来方向時間変化Δθ_jから、各フレームについて、重み付き尤度比L_jを計算する重み付き尤度比計算部と、前記重み付き尤度比L_jから、各フレームについて、音声区間であるか否かの判定結果である音声区間判定結果jを生成する音声区間判定部とを含む。 In one embodiment of the present invention, the microphone installed in the direction in which the target speaker is supposed to speak (hereinafter referred to as the front direction) is the microphone 0, and the other M (M ≧ 1) microphones are the microphone 1. , Microphones 2,..., Microphone M, M + 1 microphones i (i = 0) with the front direction as viewed from the microphone 0 being a 0 degree direction and the microphone 0 being the center and the 0 degree direction as a reference , 1, 2,..., M) as θ _i (0 ≦ θ _i ≦ 2π, where θ ₀ = 0), the microphone 0 and the M + 1 microphones i (i = 0, 1, 2, ..., M) is the distance r _i (r _i ≥0, r ₀ = 0), and the target speaker's voice collected by the microphone i (i = 0, 1, 2, ..., M) is The included speech data x _i (i = 0, 1, 2,..., M) is divided into frames, and for each frame, speech likelihood L _{speech_i, j} and non-speech likelihood L _{noise_i, j} (i = 0, 1 , 2,…, M, j are the frame numbers A likelihood calculator for calculating the box), the audio data _{x i (i = 0, 1} , 2, ..., purpose M), which for each frame, relative to the direction of 0 degrees about the microphone 0 A target speaker voice arrival direction estimation unit for estimating a target speaker voice arrival direction θ _{est, j} (0 ≦ θ _{est, j} ≦ 2π), which is the direction of the speaker, and the target speaker voice arrival direction θ _{est, j} From the direction θ _i representing the position of the microphone i (i = 0, 1, 2,..., M) and the distance r _i , the microphone i (i = 0, 1, 2,. ) Weight w _{i, j} and the target speaker voice arrival for the past t frames (t is an integer of 1 or more) for each frame from the target speaker voice arrival direction θ _{est, j.} A target speaker voice arrival direction time change calculating unit for calculating a target speaker voice arrival direction time change Δθ _j which is a change in direction; and the speech likelihood L _{speech_i, j} , the non-voice likelihood L _{noise_i, j} ( i = 0, 1, 2,..., M), the weights w _{i, j} (i = 0, 1, 2,..., M) and the target speaker voice arrival direction time change Δθ _j are weighted for each frame. generating a weighted likelihood ratio calculation unit for calculating a likelihood ratio L _j, from the weighted likelihood ratios L _j, for each frame, the speech segment determination result j is determined whether the result is a voice section A speech segment determination unit.

本発明の一態様は、目的話者が発話すると想定される方向（以下、正面方向という）に向けて設置されるマイクロホンをマイクロホン0、それ以外のM個（M≧1）のマイクロホンをマイクロホン1、マイクロホン2、…、マイクロホンMとし、前記マイクロホン0から見て正面方向を0度の方向とし、前記マイクロホン0を中心として0度の方向を基準としたM+1個のマイクロホンi(i=0, 1, 2,…, M)の方向をθ_i（0≦θ_i≦2π, ただし、θ₀=0とする）、前記マイクロホン0とM+1個のマイクロホンi(i=0, 1, 2,…, M)の距離をr_i（r_i≧0, r₀=0）とし、前記マイクロホンi(i=0, 1, 2,…, M)で収音した目的話者の音声を含む音声データx_i(i=0, 1, 2,…, M)をフレームに分割し、各フレームについて、音声尤度L_{speech_i,j}、非音声尤度L_{noise_i,j}(i=0, 1, 2,…, M、jはフレーム番号を表すインデックス)を計算する尤度計算部と、前記音声データx_i(i=0, 1, 2,…, M)から、各フレームについて、前記マイクロホン0を中心として0度の方向を基準とした目的話者の方向である目的話者音声到来方向θ_est,j（0≦θ_est,j≦2π）を推定する目的話者音声到来方向推定部と、前記目的話者音声到来方向θ_est,jと、前記マイクロホンi(i=0, 1, 2,…, M)の位置を表す方向θ_iと距離r_iから、各フレームについて、前記マイクロホンi(i=0, 1, 2,…, M)の重みw_i,jを計算する重み計算部と、前記音声尤度L_{speech_i,j}、前記非音声尤度L_{noise_i,j}(i=0, 1, 2,…, M)と、前記重みw_i,j(i=0, 1, 2,…, M)から、各フレームについて、重み付き尤度比L_jを計算する重み付き尤度比計算部と、前記重み付き尤度比L_jから、各フレームについて、音声区間であるか否かの判定結果である音声区間判定結果jを生成する音声区間判定部とを含む。 In one embodiment of the present invention, the microphone installed in the direction in which the target speaker is supposed to speak (hereinafter referred to as the front direction) is the microphone 0, and the other M (M ≧ 1) microphones are the microphone 1. , Microphones 2,..., Microphone M, M + 1 microphones i (i = 0) with the front direction as viewed from the microphone 0 being a 0 degree direction and the microphone 0 being the center and the 0 degree direction as a reference , 1, 2,..., M) as θ _i (0 ≦ θ _i ≦ 2π, where θ ₀ = 0), the microphone 0 and the M + 1 microphones i (i = 0, 1, 2, ..., M) is the distance r _i (r _i ≥0, r ₀ = 0), and the target speaker's voice collected by the microphone i (i = 0, 1, 2, ..., M) is The included speech data x _i (i = 0, 1, 2,..., M) is divided into frames, and for each frame, speech likelihood L _{speech_i, j} and non-speech likelihood L _{noise_i, j} (i = 0, 1 , 2,…, M, j are the frame numbers A likelihood calculator for calculating the box), the audio data _{x i (i = 0, 1} , 2, ..., purpose M), which for each frame, relative to the direction of 0 degrees about the microphone 0 A target speaker voice arrival direction estimation unit for estimating a target speaker voice arrival direction θ _{est, j} (0 ≦ θ _{est, j} ≦ 2π), which is the direction of the speaker, and the target speaker voice arrival direction θ _{est, j} From the direction θ _i representing the position of the microphone i (i = 0, 1, 2,..., M) and the distance r _i , the microphone i (i = 0, 1, 2,. ) Weight w _{i, j} , the speech likelihood L _{speech_i, j} , the non-speech likelihood L _{noise_i, j} (i = 0, 1, 2,..., M), and the weight A weighted likelihood ratio calculation unit that calculates a weighted likelihood ratio L _j for each frame from w _{i, j} (i = 0, 1, 2,..., M), and the weighted likelihood ratio L _j To, for each frame, a speech segment that is a determination result of whether or not it is a speech segment And a speech segment determination unit for generating a constant results j.

本発明によれば、目的話者の音声が到来する方向を推定し、当該方向からの音声を強調したうえで目的話者による音声区間を検出することにより、雑音環境下であっても、目的話者の音声区間を精度よく検出することが可能となる。 According to the present invention, the direction in which the target speaker's voice arrives is estimated, the voice from the direction is emphasized, and the voice section by the target speaker is detected. It becomes possible to detect the voice section of the speaker with high accuracy.

収音システムの一例を示す図。The figure which shows an example of a sound collection system. 音声区間検出装置１００の構成の一例を示す図。The figure which shows an example of a structure of the audio | voice area detection apparatus 100. FIG. 音声区間検出装置１００の動作の一例を示す図。The figure which shows an example of operation | movement of the audio | voice area detection apparatus 100. 尤度計算部１１０の構成の一例を示す図。The figure which shows an example of a structure of the likelihood calculation part 110. FIG. 尤度計算部１１０の動作の一例を示す図。The figure which shows an example of operation | movement of the likelihood calculation part 110. 目的話者音声到来方向の一例を示す図。The figure which shows an example of the target speaker audio | voice arrival direction. 重み付き尤度比計算部１４０の構成の一例を示す図。The figure which shows an example of a structure of the weighted likelihood ratio calculation part 140. FIG. 重み付き尤度比計算部１４０の動作の一例を示す図。The figure which shows an example of operation | movement of the weighted likelihood ratio calculation part 140. FIG. 音声区間検出装置２００の構成の一例を示す図。The figure which shows an example of a structure of the audio | voice area detection apparatus. 音声区間検出装置２００の動作の一例を示す図。The figure which shows an example of operation | movement of the audio | voice area detection apparatus 200. 重み付き尤度比計算部２４０の構成の一例を示す図。The figure which shows an example of a structure of the weighted likelihood ratio calculation part 240. FIG.

以下、本発明の実施の形態について、詳細に説明する。なお、同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. In addition, the same number is attached | subjected to the structure part which has the same function, and duplication description is abbreviate | omitted.

＜定義＞
以下、各実施形態で用いる用語について説明する。 <Definition>
Hereinafter, terms used in each embodiment will be described.

まず、目的話者の発話を収音する複数のマイクロホンからなる収音システムについて説明する。 First, a sound collection system composed of a plurality of microphones that collect the speech of the target speaker will be described.

［収音システム］
収音システムは、1個のメインマイクロホンとM個（M≧1）のサブマイクロホンから構成される。メインマイクロホンとは、目的話者が発話すると想定される方向（以下、正面方向という）に向けて設置されるマイクロホンである。例えば、コミュニケーションロボットに搭載されるマイクロホンの場合、ロボットの前面につけられた、正面方向に向けられたマイクロホンのことである。また、サブマイクロホンとは、メインマイクロホン以外のマイクロホンのことをいう。 [Sound collection system]
The sound collection system is composed of one main microphone and M (M ≧ 1) sub-microphones. The main microphone is a microphone that is installed in a direction in which the target speaker is supposed to speak (hereinafter referred to as the front direction). For example, in the case of a microphone mounted on a communication robot, the microphone is attached to the front of the robot and is directed in the front direction. The sub microphone means a microphone other than the main microphone.

以下、メインマイクロホンをマイクロホン0と、M個のサブマイクロホンをそれぞれマイクロホン1、マイクロホン2、…、マイクロホンMということにする。 Hereinafter, the main microphone is referred to as microphone 0, and the M sub-microphones are referred to as microphone 1, microphone 2,..., Microphone M, respectively.

マイクロホン0から地面に対して垂直となるよう直線を下し、当該直線を法線としマイクロホン0を通る平面Pを考える（図１参照）。この平面P上で目的話者が発話すると想定される方向、つまりマイクロホン0から見て正面方向を0度の方向とする。 A straight line is drawn from the microphone 0 so as to be perpendicular to the ground, and a plane P passing through the microphone 0 with the straight line as a normal is considered (see FIG. 1). The direction in which the target speaker is supposed to speak on the plane P, that is, the front direction when viewed from the microphone 0 is defined as a direction of 0 degree.

マイクロホン0を中心として0度の方向を基準としたM+1個のマイクロホンi(i=0, 1, 2,…, M)の方向（以下、マイクロホン0に対するマイクロホンiの方向という）をθ_i(i=0, 1, 2,…, M)とする（0≦θ_i≦2π, ただし、θ₀=0とする）。また、マイクロホン0とM+1個のマイクロホンi(i=0, 1, 2,…, M)の距離をr_i(i=0, 1, 2,…, M)とする（r_i≧0, r₀=0）。つまり、マイクロホンiの位置は、マイクロホン0に対するマイクロホンiの方向θ_iとマイクロホン0とマイクロホンiの距離r_iで表される。 The direction of M + 1 microphones i (i = 0, 1, 2,..., M) (hereinafter referred to as the direction of the microphone i with respect to the microphone 0) with respect to the direction of 0 degree centered on the microphone 0 is θ _i (i = 0, 1, 2,..., M) (0 ≦ θ _i ≦ 2π, where θ ₀ = 0). Also, let r _i (i = 0, 1, 2,..., M) be the distance between microphone 0 and M + 1 microphones i (i = 0, 1, 2,..., M) (r _i ≧ 0 , r ₀ = 0). That is, the position of the microphone _i is represented by the direction θ _i of the microphone _i with respect to the microphone 0 and the distance r _i between the microphone 0 and the microphone i.

図１は、メインマイクロホン（マイクロホン0）と２個のサブマイクロホン（マイクロホン1とマイクロホン2）で構成される収音システムの一例(M=2)を示す。図１ではサブマイクロホンが平面P上にある例を示したが、サブマイクロホンは必ずしも平面P上に存在しなくてもよい。平面P上に存在しない場合、サブマイクロホンから平面Pに下した垂線と平面Pの交点を用いてマイクロホン0に対するマイクロホンiの方向θ_i(iは1以上M以下の整数)を定義する。 FIG. 1 shows an example (M = 2) of a sound collection system including a main microphone (microphone 0) and two sub-microphones (microphone 1 and microphone 2). Although FIG. 1 shows an example in which the sub microphone is on the plane P, the sub microphone may not necessarily be on the plane P. If the plane does not exist on the plane P, the direction θ _{i of the} microphone i with respect to the microphone 0 (where i is an integer between 1 and M) is defined by using the intersection of the plane P and the perpendicular drawn from the sub microphone to the plane P.

＜第一実施形態＞
以下、図２〜図３を参照して音声区間検出装置１００について説明する。 <First embodiment>
Hereinafter, the speech section detection apparatus 100 will be described with reference to FIGS.

図２に示すように音声区間検出装置１００は、尤度計算部１１０、目的話者音声到来方向推定部１２０、重み計算部１３０、目的話者音声到来方向時間変化計算部１３５、重み付き尤度比計算部１４０、音声区間判定部１５０、記録部１９０を含む。記録部１９０は、音声区間検出装置１００の処理に必要な情報を適宜記録する構成部である。例えば、重み計算部１３０で用いるマイクロホンi(i=0, 1, 2,…, M)の位置を表す方向θ_iと距離r_iや、音声区間判定部１５０で用いる閾値thを事前に記録しておく。 As shown in FIG. 2, the speech section detection apparatus 100 includes a likelihood calculation unit 110, a target speaker voice arrival direction estimation unit 120, a weight calculation unit 130, a target speaker voice arrival direction time change calculation unit 135, a weighted likelihood. A ratio calculation unit 140, a voice segment determination unit 150, and a recording unit 190 are included. The recording unit 190 is a component that appropriately records information necessary for processing of the speech segment detection device 100. For example, the direction θ _i and the distance r _i representing the position of the microphone i (i = 0, 1, 2,..., M) used in the weight calculation unit 130 and the threshold th used in the speech section determination unit 150 are recorded in advance. Keep it.

音声区間検出装置１００は、収音システムで収音したM+1個の音声データ、つまりマイクロホンi(i=0, 1, 2,…, M)で収音した目的話者の音声を含む音声データx_iを入力として、各フレームが音声区間であるか否かの判定結果である音声区間判定結果を生成し、出力する。 The speech section detection apparatus 100 includes M + 1 speech data collected by the sound collection system, that is, speech including the speech of the target speaker collected by the microphone i (i = 0, 1, 2,..., M). Using data x _i as an input, a speech segment determination result that is a determination result of whether each frame is a speech segment is generated and output.

図３に従い音声区間検出装置１００の動作について説明する。尤度計算部１１０は、マイクロホンi(i=0, 1, 2,…, M)で収音した目的話者の音声を含む音声データx_i(i=0, 1, 2,…, M)をフレームに分割し、各フレームについて、音声尤度L_{speech_i,j}、非音声尤度L_{noise_i,j}(i=0, 1, 2,…, M、jはフレーム番号を表すインデックス)を計算する（Ｓ１１０）。インデックスjは、例えば、j=0, 1, …とすればよい。 The operation of the speech segment detection device 100 will be described with reference to FIG. Likelihood calculation section 110 includes speech data x _i (i = 0, 1, 2,..., M) including the speech of the target speaker picked up by microphone _i (i = 0, 1, 2,..., M). _Is divided into frames, and for each frame, speech likelihood L _{speech_i, j} and non-speech likelihood L _{noise_i, j} (i = 0, 1, 2,..., M, j is an index representing a frame number) are calculated. (S110). The index j may be, for example, j = 0, 1,.

以下、図４〜図５を参照して尤度計算部１１０について説明する。図４に示すように尤度計算部１１０は、特徴量抽出部１１１、尤度算出部１１３を含む。図５に従い尤度計算部１１０の動作について説明する。特徴量抽出部１１１は、入力された音声データx_i(i=0, 1, 2,…, M)をフレームに分割し、各フレームについて、特徴量jを抽出する（Ｓ１１１）。特徴量には、MFCC(Mel-Frequency Cepstrum Coefficients: メル周波数ケプストラム係数)やパワーを用いるとよい。尤度算出部１１３は、Ｓ１１１で抽出した特徴量jから、各フレームについて、音声尤度L_{speech_i,j}、非音声尤度L_{noise_i,j}(i=0, 1, 2,…, M)を算出する（Ｓ１１３）。例えば、ガウス混合分布モデルGMM(Gaussian Mixture Model)を用いる方法がある（参考非特許文献１）。この方法では、音声GMMと非音声GMMから、それぞれ音声尤度と非音声尤度を算出する。
（参考非特許文献１）藤本雅清, 有木康雄, “GMMに基づく音声信号推定法と時間領域SVDに基づく音声強調法の併用による雑音下音声認識”, 電子情報通信学会論文誌, D-II, Vol.J88-D-II, No.2, pp.250-265, 2005. Hereinafter, the likelihood calculation unit 110 will be described with reference to FIGS. As shown in FIG. 4, the likelihood calculation unit 110 includes a feature amount extraction unit 111 and a likelihood calculation unit 113. The operation of the likelihood calculation unit 110 will be described with reference to FIG. The feature amount extraction unit 111 divides the input audio data x _i (i = 0, 1, 2,..., M) into frames, and extracts a feature amount j for each frame (S111). As the feature amount, MFCC (Mel-Frequency Cepstrum Coefficients) or power may be used. The likelihood calculating unit 113 calculates the speech likelihood L _{speech_i, j} and the non-speech likelihood L _{noise_i, j} (i = 0, 1, 2,..., M) for each frame from the feature quantity j extracted in S111. Calculate (S113). For example, there is a method using a Gaussian mixture distribution model GMM (Gaussian Mixture Model) (Reference Non-Patent Document 1). In this method, a speech likelihood and a non-speech likelihood are calculated from a speech GMM and a non-speech GMM, respectively.
(Reference Non-Patent Document 1) Masayoshi Fujimoto, Yasuo Ariki, “Speech recognition under noise by using speech signal estimation based on GMM and speech enhancement based on time domain SVD”, IEICE Transactions, D-II , Vol.J88-D-II, No.2, pp.250-265, 2005.

目的話者音声到来方向推定部１２０は、入力された音声データx_i(i=0, 1, 2,…, M)から、各フレームについて、マイクロホン0を中心として0度の方向を基準とした目的話者の方向である目的話者音声到来方向θ_est,j（0≦θ_est,j≦2π）を推定する（Ｓ１２０）。図６に目的話者とマイクロホン0の位置関係を示す。目的話者音声到来方向の推定には、参考非特許文献２の技術を用いることができる。
（参考非特許文献２）C. H. Knapp, G. C. Carter, “The Generalized Correlation Method for Estimation of Time Delay”, IEEE Transactions on Acoustics, Speech and Signal Processing, Vol.24, No.4, pp.320-327, 1976. The target speaker voice arrival direction estimation unit 120 uses the direction of 0 degrees with respect to the frame 0 as a reference for each frame from the input voice data x _i (i = 0, 1, 2,..., M). The target speaker voice arrival direction θ _{est, j} (0 ≦ θ _{est, j} ≦ 2π), which is the direction of the target speaker, is estimated (S120). FIG. 6 shows the positional relationship between the target speaker and the microphone 0. The technique of Reference Non-Patent Document 2 can be used to estimate the direction of arrival of the target speaker voice.
(Reference Non-Patent Document 2) CH Knapp, GC Carter, “The Generalized Correlation Method for Estimation of Time Delay”, IEEE Transactions on Acoustics, Speech and Signal Processing, Vol.24, No.4, pp.320-327, 1976 .

この推定技術を用いると、複数のマイクロホンを用いて目的話者にフォーカスし、周囲の雑音の影響を軽減することができる。 When this estimation technique is used, the target speaker can be focused using a plurality of microphones, and the influence of ambient noise can be reduced.

重み計算部１３０は、Ｓ１２０で推定した目的話者音声到来方向θ_est,jと、マイクロホンi(i=0, 1, 2,…, M)の位置を表す方向θ_iと距離r_iから、各フレームについて、マイクロホンi(i=0, 1, 2,…, M)の重みw_i,jを計算する（Ｓ１３０）。ここで、マイクロホンiの重みとは、マイクロホン0とマイクロホンiの距離と、マイクロホン0に対するマイクロホンiの方向と目的話者音声到来方向のずれに基づいて決定される実数である。例えば、式(1)を用いて、重みw_i,j(i=0, 1, 2,…, M)を計算する。 The weight calculator 130 calculates the target speaker voice arrival direction θ _{est, j} estimated in S120, the direction θ _i representing the position of the microphone i (i = 0, 1, 2,..., M), and the distance r _i . For each frame, the weights w _{i, j} of the microphone i (i = 0, 1, 2,..., M) are calculated (S130). Here, the weight of the microphone i is a real number determined on the basis of the distance between the microphone 0 and the microphone i and the deviation between the direction of the microphone i relative to the microphone 0 and the arrival direction of the target speaker voice. For example, the weights w _{i, j} (i = 0, 1, 2,..., M) are calculated using Expression (1).

ここで、1/2(1+r_i)はマイクロホン0との距離r_iに依存する項であり、マイクロホン0に近いほど大きい値をとる。また、cos(θ_i-θ_est,j)はマイクロホン0に対する方向θ_iと目的話者音声到来方向θ_est,jに依存する項であり、0からπまで変化すると（つまり、θ_iとθ_est,jのずれが大きくなるほど）、その値は小さくなる。w_0,jはフレームjに対するマイクロホン0の重みを表し、先ほどの式(1)は以下の式(2)のようになる。 Here, 1/2 (1 + r _i ) is a term that depends on the distance r _i to the microphone 0, and takes a larger value as the distance from the microphone 0 is closer. Cos (θ _i −θ _{est, j} ) is a term that depends on the direction θ _i with respect to the microphone 0 and the direction of arrival of the target speaker voice θ _{est, j} , and changes from 0 to π (that is, θ _i and θ _The larger the deviation of _{est, j} ), the smaller the value. w _{0, j} represents the weight of the microphone 0 with respect to the frame j, and the previous equation (1) becomes the following equation (2).

したがって、目的話者音声到来方向θ_est,jが正面方向である(θ_est,j=0)場合は、w_0,j=1となる。 Therefore, when the target speaker voice arrival direction θ _{est, j} is the front direction (θ _{est, j} = 0), w _{0, j} = 1.

一般に、重みw_i,jは、マイクロホン0とマイクロホンの距離をr、マイクロホン0を中心として0度の方向を基準としたマイクロホンの方向をθ、マイクロホン0を中心として0度の方向を基準とした目的話者の方向である目的話者音声到来方向θ_estとして、距離rと、方向θと目的話者音声到来方向θ_estのずれであるθ-θ_estの関数w(r, θ-θ_est)を用いて計算される（rは0以上の実数、θ-θ_estは実数)。関数w(r, θ-θ_est)は、rについて、単調減少関数となる。また、関数w(r, θ-θ_est)は、θ-θ_estについて、θ-θ_est=0のとき最大、θ-θ_est=πのとき最小となるように単調減少し、w(r, -(θ-θ_est))=w(r, θ-θ_est)となる周期2πの関数となる。上述のような性質を有する関数w(r, θ-θ_est)を用いて重みw_i,jを計算すればよい。 In general, the weights w _{i, j} are the distance between the microphones 0 and r, the direction of the microphone with respect to the direction of 0 degrees with respect to the microphone 0 as the reference, and the direction of 0 degrees with respect to the microphone 0 as the reference. as the target speaker voice arrival direction θ _est is the direction of the target speaker, distance r and is a deviation in the direction θ and the target speaker voice arrival direction θ _{_est} θ-θ _est of the function w (r, θ-θ _est ) (Where r is a real number greater than or equal to 0 and θ-θ _est is a real number). The function w (r, θ−θ _est ) is a monotonically decreasing function with respect to r. The function _{w (r, θ-θ est} ) , for theta-theta _est, maximum when θ-θ _est = 0, so as to minimize the time of θ-θ _est = π monotonically decreasing, w (r , − (θ−θ _est )) = w (r, θ−θ _est ). The weights w _{i, j} may be calculated using the function w (r, θ−θ _est ) having the above properties.

目的話者音声到来方向時間変化計算部１３５は、Ｓ１２０で推定した目的話者音声到来方向θ_est,jから、各フレームについて、過去tフレーム分（tは1以上の整数）の目的話者音声到来方向の変化分である目的話者音声到来方向時間変化Δθ_jを計算する（Ｓ１３５）。例えば、1フレームが20msecである場合、t=5程度としてΔθ_jを計算するとよい。つまり、現フレームの目的話者音声到来方向θ_est,jとその5フレーム前の目的話者音声到来方向θ_est,j-5の差θ_est,j-θ_est,j-5をΔθ_jとする（Δθ_j=θ_est,j-θ_est,j-5）。また、目的話者音声到来方向については、逐次記録部１９０に記録しておくなどすればΔθ_jを計算することができる。この処理は、目的話者の位置はごく短い時間間隔では大きくは動かないと仮定し（つまり、収音する目的話者の音声が正常であり）、目的話者音声到来方向の時間変化Δθ_jを後述の重み付き尤度比計算部１４０における処理で考慮するために行う処理である。 The target speaker voice arrival direction time variation calculation unit 135 calculates the target speaker voice for the past t frames (t is an integer of 1 or more) for each frame from the target speaker voice arrival direction θ _{est, j} estimated in S120. A target speaker voice arrival direction time change Δθ _j which is a change in the arrival direction is calculated (S135). For example, when one frame is 20 msec, Δθ _j may be calculated with t = 5 or so. That is, the target speaker voice arrival direction theta _est of the current _{frame, j} and its 5 frames before the target speaker voice arrival direction theta _{est, j-5} of the difference theta _est, and _j - [theta] _est, the _j-5 [Delta] [theta] _j (Δθ _j = θ _{est, j} -θ _{est, j-5} ). Also, Δθ _j can be calculated by recording the target speaker voice arrival direction in the recording unit 190 sequentially. This process assumes that the position of the target speaker does not move significantly in a very short time interval (that is, the target speaker's voice to be picked up is normal), and the time change Δθ _{j in the} direction of arrival of the target speaker's voice Is a process to be taken into consideration in the process in the weighted likelihood ratio calculation unit 140 described later.

重み付き尤度比計算部１４０は、Ｓ１１０で計算した音声尤度L_{speech_i,j}、非音声尤度L_{noise_i,j}(i=0, 1, 2,…, M)と、Ｓ１３０で計算した重みw_i,j(i=0, 1, 2,…, M)と、Ｓ１３５で計算した目的話者音声到来方向時間変化Δθ_jから、各フレームについて、重み付き尤度比L_jを計算する（Ｓ１４０）。以下、図７〜図８を参照して重み付き尤度比計算部１４０について説明する。図７に示すように重み付き尤度比計算部１４０は、重み付き尤度算出部１４１、重み付き尤度比算出部１４３を含む。図８に従い重み付き尤度比計算部１４０の動作について説明する。 The weighted likelihood ratio calculation unit 140 includes the speech likelihood L _{speech_i, j} calculated in S110, the non-speech likelihood L _{noise_i, j} (i = 0, 1, 2,..., M), and the weight calculated in S130. A weighted likelihood ratio L _j is calculated for each frame from w _{i, j} (i = 0, 1, 2,..., M) and the target speaker voice arrival direction time change Δθ _j calculated in S135 ( S140). Hereinafter, the weighted likelihood ratio calculation unit 140 will be described with reference to FIGS. As shown in FIG. 7, the weighted likelihood ratio calculation unit 140 includes a weighted likelihood calculation unit 141 and a weighted likelihood ratio calculation unit 143. The operation of the weighted likelihood ratio calculation unit 140 will be described with reference to FIG.

重み付き尤度算出部１４１は、音声尤度L_{speech_i,j}、非音声尤度L_{noise_i,j}(i=0, 1, 2,…, M)と、重みw_i,j(i=0, 1, 2,…, M)から、各フレームについて、重み付き音声尤度L_{speech_all,j}、重み付き非音声尤度L_{noise_all,j}を算出する（Ｓ１４１）。重み付き音声尤度L_{speech_all,j}、重み付き非音声尤度L_{noise_all,j}は、それぞれ式(3a)、式(3b)のように重み付き加算を用いて算出される。 The weighted likelihood calculating unit 141 includes a speech likelihood L _{speech_i, j} , a non-speech likelihood L _{noise_i, j} (i = 0, 1, 2,..., M) and a weight w _{i, j} (i = 0, 1, 2,..., M), a weighted speech likelihood L _{speech_all, j} and a weighted non-speech likelihood L _{noise_all, j} are calculated for each frame (S141). The weighted speech likelihood L _{speech_all, j} and the weighted non-speech likelihood L _{noise_all, j} are calculated using weighted addition as shown in equations (3a) and (3b), respectively.

重み付き尤度比算出部１４３は、Ｓ１４１で算出した重み付き音声尤度L_{speech_all,j}、重み付き非音声尤度L_{noise_all,j}と、目的話者音声到来方向時間変化Δθ_jから、各フレームについて、重み付き尤度比L_jを算出する（Ｓ１４３）。重み付き尤度比L_jは式(4)を用いて算出される。 The weighted likelihood ratio calculation unit 143 calculates each frame from the weighted speech likelihood L _{speech_all, j} calculated in S141, the weighted non-speech likelihood L _{noise_all, j,} and the target speaker speech arrival direction time change Δθ _j. The weighted likelihood ratio L _j is calculated for (S143). The weighted likelihood ratio L _j is calculated using Equation (4).

ここで、Kはフレームが音声区間であると判定される程度を調整するためのパラメータである。通常はK=1とすればよい。音声以外の区間が誤検出されたとしても、音声を含むフレームである正しい音声区間を取りこぼすよりはよいと判断する場合には、Kの値が1より大きい値に調整し、音声区間として判定されやすくなるようにすればよい。 Here, K is a parameter for adjusting the degree to which a frame is determined to be a speech section. Usually, K = 1 is sufficient. If it is judged that it is better to miss the correct speech segment that is a frame that includes speech even if a segment other than speech is erroneously detected, the value of K is adjusted to a value greater than 1 and determined as a speech segment It should be easy to be done.

また、目的話者音声到来方向時間変化Δθ_jが大きくなると、重み付き尤度比L_jは小さくなる。このような形で、目的話者音声到来方向時間変化Δθ_jを重み付き尤度比L_jの計算に反映した理由は、以下の通りである。例えば、コミュニケーションロボットと目的話者の会話を考えると、目的話者は基本的に立ち止まって発話すると考えられる。その結果、目的話者の位置がずれていない（目的話者音声到来方向時間変化Δθ_jが小さい）ことから目的話者は発話していると、目的話者の位置がずれている(目的話者音声到来方向時間変化Δθ_jが大きい)ことから目的話者は発話していないと判断できる。そこで、目的話者音声到来方向時間変化Δθ_jを式(4)のような形で重み付けに利用し、重み付き尤度比L_jを計算することとした。 Further, as the target speaker voice arrival direction time change Δθ _j increases, the weighted likelihood ratio L _j decreases. The reason why the target speaker voice arrival direction time variation Δθ _j is reflected in the calculation of the weighted likelihood ratio L _j in this way is as follows. For example, considering the conversation between the communication robot and the target speaker, the target speaker can basically stop and speak. As a result, when the target speaker is speaking because the target speaker is not misaligned (the target speaker voice arrival direction time change Δθ _j is small), the target speaker is misaligned (the target speaker 's voice direction of arrival time change Δθ _j is large) target speaker from it can be determined that not speaking. Therefore, the target speaker voice arrival direction time variation Δθ _j is used for weighting in the form as shown in equation (4) to calculate the weighted likelihood ratio L _j .

一般に、重み付き尤度比L_jは、重み付き音声尤度L_{speech_all}と重み付き非音声尤度L_{noise_all}の比であるL_{speech_all}/L_{noise_all}、過去tフレーム分（tは1以上の整数）の目的話者音声到来方向の変化分である目的話者音声到来方向時間変化Δθの関数L(L_{speech_all}/L_{noise_all}, Δθ)を用いて計算される。関数L(L_{speech_all}/L_{noise_all}, Δθ)は、L_{speech_all}/L_{noise_all}について、単調増加関数となる。また、関数L(L_{speech_all}/L_{noise_all}, Δθ)は、Δθについて、Δθ=0のとき最大、Δθ=πのとき最小となるように単調減少し、L(L_{speech_all}/L_{noise_all}, -Δθ)= L(L_{speech_all}/L_{noise_all}, Δθ)となる周期2πの関数となる。上述のような性質を有する関数L(L_{speech_all}/L_{noise_all}, Δθ)を用いて重み付き尤度比L_jを計算すればよい。 Generally, the weighted likelihood ratio L _j is the ratio of the weighted speech likelihood L _{Speech_all} and weighted non-speech likelihood _{_{_L}} noise_all L speech_all / L noise_all, past t frame (t is an integer of 1 or more) The calculation is performed using a function L (L _{speech_all} / L _{noise_all} , Δθ) of the target speaker voice arrival direction time change Δθ, which is a change in the target speaker voice arrival direction. The function L (L _{speech_all} / L _{noise_all} , Δθ) is a monotonically increasing function for L _{speech_all} / L _{noise_all} . In addition, the function L (L _{speech_all} / L _{noise_all} , Δθ) monotonically decreases with respect to Δθ so as to be maximum when Δθ = 0 and minimum when Δθ = π, and L (L _{speech_all} / L _{noise_all} , −Δθ) = L (L _{speech_all} / L _{noise_all} , Δθ) is a function of period 2π. The weighted likelihood ratio L _j may be calculated using the function L (L _{speech_all} / L _{noise_all} , Δθ) having the above properties.

音声区間判定部１５０は、Ｓ１４０で計算した重み付き尤度比L_jから、各フレームについて、音声区間であるか否かの判定結果である音声区間判定結果jを生成する（Ｓ１５０）。例えば、重み付き尤度比L_jと閾値thを比較し、重み付き尤度比L_jが閾値thより大きい（あるいは閾値th以上である）場合に、音声区間であると判定し、フレームjが音声区間であることを示す音声区間判定結果jを生成する。それ以外については、非音声区間であると判定し、フレームjが非音声区間であることを示す音声区間判定結果jを生成する。 The speech segment determination unit 150 generates a speech segment determination result j, which is a determination result of whether or not each frame is a speech segment, from the weighted likelihood ratio L _j calculated in S140 (S150). For example, the weighted likelihood ratio L _j is compared with the threshold th, and when the weighted likelihood ratio L _j is greater than the threshold th (or greater than or equal to the threshold th), it is determined that it is a speech section, and the frame j is A speech segment determination result j indicating that it is a speech segment is generated. Otherwise, it is determined that it is a non-speech segment, and a speech segment determination result j indicating that the frame j is a non-speech segment is generated.

本実施形態の発明によれば、複数のマイクロホンで収音した音声データを用いて目的話者の音声が到来する方向を推定し、当該方向からの音声を強調するよう重みづけした音声尤度・非音声尤度を用いて目的話者による音声区間を検出することにより、雑音環境下であっても、目的話者の音声区間を精度よく検出することが可能となる。 According to the embodiment of the present invention, the speech likelihood / weighting weighted so as to estimate the direction in which the target speaker's voice arrives using voice data collected by a plurality of microphones and to emphasize the voice from the direction. By detecting the speech section by the target speaker using the non-speech likelihood, the speech section of the target speaker can be accurately detected even in a noisy environment.

また、音声区間を精度よく切り出せるようになることにより、音声認識の信頼性を向上させることができる。 In addition, since voice sections can be cut out with high accuracy, the reliability of voice recognition can be improved.

＜第二実施形態＞
第一実施形態では、目的話者の動きの程度を示す目的話者音声到来方向時間変化を反映した重み付き尤度比に基づいて音声区間であるか否かを判定したが、目的話者の動きを考慮しない形で重み付き尤度比を計算してもよい。そこで、本実施形態では、目的話者音声到来方向時間変化を用いないで音声区間であるか否かを判定する方法について説明する。 <Second embodiment>
In the first embodiment, whether or not it is a speech section is determined based on a weighted likelihood ratio reflecting a change in time of arrival direction of the target speaker voice indicating the degree of movement of the target speaker. The weighted likelihood ratio may be calculated without considering the motion. Therefore, in the present embodiment, a method for determining whether or not a speech section is determined without using the target speaker voice arrival direction time change will be described.

以下、図９〜図１０を参照して音声区間検出装置２００について説明する。 Hereinafter, the speech section detection apparatus 200 will be described with reference to FIGS. 9 to 10.

図９に示すように音声区間検出装置２００は、尤度計算部１１０、目的話者音声到来方向推定部１２０、重み計算部１３０、重み付き尤度比計算部２４０、音声区間判定部１５０、記録部１９０を含む。 As shown in FIG. 9, the speech segment detection apparatus 200 includes a likelihood calculation unit 110, a target speaker speech arrival direction estimation unit 120, a weight calculation unit 130, a weighted likelihood ratio calculation unit 240, a speech segment determination unit 150, a recording Part 190.

音声区間検出装置２００は、収音システムで収音したM+1個の音声データ、つまりマイクロホンi(i=0, 1, 2,…, M)で収音した目的話者の音声を含む音声データx_iを入力として、各フレームが音声区間であるか否かの判定結果である音声区間判定結果を生成し、出力する。 The speech section detection apparatus 200 includes M + 1 speech data collected by the sound collection system, that is, speech including the speech of the target speaker collected by the microphone i (i = 0, 1, 2,..., M). Using data x _i as an input, a speech segment determination result that is a determination result of whether each frame is a speech segment is generated and output.

図１０に従い音声区間検出装置２００の動作について説明する。尤度計算部１１０は、マイクロホンi(i=0, 1, 2,…, M)で収音した目的話者の音声を含む音声データx_i(i=0, 1, 2,…, M)をフレームに分割し、各フレームについて、音声尤度L_{speech_i,j}、非音声尤度L_{noise_i,j}(i=0, 1, 2,…, M、jはフレーム番号を表すインデックス)を計算する（Ｓ１１０）。 The operation of the speech segment detection device 200 will be described with reference to FIG. Likelihood calculation section 110 includes speech data x _i (i = 0, 1, 2,..., M) including the speech of the target speaker picked up by microphone _i (i = 0, 1, 2,..., M). _Is divided into frames, and for each frame, speech likelihood L _{speech_i, j} and non-speech likelihood L _{noise_i, j} (i = 0, 1, 2,..., M, j is an index representing a frame number) are calculated. (S110).

目的話者音声到来方向推定部１２０は、入力された音声データx_i(i=0, 1, 2,…, M)から、各フレームについて、マイクロホン0を中心として0度の方向を基準とした目的話者の方向である目的話者音声到来方向θ_est,j（0≦θ_est,j≦2π）を推定する（Ｓ１２０）。 The target speaker voice arrival direction estimation unit 120 uses the direction of 0 degrees with respect to the frame 0 as a reference for each frame from the input voice data x _i (i = 0, 1, 2,..., M). The target speaker voice arrival direction θ _{est, j} (0 ≦ θ _{est, j} ≦ 2π), which is the direction of the target speaker, is estimated (S120).

重み計算部１３０は、Ｓ１２０で推定した目的話者音声到来方向θ_est,jと、マイクロホンi(i=0, 1, 2,…, M)の位置を表す方向θ_iと距離r_iから、各フレームについて、マイクロホンi(i=0, 1, 2,…, M)の重みw_i,jを計算する（Ｓ１３０）。 The weight calculator 130 calculates the target speaker voice arrival direction θ _{est, j} estimated in S120, the direction θ _i representing the position of the microphone i (i = 0, 1, 2,..., M), and the distance r _i . For each frame, the weights w _{i, j} of the microphone i (i = 0, 1, 2,..., M) are calculated (S130).

重み付き尤度比計算部２４０は、Ｓ１１０で計算した音声尤度L_{speech_i,j}、非音声尤度L_{noise_i,j}(i=0, 1, 2,…, M)と、Ｓ１３０で計算した重みw_i,j(i=0, 1, 2,…, M)から、各フレームについて、重み付き尤度比L_jを計算する（Ｓ２４０）。以下、図１１を参照して重み付き尤度比計算部２４０について説明する。図１１に示すように重み付き尤度比計算部２４０は、重み付き尤度算出部１４１、重み付き尤度比算出部２４３を含む。 The weighted likelihood ratio calculation unit 240 uses the speech likelihood L _{speech_i, j} calculated in S110, the non-speech likelihood L _{noise_i, j} (i = 0, 1, 2,..., M), and the weight calculated in S130. A weighted likelihood ratio L _j is calculated for each frame from w _{i, j} (i = 0, 1, 2,..., M) (S240). Hereinafter, the weighted likelihood ratio calculation unit 240 will be described with reference to FIG. As shown in FIG. 11, the weighted likelihood ratio calculation unit 240 includes a weighted likelihood calculation unit 141 and a weighted likelihood ratio calculation unit 243.

重み付き尤度算出部１４１は、音声尤度L_{speech_i,j}、非音声尤度L_{noise_i,j}(i=0, 1, 2,…, M)と、重みw_i,j(i=0, 1, 2,…, M)から、各フレームについて、重み付き音声尤度L_{speech_all,j}、重み付き非音声尤度L_{noise_all,j}を算出する（Ｓ１４１）。 The weighted likelihood calculating unit 141 includes a speech likelihood L _{speech_i, j} , a non-speech likelihood L _{noise_i, j} (i = 0, 1, 2,..., M) and a weight w _{i, j} (i = 0, 1, 2,..., M), a weighted speech likelihood L _{speech_all, j} and a weighted non-speech likelihood L _{noise_all, j} are calculated for each frame (S141).

重み付き尤度比算出部２４３は、Ｓ１４１で算出した重み付き音声尤度L_{speech_all,j}、重み付き非音声尤度L_{noise_all,j}から、各フレームについて、重み付き尤度比L_jを算出する（Ｓ２４３）。重み付き尤度比L_jは式(5)を用いて算出される。 The weighted likelihood ratio calculation unit 243 calculates a weighted likelihood ratio L _j for each frame from the weighted speech likelihood L _{speech_all, j} and the weighted non-speech likelihood L _{noise_all, j} calculated in S141. (S243). The weighted likelihood ratio L _j is calculated using Equation (5).

なお、Kを用いない形、つまりK=1としてもよい。 It should be noted that a form that does not use K, that is, K = 1 may be used.

音声区間判定部１５０は、Ｓ２４０で計算した重み付き尤度比L_jから、各フレームについて、音声区間であるか否かの判定結果である音声区間判定結果jを生成する（Ｓ１５０）。 The speech segment determination unit 150 generates a speech segment determination result j that is a determination result of whether or not each frame is a speech segment for each frame from the weighted likelihood ratio L _j calculated in S240 (S150).

＜変形例＞
この発明は上述の実施形態に限定されるものではなく、この発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。上記実施形態において説明した各種の処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。 <Modification>
The present invention is not limited to the above-described embodiment, and it goes without saying that modifications can be made as appropriate without departing from the spirit of the present invention. The various processes described in the above embodiment may be executed not only in time series according to the order of description, but also in parallel or individually as required by the processing capability of the apparatus that executes the processes or as necessary.

＜補記＞
本発明の装置は、例えば単一のハードウェアエンティティとして、キーボードなどが接続可能な入力部、液晶ディスプレイなどが接続可能な出力部、ハードウェアエンティティの外部に通信可能な通信装置（例えば通信ケーブル）が接続可能な通信部、ＣＰＵ（Central Processing Unit、キャッシュメモリやレジスタなどを備えていてもよい）、メモリであるＲＡＭやＲＯＭ、ハードディスクである外部記憶装置並びにこれらの入力部、出力部、通信部、ＣＰＵ、ＲＡＭ、ＲＯＭ、外部記憶装置の間のデータのやり取りが可能なように接続するバスを有している。また必要に応じて、ハードウェアエンティティに、ＣＤ−ＲＯＭなどの記録媒体を読み書きできる装置（ドライブ）などを設けることとしてもよい。このようなハードウェア資源を備えた物理的実体としては、汎用コンピュータなどがある。 <Supplementary note>
The apparatus of the present invention includes, for example, a single hardware entity as an input unit to which a keyboard or the like can be connected, an output unit to which a liquid crystal display or the like can be connected, and a communication device (for example, a communication cable) capable of communicating outside the hardware entity. Can be connected to a communication unit, a CPU (Central Processing Unit, may include a cache memory or a register), a RAM or ROM that is a memory, an external storage device that is a hard disk, and an input unit, an output unit, or a communication unit thereof , A CPU, a RAM, a ROM, and a bus connected so that data can be exchanged between the external storage devices. If necessary, the hardware entity may be provided with a device (drive) that can read and write a recording medium such as a CD-ROM. A physical entity having such hardware resources includes a general-purpose computer.

ハードウェアエンティティの外部記憶装置には、上述の機能を実現するために必要となるプログラムおよびこのプログラムの処理において必要となるデータなどが記憶されている（外部記憶装置に限らず、例えばプログラムを読み出し専用記憶装置であるＲＯＭに記憶させておくこととしてもよい）。また、これらのプログラムの処理によって得られるデータなどは、ＲＡＭや外部記憶装置などに適宜に記憶される。 The external storage device of the hardware entity stores a program necessary for realizing the above functions and data necessary for processing the program (not limited to the external storage device, for example, reading a program) It may be stored in a ROM that is a dedicated storage device). Data obtained by the processing of these programs is appropriately stored in a RAM or an external storage device.

ハードウェアエンティティでは、外部記憶装置（あるいはＲＯＭなど）に記憶された各プログラムとこの各プログラムの処理に必要なデータが必要に応じてメモリに読み込まれて、適宜にＣＰＵで解釈実行・処理される。その結果、ＣＰＵが所定の機能（上記、…部、…手段などと表した各構成要件）を実現する。 In the hardware entity, each program stored in an external storage device (or ROM or the like) and data necessary for processing each program are read into a memory as necessary, and are interpreted and executed by a CPU as appropriate. . As a result, the CPU realizes a predetermined function (respective component requirements expressed as the above-described unit, unit, etc.).

本発明は上述の実施形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。また、上記実施形態において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 The present invention is not limited to the above-described embodiment, and can be appropriately changed without departing from the spirit of the present invention. In addition, the processing described in the above embodiment may be executed not only in time series according to the order of description but also in parallel or individually as required by the processing capability of the apparatus that executes the processing. .

既述のように、上記実施形態において説明したハードウェアエンティティ（本発明の装置）における処理機能をコンピュータによって実現する場合、ハードウェアエンティティが有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記ハードウェアエンティティにおける処理機能がコンピュータ上で実現される。 As described above, when the processing functions in the hardware entity (the apparatus of the present invention) described in the above embodiments are realized by a computer, the processing contents of the functions that the hardware entity should have are described by a program. Then, by executing this program on a computer, the processing functions in the hardware entity are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto-Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only). Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording medium, MO (Magneto-Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、ハードウェアエンティティを構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, a hardware entity is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

Claims

Microphone is set to microphone 0 in the direction in which the target speaker is supposed to speak (hereinafter referred to as the front direction), M microphones (M ≧ 1) are set to microphone 1, microphone 2, ..., microphone M
The front direction when viewed from the microphone 0 is a direction of 0 degrees,
The direction of M + 1 microphones i (i = 0, 1, 2,..., M) with reference to the direction of 0 degrees with respect to the microphone 0 is defined as θ _i (0 ≦ θ _i ≦ 2π, where θ ₀ = 0), and the distance between the microphone 0 and M + 1 microphones i (i = 0, 1, 2,..., M) is r _i (r _i ≧ 0, r ₀ = 0),
Voice data x _i (i = 0, 1, 2,..., M) including the voice of the target speaker picked up by the microphone i (i = 0, 1, 2,..., M) is divided into frames, For each frame, a likelihood calculator for calculating speech likelihood L _{speech_i, j} , non-speech likelihood L _{noise_i, j} (i = 0, 1, 2,..., M, j is an index representing a frame number);
From the speech data x _i (i = 0, 1, 2,..., M), the arrival of target speaker speech that is the direction of the target speaker with respect to the direction of 0 degrees around the microphone 0 for each frame. A target speaker voice arrival direction estimation unit that estimates a direction θ _{est, j} (0 ≦ θ _{est, j} ≦ 2π);
From the target speaker voice arrival direction θ _{est, j} , the direction θ _i representing the position of the microphone i (i = 0, 1, 2,..., M) and the distance r _i , the microphone i ( a weight calculation unit for calculating weights w _{i, j} of i = 0, 1, 2,.
From the target speaker voice arrival direction θ _{est, j} , for each frame, the target speaker voice arrival direction time change Δθ that is a change in the target speaker voice arrival direction for the past t frames (t is an integer of 1 or more). _The target speaker voice arrival direction time change calculation unit for calculating _j ,
The speech likelihood L _{speech_i, j} , the non-speech likelihood L _{noise_i, j} (i = 0, 1, 2,..., M) and the weight w _{i, j} (i = 0, 1, 2 _,. M) and a weighted likelihood ratio calculation unit for calculating a weighted likelihood ratio L _j for each frame from the target speaker voice arrival direction time change Δθ _j ,
A speech segment detection device comprising: a speech segment determination unit that generates a speech segment determination result j that is a determination result of whether or not each frame is a speech segment for each frame from the weighted likelihood ratio L _j .

Microphone is set to microphone 0 in the direction in which the target speaker is supposed to speak (hereinafter referred to as the front direction), M microphones (M ≧ 1) are set to microphone 1, microphone 2, ..., microphone M
The front direction when viewed from the microphone 0 is a direction of 0 degrees,
The direction of M + 1 microphones i (i = 0, 1, 2,..., M) with reference to the direction of 0 degrees with respect to the microphone 0 is defined as θ _i (0 ≦ θ _i ≦ 2π, where θ ₀ = 0), and the distance between the microphone 0 and M + 1 microphones i (i = 0, 1, 2,..., M) is r _i (r _i ≧ 0, r ₀ = 0),
Voice data x _i (i = 0, 1, 2,..., M) including the voice of the target speaker picked up by the microphone i (i = 0, 1, 2,..., M) is divided into frames, For each frame, a likelihood calculator for calculating speech likelihood L _{speech_i, j} , non-speech likelihood L _{noise_i, j} (i = 0, 1, 2,..., M, j is an index representing a frame number);
From the speech data x _i (i = 0, 1, 2,..., M), the arrival of target speaker speech that is the direction of the target speaker with respect to the direction of 0 degrees around the microphone 0 for each frame. A target speaker voice arrival direction estimation unit that estimates a direction θ _{est, j} (0 ≦ θ _{est, j} ≦ 2π);
From the target speaker voice arrival direction θ _{est, j} , the direction θ _i representing the position of the microphone i (i = 0, 1, 2,..., M) and the distance r _i , the microphone i ( a weight calculation unit for calculating weights w _{i, j} of i = 0, 1, 2,.
The speech likelihood L _{speech_i, j} , the non-speech likelihood L _{noise_i, j} (i = 0, 1, 2,..., M) and the weight w _{i, j} (i = 0, 1, 2 _,. M), for each frame, a weighted likelihood ratio calculation unit for calculating a weighted likelihood ratio L _j ,
A speech segment detection device comprising: a speech segment determination unit that generates a speech segment determination result j that is a determination result of whether or not each frame is a speech segment for each frame from the weighted likelihood ratio L _j .

The speech section detection device according to claim 1 or 2,
The weights w _{i, j} (i = 0, 1, 2,..., M) are defined as the distance between the microphone 0 and the microphone r, and the direction of the microphone relative to the direction of 0 degrees with respect to the microphone 0 as θ. , As the target speaker voice arrival direction θ _est , which is the direction of the target speaker with respect to the direction of 0 degree centered on the microphone 0, the distance r, the direction θ, and the target speaker voice arrival direction θ _est function _{w (r, θ-θ est} ) between the displacement a is theta-theta _est of are those that are calculated using,
The function w (r, θ-θ _est ) is a monotonically decreasing function for r,
The function _{w (r, θ-θ est} ) , for theta-theta _est, maximum when θ-θ _est = 0, so as to minimize the time of θ-θ _est = π monotonically decreasing, w (r, A speech section detection device, which is a function having a period of 2π where − (θ−θ _est )) = w (r, θ−θ _est ).

Microphone is set to microphone 0 in the direction in which the target speaker is supposed to speak (hereinafter referred to as the front direction), M microphones (M ≧ 1) are set to microphone 1, microphone 2, ..., microphone M
The front direction when viewed from the microphone 0 is a direction of 0 degrees,
The direction of M + 1 microphones i (i = 0, 1, 2,..., M) with reference to the direction of 0 degrees with respect to the microphone 0 is defined as θ _i (0 ≦ θ _i ≦ 2π, where θ ₀ = 0), and the distance between the microphone 0 and M + 1 microphones i (i = 0, 1, 2,..., M) is r _i (r _i ≧ 0, r ₀ = 0),
Voice data x _i (i = 0, 1, 2,..., M) including the voice of the target speaker picked up by the microphone _i (i = 0, 1, 2,..., M) _Is divided into frames, and for each frame, speech likelihood L _{speech_i, j} and non-speech likelihood L _{noise_i, j} (i = 0, 1, 2,..., M, j is an index representing a frame number) are calculated. A likelihood calculation step;
The speech section detection device is configured to determine the direction of the target speaker from the speech data x _i (i = 0, 1, 2,..., M) with respect to the direction of 0 degrees around the microphone 0 for each frame. A target speaker voice arrival direction estimation step for estimating a target speaker voice arrival direction θ _{est, j} (0 ≦ θ _{est, j} ≦ 2π),
The speech section detection device includes each of the target speaker speech arrival direction θ _{est, j} , the direction θ _i representing the position of the microphone i (i = 0, 1, 2,..., M), and the distance r _i . A weight calculating step for calculating a weight w _{i, j} of the microphone i (i = 0, 1, 2,..., M) for the frame;
The target speech in which the speech section detection device is a change in the target speaker voice arrival direction for the past t frames (t is an integer of 1 or more) for each frame from the target speaker voice arrival direction θ _{est, j.} A target speaker voice arrival direction time change calculating step for calculating a speaker voice arrival direction time change Δθ _j ,
The speech section detection device includes the speech likelihood L _{speech_i, j} , the non-speech likelihood L _{noise_i, j} (i = 0, 1, 2,..., M) and the weight w _{i, j} (i = 0 , 1, 2,..., M) and the target speaker voice arrival direction time change Δθ _j , a weighted likelihood ratio calculating step for calculating a weighted likelihood ratio L _j for each frame;
A speech segment determination step in which the speech segment detection device generates, from the weighted likelihood ratio L _j , a speech segment determination result j that is a determination result of whether or not each frame is a speech segment. Detection method.

A program for causing a computer to function as the speech segment detection device according to any one of claims 1 to 3.