JP6125953B2

JP6125953B2 - Voice section detection apparatus, method and program

Info

Publication number: JP6125953B2
Application number: JP2013175584A
Authority: JP
Inventors: 記良鎌土; 慶介木下; 裕司青野; 哲小橋川
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2013-02-21
Filing date: 2013-08-27
Publication date: 2017-05-10
Anticipated expiration: 2033-08-27
Also published as: JP2014186295A

Description

本発明は、音声ディジタル信号から音声区間を検出する技術、及び検出した音声区間に対して行う音声認識技術に関する。 The present invention relates to a technique for detecting a voice section from a voice digital signal, and a voice recognition technique performed for the detected voice section.

マルチチャネルの音声ディジタル信号から音声区間を検出する技術として特許文献１が知られている。雑音や対象とする人（以下「主話者」ともいう）以外の人（以下「他話者」ともいう）の音声を含むような音声ディジタル信号から、主話者の音声区間を検出する技術として特許文献２が知られている。特許文献２では、まず、音声ディジタル信号を所定の長さのフレ-ムごとに取り出し、そのフレ-ムの音声ディジタル信号を解析し、そのフレ-ムの音声ディジタル信号に対象とする話者の音声が含まれるか否かを判定し、判定結果を音声/非音声判定値として求める。次に、音声ディジタル信号に対して音声認識を行い、その音声ディジタル信号から認識単位の系列と、各認識単位の発話時間情報とを求める。さらに、音声/非音声判定値と、認識単位の系列及び各認識単位の発話時間情報と用いて、認識単位の発話時間に対応するフレ-ムの音声/非音声判定値の集計値の大小に基づいて、認識単位ごとに対象とする話者によって発話されたか否かを判定する。 Patent Document 1 is known as a technique for detecting a voice section from a multichannel voice digital signal. Technology for detecting the voice section of the main speaker from a digital audio signal that includes noise and the voice of a person other than the target person (hereinafter also referred to as “main speaker”) (hereinafter also referred to as “other speaker”) Patent Document 2 is known. In Patent Document 2, first, a voice digital signal is taken out for each frame of a predetermined length, the voice digital signal of the frame is analyzed, and the target speaker's voice digital signal is analyzed. It is determined whether or not speech is included, and the determination result is obtained as a speech / non-speech determination value. Next, speech recognition is performed on the speech digital signal, and a series of recognition units and speech time information of each recognition unit are obtained from the speech digital signal. Furthermore, using the speech / non-speech judgment value, the recognition unit sequence and the speech time information of each recognition unit, the total value of the speech / non-speech judgment value of the frame corresponding to the speech time of the recognition unit Based on this, it is determined for each recognition unit whether or not the target speaker has spoken.

特開２００９−０３１６０４号公報JP 2009-031604 A 特開２０１２−０４８１１９号公報JP 2012-048119 A

特許文献１は、主話者、他話者関係なく全ての音声区間を検出し、主話者以外の音声も抽出するため、主話者に対する音声認識精度が大きく劣化する場合がある。また、特許文献１では複数のマイクの使用が前提となり、単一のマイクによる収音を前提とするモバイル環境における音声認識では使用できない。また、特許文献２では、音声認識を行い、その認識結果を用いるため、残響下においては、認識精度が落ち、その結果、音声ディジタル信号から主話者の音声が含まれる区間（以下「主話者音声区間」ともいう）を推定する精度が劣化するという問題がある。 Since Patent Document 1 detects all voice segments regardless of the relationship between the main speaker and other speakers and extracts voices other than the main speaker, the voice recognition accuracy for the main speaker may be greatly degraded. Further, in Patent Document 1, it is assumed that a plurality of microphones are used, and cannot be used in speech recognition in a mobile environment where sound collection by a single microphone is assumed. In Patent Document 2, since speech recognition is performed and the recognition result is used, the recognition accuracy is reduced under reverberation. As a result, a section including the voice of the main speaker from the voice digital signal (hereinafter referred to as “main story”). There is a problem that the accuracy of estimating the “personal speech section” is deteriorated.

本発明は、残響を含む実環境下における、単一のマイクへの主話者及び他話者を含む複数の話者の音声が混入した音声ディジタル信号に対しても、高い精度で主話者の音声を検出することができる音声区間検出技術を提供することを目的とする。 The present invention provides a high-accuracy main speaker even for an audio digital signal in which the voices of a plurality of speakers including a main speaker and other speakers are mixed in a real environment including reverberation. It is an object of the present invention to provide a speech section detection technique that can detect a speech of

上記の課題を解決するために、本発明の第一の態様によれば、音声区間検出装置は、音声ディジタル信号に含まれる残響成分を推定し、残響信号を取得する残響推定部と、残響信号に基づいて主話者音声区間または主話者外音区間の少なくとも一方を検出する話者別区間検出部とを含む。 In order to solve the above-described problem, according to the first aspect of the present invention, a speech segment detection apparatus estimates a reverberation component included in a speech digital signal, acquires a reverberation signal, and a reverberation signal. And a section-by-speaker detection unit for detecting at least one of the main speaker voice section and the main speaker outside sound section.

上記の課題を解決するために、本発明の他の態様によれば、音声区間検出方法は、音声ディジタル信号に含まれる残響成分を推定し、残響信号を取得する残響推定ステップと、残響信号に基づいて主話者音声区間または主話者外音区間の少なくとも一方を検出する話者別区間検出ステップとを含む。 In order to solve the above-described problem, according to another aspect of the present invention, a speech interval detection method estimates a reverberation component included in a speech digital signal and acquires a reverberation signal; And a speaker-by-speaker section detecting step for detecting at least one of the main speaker voice section and the main speaker outside sound section.

本発明によれば、残響を含む実環境下における、単一のマイクへの主話者及び他話者を含む複数の話者の音声が混入した音声ディジタル信号に対しても、高い精度で主話者の音声を検出することができるという効果を奏する。 According to the present invention, in a real environment including reverberation, a main microphone with a high accuracy can be obtained even for a voice digital signal in which voices of a plurality of speakers including a main speaker and other speakers are mixed into a single microphone. There is an effect that the voice of the speaker can be detected.

第一実施形態に係る音声区間検出装置の機能ブロック図。The functional block diagram of the audio | voice area detection apparatus which concerns on 1st embodiment. 第一実施形態に係る音声区間検出装置の処理フローを示す図。The figure which shows the processing flow of the audio | voice area detection apparatus which concerns on 1st embodiment. 図３Ａは主話者の音声に対応する音声アナログ信号のイメージを表す図、図３Ｂは主話者外音に対応する音声アナログ信号のイメージを表す図。FIG. 3A is a diagram showing an image of an audio analog signal corresponding to the voice of the main speaker, and FIG. 3B is a diagram showing an image of an audio analog signal corresponding to the external sound of the main speaker. 図４Ａは主話者の音声に対応する残響信号をベクトルで表わしたイメージを表す図、図４Ｂは主話者外音に対応する残響信号をベクトルで表わしたイメージを表す図。4A is a diagram illustrating an image in which a reverberation signal corresponding to the voice of the main speaker is represented by a vector, and FIG. 4B is a diagram illustrating an image in which a reverberation signal corresponding to the external sound of the main speaker is represented by a vector. 図５Ａは主話者の音声に対応する差分をベクトルで表わしたイメージを表す図、図５Ｂは主話者外音に対応する差分をベクトルで表わしたイメージを表す図。FIG. 5A is a diagram illustrating an image in which a difference corresponding to the voice of the main speaker is represented by a vector, and FIG. 5B is a diagram illustrating an image in which a difference corresponding to the external sound from the main speaker is represented by a vector. 第二実施形態に係る音声区間検出装置の機能ブロック図。The functional block diagram of the audio | voice area detection apparatus which concerns on 2nd embodiment. 第二実施形態に係る音声区間検出装置の処理フローを示す図。The figure which shows the processing flow of the audio | voice area detection apparatus which concerns on 2nd embodiment. 第三、第四実施形態に係る音声区間検出装置の機能ブロック図。The functional block diagram of the audio | voice area detection apparatus which concerns on 3rd, 4th embodiment. 第三、第四実施形態に係る音声区間検出装置の処理フローを示す図。The figure which shows the processing flow of the audio | voice area detection apparatus which concerns on 3rd, 4th embodiment. 第五実施形態に係る音声区間検出装置の機能ブロック図。The functional block diagram of the audio | voice area detection apparatus which concerns on 5th embodiment. 第五実施形態に係る音声区間検出装置の処理フローを示す図。The figure which shows the processing flow of the audio | voice area detection apparatus which concerns on 5th embodiment. 図１２Ａは主話者の音声の残響信号のパワーをベクトルで表わしたイメージを表す図、図１２Ｂは主話者外音の残響信号のパワーをベクトルで表わしたイメージを表す図。FIG. 12A is a diagram showing an image representing the power of the reverberation signal of the main speaker's voice as a vector, and FIG. 12B is a diagram showing an image showing the power of the reverberation signal of the main speaker's sound as a vector. 第六実施形態に係る音声区間検出装置の機能ブロック図。The functional block diagram of the audio | voice area detection apparatus which concerns on 6th embodiment. 第六実施形態に係る音声区間検出装置の処理フローを示す図。The figure which shows the processing flow of the audio | voice area detection apparatus which concerns on 6th embodiment. 図１５Ａは音声ディジタル信号に含まれる他話者の音声が小さく、その残響成分が小さい場合における主話者の音声の残響信号のパワーをベクトルで表わしたイメージを表す図、図１５Ｂは音声ディジタル信号に含まれる他話者の音声が小さく、その残響成分が小さい場合における主話者外音の残響信号のパワーをベクトルで表わしたイメージを表す図。FIG. 15A is a diagram showing an image in which the power of the reverberation signal of the main speaker's voice is expressed as a vector when the voice of the other speaker included in the voice digital signal is small and the reverberation component is small, and FIG. The figure showing the image which represented the power of the reverberation signal of the main speaker outside sound by a vector in case the voice of the other speaker contained in is small and the reverberation component is small. 第七実施形態に係る音声区間検出装置の機能ブロック図。The functional block diagram of the audio | voice area detection apparatus which concerns on 7th embodiment. 第七実施形態に係る音声区間検出装置の処理フローを示す図。The figure which shows the processing flow of the audio | voice area detection apparatus which concerns on 7th embodiment. 第一から第七実施形態に係る音声区間検出装置の何れかと、第七実施形態に係る音声認識装置との配置を説明するための図。The figure for demonstrating arrangement | positioning with either of the speech area detection apparatuses which concern on 7th embodiment from 1st, and the speech recognition apparatus which concerns on 7th embodiment.

以下、本発明の実施形態について説明する。なお、以下の説明に用いる図面では、同じ機能を持つ構成部や同じ処理を行うステップには同一の符号を記し、重複説明を省略する。以下の説明において、テキスト中で使用する記号「~」等は、本来直前の文字の真上に記載されるべきものであるが、テキスト記法の制限により、当該文字の直後に記載する。式中においてはこれらの記号は本来の位置に記述している。また、ベクトルや行列の各要素単位で行われる処理は、特に断りが無い限り、そのベクトルやその行列の全ての要素に対して適用されるものとする。 Hereinafter, embodiments of the present invention will be described. In the drawings used for the following description, constituent parts having the same function and steps for performing the same process are denoted by the same reference numerals, and redundant description is omitted. In the following description, the symbol “˜” and the like used in the text should be described immediately above the immediately preceding character, but are described immediately after the character due to restrictions on text notation. In the formula, these symbols are written in their original positions. Further, the processing performed for each element of a vector or matrix is applied to all elements of the vector or matrix unless otherwise specified.

＜第一実施形態のポイント＞
本実施形態では、音声ディジタル信号に含まれる残響成分を推定し、残響成分から主話者外音を強調する。言い換えると、マイクへの入力信号（音声アナログ信号）に含まれる主話者外音区間を、入力信号に含まれる残響成分から抽出する。なお、「主話者外音」とは、主話者の音声以外がメインの音を意味する。音声ディジタル信号に含まれる残響成分の推定方法としては、例えば参考文献１の残響成分の推定方法を用いることができる。
[参考文献１]国際公開第２００７／１００１３７号パンフレット <Points of first embodiment>
In this embodiment, the reverberation component contained in the audio digital signal is estimated, and the main speaker's external sound is emphasized from the reverberation component. In other words, the main speaker external sound section included in the input signal (audio analog signal) to the microphone is extracted from the reverberation component included in the input signal. The “main speaker outside sound” means a main sound other than the voice of the main speaker. As a reverberation component estimation method included in the audio digital signal, for example, the reverberation component estimation method of Reference 1 can be used.
[Reference 1] Pamphlet of International Publication No. 2007/100137

モバイル環境では、単一マイクで収録される主話者の音声と主話者外音にそれぞれ含まれる残響成分の差分が大きいため、主話者外音を高精度に強調できる。ただし、本実施形態は、モバイル環境、単一のマイクに限定されるものではなく、他の環境、複数のマイクから得られる音声アナログ信号にも適用可能である。 In the mobile environment, the difference between the reverberation components contained in the main speaker's voice and the main speaker's external sound recorded by a single microphone is large, so that the main speaker's external sound can be emphasized with high accuracy. However, the present embodiment is not limited to a mobile environment and a single microphone, but can also be applied to audio analog signals obtained from other environments and a plurality of microphones.

＜第一実施形態に係る音声区間検出装置１００＞
図１は第一実施形態に係る音声区間検出装置１００の機能ブロック図を、図２はその処理フローを示す。音声区間検出装置１００は、音声信号取得部１１０、残響推定部１２０、ゲイン調整部１３０、主話者外音強調部１４０、主話者外音区間検出部１６０及び主話者音声抽出部１７０を含む。
音声区間検出装置１００は、音声アナログ信号を受け取り、主話者の音声に対応する音声ディジタル信号を出力する。 <Audio section detection device 100 according to the first embodiment>
FIG. 1 is a functional block diagram of the speech section detection apparatus 100 according to the first embodiment, and FIG. 2 shows a processing flow thereof. The speech section detection apparatus 100 includes a speech signal acquisition unit 110, a reverberation estimation unit 120, a gain adjustment unit 130, a main speaker external sound enhancement unit 140, a main speaker external sound section detection unit 160, and a main speaker voice extraction unit 170. Including.
The voice section detection apparatus 100 receives a voice analog signal and outputs a voice digital signal corresponding to the voice of the main speaker.

＜音声信号取得部１１０＞
入力：音声アナログ信号
出力：音声ディジタル信号
音声信号取得部１１０は、アナログの音声信号（音声信号）を受け取り、ディジタルの音声信号（音声ディジタル信号）に変換し（ｓ１１０）、出力する。図３は、マイク９０で収音された音声信号をベクトルで表わしたイメージを表す。図３Ａ及び図３Ｂは、それぞれ主話者の音声及び主話者外音に対応する音声アナログ信号を表す。図３Ａに示すように、主話者の音声に対応する音声信号は、直接音Ｄが大きく、反射音Ｒ（残響成分）が小さい。一方、図３Ｂに示すように、他話者の音声に対応する音声信号は、直接音Ｄが小さく、反射音Ｒ（残響成分）が大きい。このような性質を利用して、後述する主話者外音区間検出部１６０では、主話者音声区間または主話者外音区間の少なくとも一方を検出する。 <Audio signal acquisition unit 110>
Input: Audio analog signal output: Audio digital signal The audio signal acquisition unit 110 receives an analog audio signal (audio signal), converts it into a digital audio signal (audio digital signal) (s110), and outputs it. FIG. 3 shows an image in which the audio signal collected by the microphone 90 is expressed as a vector. 3A and 3B show audio analog signals corresponding to the main speaker's voice and main speaker's external sound, respectively. As shown in FIG. 3A, the sound signal corresponding to the voice of the main speaker has a large direct sound D and a small reflected sound R (reverberation component). On the other hand, as shown in FIG. 3B, the voice signal corresponding to the voice of the other speaker has a small direct sound D and a large reflected sound R (reverberation component). Using such a property, the main speaker outside sound section detection unit 160 described later detects at least one of the main speaker sound section and the main speaker outside sound section.

＜残響推定部１２０＞
入力: 音声ディジタル信号
出力：残響信号
残響推定部１２０は、音声ディジタル信号に含まれる残響成分を推定し（ｓ１２０）、残響信号を取得する。図４Ａ及び図４Ｂは、それぞれ主話者の音声及び主話者外音に対応する残響信号をベクトルで表わしたイメージを表す。 <Reverberation estimation unit 120>
Input: Audio digital signal output: Reverberation signal The reverberation estimation unit 120 estimates a reverberation component included in the audio digital signal (s120), and acquires the reverberation signal. 4A and 4B show images representing vectors of reverberation signals corresponding to the main speaker's voice and main speaker's external sound, respectively.

以下、残響成分を推定する方法の概要を説明する。 Hereinafter, an outline of a method for estimating reverberation components will be described.

原音声信号s(z)は、式(1)のように、白色信号u(z)に短い自己回帰（Auto-Regressive:以下「AR」ともいう)過程がかかったものである。AR過程のZ変換をv(z)=1/(1-b(z))とし、1-b(z)を多項式とする。

この原音声信号s(z)が空間を伝達し、マイクで観測される信号x(z)は、式(1)より、以下のように表される。

The original audio signal s (z) is obtained by applying a short auto-regressive (hereinafter also referred to as “AR”) process to the white signal u (z) as shown in Expression (1). The Z transformation of the AR process is v (z) = 1 / (1-b (z)), and 1-b (z) is a polynomial.

The original audio signal s (z) is transmitted through the space, and the signal x (z) observed by the microphone is expressed as follows from the equation (1).

ここで、h(z)は、音源からマイクまでの室内伝達関数を表す。音声信号は、v(z)に従って強い短期的な相関を有する。そこで、式(3)による短期的な相関を取り除く線形予測によるPre-whitening処理を施すことにより、v(z)は、ほぼ白色信号とみなせ、v(z)≒1が成り立つ。

Here, h (z) represents an indoor transfer function from the sound source to the microphone. The audio signal has a strong short-term correlation according to v (z). Therefore, by performing pre-whitening processing by linear prediction that removes short-term correlation according to Equation (3), v (z) can be regarded as a substantially white signal, and v (z) ≈1 holds.

ここで、b(p)は、v(z)を効果的に抑圧するための線形予測係数であり、式(4)により求められる。

Here, b (p) is a linear prediction coefficient for effectively suppressing v (z), and is obtained by Expression (4).

ここで、r(i)は、マイクで観測された信号x(z)がiサンプルずれた場合の自己相関係数）を示す。この線形予測は、30msのフィルタ長で実施し、30ms以内に含まれる初期反射音成分及び音声の短期的な相関が取り除かれることが期待される。 Here, r (i) represents an autocorrelation coefficient when the signal x (z) observed by the microphone is shifted by i samples. This linear prediction is performed with a filter length of 30 ms, and it is expected that the short-term correlation between the early reflection component and the speech included within 30 ms is removed.

Dをステップサイズ(遅延)、Lをフィルタ長とすると、残響信号d(n)は以下のように定式化することができる。

When D is a step size (delay) and L is a filter length, the reverberation signal d (n) can be formulated as follows.

ここで、a(l)(ローマ字のエル)は線形予測係数、x~(n)は式(3)により求められたPre-whitening処理された観測音を表す。a(l)をｚ変換したa(z)は、式(6)で求められる。

ここで、h_min(z)とh_max(z)は、それぞれh(z)の最小位相成分（Z平面上の単位円内の零点に対応する成分）と最大位相成分（Z平面上の単位円外の零点に対応する成分）を表す。また、min[h_max(z)]は、h_max(z)を最小位相化する関数を表す。 Here, a (l) (Roman letter L) represents a linear prediction coefficient, and x to (n) represent pre-whitening-processed observation sounds obtained by Expression (3). a (z) obtained by z-converting a (l) is obtained by Expression (6).

Where h _min (z) and h _max (z) are the minimum phase component of h (z) (the component corresponding to the zero in the unit circle on the Z plane) and the maximum phase component (unit on the Z plane). Represents the component corresponding to the zero point outside the circle). Also, min [h _max (z)] represents a function that minimizes h _max (z).

一般に、Dは10〜200msに相当する値を、Lは100ms〜500msに相当する値を設定する。
本手法は、例えば参考文献１に詳しい。 In general, D is set to a value corresponding to 10 to 200 ms, and L is set to a value corresponding to 100 ms to 500 ms.
This technique is detailed in Reference Document 1, for example.

上述の方法や、他の既存の残響推定技術を用いて、残響推定部１２０は、音声ディジタル信号ｘ（ｎ）に含まれる残響成分を推定し、残響信号ｄ（ｎ）を取得する。 Using the above-described method and other existing reverberation estimation techniques, the reverberation estimation unit 120 estimates a reverberation component included in the audio digital signal x (n) and acquires a reverberation signal d (n).

＜ゲイン調整部１３０＞
入力：残響信号
出力：ゲイン調整された残響信号
ゲイン調整部１３０は、残響信号を受け取り、残響信号にゲインGを乗算し（ｓ１３０）、ゲイン調整された残響信号を得、出力する。ゲインGは、1よりも小さく０より大きな値を用いる。例えば、0.8〜1.0の値を用いる。これにより、後述する主話者外音強調部１４０において、音声ディジタル信号と残響信号との差分を求める際に生じる歪を低減させることができる。 <Gain adjustment unit 130>
Input: Reverberation signal output: Gain-adjusted reverberation signal The gain adjustment unit 130 receives the reverberation signal, multiplies the reverberation signal by the gain G (s130), and obtains and outputs the gain-adjusted reverberation signal. As the gain G, a value smaller than 1 and larger than 0 is used. For example, a value of 0.8 to 1.0 is used. Thereby, in the main speaker outside sound emphasizing unit 140 described later, it is possible to reduce distortion that occurs when obtaining the difference between the audio digital signal and the reverberation signal.

＜主話者外音強調部１４０＞
入力：音声ディジタル信号、ゲイン調整された残響信号
出力：主話者外音が強調された音声ディジタル信号
主話者外音強調部１４０は、音声ディジタル信号とゲイン調整された残響信号とを受け取り、これらの信号の差分を算出し（ｓ１４０）、主話者外音が強調された音声ディジタル信号として出力する。図５Ａ及び図５Ｂは、それぞれ主話者の音声及び主話者外音に対応する差分をベクトルで表わしたイメージを表す。なお、図中の小さい矢印は除去しきれなかった残響成分Ｒ’を表す。このような処理を行うことで、主話者の音声及び主話者外音のパワーの差を出すことができ、音声ディジタル信号中の主話者外音区間を高精度に抽出することができるようになる。なお、ゲイン調整部１３０及び主話者外音強調部１４０の処理を合わせて、スペクトルサブトラクション法という既知の手法で実現することができる（参考文献２参照）。
[参考文献２] BOLL, S. F., "Suppression of Acoustic Noise in Speech Using Spectral Subtraction", IEEE Trans. Acoust., Speech, Signal Processing, 1979, vol. ASSP-27, pp. 113-120 <Outside sound enhancement unit 140>
Input: voice digital signal, gain-adjusted reverberation signal output: voice digital signal in which main speaker external sound is emphasized The main speaker external sound enhancement unit 140 receives the voice digital signal and the gain-adjusted reverberation signal, The difference between these signals is calculated (s140) and output as a voice digital signal in which the main speaker's external sound is emphasized. FIG. 5A and FIG. 5B represent images in which differences corresponding to the main speaker's voice and main speaker's external sound are represented by vectors. A small arrow in the figure represents a reverberation component R ′ that could not be removed. By performing such processing, the power difference between the main speaker's voice and main speaker's external sound can be obtained, and the main speaker's external sound section in the audio digital signal can be extracted with high accuracy. It becomes like this. Note that the processing of the gain adjusting unit 130 and the main speaker outside sound emphasizing unit 140 can be combined and realized by a known technique called a spectral subtraction method (see Reference 2).
[Reference 2] BOLL, SF, "Suppression of Acoustic Noise in Speech Using Spectral Subtraction", IEEE Trans. Acoust., Speech, Signal Processing, 1979, vol. ASSP-27, pp. 113-120

＜主話者外音区間検出部１６０＞
入力：主話者外音が強調された音声ディジタル信号
出力：区間情報
主話者外音区間検出部１６０は、主話者外音が強調された音声ディジタル信号を受け取り、主話者外音が強調された音声ディジタル信号のパワーから主話者音声区間または主話者外音区間の少なくとも一方を検出し（ｓ１６０）、主話者音声区間または主話者外音区間の少なくとも一方を示す区間情報を出力する。例えば、主話者外音が強調された音声ディジタル信号のパワーを閾値と大小比較する。（１）閾値より大きい場合、主話者の音声区間、（２）閾値より小さい場合、主話者外音区間であると判断する。閾値は主話者音声区間と主話者外音区間の正解ラベルのついた学習データ等を用いて予め定めておく。処理は例えば、音声ディジタル信号を前後Nサンプル(0.1〜0.3msに対応するサンプル数)を1区間とし、区間毎に行う。区間情報としては、例えば、主話者音声区間または主話者外音区間の少なくとも一方の開始時間と終了時間等を用いることができる。また、音声ディジタル信号を入力とし、音声ディジタル信号に主話者音声区間または主話者外音区間の少なくとも一方のフラグを付与した信号（以下「フラグ付音声ディジタル信号」ともいう）等を区間情報として用いてもよい。ただし、区間情報は例示したものに限定されず、主話者音声区間または主話者外音区間の少なくとも一方を示す情報であればどのようなものであってもよい。 <Main speaker outer sound section detection unit 160>
Input: Voice digital signal with emphasized main speaker outside sound Output: Section information The main speaker outside sound section detecting unit 160 receives a voice digital signal with emphasized main speaker outside sound, and the main speaker outside sound is received. At least one of the main speaker voice section and the main speaker outside sound section is detected from the power of the emphasized voice digital signal (s160), and section information indicating at least one of the main speaker voice section and the main speaker outside sound section. Is output. For example, the power of the voice digital signal in which the main speaker external sound is emphasized is compared with a threshold value. (1) If it is larger than the threshold, it is determined that it is the main speaker's voice section. The threshold is determined in advance using learning data with correct labels of the main speaker voice section and the main speaker outside sound section. For example, the processing is performed for each section, with the audio digital signal having N samples before and after (number of samples corresponding to 0.1 to 0.3 ms) as one section. As the section information, for example, the start time and the end time of at least one of the main speaker voice section or the main speaker outside sound section can be used. Also, a voice digital signal is input and a signal in which at least one flag of the main speaker voice section or the main speaker outside voice section is added to the voice digital signal (hereinafter also referred to as “flagged voice digital signal”), etc. It may be used as However, the section information is not limited to the exemplified one, and may be any information as long as it is information indicating at least one of the main speaker voice section and the main speaker outside sound section.

＜主話者音声抽出部１７０＞
入力：音声ディジタル信号、区間情報
出力：主話者の音声に対応する音声ディジタル信号
主話者音声抽出部１７０は、音声ディジタル信号と区間情報とを受け取り、区間情報を用いて、音声ディジタル信号から主話者の音声に対応する部分を抽出し（ｓ１７０）、音声区間検出装置１００の出力値として出力する。 <Main speaker voice extraction unit 170>
Input: voice digital signal, section information output: voice digital signal corresponding to the voice of the main speaker The main speaker voice extraction unit 170 receives the voice digital signal and the section information, and uses the section information from the voice digital signal. A portion corresponding to the voice of the main speaker is extracted (s170) and output as an output value of the voice section detecting device 100.

例えば、区間情報として、開始時間と終了時間を用いる場合、開始時間と終了時間との間のサンプルに１を、さらに、開始時間と終了時間のマージンを確保するため、主話者音声区間から主話者外音区間へと切り替わる開始時間の前にNサンプル(0.1〜0.4msに対応するサンプル長)の1を、1から0へと切り替わる終了時間の後にMサンプル(0.1〜0.4msに対応するサンプル長)の1を付加するマージン処理を行う。このマージン処理をした主話者音声区間（つまり、開始時間前Nサンプルから終了時間後Mサンプルに対応する部分が１であり、他の部分が0である時間サンプル列）を音声ディジタル信号に時間サンプル毎に乗算することで主話者音声を抽出することができる。 For example, when the start time and the end time are used as the section information, 1 is added to the sample between the start time and the end time, and in order to secure a margin between the start time and the end time, the main speaker voice section 1 of N samples (sample length corresponding to 0.1 to 0.4ms) before start time to switch to speaker outside sound section, M samples (corresponding to 0.1 to 0.4ms) after end time to switch from 1 to 0 Perform margin processing to add 1 of (sample length). The main speaker speech section (that is, a time sample sequence in which the portion corresponding to the N samples before the start time to the M samples after the end time is 1 and the other portion is 0) subjected to the margin processing is timed to the speech digital signal. The main speaker voice can be extracted by multiplying each sample.

また、音声ディジタル信号に主話者音声区間のフラグを付与した信号を区間情報として用いた場合、その信号にマージン処理を行い（つまり、始端と終端のそれぞれNサンプルとMサンプルの音声ディジタル信号に主話者音声区間のフラグを付与する）、主話者音声区間のフラグを付与した部分に対応する音声ディジタル信号を抽出する。また、音声ディジタル信号に主話者外音区間のフラグを付与した信号を区間情報として用いた場合、主話者外音区間のフラグを付与していない音声ディジタル信号にマージン処理を行い、主話者外音区間のフラグを付与していない部分に対応する音声ディジタル信号を抽出する。 In addition, when a signal obtained by adding a flag of the main speaker voice section to the voice digital signal is used as the section information, margin processing is performed on the signal (that is, the voice digital signal of N samples and M samples at the start and end, respectively). The voice digital signal corresponding to the portion to which the flag of the main speaker voice section is added is extracted. In addition, when a signal with the flag of the main speaker outside sound section added to the voice digital signal is used as the section information, margin processing is performed on the voice digital signal without the flag of the main speaker outside sound section, A voice digital signal corresponding to a portion to which no external sound section flag is assigned is extracted.

＜効果＞
このような構成により、残響を含む実環境下における、単一のマイクへの主話者及び他話者を含む複数の話者の音声が混入した音声ディジタル信号に対しても、高い精度で主話者の音声を検出することができる。また、その結果、利用するマイクの個数を少なくすることができ、ハードウェアの構成を軽量化することができる。 <Effect>
With such a configuration, even in a real environment including reverberation, even in a digital audio signal in which the voices of a plurality of speakers including a main speaker and other speakers are mixed into a single microphone, the main can be performed with high accuracy. The voice of the speaker can be detected. As a result, the number of microphones used can be reduced, and the hardware configuration can be reduced.

＜変形例＞
音声区間検出装置１００は、入力信号として音声ディジタル信号を受け取る場合には、必ずしも音声信号取得部１１０を備えなくともよい。
音声区間検出装置１００は、必ずしもゲイン調整部１３０を備えなくともよい。この場合、主話者外音強調部１４０では、ゲイン調整されていない残響信号をそのまま用いる。
音声区間検出装置１００は、必ずしも主話者音声抽出部１７０を備えなくともよい。主話者外音区間検出部１６０の出力値（区間情報）を、音声区間検出装置１００の出力値として出力する。 <Modification>
The speech section detection apparatus 100 does not necessarily need to include the speech signal acquisition unit 110 when receiving a speech digital signal as an input signal.
The speech segment detection device 100 does not necessarily have to include the gain adjustment unit 130. In this case, the main speaker external sound enhancement unit 140 uses a reverberation signal that has not been gain-adjusted.
The speech segment detection device 100 does not necessarily include the main speaker speech extraction unit 170. The output value (section information) of the main speaker outside sound section detection unit 160 is output as the output value of the speech section detection apparatus 100.

＜第二実施形態＞
第一実施形態と異なる部分を中心に説明する。
マイク９０により収音された音には音声以外の雑音が含まれる場合もある。そこで、本実施形態では、第一実施形態の処理の前段で、音声ディジタル信号に含まれる雑音を抑圧する。このような構成により、主話者外音強調部と主話者外音区間検出部の雑音による精度劣化を防ぎ、雑音環境下においても高精度に主話者音声を抽出できる。 <Second embodiment>
A description will be given centering on differences from the first embodiment.
The sound collected by the microphone 90 may include noise other than voice. Therefore, in this embodiment, noise included in the audio digital signal is suppressed before the processing of the first embodiment. With such a configuration, it is possible to prevent accuracy deterioration due to noise in the main speaker external sound emphasis unit and the main speaker external sound section detection unit, and to extract the main speaker voice with high accuracy even in a noisy environment.

図６は第二実施形態に係る音声区間検出装置２００の機能ブロック図を、図７はその処理フローを示す。音声区間検出装置２００は、音声区間検出装置１００の構成に加え、さらに、雑音抑圧部２１０を含む。 FIG. 6 is a functional block diagram of the speech segment detection apparatus 200 according to the second embodiment, and FIG. 7 shows the processing flow. The speech segment detection device 200 further includes a noise suppressing unit 210 in addition to the configuration of the speech segment detection device 100.

＜雑音抑圧部２１０＞
入力：音声ディジタル信号
出力：雑音を抑圧された音声ディジタル信号
雑音抑圧部２１０は、音声ディジタル信号を受け取り、音声ディジタル信号に含まれる音声以外の雑音を抑圧し（ｓ２１０）、雑音を抑圧された音声ディジタル信号を出力する。雑音抑圧の方法として、既知の技術を用いることができ、例えば、MMSE-STSA法（参考文献３参照）等が考えられる。
[参考文献３] Y.Ephraim and D. Malah, “Speech enhancement using a minimum mean-square error log-spectral amplitude estimator”, IEEE Trans. Acoust. Speech Signal Process., April 1985, vol.ASSP-33, no.2, pp.443-445 <Noise Suppression Unit 210>
Input: voice digital signal output: voice digital signal with suppressed noise The noise suppression unit 210 receives the voice digital signal, suppresses noise other than the voice included in the voice digital signal (s210), and has the noise suppressed. Output a digital signal. As a noise suppression method, a known technique can be used. For example, the MMSE-STSA method (see Reference 3) can be considered.
[Reference 3] Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean-square error log-spectral amplitude estimator”, IEEE Trans. Acoust. Speech Signal Process., April 1985, vol.ASSP-33, no .2, pp.443-445

なお、残響推定部１２０、主話者外音区間検出部１６０及び主話者音声抽出部１７０において用いる音声ディジタル信号は、雑音抑圧部２１０において雑音を抑圧された音声ディジタル信号である。 Note that the speech digital signal used in the reverberation estimation unit 120, the main speaker outside sound section detection unit 160, and the main speaker speech extraction unit 170 is a speech digital signal in which noise is suppressed by the noise suppression unit 210.

＜効果＞
このような構成により、第一実施形態と同様の効果を得ることができる。さらに、雑音による主話者音声区間の抽出精度劣化を防ぐことができる。 <Effect>
With such a configuration, the same effect as that of the first embodiment can be obtained. Furthermore, it is possible to prevent the extraction accuracy degradation of the main speaker voice section due to noise.

＜第三実施形態＞
第一実施形態と異なる部分を中心に説明する。
本実施形態では、主話者の音声と主話者外音を含む音声ディジタル信号全体のパワーの値も用いて主話者外音区間を検出する。 <Third embodiment>
A description will be given centering on differences from the first embodiment.
In the present embodiment, the main speaker external sound section is detected also using the power value of the entire voice digital signal including the main speaker's voice and main speaker external sound.

図８は第三実施形態に係る音声区間検出装置３００の機能ブロック図を、図９はその処理フローを示す。音声区間検出装置３００は、音声信号取得部１１０、残響推定部１２０、ゲイン調整部１３０、主話者外音強調部１４０、主話者外音区間検出部３６０及び主話者音声抽出部１７０を含み、さらに、音声信号パワー計算部３５０を含む。 FIG. 8 is a functional block diagram of the speech section detection apparatus 300 according to the third embodiment, and FIG. 9 shows the processing flow. The speech section detection apparatus 300 includes a speech signal acquisition unit 110, a reverberation estimation unit 120, a gain adjustment unit 130, a main speaker external sound enhancement unit 140, a main speaker external sound section detection unit 360, and a main speaker voice extraction unit 170. And an audio signal power calculator 350.

＜音声信号パワー計算部３５０＞
入力：音声ディジタル信号
出力：音声ディジタル信号のパワー
音声信号パワー計算部３５０は、音声ディジタル信号を受け取り、そのパワーを計算し（ｓ３５０）、出力する。なお、音声信号パワー計算部３５０は、その内部に信号スムージング部３５１（図中、破線で示す）を備えてもよい。信号スムージング部３５１は、音声ディジタル信号を前後Nサンプル(0.1〜0.3msに対応するサンプル数)を用いた平均可算を行いスムージングする（ｓ３５１、図中破線で示す）。音声信号パワー計算部３５０は、スムージングした音声ディジタル信号のパワーを計算してもよい。スムージングを行うことで主話者音声区間または主話者外音を検出しやすいよう強調することができる。 <Audio signal power calculator 350>
Input: Audio digital signal output: Audio digital signal power The audio signal power calculation unit 350 receives the audio digital signal, calculates its power (s350), and outputs it. Note that the audio signal power calculation unit 350 may include a signal smoothing unit 351 (indicated by a broken line in the drawing). The signal smoothing unit 351 smoothes an audio digital signal by averaging it using N samples before and after (number of samples corresponding to 0.1 to 0.3 ms) (s351, indicated by a broken line in the figure). The audio signal power calculation unit 350 may calculate the power of the smoothed audio digital signal. By performing the smoothing, it is possible to emphasize so that the main speaker voice section or the main speaker external sound can be easily detected.

＜主話者外音区間検出部３６０＞
入力：音声ディジタル信号のパワー、主話者外音が強調された音声ディジタル信号
出力：主話者音声区間または主話者外音区間
主話者外音区間検出部３６０は、音声ディジタル信号のパワー及び主話者外音が強調された音声ディジタル信号を受け取り、音声ディジタル信号のパワーにより正規化を行った「主話者外音が強調された音声ディジタル信号」のパワーから主話者音声区間または主話者外音区間の少なくとも一方を検出し（ｓ３６０）、主話者音声区間または主話者外音区間の少なくとも一方を示す区間情報を出力する。例えば、音声ディジタル信号のパワーによる正規化を行った「主話者外音が強調された音声ディジタル信号」のパワーを閾値と大小比較する。主話者外音が強調された音声ディジタル信号に代えて、音声ディジタル信号のパワーによる正規化を行った「主話者外音が強調された音声ディジタル信号」のパワーを用いる点を除けば、主話者外音区間検出部１６０における処理と同様の処理を行う。 <Outside sound section detection unit 360>
Input: power of voice digital signal, voice digital signal in which main speaker outside sound is emphasized output: main speaker voice section or main speaker outside sound section The main speaker outside sound section detecting unit 360 is a power of the voice digital signal. And from the power of the “sound digital signal with emphasized external sound of the main speaker” obtained by receiving the audio digital signal with emphasized external sound of the main speaker and normalized by the power of the audio digital signal or At least one of the main speaker outer sound sections is detected (s360), and section information indicating at least one of the main speaker voice section and the main speaker outer sound section is output. For example, the power of the “voice digital signal in which the sound outside the main speaker is emphasized” normalized by the power of the voice digital signal is compared with a threshold value. In place of the voice digital signal in which the main speaker external sound is emphasized, the power of the “voice digital signal in which the main speaker external sound is emphasized” normalized by the power of the voice digital signal is used. The same processing as the processing in the main speaker outside sound section detection unit 160 is performed.

＜効果＞
このような構成とすることで、残響推定部によって誤りによって直接音成分であると推定された残響成分を音声ディジタル信号のパワーにより減衰させることができ、第一実施形態と同様の効果を得ることができる。なお、第一実施形態の変形例や第二実施形態と本実施形態とを組合せてもよい。 <Effect>
By adopting such a configuration, the reverberation component estimated to be a direct sound component due to an error by the reverberation estimation unit can be attenuated by the power of the audio digital signal, and the same effect as in the first embodiment can be obtained. Can do. In addition, you may combine the modification of 1st embodiment, 2nd embodiment, and this embodiment.

＜第四実施形態＞
第三実施形態と異なる部分を中心に説明する。
本実施形態では、音声ディジタル信号全体のパワーを用いて主話者の音声や主話者外音が存在しない無音区間を推定し、無音区間外において、主話者音声区間または主話者外音区間を検出することで、検出精度を高める。 <Fourth embodiment>
A description will be given centering on differences from the third embodiment.
In the present embodiment, a silent section where there is no main speaker voice or main speaker outside sound is estimated using the power of the entire voice digital signal, and the main speaker voice section or main speaker outside sound is outside the silent section. By detecting the section, the detection accuracy is increased.

図８は第四実施形態に係る音声区間検出装置４００の機能ブロック図を、図９はその処理フローを示す。音声区間検出装置４００は、主話者外音区間検出部３６０に代えて主話者外音区間検出部４６０を含む。 FIG. 8 is a functional block diagram of the speech segment detection apparatus 400 according to the fourth embodiment, and FIG. 9 shows the processing flow. The voice section detection device 400 includes a main speaker outside sound section detection unit 460 instead of the main speaker outside sound section detection unit 360.

＜主話者外音区間検出部４６０＞
入力：音声ディジタル信号のパワー、主話者外音が強調された音声ディジタル信号
出力：区間情報
主話者外音区間検出部４６０は、音声ディジタル信号のパワー及び主話者外音が強調された音声ディジタル信号を受け取る。主話者外音区間検出部４６０は、その内部に無音区間抽出部４６１を含む（図中、破線で示す）。無音区間抽出部４６１は、音声ディジタル信号のパワーを受け取り、そのパワーから無音区間を抽出する（ｓ４６１、図中破線で示す）。例えば音声ディジタル信号のパワーを閾値と大小比較する。パワーをP=10log10（音声ディジタル信号の二乗値）[dB]とした場合、閾値は-10〜10[dB]の値で設定し、この閾値と大小比較する。閾値よりパワーが小さい区間を無音区間とし、大きい区間を無音区間外とする。無音区間の判定は、例えば0.1〜0.3msを1区間として、区間毎に行う。 <Main speaker outside sound section detection unit 460>
Input: power of voice digital signal, voice digital signal with emphasized main speaker sound output: section information The main speaker outer sound section detector 460 emphasizes the power of voice digital signal and the main speaker noise. An audio digital signal is received. The main speaker outside sound section detection unit 460 includes a silent section extraction unit 461 (indicated by a broken line in the figure). The silent section extraction unit 461 receives the power of the audio digital signal and extracts a silent section from the power (s461, indicated by a broken line in the figure). For example, the power of the audio digital signal is compared with a threshold value. When the power is set to P = 10 log 10 (square value of audio digital signal) [dB], the threshold value is set to a value of -10 to 10 [dB], and the magnitude is compared with this threshold value. A section where the power is lower than the threshold value is set as a silent section, and a large section is set as outside the silent section. The silent section is determined for each section, for example, with 0.1 to 0.3 ms as one section.

さらに、主話者外音区間検出部４６０は、無音区間外の音声ディジタル信号から主話者音声区間または主話者外音区間の少なくとも一方を検出する（ｓ４６０）。例えば、無音区間外において、主話者外音が強調された音声ディジタル信号に対して音声ディジタル信号のパワーによる正規化を行い、無音区間外において音声ディジタル信号のパワーによる正規化を行った「主話者外音が強調された音声ディジタル信号」のパワーから主話者音声区間または主話者外音区間の少なくとも一方を検出する。単なる「主話者外音が強調された音声ディジタル信号」に代えて、無音区間外において音声ディジタル信号のパワーによる正規化を行った「主話者外音が強調された音声ディジタル信号」のパワーを用いる点を除けば、主話者外音区間検出部１６０における処理と同様の処理を行う。 Furthermore, the main speaker outer sound section detection unit 460 detects at least one of the main speaker voice section and the main speaker outer sound section from the audio digital signal outside the silent section (s460). For example, outside the silence period, the voice digital signal in which the sound outside the main speaker is emphasized is normalized by the power of the voice digital signal, and outside the silence period, the normalization by the power of the voice digital signal is performed. At least one of the main speaker voice section and the main speaker outer sound section is detected from the power of the “voice digital signal in which the speaker outer sound is emphasized”. The power of the “sound digital signal with emphasized external sound of the main speaker” that is normalized by the power of the audio digital signal outside the silent period instead of just the “sound digital signal with emphasized external sound of the main speaker” Except for the point of using, the same processing as the processing in the main speaker outside sound section detection unit 160 is performed.

なお、主話者外音区間検出部４６０は、その内部に信号スムージング部４６２（図中、破線で示す）を備えてもよい。信号スムージング部４６２は、単なる「音声ディジタル信号」に代えて、上述の信号スムージング部３５１と同様の処理を行う。スムージングを行うことで主話者音声区間または主話者外音を検出しやすいよう強調することができる。この場合、主話者外音区間検出部４６０は、スムージングが施された、無音区間外において音声ディジタル信号のパワーによる正規化を行った「主話者外音が強調された音声ディジタル信号」のパワーから主話者音声区間または主話者外音区間の少なくとも一方を検出する。 In addition, the main speaker outside sound section detection unit 460 may include a signal smoothing unit 462 (indicated by a broken line in the drawing). The signal smoothing unit 462 performs the same processing as that of the above-described signal smoothing unit 351 instead of a simple “audio digital signal”. By performing the smoothing, it is possible to emphasize so that the main speaker voice section or the main speaker external sound can be easily detected. In this case, the main speaker outside sound section detection unit 460 performs the smoothing of the “sound digital signal in which the main speaker outside sound is emphasized” that is normalized by the power of the sound digital signal outside the silent section. At least one of the main speaker voice section and the main speaker outside sound section is detected from the power.

＜効果＞
このような構成により、第三実施形態と同様の効果を得ることができる。さらに、無音区間外において、主話者音声区間または主話者外音区間を検出することで、検出精度を高めることができる。なお、無音区間抽出部４６１を含まず、信号スムージング部４６２のみを含む構成としてもよい。 <Effect>
With such a configuration, the same effect as that of the third embodiment can be obtained. Furthermore, detection accuracy can be improved by detecting the main speaker voice section or the main speaker outside sound section outside the silent section. Note that the silent section extraction unit 461 may not be included, and only the signal smoothing unit 462 may be included.

＜第五実施形態＞
第一実施形態と異なる部分を中心に説明する。
図１０は第五実施形態に係る音声区間検出装置５００の機能ブロック図を、図１１はその処理フローを示す。音声区間検出装置５００は、音声信号取得部１１０、残響推定部１２０、ゲイン調整部１３０、残響信号パワー計算部５４０、主話者外音区間検出部５６０及び主話者音声抽出部１７０を含む。 <Fifth embodiment>
A description will be given centering on differences from the first embodiment.
FIG. 10 is a functional block diagram of a speech section detection apparatus 500 according to the fifth embodiment, and FIG. 11 shows the processing flow. The speech section detection apparatus 500 includes a speech signal acquisition unit 110, a reverberation estimation unit 120, a gain adjustment unit 130, a reverberation signal power calculation unit 540, a main speaker outside sound section detection unit 560, and a main speaker speech extraction unit 170.

＜残響信号パワー計算部５４０＞
入力：ゲイン調整された残響信号
出力：残響信号のパワー
残響信号パワー計算部５４０は、ゲイン調整された残響信号を受け取り、そのパワーを計算し（ｓ５４０）、出力する。なお、残響信号パワー計算部５４０における処理は、音声ディジタル信号に代えてゲイン調整された残響信号を用いる点を除けば、音声信号パワー計算部３５０の処理と同様である。例えば、その内部に信号スムージング部３５１を含んでもよい。図１２Ａ及び図１２Ｂは、それぞれ主話者の音声及び主話者外音の残響信号のパワーをベクトルで表わしたイメージを表す。図１２Ａ及び図１２Ｂや図４Ａ及び図４Ｂからも分かるように、主話者の音声では残響信号のパワーが小さく、主話者外音では残響信号のパワーが大きくなる。このような性質を利用して、後述する主話者外音区間検出部５６０では、区間情報を求める。 <Reverberation signal power calculation unit 540>
Input: reverberation signal with gain adjusted: power of reverberation signal The reverberation signal power calculation unit 540 receives the reverberation signal with gain adjustment, calculates its power (s540), and outputs it. The processing in the reverberation signal power calculation unit 540 is the same as the processing in the audio signal power calculation unit 350 except that a reverberation signal whose gain is adjusted is used instead of the audio digital signal. For example, a signal smoothing unit 351 may be included therein. 12A and 12B show images representing the power of the reverberation signal of the main speaker's voice and main speaker's external sound as vectors, respectively. As can be seen from FIGS. 12A and 12B and FIGS. 4A and 4B, the power of the reverberation signal is small in the voice of the main speaker, and the power of the reverberation signal is large in the sound outside the main speaker. Utilizing such a property, the main speaker outside sound section detection unit 560 described later obtains section information.

＜主話者外音区間検出部５６０＞
入力：残響信号のパワー
出力：区間情報
主話者外音区間検出部５６０は、残響信号のパワーを受け取り、その残響信号のパワーから主話者音声区間または主話者外音区間の少なくとも一方を検出し（ｓ５６０）、主話者音声区間または主話者外音区間の少なくとも一方を示す区間情報を出力する。例えば、残響信号のパワーを閾値と大小比較する。（１）閾値より大きい場合、主話者外音区間、（２）閾値より小さい場合、主話者音声区間であると判断する。閾値は主話者音声区間と主話者外音区間の正解ラベルのついた学習データ等を用いて予め定めておく。処理は例えば、音声ディジタル信号を前後Nサンプル(0.1〜0.3msに対応するサンプル数)を1区間とし、区間毎に行う。 <Main speaker outside sound section detection unit 560>
Input: Power output of reverberation signal: Section information The main speaker outer sound section detector 560 receives the power of the reverberation signal, and determines at least one of the main speaker voice section and the main speaker outer sound section from the power of the reverberation signal. Detection is performed (s560), and section information indicating at least one of the main speaker voice section and the main speaker outside sound section is output. For example, the power of the reverberation signal is compared with a threshold value. (1) When larger than the threshold value, it is determined that it is a main speaker outside sound section, and (2) when it is smaller than the threshold value, it is determined as a main speaker voice section. The threshold is determined in advance using learning data with correct labels of the main speaker voice section and the main speaker outside sound section. For example, the processing is performed for each section, with the audio digital signal having N samples before and after (number of samples corresponding to 0.1 to 0.3 ms) as one section.

＜効果＞
このような構成により、第一実施形態と同様の効果を得ることができる。なお、第一実施形態の変形例と本実施形態を組合せてもよいし、第二実施形態の雑音抑圧部２１０や第四実施形態の音声信号パワー計算部３５０、無音区間抽出部４６１や信号スムージング部４６２と本実施形態を組合せてもよい。ただし、無音区間抽出部４６１や信号スムージング部４６２では、主話者外音が強調された音声ディジタル信号のパワーに代えて、残響信号のパワーを用いる。 <Effect>
With such a configuration, the same effect as that of the first embodiment can be obtained. Note that the modification of the first embodiment may be combined with the present embodiment, the noise suppression unit 210 of the second embodiment, the audio signal power calculation unit 350 of the fourth embodiment, the silence interval extraction unit 461, and the signal smoothing. The unit 462 and this embodiment may be combined. However, the silent section extraction unit 461 and the signal smoothing unit 462 use the power of the reverberation signal instead of the power of the voice digital signal in which the main speaker external sound is emphasized.

＜第六実施形態＞
第五実施形態と異なる部分を中心に説明する。
図１３は第六実施形態に係る音声区間検出装置６００の機能ブロック図を、図１４はその処理フローを示す。音声区間検出装置６００は、さらに、音声信号パワー計算部３５０を含み（第三実施形態参照）、主話者外音区間検出部５６０に代えて主話者外音区間検出部６６０を含む。 <Sixth embodiment>
A description will be given centering on differences from the fifth embodiment.
FIG. 13 is a functional block diagram of a speech segment detection apparatus 600 according to the sixth embodiment, and FIG. 14 shows the processing flow thereof. The voice section detection apparatus 600 further includes a voice signal power calculation unit 350 (see the third embodiment), and includes a main speaker outer sound section detection unit 660 instead of the main speaker outer sound section detection unit 560.

＜主話者外音区間検出部６６０＞
入力：音声ディジタル信号のパワー、残響信号のパワー
出力：区間情報
主話者外音区間検出部６６０は、音声ディジタル信号のパワー及び残響信号のパワーを受け取り、音声ディジタル信号のパワーにより正規化を行った残響信号のパワーから主話者音声区間または主話者外音区間の少なくとも一方を検出し（ｓ６６０）、主話者音声区間または主話者外音区間の少なくとも一方を示す区間情報を出力する。例えば、音声ディジタル信号のパワーによる正規化を行った残響信号のパワーを閾値と大小比較する。単なる「残響信号のパワー」に代えて、音声ディジタル信号のパワーによる正規化を行った「残響信号のパワー」を用いる点を除けば、主話者外音区間検出部５６０における処理と同様の処理を行う。 <Main speaker outside sound section detection unit 660>
Input: power of voice digital signal, power of reverberation signal output: section information The main speaker outside sound section detection unit 660 receives the power of the voice digital signal and the power of the reverberation signal, and normalizes it by the power of the voice digital signal. From the power of the reverberant signal, at least one of the main speaker voice section and the main speaker outer sound section is detected (s660), and section information indicating at least one of the main speaker voice section and the main speaker outer sound section is output. . For example, the power of the reverberation signal normalized by the power of the audio digital signal is compared with a threshold value. A process similar to the process in the main speaker outside sound section detection unit 560 except that the “reverberation signal power” normalized by the power of the audio digital signal is used in place of the simple “reverberation signal power”. I do.

＜効果＞
このような構成により、残響推定部により誤って残響成分であると推定された直接音成分を音声ディジタル信号のパワーにより減衰させることができ、第五実施形態と同様の効果を得ることができる。なお、第一実施形態の変形例と本実施形態を組合せてもよいし、第二実施形態の雑音抑圧部２１０や第四実施形態の音声信号パワー計算部３５０、無音区間抽出部４６１や信号スムージング部４６２と本実施形態を組合せてもよい。ただし、無音区間抽出部４６１や信号スムージング部４６２では、主話者外音が強調された音声ディジタル信号のパワーに代えて、残響信号のパワーを用いる。 <Effect>
With such a configuration, the direct sound component erroneously estimated as the reverberation component by the reverberation estimation unit can be attenuated by the power of the audio digital signal, and the same effect as in the fifth embodiment can be obtained. Note that the modification of the first embodiment may be combined with the present embodiment, the noise suppression unit 210 of the second embodiment, the audio signal power calculation unit 350 of the fourth embodiment, the silence interval extraction unit 461, and the signal smoothing. The unit 462 and this embodiment may be combined. However, the silent section extraction unit 461 and the signal smoothing unit 462 use the power of the reverberation signal instead of the power of the voice digital signal in which the main speaker external sound is emphasized.

＜第七実施形態＞
第一実施形態と異なる部分を中心に説明する。
＜第七実施形態のポイント＞
本実施形態では、音声ディジタル信号に含まれる残響成分を推定し、残響成分から主話者音を強調する。言い換えると、マイクへの入力信号（音声アナログ信号）に含まれる主話者音区間を、入力信号に含まれる残響成分から抽出する。 <Seventh embodiment>
A description will be given centering on differences from the first embodiment.
<Points of the seventh embodiment>
In this embodiment, the reverberation component contained in the audio digital signal is estimated, and the main speaker sound is emphasized from the reverberation component. In other words, the main speaker sound section included in the input signal (voice analog signal) to the microphone is extracted from the reverberation component included in the input signal.

モバイル環境において、単一マイクで収録される他話者の音声に含まれる残響成分は、その直接音が不明瞭なために推定し難いことがある。このような性質を利用して、後述する主話者音区間検出部７６０では、区間情報を求める。 In a mobile environment, reverberation components included in the speech of other speakers recorded with a single microphone may be difficult to estimate because the direct sound is unclear. Using such a property, the main speaker sound section detection unit 760 described later obtains section information.

なお、第五実施形態では、主話者の音声では残響信号のパワーが小さく、主話者外音では残響信号のパワーが大きくなることを説明したが、これは、音声ディジタル信号に含まれる他話者の音声が、明瞭であり、その残響成分を精度よく推定できる場合に限られる。音声ディジタル信号に含まれる他話者の音声が小さく、その残響成分が小さい場合には、従来の残響成分推定方法（参考文献１参照）では、精度よく残響成分を推定することができない。そのような場合には、逆に、他話者の音声に含まれる残響成分が、主話者の音声に含まれる残響成分と比較して、小さくなることを発見した。図１５Ａ及び図１５Ｂは、それぞれ、音声ディジタル信号に含まれる他話者の音声が小さく、その残響成分が小さい場合における主話者の音声及び主話者外音の残響信号のパワーをベクトルで表わしたイメージを表す。図１５Ａ及び図１５Ｂからも分かるように、主話者の音声では残響信号のパワーが大きく、主話者外音では残響信号のパワーが小さくなる。本実施形態では、この発見を利用して、主話者音を高精度に強調できる。ただし、本実施形態は、モバイル環境、単一のマイクに限定されるものではなく、他の環境、複数のマイクから得られる音声アナログ信号にも適用可能である。 In the fifth embodiment, it has been described that the power of the reverberation signal is small in the voice of the main speaker and the power of the reverberation signal is large in the sound outside the main speaker. This is limited to the case where the speaker's voice is clear and the reverberation component can be accurately estimated. When the speech of another speaker included in the speech digital signal is small and the reverberation component is small, the reverberation component cannot be accurately estimated by the conventional reverberation component estimation method (see Reference 1). In such a case, conversely, it has been found that the reverberation component contained in the speech of the other speaker is smaller than the reverberation component contained in the speech of the main speaker. FIG. 15A and FIG. 15B respectively represent the power of the reverberation signal of the main speaker's voice and the main speaker's external sound when the other speaker's voice included in the voice digital signal is small and its reverberation component is small. Represents the image. As can be seen from FIGS. 15A and 15B, the power of the reverberation signal is large in the voice of the main speaker, and the power of the reverberation signal is small in the sound outside the main speaker. In the present embodiment, this discovery can be used to emphasize the main speaker sound with high accuracy. However, the present embodiment is not limited to a mobile environment and a single microphone, but can also be applied to audio analog signals obtained from other environments and a plurality of microphones.

＜第七実施形態に係る音声区間検出装置７００＞
図１６は第七実施形態に係る音声区間検出装置７００の機能ブロック図を、図１７はその処理フローを示す。音声区間検出装置７００は、音声信号取得部１１０、残響推定部１２０、主話者音区間検出部７６０及び主話者音声抽出部１７０を含む。音声区間検出装置７００は、ゲイン調整部１３０及び主話者外音強調部１４０を含まなくともよく、主話者外音区間検出部１６０に代えて、主話者音区間検出部７６０を含む。 <Audio section detection device 700 according to the seventh embodiment>
FIG. 16 is a functional block diagram of the speech section detection apparatus 700 according to the seventh embodiment, and FIG. 17 shows the processing flow. The speech segment detection device 700 includes a speech signal acquisition unit 110, a reverberation estimation unit 120, a main speaker sound segment detection unit 760, and a main speaker speech extraction unit 170. The speech segment detection device 700 does not need to include the gain adjustment unit 130 and the main speaker external sound enhancement unit 140, and includes a main speaker sound segment detection unit 760 instead of the main speaker external sound segment detection unit 160.

＜主話者音区間検出部７６０＞
入力：残響信号
出力：区間情報
主話者音区間検出部７６０は、残響信号を受け取り、残響信号のパワーから主話者音声区間または主話者外音区間の少なくとも一方を検出し（ｓ７６０）、主話者音声区間または主話者外音区間の少なくとも一方を示す区間情報を出力する。例えば、残響信号のパワーを閾値と大小比較する。（１）閾値より大きい場合、主話者の音声区間、（２）閾値より小さい場合、主話者外音区間であると判断する。閾値は主話者音声区間と主話者外音区間の正解ラベルのついた学習データ等を用いて予め定めておく。 <Main speaker sound section detector 760>
Input: Reverberation signal output: Section information The main speaker sound section detector 760 receives the reverberation signal, detects at least one of the main speaker voice section and the main speaker outside sound section from the power of the reverberation signal (s760), Section information indicating at least one of the main speaker voice section and the main speaker outside sound section is output. For example, the power of the reverberation signal is compared with a threshold value. (1) If it is larger than the threshold, it is determined that it is the main speaker's voice section. The threshold is determined in advance using learning data with correct labels of the main speaker voice section and the main speaker outside sound section.

＜効果＞
このような構成により、残響を含む実環境下における、特に音声ディジタル信号に含まれる他話者の音声が小さく、その残響成分が小さい場合に、単一のマイクへの主話者及び他話者を含む複数の話者の音声が混入した音声ディジタル信号に対しても、高い精度で主話者の音声を検出することができる。また、その結果、利用するマイクの個数を少なくすることができ、ハードウェアの構成を軽量化することができる。 <Effect>
With such a configuration, in a real environment including reverberation, especially when the speech of another speaker included in the speech digital signal is small and the reverberation component is small, the main speaker and the other speaker to the single microphone The voice of the main speaker can be detected with high accuracy even for a voice digital signal mixed with the voices of a plurality of speakers including. As a result, the number of microphones used can be reduced, and the hardware configuration can be reduced.

＜変形例＞
本実施形態を、他の実施形態と組み合わせてもよい。
例えば、第二〜第四実施形態と本実施形態を組合せる場合には、各音声区間検出装置２００〜４００は、ゲイン調整部１３０及び主話者外音強調部１４０を含まなくともよく、それぞれ主話者外音区間検出部１６０、３６０、４６０に代えて、主話者音区間検出部を含む。主話者音区間検出部では、主話者外音が強調された音声ディジタル信号に代えて残響信号を用いて、主話者外音区間検出部１６０、３６０、４６０と同様の処理を行う。 <Modification>
This embodiment may be combined with other embodiments.
For example, when combining the second to fourth embodiments and the present embodiment, each of the speech section detection devices 200 to 400 may not include the gain adjustment unit 130 and the main speaker outside sound enhancement unit 140, respectively. Instead of the main speaker sound section detectors 160, 360, and 460, a main speaker sound section detector is included. The main speaker sound section detection unit performs the same processing as the main speaker outer sound section detection units 160, 360, and 460 using a reverberation signal instead of the voice digital signal in which the main speaker external sound is emphasized.

また、例えば、第五及び第六実施形態と本実施形態を組合せる場合には、それぞれ主話者外音区間検出部５６０及び６６０に代えて、主話者音区間検出部を含む。この場合、主話者音区間検出部では、残響信号のパワー、または、正規化を行った残響信号のパワーと閾値とを大小比較したときの判定結果が、主話者外音区間検出部５６０及び６６０の判定結果とは逆となる。例えば、第五実施形態と本実施形態とを組合せた場合には、主話者音区間検出部は以下のように処理を行う。 Further, for example, when the fifth and sixth embodiments are combined with the present embodiment, a main speaker sound section detection unit is included instead of the main speaker outside sound section detection units 560 and 660, respectively. In this case, in the main speaker sound section detection unit, the determination result when the power of the reverberation signal or the power of the normalized reverberation signal is compared with the threshold value is the main speaker outside sound section detection unit 560. And 660 are opposite to the determination result. For example, when the fifth embodiment and the present embodiment are combined, the main speaker sound section detection unit performs processing as follows.

＜主話者音区間検出部＞
入力：残響信号のパワー
出力：区間情報
主話者外音区間検出部は、残響信号のパワーを受け取り、その残響信号のパワーから主話者音声区間または主話者外音区間の少なくとも一方を検出し、主話者音声区間または主話者外音区間の少なくとも一方を示す区間情報を出力する。例えば、残響信号のパワーを閾値と大小比較する。（１）閾値より大きい場合、「主話者音声区間」、（２）閾値より小さい場合、「主話者外音区間」であると判断する。閾値は主話者音声区間と主話者外音区間の正解ラベルのついた学習データ等を用いて予め定めておく。 <Main speaker sound section detector>
Input: Reverberation signal power output: Section information The main speaker outer sound section detector receives the power of the reverberation signal and detects at least one of the main speaker voice section or the main speaker outer sound section from the power of the reverberation signal. Then, section information indicating at least one of the main speaker voice section and the main speaker outside sound section is output. For example, the power of the reverberation signal is compared with a threshold value. (1) If it is larger than the threshold, it is determined that it is a “main speaker voice section”, and (2) if it is smaller than the threshold, it is determined that it is a “main speaker outside sound section”. The threshold is determined in advance using learning data with correct labels of the main speaker voice section and the main speaker outside sound section.

このように本実施形態と他の実施形態とを組合せることで、音声ディジタル信号に含まれる他話者の音声が小さく、その残響成分が小さい場合にも、各実施形態で説明した効果を得ることができる。 By combining this embodiment with other embodiments in this way, the effects described in each embodiment can be obtained even when the voice of another speaker included in the voice digital signal is small and the reverberation component is small. be able to.

また、第一〜第六実施形態に係る音声区間検出装置と、本実施形態に係る音声区間検出装置（または、第二〜第六実施形態と本実施形態とを組合せた音声区間検出装置）とを、音声ディジタル信号に含まれる他話者の音声の大きさに応じて切り替える構成としてもよい。音声ディジタル信号に含まれる他話者の音声の大きさが大きい場合には、第一〜第六実施形態に係る音声区間検出装置を用い、音声ディジタル信号に含まれる他話者の音声の大きさが小さい場合には、本実施形態（または、第二〜第六実施形態と本実施形態との組合せ）に係る音声区間検出装置を用いることで、何れの状況においても高い精度で主話者の音声を検出することができる。 Moreover, the speech section detection device according to the first to sixth embodiments, and the speech section detection device according to this embodiment (or a speech section detection device combining the second to sixth embodiments and this embodiment), May be switched according to the volume of the voice of the other speaker included in the voice digital signal. When the loudness of the other speaker included in the speech digital signal is large, the loudness of the other speaker included in the speech digital signal is obtained using the speech section detection device according to the first to sixth embodiments. Is small, by using the speech section detection device according to the present embodiment (or a combination of the second to sixth embodiments and the present embodiment), the main speaker's accuracy is high in any situation. Voice can be detected.

なお、「話者別区間検出部」とは、上述の「主話者外音区間検出部」と「主話者音区間検出部」とを含む概念である。 The “speaker-specific section detector” is a concept including the above-mentioned “main speaker outside sound section detector” and “main speaker sound section detector”.

＜第八実施形態＞
図１８は、音声区間検出装置１００〜７００の何れかと、音声認識装置８００との配置を説明するための図である。音声認識装置８００の前段に音声区間検出装置１００〜７００の何れかを配置する。 <Eighth embodiment>
FIG. 18 is a diagram for explaining an arrangement of any of the speech section detection devices 100 to 700 and the speech recognition device 800. Any one of the speech section detection devices 100 to 700 is disposed in front of the speech recognition device 800.

音声認識装置８００は、音声信号を入力として前述の音声区間検出装置１００〜７００の何れかによって得られた信号を用いて、音声信号に対して音声認識を行う。なお、音声信号とは、音声ディジタル信号及び音声アナログ信号を含む概念である。 The speech recognition apparatus 800 performs speech recognition on the speech signal using a signal obtained by any of the speech section detection devices 100 to 700 described above using the speech signal as an input. The audio signal is a concept including an audio digital signal and an audio analog signal.

例えば、前述の音声区間検出装置１００〜７００の何れかによって得られた主話者の音声に対応する音声ディジタル信号を受け取り、その音声認識結果を出力する。 For example, a voice digital signal corresponding to the voice of the main speaker obtained by any of the voice section detection devices 100 to 700 described above is received and the voice recognition result is output.

また、例えば区間情報を音声区間検出装置１００の出力値として出力し（第一実施形態の変形例参照）、区間情報として主話者音声区間または主話者外音区間の少なくとも一方の開始時間と終了時間等を用いる場合、区間情報に対応する音声信号に対して音声認識を行い、音声認識結果を出力する。 Further, for example, section information is output as an output value of the voice section detection device 100 (see the modification of the first embodiment), and the start time of at least one of the main speaker voice section or the main speaker outside sound section is used as the section information. When the end time or the like is used, speech recognition is performed on the speech signal corresponding to the section information, and the speech recognition result is output.

また、例えば、区間情報を音声区間検出装置１００の出力値として出力し（第一実施形態の変形例参照）、区間情報として音声ディジタル信号に主話者音声区間または主話者外音区間の少なくとも一方のフラグを付与した信号等を区間情報として用いる場合、主話者音声区間のフラグを付与された音声ディジタル信号に対して音声認識を行い、その音声認識結果を出力する。 In addition, for example, section information is output as an output value of the voice section detection device 100 (see the modification of the first embodiment), and at least a main speaker voice section or a main speaker outside sound section is added to the voice digital signal as section information. When a signal or the like with one flag is used as the section information, voice recognition is performed on the voice digital signal with the flag of the main speaker voice section, and the voice recognition result is output.

このように、音声区間検出装置１００〜７００の何れかによって得られた主話者の音声に対応する音声ディジタル信号や区間情報を用いることで、音声認識処理に用いる入力音声（音声信号）から主話者外音・無音・雑音等を除去し、主話者の音声に対してのみ音声認識処理を行うことができ、その精度を向上させることができる。通常、主話者外音や雑音は、非音声とは判定されずに音声認識されてしまい音声認識結果が誤認識として湧き出すことになるが、音声区間検出装置１００〜７００により、主話者音声区間のみを高精度に検出することで、認識対象外の音声や雑音による音声認識システムへの悪影響を低減する事が出来る。 As described above, by using the voice digital signal corresponding to the voice of the main speaker obtained by any of the voice section detection devices 100 to 700 and the section information, the main voice can be obtained from the input voice (voice signal) used for the voice recognition process. It is possible to remove the external sound / silence / noise of the speaker and perform the speech recognition process only on the voice of the main speaker, thereby improving the accuracy. Normally, the main speaker's external sound and noise are recognized as speech without being determined as non-speech, and the speech recognition result is generated as misrecognition. By detecting only the speech section with high accuracy, it is possible to reduce adverse effects on the speech recognition system due to speech or noise that is not the recognition target.

＜その他の変形例＞
本発明は上記の実施形態及び変形例に限定されるものではない。例えば、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。 <Other variations>
The present invention is not limited to the above-described embodiments and modifications. For example, the various processes described above are not only executed in time series according to the description, but may also be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes. In addition, it can change suitably in the range which does not deviate from the meaning of this invention.

＜プログラム及び記録媒体＞
また、上記の実施形態及び変形例で説明した各装置における各種の処理機能をコンピュータによって実現してもよい。その場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記各装置における各種の処理機能がコンピュータ上で実現される。 <Program and recording medium>
In addition, various processing functions in each device described in the above embodiments and modifications may be realized by a computer. In that case, the processing contents of the functions that each device should have are described by a program. Then, by executing this program on a computer, various processing functions in each of the above devices are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶部に格納する。そして、処理の実行時、このコンピュータは、自己の記憶部に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実施形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよい。さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、プログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its storage unit. When executing the process, this computer reads the program stored in its own storage unit and executes the process according to the read program. As another embodiment of this program, a computer may read a program directly from a portable recording medium and execute processing according to the program. Further, each time a program is transferred from the server computer to the computer, processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program includes information provided for processing by the electronic computer and equivalent to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、コンピュータ上で所定のプログラムを実行させることにより、各装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In addition, although each device is configured by executing a predetermined program on a computer, at least a part of these processing contents may be realized by hardware.

１００-７００音声区間検出装置
１１０音声信号取得部
１２０残響推定部
１３０ゲイン調整部
１４０主話者外音強調部
１６０，３６０，４６０，５６０，６６０主話者外音区間検出部
７６０主話者音区間検出部
１７０主話者音声抽出部
２１０雑音抑圧部
３５０音声信号パワー計算部
３５１信号スムージング部
４６１無音区間抽出部
４６２信号スムージング部
５４０残響信号パワー計算部
７００音声認識装置 100-700 Voice section detection device 110 Voice signal acquisition section 120 Reverberation estimation section 130 Gain adjustment section 140 Main speaker outside sound enhancement section 160, 360, 460, 560, 660 Main speaker outer sound section detection section 760 Main speaker sound Section detection section 170 Main speaker voice extraction section 210 Noise suppression section 350 Voice signal power calculation section 351 Signal smoothing section 461 Silent section extraction section 462 Signal smoothing section 540 Reverberation signal power calculation section 700 Speech recognition device

Claims

A reverberation estimation unit that estimates a reverberation component included in the audio digital signal and obtains the reverberation signal;
Look including the speaker-specific section detection unit for detecting at least one of Shuhanashi's speech section or main speaker outside sound section based on the reverberation signal,
The section detection unit for each speaker detects at least one of the main speaker voice section or the main speaker outside sound section by comparing the acquired reverberation signal with a predetermined value.
Voice segment detection device.

A reverberation estimation unit that estimates a reverberation component included in the audio digital signal and obtains the reverberation signal;
A section detection unit for each speaker that detects at least one of a main speaker voice section or a main speaker outside sound section based on the reverberation signal;
A reverberation signal power calculation unit for calculating the power of the reverberation signal;
The section detection unit for each speaker detects the power of the reverberation signal and a predetermined value to detect at least one of a main speaker voice section or a main speaker outside sound section,
Voice segment detection device.

It is the audio | voice area detection apparatus of Claim 1, Comprising:
further,
An audio signal power calculation unit for calculating the power of the audio digital signal;
The section-by-speaker detection unit detects at least one of a main speaker voice section and a main speaker outside sound section by comparing a value obtained by normalizing a reverberation signal with the power of the voice digital signal and a predetermined value. ,
Voice segment detection device.

The speech section detection device according to claim 2 ,
further,
An audio signal power calculation unit for calculating the power of the audio digital signal;
The section-by-speaker detection unit compares the power of the reverberation signal normalized by the power of the voice digital signal and a predetermined value to compare at least one of the main speaker voice section and the main speaker outside sound section. Detect
Voice segment detection device.

It is the audio | voice area detection apparatus of Claim 1 or 3 ,
further,
The speaker-specific section detection unit
A signal smoothing unit that smoothes the sound digital signal, or the power of the reverberation signal, or a signal obtained by normalizing the power of the reverberation signal by the power of the sound digital signal;
Detecting at least one of a main speaker voice section and a main speaker outer sound section from the smoothed signal;
Voice segment detection device.

The speech section detection device according to claim 2 or 4 ,
further,
The speaker-specific section detection unit
The power of the difference signal between the voice digital signal or the voice digital signal and the reverberation signal or the power of the difference signal between the voice digital signal and the reverberation signal is normalized by the power of the voice digital signal. Including a signal smoothing unit for smoothing the processed signal,
Detecting at least one of a main speaker voice section and a main speaker outer sound section from the smoothed signal;
Voice segment detection device.

The speech section detection device according to any one of claims 2, 4, and 6 ,
further,
Including a gain adjustment unit that multiplies the reverberation signal by gain,
Voice segment detection device.

A reverberation estimation step of estimating a reverberation component included in the audio digital signal and obtaining a reverberation signal;
It looks including the speaker-specific section detecting step of detecting at least one of Shuhanashi's speech section or main speaker outside sound section based on the reverberation signal,
The section-by-speaker detection step detects at least one of the main speaker voice section or the main speaker outside sound section by comparing the acquired reverberation signal with a predetermined value.
Voice segment detection method.

  A reverberation estimation step of estimating a reverberation component included in the audio digital signal and obtaining a reverberation signal;
  A section-by-speaker detection step for detecting at least one of a main speaker voice section or a main speaker outside sound section based on the reverberation signal;
  A reverberation signal power calculating step of calculating the power of the reverberation signal;
  The section-by-speaker detection step compares the power of the reverberation signal with a predetermined value to detect at least one of a main speaker voice section or a main speaker outside sound section,
  Voice segment detection method.

And a speech segment detection equipment according to any one of claims 1 to 7, a program for causing a computer to function.