JP2013125085A

JP2013125085A - Target sound extraction device and target sound extraction program

Info

Publication number: JP2013125085A
Application number: JP2011272620A
Authority: JP
Inventors: Katsuyuki Takahashi; 克之高橋
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 2011-12-13
Filing date: 2011-12-13
Publication date: 2013-06-24
Anticipated expiration: 2031-12-13
Also published as: JP5772562B2

Abstract

PROBLEM TO BE SOLVED: To prevent missing of target sound that may be caused by a difference in utterance speed, which cannot be covered only by long-term average processing, to further reduce sound interruption, when preventing missing of a small amplitude portion in a target sound section during voice-switch processing by subjecting a target sound detection result value to the long-term average processing.SOLUTION: A target sound extraction device calculates a coherence value on the basis of a plurality of directive signals each having a dead angle in a predetermined direction, calculates a long-term average value of a detection result value of a present frame by weighting average processing using a detection result value obtained by determining whether or not the present analysis value corresponds to a target sound section on the basis of the coherence value, and a value using the detection result value obtained in a past frame, and controls a gain for an inputted signal on the basis of the long-term average value. The device comprises weighting coefficient control means for controlling a weighting coefficient related to the weighting average processing on the basis of an utterance speed of target sound estimated from the coherence value.

Description

本発明は、目的音抽出装置及び目的音抽出プログラムに関し、例えば、電話やテレビ会議等の音声通信に用いる音声通信装置に適用し得るものである。 The present invention relates to a target sound extraction device and a target sound extraction program, and can be applied to a voice communication device used for voice communication such as a telephone or a video conference.

入力信号から所望の音声を抽出する技術の１つとして、ボイススイッチと呼ばれる技術がある。これは、目的音声区間検出機能を用いて入力信号から話者が話している区間（目的音声区間）を検出し、目的音声区間の場合は無処理で出力し、非目的音声区間の場合は振幅を減衰する、という処理のことである。 One technique for extracting a desired sound from an input signal is a technique called a voice switch. This is to detect the section where the speaker is speaking (target speech section) from the input signal using the target speech section detection function, output without processing for the target speech section, and amplitude for the non-target speech section. It is a process of attenuating.

図２は、ボイススイッチ処理を示すフローチャートである。図２において、入力信号ｉｎｐｕｔが受信されると（Ｓ９０１）、目的音声区間検出部が目的音声区間か否かを判定する（Ｓ９０２）。 FIG. 2 is a flowchart showing voice switch processing. In FIG. 2, when the input signal input is received (S901), it is determined whether or not the target speech section detection unit is the target speech section (S902).

このとき、ｉｎｐｕｔが目的音声区間であれば、ボイススイッチゲインであるＶＳ＿ＧＡＩＮは「１．０」と設定され（Ｓ９０３）、ｉｎｐｕｔが非目的音声区間であれば、ＶＳ＿ＧＡＩＮは「α」（α：任意の正の値、０．０≦α＜１．０）として設定する（Ｓ９０４）。そして、ＶＳ＿ＧＡＩＮがｉｎｐｕｔに乗算され、その出力信号ｏｕｔｐｕｔが得られる（Ｓ９０５）。 At this time, if the input is the target voice section, the voice switch gain VS_GAIN is set to “1.0” (S903), and if the input is the non-target voice section, VS_GAIN is “α” (α: arbitrary) The positive value of 0.0 ≦ α <1.0 is set (S904). Then, VS_GAIN is multiplied by input to obtain an output signal output (S905).

このボイススイッチ処理は、例えば、テレビ会議装置、携帯電話機等の音声通信機器等に適用することができ、このボイススイッチ処理を行うことで、非目的音声区間（雑音）を抑制し、通話音質を高めることができる。 This voice switch process can be applied to, for example, a voice communication device such as a video conference apparatus and a mobile phone. By performing this voice switch process, a non-target voice section (noise) is suppressed, and a voice quality is improved. Can be increased.

ところで、非目的音声は、話者以外の人間の声である「妨害音声」と、オフィスノイズや道路ノイズなどのような「背景雑音」とに分けられる。 By the way, the non-target voice is divided into “interfering voice” which is a human voice other than the speaker and “background noise” such as office noise and road noise.

非目的音声区間が背景雑音のみの場合、目的音声区間検出部は、目的音声区間か否かを正確に判定することができるのに対し、非目的音声区間に妨害雑音が重畳されている場合には、目的音声区間検出部は、妨害音声も目的音声とみなしてしまうため、誤判定が生じ得る。この結果、ボイススイッチが妨害音声を抑制できず、十分な通話音質を提供することができない。 When the non-target speech section is only background noise, the target speech section detection unit can accurately determine whether or not the target speech section is, while the non-target speech section has interference noise superimposed on it. Since the target voice section detection unit regards the disturbing voice as the target voice, an erroneous determination may occur. As a result, the voice switch cannot suppress the disturbing voice and cannot provide sufficient call sound quality.

この課題に対して、目的音声区間検出部で参照する特徴量として、これまで用いてきた入力信号レベルの変動から、コヒーレンスに変更することで改善される。 This problem can be improved by changing the input signal level that has been used so far to the coherence as the feature amount referred to by the target speech section detection unit.

ここで、コヒーレンスとは、簡単に述べれば、入力信号の到来方向を意味する特徴量である。例えば携帯電話などの利用を想定した場合、話者の声（目的音声）は正面から到来し、妨害音声は正面以外から到来する傾向が強いので、到来方向に着目することで、従来は不可能だった目的音声と妨害音声との区別が可能となる。 Here, the coherence is a feature quantity that means the arrival direction of the input signal, simply speaking. For example, assuming use of a mobile phone, the voice of the speaker (target voice) comes from the front, and the disturbing voice tends to come from other than the front. It is possible to distinguish between the target voice and the disturbing voice.

図３は、目的音声検出機能にコヒーレンスを用いる場合のボイススイッチ９０Ａの機能構成を示すブロック図である。 FIG. 3 is a block diagram showing a functional configuration of the voice switch 90A when coherence is used for the target voice detection function.

図３において、マイクｍ１及びｍ２のそれぞれから図示しないＡＤ変換器を介して、入力信号ｓ１（ｎ）及びｓ２（ｎ）がＦＦＴ部９１に与えられる。 In FIG. 3, input signals s1 (n) and s2 (n) are given to the FFT unit 91 from the microphones m1 and m2 via an AD converter (not shown).

なお、ｎはサンプルの入力順を表すインデックスであり、正の整数で表現される。本文中ではｎが小さいほど古い入力サンプルであり、大きいほど新しい入力サンプルであるとする。 Note that n is an index indicating the input order of samples, and is expressed as a positive integer. In the text, it is assumed that the smaller n is the older input sample, and the larger n is the newer input sample.

ＦＦＴ部９１は、マイクｍ１及びマイクｍ２から入力信号系列ｓ１及びｓ２を受け取り、その入力信号ｓ１及びｓ２に高速フーリエ変換（あるいは離散フーリエ変換）を行うものである。これにより、入力信号ｓ１及びｓ２を周波数領域で表現することができる。なお、高速フーリエ変換を実施するに当たり、入力信号ｓ１（ｎ）及びｓ２（ｎ）から所定のＮ個のサンプルから成る、分析フレームFRAME１(K)及びFRAME2(K)を構成する。入力信号s1からFRAME1を構成する例を以下に記載する。 The FFT unit 91 receives the input signal series s1 and s2 from the microphone m1 and the microphone m2, and performs fast Fourier transform (or discrete Fourier transform) on the input signals s1 and s2. Thereby, the input signals s1 and s2 can be expressed in the frequency domain. In performing the Fast Fourier Transform, analysis frames FRAME1 (K) and FRAME2 (K) composed of predetermined N samples are configured from the input signals s1 (n) and s2 (n). An example of configuring FRAME1 from the input signal s1 will be described below.

FRAME1(1)＝｛s1(1)、s1(2)、・・、s1(i)、・・s1(N)｝
・
・
FRAME1(K)＝｛s1(N×K+1)、s1(N×K＋2)、・・、s1(N×K＋i)、・・s1(N×K＋N)｝
なお、Kはフレームの順番を表すインデックスであり、正の整数で表現される。本文中ではKが小さいほど古い分析フレームであり、大きいほど新しい分析フレームであるとする。また、以降の動作説明において、特に但し書きが無い限りは、分析対象となる最新の分析フレームを表すインデックスはKであるとする。 FRAME1 (1) = {s1 (1), s1 (2), ..., s1 (i), ... s1 (N)}
・
・
FRAME1 (K) = {s1 (N × K + 1), s1 (N × K + 2), ..., s1 (N × K + i), ... s1 (N × K + N)}
K is an index indicating the order of frames, and is expressed as a positive integer. In the text, the smaller the K, the older the analysis frame, and the larger the K, the newer the analysis frame. In the following description of the operation, it is assumed that the index representing the latest analysis frame to be analyzed is K unless otherwise specified.

ＦＦＴ部９１では、分析フレームごとに高速フーリエ変換処理を施すことで、入力信号ｓ１から構成した分析フレームFRAME1(K)にフーリエ変換して得た周波数領域信号Ｘ１（ｆ、K）、及び入力信号ｓ２から構成した分析フレームFRAME2(K)をフーリエ変換して得た周波数領域信号Ｘ２（ｆ、K）を、第１の指向性形成部９２及び第２の指向性形成部９３に与えるものである。なおfは周波数を表すインデックスである。またX１（ｆ、K）は単一の値ではなく、
X1(f,K)=｛X1(f1,K)、X1(f2,K)、・・X1(fi,K)・・、X1(fm,K)｝
というように複数の周波数f1〜fmのスペクトル成分から構成されるものであることを補足しておく。これはX2(f,K)及び、後段の指向性形成部で現れるB1(f,K),B2(f,K)も同様である。 In the FFT unit 91, by performing fast Fourier transform processing for each analysis frame, the frequency domain signal X1 (f, K) obtained by Fourier transform to the analysis frame FRAME1 (K) configured from the input signal s1, and the input signal The frequency domain signal X2 (f, K) obtained by Fourier transforming the analysis frame FRAME2 (K) composed of s2 is given to the first directivity forming unit 92 and the second directivity forming unit 93. . Note that f is an index representing a frequency. X1 (f, K) is not a single value,
X1 (f, K) = {X1 (f1, K), X1 (f2, K), ... X1 (fi, K) ..., X1 (fm, K)}
Thus, it is supplemented that it is composed of spectral components of a plurality of frequencies f1 to fm. The same applies to X2 (f, K) and B1 (f, K) and B2 (f, K) appearing in the directivity forming section at the subsequent stage.

第１の指向性形成部９２は、ＦＦＴ部９１から周波数領域信号Ｘ１（ｆ、K）及びＸ２（ｆ、K）を受け取り、特定の方向に強い指向特性を有する信号Ｂ１（ｆ、K）を形成し、その信号Ｂ１（ｆ、K）をコヒーレンス計算部９４に与える。 The first directivity forming unit 92 receives the frequency domain signals X1 (f, K) and X2 (f, K) from the FFT unit 91, and receives a signal B1 (f, K) having strong directivity in a specific direction. The signal B 1 (f, K) is provided to the coherence calculator 94.

第２の指向性形成部９３は、ＦＦＴ部９１から周波数領域信号Ｘ１（ｆ、K）及びＸ２（ｆ、K）を受け取り、特定の方向に強い指向特性を有する信号Ｂ２（ｆ、K）を形成し、その信号Ｂ２（ｆ、K）をコヒーレンス計算部９４に与える。 The second directivity forming unit 93 receives the frequency domain signals X1 (f, K) and X2 (f, K) from the FFT unit 91, and receives a signal B2 (f, K) having strong directivity in a specific direction. Then, the signal B 2 (f, K) is given to the coherence calculator 94.

ここで、第１の指向性形成部９２及び第２の指向性形成部９３による特定方向に指向性の強い信号を形成する方法は、既存の技術の方法を適用することができ、例えば、式（１）及び式（２）に従った演算により求める方法を適用することができる。 Here, as a method of forming a signal having strong directivity in a specific direction by the first directivity forming unit 92 and the second directivity forming unit 93, a method of an existing technique can be applied. A method of obtaining by calculation according to (1) and equation (2) can be applied.

第１の指向性形成部９２は、式（１）に従って演算を行い、後述するように音源方向の特定方向（右方向）に強い指向性を持つ信号Ｂ１（ｆ、K）を求める。また、第２の指向性形成部９３は、式（２）に従って演算を行い、後述するように音源方向の特定方向（左方向）に強い指向性を持つ信号Ｂ２（ｆ、K）をそれぞれ計算する(フレームインデックスKは演算には関与しないので、計算式には記載しない)。

The first directivity forming unit 92 performs a calculation according to the equation (1), and obtains a signal B1 (f, K) having strong directivity in a specific direction (right direction) of the sound source direction as will be described later. The second directivity forming unit 93 performs calculation according to the equation (2), and calculates a signal B2 (f, K) having strong directivity in a specific direction (left direction) of the sound source direction, as will be described later. (The frame index K is not included in the calculation formula because it is not involved in the calculation).

式（１）及び式（２）の意味を、図４及び図５を用いて説明する。図４（Ａ）において、マイクｍ１とマイクｍ２とは距離ｌだけ隔てて設置されているものとする。マイクｍ１とマイクｍ２には音波が到来する。この音波は、マイクｍ１及びマイクｍ２を通る面の正面方向に対して角度θの方向から到来するものとする。 The meanings of Expression (1) and Expression (2) will be described with reference to FIGS. In FIG. 4A, it is assumed that the microphone m1 and the microphone m2 are separated by a distance l. Sound waves arrive at the microphones m1 and m2. This sound wave is assumed to come from the direction of the angle θ with respect to the front direction of the plane passing through the microphone m1 and the microphone m2.

このとき、音波がマイクｍ１とマイクｍ２に到達するまでには、時間差が生じする。この到達時間差τは、音の経路差をｄとすると、ｄ＝ｌ×ｓｉｎθなので、式（２−１）のようにして与えられる。 At this time, there is a time difference until the sound wave reaches the microphone m1 and the microphone m2. This arrival time difference τ is given by the equation (2-1) because d = 1 × sin θ, where d is the sound path difference.

τ＝ｌ×ｓｉｎθ／ｃ（ｃ：音速） …（２−１）
ところで、入力信号ｓ１（ｎ）に到達時間差τだけ遅延を与えた信号ｓ１（ｎ−τ）は、ｓ２（ｎ）と同一の信号であるといえる。 τ = 1 × sin θ / c (c: speed of sound) (2-1)
By the way, it can be said that the signal s1 (n−τ) obtained by delaying the input signal s1 (n) by the arrival time difference τ is the same signal as s2 (n).

したがって、両者の差をとった信号ｙ（ｎ）＝ｓ２（ｎ）−ｓ１（ｎ−τ）は、θ方向から到来した音が除去された信号となる。結果として、マイクロフォンアレーは図４（Ｂ）のような指向特性を持つようになる。 Therefore, the signal y (n) = s2 (n) −s1 (n−τ) taking the difference between them is a signal from which the sound coming from the θ direction is removed. As a result, the microphone array has a directivity characteristic as shown in FIG.

なお、上記の説明では時間領域での演算を記したが、周波数領域で行っても同様な効果が得られる。式（１）及び式（２）は、周波数領域とする場合の演算式の例である。 In the above description, the calculation in the time domain is described. However, the same effect can be obtained even if the calculation is performed in the frequency domain. Expressions (1) and (2) are examples of arithmetic expressions for the frequency domain.

ここで、到来方向θ＝９０度とした場合には、図５（Ａ）及び図５（Ｂ）のような指向特性となる。なお、指向特性について、図５に示すように前方向、後方向、右方向、左方向を定義する。すると、第１の指向性形成部９２で形成される指向性は図５（Ａ）に示すように、左方向に強いものとなり、第２の指向性形成部９３で形成される指向性は図５（Ｂ）に示すように、右方向に強いものとなる。 Here, when the arrival direction θ = 90 degrees, the directivity characteristics as shown in FIGS. 5A and 5B are obtained. For the directivity, forward, backward, right, and left directions are defined as shown in FIG. Then, as shown in FIG. 5A, the directivity formed by the first directivity forming unit 92 is strong in the left direction, and the directivity formed by the second directivity forming unit 93 is As shown in FIG. 5 (B), it becomes strong in the right direction.

なお、以降の説明では、説明便宜上、θ＝９０度であることを想定して動作説明を行うが、本発明の実施の際はこの設定に限定されるものではない。 In the following description, for convenience of explanation, the operation will be described assuming that θ = 90 degrees. However, the present invention is not limited to this setting.

以上のようにして得られた信号Ｂ１（ｆ、K）及びＢ２（ｆ、K）は、コヒーレンス計算部９４に与えられる。コヒーレンス計算部９４は、以下の式（３）及び式（４）に従って演算を行うことで、コヒーレンスＣＯＨを得る(ここでもフレームインデックスKは計算に関与しないので、式中には記載しない)。

The signals B1 (f, K) and B2 (f, K) obtained as described above are given to the coherence calculator 94. The coherence calculation unit 94 obtains coherence COH by performing calculations according to the following equations (3) and (4) (again, since the frame index K is not involved in the calculation, it is not described in the equations).

次に、目的音声区間検出部９５が、コヒーレンスＣＯＨ（K）を目的音声区間判定閾値Θと比較し、コヒーレンスＣＯＨ（K）が目的音声区間判定閾値Θより大きければ目的音声区間とみなして検出結果格納変数ＶＡＤ＿ＲＥＳ（K）に１．０を代入し、コヒーレンスＣＯＨ（K）が目的音声区間判定閾値Θより小さければ、非目的音声区間（妨害音声、背景音声）とみなして検出結果格納変数ＶＡＤ＿ＲＥＳ（K）に０．０を代入する。 Next, the target speech segment detection unit 95 compares the coherence COH (K) with the target speech segment determination threshold Θ, and if the coherence COH (K) is larger than the target speech segment determination threshold Θ, the target speech segment detection unit 95 regards it as the target speech segment. If 1.0 is substituted into the storage variable VAD_RES (K) and the coherence COH (K) is smaller than the target speech segment determination threshold Θ, it is regarded as a non-target speech segment (interfering speech, background speech) and the detection result storage variable VAD_RES ( Substitute 0.0 for K).

そして、ゲイン制御部９６は、ＶＡＤ＿ＲＥＳ（K）＝１．０ならば、ゲインＶＳ＿ＧＡＩＮを１．０に設定し、ＶＡＤ＿ＲＥＳ（K）＝０．０ならば、ゲインＶＳ＿ＧＡＩＮを１．０未満の任意の正の数値αに設定する。 The gain control unit 96 sets the gain VS_GAIN to 1.0 if VAD_RES (K) = 1.0, and sets the gain VS_GAIN to an arbitrary value less than 1.0 if VAD_RES (K) = 0.0. Set to a positive number α.

ここで、コヒーレンスの大小で目的音声区間を検出する背景を簡単に述べておく。コヒーレンスの概念は、例えば、正面方向の右方向から到来する信号と左方向から到来する信号の相関と言い換えられる。 Here, the background of detecting the target speech section based on the level of coherence will be briefly described. The concept of coherence can be paraphrased as, for example, the correlation between a signal coming from the right direction in the front direction and a signal coming from the left direction.

よって、コヒーレンスＣＯＨが小さい場合とは、信号Ｂ１と信号Ｂ２との相関が小さい場合であり、反対にコヒーレンスＣＯＨが大きい場合とは信号Ｂ１とＢ２との相関が大きい場合と言い換えることができる。 Therefore, the case where the coherence COH is small is a case where the correlation between the signal B1 and the signal B2 is small, and conversely, the case where the coherence COH is large can be paraphrased as a case where the correlation between the signals B1 and B2 is large.

そして、相関が小さい場合の入力信号は、入力到来方向が右方向又は左方向のいずれかに大きく偏った場合か、偏りがなくても背景雑音のような明確な規則性の少ない信号の場合である。 The input signal when the correlation is small is either when the input arrival direction is greatly deviated to either the right or left direction, or when the signal is clear and has little regularity such as background noise. is there.

そのために、コヒーレンスＣＯＨが小さい区間は妨害音声区間あるいは背景雑音区間（非目的音声区間）であるといえる。 Therefore, it can be said that the section where the coherence COH is small is a disturbing voice section or a background noise section (non-target voice section).

一方、コヒーレンスＣＯＨの値が大きい場合は、到来方向の偏りが無いため、入力信号が正面から到来する場合であるといえる。今、目的音声は正面から到来すると仮定しているので、コヒーレンスＣＯＨが大きい場合は目的音声区間といえる。 On the other hand, when the value of the coherence COH is large, it can be said that there is no deviation in the arrival direction, and therefore the input signal comes from the front. Now, since it is assumed that the target speech comes from the front, it can be said that it is the target speech section when the coherence COH is large.

以上のようにして得たＶＳ＿ＧＡＩＮはボイススイッチゲイン乗算部９７で信号ｓ１（ｎ）と乗算され、出力信号ｙ（ｎ）が得られる。 The VS_GAIN obtained as described above is multiplied by the signal s1 (n) by the voice switch gain multiplier 97 to obtain the output signal y (n).

しかしながら、図３の構成では、音声の立ち上がり部のような小振幅区間では、たとえ目的音声があっても明確なピッチ性がなく相関ができくいため、コヒーレンスＣＯＨが小さくなる。 However, in the configuration of FIG. 3, in a small amplitude section such as the rising portion of the voice, even if there is a target voice, it is difficult to correlate without a clear pitch characteristic, so the coherence COH is reduced.

その結果、図６（Ａ）に例示するように、目的音声であっても、その立ち上がり部の小振幅区間で、妨害音声と誤判定されてボイススイッチ処理で減衰されるので欠落が生じ、ところどころ途切れたような音声が出力され、音質が不自然になるという課題が生じ得る。 As a result, as illustrated in FIG. 6A, even for the target voice, it is mistakenly determined as a disturbing voice in the small amplitude section of the rising portion and is attenuated by the voice switch processing, so that a loss occurs. There may be a problem that sound that is interrupted is output and sound quality becomes unnatural.

この課題を解消するために、図７に例示するように、目的音声区間検出結果に長期平均化処理を施す検出結果長期平均部９８を有するボイススイッチ９０Ｂがある。 In order to solve this problem, as illustrated in FIG. 7, there is a voice switch 90 </ b> B having a detection result long-term average unit 98 that performs a long-term averaging process on the target speech segment detection result.

図７のボイススイッチ９０Ｂは、検出結果長期平均部９８が、検出結果格納変数ＶＡＤ＿ＲＥＳに長期平均処理を施し、その長期平均後の値がボイススイッチ作動判定閾値より大きいか否かに応じて、ボイススイッチを制御することで、目的音小振幅部での欠落を抑制することができる。 In the voice switch 90B shown in FIG. 7, the detection result long-term average unit 98 performs the long-term average processing on the detection result storage variable VAD_RES, and the voice switch 90B By controlling the switch, it is possible to suppress omission at the target sound small amplitude portion.

例えば、検出結果長期平均部９８が、式（５）に例示する演算式により、検出結果の長期平均値ＶＡＤ＿ＲＥＳ＿ＬＯＮＧ（K）を求める。そして、ゲイン制御部９９が、ＶＡＤ＿ＲＥＳ＿ＬＯＮＧ（K）とボイススイッチ作動判定閾値Ψと比較し、ＶＡＤ＿ＲＥＳ＿ＬＯＮＧ（K）＜ΨならばボイススイッチゲインＶＳ＿ＧＡＩＮ＞α（０．０≦α＜１．０）とし、そうでない場合はＶＳ＿ＧＡＩＮ＝１．０とするという制御をする。 For example, the detection result long-term average unit 98 obtains the detection result long-term average value VAD_RES_LONG (K) by the arithmetic expression illustrated in Expression (5). Then, the gain control unit 99 compares VAD_RES_LONG (K) with the voice switch operation determination threshold Ψ, and if VAD_RES_LONG (K) <Ψ, the voice switch gain VS_GAIN> α (0.0 ≦ α <1.0) Otherwise, control is performed such that VS_GAIN = 1.0.

これにより、目的音声の小振幅部でのＶＡＤ＿ＲＥＳの変動を緩和させたうえでボイススイッチを作動させることができるので、図６（Ｂ）に示すように、目的音声の小振幅部の欠落を抑制することができる。

As a result, since the voice switch can be operated after the fluctuation of the VAD_RES in the small amplitude portion of the target voice is reduced, the lack of the small amplitude portion of the target voice is suppressed as shown in FIG. can do.

なお、長期平均パラメータδは、０．０＜δ＜１．０である。ここで、式（５）の意味を捕捉する。式（５）は、現フレーム区間(動作開始時点から数えてK番目のフレーム)の入力音声に対する判定値ＶＡＤ＿ＲＥＳ（K）と１つ前のフレーム区間で得られた長期平均値ＶＡＤ＿ＲＥＳ＿ＬＯＮＧ（K−１）との重み付け加算平均値を計算しており、δの値の大小で、瞬時値ＶＡＤ＿ＲＥＳ（K）の平均値への寄与度を調整することができる。 The long-term average parameter δ is 0.0 <δ <1.0. Here, the meaning of equation (5) is captured. Expression (5) is obtained by determining the determination value VAD_RES (K) for the input speech in the current frame section (Kth frame from the operation start time) and the long-term average value VAD_RES_LONG (K−1) obtained in the previous frame section. ) And the degree of contribution of the instantaneous value VAD_RES (K) to the average value can be adjusted by the magnitude of the value of δ.

仮に、δを０に近い小さい値に設定した場合には、瞬時値の平均値への寄与度が小さくなるので、ＶＡＤ＿ＲＥＳの変動を抑制できる。また、δが１に近い値であれば、瞬時値の寄与度が高まるので、長期平均の効果を弱めることができる。 If δ is set to a small value close to 0, the contribution of the instantaneous value to the average value becomes small, so that fluctuations in VAD_RES can be suppressed. Also, if δ is a value close to 1, the contribution of the instantaneous value increases, so the long-term average effect can be weakened.

特開２００６−１９７５５２号公報JP 2006-197552 A 特表２０１０−５３２８７９号公報Japanese translation of PCT publication 2010-532879

ところで、コヒーレンスは、入力信号の相関という意味をもつため、到来した音声区間内であっても、子音か母音かで、コヒーレンスの挙動が異なる。 By the way, since coherence has the meaning of correlation of input signals, the behavior of coherence differs depending on whether it is a consonant or a vowel even within an incoming speech segment.

例えば、「さ：ｓａ」と発話した場合、子音部「ｓ」の信号波形は規則性が低いので、コヒーレンスは小さくなり、母音部「ａ」の信号波形は規則性が高いのでコヒーレンスは大きくなる。 For example, when “sa: sa” is spoken, the signal waveform of the consonant part “s” has low regularity, so the coherence is small, and the signal waveform of the vowel part “a” has high regularity, so the coherence is large. .

また、発話速度が変わった場合に、子音部の部分の長さが変わるのではなく、母音部の部分の長さが変わる。例えば、「さ：ｓａ」と発話する際に発話速度を変えた場合、発話速度が遅いときには、子音部「ｓ」が長くなるのではなく、母音部「ａ」が長くなり、発話速度が速いときには、母音部「ａ」が短くなる。 In addition, when the speaking speed is changed, the length of the consonant part is not changed, but the length of the vowel part is changed. For example, when the utterance speed is changed when uttering “sa: sa”, when the utterance speed is slow, the consonant part “s” is not lengthened, but the vowel part “a” is lengthened and the utterance speed is fast. Sometimes the vowel part “a” is shortened.

ところで、発話速度が遅い場合、子音部のような小振幅部が非目的音声と誤判定されてしまっても、母音部の大振幅部が音声区間に占める割合が高くなるため、検出結果の長期平均への誤判定の寄与が小さくなるため、小振幅部の欠落は生じにくい。 By the way, if the speech rate is slow, even if a small amplitude part such as a consonant part is misjudged as non-target speech, the proportion of the large amplitude part of the vowel part in the speech interval is high. Since the contribution of misjudgment to the average is reduced, the lack of small amplitude portions is unlikely to occur.

しかし、発話速度が速い場合には、音声区間における母音部の大振幅部の割合が下がるため、長期平均に対する小振幅部での誤判定の寄与が大きくなるため、ＶＡＤ＿ＲＥＳの変動を軽減しきれなくなり、小振幅部の欠落が発生してしまう。 However, when the utterance speed is high, the ratio of the large amplitude part of the vowel part in the speech section is reduced, and the contribution of the erroneous determination in the small amplitude part to the long-term average becomes large, so the fluctuation of VAD_RES cannot be reduced. In this case, a small amplitude portion is lost.

従って、上述したように、図７に例示する従来のボイススイッチは、発話速度によっては、検出結果を長期平均しても、目的音声の小振幅部の欠落が発生するという課題がある。 Therefore, as described above, the conventional voice switch illustrated in FIG. 7 has a problem that a small amplitude portion of the target voice is lost even if the detection results are averaged over a long period depending on the speech speed.

そのため、長期平均処理により、目的音声の小振幅部の欠落を防ぐにあたり、発話速度の違いにより生じ得る音声区間の欠落を防止して、音声の途切れを軽減することができる目的音抽出装置及び目的音抽出プログラムが求められている。 Therefore, in order to prevent the loss of the small amplitude part of the target speech by the long-term averaging process, the target sound extraction device and the purpose that can prevent the voice segment from being lost due to the difference in the speech speed and reduce the interruption of the speech There is a need for a sound extraction program.

かかる課題を解決するために、第１の本発明は、（１）入力信号を時間領域から周波数領域に変換する周波数解析手段と、（２）周波数解析手段により得られた信号に基づいて、それぞれ所定の方位に死角を有する指向性を持つ複数の信号を形成する指向性形成手段と、（３）指向性形成手段により形成された複数の指向性信号に基づいて、コヒーレンス値を求めるコヒーレンス計算手段と、（４）コヒーレンス計算手段により求められたコヒーレンス値に基づいて目的音を含むか否かを判定し、その判定結果に応じた検出結果値を出力する目的音判定手段と、（５）目的音判定手段から得る現在の入力フレームから算出した検出結果値と、過去の検出結果値に重み付け平均処理を施し、現在の入力フレームにおける検出結果値の長期平均値を求める長期平均処理手段と、（６）コヒーレンス計算手段により求められたコヒーレンス値に基づいて、入力信号に含まれる目的音の発話速度を検出する発話速度検出手段と、（７）発話速度検出手段により検出された発話速度に応じて、長期平均処理手段の重み付け平均処理に係る重み係数を制御する重み係数制御手段と、（８）長期平均処理手段の現在の入力フレームにおける検出結果値の長期平均値に基づいて、入力された信号に対する利得を制御する利得制御手段と、（９）利得制御手段により制御された利得を、入力された信号に乗算する乗算手段とを備えることを特徴とする目的音抽出装置である。 In order to solve this problem, the first aspect of the present invention includes (1) frequency analysis means for converting an input signal from the time domain to the frequency domain, and (2) based on the signal obtained by the frequency analysis means, Directivity forming means for forming a plurality of signals having directivity having a blind spot in a predetermined direction; and (3) coherence calculation means for obtaining a coherence value based on the plurality of directivity signals formed by the directivity forming means. And (4) target sound determination means for determining whether or not the target sound is included based on the coherence value obtained by the coherence calculation means, and outputting a detection result value corresponding to the determination result, and (5) purpose The detection result value calculated from the current input frame obtained from the sound determination means and the past detection result value are subjected to weighted averaging processing, and the long-term average value of the detection result value in the current input frame A long-term average processing means to be obtained; (6) an utterance speed detection means for detecting the utterance speed of the target sound included in the input signal based on the coherence value obtained by the coherence calculation means; and (7) an utterance speed detection means. A weighting factor control unit that controls a weighting factor related to the weighted averaging process of the long-term average processing unit according to the detected speech speed; and (8) a long-term average value of detection result values in the current input frame of the long-term average processing unit. And a gain control means for controlling the gain for the input signal, and (9) a multiplication means for multiplying the input signal by the gain controlled by the gain control means. It is an extraction device.

第２の本発明は、コンピュータを、（１）入力信号を時間領域から周波数領域に変換する周波数解析手段、（２）周波数解析手段により得られた信号に基づいて、それぞれ所定の方位に死角を有する指向性を持つ複数の信号を形成する指向性形成手段、（３）指向性形成手段により形成された複数の指向性信号に基づいて、コヒーレンス値を求めるコヒーレンス計算手段、（４）コヒーレンス計算手段により求められたコヒーレンス値に基づいて目的音を含むか否かを判定し、その判定結果に応じた検出結果値を出力する目的音判定手段、（５）目的音判定手段から得る現在の入力フレームから算出した検出結果値と、過去の検出結果値に重み付け平均処理を施し、現在の入力フレームにおける長期平均値を求める長期平均処理手段、（６）コヒーレンス計算手段により求められたコヒーレンス値に基づいて、入力信号に含まれる目的音の発話速度を検出する発話速度検出手段、（７）発話速度検出手段により検出された発話速度に応じて、長期平均処理手段の重み付け平均処理に係る重み係数を制御する重み係数制御手段、（８）長期平均処理手段の現在の入力フレームにおける検出結果値の長期平均値に基づいて、入力された信号に対する利得を制御する利得制御手段、（９）利得制御手段により制御された利得を、入力された信号に乗算する乗算手段として機能させることを特徴とする目的音抽出プログラムである。 According to a second aspect of the present invention, a computer is used to (1) frequency analysis means for converting an input signal from a time domain to a frequency domain, and (2) a blind spot in a predetermined direction based on a signal obtained by the frequency analysis means. Directivity forming means for forming a plurality of signals having directivity, (3) coherence calculation means for obtaining a coherence value based on the plurality of directivity signals formed by the directivity formation means, and (4) coherence calculation means. (5) A current input frame obtained from the target sound determination means, which determines whether or not the target sound is included based on the coherence value obtained by the above and outputs a detection result value corresponding to the determination result A long-term average processing means for performing a weighted average process on the detection result value calculated from the above and a past detection result value to obtain a long-term average value in the current input frame; (6) An utterance speed detecting means for detecting the utterance speed of the target sound included in the input signal based on the coherence value obtained by the coherence calculating means; (7) a long-term average according to the utterance speed detected by the utterance speed detecting means; Weight coefficient control means for controlling the weight coefficient related to the weighted average processing of the processing means; (8) controlling the gain for the input signal based on the long-term average value of the detection result values in the current input frame of the long-term average processing means; (9) A target sound extraction program that causes a gain controlled by the gain control means to function as a multiplying means that multiplies an input signal.

本発明によれば、目的音声の音声区間の小振幅部の欠落を防止する際に、長期平均処理での長期平均パラメータを発話速度に応じて制御することで、発話速度の違いにより生じ得る音声区間の欠落を防止して、音声の途切れをさらに軽減することができる。 According to the present invention, when a small amplitude portion of a voice section of a target voice is prevented from being lost, by controlling a long-term average parameter in a long-term average process according to the utterance speed, a voice that may be generated due to a difference in the utterance speed It is possible to prevent gaps from being lost by further preventing missing sections.

第１の実施形態のボイススイッチの構成を示す構成図である。It is a block diagram which shows the structure of the voice switch of 1st Embodiment. 従来のボイススイッチ処理を示すフローチャートである。It is a flowchart which shows the conventional voice switch process. 従来の目的音声検出機能にコヒーレンスを用いる場合のボイススイッチの構成を示す構成図である。It is a block diagram which shows the structure of the voice switch in the case of using coherence for the conventional target voice detection function. マイクｍ１及びマイクｍ２に入力する音波到達の様子を説明する説明図である。It is explanatory drawing explaining the mode of the sound wave arrival input into the microphone m1 and the microphone m2. 第１の指向性形成部及び第２の指向性形成部による指向特性を説明する説明図である。It is explanatory drawing explaining the directional characteristic by the 1st directivity formation part and the 2nd directivity formation part. 目的音声区間で非目的音声と誤判定されて目的音声区間が欠落することを説明する説明図である。It is explanatory drawing explaining misjudgment with a non-target audio | voice in a target audio | voice area | region, and missing a target audio | voice area. 従来の目的音声の長期平均により小振幅部の欠落を防止するボイススイッチの構成を示す構成図である。It is a block diagram which shows the structure of the voice switch which prevents the omission of a small amplitude part by the long-term average of the conventional objective sound. 第１の実施形態の長期平均パラメータ制御部の詳細な内部構成を示す内部構成図である。It is an internal block diagram which shows the detailed internal structure of the long-term average parameter control part of 1st Embodiment. 第１の実施形態の発話速度ｖ（K）と長期平均パラメータδとを対応付けた対応テーブルを説明する説明図である。It is explanatory drawing explaining the corresponding | compatible table which matched speech rate v (K) and long-term average parameter (delta) of 1st Embodiment. 第２の実施形態のボイススイッチの構成を示す構成図である。It is a block diagram which shows the structure of the voice switch of 2nd Embodiment. 第３の実施形態のボイススイッチの構成を示す構成図である。It is a block diagram which shows the structure of the voice switch of 3rd Embodiment. 第３の実施形態の第３の指向性形成部による指向特性を説明する説明図である。It is explanatory drawing explaining the directional characteristic by the 3rd directivity formation part of 3rd Embodiment. 第４の実施形態のボイススイッチの構成を示す構成図である。It is a block diagram which shows the structure of the voice switch of 4th Embodiment. 第５の実施形態のボイススイッチの構成を示す構成図である。It is a block diagram which shows the structure of the voice switch of 5th Embodiment.

（Ａ）第１の実施形態
以下では、本発明の目的音抽出装置及び目的音抽出プログラムの第１の実施形態を、図面を参照しながら詳細に説明する。 (A) First Embodiment Hereinafter, a first embodiment of a target sound extraction apparatus and a target sound extraction program of the present invention will be described in detail with reference to the drawings.

第１の実施形態では、ボイススイッチに本発明を適用する場合の実施形態を例示する。 In the first embodiment, an embodiment in which the present invention is applied to a voice switch is illustrated.

（Ａ−１）第１の実施形態の構成
（Ａ−１−１）全体構成
図１は、第１の実施形態のボイススイッチ１００Ａの構成を示す構成図である。なお、第１の実施形態のボイススイッチ１００Ａは、例えば、ＣＰＵ、ＲＯＭ、ＲＡＭ、ＥＥＰＲＯＭ、入出力インタフェース等を有するものであり、ボイススイッチ１００Ａの機能は、ＣＰＵが、ＲＯＭに格納される処理プログラムを実行することにより実現されるものである。なお、目的音抽出プログラムは、ネットワークを通じてインストールされるものであっても良く、その場合でも図１に示す構成要素を構成する。 (A-1) Configuration of First Embodiment (A-1-1) Overall Configuration FIG. 1 is a configuration diagram showing a configuration of a voice switch 100A of the first embodiment. The voice switch 100A of the first embodiment has, for example, a CPU, ROM, RAM, EEPROM, input / output interface, and the like. The function of the voice switch 100A is a processing program stored in the ROM by the CPU. This is realized by executing. Note that the target sound extraction program may be installed through a network, and in that case also constitutes the components shown in FIG.

図１において、第１の実施形態のボイススイッチ１００Ａは、マイクｍ１及びマイクｍ２、ＦＦＴ部１０１、第１の指向性形成部１０２、第２の指向性形成部１０３、コヒーレンス計算部１０４、発話速度検出部１０５、長期平均パラメータ制御部１０６、目的音声区間検出部１０７、検出結果長期平均部１０８、ゲイン制御部１０９、ボイススイッチゲイン乗算部１１０を少なくとも有するものである。 In FIG. 1, a voice switch 100A according to the first embodiment includes a microphone m1 and a microphone m2, an FFT unit 101, a first directivity forming unit 102, a second directivity forming unit 103, a coherence calculating unit 104, an utterance speed. It has at least a detection unit 105, a long-term average parameter control unit 106, a target speech section detection unit 107, a detection result long-term average unit 108, a gain control unit 109, and a voice switch gain multiplication unit 110.

マイクｍ１及びｍ２は、到来した音波を捕捉し、捕捉した音波を音声信号に変換してＦＦＴ部１０１に与えるものである。ここで、図１には図示しないが、マイクｍ１及びマイクｍ２とＦＦＴ部１０１との間にＡＤ変換部を備え、ＡＤ変換部が、マイクｍ１及びマイクｍ２の音声信号（アナログ信号）をディジタル信号に変換して、信号系列ｓ１及び信号ｓ２をＦＦＴ部１０１に与える。なお、ｎはサンプルの入力順を示す。 The microphones m <b> 1 and m <b> 2 capture incoming sound waves, convert the captured sound waves into audio signals, and supply the sound signals to the FFT unit 101. Here, although not shown in FIG. 1, an AD conversion unit is provided between the microphones m1 and m2 and the FFT unit 101, and the AD conversion unit converts the audio signals (analog signals) of the microphones m1 and m2 into digital signals. The signal sequence s1 and the signal s2 are given to the FFT unit 101. Note that n indicates the input order of samples.

ＦＦＴ部１０１は、マイクｍ１及びマイクｍ２から入力信号ｓ１及びｓ２を受け取り、所定のサンプル数から構成されるフレームごとに高速フーリエ変換（あるいは離散フーリエ変換）を施すものである。これにより、入力信号系列ｓ１及びｓ２を周波数領域で表現することができる。また、ＦＦＴ部１０１は、入力信号ｓ１から得た周波数領域信号Ｘ１（ｆ、K）及び入力信号ｓ２から得た周波数領域信号Ｘ２（ｆ、K）を、第１の指向性形成部１０２及び第２の指向性形成部１０３に与えるものである。 The FFT unit 101 receives the input signals s1 and s2 from the microphone m1 and the microphone m2, and performs fast Fourier transform (or discrete Fourier transform) for each frame composed of a predetermined number of samples. Thereby, the input signal sequences s1 and s2 can be expressed in the frequency domain. In addition, the FFT unit 101 converts the frequency domain signal X1 (f, K) obtained from the input signal s1 and the frequency domain signal X2 (f, K) obtained from the input signal s2 into the first directivity forming unit 102 and the first directivity forming unit 102. The second directivity forming unit 103 is provided.

第１の指向性形成部１０２は、ＦＦＴ部１０１から周波数領域信号Ｘ１（ｆ、K）及びＸ２（ｆ、K）を受け取り、特定の方向に強い指向特性を有する信号Ｂ１（ｆ、K）を形成し、その信号Ｂ１（ｆ、K）をコヒーレンス計算部１４に与える。 The first directivity forming unit 102 receives the frequency domain signals X1 (f, K) and X2 (f, K) from the FFT unit 101, and receives a signal B1 (f, K) having strong directivity in a specific direction. The signal B 1 (f, K) is provided to the coherence calculator 14.

第２の指向性形成部１０３は、ＦＦＴ部１０１から周波数領域信号Ｘ１（ｆ、K）及びＸ２（ｆ、K）を受け取り、第１の指向性形成部１０２とは異なる特定の方向に強い指向特性を有する信号Ｂ２（ｆ、K）を形成し、その信号Ｂ２（ｆ、K）をコヒーレンス計算部１０４に与える。 The second directivity forming unit 103 receives the frequency domain signals X1 (f, K) and X2 (f, K) from the FFT unit 101, and has strong directivity in a specific direction different from that of the first directivity forming unit 102. A signal B 2 (f, K) having characteristics is formed, and the signal B 2 (f, K) is given to the coherence calculator 104.

ここで、第１の指向性形成部１０２及び第２の指向性形成部１０３は、特定方向に死角を持つ指向性を有する信号を形成する方法としては、例えば、式（１）及び式（２）に従った演算により求める方法を適用することができる。これにより、第１の指向性形成部１０２は、式（１）に従った演算を行い、右方向に強い指向性を持つ信号Ｂ１（ｆ、K）を形成し、第２の指向性形成部１０３は、式（２）に従った演算を行い、左方向に強い指向性を持つ信号Ｂ２（ｆ、K）を形成する。 Here, the first directivity forming unit 102 and the second directivity forming unit 103 form, for example, equations (1) and (2) as a method of forming a signal having directivity having a blind spot in a specific direction. It is possible to apply a method of obtaining by calculation according to). Thereby, the first directivity forming unit 102 performs a calculation according to the equation (1), forms a signal B1 (f, K) having a strong directivity in the right direction, and the second directivity forming unit. 103 performs an operation according to the equation (2) to form a signal B2 (f, K) having strong directivity in the left direction.

コヒーレンス計算部１０４は、第１の指向性形成部１０２から取得した信号Ｂ１（ｆ、K）と、第２の指向性形成部１０３から取得した信号Ｂ２（ｆ、K）とに基づいてコヒーレンスＣＯＨ（K）を求めるものである。また、コヒーレンス計算部１０４は、求めたコヒーレンスＣＯＨ（K）を、発話音声検出部１０５及び目的音声区間検出部１０７に与えるものである。 The coherence calculating unit 104 performs coherence COH based on the signal B1 (f, K) acquired from the first directivity forming unit 102 and the signal B2 (f, K) acquired from the second directivity forming unit 103. (K) is required. Further, the coherence calculation unit 104 gives the obtained coherence COH (K) to the utterance voice detection unit 105 and the target voice section detection unit 107.

なお、コヒーレンス計算部１０４によるコヒーレンスの計算方法は、種々の方法を広く適用することができ、例えば、コヒーレンス計算部１０４が、式（３）及び式（４）を用いて求める方法を適用することができる。 Note that various methods can be widely applied as a coherence calculation method by the coherence calculation unit 104. For example, a method in which the coherence calculation unit 104 uses Equation (3) and Equation (4) is used. Can do.

目的音声検出部１０７は、コヒーレンス計算部１０４からコヒーレンスＣＯＨ（K）を受け取り、コヒーレンスＣＯＨ（K）と目的音声区間判定閾値Θとを比較し、コヒーレンスＣＯＨ（K）が目的音声区間判定閾値Θより大きい場合、目的音声区間であると判定し、目的音声区間判定閾値Θ以下の場合、非目的音声区間であると判定するものである。 The target speech detection unit 107 receives the coherence COH (K) from the coherence calculation unit 104, compares the coherence COH (K) with the target speech segment determination threshold value Θ, and the coherence COH (K) is obtained from the target speech segment determination threshold value Θ. If it is larger, it is determined that it is the target speech section, and if it is less than the target speech section determination threshold Θ, it is determined that it is a non-target speech section.

また、目的音声検出部１０７は、判定結果を示す検出結果変数ＶＡＤ＿ＲＥＳ（K）を、検出結果長期平均部１０８に与えるものである。具体的には、目的音声区間の場合にはＶＡＤ＿ＲＥＳ（K）＝１．０とし、非目的音声区間の場合にはＶＡＤ＿ＲＥＳ（K）＝０．０とする。 The target speech detection unit 107 gives a detection result variable VAD_RES (K) indicating the determination result to the detection result long-term average unit 108. Specifically, VAD_RES (K) = 1.0 is set in the case of the target voice section, and VAD_RES (K) = 0.0 is set in the case of the non-target voice section.

発話速度検出部１０５は、コヒーレンス計算部１０４から現在の入力フレームから得たコヒーレンスＣＯＨ（K）を受け取り、コヒーレンスＣＯＨ（K）に基づいて発話速度を求めるものである。また、発話速度検出部１０５は、検出した発話速度ｖ（K）を長期平均パラメータ制御部１０６に与える。 The speech rate detector 105 receives the coherence COH (K) obtained from the current input frame from the coherence calculator 104 and obtains the speech rate based on the coherence COH (K). Further, the speech rate detection unit 105 gives the detected speech rate v (K) to the long-term average parameter control unit 106.

長期平均パラメータ制御部１０６は、発話速度検出部１０５から発話速度ｖ（K）を受け取り、発話速度ｖ（K）に応じて長期平均パラメータδを求め、その長期平均パラメータδを検出結果長期平均部１０８に与えるものである。なお、長期平均パラメータ制御部１０６による長期平均パラメータの制御方法の詳細については後述する。 The long-term average parameter control unit 106 receives the utterance speed v (K) from the utterance speed detection unit 105, obtains the long-term average parameter δ according to the utterance speed v (K), and detects the long-term average parameter δ as a detection result long-term average unit. 108. The details of the long-term average parameter control method by the long-term average parameter control unit 106 will be described later.

検出結果長期平均部１０８は、目的音声区間検出部１０７から検出結果変数ＶＡＤ＿ＲＥＳ（K）を受け取ると共に、長期平均パラメータ制御部１０６から長期平均パラメータδを受け取り、目的音声区間の検出結果に長期平均化処理を行い、長期平均値ＶＡＤ＿ＲＥＳ＿ＬＯＮＧ（K）を求めるものである。 The detection result long-term average unit 108 receives the detection result variable VAD_RES (K) from the target speech segment detection unit 107 and the long-term average parameter δ from the long-term average parameter control unit 106, and performs long-term averaging on the detection result of the target speech segment. Processing is performed to obtain a long-term average value VAD_RES_LONG (K).

ここで、検出結果長期平均部１０８による長期平均化処理は、特に限定されることなく種々の方法を適用することができるが、例えば式（５）の演算式を用いて求める方法を適用することができる。 Here, various methods can be applied to the long-term averaging process by the detection result long-term average unit 108 without particular limitation. For example, a method of obtaining using the arithmetic expression of the equation (5) is applied. Can do.

ゲイン制御部１０９は、検出結果長期平均部１０８から長期平均値ＶＡＤ＿ＲＥＳ＿ＬＯＮＧ（K）を受け取り、長期平均値ＶＡＤ＿ＲＥＳ＿ＬＯＮＧ（K）に応じたゲイン値ＶＳ＿ＧＡＩＮをボイススイッチゲイン乗算部１１０に与えるものである。 The gain control unit 109 receives the long-term average value VAD_RES_LONG (K) from the detection result long-term average unit 108 and gives the gain value VS_GAIN corresponding to the long-term average value VAD_RES_LONG (K) to the voice switch gain multiplication unit 110.

ボイススイッチゲイン乗算部１１０は、ゲイン制御部１０９からゲイン値ＶＳ＿ＧＡＩＮを受け取り、入力信号ｓ１（ｎ）にゲイン値ＶＳ＿ＧＡＩＮを乗算して信号ｙ（ｎ）を出力するものである。 The voice switch gain multiplier 110 receives the gain value VS_GAIN from the gain controller 109, multiplies the input signal s1 (n) by the gain value VS_GAIN, and outputs a signal y (n).

（Ａ−１−２）長期平均パラメータ制御部の詳細な構成
図８は、第１の実施形態の長期平均パラメータ制御部１０６の詳細な内部構成を示す内部構成図である。 (A-1-2) Detailed Configuration of Long-Term Average Parameter Control Unit FIG. 8 is an internal configuration diagram illustrating a detailed internal configuration of the long-term average parameter control unit 106 according to the first embodiment.

図８において、第１の実施形態の長期平均パラメータ制御部１０６は、発話速度入力部２０１、長期平均パラメータ照合部２０２、記憶部２０３、長期平均パラメータ出力部２０４を少なくとも有する。 In FIG. 8, the long-term average parameter control unit 106 according to the first embodiment includes at least an utterance speed input unit 201, a long-term average parameter collation unit 202, a storage unit 203, and a long-term average parameter output unit 204.

発話速度入力部２０１は、発話速度検出部１０５から発話速度ｖ（K）を入力し、入力した発話速度ｖ（K）を長期平均パラメータ照合部２０２に与えるものである。 The speech rate input unit 201 inputs the speech rate v (K) from the speech rate detection unit 105 and gives the input speech rate v (K) to the long-term average parameter matching unit 202.

記憶部２０３は、発話速度ｖ（K）と長期平均パラメータδ（０．０＜δ＜１．０）とを対応付けた対応テーブルを記憶するものである。 The storage unit 203 stores a correspondence table in which the speech rate v (K) is associated with the long-term average parameter δ (0.0 <δ <1.0).

図９は、発話速度ｖ（K）と長期平均パラメータδとを対応付けた対応テーブルを説明する説明図である。例えば、図９において、発話速度検出部１０５により検出された発話速度ｖ（K）がｘ≦ｖ（K）＜ｗである場合には、長期平均パラメータδはδ＝ａと決定される。図９において、発話速度ｖ（K）は、…＜ｚ＜ｙ＜ｘ＜ｗの関係にあり、また長期平均パラメータδは、ａ＜ｂ＜ｃ＜…の関係にある。すなわち、発話速度ｖ（K）が遅くなるほど、長期平均パラメータδは小さくなり、発話速度ｖ（K）が速くなるほど、長期平均パラメータδは大きくなる関係にある。これにより、発話速度ｖ（K）が速くなるほど、現在の目的音声区間のＶＡＤ＿ＲＥＳの寄与率を低くすることができ、長期平均に対する誤判定の寄与を軽減させることができる。 FIG. 9 is an explanatory diagram for explaining a correspondence table in which the speech rate v (K) is associated with the long-term average parameter δ. For example, in FIG. 9, when the utterance speed v (K) detected by the utterance speed detector 105 is x ≦ v (K) <w, the long-term average parameter δ is determined as δ = a. In FIG. 9, the speech speed v (K) has a relationship of... <Z <y <x <w, and the long-term average parameter δ has a relationship of a <b <c <. That is, the long-term average parameter δ decreases as the utterance speed v (K) decreases, and the long-term average parameter δ increases as the utterance speed v (K) increases. As a result, the higher the utterance speed v (K), the lower the contribution rate of VAD_RES in the current target speech segment, and the contribution of misjudgment to the long-term average can be reduced.

長期平均パラメータ照合部２０２は、発話速度入力部２０１から発話速度ｖ（K）を受け取り、記憶部２０３に記憶されている対応テーブルを参照して、発話速度ｖ（K）に対応する長期平均パラメータδ（０．０＜δ＜１．０）を求めるものである。 The long-term average parameter matching unit 202 receives the utterance rate v (K) from the utterance rate input unit 201, refers to the correspondence table stored in the storage unit 203, and determines the long-term average parameter corresponding to the utterance rate v (K). δ (0.0 <δ <1.0) is obtained.

なお、長期平均パラメータの決定方法は、第１の実施形態では長期平均パラメータ照合部２０２が、図９に例示する対応テーブルを参照して、発話速度に応じた長期平均パラメータを求める場合を例示するが、この方法に限定されるものではない。 In the first embodiment, the long-term average parameter determining method refers to a case where the long-term average parameter matching unit 202 obtains a long-term average parameter corresponding to the speech rate with reference to the correspondence table illustrated in FIG. However, it is not limited to this method.

例えば、記憶部２０３に記憶される対応テーブルが、図９に例示する対応テーブルではなく、例えば、発話速度の基準値と、この発話速度における長期平均パラメータの基準値とを設定しておき、発話速度の基準値と入力された発話速度との差と、長期平均パラメータの補正値とを対応付けた対応テーブルを記憶部２０３に記憶させ、長期平均パラメータ照合部２０２が、当該対応テーブルを参照して、発話速度の基準値との差に応じて長期平均パラメータの補正値を求め、その補正値及び長期平均パラメータの基準値を用いて、長期平均パラメータδを求めるようにしてもよい。 For example, the correspondence table stored in the storage unit 203 is not the correspondence table illustrated in FIG. 9. For example, a reference value of the utterance speed and a reference value of the long-term average parameter at the utterance speed are set and the utterance is set. A correspondence table in which the difference between the reference value of the speed and the input speech speed and the correction value of the long-term average parameter is associated is stored in the storage unit 203, and the long-term average parameter matching unit 202 refers to the correspondence table. Thus, the correction value of the long-term average parameter may be obtained according to the difference from the reference value of the speech rate, and the long-term average parameter δ may be obtained using the correction value and the reference value of the long-term average parameter.

また例えば、発話速度の基準値と入力された発話速度との差と、長期平均パラメータの値とを対応付けた対応テーブルを記憶部２０３が有し、長期平均パラメータ照合部２０２が、当該対応テーブルを参照して、発話速度の基準値との差に応じた長期平均パラメータを求めるようにしてもよい。 Further, for example, the storage unit 203 has a correspondence table in which the difference between the reference value of the utterance speed and the input utterance speed is associated with the value of the long-term average parameter, and the long-term average parameter matching unit 202 is associated with the correspondence table. , A long-term average parameter corresponding to the difference from the reference value of the speech rate may be obtained.

また別の方法として、例えば、発話速度が遅くなるほど、長期平均パラメータδの値が小さくなるという関係式を作成し、入力された発話速度を関係式に代入して、長期平均パラメータδを求めるようにしてもよい。これにより、発話速度に応じた長期平均パラメータを精度良く求めることができる。 As another method, for example, a long-term average parameter δ is calculated by creating a relational expression in which the value of the long-term average parameter δ decreases as the utterance speed decreases, and the input utterance speed is substituted into the relational expression. It may be. Thereby, the long-term average parameter according to the speech rate can be obtained with high accuracy.

長期平均パラメータ出力部２０４は、長期平均パラメータ照合部２０２により求められた長期平均パラメータδを検出結果長期平均部１０８に与えるものである。 The long-term average parameter output unit 204 gives the long-term average parameter δ obtained by the long-term average parameter matching unit 202 to the detection result long-term average unit 108.

（Ａ−２）第１の実施形態の動作
次に、第１の実施形態のボイススイッチ１００における目的音抽出処理の動作を説明する。 (A-2) Operation of the First Embodiment Next, the operation of the target sound extraction process in the voice switch 100 of the first embodiment will be described.

マイクｍ１及びマイクｍ２に音声信号が入力されると、図示しないＡＤ変換部によりディジタル信号に変換され、入力信号系列ｓ１及び信号ｓ２がＦＦＴ部１０１に与えられる。 When an audio signal is input to the microphone m1 and the microphone m2, it is converted into a digital signal by an AD converter (not shown), and the input signal series s1 and the signal s2 are given to the FFT unit 101.

ＦＦＴ部１０１において、信号ｓ１及びｓ２を所定のサンプル数ごとに分析フレームを構成し、高速フーリエ変換がなされて時間領域から周波数領域に変換され、変換された信号Ｘ１（ｆ、K）及び信号Ｘ２（ｆ、K）が、第１の指向性形成部１０２及び第２の指向性形成部１０３に与えられる。 In the FFT unit 101, the signals s1 and s2 are formed in an analysis frame for each predetermined number of samples, fast Fourier transform is performed to convert from the time domain to the frequency domain, and the converted signals X1 (f, K) and X2 are converted. (F, K) is given to the first directivity forming unit 102 and the second directivity forming unit 103.

信号Ｘ１（ｆ、K）及び信号Ｘ２（ｆ、K）が入力されると、第１の指向性形成部１０２は、例えば式（１）及び式（２）の演算式に従って、入力された信号Ｘ１（ｆ、K）及び信号Ｘ２（ｆ、K）に基づいて、特定の方位を死角に有する信号Ｂ１（ｆ、K）を形成する。 When the signal X1 (f, K) and the signal X2 (f, K) are input, the first directivity forming unit 102 receives the input signal according to, for example, the arithmetic expressions of the expressions (1) and (2). Based on X1 (f, K) and the signal X2 (f, K), a signal B1 (f, K) having a specific orientation at the blind spot is formed.

また、同様に、第２の指向性形成部１０３は、第１の指向性形成部１０２と指向性の方位が異なるが、例えば式（１）及び式（２）の演算式に従い、信号Ｘ１（ｆ、K）及び信号Ｘ２（ｆ、K）に基づいて、第１の指向性形成部１０２とは異なる特定の方位に死角を有する信号Ｂ２（ｆ、K）を形成する。 Similarly, the second directivity forming unit 103 has a directivity direction different from that of the first directivity forming unit 102. For example, according to the arithmetic expressions of the expressions (1) and (2), the signal X1 ( Based on f, K) and the signal X2 (f, K), a signal B2 (f, K) having a blind spot in a specific direction different from the first directivity forming unit 102 is formed.

そして、それぞれ特定の方位に死角を有する信号Ｂ１（ｆ、K）及び信号Ｂ２（ｆ、K）が、コヒーレンス計算部１０４に与えられると、コヒーレンス計算部１０４は、例えば式（３）及び式（４）の演算式に従い、信号Ｂ１（ｆ、K）及び信号Ｂ２（ｆ、K）に基づいて、コヒーレンスＣＯＨ（K）を算出する。 Then, when the signal B1 (f, K) and the signal B2 (f, K) each having a blind spot in a specific direction are supplied to the coherence calculation unit 104, the coherence calculation unit 104, for example, The coherence COH (K) is calculated based on the signal B1 (f, K) and the signal B2 (f, K) according to the arithmetic expression of 4).

目的音声区間検出部１０７では、コヒーレンス計算部１０４により求められたコヒーレンスＣＯＨ（K）と目的音声区間判定閾値Θとが比較され、コヒーレンスＣＯＨ（K）が目的音声区間判定閾値Θより大きい場合、当該区間は目的音声区間であるとして、ＶＡＤ＿ＲＥＳ（K）に１．０を代入して、検出結果長期平均部１０８に与える。一方、コヒーレンスＣＯＨ（K）は目的音声区間判定閾値Θ以下の場合、当該区間は非目的音声区間であるとして、ＶＡＤ＿ＲＥＳ（K）に０．０を代入して、検出結果長期平均部１０８に与える。 The target speech segment detection unit 107 compares the coherence COH (K) obtained by the coherence calculation unit 104 with the target speech segment determination threshold Θ, and when the coherence COH (K) is larger than the target speech segment determination threshold Θ, Assuming that the section is the target voice section, 1.0 is substituted into VAD_RES (K) and provided to the detection result long-term average unit 108. On the other hand, if the coherence COH (K) is equal to or smaller than the target speech segment determination threshold Θ, it is assumed that the segment is a non-target speech segment, and 0.0 is substituted for VAD_RES (K) to be provided to the detection result long-term average unit 108. .

一方、コヒーレンス計算部１０４が求めたコヒーレンスＣＯＨ（K）は、発話速度検出部１０５にも与えられる。発話速度検出部１０５では、コヒーレンスＣＯＨ（K）に応じて発話速度ｖ（K）が求められる。 On the other hand, the coherence COH (K) obtained by the coherence calculation unit 104 is also given to the speech rate detection unit 105. The utterance speed detection unit 105 obtains the utterance speed v (K) according to the coherence COH (K).

ここで、発話速度検出部１０５による発話速度の検出方法は、コヒーレンスＣＯＨに基づいて発話速度を求める方法であれば種々の方法を広く適用することができる。例えば、発話速度検出部１０５は、次のような方法で発話速度を検出することができる。 Here, as a method for detecting the speech rate by the speech rate detecting unit 105, various methods can be widely applied as long as the speech rate is obtained based on the coherence COH. For example, the speech rate detection unit 105 can detect the speech rate by the following method.

例えば、コヒーレンスは２個の信号の相互相関であるから、マイクｍ１及びマイクｍ２の正面に音源があり、正面から入力した信号に対して、コヒーレンスＣＯＨは大きくなる。これに対して、マイクｍ１及びマイクｍ２の右方向又は左方向等に音源があり、右方向又は左方向等から入力した信号に対しては、コヒーレンスＣＯＨは小さくなる。 For example, since coherence is a cross-correlation between two signals, there is a sound source in front of the microphone m1 and the microphone m2, and the coherence COH is larger than a signal input from the front. On the other hand, there are sound sources in the right direction or left direction of the microphones m1 and m2, and the coherence COH is small for signals input from the right direction or left direction.

また、正面からの信号であっても、母音部（例えば「さ：ｓａ」という発音のときの「ａ」の音声部分）の信号は、波形がある程度の周期性を持つ相関が高い波形なので、コヒーレンスＣＯＨは大きくなるのに対して、子音部の信号は周期性が弱く相関の低い波形なので、コヒーレンスＣＯＨは小さいという特性がある。 Even if the signal is from the front, the signal of the vowel part (for example, the voice part of “a” when pronounced “sa: sa”) is a highly correlated waveform having a certain degree of periodicity. While the coherence COH is large, the signal of the consonant part has a characteristic that the coherence COH is small because the signal of the consonant part has a weak periodicity and a low correlation.

さらに、発話速度が変わると、子音部の長さは変わらず、母音部の長さが変わる。これは、人間の発声機構から、例えば発話速度が遅くなると、「さ：ｓａ」の子音部の長さは変わらないが、母音部の「ａ」の長さが長くなり、逆に、発話速度が速くなると、子音部の長さは変わらず、母音部の「ａ」の長さが短くなる。 Furthermore, when the speech rate changes, the length of the consonant part does not change and the length of the vowel part changes. This is because the length of the consonant part of “sa: sa” does not change when the utterance speed slows down due to the human utterance mechanism, but the length of “a” in the vowel part becomes longer. As the speed increases, the length of the consonant part does not change, and the length of “a” in the vowel part becomes shorter.

また、発話速度が速い場合に、母音部でのコヒーレンスＣＯＨは急速に小さくなるのに対して、発話速度が遅い場合に、母音部でのコヒーレンスＣＯＨはゆっくりと小さくなるという特性もあり、この現象は二重母音のような母音が連続する区間ではさらに顕著になる。 In addition, when the utterance speed is high, the coherence COH at the vowel part decreases rapidly, whereas when the utterance speed is low, the coherence COH at the vowel part decreases slowly. Becomes more prominent in sections where vowels such as double vowels continue.

そこで、発話速度検出部１０５は、上記で説明したコヒーレンスＣＯＨの特性を利用して、例えば、今回のフレーム区間のコヒーレンスＣＯＨと直前フレーム区間のコヒーレンスＣＯＨとの差を求め、そのコヒーレンスの差が大きいときには発話速度が速いとし、逆にコヒーレンスの差が小さいときには発話速度が遅いとして発話速度を求めるようにしてもよい。 Therefore, the speech rate detection unit 105 obtains, for example, a difference between the coherence COH of the current frame section and the coherence COH of the immediately preceding frame section by using the characteristics of the coherence COH described above, and the difference in the coherence is large. In some cases, the utterance speed may be high, and conversely, when the difference in coherence is small, the utterance speed may be determined to be low and the utterance speed may be obtained.

具体的には、コヒーレンスの差と、これに応じた発話速度とを予め対応付けた対応テーブルを発話速度検出部１０５が保持し、発話速度検出部１０５が、上記対応テーブルを参照して、現在のフレームから得たコヒーレンスＣＯＨ（K）と直前のフレームで得られたコヒーレンスＣＯＨ（K−１）との差に対応する発話速度を求める方法を適用できる。なお、発話速度検出部１０５による発話速度の求める方法は、上記の検出例に限定されるものではない。 Specifically, the utterance speed detection unit 105 holds a correspondence table in which the difference in coherence and the utterance speed corresponding thereto are associated in advance, and the utterance speed detection unit 105 refers to the correspondence table to A method for obtaining the speech rate corresponding to the difference between the coherence COH (K) obtained from the previous frame and the coherence COH (K-1) obtained from the immediately preceding frame can be applied. Note that the method of obtaining the speech rate by the speech rate detection unit 105 is not limited to the above detection example.

次に、長期平均パラメータ制御部１０６は、発話速度検出部１０５により求められた発話速度ｖ（K）に応じて、長期平均パラメータδを求める。 Next, the long-term average parameter control unit 106 obtains the long-term average parameter δ according to the utterance speed v (K) obtained by the utterance speed detection unit 105.

長期平均パラメータ制御部１０６では、長期平均パラメータ照合部２０２が、発話速度入力部２０１から入力された発話速度ｖ（K）を受け取り、記憶部２０３に記憶される対応テーブルを参照して、入力された発話速度ｖ（K）に対応する長期平均パラメータδを取得する。そして、長期平均パラメータδが、長期平均パラメータ出力部２０４から検出結果長期平均部１０８に与えられる。 In the long-term average parameter control unit 106, the long-term average parameter matching unit 202 receives the utterance speed v (K) input from the utterance speed input unit 201 and inputs it by referring to the correspondence table stored in the storage unit 203. The long-term average parameter δ corresponding to the utterance speed v (K) is acquired. Then, the long-term average parameter δ is given from the long-term average parameter output unit 204 to the detection result long-term average unit 108.

検出結果長期平均部１０８において、目的音声区間検出部１０７からＶＡＤ＿ＲＥＳ（K）と、長期平均パラメータ制御部１０６から長期平均パラメータδとが与えられ、検出結果長期平均部１０８が、例えば式（５）の演算式に従い、長期平均値ＶＡＤ＿ＲＥＳ＿ＬＯＮＧ（K）を求める。 In the detection result long-term average unit 108, VAD_RES (K) is given from the target speech section detecting unit 107, and the long-term average parameter control unit 106 is given the long-term average parameter δ. The long-term average value VAD_RES_LONG (K) is obtained according to the following equation.

そして、ゲイン制御部１０９は、従来と同様に、ＶＡＤ＿ＲＥＳ＿ＬＯＮＧ（K）とボイススイッチ作動判定閾値Ψと比較し、ＶＡＤ＿ＲＥＳ＿ＬＯＮＧ（K）がボイススイッチ作動判定閾値Ψより小さい場合、ボイススイッチゲインＶＳ＿ＧＡＩＮ＝α（０．０≦α＜１．０）とし、そうでない場合はＶＳ＿ＧＡＩＮ＝１．０とする。 Then, the gain control unit 109 compares VAD_RES_LONG (K) with the voice switch operation determination threshold Ψ as in the conventional case, and when VAD_RES_LONG (K) is smaller than the voice switch operation determination threshold Ψ, the voice switch gain VS_GAIN = α ( 0.0 ≦ α <1.0), otherwise VS_GAIN = 1.0.

ここで、長期平均パラメータδは、発話速度ｖ（K）が速くなるにつれて、大きな値（すなわち、１．０に近い値）となり、発話速度ｖ（K）が遅くなるにつれて小さな値（すなわち、０．０に近い値）となる。 Here, the long-term average parameter δ becomes a large value (ie, a value close to 1.0) as the utterance speed v (K) becomes fast, and becomes a small value (ie, 0 as the utterance speed v (K) becomes slow. Value close to .0).

このことは、式（５）において、発話速度が速い場合には、現在のフレームで得られたＶＡＤ＿ＲＥＳ（K）の寄与度を小さくし、直前フレーム区間のＶＡＤ＿ＲＥＳ＿ＬＯＮＧ（K−１）の寄与を大きくしていることを意味する。これにより、発話速度が速い場合に、目的音声区間内の小振幅部で生じる誤判定の長期平均値への寄与を小さくすることができる。したがって、ＶＡＤ＿ＲＥＳ＿ＬＯＮＧ(K)が判定閾値Ψより大きくなる可能性を高めることができるため、目的音声の欠落を防止することができる。 In Equation (5), when the speech rate is high, the contribution of VAD_RES (K) obtained in the current frame is reduced, and the contribution of VAD_RES_LONG (K−1) in the immediately preceding frame interval is increased. Means that Thereby, when the speech speed is high, it is possible to reduce the contribution to the long-term average value of the misjudgment that occurs in the small amplitude part in the target speech section. Therefore, it is possible to increase the possibility that VAD_RES_LONG (K) becomes larger than the determination threshold value Ψ, and thus it is possible to prevent the target voice from being lost.

また、発話速度ｖ（K）が遅い場合には、発話速度が速い場合と比較するとＶＡＤ＿ＲＥＳ（K）の寄与度を大きくし、長期平均値ＶＡＤ＿ＲＥＳ＿ＬＯＮＧ（K−１）の寄与度を小さくしている。これは発話速度が遅い場合には、目的音声区間に母音部が占める割合が高いために誤判定の割合は少なく、ＶＡＤ＿ＲＥＳ（K）の瞬時値を長期平均に大きく寄与させた方が音声の欠落防止に効果的であることを考慮した処理である。このように、発話速度が遅い場合も長期平均パラメータδが適切に制御されるため、目的音声の欠落を防止することができる。 Further, when the speech rate v (K) is slow, the contribution of VAD_RES (K) is increased and the contribution of the long-term average value VAD_RES_LONG (K-1) is decreased compared to the case where the speech rate is fast. . This is because when the utterance speed is slow, the proportion of vowels in the target speech segment is high, so the rate of misjudgment is small. This is a process considering that it is effective for prevention. As described above, since the long-term average parameter δ is appropriately controlled even when the speech rate is low, it is possible to prevent the target speech from being lost.

そして、ボイススイッチゲイン乗算部１１０が、入力信号ｓ１（ｎ）に、ゲイン制御部１０９からのＶＳ＿ＧＡＩＮを乗算することで出力信号ｙ（ｎ）を作成し出力する。 Then, the voice switch gain multiplication unit 110 multiplies the input signal s1 (n) by VS_GAIN from the gain control unit 109 to create and output an output signal y (n).

（Ａ−３）第１の実施形態の効果
以上のように、第１の実施形態によれば、発話速度が変化した場合でも、目的音声の欠落を防止することができるので、音質の劣化を解消することができる。 (A-3) Effect of the First Embodiment As described above, according to the first embodiment, even when the speech speed changes, it is possible to prevent the target voice from being lost, so that the sound quality is deteriorated. Can be resolved.

これにより、例えばテレビ会議システムや携帯電話などの通信装置に本発明を適用することで、通話音質の向上が期待できる。 Thereby, for example, by applying the present invention to a communication apparatus such as a video conference system or a mobile phone, it is possible to expect improvement in call sound quality.

（Ｂ）第２の実施形態
次に、本発明の目的音抽出装置及び目的音抽出プログラムの第２の実施形態を、図面を参照しながら説明する。 (B) Second Embodiment Next, a second embodiment of the target sound extraction apparatus and target sound extraction program of the present invention will be described with reference to the drawings.

（Ｂ−１）第２の実施形態の構成及び動作
図１０は、第２の実施形態のボイススイッチ１００Ｂの構成を示す構成図である。図１０において、第２の実施形態のボイススイッチ１００Ｂは、マイクｍ１及びマイクｍ２、ＦＦＴ部１０１、第１の指向性形成部１０２、第２の指向性形成部１０３、コヒーレンス計算部１０４、発話速度検出部１０５、長期平均パラメータ制御部１０６、目的音声区間検出部１０７、検出結果長期平均部１０８、ゲイン制御部１０９、ボイススイッチゲイン乗算部１１０、非目的音声区間監視部３０１、長期平均値初期化部３０２を少なくとも有するものである。 (B-1) Configuration and Operation of Second Embodiment FIG. 10 is a configuration diagram showing the configuration of the voice switch 100B of the second embodiment. In FIG. 10, the voice switch 100B according to the second embodiment includes a microphone m1 and a microphone m2, an FFT unit 101, a first directivity forming unit 102, a second directivity forming unit 103, a coherence calculating unit 104, an utterance speed. Detection unit 105, long-term average parameter control unit 106, target voice section detection unit 107, detection result long-term average unit 108, gain control unit 109, voice switch gain multiplication unit 110, non-target voice section monitoring unit 301, long-term average value initialization It has at least part 302.

第２の実施形態が第１の実施形態と異なる点は、第１の実施形態の構成要素に加えて、非目的音声区間監視部３０１、長期平均値初期化部３０２を更に備える点である。 The second embodiment is different from the first embodiment in that in addition to the components of the first embodiment, a non-target voice section monitoring unit 301 and a long-term average value initialization unit 302 are further provided.

第１の実施形態は、発話速度に応じて長期平均パラメータδを制御するものであるが、現在のＶＡＤ＿ＲＥＳ（K）の寄与率を小さくした場合、目的音声区間の開始に正確に反応できなくなり、非目的音声区間から目的音声区間に切り替わった等の場合に、本来は目的音声区間であるにもかかわらず長期平均処理によって非目的音声区間と誤判定されてしまい、話頭がボイススイッチで欠落する場合が生じ得る。 In the first embodiment, the long-term average parameter δ is controlled according to the utterance speed. However, when the contribution rate of the current VAD_RES (K) is reduced, it becomes impossible to accurately respond to the start of the target speech section. When switching from a non-target voice section to a target voice section, etc., although it is originally a target voice section, it is misjudged as a non-target voice section by long-term averaging processing, and the head of the talk is missing at the voice switch Can occur.

そこで、第２の実施形態は、第１の実施形態の構成に、非目的音声区間監視部３０１及び長期平均値初期化部３０２を備えることにより、話頭が欠落することを防止する。 Therefore, in the second embodiment, the non-target speech section monitoring unit 301 and the long-term average value initialization unit 302 are provided in the configuration of the first embodiment, thereby preventing the beginning of the talk from being lost.

なお、図１０において、第１の実施形態と同じ構成要素については同じ番号を付しており、これら第１の実施形態と同じ構成要素の機能及び動作は、第１の実施形態と同じであるので、ここでの詳細な説明は省略する。 In FIG. 10, the same constituent elements as those in the first embodiment are denoted by the same reference numerals, and the functions and operations of the same constituent elements as those in the first embodiment are the same as those in the first embodiment. Therefore, detailed description here is omitted.

非目的音声区間監視部３０１は、目的音声区間検出部１０７による検出結果に基づいて、非目的音声区間を監視するものである。具体的には、非目的音声区間監視部３０１は、目的音声区間検出部１０７により求められたＶＡＤ＿ＲＥＳ（K）を受け取り、ＶＡＤ＿ＲＥＳが連続して０．０となるフレーム区間数を監視する。 The non-target voice section monitoring unit 301 monitors the non-target voice section based on the detection result by the target voice section detection unit 107. Specifically, the non-target voice section monitoring unit 301 receives VAD_RES (K) obtained by the target voice section detection unit 107 and monitors the number of frame sections in which VAD_RES is 0.0 continuously.

長期平均値初期化部３０２は、非目的音声区間監視部３０１から非目的音声区間の連続フレーム区間数を受け取り、この連続フレーム区間数が閾値を超えた場合に、検出結果長期平均部１０８が演算に用いる長期平均値及び長期平均パラメータを初期化するものである。 The long-term average value initialization unit 302 receives the number of consecutive frame sections of the non-target voice section from the non-target voice section monitoring unit 301, and when the number of continuous frame sections exceeds a threshold, the detection result long-term average unit 108 calculates This initializes the long-term average value and the long-term average parameter used in the above.

非目的音声区間数が閾値を超えて長く続く状態とは、話者の音声（目的音声）が入力されない状態といえる。そこで、目的音声が入力されない期間に、長期平均値初期化部３０２が、長期平均値及び長期平均パラメータを初期化し、長期平均値に蓄積されている非目的音声区間の寄与を消去することで、話頭部分の欠落を防止することができる。 A state in which the number of non-target speech sections continues longer than a threshold can be said to be a state in which a speaker's speech (target speech) is not input. Therefore, in a period in which the target voice is not input, the long-term average value initialization unit 302 initializes the long-term average value and the long-term average parameter, and deletes the contribution of the non-target voice section accumulated in the long-term average value, It is possible to prevent the beginning of the talk from being lost.

なお、目的音声が入力された後の動作は、第１の実施形態と同じであるので、ここでの詳細な説明は行わない。 Since the operation after the target voice is input is the same as that of the first embodiment, detailed description thereof will not be given here.

（Ｂ−２）第２の実施形態の効果
以上のように、第２の実施形態によれば、第１の実施形態の効果に加えて、話頭部分の欠落を防止することができ、さらに音質を向上させることができる。 (B-2) Effects of the Second Embodiment As described above, according to the second embodiment, in addition to the effects of the first embodiment, it is possible to prevent the lack of a talk head part, and to further improve the sound quality. Can be improved.

（Ｃ）第３の実施形態
次に、本発明の目的音抽出装置及び目的音抽出プログラムの第３の実施形態を、図面を参照しながら詳細に説明する。 (C) Third Embodiment Next, a third embodiment of the target sound extraction apparatus and target sound extraction program of the present invention will be described in detail with reference to the drawings.

（Ｃ−１）第３の実施形態の構成及び動作
図１１は、第３の実施形態のボイススイッチ１００Ｃの構成を示す構成図である。図１１において、第３の実施形態のボイススイッチ１００Ｃは、マイクｍ１及びマイクｍ２、ＦＦＴ部１０１、第１の指向性形成部１０２、第２の指向性形成部１０３、コヒーレンス計算部１０４、発話速度検出部１０５、長期平均パラメータ制御部１０６、目的音声区間検出部１０７、検出結果長期平均部１０８、ゲイン制御部１０９、ボイススイッチゲイン乗算部１１０、周波数減算部４０を少なくとも有するものである。 (C-1) Configuration and Operation of the Third Embodiment FIG. 11 is a configuration diagram showing the configuration of the voice switch 100C of the third embodiment. In FIG. 11, the voice switch 100C of the third embodiment includes a microphone m1 and a microphone m2, an FFT unit 101, a first directivity forming unit 102, a second directivity forming unit 103, a coherence calculating unit 104, an utterance speed. The detection unit 105, the long-term average parameter control unit 106, the target speech section detection unit 107, the detection result long-term average unit 108, the gain control unit 109, the voice switch gain multiplication unit 110, and the frequency subtraction unit 40 are included.

第３の実施形態は、第１の実施形態の構成要素に、更に周波数減算部４０を加えた構成である。これにより、ボイススイッチでは抑制できなかった、目的音声区間に重畳された妨害音声(話者以外の人の話し声)や背景雑音も抑制できるようになり、第１、２の実施例よりもさらに高い雑音抑圧性能を実現することができる。 In the third embodiment, a frequency subtracting unit 40 is further added to the components of the first embodiment. As a result, the disturbing voice (speaking voice of a person other than the speaker) and background noise superimposed on the target voice section, which could not be suppressed by the voice switch, can be suppressed, which is higher than the first and second embodiments. Noise suppression performance can be realized.

周波数減算部４０は、入力信号から非目的音声信号成分を減算するものである。周波数減算部４０は、図１１に示すように、第３の指向性形成部４０１、減算部４０２、ＩＦＦＴ部４０３を少なくとも有する。 The frequency subtraction unit 40 subtracts the non-target audio signal component from the input signal. As shown in FIG. 11, the frequency subtraction unit 40 includes at least a third directivity forming unit 401, a subtraction unit 402, and an IFFT unit 403.

第３の指向性形成部４０１は、ＦＦＴ部１０１から信号Ｘ（ｆ、K）及び信号Ｘ２（ｆ、K）を受け取り、図１２に示すように、正面方向に死角を有する指向性の信号Ｂ３（ｆ、K）を形成するものである。 The third directivity forming unit 401 receives the signal X (f, K) and the signal X2 (f, K) from the FFT unit 101, and as shown in FIG. 12, a directivity signal B3 having a blind spot in the front direction. (F, K) is formed.

第３の指向性形成部４０１が正面方向を死角とする指向性を形成する理由は、入力信号に含まれる雑音信号成分を取得するためである。今、話者はマイクｍ１及びｍ２の正面から発声することを仮定しているので、第３の指向性形成部４０１で正面に死角を形成することで、側方から到来する非目的音声を取得することができる。 The reason why the third directivity forming unit 401 forms directivity with the front direction as a blind spot is to acquire a noise signal component included in the input signal. Now, since it is assumed that the speaker utters from the front of the microphones m1 and m2, the third directivity forming unit 401 forms a blind spot on the front to obtain non-target speech coming from the side. can do.

例えば、第３の指向性形成部４０１は、式（６）に従って、信号Ｂ３（ｆ、K）を取得する。 For example, the third directivity forming unit 401 acquires the signal B3 (f, K) according to Expression (6).

Ｂ３（ｆ、K）＝Ｘ１（ｆ、K）−Ｘ２（ｆ、K） …（６）
減算部４０２は、第３の指向性形成部４０１から信号Ｂ３（ｆ、K）を受け取り、信号Ｘ１（ｆ、K）から雑音成分である信号Ｂ３（ｆ、K）を取り除くものである。例えば、減算部４０２は式（７）の演算式に従って、雑音除去後信号Ｄ（ｆ、K）を取得する。 B3 (f, K) = X1 (f, K) -X2 (f, K) (6)
The subtraction unit 402 receives the signal B3 (f, K) from the third directivity forming unit 401 and removes the signal B3 (f, K) that is a noise component from the signal X1 (f, K). For example, the subtraction unit 402 acquires the signal D (f, K) after noise removal according to the arithmetic expression of Expression (7).

Ｄ（ｆ、K）＝Ｘ１（ｆ、K）−Ｂ３（ｆ、K） …（７）
ＩＦＦＴ部４０３は、減算部４０２から雑音除去信号Ｄ（ｆ、K）を受け取り、周波数領域信号であるＤ（ｆ、K）を時間領域に変換し、その変換した信号ｑ（ｎ）をゲイン乗算部１１０に与えるものである。 D (f, K) = X1 (f, K) -B3 (f, K) (7)
The IFFT unit 403 receives the noise removal signal D (f, K) from the subtraction unit 402, converts the frequency domain signal D (f, K) into the time domain, and performs gain multiplication on the converted signal q (n). This is given to the unit 110.

なお、第１の実施形態と同様の処理により、発話速度に応じて長期平均パラメータδが制御され、ゲイン制御部１０９はＶＳ＿ＧＡＩＮをゲイン乗算部１１０に出力する。 Note that the long-term average parameter δ is controlled according to the speech rate by the same processing as in the first embodiment, and the gain control unit 109 outputs VS_GAIN to the gain multiplication unit 110.

また、ゲイン乗算部１１０は、ＩＦＦＴ部４０３から得た出力信号ｑ（ｎ）に、ゲイン制御部１０９から取得したＶＳ＿ＧＡＩＮを乗算して出力信号ｙ（ｎ）を出力する。 The gain multiplication unit 110 multiplies the output signal q (n) obtained from the IFFT unit 403 by VS_GAIN obtained from the gain control unit 109, and outputs an output signal y (n).

（Ｃ−２）第３の実施形態の効果
以上のように、第３の実施形態によれば、第１の実施形態の効果に加えて、目的音声区間に重畳された雑音成分を除去することができるので、更に音質を向上させることができる。 (C-2) Effects of the Third Embodiment As described above, according to the third embodiment, in addition to the effects of the first embodiment, the noise component superimposed on the target speech section is removed. Sound quality can be further improved.

（Ｄ）第４の実施形態
次に、本発明の目的音抽出装置及び目的音抽出プログラムの第４の実施形態を、図面を参照しながら説明する。 (D) Fourth Embodiment Next, a fourth embodiment of the target sound extraction apparatus and target sound extraction program of the present invention will be described with reference to the drawings.

（Ｄ−１）第４の実施形態の構成及び動作
図１３は、第４の実施形態のボイススイッチ１００Ｄの構成を示す構成図である。図１３において、第４の実施形態のボイススイッチ１００Ｄは、マイクｍ１及びマイクｍ２、ＦＦＴ部１０１、第１の指向性形成部１０２、第２の指向性形成部１０３、コヒーレンス計算部１０４、発話速度検出部１０５、長期平均パラメータ制御部１０６、目的音声区間検出部１０７、検出結果長期平均部１０８、ゲイン制御部１０９、ボイススイッチゲイン乗算部１１０、コヒーレンスフィルター演算部５０を少なくとも有するものである。 (D-1) Configuration and Operation of the Fourth Embodiment FIG. 13 is a configuration diagram showing the configuration of the voice switch 100D of the fourth embodiment. In FIG. 13, the voice switch 100D of the fourth embodiment includes a microphone m1 and a microphone m2, an FFT unit 101, a first directivity forming unit 102, a second directivity forming unit 103, a coherence calculating unit 104, an utterance speed. It has at least a detection unit 105, a long-term average parameter control unit 106, a target speech section detection unit 107, a detection result long-term average unit 108, a gain control unit 109, a voice switch gain multiplication unit 110, and a coherence filter calculation unit 50.

第４の実施形態は、第１の実施形態の構成要素に、更にコヒーレンスフィルター演算部５０を加えた構成である。これにより、ボイススイッチでは抑制できなかった、目的音声区間に重畳された雑音成分も抑制できるようになり、第１、２の実施形態よりも高い雑音抑圧性能を実現することができる。 In the fourth embodiment, a coherence filter calculation unit 50 is further added to the components of the first embodiment. Thereby, the noise component superimposed on the target voice section, which could not be suppressed by the voice switch, can be suppressed, and higher noise suppression performance than the first and second embodiments can be realized.

コヒーレンスフィルター演算部５０は、コヒーレンス計算部１０４により式（３）の演算式により求められたｃｏｅｆ（ｆ、K）を受け取り、ｃｏｅｆ（ｆ、K）周波数毎に入力信号Ｘ１（ｆ、K）に乗算するものである。これにより、到来方向に偏りを有する信号成分、波形の規則性が小さい背景雑音成分などを抑制することができる。 The coherence filter calculation unit 50 receives coef (f, K) obtained by the calculation formula (3) by the coherence calculation unit 104, and converts it into the input signal X1 (f, K) for each coef (f, K) frequency. Multiply. Thereby, it is possible to suppress a signal component having a bias in the arrival direction, a background noise component having a small waveform regularity, and the like.

また、コヒーレンスフィルター演算部５０は、コヒーレンスフィルター係数乗算部５０１、ＩＦＦＴ部５０２を少なくとも有する。 The coherence filter calculation unit 50 includes at least a coherence filter coefficient multiplication unit 501 and an IFFT unit 502.

コヒーレンスフィルター係数乗算部５０１は、コヒーレンス計算部１０４からｃｏｅｆ（ｆ、K）を受け取り、式（８）に従って、ｃｏｅｆ（ｆ、K）を信号Ｘ１（ｆ、K）に乗算して雑音抑制後信号Ｄ（ｆ）を生成するものである。 The coherence filter coefficient multiplication unit 501 receives coef (f, K) from the coherence calculation unit 104, multiplies coef (f, K) by the signal X1 (f, K) according to the equation (8), and outputs a signal after noise suppression. D (f) is generated.

Ｄ（ｆ、K）＝Ｘ１（ｆ、K）× ｃｏｅｆ（ｆ、K） …（８）
ＩＦＦＴ部５０２は、コヒーレンスフィルター係数乗算部５０１から雑音抑制後信号Ｄ（ｆ、K）を受け取り、周波数領域信号であるＤ（ｆ、K）を時間領域に変換し、その変換した信号ｑ（ｎ）をゲイン乗算部１１０に与えるものである。 D (f, K) = X1 (f, K) × coef (f, K) (8)
The IFFT unit 502 receives the noise-suppressed signal D (f, K) from the coherence filter coefficient multiplication unit 501, converts the frequency domain signal D (f, K) into the time domain, and the converted signal q (n ) Is provided to the gain multiplication unit 110.

また、ゲイン乗算部１１０は、ＩＦＦＴ部５０２からの出力信号ｑ（ｎ）に、ゲイン制御部１０９からのＶＳ＿ＧＡＩＮを乗算して出力信号ｙ（ｎ）を取得し、この出力信号ｙ（ｎ）を出力する。 Further, the gain multiplication unit 110 multiplies the output signal q (n) from the IFFT unit 502 by VS_GAIN from the gain control unit 109 to obtain the output signal y (n), and obtains this output signal y (n). Output.

（Ｄ−２）第４の実施形態の効果
以上のように、第４の実施形態によれば、第１の実施形態の効果に加えて、目的音声区間に重畳されている雑音成分を抑制することができるので、更に音質を向上させることができる。 (D-2) Effects of the Fourth Embodiment As described above, according to the fourth embodiment, in addition to the effects of the first embodiment, the noise component superimposed on the target speech section is suppressed. Therefore, the sound quality can be further improved.

（Ｅ）第５の実施形態
次に、本発明の目的音抽出装置及び目的音抽出プログラムの第５の実施形態を、図面を参照しながら説明する。 (E) Fifth Embodiment Next, a fifth embodiment of the target sound extraction apparatus and target sound extraction program of the present invention will be described with reference to the drawings.

（Ｅ−１）第５の実施形態の構成及び動作
図１４は、第５の実施形態のボイススイッチ１００Ｅの構成を示す構成図である。図１４において、第５の実施形態のボイススイッチ１００Ｅは、マイクｍ１及びマイクｍ２、ＦＦＴ部１０１、第１の指向性形成部１０２、第２の指向性形成部１０３、コヒーレンス計算部１０４、発話速度検出部１０５、長期平均パラメータ制御部１０６、目的音声区間検出部１０７、検出結果長期平均部１０８、ゲイン制御部１０９、ボイススイッチゲイン乗算部１１０、ウィーナーフィルター演算部６０を少なくとも有するものである。 (E-1) Configuration and Operation of Fifth Embodiment FIG. 14 is a configuration diagram showing a configuration of a voice switch 100E of the fifth embodiment. In FIG. 14, the voice switch 100E of the fifth embodiment includes a microphone m1 and a microphone m2, an FFT unit 101, a first directivity forming unit 102, a second directivity forming unit 103, a coherence calculating unit 104, an utterance speed. The detection unit 105, the long-term average parameter control unit 106, the target speech section detection unit 107, the detection result long-term average unit 108, the gain control unit 109, the voice switch gain multiplication unit 110, and the Wiener filter calculation unit 60 are included.

第５の実施形態は、第１の実施形態の構成要素に、更にウィーナーフィルター演算部６０を加えた構成である。これにより、ボイススイッチでは抑制できなかった目的音声区間に重畳された背景雑音を抑制できるようになり、第１、２の実施例よりも高い雑音抑圧性能を実現することができる。 In the fifth embodiment, a Wiener filter calculation unit 60 is further added to the components of the first embodiment. As a result, it is possible to suppress background noise superimposed on the target voice section that could not be suppressed by the voice switch, and higher noise suppression performance than the first and second embodiments can be realized.

ウィーナーフィルター演算部６０は、雑音区間の信号から周波数毎に雑音特性を推定して得た係数を乗算することで、雑音成分を除去するものである。ウィーナーフィルター演算部６０による処理は、既存技術を適用することができ、例えば特許文献２に記載の技術を適用することができ、ここでの詳細な説明は省略する。 The Wiener filter calculation unit 60 removes a noise component by multiplying a coefficient obtained by estimating a noise characteristic for each frequency from a signal in a noise section. For the processing by the Wiener filter calculation unit 60, an existing technique can be applied. For example, the technique described in Patent Document 2 can be applied, and detailed description thereof is omitted here.

ウィーナーフィルター演算部６０は、ウィーナーフィルター係数計算部６０１、ウィーナーフィルター係数乗算部６０２、ＩＦＦＴ部６０３を有する。 The Wiener filter calculation unit 60 includes a Wiener filter coefficient calculation unit 601, a Wiener filter coefficient multiplication unit 602, and an IFFT unit 603.

ウィーナーフィルター係数計算部６０１は、目的音声区間検出部１０７により検出された検出結果ＶＡＤ＿ＲＥＳに基づいて非目的音声区間であるか否かを判定し、非目的音声区間の場合に、例えば特許文献２に記載の数３の演算等によりウィーナーフィルター係数ｗｆ＿ｃｏｅｆ（ｆ、K）の推定を行い、一方、目的音声区間の場合には、ウィーナーフィルター係数の推定を行わない。 The Wiener filter coefficient calculation unit 601 determines whether or not it is a non-target speech section based on the detection result VAD_RES detected by the target speech section detection unit 107. The Wiener filter coefficient wf_coef (f, K) is estimated by the calculation of the equation 3 described above, while the Wiener filter coefficient is not estimated in the case of the target speech section.

ウィーナーフィルター係数乗算部６０２は、式（９）に従って、ウィーナーフィルター係数計算部６０１により求められたウィーナーフィルター係数ｗｆ＿ｃｏｅｆ（ｆ、K）を信号Ｘ１（ｆ、K）に乗算して、雑音抑圧後信号Ｄ（ｆ、K）を求めるものである。 The Wiener filter coefficient multiplication unit 602 multiplies the signal X1 (f, K) by the Wiener filter coefficient wf_coef (f, K) obtained by the Wiener filter coefficient calculation unit 601 according to the equation (9) to obtain a signal after noise suppression. D (f, K) is obtained.

Ｄ（ｆ、K）＝Ｘ１（ｆ、K）× ｗｆ＿ｃｏｅｆ（ｆ、K）（９）
ＩＦＦＴ部６０３は、ウィーナーフィルター係数乗算部６０２から雑音抑圧後信号Ｄ（ｆ、K）を受け取り、周波数領域信号であるＤ（ｆ、K）を時間領域に変換し、その変換した信号ｑ（ｎ）をゲイン乗算部１１０に与えるものである。 D (f, K) = X1 (f, K) × wf_coef (f, K) (9)
The IFFT unit 603 receives the noise-suppressed signal D (f, K) from the Wiener filter coefficient multiplication unit 602, converts the frequency domain signal D (f, K) into the time domain, and the converted signal q (n ) Is provided to the gain multiplication unit 110.

また、ゲイン乗算部１１０は、ＩＦＦＴ部６０３からの出力信号ｑ（ｎ）に、ゲイン制御部１０９からのＶＳ＿ＧＡＩＮを乗算して出力信号ｙ（ｎ）を取得し、この出力信号ｙ（ｎ）を出力する。 Further, the gain multiplication unit 110 multiplies the output signal q (n) from the IFFT unit 603 by VS_GAIN from the gain control unit 109 to obtain the output signal y (n), and obtains this output signal y (n). Output.

（Ｅ−２）第５の実施形態の効果
以上のように、第５の実施形態によれば、第１の実施形態の効果に加えて、目的音声区間に重畳される背景雑音成分を抑制することができるので、更に音質を向上させることができる。 (E-2) Effect of Fifth Embodiment As described above, according to the fifth embodiment, in addition to the effect of the first embodiment, the background noise component superimposed on the target speech section is suppressed. Therefore, the sound quality can be further improved.

（Ｆ）他の実施形態
（Ｆ−１）上述した第３〜第５の実施形態では、周波数減算技術、コヒーレンスフィルター、ウィーナーフィルターにより、雑音抑圧する技術を説明したが、第３〜第５の実施形態で説明した、周波数減算技術、コヒーレンスフィルター、ウィーナーフィルターのいずれか１つ、あるいは、いずれか２つ、あるいは全ての技術を組み合わせてもよい。これにより、さらに高い雑音抑圧性能を実現できる。 (F) Other Embodiments (F-1) In the third to fifth embodiments described above, the technique for suppressing noise by the frequency subtraction technique, the coherence filter, and the Wiener filter has been described. Any one of the frequency subtraction technique, the coherence filter, the Wiener filter, any two, or all the techniques described in the embodiments may be combined. Thereby, higher noise suppression performance can be realized.

（Ｆ−２）上述した第１〜第５の実施形態では、ボイススイッチが、２個のマイクｍ１及びマイクｍ２を備え、右方向に死角、左方向に死角を備える指向性信号Ｂ１（ｆ）及びＢ２（ｆ）に基づいてコヒーレンスを求める場合を例示した。 (F-2) In the first to fifth embodiments described above, the voice switch includes two microphones m1 and m2, and a directivity signal B1 (f) having a blind spot in the right direction and a blind spot in the left direction. And the case of obtaining coherence based on B2 (f).

しかし、これに限定されず、４個のマイクと上下左右の４種の指向性信号を形成する４個の指向性形成部とを備え、右方向に死角を有する信号Ｂ１（ｆ）、左方向に死角を有する信号Ｂ２（ｆ）、上方向に死角を有する信号Ｂ３（ｆ）、下方向に死角を有する信号Ｂ４（ｆ）に基づいて、コヒーレンスＣＯＨを求めるようにしてもよい。 However, the present invention is not limited to this, and includes four microphones and four directivity forming units that form four types of directivity signals, up, down, left, and right, and a signal B1 (f) having a blind spot in the right direction, left direction Alternatively, the coherence COH may be obtained based on the signal B2 (f) having a blind spot in the vertical direction, the signal B3 (f) having a blind spot in the upward direction, and the signal B4 (f) having a blind spot in the downward direction.

この場合、コヒーレンス計算部は、式（１０）及び式（４）に従って、コヒーレンスＣＯＨを求めるようにしてもよい。

In this case, the coherence calculation unit may obtain the coherence COH according to the equations (10) and (4).

…（１０）
（Ｆ−３）本発明では発話速度に応じて長期平均パラメータδを制御する方法を説明したが、目的音声の欠落は発話速度だけではなくマイクと話者との距離の変動によっても発生する。こちらの課題も、本発明を適用することで改善することができる。この場合には、発話速度検出部に代えて、公知の手法によりマイクと話者との距離を推定する距離検出部を設け、長期平均パラメータ制御部では、距離に応じて長期平均パラメータを制御するように、距離と長期平均パラメータの対応テーブルを記憶部に格納しておけばよい。

(10)
(F-3) In the present invention, the method of controlling the long-term average parameter δ according to the speech rate has been described. However, the lack of target speech occurs not only due to the speech rate, but also due to variations in the distance between the microphone and the speaker. This problem can also be improved by applying the present invention. In this case, instead of the speech rate detection unit, a distance detection unit that estimates the distance between the microphone and the speaker by a known method is provided, and the long-term average parameter control unit controls the long-term average parameter according to the distance. In this way, a correspondence table between distances and long-term average parameters may be stored in the storage unit.

１００Ａ〜１００Ｂ…ボイススイッチ、
１０１…ＦＦＴ部、１０２…第１の指向性形成部、
１０３…第２の指向性形成部、１０４…コヒーレンス計算部、
１０５…発話速度検出部、１０６…長期平均パラメータ制御部、
１０７…目的音声区間検出部、１０８…検出結果長期平均部、
１０９…ゲイン制御部、１１０…ゲイン乗算部、
３０１…非目的音声区間監視部、３０２…長期平均値初期化部、
４０…周波数減算部、５０…コヒーレンスフィルター演算部、
６０…ウィーナーフィルター演算部、
２０１…発話速度入力部、２０２…長期平均パラメータ照合部、２０３…記憶部、２０４…長期平均パラメータ出力部。 100A-100B ... Voice switch,
101 ... FFT unit, 102 ... first directivity forming unit,
103 ... 2nd directivity formation part, 104 ... Coherence calculation part,
105 ... utterance speed detection unit, 106 ... long-term average parameter control unit,
107 ... target voice section detection unit, 108 ... detection result long-term average part,
109: Gain control unit, 110: Gain multiplication unit,
301: Non-target voice section monitoring unit 302: Long-term average value initialization unit,
40 ... frequency subtraction unit, 50 ... coherence filter calculation unit,
60 ... Wiener filter calculation unit,
201 ... utterance speed input unit, 202 ... long-term average parameter matching unit, 203 ... storage unit, 204 ... long-term average parameter output unit

Claims

A frequency analysis means for converting the input signal from the time domain to the frequency domain;
Directivity forming means for forming a plurality of signals having directivity each having a blind spot in a predetermined direction based on the signal obtained by the frequency analysis means;
A coherence calculating means for obtaining a coherence value based on a plurality of directivity signals formed by the directivity forming means;
Determining whether or not the target sound is included based on the coherence value obtained by the coherence calculating means, and outputting a detection result value corresponding to the determination result; and
By performing a weighted averaging process on the detection result value in the input frame obtained from the target sound determination means and the long-term average value of the detection result value obtained in the frame immediately before the input frame, the input frame A long-term average processing means for obtaining a long-term average value of the detection result values in
Utterance speed detection means for detecting the utterance speed of the target sound included in the input signal based on the coherence value obtained by the coherence calculation means;
Weight coefficient control means for controlling a weight coefficient related to the weighted average processing of the long-term average processing means according to the speech speed detected by the speech speed detecting means;
Gain control means for controlling a gain for an input signal based on a long-term average value of detection result values in the input frame of the long-term average processing means;
A target sound extraction apparatus comprising: gain multiplication means for multiplying an input signal by a gain controlled by the gain control means.

The weight coefficient control means is
A storage unit for storing a correspondence table in which the utterance speed is associated with the weighting factor in the long-term average processing unit;
A weighting factor determining unit that determines the weighting factor corresponding to the speech rate obtained from the speech rate detecting means with reference to the correspondence table;
The target sound extraction apparatus according to claim 1, further comprising: an output unit that supplies the weighting factor determined by the weighting factor determination unit to the long-term average processing unit.

Non-target sound monitoring means for observing the detection result value of the target sound determination means and monitoring the length of the non-target sound period in which the target sound is not included;
Initializing means for initializing parameters related to the weighted average processing of the long-term average processing means when the non-target sound period length exceeds a threshold based on the monitoring result of the non-target sound monitoring means. The target sound extraction device according to claim 1, wherein

A non-target sound signal generating means for forming a blind spot in the target sound direction from the signal obtained by the frequency analyzing means and obtaining a non-target sound signal;
Subtracting means for subtracting the non-target sound signal from the input signal obtained by the frequency analyzing means;
The target sound extraction apparatus according to any one of claims 1 to 3, further comprising: a frequency subtraction unit that includes an inverse frequency conversion unit that converts a signal after noise removal obtained by subtraction into a time domain.

A coherence filter coefficient multiplication means for multiplying the signal obtained by the frequency analysis means by a coherence coefficient obtained by a coherence calculation means to obtain a signal component having a bias in the arrival direction and a noise-suppressed signal that suppresses background noise; , Inverse frequency transform means for transforming the signal after the coherence filter coefficient multiplication into the time domain

The target sound extraction device according to claim 1, further comprising a coherence filter calculation unit.

Based on the detection result value from the target sound determination means, a Wiener filter coefficient calculation unit that updates a Wiener filter coefficient by a predetermined method only in the case of a non-target sound section;
A Wiener filter coefficient multiplier for multiplying the input signal obtained from the frequency analysis means by the Wiener filter coefficient obtained by the Wiener filter coefficient calculator;
6. The Wiener filter operation means further comprising: an inverse frequency conversion section that converts the frequency domain signal obtained by the Wiener filter coefficient multiplication section into a time domain and gives the multiplication section to the multiplication means. The target sound extraction apparatus according to any one of the above.

Computer
Frequency analysis means for converting the input signal from the time domain to the frequency domain,
Directivity forming means for forming a plurality of signals having directivity having blind spots in predetermined directions based on the signals obtained by the frequency analysis means,
A coherence calculating means for obtaining a coherence value based on a plurality of directivity signals formed by the directivity forming means;
A target sound determination unit that determines whether or not the target sound is included based on the coherence value obtained by the coherence calculation unit, and outputs a detection result value according to the determination result;
By performing a weighted averaging process on the detection result value in the input frame obtained from the target sound determination means and the long-term average value of the detection result value obtained in the frame immediately before the input frame, the input frame A long-term average processing means for obtaining a long-term average value of the detection result values in
Utterance speed detection means for detecting the utterance speed of the target sound included in the input signal based on the coherence value obtained by the coherence calculation means;
Weight coefficient control means for controlling a weight coefficient related to the weighted average processing of the long-term average processing means according to the speech speed detected by the speech speed detecting means;
Gain control means for controlling a gain for an input signal based on a long-term average value of detection result values in the input frame of the long-term average processing means;
An objective sound extraction program for causing a gain controlled by the gain control means to function as a gain multiplication means for multiplying an input signal.