JP2017040752A

JP2017040752A - Voice determining device, method, and program, and voice signal processor

Info

Publication number: JP2017040752A
Application number: JP2015161954A
Authority: JP
Inventors: 克之高橋; Katsuyuki Takahashi
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 2015-08-19
Filing date: 2015-08-19
Publication date: 2017-02-23
Anticipated expiration: 2035-08-19
Also published as: JP6638248B2

Abstract

PROBLEM TO BE SOLVED: To provide a voice determining device and a voice signal processor for determining presence of interfering voice precisely.SOLUTION: A voice determining device 1 includes: front suppression signal generation part 20 that acquires a frequency domain input signal obtained by converting an input signal from a time domain to a frequency domain, and generates a front suppression signal having a dead angle on the front based on a difference in the acquired frequency domain input signals for each microphone; a coherence calculation part 30 for calculating coherence from input signals obtained from the plurality of microphones; and a determination part 40 for calculating the amount of feature for indicating a relationship between the coherence calculated by the coherence calculation part and the front suppression signal, and determining the presence or absence of interfering voice based on the value of the amount of feature.SELECTED DRAWING: Figure 1

Description

この発明は、音声判定装置、方法及びプログラム、並びに、音声信号処理装置に関し、例えば、電話やテレビ会議などにおける音声処理や、音声認識処理に際して、目的音以外の非目的音（例えば、妨害音声）の有無の判定に適用し得る。 The present invention relates to a sound determination device, method and program, and a sound signal processing device. For example, a non-target sound other than a target sound (for example, disturbing sound) in a sound process or a speech recognition process in a telephone or a video conference. It can be applied to the determination of the presence or absence.

近年、スマートフォンやカーナビゲーションなどの音声通話機能や音声認識機能などの様々な音声処理機能に対応する装置（以下、これらの装置を総称して「音声処理装置」と呼ぶものとする）が普及している。しかし、これらの音声処理装置が普及したことで、混雑した街中や走行中の車内など、以前よりも過酷な雑音環境下で音声処理装置が用いられるようになってきている。そのため、雑音環境下でも通話音質や音声認識性能を維持できるような、音声処理装置の需要が高まっている。 In recent years, devices that support various voice processing functions such as voice call functions and voice recognition functions such as smartphones and car navigation systems (hereinafter, these devices are collectively referred to as “voice processing devices”) have become widespread. ing. However, with the widespread use of these voice processing devices, the voice processing devices have come to be used in harsher noise environments than before, such as in crowded streets and in running cars. For this reason, there is an increasing demand for speech processing devices that can maintain call quality and speech recognition performance even in noisy environments.

従来の音声処理装置において、目的音を抽出して取得する際には、目的音以外の非目的音を抑制する処理が行われる。 When a target sound is extracted and acquired in a conventional speech processing apparatus, processing for suppressing non-target sounds other than the target sound is performed.

従来の非目的音を抑制する音声処理装置としては、例えば、特許文献１に記載された技術がある。 As a conventional speech processing apparatus for suppressing non-target sounds, for example, there is a technique described in Patent Document 1.

特許文献１に記載された装置では、入力音声信号に遅延減算処理を施して、第１、第２の所定方位に死角を有する第１、第２の指向性信号を形成し、これら２つの指向性信号のコヒーレンスを取得する。そして、特許文献１に記載された装置ではでは、取得したコヒーレンスと判定閾値とを比較して、入力音声信号が、目的方位から到来している目的音声の区間か、それ以外の非目的音声区間かを判定し、この判定結果に応じてゲインを設定し、ゲインを入力音声信号に乗算して非目的音声を減衰する。 In the device described in Patent Document 1, a delay subtraction process is performed on an input audio signal to form first and second directivity signals having blind spots in first and second predetermined directions, and these two directivities are formed. Get coherence of sex signal. In the apparatus described in Patent Document 1, the acquired coherence is compared with the determination threshold value, and the input voice signal is a section of the target voice arriving from the target direction or other non-target voice section. The gain is set according to the determination result, and the input audio signal is multiplied by the gain to attenuate the non-target audio.

特開２０１３−１８２０４４号公報JP 2013-182044 A

ところで、通常非目的音に含まれる成分としては、例えば、背景雑音（例えば、街中での雑踏や、自動車の走行雑音など）と、妨害音声（例えば、当該音声処理装置の使用者以外の人の話し声）に大別できる。従来、背景雑音は周波数特性やパワーが定常であることを前提に、様々な有効な抑圧方法が提案されている。一方で、妨害音声は信号パワーや周波数特性が非定常であるうえに、目的音声（音声処理機能使用者の声）と同様に人間の声である。したがって、従来の音声処理装置において、妨害音声を検出しようとする場合、背景雑音のように目的音声との挙動の差異に基づいて存在の有無を判定することが困難である。このため、従来の音声処理装置で、妨害音を抑制しようとすると、妨害音の有無によらず、過度に抑圧処理を施して音質の歪が顕著になったり、抑圧不足で妨害音の残留成分によって通話音質や音声認識性能が所定の水準に達しない、といった問題が生じる。 By the way, the components included in the normal non-target sound include, for example, background noise (for example, crowds in the city, driving noise of automobiles, etc.) and disturbing sound (for example, people other than the user of the sound processing device). (Speaking voice). Conventionally, various effective suppression methods have been proposed on the assumption that the background noise has constant frequency characteristics and power. On the other hand, the disturbing voice is a human voice as well as the target voice (voice of the voice processing function user) in addition to non-stationary signal power and frequency characteristics. Therefore, in the conventional speech processing device, when detecting the disturbing speech, it is difficult to determine the presence / absence based on the difference in behavior from the target speech such as background noise. For this reason, when trying to suppress the interfering sound with a conventional sound processing device, the sound quality distortion becomes significant due to excessive suppression processing regardless of the presence or absence of the interfering sound, or the residual component of the interfering sound due to insufficient suppression Therefore, there arises a problem that the voice quality and voice recognition performance do not reach predetermined levels.

以上のような問題に鑑みて、精度よく非目的音（例えば、妨害音声）の存在を判定することができる音声判定装置、方法及びプログラム、並びに、音声信号処理装置が望まれている。 In view of the above problems, a voice determination device, a method and a program, and a voice signal processing device that can accurately determine the presence of a non-target sound (for example, disturbing voice) are desired.

第１の本発明の音声判定装置は、（１）複数のマイクから得られた入力信号を時間領域から周波数領域に変換された周波数領域入力信号を取得し、取得した上記マイクごとの周波数領域入力信号の差に基づいて、正面に死角を有する正面抑圧信号を生成する正面抑圧信号生成部と、（２）前記複数のマイクから得られた入力信号からコヒーレンスを計算するコヒーレンス計算部と、（３）前記コヒーレンス計算部が計算したコヒーレンスと、前記正面抑圧信号との関係性を表す特徴量を算出し、前記特徴量の値に基づいて妨害音声の有無を判定する判定部とを有することを特徴とする。 The speech determination apparatus according to the first aspect of the present invention provides (1) a frequency domain input signal obtained by converting input signals obtained from a plurality of microphones from a time domain to a frequency domain, and the obtained frequency domain input for each microphone. A front suppression signal generation unit that generates a frontal suppression signal having a blind spot in front based on the signal difference; (2) a coherence calculation unit that calculates coherence from input signals obtained from the plurality of microphones; And a determination unit that calculates a feature amount representing a relationship between the coherence calculated by the coherence calculation unit and the front suppression signal, and determines presence / absence of disturbing speech based on the value of the feature amount. And

第２の本発明の音声判定プログラムは、コンピュータを、（１）複数のマイクから得られた入力信号を時間領域から周波数領域に変換された周波数領域入力信号を取得し、取得した上記マイクごとの周波数領域入力信号の差に基づいて、正面に死角を有する正面抑圧信号を生成する正面抑圧信号生成部と、（２）前記複数のマイクから得られた入力信号からコヒーレンスを計算するコヒーレンス計算部と、（３）前記コヒーレンス計算部が計算したコヒーレンスと、前記正面抑圧信号との関係性を表す特徴量を算出し、前記特徴量の値に基づいて妨害音声の有無を判定する判定部として機能させることを特徴とする。 The sound determination program according to the second aspect of the present invention provides a computer, (1) obtains a frequency domain input signal obtained by converting input signals obtained from a plurality of microphones from a time domain to a frequency domain, and obtains each of the obtained microphones. A front suppression signal generation unit that generates a frontal suppression signal having a blind spot in front based on a difference between frequency domain input signals; and (2) a coherence calculation unit that calculates coherence from input signals obtained from the plurality of microphones. (3) calculating a feature value representing the relationship between the coherence calculated by the coherence calculation unit and the front suppression signal, and functioning as a determination unit that determines the presence or absence of disturbing speech based on the value of the feature value It is characterized by that.

第３の本発明は、複数のマイクから得られた入力信号に関する判定方法において、（１）正面抑圧信号生成部、コヒーレンス計算部、及び判定部を備え、（２）前記正面抑圧信号生成部は、複数のマイクから得られた入力信号を時間領域から周波数領域に変換された周波数領域入力信号を取得し、取得した上記マイクごとの周波数領域入力信号の差に基づいて、正面に死角を有する正面抑圧信号を生成し、（３）前記コヒーレンス計算部は、前記複数のマイクから得られた入力信号からコヒーレンスを計算し、（４）前記判定部は、前記コヒーレンス計算部が計算したコヒーレンスと、前記正面抑圧信号との関係性を表す特徴量を算出し、前記特徴量の値に基づいて妨害音声の有無を判定することを特徴とする。 According to a third aspect of the present invention, in the determination method for input signals obtained from a plurality of microphones, (1) a front suppression signal generation unit, a coherence calculation unit, and a determination unit are provided, and (2) the front suppression signal generation unit includes: A front side having a blind spot on the front side based on the difference between the obtained frequency domain input signals for each microphone, acquiring a frequency domain input signal obtained by converting input signals obtained from a plurality of microphones from a time domain to a frequency domain Generating a suppression signal, (3) the coherence calculation unit calculates coherence from input signals obtained from the plurality of microphones, (4) the determination unit, the coherence calculated by the coherence calculation unit, A feature amount representing a relationship with the front suppression signal is calculated, and the presence or absence of disturbing speech is determined based on the value of the feature amount.

第４の本発明は、複数のマイクから得られた入力信号の音声処理を行う音声処理装置において、第１の本発明の音声判定装置の判定結果を利用した音声処理を行うことを特徴とする。 According to a fourth aspect of the present invention, in a voice processing apparatus that performs voice processing of input signals obtained from a plurality of microphones, voice processing is performed using the determination result of the voice determination apparatus according to the first aspect of the present invention. .

本発明によれば、精度よく妨害音声を判定する音声判定装置及び音声信号処理装置を提供することができる。 ADVANTAGE OF THE INVENTION According to this invention, the audio | voice determination apparatus and audio | voice signal processing apparatus which determine a disturbance audio | voice accurately can be provided.

実施形態に係る音声判定装置の機能的構成について示したブロック図である。It is the block diagram shown about the functional structure of the audio | voice determination apparatus which concerns on embodiment. 実施形態に係るマイクの配置例について示した説明図である。It is explanatory drawing shown about the example of arrangement | positioning of the microphone which concerns on embodiment. 実施形態に係る音声判定装置で適用される指向性信号の特性について示した図（その１）である。It is the figure (the 1) shown about the characteristic of the directional signal applied with the audio | voice determination apparatus which concerns on embodiment. 実施形態に係る音声判定装置で適用される指向性信号の特性について示した図（その２）である。It is the figure (the 2) shown about the characteristic of the directional signal applied with the audio | voice determination apparatus which concerns on embodiment. 実施形態に係る音声判定装置の動作の例について示したフローチャート（その１）である。It is the flowchart (the 1) shown about the example of operation | movement of the audio | voice determination apparatus which concerns on embodiment. 実施形態に係る音声判定装置の動作の例について示したフローチャート（その２）である。It is the flowchart (the 2) shown about the example of operation | movement of the audio | voice determination apparatus which concerns on embodiment.

（Ａ）主たる実施形態
以下、本発明による音声判定装置、方法及びプログラム、並びに、音声信号処理装置、方法及びプログラムの一実施形態を、図面を参照しながら詳述する。 (A) Main Embodiment Hereinafter, an embodiment of a sound determination device, method and program, and sound signal processing device, method and program according to the present invention will be described in detail with reference to the drawings.

（Ａ−１）実施形態の構成
図１は、この実施形態の音声判定装置１の全体構成を示すブロック図である。 (A-1) Configuration of Embodiment FIG. 1 is a block diagram showing an overall configuration of a voice determination device 1 of this embodiment.

音声判定装置１は、一対のマイクｍ＿１、ｍ＿２のそれぞれから、図示しないＡＤ変換器を介して入力信号ｓ１（ｎ）、ｓ２（ｎ）を取得する。なお、ｎはサンプルの入力順を表すインデックスであり、正の整数で表現される。本文中では、ｎが小さいほど古い入力サンプルであり、大きいほど新しい入力サンプルであるとする。 The voice determination device 1 acquires input signals s1 (n) and s2 (n) from each of the pair of microphones m_1 and m_2 via an AD converter (not shown). Note that n is an index indicating the input order of samples, and is expressed as a positive integer. In the text, it is assumed that the smaller n is the older input sample, and the larger n is the newer input sample.

音声判定装置１は、マイクｍ＿１、ｍ＿２で補足される入力信号に非目的音が含まれるか否かを判定し、その判定結果を音声処理装置２に供給する。音声処理装置２は、音声判定装置１から供給される判定結果を利用して、入力信号の処理を行う。音声処理装置２が入力信号に対して行う処理内容については限定されないものである。音声処理装置２の機能や処理内容は限定されないものである。音声処理装置２は、例えば、テレビ会議システムや携帯電話端末などの通信装置や音声認識機能の前処理に、音声判定装置１から供給される判定結果を利用する。音声処理装置２は、例えば、音声判定装置１から供給される判定結果を非目的音（例えば、妨害音声）の抑制処理等に利用する。 The sound determination device 1 determines whether or not a non-target sound is included in the input signal supplemented by the microphones m_1 and m_2, and supplies the determination result to the sound processing device 2. The voice processing device 2 processes the input signal using the determination result supplied from the voice determination device 1. The processing content performed by the audio processing device 2 on the input signal is not limited. The functions and processing contents of the voice processing device 2 are not limited. The voice processing device 2 uses the determination result supplied from the voice determination device 1 for preprocessing of a communication device such as a video conference system or a mobile phone terminal and a voice recognition function, for example. For example, the sound processing device 2 uses the determination result supplied from the sound determination device 1 for the non-target sound (for example, disturbing sound) suppression processing or the like.

図２は、マイクｍ＿１、ｍ＿２の配置の例について示した説明図である。 FIG. 2 is an explanatory diagram showing an example of the arrangement of the microphones m_1 and m_2.

図２に示すように、この実施形態では、マイクｍ＿１、ｍ＿２は、２つのマイクｍ＿１、ｍ＿２を含む面が目的音の到来する方向（目的音の音源の方向）に対して垂直となるように配置されているものとする。また、以下では、図２に示すように、２つのマイクｍ＿１、ｍ＿２の間の位置から見て、目的音の到来方向を前方向又は正面方向と呼ぶものとする。また、以下では、図２に示すように、右方向、左方向、後方向と呼ぶ場合は、２つのマイクｍ＿１、ｍ＿２の間の位置から目的音の到来方向を見た場合の各方向を示すものとして説明する。なお、この実施形態では、目的音がマイクｍ＿１、ｍ＿２の正面方向から到来し、妨害音声を含む非目的音が左右方向（横方向）から到来するものとして説明する。 As shown in FIG. 2, in this embodiment, the microphones m_1 and m_2 are such that the plane including the two microphones m_1 and m_2 is perpendicular to the direction in which the target sound arrives (the direction of the target sound source). It is assumed that it is arranged. In the following, as shown in FIG. 2, the arrival direction of the target sound is referred to as the front direction or the front direction when viewed from the position between the two microphones m_1 and m_2. In the following, as shown in FIG. 2, when referring to the right direction, the left direction, and the rear direction, each direction when the arrival direction of the target sound is viewed from the position between the two microphones m_1 and m_2 is shown. It will be explained as a thing. In this embodiment, it is assumed that the target sound comes from the front direction of the microphones m_1 and m_2, and the non-target sound including the disturbing sound comes from the left-right direction (lateral direction).

音声判定装置１は、ＦＦＴ部１０、正面抑圧信号生成部２０、コヒーレンス計算部３０、及び判定部４０を有している。 The speech determination apparatus 1 includes an FFT unit 10, a front suppression signal generation unit 20, a coherence calculation unit 30, and a determination unit 40.

音声判定装置１は、プロセッサやメモリ等を有するコンピュータにプログラム（実施形態に係る音声判定プログラムを含むプログラム）をインストールして実現するようにしてもよいが、この場合でも、音声判定装置１は機能的には図１を用いて示すことができる。なお、音声判定装置１については一部又は全部をハードウェア的に実現するようにしてもよい。 The voice determination apparatus 1 may be realized by installing a program (a program including the voice determination program according to the embodiment) in a computer having a processor, a memory, and the like. Specifically, it can be shown using FIG. Note that part or all of the voice determination device 1 may be realized by hardware.

ＦＦＴ部１０は、マイクｍ１及びマイクｍ２から入力信号系列ｓ１及びｓ２を受け取り、その入力信号ｓ１及びｓ２に高速フーリエ変換（あるいは離散フーリエ変換）を行うものである。これにより、入力信号ｓ１及びｓ２が周波数領域で表現されることになる。なお、ＦＦＴ部１０は、高速フーリエ変換を実施するにあたり、入力信号ｓ１（ｎ）及びｓ２（ｎ）から所定のＮ個（Ｎは任意の整数）のサンプルから成る、分析フレームＦＲＡＭＥ１（Ｋ）及びＦＲＡＭＥ２（Ｋ）を構成するものとする。入力信号ｓ１からＦＲＡＭＥ１を構成する例を以下の（１）式に示す。なお、以下の（１）式において、Ｋはフレームの順番を表すインデックスであり、正の整数で表現される。以下では、Ｋの値が小さいほど古い分析フレームであり、Ｋの値大きいほど新しい分析フレームであるものとする。また、以降の動作説明において、特に但し書きが無い限りは、分析対象となる最新の分析フレームを表すインデックスはＫであるとする。
ＦＲＡＭＥ１（１）＝｛ｓ１（１）、ｓ１（２）・・、ｓ１（i）、・・ｓ１（ｎ）｝
ＦＲＡＭＥ１（Ｋ）＝｛ｓ１（Ｎ×Ｋ＋１）、ｓ１（Ｎ×Ｋ＋２）・・、ｓ１（Ｎ×Ｋ＋ｉ）、・・ｓ１（Ｎ×Ｋ＋Ｎ）｝ …（１） The FFT unit 10 receives input signal sequences s1 and s2 from the microphone m1 and the microphone m2, and performs fast Fourier transform (or discrete Fourier transform) on the input signals s1 and s2. As a result, the input signals s1 and s2 are expressed in the frequency domain. Note that, in performing the fast Fourier transform, the FFT unit 10 includes an analysis frame FRAME1 (K) including predetermined N samples (N is an arbitrary integer) from the input signals s1 (n) and s2 (n). Assume that FRAME2 (K) is configured. An example of configuring FRAME1 from the input signal s1 is shown in the following equation (1). In the following equation (1), K is an index representing the order of frames and is represented by a positive integer. In the following, it is assumed that the smaller the K value, the older the analysis frame, and the larger the K value, the newer the analysis frame. In the following description of the operation, it is assumed that the index representing the latest analysis frame to be analyzed is K unless otherwise specified.
FRAME1 (1) = {s1 (1), s1 (2)... S1 (i),.
FRAME1 (K) = {s1 (N × K + 1), s1 (N × K + 2).., S1 (N × K + i),... S1 (N × K + N)} (1)

ＦＦＴ部１０は、分析フレームごとに高速フーリエ変換処理を施すことで、入力信号ｓ１から構成した分析フレームＦＲＡＭＥ１（Ｋ）にフーリエ変換して得た周波数領域信号Ｘ１（ｆ，Ｋ）と、入力信号ｓ２から構成した分析フレームＦＲＡＭＥ２（Ｋ）をフーリエ変換して得た周波数領域信号Ｘ２（ｆ，Ｋ）とを取得する。なおｆは周波数を表すインデックスである。また（ｆ，Ｋ）は単一の値ではなく、以下の（２）式のように、複数の周波数ｆ１〜ｆｍのｍ個（ｍは任意の整数）のスペクトル成分から構成されるものであるものとする。 The FFT unit 10 performs a fast Fourier transform process for each analysis frame, thereby performing a frequency domain signal X1 (f, K) obtained by performing a Fourier transform on the analysis frame FRAME1 (K) configured from the input signal s1, and an input signal. A frequency domain signal X2 (f, K) obtained by Fourier transforming the analysis frame FRAME2 (K) configured from s2 is acquired. Note that f is an index representing a frequency. In addition, (f, K) is not a single value, but is composed of m (m is an arbitrary integer) spectral components of a plurality of frequencies f1 to fm as shown in the following equation (2). Shall.

ＦＦＴ部１０は、周波数領域信号Ｘ１（ｆ，Ｋ）、Ｘ２（ｆ，Ｋ）を、正面抑圧信号生成部２０及びコヒーレンス計算部３０に供給する。 The FFT unit 10 supplies the frequency domain signals X 1 (f, K) and X 2 (f, K) to the front suppression signal generation unit 20 and the coherence calculation unit 30.

なお、Ｘ１（ｆ，Ｋ）は複素数であり、実部と虚部で構成される。これは、Ｘ２（ｆ，Ｋ）及び、後述する正面抑圧信号生成部２０で説明する「Ｎ（ｆ，Ｋ）」についても同様である。
Ｘ１（ｆ，Ｋ）＝｛Ｘ１（ｆ１，Ｋ）、Ｘ１（ｆ２，Ｋ）、・・Ｘ１（ｆｉ，Ｋ）・・、Ｘ１（ｆｍ，Ｋ）｝ …（２） X1 (f, K) is a complex number and is composed of a real part and an imaginary part. The same applies to X2 (f, K) and “N (f, K)” described in the front suppression signal generation unit 20 described later.
X1 (f, K) = {X1 (f1, K), X1 (f2, K),... X1 (fi, K) .., X1 (fm, K)} (2)

次に、正面抑圧信号生成部２０について説明する。 Next, the front suppression signal generation unit 20 will be described.

正面抑圧信号生成部２０は、ＦＦＴ部１０から供給された信号について、周波数ごとに正面方向の信号成分を抑圧する処理を行う。言い換えると、正面抑圧信号生成部２０は、正面方向の成分を抑圧する指向性フィルタとして機能する。 The front suppression signal generation unit 20 performs a process of suppressing the signal component in the front direction for each frequency with respect to the signal supplied from the FFT unit 10. In other words, the front suppression signal generation unit 20 functions as a directivity filter that suppresses a component in the front direction.

例えば、正面抑圧信号生成部２０は、図３に示すように、正面方向に死角を有する８の字型の双指向性のフィルタを用いて、ＦＦＴ部１０から供給された信号から正面方向の成分を抑圧する指向性フィルタを形成する。 For example, as shown in FIG. 3, the front suppression signal generation unit 20 uses an 8-shaped bi-directional filter having a blind spot in the front direction to generate a front direction component from the signal supplied from the FFT unit 10. A directional filter that suppresses the noise is formed.

具体的には、正面抑圧信号生成部２０は、ＦＦＴ部１０から供給された信号「Ｘ１（ｆ，Ｋ）」、「Ｘ２（ｆ，Ｋ）」に基づいて以下の（３）式のような計算を行って、周波数ごとの正面抑圧信号Ｎ（ｆ，Ｋ）を生成する。以下の（３）式の計算は、上述の図３のような、正面方向に死角を有する８の字型の双指向性のフィルタを形成する処理に相当する。
Ｎ（ｆ，Ｋ）＝Ｘ１（ｆ，Ｋ）−Ｘ２（ｆ，Ｋ） …（３） Specifically, the front suppression signal generation unit 20 is represented by the following equation (3) based on the signals “X1 (f, K)” and “X2 (f, K)” supplied from the FFT unit 10. A calculation is performed to generate a front suppression signal N (f, K) for each frequency. The calculation of the following equation (3) corresponds to a process of forming an 8-shaped bi-directional filter having a blind spot in the front direction as shown in FIG.
N (f, K) = X1 (f, K) -X2 (f, K) (3)

そして、正面抑圧信号生成部２０は、以下の（４）式を用いて、全周波数にわたってＮ（ｆ，Ｋ）を平均した、平均正面抑圧信号ＡＶＥ＿Ｎ（Ｋ）を算出する。

Then, the front suppression signal generation unit 20 calculates an average front suppression signal AVE_N (K) by averaging N (f, K) over all frequencies using the following equation (4).

次に、コヒーレンス計算部３０の処理について説明する。 Next, the process of the coherence calculation unit 30 will be described.

コヒーレンス計算部３０は、周波数領域信号Ｘ１（ｆ，Ｋ）、Ｘ２（ｆ，Ｋ）について、左方向（第１の方向）に強い指向性（例えば、図４（ａ）に示すような単一指向性）のフィルタで処理された信号（以下、「指向性信号Ｂ１（ｆ）」と呼ぶ）と、右方向（第２の方向）に強い指向性（例えば、図４（ｂ）に示すような単一指向性）のフィルタで処理された信号（以下、「指向性信号Ｂ２（ｆ）」と呼ぶ）とに基づくコヒーレンスＣＯＨ（Ｋ）を算出する。なお、指向性信号Ｂ１（ｆ）及び指向性信号Ｂ２（ｆ）に係る指向性の方向は正面方向以外の任意の方向（ただし、Ｂ１（ｆ）とＢ２（ｆ）とで異なる方向とする必要がある）とするようにしてもよい。 The coherence calculator 30 has a strong directivity (for example, as shown in FIG. 4A) in the left direction (first direction) for the frequency domain signals X1 (f, K) and X2 (f, K). A signal processed by a filter of directivity (hereinafter referred to as “directivity signal B1 (f)”) and directivity strong in the right direction (second direction) (for example, as shown in FIG. 4B) A coherence COH (K) based on a signal (hereinafter, referred to as “directional signal B2 (f)”) processed by a filter having a unidirectionality). Note that the directionality of the directivity signal B1 (f) and the directivity signal B2 (f) is an arbitrary direction other than the front direction (however, it is necessary to set different directions for B1 (f) and B2 (f)). There may be a).

コヒーレンスＣＯＨ（Ｋ）を算出する具体的な算出処理（例えば、計算式）については限定されないものであるが、例えば、特許文献１と同様の処理（例えば、特許文献１に記載に記載された（３）式〜（７）式の計算処理）を適用することができるため、詳細については省略する。 A specific calculation process (for example, a calculation formula) for calculating the coherence COH (K) is not limited. For example, a process similar to that of Patent Document 1 (for example, described in Patent Document 1 ( Since the calculation processing of the formulas (3) to (7) can be applied, details are omitted.

次に、判定部４０の処理について説明する。 Next, the process of the determination part 40 is demonstrated.

判定部４０は、正面以外に指向性を有する正面抑圧信号Ｎ（ｆ，Ｋ）（平均正面抑圧信号ＡＶＥ＿Ｎ（Ｋ））と、コヒーレンスＣＯＨ（Ｋ）を用いて、非目的音の有無を判定する。 The determination unit 40 determines the presence / absence of a non-target sound by using the front suppression signal N (f, K) (average front suppression signal AVE_N (K)) having directivity other than the front and the coherence COH (K). .

ここでは、目的音がマイクｍ＿１、ｍ＿２の正面方向から到来し、妨害音声を含む非目的音が左右方向（横方向）から到来するものとして説明する。例えば、マイクｍ＿１、ｍ＿２を電話端末（例えば、携帯電話端末等）の受話器のマイク部分に適用した場合には、目的音としての話者（ユーザ）の音声はマイクｍ＿１、ｍ＿２の正面方向から到来し、当該電話端末の話者以外の音声は、左右方向（横方向）から到来することになる。 Here, description will be made assuming that the target sound comes from the front direction of the microphones m_1 and m_2, and the non-target sound including the disturbing sound comes from the left-right direction (lateral direction). For example, when the microphones m_1 and m_2 are applied to the microphone portion of the handset of a telephone terminal (for example, a mobile phone terminal), the voice of the speaker (user) as the target sound comes from the front direction of the microphones m_1 and m_2. However, the voice other than the speaker of the telephone terminal comes from the left-right direction (lateral direction).

したがって、例えば、「妨害音声が存在せず」かつ「目的音が存在する」場合は、正面抑圧信号Ｎ（ｆ，Ｋ）の平均正面抑圧信号ＡＶＥ＿Ｎ（Ｋ）は、目的音成分の大きさに比例した値となる。図２に示すように、平均正面抑圧信号ＡＶＥ＿Ｎ（Ｋ）（正面抑圧信号Ｎ（ｆ，Ｋ））生成時の指向性特性には、「妨害音声が存在せず」かつ「目的音が存在する」場合でも、正面方向から到来する信号成分も含まれることになるためである。ただし、図２に示すように、平均正面抑圧信号ＡＶＥ＿Ｎ（Ｋ）（正面抑圧信号Ｎ（ｆ，Ｋ））生成時の指向性特性には、正面方向から到来する信号成分も含まれるが、横方向のゲインと比較すると非常に小さい。また、「妨害音声が存在せず」かつ「目的音が存在する」場合の正面抑圧信号Ｎ（ｆ，Ｋ）のゲインは、妨害音声が存在する場合よりも小さくなる。 Therefore, for example, when “no disturbing voice exists” and “the target sound exists”, the average front suppression signal AVE_N (K) of the front suppression signal N (f, K) has the magnitude of the target sound component. Proportional value. As shown in FIG. 2, the directivity characteristics when generating the average front suppression signal AVE_N (K) (front suppression signal N (f, K)) are “no disturbing speech” and “the target sound exists”. This is because the signal component coming from the front direction is also included. However, as shown in FIG. 2, the directivity characteristic when generating the average front suppression signal AVE_N (K) (front suppression signal N (f, K)) includes a signal component coming from the front direction. Very small compared to the direction gain. In addition, the gain of the front suppression signal N (f, K) when “no disturbing sound exists” and “the target sound exists” is smaller than when the disturbing sound exists.

また、コヒーレンスＣＯＨ（Ｋ）は、簡単に述べれば、第１の方向（右方向）から到来する信号と第２の方向（左方向）から到来する信号の相関（特徴量）と言える。従って、コヒーレンスＣＯＨ（Ｋ）が小さい場合とは、２つの指向性信号Ｂ１（ｆ）、Ｂ２（ｆ）の相関が小さい場合であり、反対にコヒーレンスＣＯＨ（Ｋ）が大きい場合とは相関が大きい場合と言い換えることができる。そして、相関が小さい場合は、目的音の到来方向が右又は左のどちらかに大きく偏った場合か、偏りがなくても雑音のような明確な規則性の少ない信号の場合である。また、例えば、マイクｍ＿１、ｍ＿２を電話端末（例えば、携帯電話端末等）の受話器のマイク部分に適用した場合には、話者の音声（目的音声）は正面から到来し、妨害音声は正面以外から到来する傾向が強い。以上のようにコヒーレンスＣＯＨ（Ｋ）は、入力信号の到来方向と深い関係を持つ特徴量となる。したがって、「妨害音声が存在せず」かつ「目的音が存在する」場合には、コヒーレンスＣＯＨ（Ｋ）の値は大きくなる傾向となり、「妨害音声が存在する」場合には、コヒーレンスＣＯＨ（Ｋ）の値は小さくなる傾向となる。 The coherence COH (K) can be simply described as a correlation (feature amount) between a signal arriving from the first direction (right direction) and a signal arriving from the second direction (left direction). Therefore, the case where the coherence COH (K) is small is a case where the correlation between the two directivity signals B1 (f) and B2 (f) is small, and conversely, the case where the coherence COH (K) is large is large. In other words. The case where the correlation is small is when the arrival direction of the target sound is greatly deviated to the right or left, or a signal having a clear and regularity such as noise even if there is no deviation. For example, when the microphones m_1 and m_2 are applied to the microphone part of a telephone terminal (for example, a cellular phone terminal), the speaker's voice (target voice) comes from the front and the disturbing voice is other than the front. The tendency to come from is strong. As described above, the coherence COH (K) is a feature amount having a deep relationship with the arrival direction of the input signal. Therefore, the value of coherence COH (K) tends to increase when “no disturbing speech exists” and “the target sound exists”, and when “jamming speech exists”, coherence COH (K ) Tends to be smaller.

以上の各値の挙動を妨害音声の有無に着目して整理すると以下のような条件で、妨害音声の有無を判断することができる。以下では、「妨害音声が存在せず」かつ「目的音が存在する」という条件（以下、「第１の条件」と呼ぶ）と、「妨害音声が存在する」という条件（以下、「第２の条件」と呼ぶ）に場合分けして、妨害音声の有無の判定方法について説明する。 If the behavior of each of the above values is organized by paying attention to the presence or absence of interfering speech, the presence or absence of interfering speech can be determined under the following conditions. In the following, the condition that “no disturbing sound exists” and “the target sound exists” (hereinafter referred to as “first condition”) and the condition that “the disturbing sound exists” (hereinafter referred to as “second sound”). The method for determining the presence / absence of interfering speech will be described for each case.

第１の条件の場合（「妨害音声が存在せず」かつ「目的音が存在する」場合）には、コヒーレンスＣＯＨ（Ｋ）が比較的大きな値となり、平均正面抑圧信号ＡＶＥ＿Ｎ（Ｋ）は、目的音成分の大きさに比例した値となる。 In the case of the first condition (“no disturbing sound” and “target sound”), the coherence COH (K) is a relatively large value, and the average front suppression signal AVE_N (K) is The value is proportional to the size of the target sound component.

一方、第２の条件の場合（「妨害音声が存在する」場合）には、コヒーレンスＣＯＨ（Ｋ）の値は小さい値となり、平均正面抑圧信号ＡＶＥ＿Ｎ（Ｋ）は大きな値となる傾向にある。 On the other hand, in the case of the second condition (when “disturbing speech is present”), the value of coherence COH (K) tends to be a small value, and average front suppression signal AVE_N (K) tends to be a large value.

したがって、平均正面抑圧信号ＡＶＥ＿Ｎ（Ｋ）とコヒーレンスＣＯＨ（Ｋ）の相関係数ｃｏｒ（Ｋ）を導入すると、相関係数ｃｏｒ（Ｋ）と妨害音声の有無との関係は以下のような関係となる。 Therefore, when the correlation coefficient cor (K) of the average front suppression signal AVE_N (K) and coherence COH (K) is introduced, the relationship between the correlation coefficient cor (K) and the presence or absence of disturbing speech is as follows: Become.

妨害音声が存在しない場合は、相関係数ｃｏｒ（Ｋ）は正の値（相関性が高いことを示す所定値以上の値）となる傾向となる。一方、妨害音声が存在する場合には、相関係数ｃｏｒ（Ｋ）は負の値（相関性が低いことを示す所定値未満の値）となる傾向となる。 When no disturbing speech exists, the correlation coefficient cor (K) tends to be a positive value (a value equal to or higher than a predetermined value indicating high correlation). On the other hand, when disturbing speech exists, the correlation coefficient cor (K) tends to be a negative value (a value less than a predetermined value indicating that the correlation is low).

すなわち、平均正面抑圧信号ＡＶＥ＿Ｎ（Ｋ）とコヒーレンスＣＯＨ（Ｋ）との相関係数ｃｏｒ（Ｋ）を導入することにより、例えば、相関係数ｃｏｒ（Ｋ）の正負判断というシンプルな処理で、妨害音声の有無を判定することができる。 That is, by introducing the correlation coefficient cor (K) between the average front suppression signal AVE_N (K) and the coherence COH (K), for example, simple processing of determining whether the correlation coefficient cor (K) is positive or negative can be performed. The presence or absence of sound can be determined.

そこで、この実施形態の判定部４０は、まず、相関係数ｃｏｒ（Ｋ）を求め、相関係数ｃｏｒ（Ｋ）に基づいて妨害音声の有無を判定するものとする。 Accordingly, the determination unit 40 of this embodiment first obtains the correlation coefficient cor (K) and determines the presence or absence of disturbing speech based on the correlation coefficient cor (K).

なお、判定部４０が、相関係数ｃｏｒ（Ｋ）を求める際の具体的な計算方法については限定されないものであるが、例えば、判定部４０は以下の（５）式を用いて相関係数ｃｏｒ（Ｋ）を求めるようにしてもよい。なお、以下の（５）式において、Ｃｏｖ［ＡＶＥ＿Ｎ（Ｋ），ＣＯＨ（Ｋ）］は、平均正面抑圧信号ＡＶＥ＿Ｎ（Ｋ）とコヒーレンスＣＯＨ（Ｋ）の共分散を示している。また、以下の（５）式において、σＮ（ｆ，Ｋ）は、平均正面抑圧信号ＡＶＥ＿Ｎ（Ｋ）の標準偏差を示している。さらに、以下の（５）式において、σＣＯＨ（Ｋ）は、コヒーレンスＣＯＨ（Ｋ）の標準偏差を示している。以下の（５）式にて相関係数ｃｏｒ（Ｋ）を求める場合には、ＡＶＥ＿Ｎ（Ｋ）及びＣＯＨ（Ｋ）についてそれぞれ直近に処理した所定数i個のフレームの結果を用いて、標準偏差や共分散を求めるようにしてもよい。具体的には、以下の（５）式にて相関係数ｃｏｒ（Ｋ）を求める過程において、例えば、直近に処理したｉ個のフレーム（Ｋ−ｉ番目のフレーム、Ｋ−（ｉ−１）番目のフレーム、…、Ｋ−１番目のフレーム、Ｋ番目のフレームの）のそれぞれに係るＣＯＨ及びＡＶＥ＿Ｎを用いて、標準偏差（σＮ（ｆ，Ｋ）、及びσＣＯＨ（Ｋ））や共分散（Ｃｏｖ［ＡＶＥ＿Ｎ（Ｋ），ＣＯＨ（Ｋ）］）を求めるようにしてもよい。言い換えると、判定部４０は、相関係数ｃｏｒ（Ｋ）を求める過程において、直近に求めたｉ個のＡＶＥ＿Ｎ及びＣＯＨをサンプルとして用いて、以下の（５）式における標準偏差や共分散を求めるようにしてもよい。

The specific calculation method used when the determination unit 40 calculates the correlation coefficient cor (K) is not limited. For example, the determination unit 40 uses the following equation (5) to calculate the correlation coefficient. cor (K) may be obtained. In the following equation (5), Cov [AVE_N (K), COH (K)] indicates the covariance between the average front suppression signal AVE_N (K) and the coherence COH (K). In the following equation (5), σN (f, K) represents the standard deviation of the average front suppression signal AVE_N (K). Furthermore, in the following equation (5), σCOH (K) represents a standard deviation of coherence COH (K). When the correlation coefficient cor (K) is obtained by the following equation (5), the standard deviation is obtained using the result of a predetermined number i frames most recently processed for AVE_N (K) and COH (K). Or covariance may be obtained. Specifically, in the process of obtaining the correlation coefficient cor (K) by the following equation (5), for example, i frames (Ki-th frame, K- (i-1) processed most recently) The standard deviations (σN (f, K) and σCOH (K)) and covariances (CON and AVE_N) of the first frame,..., K−1th frame, Kth frame) and covariance ( Cov [AVE_N (K), COH (K)]) may be obtained. In other words, in the process of obtaining the correlation coefficient cor (K), the determination unit 40 obtains the standard deviation and covariance in the following equation (5) using the most recently obtained i AVE_N and COH as samples. You may do it.

判定部４０は、例えば、相関係数ｃｏｒ（Ｋ）が閾値Ｔｈ以上だった場合、妨害音声無しを示す値（例えば、「０」）を出力し、相関係数ｃｏｒ（Ｋ）が閾値Ｔｈより小さい場合には妨害音声有りを出力するようにしてもよい。この実施形態では、上述の検討に従って閾値Ｔｈ＝０と設定するものとして説明する。したがって、判定部４０は、相関係数ｃｏｒ（Ｋ）が０より大きい場合（相関係数ｃｏｒ（Ｋ）が正の場合；ｃｏｒ（Ｋ）＞０の場合）には妨害音声無しと判定し、相関係数ｃｏｒ（Ｋ）が０未満の場合（相関係数ｃｏｒ（Ｋ）が０又は負の場合；０≧ｃｏｒ（Ｋ）の場合）には妨害音声有りと判定するものとする。 For example, when the correlation coefficient cor (K) is equal to or greater than the threshold Th, the determination unit 40 outputs a value indicating no disturbing voice (for example, “0”), and the correlation coefficient cor (K) is greater than the threshold Th. If it is small, the presence of disturbing sound may be output. In this embodiment, the description will be made assuming that the threshold value Th = 0 is set in accordance with the above-described examination. Therefore, when the correlation coefficient cor (K) is greater than 0 (when the correlation coefficient cor (K) is positive; when cor (K)> 0), the determination unit 40 determines that there is no disturbing voice, When the correlation coefficient cor (K) is less than 0 (when the correlation coefficient cor (K) is 0 or negative; 0 ≧ cor (K)), it is determined that there is disturbing sound.

また、判定部４０は、判定結果を示す信号Ｒ（Ｋ）を出力する。信号Ｒ（Ｋ）の形式は限定されないものであるが、例えば、「妨害音声有り」を示す値（例えば、「１」）又は、「妨害音声無し」を示す値（例えば、「０」）を出力するようにしてもよい。この実施形態において、判定部４０は、音声処理装置２に信号Ｒ（Ｋ）を供給する。なお、判定部４０が信号Ｒ（Ｋ）を出力する方式や供給先については限定されないものである。 Further, the determination unit 40 outputs a signal R (K) indicating the determination result. The format of the signal R (K) is not limited. For example, a value indicating “with disturbing sound” (for example, “1”) or a value indicating “without disturbing sound” (for example, “0”) is used. You may make it output. In this embodiment, the determination unit 40 supplies the signal R (K) to the sound processing device 2. It should be noted that the method by which the determination unit 40 outputs the signal R (K) and the supply destination are not limited.

（Ａ−２）実施形態の動作
次に、以上のような構成を有するこの実施形態の音声判定装置１の動作（実施形態の判定方法）を説明する。 (A-2) Operation | movement of embodiment Next, operation | movement (determination method of embodiment) of the audio | voice determination apparatus 1 of this embodiment which has the above structures is demonstrated.

まず、音声判定装置１の全体の動作について図１を用いて説明する。 First, the overall operation of the speech determination apparatus 1 will be described with reference to FIG.

マイクｍ＿１、ｍ＿２のそれぞれから図示しないＡＤ変換器を介して、１フレーム分（１つの処理単位分）の入力信号ｓ１（ｎ）及びｓ２（ｎ）がＦＦＴ部１０に供給されたものとする。そして、ＦＦＴ部１０は、１フレーム分の入力信号ｓ１（ｎ）及びｓ２（ｎ）に基づく分析フレームＦＲＡＭＥ１（Ｋ）、ＦＲＡＭＥ２（Ｋ）についてフーリエ変換し、周波数領域で示される信号Ｘ１（ｆ，Ｋ）、Ｘ２（ｆ，Ｋ）を取得する。そして、ＦＦＴ部１０で生成された信号Ｘ１（ｆ，Ｋ）、Ｘ２（ｆ，Ｋ）が、正面抑圧信号生成部２０及びコヒーレンス計算部３０に供給される。 Assume that input signals s1 (n) and s2 (n) for one frame (for one processing unit) are supplied to the FFT unit 10 from each of the microphones m_1 and m_2 via an AD converter (not shown). Then, the FFT unit 10 performs Fourier transform on the analysis frames FRAME1 (K) and FRAME2 (K) based on the input signals s1 (n) and s2 (n) for one frame, and the signal X1 (f, K) and X2 (f, K) are acquired. Then, the signals X1 (f, K) and X2 (f, K) generated by the FFT unit 10 are supplied to the front suppression signal generation unit 20 and the coherence calculation unit 30.

正面抑圧信号生成部２０は、供給されたＸ１（ｆ，Ｋ）、Ｘ２（ｆ，Ｋ）に基づいて、正面抑圧信号Ｎ（ｆ，Ｋ）を算出する。そして、正面抑圧信号生成部２０は、正面抑圧信号Ｎ（ｆ，Ｋ）に基づいて平均正面抑圧信号ＡＶＥ＿Ｎ（Ｋ）を算出し、判定部４０に供給する。 The front suppression signal generator 20 calculates the front suppression signal N (f, K) based on the supplied X1 (f, K) and X2 (f, K). Then, the front suppression signal generation unit 20 calculates an average front suppression signal AVE_N (K) based on the front suppression signal N (f, K) and supplies the average front suppression signal AVE_N (K) to the determination unit 40.

一方、コヒーレンス計算部３０は、供給されたＸ１（ｆ，Ｋ）、Ｘ２（ｆ，Ｋ）に基づいて、コヒーレンスＣＯＨ（Ｋ）を生成し、判定部４０に供給する。 On the other hand, the coherence calculation unit 30 generates coherence COH (K) based on the supplied X1 (f, K) and X2 (f, K), and supplies the coherence COH (K) to the determination unit 40.

判定部４０は、平均正面抑圧信号ＡＶＥ＿Ｎ（Ｋ）及びコヒーレンスＣＯＨ（Ｋ）に基づいて、相関係数ｃｏｒ（Ｋ）を算出し、算出した相関係数ｃｏｒ（Ｋ）に基づいて妨害音声の有無を判定し、その判定結果を信号Ｒ（Ｋ）として出力する。 The determination unit 40 calculates a correlation coefficient cor (K) based on the average front suppression signal AVE_N (K) and the coherence COH (K), and based on the calculated correlation coefficient cor (K), the presence / absence of disturbing speech And the determination result is output as a signal R (K).

次に、判定部４０の動作詳細について図５、図６のフローチャートを用いて説明する。 Next, details of the operation of the determination unit 40 will be described using the flowcharts of FIGS. 5 and 6.

図５は、判定部４０が妨害音声の有無を判定する処理について示したフローチャートである。図６は、図５のフローチャートの一部の処理について示したフローチャートである。判定部４０は、平均正面抑圧信号ＡＶＥ＿Ｎ（Ｋ）及びコヒーレンスＣＯＨ（Ｋ）（１フレーム分のデータ）が供給されるごとに、図５、図６のフローチャートの処理により妨害音声の有無を判定し、信号Ｒ（Ｋ）を出力するものとする。 FIG. 5 is a flowchart illustrating the process in which the determination unit 40 determines the presence or absence of disturbing sound. FIG. 6 is a flowchart showing a part of the processing of the flowchart of FIG. Each time the average front suppression signal AVE_N (K) and coherence COH (K) (data for one frame) are supplied, the determination unit 40 determines the presence or absence of interfering speech by the processing of the flowcharts of FIGS. , Signal R (K) is output.

判定部４０は、平均正面抑圧信号ＡＶＥ＿Ｎ（Ｋ）及びコヒーレンスＣＯＨ（Ｋ）が供給されると（Ｓ１０１）、平均正面抑圧信号ＡＶＥ＿Ｎ（Ｋ）とコヒーレンスＣＯＨ（Ｋ）とに基づいて相関係数ｃｏｒ（Ｋ）を算出する（Ｓ１０２）。 When the average front suppression signal AVE_N (K) and the coherence COH (K) are supplied (S101), the determination unit 40 determines the correlation coefficient cor based on the average front suppression signal AVE_N (K) and the coherence COH (K). (K) is calculated (S102).

次に、判定部４０は、算出した相関係数ｃｏｒ（Ｋ）に基づいて妨害音声の有無を判定し（Ｓ１０３）、その判定結果を示す信号Ｒ（Ｋ）を生成して出力する（Ｓ１０４）。 Next, the determination unit 40 determines the presence / absence of interfering speech based on the calculated correlation coefficient cor (K) (S103), and generates and outputs a signal R (K) indicating the determination result (S104). .

次に、判定部４０が上述のステップＳ１０３で行う判定処理の具体例について図６のフローチャートを用いて説明する。 Next, a specific example of the determination process performed by the determination unit 40 in step S103 described above will be described with reference to the flowchart of FIG.

判定部４０は、判定処理を開始すると、相関係数ｃｏｒ（Ｋ）の値を確認し（Ｓ２０１）、相関係数ｃｏｒ（Ｋ）の値に応じて妨害音の有無を判定する。 When the determination process is started, the determination unit 40 confirms the value of the correlation coefficient cor (K) (S201), and determines the presence or absence of interfering sound according to the value of the correlation coefficient cor (K).

具体的には、判定部４０は、相関係数ｃｏｒ（Ｋ）が０より大きい場合（相関係数ｃｏｒ（Ｋ）が正の値場合；ｃｏｒ（Ｋ）＞０の場合）には「妨害音声無し」と判定し（Ｓ２０２）、相関係数ｃｏｒ（Ｋ）が０未満の場合（相関係数ｃｏｒ（Ｋ）が０又は負の値の場合；０≧ｃｏｒ（Ｋ）の場合）には「妨害音声有り」と判定する（Ｓ２０３）。 Specifically, when the correlation coefficient cor (K) is greater than 0 (when the correlation coefficient cor (K) is a positive value; when cor (K)> 0), the determination unit 40 determines that “interfering voice”. When the correlation coefficient cor (K) is less than 0 (when the correlation coefficient cor (K) is 0 or a negative value; 0 ≧ cor (K)), It is determined that there is a disturbing voice (S203).

（Ａ−３）実施形態の効果
この実施形態によれば、以下のような効果を奏することができる。 (A-3) Effects of Embodiment According to this embodiment, the following effects can be achieved.

この実施形態の音声判定装置１では、相関係数ｃｏｒ（Ｋ）の値に基づいて、妨害音声の有無を判定している。これにより、この実施形態の音声判定装置１では、精度よく妨害音声の有無を判定することができるので、判定結果の供給先（例えば、音声処理装置２）で、妨害音声の有無に応じて最適な音声処理を実現することができる。すなわち、音声処理装置２の音声処理（例えば、テレビ会議システムや携帯電話などの通信装置や音声認識機能の前処理）に、この実施形態の音声判定装置１の判定結果を適用することで、音声処理装置２の性能向上（例えば、妨害音声等の非目的音の抑制性能の向上）が期待できる。 In the voice determination device 1 of this embodiment, the presence / absence of disturbing voice is determined based on the value of the correlation coefficient cor (K). As a result, the sound determination device 1 of this embodiment can accurately determine the presence or absence of interfering sound, so that it is optimal according to the presence or absence of disturbing sound at the determination result supply destination (for example, the sound processing device 2). Voice processing can be realized. That is, by applying the determination result of the audio determination device 1 of this embodiment to the audio processing of the audio processing device 2 (for example, a communication device such as a video conference system or a mobile phone, or preprocessing of the audio recognition function), An improvement in the performance of the processing device 2 (for example, an improvement in the performance of suppressing non-target sounds such as disturbing sounds) can be expected.

（Ｂ）他の実施形態
本発明は、上記の実施形態に限定されるものではなく、以下に例示するような変形実施形態も挙げることができる。 (B) Other Embodiments The present invention is not limited to the above-described embodiments, and may include modified embodiments as exemplified below.

（Ｂ−１）上記の実施形態では、音声判定装置１と音声処理装置２とは別個の構成要素として説明したが、１つの音声処理装置（音声判定装置を含む１つの装置）として構築するようにしてもよい。 (B-1) In the above embodiment, the sound determination device 1 and the sound processing device 2 have been described as separate components. However, the sound determination device 1 and the sound processing device 2 are constructed as one sound processing device (one device including the sound determination device). It may be.

（Ｂ−２）上記の実施形態の音声判定装置１は、２つのマイクから供給される入力信号に基づいた処理を行う例について説明したが、音声判定装置１では３つ以上のマイクから供給される入力信号に基づいて判定処理を行うようにしてもよい。例えば、音声判定装置１において、３つ以上のマイクから供給される入力信号に基づき、正面方向に死角を有する正面抑圧信号Ｎ（ｆ，Ｋ）や、正面以外の所定の方向に指向性を有する指向性信号Ｂ１（ｆ）、Ｂ２（ｆ）を取得して上記の実施形態と同様の処理を行うようにしてもよい。すなわち、音声判定装置１において、正面抑圧信号Ｎ（ｆ，Ｋ）や、指向性信号Ｂ１（ｆ）、Ｂ２（ｆ）を取得するためのマイクの構成等は限定されないものである。 (B-2) Although the voice determination device 1 of the above embodiment has been described with respect to an example in which processing is performed based on input signals supplied from two microphones, the voice determination device 1 is supplied from three or more microphones. The determination process may be performed based on the input signal. For example, in the voice determination device 1, based on input signals supplied from three or more microphones, the front suppression signal N (f, K) having a blind spot in the front direction or directivity in a predetermined direction other than the front The directivity signals B1 (f) and B2 (f) may be acquired and the same processing as in the above embodiment may be performed. That is, in the voice determination device 1, the configuration of a microphone for acquiring the front suppression signal N (f, K) and the directivity signals B1 (f) and B2 (f) is not limited.

（Ｂ−３）上記の実施形態の判定部４０では、平均正面抑圧信号ＡＶＥ＿Ｎ（Ｋ）とコヒーレンスＣＯＨ（Ｋ）との関係性を表す特徴量として、平均正面抑圧信号ＡＶＥ＿Ｎ（Ｋ）とコヒーレンスＣＯＨ（Ｋ）との相関係数ｃｏｒ（Ｋ）を適用しているが、他の種類の値を特徴量として適用するようにしてもよい。例えば、判定部４０では、平均正面抑圧信号ＡＶＥ＿Ｎ（Ｋ）とコヒーレンスＣＯＨ（Ｋ）との関係性を表す特徴量として、平均正面抑圧信号ＡＶＥ＿Ｎ（Ｋ）とコヒーレンスＣＯＨ（Ｋ）との共分散を適用するようにしてもよい。 (B-3) In the determination unit 40 according to the above-described embodiment, the average front suppression signal AVE_N (K) and the coherence COH are used as the feature amount representing the relationship between the average front suppression signal AVE_N (K) and the coherence COH (K). Although the correlation coefficient cor (K) with (K) is applied, other types of values may be applied as feature quantities. For example, in the determination unit 40, the covariance between the average front suppression signal AVE_N (K) and the coherence COH (K) is used as a feature amount representing the relationship between the average front suppression signal AVE_N (K) and the coherence COH (K). You may make it apply.

１…音声判定装置、２…音声処理装置、１０…ＦＦＴ部、２０…正面抑圧信号生成部、３０…コヒーレンス計算部、４０…妨害音判定部、ｍ＿１、ｍ＿２…マイク。 DESCRIPTION OF SYMBOLS 1 ... Voice determination apparatus, 2 ... Speech processing apparatus, 10 ... FFT part, 20 ... Front suppression signal generation part, 30 ... Coherence calculation part, 40 ... Interference sound determination part, m_1, m_2 ... Microphone.

Claims

Obtaining a frequency domain input signal obtained by converting input signals obtained from a plurality of microphones from a time domain to a frequency domain, and based on the obtained difference of the frequency domain input signals for each microphone, frontal suppression having a blind spot in front A front suppression signal generator for generating a signal;
A coherence calculator for calculating coherence from input signals obtained from the plurality of microphones;
A determination unit that calculates a feature amount indicating a relationship between the coherence calculated by the coherence calculation unit and the front suppression signal, and determines the presence or absence of disturbing speech based on the value of the feature amount. A voice determination device.

The speech determination apparatus according to claim 1, wherein the feature amount is a correlation coefficient between the front suppression signal and the coherence.

The speech determination apparatus according to claim 2, wherein the determination unit determines the presence or absence of interfering speech based on the sign of the correlation coefficient as the feature amount.

The speech determination apparatus according to claim 1, wherein the feature amount is a covariance between the front suppression signal and the coherence.

The speech determination apparatus according to claim 2, wherein the determination unit determines the presence or absence of interfering speech based on the sign of covariance as the feature amount.

Computer
Obtaining a frequency domain input signal obtained by converting input signals obtained from a plurality of microphones from a time domain to a frequency domain, and based on the obtained difference of the frequency domain input signals for each microphone, frontal suppression having a blind spot in front A front suppression signal generator for generating a signal;
A coherence calculator for calculating coherence from input signals obtained from the plurality of microphones;
Calculating a feature value representing a relationship between the coherence calculated by the coherence calculation unit and the front suppression signal, and functioning as a determination unit that determines the presence or absence of disturbing speech based on the value of the feature value. A voice determination program.

In the determination method regarding input signals obtained from a plurality of microphones,
A front suppression signal generation unit, a coherence calculation unit, and a determination unit;
The front suppression signal generation unit obtains a frequency domain input signal obtained by converting input signals obtained from a plurality of microphones from a time domain to a frequency domain, and based on the obtained frequency domain input signal difference for each microphone. Generate a frontal suppression signal with a blind spot in front,
The coherence calculating unit calculates coherence from input signals obtained from the plurality of microphones;
The determination unit calculates a feature amount representing a relationship between the coherence calculated by the coherence calculation unit and the front suppression signal, and determines presence / absence of disturbing speech based on the value of the feature amount. Voice judgment method to do.

6. A voice processing apparatus that performs voice processing of input signals obtained from a plurality of microphones, and performs voice processing using the determination result of the voice determination apparatus according to claim 1. apparatus.