JP2018142822A

JP2018142822A - Acoustic signal processing device, method and program

Info

Publication number: JP2018142822A
Application number: JP2017035305A
Authority: JP
Inventors: 克之高橋; Katsuyuki Takahashi
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 2017-02-27
Filing date: 2017-02-27
Publication date: 2018-09-13

Abstract

PROBLEM TO BE SOLVED: To make it possible to determine (classify) which signal zone among target sound, disturbing sound and background noise an input signal belongs to, with a low processing load and good accuracy in real time.SOLUTION: An acoustic signal processing device according to the present invention, comprises: a front suppression signal generation unit which generates front suppression signals with a blind spot in front thereof based on differences among a plurality of frequency domain input signals resulting from conversion of input signals from a plurality of microphones from a time domain to a frequency domain; a coherence calculating unit which calculates a coherence based on signals obtained from the plurality of input signals; and a classification unit which classifies signal zones of the input signals based on a feature quantity value indicative of the relationship between the front suppression signal and the coherence, and a variable obtained in the course of calculation of the feature quantity.SELECTED DRAWING: Figure 1

Description

この発明は、音響信号処理装置、方法及びプログラムに関し、例えば、電話やテレビ会議システムなどの音声を用いる通信装置または通信ソフトウェア、あるいは音声認識処理の前処理で用いる音響信号処理に適用し得るものである。 The present invention relates to an acoustic signal processing device, method, and program, and can be applied to, for example, a communication device or communication software that uses voice, such as a telephone or a video conference system, or acoustic signal processing that is used in preprocessing of voice recognition processing. is there.

近年、音声認識技術の急速な性能向上により、音声アシスタントが急速に普及している。音声アシスタントの形態はスマートフォンのアプリケーションに限らず、例えば、アマゾンエコー（登録商標）等のように専用端末（音声認識機能を有するスピーカー）として普及する事例も増えてきている。 In recent years, voice assistants are rapidly spreading due to rapid performance improvement of voice recognition technology. The form of the voice assistant is not limited to a smartphone application, and for example, there are an increasing number of cases where it is spread as a dedicated terminal (speaker having a voice recognition function) such as Amazon Echo (registered trademark).

音声アシスタントの実現には、形態の如何によらず音声認識が重要な技術であることは言うまでも無いが、特に専用端末の場合は、話者と端末との位置関係が変動することを考慮して音声認識処理を行わなければならないことが特徴の一つである。なぜなら、音声認識に先立ち、ＭＵＳＩＣ法などの音源定位技術を用いて音源の方位を推定し、推定した方位情報を用いてビームフォーマー技術などによる目的音の強調や雑音の抑圧、といった前処理が必要となるためである。このような前処理の実現のために、雑音下でも高精度で音源方位を推定できるような音源定位技術が必要不可欠である。しかも、音声アシスタントに対する利用者の満足度を高めるためには利用者が発話してからアシスタントからの応答が得られるまでの時間（応答時間）が短いことが重要である。そのため、音源定位の精度の向上と同時に、音源定位の処理が軽量で、リアルタイム性が保証されることも求められている。 Needless to say, speech recognition is an important technology for realizing voice assistants regardless of the form, but especially in the case of dedicated terminals, the positional relationship between the speaker and the terminal varies. One feature is that voice recognition processing must be performed. This is because, prior to speech recognition, pre-processing such as sound source localization technology such as the MUSIC method is used, and target sound enhancement or noise suppression is performed using beam former technology using the estimated orientation information. This is necessary. In order to realize such preprocessing, a sound source localization technique that can accurately estimate the sound source direction even under noise is indispensable. Moreover, in order to increase the user's satisfaction with the voice assistant, it is important that the time from when the user speaks until the response from the assistant is obtained (response time) is short. Therefore, at the same time as improving the accuracy of sound source localization, it is also required that the sound source localization processing is lightweight and real-time performance is guaranteed.

特開２０１３−１８２０４４号公報JP 2013-182044 A 特開２０１５−１８９９１９号公報JP2015-189919A

浅野太著，日本音響学会編“音響テクノロジーシリーズ１６音のアレイ信号処理音源の定位・追跡と分離”，コロナ社，２０１１年２月２５日発行，Ｐ．１１５Asano Tadashi, Acoustical Society of Japan “Sound Technology Series 16 Sound Array Signal Processing Sound Source Localization / Tracking and Separation”, Corona Publishing Co., Ltd., February 25, 2011, p. 115

音源定位技術として代表的な方法がＭＵＳＩＣ法である。ＭＵＳＩＣ法は複数のマイクから得た観測信号間の相関行列の固有ベクトルが張る固有空間において、信号部分空間と雑音部分空間が直交する性質に基づいて空間スペクトルのピークを推定し、ピークに対する方位を音源の到来方位として定める技術である（非特許文献１参照）。 A typical method as a sound source localization technique is the MUSIC method. The MUSIC method estimates the peak of the spatial spectrum based on the property that the signal subspace and the noise subspace are orthogonal in the eigenspace spanned by the eigenvectors of the correlation matrix between the observation signals obtained from multiple microphones. (See Non-Patent Document 1).

ＭＵＳＩＣ法は従来のビームフォーマーによる音源定位技術よりも高い空間分解能が得られる点で有用である。しかし、雑音下では音源定位の精度が低下することが課題であるため、目的音の休止区間で雑音の相関行列を算出し、これを用いて固有値分解を行うことで雑音の影響を低減するといった対策がとられている。 The MUSIC method is useful in that a higher spatial resolution can be obtained than a sound source localization technique using a conventional beam former. However, since the accuracy of sound source localization is a problem under noise, the noise correlation matrix is calculated in the pause period of the target sound, and eigenvalue decomposition is used to reduce the influence of noise. Measures are taken.

しかし、雑音は背景雑音（空間的に白色）や、妨害音（例えば、目的とする話者以外の話し声、空間的に有色）とに大別されるが、同じ雑音でも空間的に白色な場合と有色な場合とでは、ＭＵＳＩＣ法の推定精度に及ぼす影響が大きく変わる。特に雑音が有色の場合には、信号部分空間と雑音部分空間の直交性が成り立ちにくくなるため、雑音の白色化処理が必要となるなど、雑音の空間的な有色性に応じた制御が必要である。また、背景雑音の特性を正しく得るという観点では、目的音成分が外乱となるため、背景雑音と目的音の区別も必要である。このため、前述のように目的音の休止区間を検出するだけでは不十分で、入力信号が、目的音区間に属するのか、妨害音区間に属するのか、背景雑音区間に属するのか、というレベルで分類できるような入力信号の分類法が求められている。 However, noise is broadly divided into background noise (spatial white) and interference sound (for example, speech other than the intended speaker, spatially colored). The influence on the estimation accuracy of the MUSIC method is greatly changed between the case of being colored and the case of being colored. In particular, when the noise is colored, the signal subspace and the noise subspace are less likely to be orthogonal, so noise whitening is required, and control according to the spatial color of the noise is required. is there. Further, from the viewpoint of obtaining the characteristics of the background noise correctly, the target sound component becomes a disturbance, so it is necessary to distinguish the background noise from the target sound. For this reason, it is not sufficient to detect the pause period of the target sound as described above, and the input signal is classified according to whether it belongs to the target sound period, the interference sound period, or the background noise period. There is a need for an input signal classification method that can be used.

従来の信号区間の分類法の一つが特許文献１である。これは音の到来方位と直結するコヒーレンスという特徴量の大小に基づいて目的音区間を検出する手法である。これは到来方位に着目することで従来は困難だった目的音と有色雑音（妨害音）の区別を実現した点では有効だが、目的音と有色雑音（妨害音）が重畳された場合には、有色雑音（妨害音）の有無によらず目的音区間と判定する点が課題であった。 One conventional method for classifying signal sections is Patent Document 1. This is a method of detecting a target sound section based on the magnitude of a feature quantity called coherence that is directly connected to the direction of arrival of the sound. This is effective in achieving distinction between the target sound and colored noise (interfering sound), which was difficult in the past by focusing on the direction of arrival, but when the target sound and colored noise (interfering sound) are superimposed, The problem is that the target sound section is determined regardless of the presence or absence of colored noise (interfering sound).

本願発明者は、上記課題を改善するため、特願２０１５−１８９９１９（参照文献と呼ぶ。）に記載される技術を提案している。これはコヒーレンス（特許文献１の式（７））と正面抑圧信号との相関の正負に基づいて有色雑音（妨害音）の存在を判定する技術である。この方式によって、たとえ目的音と重畳されている有色雑音でも、相関が負であるか否かを監視するだけで有色雑音の存在を検出できる。 The inventor of the present application has proposed a technique described in Japanese Patent Application No. 2015-189919 (referred to as a reference) in order to improve the above-described problems. This is a technique for determining the presence of colored noise (interfering sound) based on the sign of the correlation between coherence (formula (7) of Patent Document 1) and the front suppression signal. By this method, even if the colored noise superimposed on the target sound is detected, the presence of the colored noise can be detected only by monitoring whether or not the correlation is negative.

しかし、有色雑音が存在しない区間では、目的音区間か、背景雑音区間かを問わず相関が正になるため背景雑音区間の検出には達していなかった。例えば、特許文献１と参照文献とを組み合わせるなど複数の信号区間検出方式を併用することで目的音区間、妨害音区間、背景雑音区間を分類することは可能だが、前述のように音声アシスタントに適用するにはリアルタイム性が求められるため、複数の検出方式を同時に実行するのではなく、単一の検出方式で分類できるような方法が必要である。 However, in a section where there is no colored noise, the correlation is positive regardless of whether it is the target sound section or the background noise section, so that the background noise section has not been detected. For example, it is possible to classify target sound sections, interfering sound sections, and background noise sections by using a plurality of signal section detection methods such as combining Patent Document 1 and a reference document. However, as described above, it is applied to a voice assistant. In order to achieve this, real-time characteristics are required, and therefore, a method is required in which a plurality of detection methods are not executed simultaneously but can be classified by a single detection method.

そのため、上記課題に鑑み、入力信号が、目的音、妨害音及び背景雑音のいずれかの信号区間であることを、処理負荷を抑え、精度よく、リアルタイムに判定（分類）することができる音響信号処理装置、方法及びプログラムが求められている。 Therefore, in view of the above problems, an acoustic signal that can accurately determine (classify) in real time that the processing load is reduced and the input signal is a signal section of any of the target sound, interference sound, and background noise. There is a need for a processing device, method and program.

かかる課題を解決するために、第１の本発明に係る音響信号処理装置は、（１）複数のマイクのそれぞれからの各入力信号を時間領域から周波数領域に変換して得た、複数の周波数領域入力信号の差に基づいて、正面に死角を有する正面抑圧信号を生成する正面抑圧信号生成部と、（２）複数の入力信号から得た信号に基づいてコヒーレンスを算出するコヒーレンス算出部と、（３）正面抑圧信号とコヒーレンスとの関係性を示す特徴量の値と、特徴量の算出過程で得られる変数とに基づいて、入力信号の信号区間を分類する分類部とを備えることを特徴とする。 In order to solve such a problem, the acoustic signal processing apparatus according to the first aspect of the present invention is (1) a plurality of frequencies obtained by converting each input signal from each of a plurality of microphones from a time domain to a frequency domain. A front suppression signal generation unit that generates a frontal suppression signal having a blind spot on the front surface based on the difference between the region input signals; and (2) a coherence calculation unit that calculates coherence based on signals obtained from a plurality of input signals; (3) A classification unit that classifies the signal section of the input signal based on a feature value indicating the relationship between the front suppression signal and coherence and a variable obtained in the feature value calculation process. And

第２の本発明に係る音響信号処理プログラムは、コンピュータを、（１）複数のマイクのそれぞれからの各入力信号を時間領域から周波数領域に変換して得た、複数の周波数領域入力信号の差に基づいて、正面に死角を有する正面抑圧信号を生成する正面抑圧信号生成部と、（２）複数の入力信号から得た信号に基づいてコヒーレンスを算出するコヒーレンス算出部と、（３）正面抑圧信号とコヒーレンスとの関係性を示す特徴量の値と、特徴量の算出過程で得られる変数とに基づいて、入力信号の信号区間を分類する分類部として機能させることを特徴とする。 The acoustic signal processing program according to the second aspect of the present invention provides a computer, (1) a difference between a plurality of frequency domain input signals obtained by converting each input signal from each of a plurality of microphones from a time domain to a frequency domain. A front suppression signal generation unit that generates a frontal suppression signal having a blind spot on the front side, (2) a coherence calculation unit that calculates coherence based on signals obtained from a plurality of input signals, and (3) frontal suppression. It is characterized by functioning as a classification unit that classifies signal sections of an input signal based on a feature value indicating a relationship between a signal and coherence and a variable obtained in a feature value calculation process.

第３の本発明に係る音響信号処理方法は、（１）正面抑圧信号生成部が、複数のマイクのそれぞれからの各入力信号を時間領域から周波数領域に変換して得た、複数の周波数領域入力信号の差に基づいて、正面に死角を有する正面抑圧信号を生成し、（２）コヒーレンス算出部が、複数の入力信号から得た信号に基づいてコヒーレンスを算出し、（３）分類部が、正面抑圧信号とコヒーレンスとの関係性を示す特徴量の値と、特徴量の算出過程で得られる変数とに基づいて、入力信号の信号区間を分類することを特徴とする。 The acoustic signal processing method according to the third aspect of the present invention includes: (1) a plurality of frequency domains obtained by the front suppression signal generation unit converting each input signal from each of the plurality of microphones from the time domain to the frequency domain; A front suppression signal having a blind spot in front is generated based on the difference between the input signals, (2) a coherence calculation unit calculates coherence based on signals obtained from a plurality of input signals, and (3) a classification unit The signal section of the input signal is classified based on the feature value indicating the relationship between the front suppression signal and the coherence and the variable obtained in the feature value calculation process.

本発明によれば、入力信号が、目的音、妨害音及び背景雑音のいずれかの信号区間かを、低い処理負荷で、精度よく、リアルタイムに判定（分類）することができる。 According to the present invention, it is possible to accurately determine (classify) in real time with a low processing load whether an input signal is a signal section of target sound, interference sound, or background noise.

実施形態に係る音響信号処理装置の全体構成を示すブロック図である。1 is a block diagram showing an overall configuration of an acoustic signal processing device according to an embodiment. 実施形態に係るマイクの配置例を説明する説明図である。It is explanatory drawing explaining the example of arrangement | positioning of the microphone which concerns on embodiment. 実施形態に係る音響信号処理装置で適用される指向性信号の特性を示す図である。It is a figure which shows the characteristic of the directional signal applied with the acoustic signal processing apparatus which concerns on embodiment. 実施形態に係る信号分類部の内部構成を示すブロック図である。It is a block diagram which shows the internal structure of the signal classification | category part which concerns on embodiment. 実施形態に係る分類部における信号分類処理を示すフローチャートである。It is a flowchart which shows the signal classification | category process in the classification | category part which concerns on embodiment.

（Ａ）主たる実施形態
以下では、本発明に係る音響信号処理装置、方法及びプログラムの主たる実施形態を、図面を参照しながら詳細に説明する。 (A) Main Embodiments Hereinafter, main embodiments of an acoustic signal processing device, method, and program according to the present invention will be described in detail with reference to the drawings.

（Ａ−１）実施形態の構成
図１は、実施形態に係る音響信号処理装置１の全体構成を示すブロック図である。 (A-1) Configuration of Embodiment FIG. 1 is a block diagram showing an overall configuration of an acoustic signal processing device 1 according to the embodiment.

図１に示すように、音響信号処理装置１は、複数（図１では２個の場合を示している。）のマイクｍ＿１及びｍ＿２から入力信号ｓ１（ｎ）及びｓ２（ｎ）を取得する。なお、ｎはサンプルの入力順を示すインデックスであり、正の整数で表現される。以下では、ｎが小さいほど古い入力サンプルであり、大きいほど新しい入力サンプルであるとする。 As shown in FIG. 1, the acoustic signal processing apparatus 1 acquires input signals s1 (n) and s2 (n) from a plurality of microphones m_1 and m_2 (two cases are shown in FIG. 1). Note that n is an index indicating the input order of samples and is expressed by a positive integer. In the following, it is assumed that the smaller n is the older input sample, and the larger n is the new input sample.

音響信号処理装置１は、マイクｍ＿１及びｍ＿２から取得した入力信号が、目的音、妨害音、背景雑音のいずれかの信号区間であるか否かを判定（分類）し、その判定結果を、後段の音声処理装置２に供給する。 The acoustic signal processing device 1 determines (classifies) whether the input signal acquired from the microphones m_1 and m_2 is a signal section of the target sound, the disturbing sound, or the background noise, and the determination result is displayed in the subsequent stage. To the audio processing apparatus 2 of FIG.

音声処理装置２は、音響信号処理装置１からの判定結果を利用して、マイクｍ＿１及びｍ＿２から取得した入力信号に対して所定の処理を行う。音声処理装置２は、例えば、テレビ会議システムや電話端末等の通信装置又は通信ソフトウェア、あるいは音声認識の前処理に、音響信号処理装置１の判定結果を利用することができる。音声処理装置２による処理内容は特に限定されるものではなく、様々な処理を行なうものを適用できる。例えば、音源定位処理の雑音相関行列計算の制御や、妨害音の抑圧処理の制御などその他の任意の処理に用いてよい。なお、音響信号処理装置１と音声処理装置２とは、信号の授受が可能であればよく、回路の配線接続されているようにしてもよいし、又例えば有線回線、無線回線を介したネットワーク通信により信号の授受ができるものであってもよい。 The sound processing device 2 performs predetermined processing on the input signals acquired from the microphones m_1 and m_2 using the determination result from the acoustic signal processing device 1. For example, the audio processing device 2 can use the determination result of the audio signal processing device 1 for a communication device or communication software such as a video conference system or a telephone terminal, or pre-processing for voice recognition. The content of processing performed by the audio processing device 2 is not particularly limited, and various types of processing can be applied. For example, it may be used for other arbitrary processing such as control of noise correlation matrix calculation of sound source localization processing and control of interference noise suppression processing. Note that the acoustic signal processing device 1 and the sound processing device 2 are only required to be able to exchange signals, and may be connected to circuits by wiring, or for example, a network via a wired line or a wireless line It may be one that can send and receive signals by communication.

図２は、マイクｍ＿１およびｍ＿２の配置例を説明する説明図である。 FIG. 2 is an explanatory diagram for explaining an arrangement example of the microphones m_1 and m_2.

図２に示すように、マイクｍ＿１、ｍ＿２は、２つのマイクｍ＿１、ｍ＿２を含む面が目的音の到来する方向（目的音の音源の方向）に対して垂直となるように配置されているものとする。また、以下では、図２に示すように、２つのマイクｍ＿１、ｍ＿２の間の位置から見て、目的音の到来方向を前方向又は正面方向と呼ぶものとする。また、以下では、図２に示すように、右方向、左方向、後ろ方向と呼ぶ場合は、２つのマイクｍ＿１、ｍ＿２の間の位置から目的音の到来方向を見た場合の各方向を示すものとして説明する。なお、この実施形態では、目的音がマイクｍ＿１、ｍ＿２の正面方向から到来し、妨害音声を含む非目的音が左右方向（横方向）から到来するものとして説明する。 As shown in FIG. 2, the microphones m_1 and m_2 are arranged so that the plane including the two microphones m_1 and m_2 is perpendicular to the direction in which the target sound arrives (the direction of the target sound source). And In the following, as shown in FIG. 2, the arrival direction of the target sound is referred to as the front direction or the front direction when viewed from the position between the two microphones m_1 and m_2. In the following, as shown in FIG. 2, when referring to the right direction, the left direction, and the backward direction, each direction when viewing the arrival direction of the target sound from the position between the two microphones m_1 and m_2 is shown. It will be explained as a thing. In this embodiment, it is assumed that the target sound comes from the front direction of the microphones m_1 and m_2, and the non-target sound including the disturbing sound comes from the left-right direction (lateral direction).

図１に示すように、音響信号処理装置１は、ＦＦＴ部１１、正面抑圧信号生成部１２、コヒーレンス計算部１３、信号分類部１４を有する。 As illustrated in FIG. 1, the acoustic signal processing device 1 includes an FFT unit 11, a front suppression signal generation unit 12, a coherence calculation unit 13, and a signal classification unit 14.

音響信号処理装置１は、プロセッサやメモリ等を有するコンピュータにプログラム（音響信号処理プログラム）をインストールして実現するようにしてもよく、この場合、音響信号処理装置１は機能的には図１を用いて示すことができる。なお、音響信号処理装置１については一部又は全部をハードウェア的に実現するようにしてもよい。 The acoustic signal processing apparatus 1 may be realized by installing a program (acoustic signal processing program) in a computer having a processor, a memory, and the like. In this case, the acoustic signal processing apparatus 1 is functionally shown in FIG. Can be used to show. Note that part or all of the acoustic signal processing device 1 may be realized by hardware.

ＦＦＴ部１１は、マイクｍ＿１及びｍ＿２のそれぞれから図示しないＡＤ変換器を介して、入力信号ｓ１及びｓ２を受け取り、その入力信号ｓ１及びｓ２に高速フーリエ変換（あるいは離散フーリエ変換）を行うものである。これにより、入力信号ｓ１及びｓ２が周波数領域で表現されることになる。 The FFT unit 11 receives the input signals s1 and s2 from the microphones m_1 and m_2 via an AD converter (not shown), and performs fast Fourier transform (or discrete Fourier transform) on the input signals s1 and s2. . As a result, the input signals s1 and s2 are expressed in the frequency domain.

なお、ＦＦＴ部１１は、高速フーリエ変換を実施するにあたり、入力信号ｓ１（ｎ）及びｓ２（ｎ）から所定のＮ個（Ｎは任意の整数）のサンプルから成る、分析フーリエＦＲＡＭＥ１（Ｋ）及びＦＲＡＭＥ２（Ｋ）を構成するものとする。入力信号ｓ１からＦＲＡＭＥ１を構成する例を以下の（１）式に示す。 Note that, in performing the fast Fourier transform, the FFT unit 11 includes an analysis Fourier FRAME1 (K) and a predetermined N (N is an arbitrary integer) samples from the input signals s1 (n) and s2 (n). Assume that FRAME2 (K) is configured. An example of configuring FRAME1 from the input signal s1 is shown in the following equation (1).

（１）式において、Ｋはフレームの順番を表すインデックスであり、正の整数で表現される。以下では、Ｋの値が小さいほど古い分析フレームであり、Ｋの値が大きいほど新しい分析フレームであるものとする。また、以降の説明において、特に但し書きが無い限り、分析対象となる最新の分析フレームを表すインデックスはＫであるとする。

In the equation (1), K is an index indicating the order of frames and is expressed by a positive integer. In the following, it is assumed that the smaller the K value, the older the analysis frame, and the larger the K value, the newer the analysis frame. In the following description, it is assumed that the index representing the latest analysis frame to be analyzed is K unless otherwise specified.

ＦＦＴ部１１は、分析フレームごとに、高速フーリエ変換処理を施すことで、入力信号ｓ１から構成した分析フレームＦＲＡＭＥ１（Ｋ）にフーリエ変換して得た周波数領域信号Ｘ１（ｆ，Ｋ）と、入力信号ｓ２から構成した分析フレームＦＲＡＭＥ２（Ｋ）にフーリエ変換して得た周波数領域信号Ｘ２（ｆ，Ｘ）とを、正面抑圧信号生成部１２及びコヒーレンス計算部１３に与える。 The FFT unit 11 performs a fast Fourier transform process for each analysis frame, thereby performing a frequency domain signal X1 (f, K) obtained by performing a Fourier transform on the analysis frame FRAME1 (K) configured from the input signal s1, and an input A frequency domain signal X2 (f, X) obtained by performing Fourier transform on the analysis frame FRAME2 (K) configured from the signal s2 is supplied to the front suppression signal generation unit 12 and the coherence calculation unit 13.

ここで、ｆは周波数を表すインデックスである。また、周波数領域信号Ｘ１（ｆ，Ｋ）は、単一の値ではなく、(２)式のように複数の周波数ｆ１〜ｆｍのｍ個（ｍは任意の整数）のスペクトル成分から構成されるものであるとする。

Here, f is an index representing a frequency. Further, the frequency domain signal X1 (f, K) is not a single value but is composed of m (m is an arbitrary integer) spectral components of a plurality of frequencies f1 to fm as shown in the equation (2). Suppose it is a thing.

上記（２）式において、Ｘ１（ｆ，Ｋ）は複素数であり、実部と虚部からなる。これは、Ｘ２（ｆ，Ｋ）、及び後述する正面抑圧信号生成部１２で説明する正面抑圧信号Ｎ（ｆ，Ｋ）についても同様である。 In the above equation (2), X1 (f, K) is a complex number and consists of a real part and an imaginary part. The same applies to X2 (f, K) and the front suppression signal N (f, K) described in the front suppression signal generation unit 12 described later.

正面抑圧信号生成部１２は、ＦＦＴ部１１から供給された信号について、周波数毎に正面方向の信号成分を抑圧する処理を行う。換言すると、正面抑圧信号生成部１２は、正面方向の成分を抑圧する指向性フィルタとして機能する。 The front suppression signal generation unit 12 performs a process of suppressing the signal component in the front direction for each frequency with respect to the signal supplied from the FFT unit 11. In other words, the front suppression signal generation unit 12 functions as a directivity filter that suppresses a component in the front direction.

例えば、正面抑圧信号生成部１２は、図３に示すように、正面方向に死角を有する８の字型の双指向性のフィルタを用いて、ＦＦＴ部１１から供給された信号から正面方向の成分を抑圧する指向性フィルタを形成する。 For example, as shown in FIG. 3, the front suppression signal generation unit 12 uses an 8-shaped bi-directional filter having a blind spot in the front direction to generate a front direction component from the signal supplied from the FFT unit 11. A directional filter that suppresses the noise is formed.

具体的には、正面抑圧信号生成部１２は、ＦＦＴ部１１から供給された信号Ｘ１（ｆ，Ｋ）、Ｘ２（ｆ，Ｋ）に基づいて、以下の（３）式のような計算を行って、周波数毎の正面抑圧信号Ｎ（ｆ，Ｋ）を生成する。以下の（３）式の計算は、図３のような正面方向に死角を有する８の字型の双指向性のフィルタを形成する処理に相当する。
Ｎ（ｆ，Ｋ）＝Ｘ１（ｆ，Ｋ）−Ｘ２（ｆ，Ｋ） …（３） Specifically, the front suppression signal generation unit 12 performs a calculation such as the following equation (3) based on the signals X1 (f, K) and X2 (f, K) supplied from the FFT unit 11. Thus, the front suppression signal N (f, K) for each frequency is generated. The calculation of the following equation (3) corresponds to a process of forming an 8-shaped bi-directional filter having a blind spot in the front direction as shown in FIG.
N (f, K) = X1 (f, K) -X2 (f, K) (3)

以上のように、正面抑圧信号生成部１２は、周波数ｆ１〜ｆｍの各周波数成分（各周波数帯の１フレーム分のパワー）を取得する。 As described above, the front suppression signal generation unit 12 acquires each frequency component of frequencies f1 to fm (power for one frame in each frequency band).

また、正面抑圧信号生成部１２は、（４）式に従って、周波数ｆ１〜ｆｍの全周波数に亘って、正面抑圧信号Ｎ（ｆ，Ｋ）を平均した、平均正面抑圧信号ＡＶＥ＿Ｎ（Ｋ）を算出する。

Further, the front suppression signal generator 12 calculates an average front suppression signal AVE_N (K) by averaging the front suppression signals N (f, K) over all frequencies f1 to fm according to the equation (4). To do.

コヒーレンス計算部１３は、ＦＦＴ部１１からの周波数領域信号Ｘ１（ｆ，Ｋ）、Ｘ２（ｆ，Ｋ）に含まれる特定方向に指向性の強い信号を形成してコヒーレンスＣＯＨ（Ｋ）を算出する。 The coherence calculation unit 13 forms a highly directional signal in a specific direction included in the frequency domain signals X1 (f, K) and X2 (f, K) from the FFT unit 11 to calculate coherence COH (K). .

ここで、コヒーレンス計算部１３におけるコヒーレンスＣＯＨ（Ｋ）の算出処理を説明する。 Here, the calculation processing of coherence COH (K) in the coherence calculation unit 13 will be described.

コヒーレンス計算部１３は、周波数領域信号Ｘ１（ｆ，Ｋ）及びＸ２（ｆ，Ｋ）から第１の方向（例えば、左方向）に指向性が強いフィルタで処理した信号Ｂ１（ｆ，Ｋ）を形成し、またコヒーレンス計算部１３は、周波数領域信号Ｘ１（ｆ，Ｋ）及びＸ２（ｆ，Ｋ）から第２の方向（例えば、右方向）に指向性が強いフィルタで処理した信号Ｂ２（ｆ，Ｋ）を形成する。特定方向に指向性の強い信号Ｂ１（ｆ）、Ｂ２（ｆ）の形成方法は、既存の方法を適用することができ、ここでは、以下の（５）式を適用して第１の方向に指向性が強い信号Ｂ１を形成し、以下の（６）式を適用して第２の方向に指向性が強い信号Ｂ２を形成する場合を例示する。

The coherence calculator 13 processes the signal B1 (f, K) obtained by processing the frequency domain signals X1 (f, K) and X2 (f, K) with a filter having strong directivity in the first direction (for example, the left direction). The coherence calculation unit 13 forms the signal B2 (f) processed from the frequency domain signals X1 (f, K) and X2 (f, K) with a filter having strong directivity in the second direction (for example, the right direction). , K). An existing method can be applied to the formation method of the signals B1 (f) and B2 (f) having high directivity in a specific direction. Here, the following equation (5) is applied to the first direction. An example in which the signal B1 having high directivity is formed and the signal B2 having high directivity in the second direction is formed by applying the following equation (6) will be described.

上記の（５）式、（６）式において、Ｓはサンプリング周波数、ＮはＦＦＴ分析フレーム長、τはマイクｍ＿１とマイクｍ＿２との間の音波到達時間差、ｉは虚数単位、ｆは周波数を示す。 In the above formulas (5) and (6), S is the sampling frequency, N is the FFT analysis frame length, τ is the difference in arrival time of sound waves between the microphone m_1 and the microphone m_2, i is the imaginary unit, and f is the frequency. .

次に、コヒーレンス計算部１３は、上記のようにして得られた信号Ｂ１（ｆ）、Ｂ２（ｆ）に対し、以下のような（７）式、（８）式に示す演算を施すことでコヒーレンスＣＯＨ（Ｋ）を得る。ここで、（７）式におけるＢ２（ｆ、Ｋ）^＊はＢ２（ｆ、Ｋ）の共役複素数である。

Next, the coherence calculation unit 13 performs the operations shown in the following expressions (7) and (8) on the signals B1 (f) and B2 (f) obtained as described above. Obtain coherence COH (K). Here, B2 (f, K) ^* in the equation (7) is a conjugate complex number of B2 (f, K).

ｃｏｅｆ（ｆ、Ｋ）は、インデックスが任意のインデックスＫのフレーム（分析フレームＦＲＡＭＥ１（Ｋ）及びＦＲＡＭＥ２（Ｋ）を構成する任意の周波数ｆ（周波数ｆ１〜ｆｍのいずれかの周波数）の成分におけるコヒーレンスを表しているものとする。 coef (f, K) is a coherence in a component of an index K having an arbitrary index K (an arbitrary frequency f (any one of frequencies f1 to fm) constituting the analysis frames FRAME1 (K) and FRAME2 (K)). .

なお、ｃｏｅｆ（ｆ，Ｋ）を求める際に、信号Ｂ１（ｆ）の指向性の方向と信号Ｂ（ｆ）の指向性の方向が異なるものであれば、信号Ｂ１（ｆ）及び信号Ｂ２（ｆ）に係る指向性方向はそれぞれ、正面方向以外の任意の方向とするようにしてもよい。また、ｃｏｅｆ（ｆ，Ｋ）を算出する方法は、上記の算出方法に限定されるものではない。 When obtaining coef (f, K), if the directionality of the signal B1 (f) is different from that of the signal B (f), the signals B1 (f) and B2 ( The directivity direction according to f) may be any direction other than the front direction. Further, the method for calculating coef (f, K) is not limited to the above calculation method.

信号分類部１４は、正面以外に指向性を有する正面抑圧信号Ｎ（ｆ，Ｎ）（平均正面抑圧信号ＡＶＥ＿Ｎ（Ｋ））と、コヒーレンスＣＯＨ（Ｋ）とを用いて、妨害音、目的音及び背景雑音の信号区間を判定（分類）し、その判定結果を示す信号Ｒ（Ｋ）を出力する。 The signal classification unit 14 uses the front suppression signal N (f, N) (average front suppression signal AVE_N (K)) having directivity other than the front and the coherence COH (K), and the interference sound, the target sound, A signal interval of background noise is determined (classified), and a signal R (K) indicating the determination result is output.

図４は、実施形態に係る信号分類部１４の構成を示すブロック図である。 FIG. 4 is a block diagram illustrating a configuration of the signal classification unit 14 according to the embodiment.

図４に示すように、信号分類部１４は、正面抑圧信号取得部２１、コヒーレンス取得部２２、相関係数計算部２３、分類部２４、結果出力部２５を有する。 As shown in FIG. 4, the signal classification unit 14 includes a front suppression signal acquisition unit 21, a coherence acquisition unit 22, a correlation coefficient calculation unit 23, a classification unit 24, and a result output unit 25.

正面抑圧信号取得部２１は、正面抑圧信号生成部１２から平均正面抑圧信号ＡＶＥ＿Ｎ（Ｋ）を取得する。また、コヒーレンス取得部２２は、コヒーレンスＣＯＨ（Ｋ）を取得する。 The front suppression signal acquisition unit 21 acquires the average front suppression signal AVE_N (K) from the front suppression signal generation unit 12. Further, the coherence acquisition unit 22 acquires coherence COH (K).

相関係数計算部２３は、平均正面抑圧信号ＡＶＥ＿Ｎ（Ｋ）とコヒーレンスＣＯＨ（Ｋ）との関係性を示す特徴量である相関係数ｃｏｒ（Ｋ）を計算する。 The correlation coefficient calculation unit 23 calculates a correlation coefficient cor (K), which is a feature amount indicating the relationship between the average front suppression signal AVE_N (K) and the coherence COH (K).

ここで、相関係数ｃｏｒ（Ｋ）の計算方法は限定されるものではないが、例えば、以下の式（９）を用いて、フレームごとに相関係数ｃｏｒ（Ｋ）を求める。なお、以下の式（９）において、ｃｏｖ［ＡＶＥ＿Ｎ（Ｋ），ＣＯＨ（Ｋ）］は、平均正面抑圧信号ＡＶＥ＿Ｎ（Ｋ）とコヒーレンスＣＯＨ（Ｋ）の共分散を示している。また、以下の式（９）において、σＡＶＥ＿Ｎ（Ｋ）は、平均正面抑圧信号ＡＶＥ＿Ｎ（Ｋ）の標準偏差を示し、σＣＯＨ（Ｋ）は、コヒーレンスＣＯＨ（Ｋ）の標準偏差を示している。このようにして得られる相関係数ｃｏｒ（Ｋ）は、−１．０〜１．０の値をとる。

Here, the calculation method of the correlation coefficient cor (K) is not limited. For example, the correlation coefficient cor (K) is obtained for each frame using the following equation (9). In the following equation (9), cov [AVE_N (K), COH (K)] indicates the covariance between the average front suppression signal AVE_N (K) and the coherence COH (K). In the following equation (9), σAVE_N (K) represents the standard deviation of the average front suppression signal AVE_N (K), and σCOH (K) represents the standard deviation of the coherence COH (K). The correlation coefficient cor (K) obtained in this way takes a value of −1.0 to 1.0.

分類部２４は、平均正面抑圧信号ＡＶＥ＿Ｎ（Ｋ）とコヒーレンスＣＯＨ（Ｋ）との相関係数ｃｏｒ（Ｋ）の正負に基づいて、妨害音が存在するか否かを判定する第１判定部２４１と、妨害音が存在しない場合に、平均正面抑圧信号ＡＶＥ＿Ｎ（Ｋ）とコヒーレンスＣＯＨ（Ｋ）との相関係数ｃｏｒ（Ｋ）の算出過程で得られる変数に基づいて、背景雑音及び目的音の信号区間を判定する第２判定部２４２とを有する。 The classification unit 24 determines whether or not a disturbing sound exists based on the sign of the correlation coefficient cor (K) between the average front suppression signal AVE_N (K) and the coherence COH (K). And when there is no interfering sound, based on the variable obtained in the process of calculating the correlation coefficient cor (K) between the average front suppression signal AVE_N (K) and the coherence COH (K), the background noise and the target sound And a second determination unit 242 for determining a signal section.

まず、分類部２４の第１判定部２４１が、平均正面抑圧信号ＡＶＥ＿Ｎ（Ｋ）とコヒーレンスＣＯＨ（Ｋ）との相関係数ｃｏｒ（Ｋ）に基づいて、妨害音が存在するか否かの判定処理を説明する。 First, the first determination unit 241 of the classification unit 24 determines whether or not an interference sound exists based on the correlation coefficient cor (K) between the average front suppression signal AVE_N (K) and the coherence COH (K). Processing will be described.

ここでは、マイクｍ＿１及びマイクｍ＿２の正面方向に、目的音を発する音源が存在し、正面方向以外の方向（例えば、マイクｍ＿１及びマイクｍ＿２の横方向（すなわち、左方向、右方向）から妨害音が到来するものとする。 Here, there is a sound source that emits a target sound in the front direction of the microphone m_1 and the microphone m_2, and the disturbing sound is generated from a direction other than the front direction (for example, the lateral direction of the microphone m_1 and the microphone m_2 (that is, the left direction and the right direction). Shall arrive.

例えば、「妨害音声が存在せず」、かつ、「目的音が存在する」場合、正面抑圧信号Ｎ（ｆ，Ｋ）は、目的音成分の大きさに比例した信号値となる。ただし、図２のように、正面方向のゲインは、横方向のゲインと比較して小さいため、妨害音が存在する場合よりも小さい値となる。 For example, when “no disturbing voice is present” and “the target sound is present”, the front suppression signal N (f, K) has a signal value proportional to the magnitude of the target sound component. However, as shown in FIG. 2, since the gain in the front direction is smaller than the gain in the horizontal direction, the gain is smaller than that in the case where an interfering sound is present.

また、コヒーレンスＣＯＨ（Ｋ）は、入力信号の到来方向と深い関係を持つ特徴量であり、２つの信号成分の相関と言い換えられる。これは、（６）式は、ある周波数成分についての相関を算出する式であり、（７）式は全ての周波数成分の相関値の平均を計算する式であるためであるため、コヒーレンスＣＯＨ（Ｋ）が小さい場合は、２つの信号成分の相関が小さい場合であるといえ、反対に、コヒーレンスＣＯＨ（Ｋ）が大きい場合とは、２つの信号成分の相関が大きい場合であるといえる。コヒーレンスＣＯＨ（Ｋ）が小さい場合の入力信号は、到来方向が右方向又は左方向のいずれかに大きく偏っており、正面方向以外の方向から到来している信号といえる。一方、コヒーレンスＣＯＨ（Ｋ）が大きい場合の入力信号は、到来方向の偏りが少なく、正面方向から到来している信号であるといえる。 The coherence COH (K) is a feature quantity that has a deep relationship with the arrival direction of the input signal, and is rephrased as a correlation between two signal components. This is because the equation (6) is an equation for calculating the correlation for a certain frequency component, and the equation (7) is an equation for calculating the average of the correlation values of all frequency components, so that the coherence COH ( When K) is small, it can be said that the correlation between the two signal components is small, and conversely, when the coherence COH (K) is large, it can be said that the correlation between the two signal components is large. The input signal when the coherence COH (K) is small can be said to be a signal arriving from a direction other than the front direction because the arrival direction is greatly biased to either the right direction or the left direction. On the other hand, the input signal when the coherence COH (K) is large can be said to be a signal arriving from the front direction with little deviation in the arrival direction.

そうすると、「妨害音が存在せず」、かつ、「目的音が存在する」場合、コヒーレンスＣＯＨ（Ｋ）は大きい値となり、「妨害音が存在し」、かつ、「目的音が存在する」場合、コヒーレンスＣＯＨ（Ｋ）は小さい値となる。 Then, if “no interference sound exists” and “the target sound exists”, the coherence COH (K) becomes a large value, “the interference sound exists”, and “the target sound exists”. The coherence COH (K) is a small value.

以上の挙動を妨害音の有無に着目して整理すると、以下のような関係となる。
・「妨害音が存在せず」、かつ、「目的音が存在する」場合、コヒーレンスＣＯＨ（Ｋ）は大きな値となり、正面抑圧信号Ｎ（ｆ，Ｋ）（平均正面抑圧信号ＡＶＥ＿Ｎ（Ｋ））は目的音成分の大きさに比例した値となる。
・「妨害音が存在する」場合、コヒーレンスＣＯＨ（Ｋ）が小さい値となり、正面抑圧信号Ｎ（ｆ，Ｋ）（平均正面抑圧信号ＡＶＥ＿Ｎ（Ｋ））は大きい値となる。 When the above behavior is organized by focusing on the presence or absence of interfering sounds, the following relationship is obtained.
When “no interference sound exists” and “target sound exists”, the coherence COH (K) becomes a large value, and the front suppression signal N (f, K) (average front suppression signal AVE_N (K)) Is a value proportional to the magnitude of the target sound component.
When “disturbance sound exists”, the coherence COH (K) has a small value, and the front suppression signal N (f, K) (average front suppression signal AVE_N (K)) has a large value.

ところで、上記のような挙動の場合、正面抑圧信号Ｎ（ｆ，Ｋ）（平均正面抑圧信号ＡＶＥ＿Ｎ（Ｋ））とコヒーレンスＣＯＨ（Ｋ）との相関係数ｃｏｒ（Ｋ）を導入すると、以下のようなことがいえる。
・「妨害音が存在しない」場合、相関係数ｃｏｒ（Ｋ）は正の値（ｃｏｒ（Ｋ）＞０）となる。
・「妨害音が存在する」場合、相関係数ｃｏｒ（Ｋ）は負の値（ｃｏｒ（Ｋ）≦０）となる。 By the way, in the case of the above behavior, if the correlation coefficient cor (K) between the front suppression signal N (f, K) (average front suppression signal AVE_N (K)) and coherence COH (K) is introduced, The same can be said.
When “no disturbing sound exists”, the correlation coefficient cor (K) is a positive value (cor (K)> 0).
When “interference sound exists”, the correlation coefficient cor (K) is a negative value (cor (K) ≦ 0).

従って、第１判定部２４１は、平均正面抑圧信号ＡＶＥ＿Ｎ（Ｋ）とコヒーレンスＣＯＨ（Ｋ）との相関係数ｃｏｒ（Ｋ）の正負を観測し、相関係数ｃｏｒ（Ｋ）が正の場合に妨害音は存在しないと判定し、相関係数ｃｏｒ（Ｋ）が負の場合に妨害音が存在すると判定することができる。 Therefore, the first determination unit 241 observes the sign of the correlation coefficient cor (K) between the average front suppression signal AVE_N (K) and the coherence COH (K), and the correlation coefficient cor (K) is positive. It can be determined that there is no interfering sound, and it can be determined that there is an interfering sound when the correlation coefficient cor (K) is negative.

次に、分類部２４の第２判定部２４２は、妨害音が存在しない場合に、平均正面抑圧信号ＡＶＥ＿Ｎ（Ｋ）とコヒーレンスＣＯＨ（Ｋ）との相関係数ｃｏｒ（Ｋ）の算出過程で得られる変数に基づいて、背景雑音及び目的音の信号区間を判定する処理を説明する。 Next, the second determination unit 242 of the classification unit 24 obtains the correlation coefficient cor (K) between the average front suppression signal AVE_N (K) and the coherence COH (K) when no disturbing sound exists. A process of determining the background noise and the signal section of the target sound based on the obtained variables will be described.

上述したように、相関係数ｃｏｒ（Ｋ）は、（９）式を用いて算出することができる。 As described above, the correlation coefficient cor (K) can be calculated using the equation (9).

（９）式において、分子は、平均正面抑圧信号ＡＶＥ＿Ｎ（Ｋ）とコヒーレンスＣＯＨ（Ｋ）との共分散であり、分母は、コヒーレンスの標準偏差σ（ＣＯＨ（Ｋ））と、平均正面抑圧信号の標準偏差σ（ＡＶＥ＿Ｎ（Ｋ））の積である。 In equation (9), the numerator is the covariance of the mean front suppression signal AVE_N (K) and the coherence COH (K), and the denominator is the standard deviation σ (COH (K)) of the coherence and the mean front suppression signal. Of the standard deviation σ (AVE_N (K)).

目的音が正面方向から到来すると仮定すると、（９）式において、目的音区間と背景雑音区間とで最も大きく変化する変数は、コヒーレンスの標準偏差σ（ＣＯＨ（Ｋ））である。なぜなら、コヒーレンスＣＯＨ（Ｋ）は正面方向から到来する信号成分が存在すれば大きい値をもつため、目的音区間ではコヒーレンスＣＯＨ（Ｋ）の変動の幅が大きくなるのに対して、背景雑音区間ではコヒーレンスＣＯＨ（Ｋ）の変動の幅が小さいためである。 Assuming that the target sound comes from the front direction, the variable that changes most greatly between the target sound section and the background noise section in the equation (9) is the standard deviation σ (COH (K)) of coherence. This is because the coherence COH (K) has a large value if there is a signal component coming from the front direction, so that the fluctuation range of the coherence COH (K) increases in the target sound section, whereas in the background noise section. This is because the fluctuation range of the coherence COH (K) is small.

この性質を用いると、妨害音が存在しない区間において、さらに目的音区間と背景雑音区間とに分類することができる。 By using this property, it is possible to further classify into a target sound section and a background noise section in a section where no disturbing sound exists.

すなわち、第２判定部２４２は、妨害音が存在しない場合に、相関係数ｃｏｒ（Ｋ）の算出過程で得られる変数である、コヒーレンスの標準偏差σ（ＣＯＨ（Ｋ））の値を閾値Θに基づいて、以下のようにして、目的音区間であるか又は背景雑音区間であるかを判定する。
・相関係数ｃｏｒ（Ｋ）が正であり、かつコヒーレンスの標準偏差σ（（ＣＯＨ（Ｋ））が閾値Θ以上の信号区間は、目的音区間である。
・相関係数ｃｏｒ（Ｋ）が正であり、かつコヒーレンスの標準偏差σ（ＣＯＨ（Ｋ））が閾値Θより小さい信号区間は背景雑音区間である。 In other words, the second determination unit 242 uses the value of the standard deviation σ (COH (K)) of the coherence, which is a variable obtained in the process of calculating the correlation coefficient cor (K), when there is no interfering sound as the threshold Θ. Based on the above, it is determined whether it is the target sound section or the background noise section as follows.
The signal interval in which the correlation coefficient cor (K) is positive and the standard deviation σ ((COH (K)) of the coherence is greater than or equal to the threshold Θ is the target sound interval.
A signal interval in which the correlation coefficient cor (K) is positive and the standard deviation σ (COH (K)) of the coherence is smaller than the threshold Θ is a background noise interval.

以上のようにして、分類部２４の第２判定部２４２は、コヒーレンスＣＯＨ（Ｋ）と平均正面抑圧信号ＡＶＥ＿Ｎ（Ｋ）との相関係数ｃｏｒ（Ｋ）を計算する過程において得られる変数であるコヒーレンスの標準偏差σ（ＣＯＨ（Ｋ））の値を加味することで、単一の手法により信号区間を詳細に分類することができる。その結果、後段の音声処理装置２においてリアルタイム性が求められる場合でも、信号区間の判定結果を迅速に通知することができる。 As described above, the second determination unit 242 of the classification unit 24 is a variable obtained in the process of calculating the correlation coefficient cor (K) between the coherence COH (K) and the average front suppression signal AVE_N (K). By adding the value of the standard deviation σ (COH (K)) of coherence, it is possible to classify signal sections in detail by a single method. As a result, the determination result of the signal section can be notified promptly even when real-time performance is required in the subsequent audio processing device 2.

また、結果出力部２５は、第１判定部２４１及び第２判定部２４２による判定結果を示す信号Ｒ（Ｋ）を、後段の音声処理装置２に出力する。信号Ｒ（Ｋ）の形式は限定されるものではなく、例えば、分類部２４は、妨害音が存在する場合には、妨害音の存在を示す値(例えば、信号Ｒ（Ｋ）＝「０」)を出力し、目的音区間の場合、目的音区間を示す値（例えば、信号Ｒ（Ｋ）＝「１」）を出力し、背景雑音区間の場合、背景雑音区間を示す値（例えば、信号Ｒ（Ｋ）＝「２」）を出力するようにしてもよい。なお、この実施形態では、説明を容易にするために、結果出力部２５が、音声処理装置２に信号Ｒ（Ｋ）を出力する場合を例示するが、これに限定されない。 In addition, the result output unit 25 outputs a signal R (K) indicating the determination results by the first determination unit 241 and the second determination unit 242 to the subsequent audio processing device 2. The format of the signal R (K) is not limited. For example, when the interference sound exists, the classification unit 24 indicates a value indicating the presence of the interference sound (for example, the signal R (K) = “0”). ), In the case of the target sound section, a value indicating the target sound section (for example, signal R (K) = “1”) is output, and in the case of the background noise section, a value indicating the background noise section (for example, signal R (K) = "2") may be output. In this embodiment, in order to facilitate the description, the result output unit 25 exemplifies a case where the signal R (K) is output to the sound processing device 2, but the present invention is not limited to this.

（Ａ−２）実施形態の動作
次に、実施形態に係る音響信号処理装置１における入力信号の信号区間を分類する処理動作を、図面を参照して詳細に説明する。 (A-2) Operation | movement of embodiment Next, the processing operation | movement which classify | categorizes the signal area of the input signal in the acoustic signal processing apparatus 1 which concerns on embodiment is demonstrated in detail with reference to drawings.

まず、マイクｍ＿１、ｍ＿２のそれぞれから図示しないＡＤ変換器を介して、１フレーム分（１つの処理単位分）の入力信号ｓ１（ｎ）、ｓ２（ｎ）がＦＦＴ部１１に供給される。ＦＦＴ部１１は、１フレーム分の入力信号ｓ１（ｎ）及びｓ２（ｎ）に基づく分析フレームＦＲＡＭＥ１（Ｋ）、ＦＲＡＭＥ２（Ｋ）についてフーリエ変換し、周波数領域で示される信号Ｘ１（ｆ，Ｋ）、Ｘ２（ｆ，Ｋ）を取得する。ＦＦＴ部１１で生成された信号Ｘ１（ｆ，Ｋ）、Ｘ２（ｆ，Ｋ）が、正面抑圧信号生成部１２及びコヒーレンス計算部１３に与えられる。 First, input signals s1 (n) and s2 (n) for one frame (for one processing unit) are supplied from each of the microphones m_1 and m_2 to the FFT unit 11 via an AD converter (not shown). The FFT unit 11 performs Fourier transform on the analysis frames FRAME1 (K) and FRAME2 (K) based on the input signals s1 (n) and s2 (n) for one frame, and a signal X1 (f, K) indicated in the frequency domain. , X2 (f, K). The signals X1 (f, K) and X2 (f, K) generated by the FFT unit 11 are given to the front suppression signal generation unit 12 and the coherence calculation unit 13.

正面抑圧信号生成部１２は、ＦＦＴ部１１からの信号Ｘ１（ｆ，Ｋ）、Ｘ２（ｆ，Ｋ）に基づいて、正面抑圧信号Ｎ（ｆ，Ｋ）を算出する。そして、正面抑圧信号生成部１２は、正面抑圧信号Ｎ（ｆ，Ｋ）に基づいて平均正面抑圧信号ＡＶＥ＿Ｎ（Ｋ）を算出して、信号分類部１４に与える。 The front suppression signal generator 12 calculates a front suppression signal N (f, K) based on the signals X1 (f, K) and X2 (f, K) from the FFT unit 11. Then, the front suppression signal generation unit 12 calculates an average front suppression signal AVE_N (K) based on the front suppression signal N (f, K), and provides it to the signal classification unit 14.

コヒーレンス計算部１３は、ＦＦＴ部１１からの信号Ｘ１（ｆ，Ｋ）、Ｘ２（ｆ，Ｋ）に基づいて、コヒーレンスＣＯＨ（Ｋ）を生成し、信号分類部１４に与える。 The coherence calculation unit 13 generates coherence COH (K) based on the signals X1 (f, K) and X2 (f, K) from the FFT unit 11 and gives them to the signal classification unit 14.

信号分類部１４では、相関係数計算部２３が、例えば（９）式を用いて、平均正面抑圧信号ＡＶＥ＿Ｎ（Ｋ）とコヒーレンスＣＯＨ（Ｋ）との関係性を示す特徴量である相関係数ｃｏｒ（Ｋ）を計算する。 In the signal classification unit 14, the correlation coefficient calculation unit 23 uses the equation (9), for example, to calculate a correlation coefficient that is a feature amount indicating the relationship between the average front suppression signal AVE_N (K) and the coherence COH (K). Calculate cor (K).

そして、分類部２４は、相関係数計算部２３により計算される相関係数ｃｏｒ（Ｋ）と、相関係数ｃｏｒ（Ｋ）の算出過程で得られるコヒーレンスの標準偏差σ（ＣＯＨ（Ｋ））の値とを観測して、妨害音の有無、目的音区間又は背景雑音区間を判定し、その判定結果を示す信号Ｒ（Ｋ）を出力する。 Then, the classification unit 24 calculates the correlation coefficient cor (K) calculated by the correlation coefficient calculation unit 23 and the standard deviation σ (COH (K)) of the coherence obtained in the calculation process of the correlation coefficient cor (K). And determining the presence / absence of interfering sound, the target sound section or the background noise section, and outputting a signal R (K) indicating the determination result.

図５は、実施形態に係る分類部２４における信号分類処理を示すフローチャートである。 FIG. 5 is a flowchart showing signal classification processing in the classification unit 24 according to the embodiment.

まず、分類部２４では、第１判定部２４１が、平均正面抑圧信号ＡＶＥ＿Ｎ（Ｋ）とコヒーレンスＣＯＨ（Ｋ）との相関係数ｃｏｒ（Ｋ）の値が正であるか否かを判定する（Ｓ１０１）。そして、相関係数ｃｏｒ（Ｋ）の値が正でない場合（ｃｏｒ（Ｋ）≦０の場合）、第１判定部２４１は妨害音が存在すると判定する（Ｓ１０２）。 First, in the classification unit 24, the first determination unit 241 determines whether or not the value of the correlation coefficient cor (K) between the average front suppression signal AVE_N (K) and the coherence COH (K) is positive ( S101). When the value of the correlation coefficient cor (K) is not positive (when cor (K) ≦ 0), the first determination unit 241 determines that there is an interfering sound (S102).

一方、相関係数ｃｏｒ（Ｋ）の値が正の場合（ｃｏｒ（Ｋ）＞０の場合）、すなわち妨害音が存在しない信号区間で、第２判定部２４２は、相関係数ｃｏｒ（Ｋ）の算出過程において得られたコヒーレンスの標準偏差σ（ＣＯＨ（Ｋ））の値が閾値以上であるか否かを判定する（Ｓ１０３）。 On the other hand, when the value of the correlation coefficient cor (K) is positive (when cor (K)> 0), that is, in a signal section where no disturbing sound exists, the second determination unit 242 uses the correlation coefficient cor (K). It is determined whether or not the value of the standard deviation σ (COH (K)) of the coherence obtained in the calculation process is equal to or greater than a threshold value (S103).

そして、コヒーレンスの標準偏差σ（ＣＯＨ（Ｋ））の値が閾値Θ以上の場合、第２判定部２４２は目的音区間であると判定し（Ｓ１０４）、コヒーレンスの標準偏差σ（ＣＯＨ（Ｋ））の値が閾値Θ未満の場合、第２判定部２４２は背景雑音区間と判定する（Ｓ１０５）。 If the value of the standard deviation σ (COH (K)) of the coherence is equal to or greater than the threshold Θ, the second determination unit 242 determines that the target sound section is present (S104), and the standard deviation σ (COH (K) of the coherence ) Is less than the threshold value Θ, the second determination unit 242 determines that the background noise section (S105).

上記のように、信号分類部１４では、平均正面抑圧信号ＡＶＥ＿Ｎ（Ｋ）とコヒーレンスＣＯＨ（Ｋ）との相関係数ｃｏｒ（Ｋ）を計算し、この相関係数ｃｏｒ（Ｋ）の正負に基づいて妨害音が存在するか否かを判定する。そして、妨害音が存在しない場合に、信号分類部１４は、上記相関係数ｃｏｒ（Ｋ）の算出で得られる変数であるコヒーレンスの標準偏差σ（ＣＯＨ（Ｋ））の値に基づいて、目的音区間であるか又は背景雑音区間であるかを判定するため、単一の手法で、妨害音が存在すること、目的音区間又は背景雑音区間を分類することができる。 As described above, the signal classification unit 14 calculates the correlation coefficient cor (K) between the average front suppression signal AVE_N (K) and the coherence COH (K), and based on the sign of the correlation coefficient cor (K). To determine whether there is any interfering sound. Then, when there is no interfering sound, the signal classification unit 14 determines the purpose based on the value of the standard deviation σ (COH (K)) of coherence, which is a variable obtained by calculating the correlation coefficient cor (K). In order to determine whether it is a sound section or a background noise section, it is possible to classify the presence of a disturbing sound, the target sound section, or the background noise section by a single method.

そして、結果出力部２５は、分類部２４による判定結果を示す信号Ｒ（Ｋ）を出力する。 Then, the result output unit 25 outputs a signal R (K) indicating the determination result by the classification unit 24.

（Ａ−３）実施形態の効果
以上のように、この実施形態によれば、妨害音が存在する場合は、正面抑圧信号とコヒーレンスとの相関係数が負であり、妨害音声が存在しない場合には正となり、さらに妨害音が存在しない場合、中間変数であるコヒーレンスの標準偏差が大きければ目的音区間で、小さければ背景雑音区間である、という特徴的な挙動を用いることで、入力信号区間を簡単に分類できる。この結果を用いることで、例えば、雑音が妨害音（空間的に有色）か、背景雑音（空間的に白色）かに基づいて音源定位処理を制御することで精度を改善するというように、より適切な処理を実現できるようになる。また、処理負荷が小さいため、リアルタイム性も保証できる。以上より、この実施形態の手法を音声アシスタントなどに用いられる音声認識の前処理に適用することで、音声アシスタントの利用者の満足度向上が期持できる。 (A-3) Effect of Embodiment As described above, according to this embodiment, when the interference sound exists, the correlation coefficient between the front suppression signal and the coherence is negative, and the interference sound does not exist. When there is no interfering sound, the input signal interval is obtained by using the characteristic behavior that if the standard deviation of the coherence, which is an intermediate variable, is large, it is the target sound interval, and if it is small, it is the background noise interval. Can be easily classified. By using this result, for example, the accuracy is improved by controlling the sound source localization processing based on whether the noise is a disturbance sound (spatial color) or background noise (spatial white), Appropriate processing can be realized. In addition, since the processing load is small, real-time performance can be guaranteed. As described above, the satisfaction of the voice assistant user can be improved by applying the method of this embodiment to the speech recognition preprocessing used for the voice assistant or the like.

（Ｂ）他の実施形態
上述した実施形態においても種々の変形実施形態を言及したが、本発明は、以下の変形実施形態にも適用することができる。 (B) Other Embodiments Although various modified embodiments have been mentioned in the above-described embodiments, the present invention can also be applied to the following modified embodiments.

（Ｂ−１）上述した実施形態では、妨害音が存在しない信号区間で、コヒーレンスの標準偏差の大きさに基づいて、目的音区間と背景雑音区間とを分類する場合を例示した。しかし、信号区間を判断する変数は、コヒーレンスの標準偏差に限定されるものではない。なぜなら、妨害音が存在しない場合に、コヒーレンスの変動の幅の大きさに着目して、目的音区間と背景雑音区間とを分類するためである。従って、コヒーレンスの標準偏差に限定せず、コヒーレンスの分散に基づいて、目的音区間と背景雑音区間とを分類するようにしてもよい。この場合も、コヒーレンスの分散の値は、コヒーレンスの標準偏差を計算する過程で得られるものであるため、処理の負荷は小さい。 (B-1) In the above-described embodiment, the case where the target sound section and the background noise section are classified based on the standard deviation of the coherence in the signal section where no disturbing sound exists is illustrated. However, the variable for determining the signal interval is not limited to the standard deviation of coherence. This is because when there is no interfering sound, the target sound section and the background noise section are classified by paying attention to the magnitude of the coherence fluctuation range. Therefore, the target sound section and the background noise section may be classified based on the coherence variance without being limited to the standard deviation of the coherence. Also in this case, since the value of the coherence variance is obtained in the process of calculating the standard deviation of the coherence, the processing load is small.

１…音響信号処理装置、１１…ＦＦＴ部、１２…正面抑圧信号生成部、１３…コヒーレンス計算部、１４…信号分類部、２１…正面抑圧信号取得部、２２…コヒーレンス取得部、２３…相関係数計算部、２４…分類部、２４１…第１判定部、２４２…第２判定部、２５…結果出力部。 DESCRIPTION OF SYMBOLS 1 ... Acoustic signal processing apparatus, 11 ... FFT part, 12 ... Front suppression signal generation part, 13 ... Coherence calculation part, 14 ... Signal classification part, 21 ... Front suppression signal acquisition part, 22 ... Coherence acquisition part, 23 ... Phase relationship Number calculation unit, 24 ... classification unit, 241 ... first determination unit, 242 ... second determination unit, 25 ... result output unit.

Claims

Front-side suppression signal generation that generates a front-side suppression signal having a blind spot in front based on the difference between the plurality of frequency-domain input signals obtained by converting each input signal from each of the plurality of microphones from the time domain to the frequency domain And
A coherence calculator that calculates coherence based on signals obtained from the plurality of input signals;
A classification unit that classifies a signal section of the input signal based on a value of a feature amount indicating a relationship between the front suppression signal and the coherence and a variable obtained in the process of calculating the feature amount. An acoustic signal processing device.

The classification unit is
A first determination unit that determines the presence / absence of an interfering sound based on the correlation coefficient between the front suppression signal and the coherence as the feature amount, and based on whether the correlation coefficient is positive or negative;
And a second determination unit that determines a target sound section and a background noise section based on a variable that affects a magnitude of a fluctuation range of the coherence in a signal section where the interfering sound does not exist. Item 2. The acoustic signal processing device according to Item 1.

The second determination unit compares the standard deviation value of the coherence with a threshold value. If the standard deviation value of the coherence is equal to or larger than the threshold value, the target sound period is set. If not, the background noise period is set. The acoustic signal processing apparatus according to claim 2.

Computer
Front-side suppression signal generation that generates a front-side suppression signal having a blind spot in front based on the difference between the plurality of frequency-domain input signals obtained by converting each input signal from each of the plurality of microphones from the time domain to the frequency domain And
A coherence calculator that calculates coherence based on signals obtained from the plurality of input signals;
Functioning as a classification unit for classifying the signal section of the input signal based on the feature value indicating the relationship between the front suppression signal and the coherence and the variable obtained in the feature value calculation process. An acoustic signal processing program.

A front suppression signal having a blind spot in front based on a difference between a plurality of frequency domain input signals obtained by converting each input signal from each of a plurality of microphones from a time domain to a frequency domain by a front suppression signal generation unit. Produces
A coherence calculator calculates coherence based on signals obtained from the plurality of input signals;
A classification unit classifying a signal section of an input signal based on a feature value indicating a relationship between the front suppression signal and the coherence, and a variable obtained in a process of calculating the feature value; Acoustic signal processing method.