JP2009080309A

JP2009080309A - Speech recognition device, speech recognition method, speech recognition program and recording medium in which speech recogntion program is recorded

Info

Publication number: JP2009080309A
Application number: JP2007249648A
Authority: JP
Inventors: Kaoru Suzuki; 薫鈴木; Miwako Doi; 美和子土井; Toshiyuki Koga; 敏之古賀; Koichi Yamamoto; 幸一山本
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2007-09-26
Filing date: 2007-09-26
Publication date: 2009-04-16

Abstract

<P>PROBLEM TO BE SOLVED: To detect a time period when noise is dominant, in sound source locating process, and to perform suitable processing in this time period in speech recognition process. <P>SOLUTION: Phase difference and power are calculated for each frequency component from first and second sound signals which are captured in two points, and a scatter diagram in which frequency and phase difference for each frequency component are set as coordinates is created. Arrangement of the frequency component for showing predetermined linearity on the scatter diagram is detected, together with a straight line score according to the power of the frequency component. The arrangement in which the straight line score is a predetermined threshold or more, is detected as the straight line for indicating existence of the sound source. A sound source stream composed of information of the straight lines and the straight lines scores etc. is extracted, and reliability information is attached for each time of the sound source stream, based on height of the straight line score for each time of the sound source stream. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は音声を認識する装置に関し、特に雑音環境下で目的音源を検出定位し、該目的音源の音声データを雑音から分離抽出してこれを認識する装置に係る。 The present invention relates to an apparatus for recognizing speech, and more particularly to an apparatus for detecting and localizing a target sound source in a noisy environment, and separating and extracting sound data of the target sound source from noise.

近年、ロボット用の聴覚研究の分野で、雑音環境下で複数の目的音源の数とその方向を推定し（音源定位）、各音源からの音声を分離抽出し（音源分離）、この分離音声を認識する（音声認識）ための方式が提案されている。 In recent years, in the field of auditory research for robots, the number and direction of multiple target sound sources are estimated in a noisy environment (sound source localization), and the sound from each sound source is separated and extracted (sound source separation). A method for recognition (voice recognition) has been proposed.

例えば、一対のマイクを用いて、拡散性雑音環境下で方向性のある音源（目的音源）の定位と音源音声の分離を行う方法が提案されている（例えば、特許文献１参照）。この方法は、２つのマイクで捉えた２つの音響信号をそれぞれフーリエ変換した周波数分解データから、両音響信号間の周波数毎の位相差を求め、これを周波数−位相差座標系にプロットした散布図を生成する。到達時間差の同じ周波数成分を同一音源に由来するものと看做した場合、これら周波数成分が散布図上で原点を通る直線の上に分布することに着目し、散布図上でハフ変換・ハフ投票を応用した直線検出を行うことで、音源の検出と方向の定位を行う。ハフ投票の得票値（スコアとも呼ぶ）が所定の閾値を越える直線が２本検出されれば、方向の異なる音源が２つあり、それぞれの直線の傾きから各音源がどの方向にあるかを知ることができる。さらに、各時刻で得られた直線を、その傾きに注目して時系列にグルーピングすることで目的音源ストリームと成し、その音源方向に指向性を与えたビームフォーミングで当該音源からの音声（目的音声）を抽出する。分離抽出された音声（分離音声）は音声認識され、当該目的音声の言語的情報が推定される。 For example, using a pair of microphones, a method of performing localization of a directional sound source (target sound source) and sound source sound under a diffuse noise environment has been proposed (see, for example, Patent Document 1). In this method, a phase difference for each frequency between two acoustic signals is obtained from frequency-resolved data obtained by Fourier transforming two acoustic signals captured by two microphones, and this is plotted in a frequency-phase difference coordinate system. Is generated. Considering that frequency components with the same arrival time difference are derived from the same sound source, pay attention to the fact that these frequency components are distributed on a straight line passing through the origin on the scatter diagram. Detecting the sound source and locating the direction by performing straight line detection using If two straight lines with a Hough vote vote value (also called a score) exceeding a predetermined threshold are detected, there are two sound sources with different directions, and the direction of each sound source is known from the inclination of each straight line. be able to. Furthermore, the straight line obtained at each time is grouped in time series by paying attention to the slope, and the target sound source stream is formed. The sound from the sound source (target purpose) is formed by beam forming with directivity in the sound source direction. Audio). The separated and extracted speech (separated speech) is recognized as speech, and the linguistic information of the target speech is estimated.

この従来技術は空間中の局所領域から発せられる音声（方向性音源＝目的音源）の両マイクへの到達時間差に基づいて最初に各音源の存在を検出するとともにその方向を推定（音源定位）する。そして、各音源方向に指向性を与えたビームフォーミングによって各音源音声を他の音（他の方向性音源音声や拡散性の環境雑音）から分離抽出（音源分離）し、この分離音声を認識（音声認識）する。 This prior art first detects the presence of each sound source and estimates its direction (sound source localization) based on the arrival time difference between the microphones (directional sound source = target sound source) emitted from a local region in space. . Then, each sound source sound is separated and extracted (sound source separation) from other sounds (other directional sound source sounds and diffusive environmental noise) by beam forming with directivity in each sound source direction, and this separated sound is recognized ( Voice recognition).

しかしながら、複数の音源が同じ周波数の音を同時に発していると、フーリエ変換によって得られる周波数毎の振幅ベクトルは複素平面上で各音源音声の振幅ベクトルの合成ベクトルとなってしまうため、その周波数について位相差を求めても正しい到達時間差を表してはくれない。すなわち、このような周波数成分はいずれの音源方向にも当てはまらず、ビームフォーミングの際に欠落し、そのため、抽出された分離音声が歪んでしまう。もし、音声認識時に用いられる音響モデルがこの歪みを学習していないと、音声認識時の尤度計算において音響スコアが上がらず誤認識の原因となる。 However, if multiple sound sources are simultaneously producing sounds of the same frequency, the amplitude vector for each frequency obtained by Fourier transform becomes a composite vector of the amplitude vectors of each sound source sound on the complex plane. Finding the phase difference does not represent the correct arrival time difference. That is, such frequency components do not apply to any sound source direction and are lost during beam forming, so that the extracted separated speech is distorted. If the acoustic model used at the time of speech recognition does not learn this distortion, the acoustic score does not increase in the likelihood calculation at the time of speech recognition, resulting in erroneous recognition.

この歪みの影響を抑制するために、スペクトル歪みの影響を受けた特徴量の成分をマスクして尤度計算に用いない音声認識方法が示されている（例えば、非特許文献１参照）。その際、音源分離過程で雑音推定を行い、その推定された雑音情報を用いてマスクを自動生成する。当該文献中にも記述されているように、この方法は「音源分離と音声認識を統合する」手法である。 In order to suppress the influence of this distortion, a speech recognition method is shown that masks the component of the feature quantity affected by the spectral distortion and does not use it for likelihood calculation (for example, see Non-Patent Document 1). At that time, noise estimation is performed in the sound source separation process, and a mask is automatically generated using the estimated noise information. As described in this document, this method is a method of “integrating sound source separation and speech recognition”.

また、人物が１つのフレーズを発話している最中であっても、その音圧には強弱があるため、弱い部分では発話音声が環境雑音に負けてしまい、抽出されたその期間の分離音声が観測できないことが起きる。このような場合、例えば静穏環境では聞こえるはずの発話内容の一部期間が雑音環境下で観測できなかったことになるため、この発話内容を文法情報に与えられた音声認識ではこの期間で解釈に失敗して誤認識を起こす危険性が高い。特に雑音強度が上がるほどこのような期間が長くなるため、その危険性が増すことになる。 Even when a person is speaking one phrase, since the sound pressure is strong and weak, the speech is lost to the environmental noise in the weak part, and the extracted separated speech for that period It is impossible to observe. In such a case, for example, a part of the utterance content that should be audible in a quiet environment could not be observed in a noisy environment, so the speech recognition given to the grammar information interprets this utterance content during this period. There is a high risk of failure and misrecognition. In particular, as the noise intensity increases, such a period becomes longer, and the risk increases.

発話中の音声の強弱変化を検出して利用した例として、入力音声中の無音らしさに応じて尤度計算を制御する方法が示されている（例えば、特許文献２参照）。その実施例中には、「・・・ビーム探索を用いるものであるが、無音区間においてビーム幅の絞り込みを行うことを特徴としている」との記載があり、各時刻において尤度の高い幾つかの仮説を残して枝刈りするビーム探索において、無音区間で残す仮説を減らすことで、無音区間での認識処理量を減らすことが記載されている。これは、無音区間にはそもそも発話音声の情報が無いので、その間での無駄な計算を減らそうというものである。また、無音区間か否かの判定方法として、（Ａ）入力音声のパワーが所定閾値より高い期間が継続するか否かで判定する方法と、（Ｂ）無音音響特徴と照合した音響スコアが、無音以外の音響特徴と照合した音響スコアよりも高い期間が継続するか否かで判定する方法の２通りが示されている。 As an example of detecting and using a change in the strength of speech during utterance, a method of controlling likelihood calculation according to silence likeness in input speech is shown (for example, see Patent Document 2). In the embodiment, there is a description that "... the beam search is used, but the beam width is narrowed down in the silent section", and some of the likelihoods are high at each time. In the beam search for pruning while leaving the above hypothesis, it is described that the amount of recognition processing in the silent section is reduced by reducing the hypotheses left in the silent section. This is because there is no speech information in the silent section in the first place, so that unnecessary calculations during that period are reduced. In addition, as a method for determining whether or not it is a silent section, (A) a method for determining whether or not a period during which the power of the input voice is higher than a predetermined threshold continues, and (B) an acoustic score that is collated with a silent acoustic feature, Two methods of determining whether or not a period higher than an acoustic score collated with an acoustic feature other than silence continues are shown.

一方、特許文献１の音源定位過程を応用すると、ハフ投票の得票値が各周波数成分のパワーに応じて増大するよう定めることができるので、方向毎の得票値の大小がその方向からの音源音声のパワーに呼応するようにでき、その結果、環境雑音が支配的になっている期間を、直線が検出できない、あるいは直線の得票値が小さいという現象で検出できる。そのため、強い環境音のせいで入力音声が無音どころか強い雑音に支配されている期間であっても、ある音源からの音声がこの環境音にまぎれて途切れがちであることを検出できようになる。特許文献２の「無音検出手段」にはその能力が無い。 On the other hand, by applying the sound source localization process of Patent Document 1, it is possible to determine that the vote value of the Hough vote increases according to the power of each frequency component, so the magnitude of the vote value for each direction is the sound source sound from that direction. As a result, the period in which the environmental noise is dominant can be detected by a phenomenon that the straight line cannot be detected or the vote value of the straight line is small. For this reason, even when the input sound is dominated by strong noise rather than silence due to strong environmental sound, it can be detected that the sound from a certain sound source tends to be interrupted by the environmental sound. The “silent detection means” of Patent Document 2 does not have this capability.

また、音声認識過程ではこのような環境雑音が支配的になっている期間の処理をうまく調整できることが期待される。これは『音源分離と音声認識を統合する』とした非特許文献１に対して、『音源定位と音声認識を統合する』というアプローチに相当する。上述した従来技術はこの統合について示唆していない。 In the speech recognition process, it is expected that the process during the period in which such environmental noise is dominant can be adjusted well. This corresponds to an approach of “integrating sound source localization and speech recognition” as opposed to Non-Patent Document 1 that “integrates sound source separation and speech recognition”. The prior art described above does not suggest this integration.

直線の検出については、周波数−位相差散布図上で直線を検出する手法が開示されている（例えば、特許文献３参照）。しかし、この手法は、散布図上で様々な傾きの直線を仮定して評価する点は特許文献１と同様であるが、方式に違いがある。特許文献３は、散布図上の周波数成分配置に対して最小二乗誤差となる直線を検出する方式であり、散布図上の各周波数成分と該直線との距離の二乗和として評価量（仮定された直線の得票値）を得ているため、特許文献１のようなパワーに応じた得票値となっていない。もし、得票値から雑音の支配的な期間を知りたければ、パワー（あるいは振幅）の関数としての得票値を求める方が都合が良い。
特開２００６−２５４２２６号公報特開平１１−８５１８０号公報特開２００３−３３７１６４号公報山本俊一他、“音源分離との統合によるミッシングフィーチャマスク自動生成に基づく同時発話音声認識”、日本ロボット学会誌、Ｖｏｌ．２５、Ｎｏ．１、２００７年１月１５日発行 As for the detection of a straight line, a technique for detecting a straight line on a frequency-phase difference scatter diagram is disclosed (for example, see Patent Document 3). However, this method is the same as Patent Document 1 in that it is evaluated assuming straight lines with various inclinations on the scatter diagram, but there is a difference in the method. Patent Document 3 is a method of detecting a straight line that is a least square error with respect to the frequency component arrangement on the scatter diagram, and an evaluation amount (assumed as a sum of squares of the distance between each frequency component on the scatter diagram and the straight line). Therefore, the vote value corresponding to the power as in Patent Document 1 is not obtained. If it is desired to know the dominant period of noise from the vote value, it is more convenient to obtain the vote value as a function of power (or amplitude).
JP 2006-254226 A Japanese Patent Laid-Open No. 11-85180 JP 2003-337164 A Shunichi Yamamoto et al., “Simultaneous speech recognition based on automatic generation of missing feature masks by integration with sound source separation”, Journal of the Robotics Society of Japan, Vol. 25, no. 1. Issued on January 15, 2007

本発明は上記の問題点及び考察に鑑みて成されたものであり、その目的とするところは、（１）雑音が支配的になっている期間を音源定位過程で検出し、（２）目的音源ストリームの分離音声を認識する際、雑音の悪影響を抑制した認識を行うことの可能な音声認識装置、音声認識方法、音声認識プログラム、及び音声認識プログラムを記録した記録媒体を提供することである。 The present invention has been made in view of the above-mentioned problems and considerations. The object of the present invention is to (1) detect a period in which noise is dominant in a sound source localization process, and (2) It is to provide a voice recognition device, a voice recognition method, a voice recognition program, and a recording medium on which a voice recognition program is recorded, capable of performing recognition while suppressing the adverse effect of noise when recognizing separated sound of a sound source stream. .

本発明の一観点に係る音声認識装置は、２地点で捉えられた第１と第２の音響信号を入力する入力手段と、前記第１と第２の音響信号のそれぞれを周波数分解して周波数成分を求め、該周波数成分毎の位相差及びパワーを算出する算出手段と、前記周波数成分の値と前記位相差の値とを座標値とする散布図を生成する生成手段と、前記散布図上で直線性を示す周波数成分の配置を前記パワーに応じた直線スコアとともに検出し、該直線スコアが閾値以上となる周波数成分の配置を、音源の存在を示す直線として検出する検出手段と、一定範囲内の直線無検出期間及び直線傾きぶれを許容しつつ、前記検出手段により検出された少なくとも一つの直線を時間軸方向にグルーピングする音源ストリームであって、前記直線の傾きを含む情報、前記直線スコア、及び前記直線が検出された時刻の情報を含む音源ストリームを抽出する抽出手段と、前記音源ストリームの前記時刻に対して前記直線スコアの高低に基づく信頼可否情報を付与し、該音源ストリームの各フレームを分類する分類手段と、前記音源ストリームに含まれる前記直線の傾きの情報から算定される音源存在角度に基づいて該音源ストリームの音声データを抽出し、音源分離する音源分離手段と、文法情報に定められた文仮説を状態と遷移の探索木に展開し、前記音源ストリームの音声データから所定の音響特徴を抽出し、該音響特徴の系列に対する前記探索木の状態遷移経路の尤度を計算し、尤度の高い状態遷移経路を探索することで前記音源ストリームの言語的内容を認識する音声認識手段とを具備し、前記状態遷移経路の探索を前記信頼可否情報に基づいて制御することを特徴とする。 A speech recognition apparatus according to an aspect of the present invention includes an input unit that inputs first and second acoustic signals captured at two points, and frequency-resolves each of the first and second acoustic signals. Calculating means for calculating a phase difference and power for each frequency component; generating means for generating a scatter diagram having the frequency component value and the phase difference value as coordinate values; and on the scatter diagram. Detecting means for detecting the arrangement of frequency components exhibiting linearity together with a linear score corresponding to the power, and detecting the arrangement of frequency components for which the linear score is equal to or greater than a threshold value as a straight line indicating the presence of a sound source; and a certain range A sound source stream that groups at least one straight line detected by the detecting means in a time axis direction while allowing a straight line non-detection period and a straight line inclination fluctuation, and includes information including the straight line inclination, Extraction means for extracting a sound source stream including information about a line score and a time when the straight line is detected; and reliability information based on the level of the straight line score is given to the time of the sound source stream, and the sound source stream Classifying means for classifying each frame of the sound source, sound source separation means for extracting sound data of the sound source stream based on the sound source existence angle calculated from the information of the slope of the straight line included in the sound source stream, and separating the sound source; The sentence hypothesis defined in the grammatical information is expanded into a state and transition search tree, a predetermined acoustic feature is extracted from the sound data of the sound source stream, and the likelihood of the state transition path of the search tree for the acoustic feature sequence And speech recognition means for recognizing the linguistic content of the sound source stream by searching for a state transition path with high likelihood, and the state transition And controlling based on a search of the road to the trust permission information.

本発明によれば、（１）雑音が支配的になっている期間を音源定位過程で検出し、（２）目的音源ストリームの分離音声を認識する際、雑音の悪影響を抑制した認識を行うことができる。 According to the present invention, (1) a period in which noise is dominant is detected in a sound source localization process, and (2) when recognizing separated speech of a target sound source stream, recognition is performed while suppressing adverse effects of noise. Can do.

以下、本発明に係る音声認識装置、音声認識方法、音声認識プログラム、及び音声認識プログラムを記録した記録媒体の実施形態を図面に従って説明する。 DESCRIPTION OF EMBODIMENTS Hereinafter, embodiments of a voice recognition device, a voice recognition method, a voice recognition program, and a recording medium that records the voice recognition program according to the present invention will be described with reference to the drawings.

図１に本発明の一実施形態に係る音声認識装置の機能ブロックを示す。本実施形態に係る音声認識装置は、空間的に異なる位置に配置されたマイク１ａと１ｂと、音響信号入力部２と、音源ストリーム抽出分類部３と、音源分離部４と、語彙認識部５と、話者認識部６と、物音認識部７と、出力部８と、ユーザインタフェース部９とを有している。 FIG. 1 shows functional blocks of a speech recognition apparatus according to an embodiment of the present invention. The speech recognition apparatus according to the present embodiment includes microphones 1a and 1b arranged at spatially different positions, an acoustic signal input unit 2, a sound source stream extraction / classification unit 3, a sound source separation unit 4, and a vocabulary recognition unit 5. A speaker recognition unit 6, a sound recognition unit 7, an output unit 8, and a user interface unit 9.

マイク１ａと１ｂからの２つの振幅データは、音響信号入力部２を経由して音源ストリーム抽出分類部３に入力される。音源ストリーム抽出分類部３は、所定時間間隔（フレームシフト）で繰り返される離散的な時刻（フレーム）毎に、所定個数（フレーム長）の振幅データを先ずは（１）ＦＦＴ処理によって周波数分解し、（２）両入力の位相差を周波数成分毎に求める。また、このとき、両入力における各周波数成分のパワー値の例えば平均を当該周波数成分の代表パワー値として求める。 Two amplitude data from the microphones 1 a and 1 b are input to the sound source stream extraction / classification unit 3 via the acoustic signal input unit 2. The sound source stream extraction classifying unit 3 first frequency-decomposes a predetermined number (frame length) of amplitude data by (1) FFT processing for each discrete time (frame) repeated at a predetermined time interval (frame shift). (2) The phase difference between both inputs is obtained for each frequency component. At this time, for example, an average of the power values of the frequency components at both inputs is obtained as the representative power value of the frequency components.

次に、音源ストリーム抽出分類部３は、（３）連続する所定フレーム分の周波数毎の位相差を周波数−位相差平面上の２次元散布図化し、（４）この２次元散布図から所定の直線をその直線スコアとともに検出する。検出された直線はある方向性音源（目的音源）の存在を示唆している。このとき直線近傍に分布する周波数成分は該目的音源を発した音声（音源音）のその時刻（フレーム）におけるスペクトルを近似しており、それら周波数成分の前記代表パワー値に基づいて算出される直線スコアは、当該音源音の総パワーの目安を与える。本実施形態では、この直線スコアをハフ投票の得票値によって算定する。また、この検出された直線の傾きθはマイク１ａと１ｂを結ぶ線分に対する目的音源の存在角度φ（目的音源の存在する円錐面の開き角度）と１対１の対応関係にある。そして、（５）直線の傾きθを角度−時間平面上に並べたデータ上で、所定範囲内の直線無検出期間と直線傾きぶれを許容して時間軸方向にグルーピングされる、前記直線無検出期間を含む所定期間以上の長さを持つ前記検出された直線の系列と、その傾きθの系列及びそこから計算される存在角度φの系列と、前記検出された直線の系列の存在期間（前記グルーピングの始端フレームと終端フレームで挟まれる期間）とを１つの目的音源から発せられる音声ストリームの情報（目的音源ストリーム）として検出する。 Next, the sound source stream extraction classifying unit 3 (3) converts the phase difference for each frequency of successive predetermined frames into a two-dimensional scatter diagram on the frequency-phase difference plane, and (4) determines a predetermined difference from the two-dimensional scatter diagram. A straight line is detected along with its straight line score. The detected straight line suggests the existence of a certain directional sound source (target sound source). At this time, the frequency component distributed in the vicinity of the straight line approximates the spectrum at the time (frame) of the sound (sound source sound) emitted from the target sound source, and is a straight line calculated based on the representative power value of these frequency components. The score gives a measure of the total power of the sound source. In the present embodiment, this straight line score is calculated by the vote value of the Hough vote. Further, the detected slope θ of the straight line has a one-to-one correspondence with the target sound source existing angle φ (the opening angle of the conical surface where the target sound source exists) with respect to the line segment connecting the microphones 1a and 1b. (5) The straight line no-detection is grouped in the time axis direction while allowing straight line no-detection periods and straight-line tilt fluctuations within a predetermined range on data obtained by arranging straight line inclinations θ on an angle-time plane. A sequence of the detected straight line having a length equal to or longer than a predetermined period including a period, a sequence of the inclination θ and a sequence of the existence angle φ calculated therefrom, and an existence period of the detected series of straight lines (the above-mentioned The period between the start frame and the end frame of the grouping) is detected as information of an audio stream (target sound source stream) emitted from one target sound source.

特に本実施形態に係る音源ストリーム抽出分類部３は、（６）前記目的音源ストリームの始終端で挟まれる各フレームについて、前記直線が当該目的音源ストリームにグルーピングされたフレームを信頼可、されなかったフレームを信頼不可と判定分類する。この信頼可否の別は当該目的音源ストリームの音源音の各時刻（各フレーム）における明瞭度を表している。 In particular, the sound source stream extraction / classification unit 3 according to the present embodiment (6) for each frame sandwiched between the start and end of the target sound source stream, the frame in which the straight line is grouped in the target sound source stream is not reliable. Classify the frame as unreliable. This reliability indicates whether the sound source sound of the target sound source stream is intelligible at each time (each frame).

なお、上記（１）〜（５）の処理は、特許文献１に開示される技術によって実施可能である。 The processes (1) to (5) can be performed by the technique disclosed in Patent Document 1.

音源分離部４は、前記存在角度系列に基づいて入力音声データへのビームフォーミングを行うことで、当該目的音源ストリームの音声データ（目的音声データ）を環境雑音から分離する。 The sound source separation unit 4 separates the sound data (target sound data) of the target sound source stream from the environmental noise by performing beam forming on the input sound data based on the existence angle series.

以上の処理によって、目的音源ストリームの音声データ（目的音声データあるいは分離音声データと呼ぶ）と、その各時刻（フレーム）における信頼可否の情報が得られる。信頼不可のフレームは、（ａ）強力な環境雑音に目的音声が負けてしまったか、（ｂ）目的音声が元々微弱もしくは無かったために、直線が明瞭に検出できなかった期間を表していると解釈できる。特に（ａ）の場合、例えば静穏環境では聞こえるはずの発声内容の一部期間が雑音環境下で観測できなかったことになるため、この発声内容を文法情報に与えられた従来の音声語彙認識では、この期間で解釈に失敗して誤認識を起こす危険性が高い。 Through the above processing, audio data of the target sound source stream (referred to as target audio data or separated audio data) and reliability information at each time (frame) are obtained. An unreliable frame is interpreted as representing a period during which a straight line could not be clearly detected because (a) the target voice was defeated by strong environmental noise, or (b) the target voice was originally weak or absent. it can. In particular, in the case of (a), for example, a part of the utterance content that should be audible in a quiet environment could not be observed in a noisy environment. Therefore, in the conventional speech vocabulary recognition given to the grammatical information, In this period, there is a high risk of misinterpretation due to failure to interpret.

同様に、音声話者認識や物音認識でも、雑音が支配的な期間の音声を認識しようとすれば正しく認識できない危険性が高い。 Similarly, in speech speaker recognition and object sound recognition, there is a high risk that speech recognition during a period in which noise is dominant cannot be recognized correctly.

語彙認識部５は、前記目的音声データの言語的内容を認識する手段であり、文法情報に従って解釈するための尤度計算に際して、信頼可と分類されたフレームだけを尤度計算に用い、信頼不可と分類されたフレームで枝刈りをしないことで、誤認識の発生を抑制する。 The vocabulary recognition unit 5 is a means for recognizing the linguistic contents of the target speech data, and uses only frames classified as reliable in the likelihood calculation when calculating the likelihood for interpretation according to the grammatical information. The occurrence of misrecognition is suppressed by not pruning the frames classified as.

話者認識部６は、前記目的音声データが誰の声であるかを認識する手段であり、信頼可と分類されたフレームだけを対象に認識を行うことで、誤認識の発生を抑制する。 The speaker recognition unit 6 is means for recognizing who the target voice data is, and suppresses the occurrence of misrecognition by recognizing only frames classified as reliable.

物音認識部７は、前記目的音声データが何の物音であるかを認識する手段であり、信頼可と分類されたフレームだけを対象に認識を行うことで、誤認識の発生を抑制する。 The sound object recognition unit 7 is a means for recognizing what kind of sound the target sound data is, and suppresses the occurrence of misrecognition by recognizing only the frames classified as reliable.

出力部８は、前記目的音源ストリームの数、各目的音源ストリームの存在角度系列、前記目的音声データを認識して得た認識の結果とを少なくとも含む音源情報を生成出力する。 The output unit 8 generates and outputs sound source information including at least the number of the target sound source streams, the existence angle series of each target sound source stream, and the recognition result obtained by recognizing the target sound data.

ユーザインタフェース部９は各種設定値の利用者への呈示、利用者からの設定入力受理、外部記憶装置への設定値の保存、外部記憶装置からの設定値の読み出し、及び各種処理結果の利用者への呈示を実行する。 The user interface unit 9 presents various setting values to the user, accepts setting input from the user, saves the setting values to the external storage device, reads the setting values from the external storage device, and users of various processing results Make a presentation to

以下、本実施形態に係る音声認識装置の各機能ブロックの動作を詳しく説明する。 Hereinafter, the operation of each functional block of the speech recognition apparatus according to the present embodiment will be described in detail.

（周波数成分毎の位相差から音源を推定するという基本概念）
マイク１ａとマイク１ｂは、空気などの媒質中に所定の距離をあけて配置された２つのマイクであり、異なる２地点での媒質振動（音波）をそれぞれ電気信号（音響信号）に変換するための手段である。以後、マイク１ａとマイク１ｂをひとまとめに扱う場合、これを「マイク対」と呼ぶことにする。 (Basic concept of estimating sound source from phase difference for each frequency component)
The microphone 1a and the microphone 1b are two microphones arranged at a predetermined distance in a medium such as air, and convert medium vibrations (sound waves) at two different points into electrical signals (acoustic signals), respectively. It is means of. Hereinafter, when the microphone 1a and the microphone 1b are handled together, they are referred to as “microphone pairs”.

音響信号入力部２は、マイク１ａとマイク１ｂによる２つの電気信号（音響信号）を所定のサンプリング周波数Ｆｒで定期的にＡ／Ｄ変換することで、マイク１ａとマイク１ｂによる２つの音響信号のデジタル化された振幅データを時系列的に生成する手段である。この入力された振幅データを周波数成分毎の位相差に分解して解析することで、複数の音源が同時期に存在しても、各音源に特有の周波数成分については、２つのデータ間でそれぞれの音源方向に応じた位相差が観測されるため、もし周波数成分毎の位相差を方向を同じくするグループに分けることができれば、幅広い種類の音源について、幾つの音源が存在し、その各々がどちらの方向にあり、それぞれが主にどのような音声を発しており、その強さあるいはパワーがどれくらいかを把握できるはずである。 The acoustic signal input unit 2 periodically A / D-converts two electrical signals (acoustic signals) from the microphone 1a and the microphone 1b at a predetermined sampling frequency Fr, so that the two acoustic signals from the microphone 1a and the microphone 1b are converted. It is a means for generating digitized amplitude data in time series. By analyzing this input amplitude data by decomposing into phase differences for each frequency component, even if multiple sound sources exist at the same time, frequency components peculiar to each sound source are each between two data. Therefore, if the phase difference for each frequency component can be divided into groups with the same direction, there are several sound sources for a wide variety of sound sources, You should be able to figure out what kind of voice each person is making and how much strength or power it is.

（音声ストリーム抽出分類部３）
以上の基本コンセプトを実現する音源ストリーム抽出分類部３の内部構成を図２に示す。音源ストリーム抽出分類部３は、周波数分解部３０１、位相差算出部３０２、散布図生成部３０３、投票部３０４、直線検出部３０５、時系列追跡部３０６、継続時間評価部３０７、フレーム分類部３０８より成る。 (Audio stream extraction and classification unit 3)
FIG. 2 shows an internal configuration of the sound source stream extraction / classification unit 3 that realizes the above basic concept. The sound source stream extraction and classification unit 3 includes a frequency decomposition unit 301, a phase difference calculation unit 302, a scatter diagram generation unit 303, a voting unit 304, a straight line detection unit 305, a time series tracking unit 306, a duration evaluation unit 307, and a frame classification unit 308. Consists of.

（周波数分解部３０１）
周波数分解部３０１は、マイク１ａと１ｂの捉えた音響信号を音響信号入力部２がデジタル化して生成した振幅データａとｂを入力として、それぞれを周波数成分に分解した周波数分解データａとｂを生成する。振幅データを周波数成分に分解する一般的な手法として高速フーリエ変換（ＦＦＴ）がある。代表的なアルゴリズムとしては、Ｃｏｏｌｅｙ−ＴｕｒｋｅｙＤＦＴアルゴリズムなどが知られている。 (Frequency decomposition unit 301)
The frequency resolving unit 301 receives the amplitude data a and b generated by digitizing the sound signals captured by the microphones 1a and 1b by the sound signal input unit 2, and the frequency resolving data a and b obtained by decomposing each into frequency components. Generate. A general technique for decomposing amplitude data into frequency components is fast Fourier transform (FFT). As a typical algorithm, a Cooley-Turkey DFT algorithm and the like are known.

周波数分解部３０１は、音響信号入力部２による振幅データについて、ある時刻（Ｔ番目のフレーム）を起点として連続するＮ個の振幅データを抜き出してＦＦＴ処理を行うとともに、この抜き出し位置を所定のフレームシフト量Ｆｓずつずらしながら離散的な時刻毎（Ｔ＋１番目のフレーム、Ｔ＋２番目のフレーム、・・・）に繰り返す。この結果、入力された振幅データに対する周波数成分毎のパワー値と位相値とから成る周波数分解データが時系列的に生成される。 The frequency resolving unit 301 extracts N pieces of continuous amplitude data from a certain time (T-th frame) from the amplitude data obtained by the acoustic signal input unit 2 and performs FFT processing, and sets the extracted position to a predetermined frame. It is repeated at discrete times (T + 1 frame, T + 2 frame,...) While shifting by the shift amount Fs. As a result, frequency-resolved data composed of the power value and phase value for each frequency component with respect to the input amplitude data is generated in time series.

（位相差算出部３０２）
位相差算出部３０２は、周波数分解部３０１により得られた同時期の２つの周波数分解データａとｂとを比較して、同じ周波数成分毎に両者の位相値の差を計算して得たａｂ間位相差データを生成する。図３に示すように、ある周波数成分ｆｋの位相差ΔＰｈ（ｆｋ）は、マイク１ａにおける位相値Ｐｈ１（ｆｋ）とマイク１ｂにおける位相値Ｐｈ２（ｆｋ）の差を計算し、その値が｛ΔＰｈ（ｆｋ）：−π＜ΔＰｈ（ｆｋ）≦π｝に収まるように２πの剰余系として算定される。 (Phase difference calculation unit 302)
The phase difference calculation unit 302 compares the two frequency decomposition data a and b at the same time obtained by the frequency decomposition unit 301, and calculates the difference between both phase values for each same frequency component. Interphase difference data is generated. As shown in FIG. 3, the phase difference ΔPh (fk) of a certain frequency component fk is calculated by calculating the difference between the phase value Ph1 (fk) at the microphone 1a and the phase value Ph2 (fk) at the microphone 1b. (Fk): Calculated as a 2π residue system so that −π <ΔPh (fk) ≦ π}.

（散布図生成部３０３）
散布図生成部３０３は、位相差算出部３０２により得られたａｂ間位相差データを元に、周波数とその位相差の組を所定の２次元ＸＹ座標系上の点として扱うための座標値を決定する手段である。ある周波数成分ｆｋの位相差ΔＰｈ（ｆｋ）に対応するＸ座標値ｘ（ｆｋ）とＹ座標値ｙ（ｆｋ）は、図４に示す式によって決定される。Ｘ座標値は位相差ΔＰｈ（ｆｋ）、Ｙ座標値は周波数成分番号ｋである。このような点群をＸＹ座標系にプロットしたものが散布図である。 (Scatter diagram generator 303)
Based on the inter-ab phase difference data obtained by the phase difference calculation unit 302, the scatter diagram generation unit 303 generates coordinate values for handling a set of frequency and the phase difference as points on a predetermined two-dimensional XY coordinate system. It is a means to determine. The X coordinate value x (fk) and the Y coordinate value y (fk) corresponding to the phase difference ΔPh (fk) of a certain frequency component fk are determined by the equations shown in FIG. The X coordinate value is the phase difference ΔPh (fk), and the Y coordinate value is the frequency component number k. A plot of such point clouds in the XY coordinate system is a scatter diagram.

（同一時間差に対する位相差の周波数比例性）
位相差算出部３０２によって、図３に示したように算出される周波数成分毎の位相差は、同一音源（同一方向）に由来するものどうしが同じ到達時間差を表しているはずである。このとき、ＦＦＴによって得られたある周波数の位相値及び両マイク間の位相差はその周波数の周期を２πとして算出された値なので、同じ時間であっても周波数が２倍になれば位相も２倍となる比例関係にある。これは位相差についても同様であり、同一時間差ΔＴに対する位相差は周波数に比例して大きくなる。同一音源から発せられてΔＴを共通にする各周波数成分の位相差を図４に示した座標値計算により２次元座標系上にプロットした散布図を生成すると、各周波数成分の位相差を表す座標点が直線状に並ぶ。ΔＴが大きいほど、すなわち両マイク間で音源までの距離が異なるほど、この直線の傾きは大きくなる。 (Frequency proportionality of phase difference for the same time difference)
The phase difference for each frequency component calculated by the phase difference calculation unit 302 as shown in FIG. 3 should represent the same arrival time difference between those derived from the same sound source (in the same direction). At this time, since the phase value of a certain frequency obtained by FFT and the phase difference between both microphones are values calculated by setting the frequency period to 2π, the phase is also 2 if the frequency is doubled even at the same time. There is a proportional relationship that doubles. The same applies to the phase difference, and the phase difference with respect to the same time difference ΔT increases in proportion to the frequency. When a scatter diagram in which the phase difference of each frequency component emitted from the same sound source and having a common ΔT is plotted on the two-dimensional coordinate system by the coordinate value calculation shown in FIG. 4 is generated, the coordinates representing the phase difference of each frequency component are generated. The dots are arranged in a straight line. The greater the ΔT, that is, the greater the distance to the sound source between the two microphones, the greater the slope of this straight line.

（位相差の循環性）
但し、両マイク間の位相差がこの散布図の全域で周波数に比例するのは、解析対象となる最低周波数から最高周波数まで通して真の位相差が±πを逸脱しない場合に限られる。この条件はΔＴが、最高周波数（サンプリング周波数の半分）Ｆｒ／２［Ｈｚ］の１／２周期分の時間、すなわち１／Ｆｒ［秒］以上とならないことである。もし、ΔＴが１／Ｆｒ以上となる場合には、次に述べるように位相差が循環性を持つ値としてしか得られないことを考慮しなければならない。 (Circulation of phase difference)
However, the phase difference between the microphones is proportional to the frequency throughout the scatter diagram only when the true phase difference does not deviate from ± π from the lowest frequency to the highest frequency to be analyzed. This condition is that ΔT does not become equal to or longer than the time corresponding to ½ period of the maximum frequency (half the sampling frequency) Fr / 2 [Hz], that is, 1 / Fr [second]. If ΔT is 1 / Fr or more, it must be considered that the phase difference can only be obtained as a cyclic value as described below.

手に入れることのできる周波数成分毎の位相値は複素座標系上の角度値として２π［ラジアン］の幅（本実施形態では−πからπの間の２πの幅）でしか得ることができない。このことは、その周波数成分における実際の位相差が両マイク間で１周期以上開いていても、周波数分解結果として得られる位相値からそれを知ることができないことを意味する。そのため、本実施形態では位相差を−πからπの間で得るようにしている。しかし、ΔＴに起因する真の位相差は、ここで求められた位相差の値に２πを加えたり差し引いたり、あるいはさらに４πや６πを加えたり差し引いたりした値である可能性がある。 A phase value for each frequency component that can be obtained can only be obtained as an angle value on a complex coordinate system with a width of 2π [radian] (in this embodiment, a width of 2π between −π and π). This means that even if the actual phase difference in the frequency component is opened for one period or more between both microphones, it cannot be known from the phase value obtained as a result of frequency decomposition. Therefore, in the present embodiment, the phase difference is obtained between −π and π. However, the true phase difference caused by ΔT may be a value obtained by adding or subtracting 2π to the value of the phase difference obtained here, or adding or subtracting 4π or 6π.

これを模式的に示した散布図が図５である。周波数ｆｋの位相差ΔＰｈ（ｆｋ）が図中の黒丸１４０で表すように＋πであるとき、１つ高い周波数ｆｋ＋１の位相差は図中の白丸１４１で表すように＋πを超えている。しかしながら、計算された位相差ΔＰｈ（ｆｋ＋１）は、本来の位相差から２πを差し引いた、図中の黒丸１４２で表すように−πよりやや大きい値となる。さらに、図示はしていないが、その３倍の周波数でも同様の値を示すことになるが、これは実際の位相差から４πを差し引いた値である。このように位相差は周波数が高くなるにつれて２πの剰余系として−πからπの間で循環する。この例ように、ΔＴが大きくなると、ある周波数ｆｋ＋１から上では、白丸で表した真の位相差が黒丸で示したように反対側に循環してしまう。 FIG. 5 is a scatter diagram schematically showing this. When the phase difference ΔPh (fk) of the frequency fk is + π as represented by the black circle 140 in the figure, the phase difference of the next higher frequency fk + 1 exceeds + π as represented by the white circle 141 in the figure. However, the calculated phase difference ΔPh (fk + 1) is a value slightly larger than −π as indicated by the black circle 142 in the figure, which is obtained by subtracting 2π from the original phase difference. Further, although not shown in the figure, the same value is shown even at three times the frequency, but this is a value obtained by subtracting 4π from the actual phase difference. Thus, the phase difference circulates between −π and π as a 2π residue system as the frequency increases. As shown in this example, when ΔT increases, the true phase difference represented by a white circle circulates on the opposite side as indicated by the black circle above a certain frequency fk + 1.

本発明における音源の数と方向を推定する問題は、このような散布図上で、図示したような直線を検出することに帰着できる。また、音源毎のおおよその周波数成分を推定する問題は、検出された直線に近い位置にプロットされた周波数成分を選別することに帰着できる。そこで、本実施形態における散布図生成部３０３の出力する散布図データは、周波数分解部３０１による周波数分解データを使って周波数と位相差の関数として決定される点群とする。投票部３０４はこの散布図データとして与えられる点群配置から直線状の配置を図形として検出する。 The problem of estimating the number and direction of sound sources in the present invention can be reduced to detecting a straight line as shown on such a scatter diagram. The problem of estimating the approximate frequency component for each sound source can be reduced to selecting the frequency component plotted at a position close to the detected straight line. Therefore, the scatter diagram data output from the scatter diagram generation unit 303 in this embodiment is a point group determined as a function of frequency and phase difference using the frequency decomposition data by the frequency decomposition unit 301. The voting unit 304 detects a linear arrangement as a figure from the point cloud arrangement given as the scatter diagram data.

（投票部３０４）
投票部３０４は、散布図生成部３０３によって（ｘ，ｙ）座標を与えられた各周波数成分に対して、後述するように直線ハフ変換を適用し、その軌跡をハフ投票空間に所定の方法で投票する手段である。ハフ変換については、岡崎彰夫、“はじめての画像処理”、工業調査会、２０００年１０月２０日発行の１００〜１０２ページに解説されている。 (Voting section 304)
The voting unit 304 applies a linear Hough transform to each frequency component given the (x, y) coordinates by the scatter diagram generation unit 303 as described later, and the trajectory in the Hough voting space by a predetermined method. A means to vote. Hough conversion is described in Akio Okazaki, “First Image Processing”, Industrial Research Committee, pages 100 to 102, published on October 20, 2000.

（直線ハフ変換）
２次元座標上の点ｐ（ｘ，ｙ）を通り得る直線は無数に存在するが、原点Ｏから各直線に下ろした垂線のＸ軸からの傾きをθ、この垂線の長さをρとして表現すると、１つの直線についてθとρは一意に決まり、ある点（ｘ，ｙ）を通る直線の取り得るθとρの組は、θρ座標系上で（ｘ，ｙ）の値に固有の軌跡（ρ＝ｘｃｏｓθ＋ｙｓｉｎθ）を描くことが知られている。この軌跡をハフ曲線と呼ぶ。また、このような、（ｘ，ｙ）座標値からそこを通り得る直線の（θ，ρ）の軌跡への変換を直線ハフ変換と云う。なお、直線が左に傾いているときθは正値、垂直のとき０、右に傾いているとき負値であるとし、また、θの定義域は｛θ：−π＜θ≦π｝を逸脱することはない。 (Linear Hough transform)
There are an infinite number of straight lines that can pass through the point p (x, y) on the two-dimensional coordinates, but the inclination from the X axis of the perpendicular drawn from the origin O to each straight line is expressed as θ, and the length of the perpendicular is expressed as ρ. Then, θ and ρ are uniquely determined for one straight line, and a set of θ and ρ that can be taken by a straight line passing through a certain point (x, y) is a locus unique to the value of (x, y) on the θρ coordinate system. It is known to draw (ρ = x cos θ + y sin θ). This locus is called a Hough curve. Further, such a conversion from the (x, y) coordinate value to a locus of (θ, ρ) of a straight line passing therethrough is called a linear Hough transform. Note that θ is positive when the straight line is tilted to the left, 0 when vertical, and a negative value when tilted to the right, and the definition range of θ is {θ: −π <θ ≦ π}. There is no departure.

ハフ曲線はＸＹ座標系上の各点について独立に求めることができるが、例えば３点ｐ１、ｐ２、ｐ３を共通に通る直線は、ｐ１、ｐ２、ｐ３に対応した３本の軌跡が交差する点の座標（θ０，ρ０）で定められる直線として求めることができる。多くの点を通る直線であればあるほど、その直線を表すθとρの位置を多くの軌跡が通過する。 The Hough curve can be obtained independently for each point on the XY coordinate system. For example, a straight line passing through the three points p1, p2, and p3 is a point where three trajectories corresponding to p1, p2, and p3 intersect. Can be obtained as a straight line defined by the coordinates (θ0, ρ0). The more lines that pass through the points, the more trajectories pass through the positions of θ and ρ representing the lines.

（ハフ投票）
点群から直線を検出するため、ハフ投票という手法が使われる。これはθとρを座標軸とする２次元のハフ投票空間に各軌跡の通過するθとρの組を投票することで、ハフ投票空間の得票の大きい位置に多数の軌跡の通過するθとρの組、すなわち直線の存在を示唆させるようにする手法である。 (Hough voting)
In order to detect a straight line from a point cloud, a method called Hough voting is used. This is because by voting a set of θ and ρ through which each trajectory passes in a two-dimensional Hough voting space with θ and ρ as coordinate axes, θ and ρ through which a large number of trajectories pass at a large position in the Hough voting space. This is a technique for suggesting the existence of a pair, that is, a straight line.

投票部３０４は、次の条件を全て満たす周波数成分についてハフ投票を行う。この条件により、所定の周波数帯で所定閾値以上のパワーを持つ周波数成分のみが投票されることになる。 The voting unit 304 performs Hough voting on frequency components that satisfy all of the following conditions. Under this condition, only frequency components having a power equal to or higher than a predetermined threshold in a predetermined frequency band are voted.

（投票条件１）周波数が所定範囲にあるもの（低域カットと高域カット）。 (Voting condition 1) The frequency is in a predetermined range (low frequency cut and high frequency cut).

（投票条件２）当該周波数成分ｆｋの代表パワーＰ（ｆｋ）が所定閾値以上のもの。 (Voting condition 2) The representative power P (fk) of the frequency component fk is greater than or equal to a predetermined threshold.

投票条件１は、一般に暗騒音が乗っている低域をカットしたり、ＦＦＴ精度の落ちる高域をカットしたりする目的で使われる。この低域カットと高域カットの範囲は運用に合わせて調整可能である。最も広く周波数帯域を使う場合、低域カットは直流成分のみ、高域カットは最大周波数のみとする設定が適している。 The voting condition 1 is generally used for the purpose of cutting a low range where background noise is riding or cutting a high range where the FFT accuracy is lowered. The range of the low frequency cut and the high frequency cut can be adjusted according to the operation. When using the widest frequency band, it is appropriate to set the low frequency cut to only the DC component and the high frequency cut to the maximum frequency only.

暗騒音程度の非常に弱い周波数成分ではＦＦＴ結果の信頼性が高くないと考えられる。投票条件２は、このような信頼性の低い周波数成分をパワーで閾値処理することで投票に参加させないようにする目的で使われる。マイク１ａにおけるパワー値Ｐｏ１（ｆｋ）、マイク１ｂにおけるパワー値Ｐｏ２（ｆｋ）とすると、このとき評価される当該周波数成分ｆｋの代表パワーＰ（ｆｋ）は両者の平均として求めることとする。 It is considered that the reliability of the FFT result is not high with a very weak frequency component such as background noise. The voting condition 2 is used for the purpose of preventing such frequency components having low reliability from participating in voting by thresholding with power. Assuming that the power value Po1 (fk) at the microphone 1a and the power value Po2 (fk) at the microphone 1b, the representative power P (fk) of the frequency component fk evaluated at this time is obtained as the average of both.

また、投票部３０４は、投票に際して軌跡の通過位置に当該周波数成分ｆｋの代表パワーＰ（ｆｋ）の関数値を加算する。この投票方式は、通過する点が少なくても、パワーの大きい周波数成分を含んでいれば上位の極大値を得ることのできる方式であり、周波数成分が少なくてもパワーの大きい有力な成分を持つ直線（すなわち音源）を検出するのに適している。代表パワーＰ（ｆｋ）の関数値はＧ（Ｐ（ｆｋ））として計算される。図６にＧ（Ｐ（ｆｋ））の計算式を示す。中間パラメータＶの値はＰ（ｆｋ）の対数値ｌｏｇ_１０（Ｐ（ｆｋ））に所定のオフセットαを足した値として計算される。そしてＶが正であるときはＶ＋１の値を、Ｖがゼロ以下であるときには１を、関数Ｇ（Ｐ（ｆｋ））の値とする。このように最低でも１を投票することで、パワーの大きい周波数成分を含む直線（音源）が上位に浮上するだけでなく、多数の周波数成分を含む直線（音源）も上位に浮上するという多数決的な性質を併せ持たせることができる。 Further, the voting unit 304 adds a function value of the representative power P (fk) of the frequency component fk to the passing position of the trajectory at the time of voting. This voting method is a method that can obtain a higher maximum value if it contains frequency components with high power even if there are few points to pass, and has powerful components with high power even if there are few frequency components. Suitable for detecting straight lines (ie sound sources). The function value of the representative power P (fk) is calculated as G (P (fk)). FIG. 6 shows a calculation formula for G (P (fk)). The value of the intermediate parameter V is calculated as a value obtained by adding a predetermined offset α to the logarithmic value log ₁₀ (P (fk)) of P (fk). When V is positive, the value of V + 1 is set as the value of the function G (P (fk)). By voting at least 1 in this way, not only a straight line (sound source) containing a high-power frequency component rises to the top, but a straight line (sound source) containing many frequency components also rises to the top. It can have the same properties.

（複数ＦＦＴ結果をまとめて投票）
さらに、投票部３０４は、１回のＦＦＴ毎に投票を行うことも可能だが、一般的に連続するｍ回（ｍ≧１）の時系列的なＦＦＴ結果についてまとめて投票を行なうこととする。長期的には音源の周波数成分は変動するものであるが、このようにすることで、周波数成分の安定している適度に短期間の複数時刻のＦＦＴ結果から得られるより多くのデータを用いて、より信頼性の高いハフ投票結果を得ることができるようになる。なお、このｍは運用に合わせてパラメータとして設定可能とする。 (Poll together multiple FFT results)
Further, the voting unit 304 can vote for each FFT, but generally, the voting unit 304 collectively votes for m consecutive (m ≧ 1) time-series FFT results. In the long term, the frequency components of the sound source will fluctuate, but by doing so, using more data obtained from the FFT results of moderately short time multiple times where the frequency components are stable You will be able to get a more reliable Hough voting result. Note that m can be set as a parameter according to the operation.

（直線検出部３０５）
直線検出部３０５は、投票部３０４によって生成されたハフ投票空間上の得票分布を解析して有力な直線を検出する手段である。このとき、図５で述べた位相差の循環性など、本問題に特有の事情を考慮することで、より高精度な直線検出を実現する。 (Linear detection unit 305)
The straight line detection unit 305 is a means for detecting a powerful straight line by analyzing the vote distribution in the Hough voting space generated by the voting unit 304. At this time, by taking into consideration the circumstances peculiar to this problem such as the phase difference circulation described in FIG.

（ρ＝０の制約）
マイク１ａとマイク１ｂの信号が音響信号入力部２によって同相でＡ／Ｄ変換される場合、検出されるべき直線は必ずρ＝０、すなわちＸＹ座標系の原点を通る。したがって、音源の推定問題は、理想的には、ハフ投票空間上でρ＝０となるθ軸上の得票分布Ｓ（θ，０）から極大値を探索する問題に帰着するはずである。 (Restriction of ρ = 0)
When the signals of the microphone 1a and the microphone 1b are A / D converted in phase by the acoustic signal input unit 2, the straight line to be detected always passes through ρ = 0, that is, the origin of the XY coordinate system. Therefore, the sound source estimation problem should ideally result in a problem of searching for a local maximum value from the vote distribution S (θ, 0) on the θ axis where ρ = 0 in the Hough voting space.

（位相差循環を考慮した直線群の定義）
しかし、実際には位相差の循環性によって、原点を通る直線がΔρだけ平行移動してＸ軸上の反対側から循環してくる直線もまた同じ到達時間差を示す直線である。このように原点を通る直線を延長してＸの値域からはみ出した部分が反対側から循環的に現れる直線を、「循環延長線」、基準となった原点を通る直線を「基準直線」とそれぞれ呼ぶことにする。もし、基準直線がさらに傾いていれば、循環延長線はさらに数を増すことになる。ここで係数ａを０以上の整数とすると、到達時間差を同じくする直線は全て（θ０，０）で定義される基準直線をΔρずつ平行移動させた直線群（θ０，ａΔρ）となる。このとき、Δρは直線の傾きθの関数Δρ（θ）として図７に示す式で定義される符号付きの値である。 (Definition of straight line group considering phase difference circulation)
However, in reality, due to the cyclic nature of the phase difference, the straight line passing through the origin is translated by Δρ and circulated from the opposite side on the X axis is also a straight line showing the same arrival time difference. In this way, the straight line that passes through the origin and extends out of the X value range cyclically appears from the opposite side as the “circulation extension line”, and the straight line that passes through the reference origin as the “reference straight line”. I will call it. If the reference straight line is further inclined, the circulation extension line is further increased in number. If the coefficient a is an integer greater than or equal to 0, all straight lines having the same arrival time difference are a straight line group (θ0, aΔρ) obtained by translating the reference straight line defined by (θ0, 0) by Δρ. At this time, Δρ is a signed value defined by the equation shown in FIG. 7 as a function Δρ (θ) of the slope θ of the straight line.

（位相差循環を考慮した極大位置検出）
位相差の循環性から、音源を表す直線は１つではなく基準直線と循環延長線から成る直線群として扱われるべきであることを述べた。このことは得票分布から極大位置を検出する際にも考慮されなければならない。 (Maximum position detection considering phase difference circulation)
From the circulation of the phase difference, it was stated that the straight line representing the sound source should be treated as a straight line group consisting of a reference straight line and a circulation extension line instead of one. This must be taken into account when detecting the maximum position from the vote distribution.

図８に、室内雑音環境下で２人の人物がマイク対の正面約２０度左と約４５度右から同時に発話した実際の音声を用いて処理したときの周波数成分のパワースペクトル、５回分（ｍ＝５）のＦＦＴ結果から得た周波数成分毎の位相差散布図、同じ５回分のＦＦＴ結果から得たハフ投票結果（得票分布）を示す。 FIG. 8 shows the power spectrum of the frequency component for five times when processing is performed using actual speech spoken simultaneously from about 20 degrees left and about 45 degrees right in front of the microphone pair in a room noise environment. A phase difference scatter diagram for each frequency component obtained from the FFT result of m = 5) and a Hough vote result (voting distribution) obtained from the same five FFT results are shown.

マイク対で取得された振幅データは、周波数分解部３０１によって周波数成分毎のパワー値と位相値のデータに変換される。図中の２１０と２１１は、縦軸を周波数、横軸を時間として、周波数成分毎の対数パワー値を輝度表示（黒いほど大きい）したものである。縦の１ラインが１フレーム（１回のＦＦＴ結果）に対応し、これを時間経過（右向き）に沿ってグラフ化した図である。上段２１０がマイク１ａ、下段２１１がマイク１ｂからの信号を処理した結果であり、多数の周波数成分が検出されている。この周波数分解結果を受けて、位相差算出部３０２により周波数成分毎の位相差が求められ、散布図生成部３０３によりその（ｘ，ｙ）座標値が算出される。図中の２１２はある時刻２１３から連続５フレーム分のＦＦＴによって得た位相差をプロットした散布図である。この図で原点から左に傾いた基準直線２１４に沿う点群分布と右に傾いた基準直線２１５に沿う点群分布が認められる。投票部３０４により、このような分布を示している各点がハフ投票空間に投票されて得票分布２１６を形成する。 The amplitude data acquired by the microphone pair is converted into data of a power value and a phase value for each frequency component by the frequency resolving unit 301. In the figure, reference numerals 210 and 211 indicate logarithmic power values for each frequency component in luminance (the larger the black), the frequency is on the vertical axis and the time is on the horizontal axis. FIG. 6 is a diagram in which one vertical line corresponds to one frame (one FFT result) and is graphed over time (rightward). The upper stage 210 is the result of processing the signal from the microphone 1a and the lower stage 211 is the signal from the microphone 1b, and a large number of frequency components are detected. In response to this frequency decomposition result, the phase difference calculation unit 302 obtains the phase difference for each frequency component, and the scatter diagram generation unit 303 calculates the (x, y) coordinate value. 212 in the figure is a scatter diagram in which phase differences obtained by FFT for five consecutive frames from a certain time 213 are plotted. In this figure, a point cloud distribution along the reference line 214 inclined to the left from the origin and a point cloud distribution along the reference line 215 inclined to the right are recognized. Each point indicating such distribution is voted by the voting unit 304 to the Hough voting space to form a vote distribution 216.

図９は位相差循環性を考慮して、Δρずつ離れた数箇所の得票値を合計して極大位置を探索した結果を示した図である。図９（ａ）に示す得票分布２４０は、図８における得票分布２１６上に、原点を通る直線をΔρずつ平行移動させたときのρの位置を破線２４２〜２４９で表示したものである。このとき、θ軸２４１と破線２４２〜２４５、及びθ軸２４１と破線２４６〜２４９はそれぞれΔρ（θ）の自然数倍で等間隔に離れている。なお、直線がＸの値域を越えずに散布図の天井まで抜けることが確実なθ＝０には破線を表示していない。 FIG. 9 is a diagram showing a result of searching for the maximum position by summing up several vote values separated by Δρ in consideration of phase difference circulation. The vote distribution 240 shown in FIG. 9A is obtained by displaying the positions of ρ by broken lines 242 to 249 when the straight line passing through the origin is translated by Δρ on the vote distribution 216 in FIG. At this time, the θ axis 241 and the broken lines 242 to 245 and the θ axis 241 and the broken lines 246 to 249 are spaced apart at equal intervals by a natural number multiple of Δρ (θ). It should be noted that no broken line is displayed at θ = 0 where it is certain that the straight line passes through the ceiling of the scatter diagram without exceeding the X value range.

あるθ０の得票Ｈ（θ０）は、θ＝θ０の位置で縦に見たときのθ軸２４１上の得票と破線２４２〜２４９上の得票の合計値、すなわちａを０以上の整数とした、Ｈ（θ０）＝Σ｛Ｓ（θ０，ａΔρ（θ０））｝として計算される。この操作はθ＝θ０となる基準直線とその循環延長線の得票を合計することに相当する。この得票分布Ｈ（θ）を棒グラフにしたものが図９（ｂ）中の２５０である。この得票分布２５０からは同図９（ｂ）の２５１に示す１０個の極大位置が検出される。このうち、極大位置２５２と２５３が、マイク対の正面約２０度左からの音声を検出した直線群（極大位置２５３に対応する、図９（ｃ）に示す基準直線２５４と循環延長線２５５）と、マイク対の正面約４５度右からの音声を検出した直線群（極大位置２５２に対応する、同図９（ｃ）に示す基準直線２５６と循環延長線２５７と２５８）に対応している。このようにΔρずつ離れた箇所の得票値を合計して極大位置を探索することで、傾きの小さい直線から傾きの大きい直線まで安定に検出する。そして、所定閾値以上の得票値を得た極大位置（直線）を選別することで、音源らしき候補（音源候補）を抽出することができる。 A given vote H (θ0) of θ0 is a total value of the votes on the θ-axis 241 and the votes on the broken lines 242-249 when viewed vertically at the position θ = θ0, that is, a is an integer of 0 or more. Calculated as H (θ0) = Σ {S (θ0, aΔρ (θ0))}. This operation is equivalent to adding up the votes of the reference straight line where θ = θ0 and the circulation extension line. A bar graph of this vote distribution H (θ) is 250 in FIG. From this vote distribution 250, ten local maximum positions indicated by reference numeral 251 in FIG. 9B are detected. Among these, the maximum positions 252 and 253 are straight line groups in which the sound from the left of the front of the microphone pair is detected about 20 degrees (corresponding to the maximum position 253, the reference straight line 254 and the circulation extension line 255 shown in FIG. 9C). Corresponding to the straight line group (the reference straight line 256 and the circulation extension lines 257 and 258 shown in FIG. 9C corresponding to the maximum position 252) in which the sound from the right of the microphone pair about 45 degrees right is detected. . In this way, by searching for the maximum position by summing the vote values at locations separated by Δρ, stable detection is possible from a straight line having a small inclination to a straight line having a large inclination. A candidate (sound source candidate) that seems to be a sound source can be extracted by selecting the maximum position (straight line) from which the vote value equal to or greater than the predetermined threshold is obtained.

（存在角度推定）
さらに、直線検出部３０５は、検出された直線群毎のθ値から各直線群に対応した音源候補の存在範囲を計算する。マイク間距離に対して音源までの距離が十分遠い場合、音源の存在範囲は２つのマイク１ａとマイク１ｂを結ぶ線分（マイク対のベースラインと呼ぶ）に対してある角度（存在角度）を持った円錐面となる。 (Presence angle estimation)
Furthermore, the straight line detection unit 305 calculates the existence range of sound source candidates corresponding to each straight line group from the detected θ value for each straight line group. When the distance to the sound source is sufficiently far from the distance between the microphones, the sound source exists within a certain angle (existence angle) with respect to a line segment (referred to as a base line of the microphone pair) connecting the two microphones 1a and 1b. It has a conical surface.

マイク１ａとマイク１ｂの到達時間差ΔＴは±ΔＴｍａｘの範囲で変化し得る。マイク対の正面から入射する場合、ΔＴは０となり、音源の存在角度φは正面を基準にした場合０°となる。また、音声がマイク対の右真横、すなわちマイク１ｂ方向から入射する場合、ΔＴは＋ΔＴｍａｘに等しく、音源の存在角度φは正面を基準にして右回りを正として＋９０°となる。同様に、音声がマイク対の左真横、すなわちマイク１ａ方向から入射する場合、ΔＴは−ΔＴｍａｘに等しく、存在角度φは−９０°となる。このように、ΔＴを音が右から入射するとき正、左から入射するとき負となるように定義する。以上を踏まえて一般的な条件を考えると、符号も含めて、存在角度はφ＝ｓｉｎ^−１（ΔＴ／ΔＴｍａｘ）として計算することができる。 The arrival time difference ΔT between the microphone 1a and the microphone 1b can vary within a range of ± ΔTmax. When incident from the front of the microphone pair, ΔT is 0, and the sound source existence angle φ is 0 ° when the front is used as a reference. In addition, when the sound is incident directly to the right of the microphone pair, that is, from the direction of the microphone 1b, ΔT is equal to + ΔTmax, and the sound source existing angle φ is + 90 ° with the clockwise direction as a reference with respect to the front. Similarly, when sound enters from the left side of the microphone pair, that is, from the direction of the microphone 1a, ΔT is equal to −ΔTmax, and the existence angle φ is −90 °. In this way, ΔT is defined to be positive when sound enters from the right and negative when sound enters from the left. Considering general conditions based on the above, the existence angle including the sign can be calculated as φ = sin ⁻¹ (ΔT / ΔTmax).

ΔＴｍａｘはΔＴｍａｘ＝Ｌ÷Ｖｓ［ｓｅｃ］で求められる、マイク間距離Ｌ［ｍ］を音速Ｖｓ［ｍ／ｓｅｃ］で割った値である。このとき、音速Ｖｓは気温ｔ［℃］の関数として、Ｖｓ＝３３１．４＋０．６０４ｔ［ｍ／ｓｅｃ］で近似できることが知られている。今、直線検出部３０５によって傾きθの直線が検出されているとする。この直線が右に傾いているとすればθは負値である。ｙ＝ｋ（周波数ｆｋ）のとき、この直線で示される位相差ΔＰｈはｋとθの関数としてΔＰｈ（θ，ｋ）＝ｋ・ｔａｎ（−θ）で求めることができる。このときΔＴ［ｓｅｃ］はΔＴ＝（ΔＰｈ（θ，ｋ）／２π）×（１／ｆｋ）で示すように、位相差ΔＰｈ（θ，ｋ）の２πに対する割合に周波数ｆｋの１周期（１／ｆｋ）［ｓｅｃ］を乗じた時間となる。θが符号付きの量なので、ΔＴも符号付きの量となる。すなわち、音が右から入射する（位相差ΔＰｈが正値となる）とき、θは負値となる。また、音が左から入射する（位相差ΔＰｈが負値となる）とき、θは正値となる。そのために、式ではθの符号を反転させている。なお、実際の計算においては、ｋ＝１（直流成分ｋ＝０のすぐ上の周波数）で計算を行えば良い。 ΔTmax is a value obtained by dividing the inter-microphone distance L [m] by the sound velocity Vs [m / sec], which is obtained by ΔTmax = L ÷ Vs [sec]. At this time, it is known that the sound velocity Vs can be approximated by Vs = 331.4 + 0.604 t [m / sec] as a function of the temperature t [° C.]. Now, it is assumed that a straight line having an inclination θ is detected by the straight line detection unit 305. If this straight line is tilted to the right, θ is a negative value. When y = k (frequency fk), the phase difference ΔPh indicated by this straight line can be obtained as ΔPh (θ, k) = k · tan (−θ) as a function of k and θ. At this time, as shown by ΔT = (ΔPh (θ, k) / 2π) × (1 / fk), ΔT [sec] is equal to one period (1) of the frequency fk in proportion to 2π of the phase difference ΔPh (θ, k). / Fk) times multiplied by [sec]. Since θ is a signed quantity, ΔT is also a signed quantity. That is, when sound enters from the right (the phase difference ΔPh becomes a positive value), θ becomes a negative value. When sound enters from the left (the phase difference ΔPh becomes a negative value), θ becomes a positive value. Therefore, the sign of θ is inverted in the equation. In the actual calculation, the calculation may be performed with k = 1 (frequency immediately above the DC component k = 0).

（時系列追跡部３０６）
上述した通り、投票部３０４によるハフ投票毎に、直線検出部３０５により直線群が求められる。ハフ投票は連続するｍ回（ｍ≧１）のＦＦＴ結果についてまとめて行われる。この結果、直線群はｍフレーム分の時間を周期（これを「直線検出周期」と呼ぶことにする）として時系列的に求められることになる。また、直線群のθは円錐面の開き角度（存在角度）φと１対１に対応しているので、音源が静止していても移動していても、安定な音源に対応しているθ（あるいはφ）の時間軸上の軌跡は連続して変化すると仮定される。一方、直線検出部３０５により検出された直線群の中には、閾値の設定具合によって背景雑音に対応する直線群（これを「雑音直線群」と呼ぶことにする）が含まれていることがある。しかしながら、このような雑音直線群のθ（あるいはφ）の時間軸上の軌跡は連続していないか、連続していても短いことが期待できる。 (Time series tracking unit 306)
As described above, a straight line group is obtained by the straight line detection unit 305 for each Hough vote by the voting unit 304. Hough voting is performed on m consecutive (m ≧ 1) FFT results collectively. As a result, the straight line group is obtained in time series with the time corresponding to m frames as a period (hereinafter referred to as a “straight line detection period”). In addition, since θ of the straight line group has a one-to-one correspondence with the opening angle (existing angle) φ of the conical surface, it corresponds to a stable sound source regardless of whether the sound source is stationary or moving. It is assumed that the locus of (or φ) on the time axis changes continuously. On the other hand, the straight line group detected by the straight line detection unit 305 includes a straight line group corresponding to background noise (this will be referred to as a “noise straight line group”) depending on how the threshold is set. is there. However, it can be expected that the locus on the time axis of θ (or φ) of such a noise straight line group is not continuous or short even if it is continuous.

時系列追跡部３０６は、このように直線検出周期毎に求められるθ（あるいはφ）を時間軸上で連続とみなせるグループにまとめることで、θ（あるいはφ）の時間軸上の軌跡（音源ストリーム候補と呼ぶ）を求める手段である。図１０を参照し、θを用いた場合のグルーピングの方法を説明する。 The time-series tracking unit 306 collects θ (or φ) obtained for each straight line detection period into a group that can be regarded as continuous on the time axis in this way, so that the locus (sound source stream) of θ (or φ) on the time axis is collected. (Referred to as a candidate). With reference to FIG. 10, a grouping method when θ is used will be described.

（１）音源ストリーム候補バッファを用意する。音源ストリーム候補バッファは音源ストリーム候補データの配列である。１つの音源ストリーム候補データＫｄは、その開始時刻Ｔｓと、終了時刻Ｔｅと、当該音源ストリーム候補を構成する直線群データＬｄの配列（直線群リスト）と、ラベル番号Ｌｎとを保持することができる。１つの直線群データＬｄは、当該音源ストリーム候補を構成する１つの直線群のθ値及びρ値（直線検出部３０５による）と、この直線群に対応した音源の存在角度φ値（直線検出部３０５による）と、直線スコア（直線検出部３０５による）と、それらが取得された時刻とから成る一群のデータである。なお、上述したように、位相差の循環性から、音源を表す直線は１つではなく基準直線と循環延長線から成る直線群として扱われるべきであることから、直線群データＬｄの配列における１つの要素は、１つの直線群であるとしている。あえて位相差の循環性を考慮しない場合は、上記配列の要素は１つの直線ということになる。また、音源ストリーム候補バッファは最初空である。また、ラベル番号を発行するためのパラメータとして新規ラベル番号を用意し、初期値を０に設定する。 (1) A sound source stream candidate buffer is prepared. The sound source stream candidate buffer is an array of sound source stream candidate data. One sound source stream candidate data Kd can hold a start time Ts, an end time Te, an array of straight line group data Ld (straight line group list) constituting the sound source stream candidate, and a label number Ln. . One line group data Ld includes a θ value and a ρ value (by the line detection unit 305) of one line group constituting the sound source stream candidate, and a sound source existing angle φ value (line detection unit) corresponding to the line group. 305), a straight line score (by the straight line detection unit 305), and a time when they are acquired. As described above, because of the cyclic nature of the phase difference, the straight line representing the sound source should be treated as a straight line group consisting of a reference straight line and a cyclic extension line instead of one, so that 1 in the array of straight line group data Ld. One element is assumed to be one straight line group. If the circularity of the phase difference is not taken into account, the elements of the array are one straight line. The sound source stream candidate buffer is initially empty. A new label number is prepared as a parameter for issuing a label number, and an initial value is set to zero.

（２）あるフレームＴにおいて、新しく検出された直線の傾きθの各々（以後θｎとし、図中では黒丸３２３と黒丸３２４で示される２つが得られたものとする）について、音源ストリーム候補バッファに保持されている音源ストリーム候補データＫｄ（図中の矩形３２１と３２２）の直線群データＬｄ（図中の矩形内に配置された黒丸）を参照し、そのθ値とθｎの差（図中の３２５と３２６）が所定角度閾値Δθ（角度方向ギャップの許容範囲を与える）内にあり、かつその取得時刻の差（図中の３２７と３２８）が所定時間閾値Δｔ（時間方向ギャップの許容範囲を与える）内にあるＬｄを持つ音源ストリーム候補データを検出する。この結果、黒丸３２３については音源ストリーム候補データ３２１が検出されたが、黒丸３２４については最も近い音源ストリーム候補データ３２２も上記条件を満たさなかったとする。 (2) In a certain frame T, each newly detected straight line inclination θ (hereinafter referred to as θn and two indicated by black circle 323 and black circle 324 in the figure) is obtained in the sound source stream candidate buffer. Reference is made to the straight line group data Ld (black circles arranged in the rectangle in the figure) of the stored sound source stream candidate data Kd (rectangles 321 and 322 in the figure), and the difference between the θ value and θn (in the figure) 325 and 326) are within a predetermined angle threshold Δθ (giving an allowable range of angular gaps), and the difference between the acquisition times (327 and 328 in the figure) is a predetermined time threshold Δt (the allowable range of time direction gaps). The sound source stream candidate data having Ld within (given) is detected. As a result, the sound source stream candidate data 321 is detected for the black circle 323, but the nearest sound source stream candidate data 322 for the black circle 324 does not satisfy the above condition.

（３）黒丸３２３のように、もし、（２）の条件を満たす音源ストリーム候補データが見つかった場合は、θｎはこの音源ストリーム候補と同一の音源ストリーム候補を成すものとして、このθｎとそれに対応したφ値とρ値と現時刻Ｔとを当該音源ストリーム候補データＫｄの新たな直線群データとして直線群リストに追加し、現時刻Ｔを当該音源ストリーム候補データの新たな終了時刻Ｔｅとする。このとき、複数の音源ストリーム候補データが見つかった場合には、それら全てが同一の音源ストリーム候補を成すものとして、最も若いラベル番号を持つ音源ストリーム候補データに統合して、残りを音源ストリーム候補バッファから削除する。統合された音源ストリーム候補データの開始時刻Ｔｓは統合前の各音源ストリーム候補データの中で最も早い開始時刻であり、終了時刻Ｔｅは統合前の各音源ストリーム候補データの中で最も遅い終了時刻であり、直線群リストは統合前の各音源ストリーム候補データの直線群リストの和集合である。この結果、黒丸３２３は音源ストリーム候補データ３２１に追加される。 (3) If sound source stream candidate data satisfying the condition (2) is found as indicated by the black circle 323, θn is assumed to form the same sound source stream candidate as this sound source stream candidate, and this θn and the corresponding The φ value, the ρ value, and the current time T are added to the line group list as new line group data of the sound source stream candidate data Kd, and the current time T is set as a new end time Te of the sound source stream candidate data. At this time, if a plurality of sound source stream candidate data are found, all of them are considered to form the same sound source stream candidate, and are integrated into the sound source stream candidate data having the youngest label number, and the rest are sound source stream candidate buffers. Delete from. The start time Ts of the integrated sound source stream candidate data is the earliest start time among the sound source stream candidate data before integration, and the end time Te is the latest end time of the sound source stream candidate data before integration. Yes, the line group list is a union of the line group lists of the sound source stream candidate data before integration. As a result, the black circle 323 is added to the sound source stream candidate data 321.

（４）黒丸３２４のように、もし、（２）の条件を満たす音源ストリーム候補データが見つからなかった場合は、新規の音源ストリーム候補の始まりとし、音源ストリーム候補バッファの空き部分に新しい音源ストリーム候補データを作成し、開始時刻Ｔｓと終了時刻Ｔｅを共に現時刻Ｔとし、θｎとそれに対応したφ値とρ値と現時刻Ｔとを直線群リストの最初の直線群データとし、新規ラベル番号の値をこの音源ストリーム候補データのラベル番号Ｌｎとして与え、新規ラベル番号を１だけ増加させる。なお、新規ラベル番号が所定の最大値に達したときは、新規ラベル番号を０に戻す。この結果、黒丸３２４は新たな音源ストリーム候補データとして音源ストリーム候補バッファに登録される。 (4) If no sound source stream candidate data satisfying the condition (2) is found, as indicated by a black circle 324, a new sound source stream candidate is set as the start of a new sound source stream candidate and a new sound source stream candidate in an empty portion of the sound source stream candidate buffer. The data is created, the start time Ts and the end time Te are both set as the current time T, θn, the corresponding φ value, ρ value, and the current time T are set as the first line group data in the line group list, and the new label number A value is given as the label number Ln of the sound source stream candidate data, and the new label number is incremented by one. When the new label number reaches a predetermined maximum value, the new label number is returned to zero. As a result, the black circle 324 is registered in the sound source stream candidate buffer as new sound source stream candidate data.

（５）もし、音源ストリーム候補バッファに保持されている音源ストリーム候補データで、最後に更新されてから（すなわちその終了時刻Ｔｅから）現時刻Ｔまでに前記所定時間Δｔを経過したものがあれば、追加すべき新たなθｎが見つからなかった、すなわちグルーピングを終えた音源ストリーム候補として、この音源ストリーム候補データを次段の継続時間評価部３０７に出力する。図の例では音源ストリーム候補データ３２２がこれに該当する。 (5) If there is sound source stream candidate data held in the sound source stream candidate buffer, the predetermined time Δt has elapsed from the last update (that is, from the end time Te) to the current time T. The sound source stream candidate data is output to the duration evaluation unit 307 in the next stage as a sound source stream candidate for which a new θn to be added has not been found, that is, grouping has been completed. In the example of the figure, the sound source stream candidate data 322 corresponds to this.

（継続時間評価部３０７）
継続時間評価部３０７は、時系列追跡部３０６により出力された、グルーピングを終えた音源ストリーム候補データの開始時刻と終了時刻から当該音源ストリーム候補の継続時間を計算し、この継続時間が所定閾値を越えるものを音源音に基づく（安定な）音源ストリーム候補と認定し、それ以外を雑音に基づく（不安定な）音源ストリーム候補と認定する。音源音に基づく音源ストリーム候補データを音源ストリーム情報と呼ぶことにする。音源ストリーム情報には、当該音源ストリームの開始時刻Ｔｓ、終了時刻Ｔｅ、音源方向を表すθとρとφと直線スコアの時系列データが含まれる。 (Duration Evaluation Unit 307)
The duration evaluation unit 307 calculates the duration of the sound source stream candidate from the start time and end time of the sound source stream candidate data that has been grouped and output from the time series tracking unit 306, and sets the duration to a predetermined threshold. Those exceeding are recognized as (stable) sound source stream candidates based on sound source sounds, and others are recognized as (stable) sound source stream candidates based on noise. The sound source stream candidate data based on the sound source sound will be referred to as sound source stream information. The sound source stream information includes start time Ts and end time Te of the sound source stream, and time series data of θ, ρ, φ, and a straight line score indicating the sound source direction.

なお、直線検出部３０５による直線群の数が音源らしき候補の数を与えるが、そこには雑音源も含まれている。一方、継続時間評価部３０７による音源ストリーム情報の数は、雑音に基づくとされたものを除いた、信頼できる音源の数を与えてくれると考えられる。 Note that the number of straight line groups by the straight line detection unit 305 gives the number of candidates that seem to be sound sources, and includes noise sources. On the other hand, the number of sound source stream information by the duration evaluation unit 307 is considered to give the number of sound sources that can be trusted except for those that are supposed to be based on noise.

（フレーム分類部３０８）
図１０の例において、黒丸３２３は音源ストリーム候補データ３２１と同じ音源から発せられている一連の音声を表しているデータであると判断されたわけであるが、このとき、音源ストリーム候補データ３２１の終端と黒丸３２３の間には直線の検出されていないギャップ期間が存在していた。このギャップ期間を雑音の支配的な期間であると考え、その間の分離音声は信頼できるレベルにないだろうと仮定する。その判定を行うのがフレーム分類部３０８である。 (Frame classification unit 308)
In the example of FIG. 10, the black circle 323 is determined to be data representing a series of sounds emitted from the same sound source as the sound source stream candidate data 321. At this time, the end of the sound source stream candidate data 321 is determined. Between the black circles 323 and the black circles 323, there was a gap period in which no straight line was detected. Consider this gap period as the dominant period of noise and assume that the separated speech during that period will not be at a reliable level. The frame classification unit 308 performs this determination.

フレーム分類部３０８は、継続時間評価部３０７により出力された音源ストリーム情報の各フレームに対して、次の２つの信頼可否判別方式のいずれかを用いて、信頼可否の別を表すフラグを与える。なお、いずれの方式を使用するかは運用に合わせて設定可能である。 The frame classifying unit 308 gives a flag indicating whether or not reliability is possible to each frame of the sound source stream information output from the duration evaluation unit 307 using one of the following two reliability determination methods. Note that which method is used can be set according to the operation.

（信頼可否判定方式１）
図１１に例示するように、時間軸を等間隔に刻んだ離散的な時刻をフレーム（図中３３１）とする。このとき、音源ストリーム３３２には自身に属する直線を検出できたフレーム（図中黒丸のある時刻）と検出できなかったフレーム（図中の３３３、３３４、３３５）とがある。直線を検出できたフレームには直線スコア（図中の３４６、３４７、３４８のグラフ）が与えられている。信頼可否判定方式１は、音源ストリーム情報毎に、直線を検出できたフレームに信頼可のフラグを与え、そうでないフレームに信頼不可のフラグを与える。 (Reliability determination method 1)
As illustrated in FIG. 11, a discrete time with the time axis cut at equal intervals is defined as a frame (331 in the figure). At this time, the sound source stream 332 includes a frame in which a straight line belonging to the sound source stream 332 can be detected (a time with a black circle in the figure) and a frame in which the straight line cannot be detected (333, 334, 335 in the figure). A straight line score (graphs 346, 347, and 348 in the figure) is given to a frame in which a straight line can be detected. In the reliability determination method 1, for each sound source stream information, a reliable flag is given to a frame in which a straight line has been detected, and an unreliable flag is given to a frame that is not.

（信頼可否判定方式２）
図１２に例示するように、図１１と同じ音源ストリーム３３２がある。直線検出周期毎に行われるハフ投票により、各フレームには得票分布Ｈ（θ）が得られている。図では、４つのフレーム（時刻）における得票分布を図中の３３９〜３４２に模式的に示す。信頼可否判定方式２は、直線の検出できなかったフレーム（図中の３３３、３３４、３３５）の直線傾きθを、θの時間連続性を仮定して、直線の検出できた前後のフレームから例えば線形補間で内挿して推定する。推定されたθを図中の白丸３３６、３３７、３３８で示す。そして、この内挿によって得たθ値に対応する時刻の得票分布Ｈ（θ）を読み出す。このとき、直線の検出できなかったフレーム３３３の傾き（内挿で求められた）をθｅ（図中の白丸３３６）とすれば、その時刻の得票分布Ｈ（θ）（図中の３４０）から得票値Ｈ（θｅ）（図中の３４３）を読み出して当該フレーム３３３の直線スコアとする。このようにしてギャップ期間の直線スコアが得られ、直線の検出できたフレームの直線スコアと合わせてストリーム全域にわたる直線スコア（図中の３４４）が出揃う。そして、所定閾値（図中の３４５）以上の直線スコアを得たフレームに信頼可のフラグを与え、それ以外のフレームには信頼不可のフラグを与える。このようにすることで、直線検出時の閾値とは別の閾値によって、信頼可否の情報を生成することができる。 (Reliability determination method 2)
As illustrated in FIG. 12, there is the same sound source stream 332 as in FIG. The vote distribution H (θ) is obtained for each frame by the Hough voting performed at each straight line detection cycle. In the figure, the vote distributions in four frames (time) are schematically shown as 339 to 342 in the figure. In the reliability determination method 2, the straight line inclination θ of a frame (333, 334, 335 in the figure) in which a straight line cannot be detected is calculated from frames before and after the straight line can be detected, assuming θ time continuity, for example. Estimate by interpolation with linear interpolation. The estimated θ is indicated by white circles 336, 337, and 338 in the figure. Then, the vote distribution H (θ) at the time corresponding to the θ value obtained by this interpolation is read. At this time, if the inclination (obtained by interpolation) of the frame 333 in which a straight line cannot be detected is θe (white circle 336 in the figure), the vote distribution H (θ) at that time (340 in the figure) The vote value H (θe) (343 in the figure) is read out and used as the straight line score of the frame 333. In this way, a straight line score of the gap period is obtained, and a straight line score (344 in the figure) over the entire stream is obtained together with the straight line score of the frame in which the straight line can be detected. Then, a reliable flag is given to a frame that has obtained a linear score equal to or higher than a predetermined threshold (345 in the figure), and an unreliable flag is given to other frames. By doing in this way, the reliability information can be generated by a threshold different from the threshold at the time of straight line detection.

（音源分離部４）
音源分離部４の内部構成を図１３に示す。音源分離部４は、同相化部３７１とビームフォーミング部３７２より成る。 (Sound source separation unit 4)
The internal configuration of the sound source separation unit 4 is shown in FIG. The sound source separation unit 4 includes an in-phase unit 371 and a beam forming unit 372.

（同相化部３７１）
同相化部３７１は、音源ストリームの存在角度データを参照することで、当該ストリームの音源方向（存在角度）φの時間推移を得て、φの最大値φｍａｘと最小値φｍｉｎから中間値φｍｉｄ＝（φｍａｘ＋φｍｉｎ）／２を計算して幅φｗ＝φｍａｘ−φｍｉｄを求める。そして、当該音源ストリーム情報の元となった２つの周波数分解データａとｂの時系列データを、当該ストリームの開始時刻Ｔｓより所定時間遡った時刻Ｔｓ’から終了時刻Ｔｅより所定時間経過した時刻Ｔｅ’まで抽出して、中間値φｍｉｄで逆算される到達時間差をキャンセルするように補正することで粗く同相化する。 (In-phase unit 371)
The in-phase unit 371 obtains the time transition of the sound source direction (existing angle) φ of the stream by referring to the sound source stream existing angle data, and obtains the intermediate value φmid = (from the maximum value φmax and the minimum value φmin of φ. φmax + φmin) / 2 is calculated to obtain the width φw = φmax−φmid. Then, the time Te of the time-series data of the two frequency-resolved data a and b that is the source of the sound source stream information is a time Te that has passed a predetermined time from the end time Te from a time Ts ′ that is a predetermined time before the start time Ts of the stream. It is roughly in-phased by extracting up to 'and correcting it so as to cancel the arrival time difference calculated backward with the intermediate value φmid.

（ビームフォーミング部３７２）
同相化部３７１によって粗く同相化された２つの周波数分解データａとｂの時系列データは、あたかもマイク対の正面方向から入射したかのような信号となっている。但し、各時刻においては正確に正面０°というわけではなく、±φｗの範囲で変化している。ビームフォーミング部３７２は、この粗く同相化された２つの周波数分解データａとｂの時系列データを、正面０°に対して±β・φｗのマージン（βは１以上の適当な係数で、例えば１．１）を与えた角度の範囲を追尾範囲とする「話者追尾型適応アレイ」に掛けることで、当該ストリームの音声データの音源分離を高精度に行う。天田皇他、“音声認識のためのマイクロホンアレー技術”、東芝レビュー２００４、ＶＯＬ．５９、ＮＯ．９、２００４年には、話者追尾型適応アレイの一構成例が開示されている。 (Beam forming part 372)
The time-series data of the two frequency decomposition data a and b roughly in-phased by the in-phase unit 371 is a signal as if it were incident from the front direction of the microphone pair. However, at each time, the front is not exactly 0 °, but changes within a range of ± φw. The beam forming unit 372 converts the rough and in-phase time-series data of the frequency-resolved data a and b into a margin of ± β · φw with respect to the front 0 ° (β is an appropriate coefficient of 1 or more, for example, 1.1), the sound source separation of the audio data of the stream is performed with high accuracy by applying to the “speaker tracking adaptive array” whose tracking range is the range of the angle given in 1.1). Emperor Amada et al., “Microphone array technology for speech recognition”, Toshiba Review 2004, VOL. 59, NO. 9, 2004, a configuration example of a speaker tracking adaptive array is disclosed.

（語彙認識部５）
語彙認識部５の内部構成を図１４に示す。音響特徴抽出部２６１は、入力される音源ストリームの分離音声から所定の音響特徴（例えばＭＦＣＣ、ΔＭＦＣＣなど）の時系列データを抽出する。これを入力音響特徴と呼ぶことにする。音響特徴照合部２６２は、音響モデルデータベース２６３に記憶されている音響モデルと前記入力音響特徴とを照合し、入力音響特徴列を所定の音素記号列（入力系列）データに変換する。最尤文仮説探索部２６４は、文法・言語モデルデータベース２６５に記憶されている文法情報と言語モデルから文仮説を記述したＨＭＭ（隠れマルコフモデル）を生成し、前記入力系列を最も高い確率（尤度）で出力する文仮説をＨＭＭ上で探索して出力する。出力される文仮説（最尤文仮説）が前記音源ストリーム分離音声の言語的内容を認識した結果となる。 (Vocabulary recognition unit 5)
The internal structure of the vocabulary recognition unit 5 is shown in FIG. The acoustic feature extraction unit 261 extracts time-series data of predetermined acoustic features (for example, MFCC, ΔMFCC, etc.) from the separated sound of the input sound source stream. This is called an input acoustic feature. The acoustic feature collating unit 262 collates the acoustic model stored in the acoustic model database 263 with the input acoustic feature, and converts the input acoustic feature string into predetermined phoneme symbol string (input series) data. The maximum likelihood sentence hypothesis search unit 264 generates an HMM (Hidden Markov Model) describing a sentence hypothesis from the grammar information stored in the grammar / language model database 265 and the language model, and the input sequence has the highest probability (likelihood). The sentence hypothesis to be output in (1) is searched for on the HMM and output. The output sentence hypothesis (maximum likelihood sentence hypothesis) is the result of recognizing the linguistic content of the sound source stream separated speech.

（音響特徴照合部３６２）
音響モデルデータベース３６３に記憶される音響モデルとは、各音素についての標準音響特徴とその音素記号を組にした情報である。なお、日本語においてどのような音素を擁しておくべきかが、鹿野清宏他、“音声認識システム”、オーム社出版局、２００１年５月１５日発行の４５ページに音節表として示されている。 (Acoustic feature matching unit 362)
The acoustic model stored in the acoustic model database 363 is information obtained by combining standard acoustic features and their phonemic symbols for each phoneme. In addition, what phoneme should be used in Japanese is shown as a syllable table on page 45 of Kiyohiro Shikano et al., “Speech Recognition System”, Ohm Publishing Co., Ltd., published on May 15, 2001. .

音響特徴照合部３６２は、分離音声データから生成される入力音響特徴と音響モデルに記述される標準音響特徴とを照合して、その類似度（音響スコア）を計算する。そして、最も類似した上位Ｎ位（Ｎベスト）までの標準音響特徴の音素記号とその音響スコアを出力する。この結果、入力音響特徴列は音素記号列に変換されることになる。しかしながら、入力音響特徴列には信頼可フレームと信頼不可フレームのデータがある。このとき、信頼不可フレームのデータは目的音声よりも雑音の方が支配的になっているため、その期間の入力音響特徴が適切な音素記号に対応付けられる可能性は低い。そこで、音響特徴照合部３６２は、信頼不可フレームの入力音響特徴を標準音響特徴と照合する代わりに、これに所定のダミー音素記号とダミー音響スコアを対応付ける。この結果、信頼不可フレームでの照合処理を省略して計算コストを節約する。よって、音響特徴照合部３６２から入力状態系列データとして出力されるものは、ダミー音素記号を含む音素記号列データとなる。 The acoustic feature matching unit 362 compares the input acoustic feature generated from the separated speech data with the standard acoustic feature described in the acoustic model, and calculates the similarity (acoustic score). Then, the phoneme symbols of the standard acoustic features up to the most similar top N (N best) and their acoustic scores are output. As a result, the input acoustic feature string is converted into a phoneme symbol string. However, the input acoustic feature sequence includes data of reliable frames and unreliable frames. At this time, since the noise of the unreliable frame data is more dominant than the target speech, it is unlikely that the input acoustic feature in that period is associated with an appropriate phoneme symbol. Therefore, the acoustic feature matching unit 362 associates a predetermined dummy phoneme symbol with a dummy acoustic score instead of matching the input acoustic feature of the unreliable frame with the standard acoustic feature. As a result, the verification process in the unreliable frame is omitted and the calculation cost is saved. Therefore, what is output as input state series data from the acoustic feature matching unit 362 is phoneme symbol string data including dummy phoneme symbols.

（最尤文仮説探索部３６４）
文は１以上の単語から成るものとし、文法情報には発話に出現すると想定される単語とその連結関係が定義されている。各単語は１以上の音素から成るものとし、よって単語と文はそれぞれ１以上の音素の連結された音素列と看做すことができるわけである。 (Maximum likelihood sentence hypothesis search unit 364)
The sentence is composed of one or more words, and the grammatical information defines words assumed to appear in the utterance and their connection relations. Each word is composed of one or more phonemes, so that each word and sentence can be regarded as a phoneme string in which one or more phonemes are connected.

ＨＭＭは、状態と、状態間の遷移（同一状態への遷移も含む）から成り、状態にはその状態を取り得る確率（出力確率）を、遷移にはその遷移の起こり得る確率（遷移確率）をそれぞれ与えることで確率過程をモデル化する。このとき、状態を音素に対応させると、音素列である単語は所定の状態が連鎖したＨＭＭ（単語ＨＭＭ）で記述できる。そして、文は所定の単語ＨＭＭが連鎖したより大きなＨＭＭで記述できる。このとき、予め大量の例文（コーパス）から、各音素の次にどの音素が続くのか（バイフォン、トライフォン）、各単語の次にどの単語が続くのか（バイグラム、トライグラム）を確率で表した言語モデルを利用してＨＭＭの遷移確率を設定することができる。 An HMM is composed of states and transitions between states (including transitions to the same state). The state has a probability of being able to take that state (output probability), and the transition is a probability that the transition can occur (transition probability). To model the stochastic process. At this time, if the state is associated with a phoneme, a word that is a phoneme string can be described by an HMM (word HMM) in which a predetermined state is linked. A sentence can be described in a larger HMM in which predetermined word HMMs are chained. At this time, from a large number of example sentences (corpus), which phoneme follows each phoneme (biphone, triphone) and which word follows each word (bigram, trigram) is expressed by probability. The transition probability of the HMM can be set using a language model.

初期状態をシンボル的な文頭無音（ＳｉｌＢ）として、可能な文を構成する単語列を木構造に展開すると、木の枝葉の終端に配置されるこれもシンボル的な文末無音（ＳｉｌＥ）に達した最終状態までの経路で決まる音素列が文法情報で定義される可能な文（文仮説）を表す。 When the initial state is symbolic sentence silence (SilB) and the word strings that make up a possible sentence are expanded into a tree structure, this is also placed at the end of the tree branch and leaves, which also reaches symbolic sentence silence (SilE). A phoneme string determined by the route to the final state represents a possible sentence (sentence hypothesis) defined by grammatical information.

分離音声は入力系列としての音素記号列に変換されている。最尤文仮説探索部３６４は、文法情報から生成されるＨＭＭ上のどの文仮説がこの入力系列を最も良く説明できるかを探索する。この探索にはビームサーチを用いる。ビームサーチはその時点で有望な幾つかの仮説を残し、あまり有望でない残りの仮説を破棄しながら探索を進める手法である。幾つの仮説を残すかという基準を「ビーム幅」と呼び、仮説の破棄作業を「枝刈り」と呼ぶ。 The separated speech is converted into a phoneme symbol string as an input sequence. The maximum likelihood sentence hypothesis searching unit 364 searches which sentence hypothesis on the HMM generated from the grammatical information can best explain this input sequence. A beam search is used for this search. Beam search is a technique in which several hypotheses that are promising at that time are left and the search is advanced while the remaining hypotheses that are not very promising are discarded. The criterion for how many hypotheses are left is called “beam width”, and the hypothesis discarding operation is called “pruning”.

入力音素記号列の先頭の音素記号（Ｎベストなので最大Ｎ個ある）と同じ音素記号に対応する状態を、ＨＭＭ上の初期状態から遷移可能な全ての状態の中で探索する。同じ音素記号に対応する遷移可能な状態が見つかると、その出力確率をその音素の音響スコア、初期状態からその状態への遷移確率を言語スコアとして、音響スコアと言語スコアの積をこの遷移経路の尤度とする。そして、尤度で上位Ｍ位までを残して他の経路を破棄（枝刈り）する。 A state corresponding to the same phoneme symbol as the first phoneme symbol of the input phoneme symbol string (there is N at the maximum because N is the best) is searched for in all states that can transition from the initial state on the HMM. When a transitionable state corresponding to the same phoneme symbol is found, the output probability is the acoustic score of the phoneme, the transition probability from the initial state to the state is the language score, and the product of the acoustic score and the language score is Likelihood. Then, other routes are discarded (pruned) with the likelihood remaining up to the top M.

入力音素記号列の次の音素記号（Ｎベストなので最大Ｎ個ある）についても、枝刈りで残った状態から次に遷移可能な全ての状態の中で、同じ音素記号に対応する状態を探索する。そのような状態が見つかると、同様に音響スコアと言語スコアが定められ、その積をそこに至った遷移経路のこれまでの尤度に掛けて新しい尤度とする。そして、同様に新しい尤度で上位Ｍ位までを残して経路を枝刈りする。 As for the next phoneme symbol of the input phoneme symbol string (there is N at best, there are N at most), the state corresponding to the same phoneme symbol is searched in all the states that can be transitioned from the state remaining after pruning. . When such a state is found, an acoustic score and a language score are determined in the same manner, and the product is multiplied by the previous likelihood of the transition path leading to the new score. Similarly, the path is pruned with the new likelihood remaining up to the top M.

以上の処理を終端に達するまで繰り返し、終端到達時点で最も尤度の高い遷移経路を求め、その遷移経路が辿った単語列を分離音声データの言語的解釈、すなわち認識結果とする。 The above processing is repeated until the end is reached, the transition path having the highest likelihood at the end arrival time is obtained, and the word string followed by the transition path is used as the linguistic interpretation of the separated speech data, that is, the recognition result.

なお、確率を対数化して扱うと、確率の積や尤度の積を全て足し算で行うことができる。また、遷移確率を全て１として、尤度を音響スコアにのみ依存させることも可能である。また、認識後の後処理のために、最終的な尤度で上位Ｋ位までの遷移経路を認識結果（の候補）として出力することも可能である。 If the probabilities are treated logarithmically, all the products of probabilities and products of likelihood can be added. It is also possible to make the transition probabilities all 1 and make the likelihood depend only on the acoustic score. In addition, for post-processing after recognition, it is possible to output a transition path up to the top K position with a final likelihood as a recognition result (candidate).

以上のようにして認識結果が得られるわけだが、入力系列にはダミー音素記号が含まれている。最尤文仮説探索部３６４は、次の２つの方式のいずれかを使ってダミー音素記号を処理する。なお、どの方式を利用するかは設定により変更可能であるものとする。 Although the recognition result is obtained as described above, a dummy phoneme symbol is included in the input series. The maximum likelihood sentence hypothesis search unit 364 processes the dummy phoneme symbols using one of the following two methods. It should be noted that which method is used can be changed by setting.

（ダミー対応方式１：枝刈り停止、尤度計算停止）
この方式では、最尤文仮説探索部３６４は、ダミー音素記号に遭遇すると、現時点で有効な状態から次に遷移可能な全ての状態への遷移を枝刈りせずに残す。このとき、残された遷移経路の尤度を更新しない。すなわち、信頼可と分類されたフレームだけを尤度計算に用い、信頼不可と分類されたフレームで枝刈りをしないのである（枝外りの抑制）。このようにすることで、元々発話のない無音やポーズと異なり、発話の途中で目的音声が雑音にまぎれてしまっても、その期間を乗り越えて尤度計算を行いながら認識処理を継続することができるようになる。このように、仮に信頼不可期間が無音区間と偶然一致していたとしても、その期間で枝刈りをしないことが、特許文献２と異なる点である。この相違は、本発明が信頼不可として検出しようとする期間が、目的音声のない実際に無音となっている期間だけでなく、目的音声が存在し、それが雑音にまぎれて聞き取りにくくなっている期間をも包含するからである。次に目的音声が聞こえ始めたときには、発話はずっと先に進んでしまっている、という事態への対処である。 (Dummy support method 1: Stop pruning, stop likelihood calculation)
In this method, when the maximum likelihood sentence hypothesis search unit 364 encounters a dummy phoneme symbol, it leaves unpruned transitions from the currently valid state to all states that can be transited next. At this time, the likelihood of the remaining transition path is not updated. That is, only frames classified as reliable are used for likelihood calculation, and no pruning is performed on frames classified as unreliable (suppression of debranching). In this way, unlike silence and pauses that are not originally uttered, even if the target voice is covered with noise during the utterance, the recognition process can be continued while calculating the likelihood over the period. become able to. In this way, even if the unreliable period coincides with the silent section, it is different from Patent Document 2 that no pruning is performed during that period. This difference is not only the period in which the present invention attempts to detect as unreliable, but the period in which the target voice is actually silent, as well as the target voice is present, making it difficult to hear due to noise. This is because the period is also included. The next time the target voice begins to be heard, this is a response to the situation that the utterance has progressed far ahead.

（ダミー対応方式２：ダミー音素状態の挿入）
この方式では、最尤文仮説探索部３６４は、ＨＭＭ上の全ての遷移先と並列にダミー音素に対応した状態への遷移を挿入する。ダミー音素記号に遭遇すると、現時点で有効な状態からダミー音素状態への遷移が起こる。ダミー音素状態の出力確率（音響スコア）を１に、そこへの遷移確率（言語スコア）にも１を入れておくことで、遷移経路の尤度は変更されない。すなわち、ダミー音素を加味したＨＭＭを生成しておくだけで、後の計算は全て通常通りに行うことができ、ダミー対応方式１のような例外処理を必要としない。ただし、ダミー音素状態の挿入によりＨＭＭの規模が膨らむので、この方式は小規模な文法情報に対して用いると良い。 (Dummy support method 2: Insert dummy phoneme state)
In this method, the maximum likelihood sentence hypothesis search unit 364 inserts a transition to a state corresponding to a dummy phoneme in parallel with all transition destinations on the HMM. When a dummy phoneme symbol is encountered, a transition from the currently valid state to the dummy phoneme state occurs. By setting the output probability (acoustic score) of the dummy phoneme state to 1 and also entering 1 in the transition probability (language score) there, the likelihood of the transition path is not changed. That is, all the subsequent calculations can be performed as usual only by generating an HMM that takes dummy phonemes into account, and exception processing as in the dummy handling method 1 is not required. However, since the scale of the HMM is expanded by inserting the dummy phoneme state, this method is preferably used for small grammatical information.

なお、音響特徴照合部３６２が信頼不可フレームにダミー音素記号を対応付けるのではなく、最尤文仮説探索部３６４が直接信頼可否情報を参照して上述した例外処理（ダミー対応方式１）を行うようにすることも可能である。 Note that the acoustic feature matching unit 362 does not associate the dummy phoneme symbol with the unreliable frame, but the maximum likelihood sentence hypothesis searching unit 364 directly refers to the reliability information to perform the above-described exception processing (dummy handling method 1). It is also possible to do.

（話者認識部６）
話者認識部６は、例えばＡさんの声であるなど、入力される音源ストリームの分離音声が誰の声かを認識する。そのための話者認識部６の内部構成を図１５に示す。音響特徴抽出部２７１は、入力される音源ストリームの分離音声のうち、信頼可フレームの音声のみから所定の音響特徴（例えばフォルマントなど）の時系列データを抽出する。これを話者認識用の入力音響特徴と呼ぶことにする。音響特徴照合部２７２は、標準話者特徴データベース２７３に記憶されている話者毎の話者認識用の標準音響特徴と前記話者認識用の入力音響特徴とを照合し、話者認識用の入力音響特徴列全域にわたる類似度の平均を話者毎に計算する。この平均類似度が所定閾値以上で最大となる話者を当該分離音声の発話者として認定し、その話者ＩＤを出力する。もし、閾値以上の平均類似度が得られなければ、当該分離音声の話者は不明であることを表す特別なＩＤを出力する。 (Speaker recognition unit 6)
The speaker recognizing unit 6 recognizes who the voice of the input sound source stream is, for example, Mr. A's voice. The internal configuration of the speaker recognition unit 6 for this purpose is shown in FIG. The acoustic feature extraction unit 271 extracts time-series data of a predetermined acoustic feature (for example, formant) from only the sound of the reliable frame among the separated sounds of the input sound source stream. This is called an input acoustic feature for speaker recognition. The acoustic feature matching unit 272 compares the standard acoustic features for speaker recognition stored in the standard speaker feature database 273 for each speaker with the input acoustic features for speaker recognition, and performs speaker recognition. The average similarity over the entire input acoustic feature sequence is calculated for each speaker. The speaker whose average similarity is equal to or greater than a predetermined threshold is identified as the speaker of the separated speech, and the speaker ID is output. If the average similarity equal to or greater than the threshold is not obtained, a special ID indicating that the speaker of the separated speech is unknown is output.

（物音認識部７）
物音認識部７は、例えば「ガラスの割れる音である」など、入力される音源ストリームの分離音声が何の物音であるかを認識する。そのための物音認識部７の内部構成を図１６に示す。音響特徴抽出部２８１は、入力される音源ストリームの分離音声のうち、信頼可フレームの音声のみから所定の音響特徴（例えばエンベロープや対数パワースペクトルなど：例えば、ガラスの割れる音は時間方向に見れば振幅が減少していく減衰性のエンベロープを示し、その対数パワースペクトルは白色に近い）の時系列データを抽出する。これを物音認識用の入力音響特徴と呼ぶことにする。音響特徴照合部２８２は、標準物音特徴データベース２８３に記憶されている物音毎の物音認識用の標準音響特徴と前記物音認識用の入力音響特徴とを照合し、物音認識用の入力音響特徴列全域にわたる類似度の平均を物音毎に計算する。この平均類似度が所定閾値以上で最大となる物音を当該分離音声の正体として認定し、その物音ＩＤを出力する。もし、閾値以上の平均類似度が得られなければ、当該分離音声の正体は不明であることを表す特別なＩＤを出力する。 (Sound recognition unit 7)
The sound object recognition unit 7 recognizes what kind of sound the separated sound of the input sound source stream is, for example, “It is a sound that breaks the glass”. FIG. 16 shows the internal configuration of the sound recognition unit 7 for that purpose. The acoustic feature extraction unit 281 has predetermined acoustic features (for example, an envelope, a logarithmic power spectrum, etc., for example, from the sound of a reliable frame among the separated sound of the input sound source stream, for example, if the sound that breaks the glass is seen in the time direction. Time-series data is extracted that shows an attenuating envelope with decreasing amplitude and whose logarithmic power spectrum is close to white. This is called an input acoustic feature for object sound recognition. The acoustic feature collating unit 282 collates the standard acoustic features for recognizing the sound for each sound stored in the standard sound feature database 283 with the input sound features for recognizing the sound, and the entire input sound feature sequence for the sound recognition. The average of the similarity over is calculated for each sound. The sound with the average similarity equal to or greater than a predetermined threshold is recognized as the identity of the separated sound, and the sound ID is output. If an average similarity equal to or greater than the threshold is not obtained, a special ID indicating that the identity of the separated speech is unknown is output.

（出力部８）
出力部８は、音源の数、各音源の空間的な存在範囲（円錐面を決定させる存在角度φ）、前記音源を発した音声の時間的な存在期間（Ｔｓ、Ｔｅ）、前記音源毎の分離音声、前記分離音声の言語的内容、前記分離音声の話者の別、前記分離音声の物音の別、の少なくとも１つを含む音源情報を出力する手段である。 (Output unit 8)
The output unit 8 includes the number of sound sources, the spatial existence range of each sound source (the existence angle φ that determines the conical surface), the temporal existence period (Ts, Te) of the sound emitted from the sound source, It is a means for outputting sound source information including at least one of separated speech, linguistic content of the separated speech, a speaker of the separated speech, and a physical sound of the separated speech.

（ユーザインタフェース部９）
ユーザインタフェース部９は、上述した音声認識処理に必要な各種設定内容の利用者への呈示、利用者からの設定入力受理、設定内容の外部記憶装置への保存と外部記憶装置からの読み出しを実行したり、図８や図９に示した（１）マイク毎の周波数成分の表示、（２）散布図の表示、（３）各種得票分布の表示、（４）極大位置の表示、（５）散布図上の直線群の表示、図１０に示した（６）音源ストリーム候補データの表示、のように各種処理結果や中間結果を可視化して利用者に呈示したり、所望のデータを利用者に選択させてより詳細に可視化するための手段である。このようにすることで、利用者が本実施形態に係る音声認識装置の働きを確認したり、所望の動作を行ない得るように調整したり、以後は、調整済みの状態で本装置を利用したりすることが可能になる。 (User interface unit 9)
The user interface unit 9 presents various setting contents necessary for the above-described voice recognition processing to the user, accepts setting input from the user, saves the setting contents in the external storage device, and reads out from the external storage device 8 and FIG. 9 (1) display of frequency components for each microphone, (2) display of scatter diagram, (3) display of various vote distributions, (4) display of maximum position, (5) Various processing results and intermediate results are visualized and displayed to the user, such as the display of the straight line group on the scatter diagram and the display of the (6) sound source stream candidate data shown in FIG. It is a means for making it select and visualizing in detail. In this way, the user can confirm the operation of the voice recognition apparatus according to the present embodiment, adjust the voice recognition apparatus to perform a desired operation, and thereafter use the apparatus in an adjusted state. It becomes possible to do.

（処理の流れ図）
図１７に本実施形態に係る音声認識装置における処理の流れを示す。本処理は、初期設定処理ステップＳ１と、音響信号入力処理ステップＳ２と、音源ストリーム抽出分類処理ステップＳ３と、音源分離処理ステップＳ４と、語彙認識処理ステップＳ５と、話者認識処理ステップＳ６と、物音認識処理ステップＳ７と、出力処理ステップＳ８と、終了判断処理ステップＳ９と、確認判断処理ステップＳ１０と、情報呈示・設定受理処理ステップＳ１１と、終了処理ステップＳ１２とから成る。 (Process flow diagram)
FIG. 17 shows the flow of processing in the speech recognition apparatus according to this embodiment. This processing includes initial setting processing step S1, acoustic signal input processing step S2, sound source stream extraction / classification processing step S3, sound source separation processing step S4, vocabulary recognition processing step S5, speaker recognition processing step S6, It consists of a sound recognition processing step S7, an output processing step S8, an end determination processing step S9, a confirmation determination processing step S10, an information presentation / setting acceptance processing step S11, and an end processing step S12.

初期設定処理ステップＳ１は、ユーザインタフェース部９における処理の一部を実行する処理ステップであり、音声認識処理に必要な各種設定内容を外部記憶装置から読み出して、装置を所定の設定状態に初期化する。 The initial setting processing step S1 is a processing step for executing a part of the processing in the user interface unit 9, and reads various setting contents necessary for the voice recognition processing from the external storage device, and initializes the device to a predetermined setting state. To do.

音響信号入力処理ステップＳ２は、音響信号入力部２における処理を実行する処理ステップであり、空間的に同一でない２つの位置で捉えられた２つの音響信号を入力する。 The acoustic signal input processing step S2 is a processing step for executing processing in the acoustic signal input unit 2, and inputs two acoustic signals captured at two positions that are not spatially identical.

音源ストリーム抽出分類処理ステップＳ３は、音源ストリーム抽出分類部３における処理を実行する処理ステップであり、（１）前記音響信号入力処理ステップＳ２による２つの入力音響信号をそれぞれ周波数分解し、（２）両入力音響信号の周波数毎の位相差値を算出し、該周波数毎の位相差値を、周波数をＹ軸、位相差値をＸ軸とするＸＹ座標系上の散布図データを生成し、（３）該散布図データから所定の直線を検出し、（４）検出された直線の情報に基づいて、前記音響信号の発生源たる音源の数、各音源の空間的な存在範囲、前記各音源を発した音声の時間的な存在期間をデータ化し、（５）該存在期間の各時刻を信頼可あるいは信頼不可と分類し、これらの情報を音源ストリーム情報として出力する。 The sound source stream extraction / classification processing step S3 is a processing step for executing processing in the sound source stream extraction / classification unit 3, and (1) frequency-resolves each of the two input acoustic signals in the acoustic signal input processing step S2, and (2) The phase difference value for each frequency of both input acoustic signals is calculated, and the phase difference value for each frequency is generated as scatter diagram data on the XY coordinate system with the frequency as the Y axis and the phase difference value as the X axis. 3) A predetermined straight line is detected from the scatter diagram data. (4) Based on the detected straight line information, the number of sound sources that generate the acoustic signal, the spatial existence range of each sound source, and each sound source (5) Each time of the existence period is classified as reliable or unreliable, and these pieces of information are output as sound source stream information.

音源分離処理ステップＳ４は、音源分離部４における処理を実行する処理ステップであり、音源ストリーム情報に基づいて各音源の音声を分離抽出する。 The sound source separation processing step S4 is a processing step for executing processing in the sound source separation unit 4, and separates and extracts the sound of each sound source based on the sound source stream information.

語彙認識処理ステップＳ５は、語彙認識部５における処理を実行する処理ステップであり、各音源の分離抽出音声の言語的意味を認識する。 The vocabulary recognition processing step S5 is a processing step for executing the processing in the vocabulary recognition unit 5, and recognizes the linguistic meaning of the separated and extracted speech of each sound source.

話者認識処理ステップＳ６は、話者認識部６における処理を実行する処理ステップであり、各音源の分離抽出音声の発話者の別を認識する。 The speaker recognition processing step S6 is a processing step for executing the processing in the speaker recognition unit 6, and recognizes the speaker type of the separated and extracted speech of each sound source.

物音認識処理ステップＳ７は、物音認識部７における処理を実行する処理ステップであり、各音源の分離抽出音声が何の物音であるかを認識する。 The object sound recognition processing step S7 is a process step for executing the process in the object sound recognition unit 7, and recognizes what object sound the separated and extracted sound of each sound source is.

出力処理ステップＳ８は、出力部８における処理を実行する処理ステップであり、前記音源ストリーム情報や前記音声認識の結果を出力する。 The output processing step S8 is a processing step for executing the processing in the output unit 8, and outputs the sound source stream information and the result of the voice recognition.

終了判断処理ステップＳ９は、ユーザインタフェース部９における処理の一部を実行する処理ステップであり、利用者からの終了命令の有無を検査して、終了命令が有る場合には終了処理ステップＳ１２へ（左分岐）、無い場合には確認判断処理ステップＳ１０へ（上分岐）と処理の流れを制御する。 The end determination process step S9 is a process step for executing a part of the process in the user interface unit 9. The end determination process step S9 checks whether or not there is an end instruction from the user. If there is no left branch), the flow proceeds to the confirmation judgment processing step S10 (upper branch) and the process flow is controlled.

確認判断処理ステップＳ１０は、ユーザインタフェース部９における処理の一部を実行する処理ステップであり、利用者からの確認命令の有無を検査して、確認命令が有る場合には情報呈示・設定受理処理ステップＳ１１へ（左分岐）、無い場合には音響信号処理ステップＳ２（上分岐）と処理の流れを制御する。 The confirmation determination processing step S10 is a processing step for executing a part of the processing in the user interface unit 9. The presence / absence of a confirmation command from the user is inspected, and if there is a confirmation command, an information presentation / setting acceptance processing is performed. If not, go to step S11 (left branch), and if not, control the acoustic signal processing step S2 (upper branch) and the flow of processing.

情報呈示・設定受理処理ステップＳ１１は、利用者からの確認命令を受けて実行される、ユーザインタフェース部９における処理の一部を実行する処理ステップであり、音声認識処理に必要な各種設定内容の利用者への呈示、利用者からの設定入力受理、保存命令による設定内容の外部記憶装置への保存、読み出し命令による設定内容の外部記憶装置からの読み出しを実行したり、各種処理結果や中間結果を可視化して利用者に呈示したり、所望のデータを利用者に選択させてより詳細に可視化することで、利用者が音声認識処理の動作を確認したり、所望の動作を行ない得るように調整したり、以後調整済みの状態で処理を継続したりすることを可能にする。 The information presentation / setting acceptance processing step S11 is a processing step for executing a part of the processing in the user interface unit 9 that is executed in response to a confirmation command from the user. Presenting to the user, accepting setting input from the user, saving the setting contents to the external storage device by the save command, reading the setting contents from the external storage device by the read command, various processing results and intermediate results So that the user can confirm the operation of the voice recognition processing or perform the desired operation by making the user select the desired data and visualize it in more detail It is possible to make adjustments and continue processing in an adjusted state thereafter.

終了処理ステップＳ１２は、利用者からの終了命令を受けて実行される、ユーザインタフェース部９における処理の一部を実行する処理ステップであり、音声認識処理に必要な各種設定内容の外部記憶装置への保存を自動実行する。 The termination processing step S12 is a processing step for executing a part of the processing in the user interface unit 9 that is executed in response to a termination command from the user. To the external storage device for various setting contents necessary for the speech recognition processing. Automatically saves.

以下、上述した実施形態の変形例を幾つか述べる。 Hereinafter, some modifications of the above-described embodiment will be described.

（複数系統の並列実装）
以上の例はマイクを２つ備えた最も単純な構成で説明したものであるが、図１８に示すように、マイクをＮ（Ｎ≧３）個備え、最大Ｍ（１≦Ｍ≦_ＮＣ_２）個のマイク対を構成することも可能である。 (Multiple systems mounted in parallel)
The above example has been described with the simplest configuration including two microphones. However, as shown in FIG. 18, N (N ≧ 3) microphones are provided, and a maximum of M (1 ≦ M ≦ _N C ₂ It is also possible to configure a pair of microphones.

図中の１１〜１３はＮ個のマイクである。図中の２２はＮ個のマイクによるＮ個の音響信号を入力する手段である。図中の２３は入力されたＮ個の音響信号をそれぞれ周波数分解し、Ｎ個の音響信号のうちの２つから成るＭ（１≦Ｍ≦_ＮＣ_２）組の対の各々について散布図データを生成し、生成されたＭ組の散布図データからそれぞれ所定の直線を検出し、検出されたＭ組の直線の情報のそれぞれから、フレームの信頼可否情報を含む音源ストリーム情報を生成する手段である。図中の２４は生成された音源ストリーム情報を用いて、各々の音源の分離音声を抽出する手段である。図中の２５は抽出された分離音声の言語的内容を認識する手段である。図中の２６は抽出された分離音声の話者の別を認識する手段である。図中の２７は抽出された分離音声の物音の別を認識する手段である。図中の２８は音源ストリーム情報や音声認識の結果を出力する手段である。図中の２９は各対を構成するマイクの情報を含む各種設定値の利用者への呈示、利用者からの設定入力受理、外部記憶装置への設定値の保存、外部記憶装置からの設定値の読み出し、及び各種処理結果の利用者への呈示を実行する手段である。各マイク対における処理はこれまでに述べた実施形態と同様であり、そのような処理が複数のマイク対について並列的に実行される。 Reference numerals 11 to 13 in the figure denote N microphones. In the figure, reference numeral 22 denotes means for inputting N acoustic signals from N microphones. Reference numeral 23 in the figure shows frequency analysis of the input N acoustic signals, respectively, and scatter diagram data for each of a pair of M (1 ≦ M ≦ _N C ₂ ) pairs composed of two of the N acoustic signals. Means for detecting a predetermined straight line from the generated M sets of scatter diagram data and generating sound source stream information including frame reliability information from each of the detected information of the M sets of straight lines. is there. Reference numeral 24 in the figure denotes means for extracting separated sound of each sound source using the generated sound source stream information. Reference numeral 25 in the figure denotes means for recognizing the linguistic content of the extracted separated speech. In the figure, reference numeral 26 denotes means for recognizing different speakers of the extracted separated speech. In the figure, reference numeral 27 denotes means for recognizing different extracted sounds of separated speech. In the figure, 28 is a means for outputting sound source stream information and the result of speech recognition. 29 in the figure shows various setting values including information of microphones constituting each pair to the user, accepts setting inputs from the user, saves the setting values to the external storage device, and setting values from the external storage device Is a means for executing reading of data and presenting various processing results to the user. The processing in each microphone pair is the same as in the embodiments described so far, and such processing is executed in parallel for a plurality of microphone pairs.

このようにすることで、音源方向に対する得て不得手が各々のマイク対にあるとしても、複数のマイク対でカバーすることで装置周囲の広範な方位に存在する目的音源を検出・定位・認識することが可能になる。 In this way, even if there are gains and weaknesses in the sound source direction for each microphone pair, it is possible to detect, localize and recognize target sound sources that exist in a wide range of directions around the device by covering with multiple microphone pairs It becomes possible to do.

（コンピュータを使った実施：プログラム）
また、本発明は図１９に示すようにコンピュータを使って実施することも可能である。図中の３１〜３３はＮ個のマイクである。図中の４０はＮ個のマイクによるＮ個の音響信号を入力するＡ／Ｄ変換手段であり、図中の４１は入力されたＮ個の音響信号を処理するためのプログラム命令を実行するＣＰＵである。図中の４２〜４７はコンピュータを構成する標準的なデバイスであり、それぞれＲＡＭ４２、ＲＯＭ４３、ＨＤＤ４４、マウス／キーボード４５、ディスプレイ４６、ＬＡＮ４７である。また、図中の５０〜５２は外部から記憶メディアを介してプログラムやデータをコンピュータに供給するためのドライブ類であり、それぞれＣＤＲＯＭ５０、ＦＤＤ５１、ＣＦ／ＳＤカード５２である。図中の４８は音響信号を出力するためのＤ／Ａ変換手段であり、その出力にスピーカ４９が繋がっている。このコンピュータ装置は、図２７に示した処理ステップから成る音響信号処理プログラムをＨＤＤ４４に記憶し、これをＲＡＭ４２に読み出してＣＰＵ４１で実行することで音響信号処理装置として機能する。また、外部記憶装置としてのＨＤＤ４４、操作入力を受け付けるマウス／キーボード４５、情報呈示手段としてのディスプレイ４６とスピーカ４９を使うことで、上述したユーザインタフェース部８の機能を実現する。また、音響信号処理によって得られた音源情報をＲＡＭ４２やＲＯＭ４３やＨＤＤ４４に保存出力したり、ＬＡＮ４７を介して通信出力する。 (Implementation using a computer: program)
The present invention can also be implemented using a computer as shown in FIG. Reference numerals 31 to 33 in the figure denote N microphones. In the figure, 40 is an A / D conversion means for inputting N acoustic signals from N microphones, and 41 in the figure is a CPU for executing a program command for processing the inputted N acoustic signals. It is. 42 to 47 in the figure are standard devices constituting the computer, and are a RAM 42, a ROM 43, an HDD 44, a mouse / keyboard 45, a display 46, and a LAN 47, respectively. Reference numerals 50 to 52 in the figure denote drives for supplying programs and data to the computer from the outside via a storage medium, which are a CD ROM 50, an FDD 51, and a CF / SD card 52, respectively. Reference numeral 48 in the figure denotes D / A conversion means for outputting an acoustic signal, and a speaker 49 is connected to the output. This computer device functions as an acoustic signal processing device by storing an acoustic signal processing program comprising the processing steps shown in FIG. 27 in the HDD 44, reading it into the RAM 42 and executing it by the CPU 41. Further, the functions of the user interface unit 8 described above are realized by using the HDD 44 as an external storage device, the mouse / keyboard 45 that receives operation input, the display 46 and the speaker 49 as information presenting means. The sound source information obtained by the acoustic signal processing is stored and output to the RAM 42, ROM 43, and HDD 44, or communicated and output via the LAN 47.

（記録媒体）
また、本発明は図２０に示すように記録媒体として実施することも可能である。図中の６１は本発明に係る信号処理プログラムを記録したＣＤ−ＲＯＭやＣＦやＳＤカードやフロッピー（登録商標）ディスクなどで実現される記録媒体である。この記録媒体６１をテレビやコンピュータなどの電子装置６２や電子装置６３やロボット６４に挿入することで当該プログラムを実行可能としたり、あるいはプログラムを供給された電子装置６３から通信によって別の電子装置６５やロボット６４に当該プログラムを供給することで、電子装置６５やロボット６４上で当該プログラムを実行可能とする。 (recoding media)
The present invention can also be implemented as a recording medium as shown in FIG. Reference numeral 61 in the figure denotes a recording medium realized by a CD-ROM, CF, SD card, floppy (registered trademark) disk or the like on which a signal processing program according to the present invention is recorded. The program can be executed by inserting the recording medium 61 into an electronic device 62 such as a television or a computer, an electronic device 63, or a robot 64, or another electronic device 65 is communicated from the electronic device 63 supplied with the program. By supplying the program to the robot 64, the program can be executed on the electronic device 65 or the robot 64.

以上説明した本発明の実施形態よれば、以下のような作用効果が得られる。 According to the embodiment of the present invention described above, the following operational effects can be obtained.

（１）音源ストリーム抽出分類手段により、認識対象となる目的音源ストリームの全フレームの中で雑音が支配的になっているフレームを検出することができる。そして、雑音が支配的になっているか否かを各フレームに対する信頼可否情報（信頼可フラグ／信頼不可フラグ）として与えることで、後続の各種音声認識手段でその情報を利用できるようにする。 (1) The sound source stream extraction and classification means can detect a frame in which noise is dominant among all the frames of the target sound source stream to be recognized. Then, whether or not noise is dominant is given as reliability information (reliable flag / unreliable flag) for each frame, so that the information can be used by various subsequent voice recognition means.

以上について、特に非特許文献１と比較して、従来技術が音源分離手段で推定された雑音に基づいて音響特徴の各要素（周波数成分）の信頼可否を判定しているのに対し、本発明は、音源分離過程に先駆けて音源検出過程で信頼可否を判定するものである。 As compared with Non-Patent Document 1, the conventional technique determines whether each element (frequency component) of the acoustic feature is reliable based on the noise estimated by the sound source separation means, while comparing with the present invention. Is for determining the reliability in the sound source detection process prior to the sound source separation process.

（２）語彙認識手段は、入力音響特徴列を音素記号列に変換する際、信頼不可フレームの音響特徴を標準音響特徴と照合することなく、直ちにダミー音素に変換することで、音響特徴照合計算コストを削減することができる。 (2) When converting the input acoustic feature sequence into a phoneme symbol sequence, the vocabulary recognition means immediately converts the acoustic feature of the unreliable frame into a dummy phoneme without matching with the standard acoustic feature, thereby calculating the acoustic feature comparison. Cost can be reduced.

（３）語彙認識手段は、入力音響特徴列（入力音素記号列）に適合する文仮説を探索する際、信頼可フレームのみを尤度計算に参加させ、かつ、信頼不可フレームで枝刈りをしないことで探索の破綻を防ぎ、雑音環境下での誤認識の発生を抑制することができる。 (3) When searching for a sentence hypothesis that matches the input acoustic feature sequence (input phoneme symbol sequence), the vocabulary recognition means allows only the reliable frame to participate in the likelihood calculation and does not prune the unreliable frame. Thus, it is possible to prevent the search from being broken and to suppress the occurrence of erroneous recognition in a noisy environment.

（４）語彙認識手段は、入力音響特徴列を音素記号列に変換する際、信頼不可フレームの音響特徴を標準音響特徴と照合することなく、直ちにダミー音素に変換し、かつ、入力音響特徴列（入力音素記号列）に適合する文仮説を探索する際、ダミー音素を加味した探索木を用いて文仮説を評価することで探索の破綻を防ぎ、雑音環境下での誤認識の発生を抑制することができる。 (4) When converting the input acoustic feature sequence to the phoneme symbol sequence, the vocabulary recognition means immediately converts the acoustic feature of the unreliable frame into the dummy phoneme without matching the standard acoustic feature, and the input acoustic feature sequence When searching for sentence hypotheses that match (input phoneme symbol string), search hypotheses are evaluated using a search tree that takes into account dummy phonemes to prevent search failures and prevent misrecognition in noisy environments. can do.

以上３点について、特に特許文献２と比較して、従来技術が無音期間を検出してビーム幅を絞っていたが、これは無音期間での計算コストを削減することを目的としたものである。この操作が有効なのは、無音期間が発話中のポーズ期間である、すなわち、その期間で発話中の音声の一部が失われ、探索中の文仮説との整合性が取れなくなる、ということがないときに限られる。一方、本発明は、信頼不可期間は決してポーズのような無音ではなく、雑音に負けた目的音声が欠落している可能性を想定して成されたものである。そのため、この欠落による探索の破綻を防ぐために信頼可否情報を利用し、この期間にビーム幅を広げる操作を行う。その結果、雑音環境下において目的音声が雑音に負けている期間が仮にあったとしても、その期間で破綻することなく、負けていない期間の音声を手掛かりに認識を続けられる音声認識を実現している。 With regard to the above three points, in particular, compared with Patent Document 2, the prior art has detected the silent period and narrowed the beam width, but this is intended to reduce the calculation cost in the silent period. . This operation is valid during the pause period during which the silence period is speaking, that is, a part of the speech being spoken during that period is not lost, and the consistency with the sentence hypothesis being searched cannot be lost. Sometimes limited. On the other hand, the present invention is made on the assumption that the target speech that is defeated by noise is missing during the unreliable period. Therefore, in order to prevent the failure of the search due to this omission, the reliability information is used, and an operation for expanding the beam width is performed during this period. As a result, even if there is a period in which the target speech is defeated by noise in a noisy environment, it realizes speech recognition that does not fail in that period and can continue to recognize the speech of the period that is not defeated Yes.

（５）話者認識手段は、入力音声特徴を標準話者特徴と照合する際、信頼可フレームの音声特徴のみを標準話者特徴と照合することで、雑音環境下での誤認識の発生を抑制することができる。 (5) When recognizing the input speech feature with the standard speaker feature, the speaker recognizing means collates only the speech feature of the reliable frame with the standard speaker feature, thereby preventing erroneous recognition in a noisy environment. Can be suppressed.

（６）物音認識手段は、入力音声特徴を標準物音特徴と照合する際、信頼可フレームの音声特徴のみを標準物音特徴と照合することで、雑音環境下での誤認識の発生を抑制することができる。 (6) When the input sound feature is compared with the standard sound feature, the sound recognition means checks only the sound feature of the reliable frame with the standard sound feature, thereby suppressing the occurrence of misrecognition in a noisy environment. Can do.

以上２点について、話者の別や物音の別は音声の属するクラスであると捉える。よって、各クラスの標準音響特徴と入力特徴を照合して、最も高い類似度を獲得したクラスを、その音声のクラスであると認定する。これは認識全般について言えることである。このとき、悪いデータを認識しようとすれば誤認識という失敗を犯すことになる。良いデータだけを選別して認識できれば、結果はもっと良くなるであろう。本発明によれば、データの良い悪い（音声の明瞭さ）を信頼可否情報が与えてくれるので、認識過程では、良いデータだけを選んで評価することができる。 With regard to the above two points, it is considered that the class of the speaker belongs to the class of the sound. Therefore, the standard acoustic feature and the input feature of each class are collated, and the class having the highest similarity is recognized as the speech class. This is true for recognition in general. At this time, if it tries to recognize bad data, it will make a mistake called misrecognition. If only good data can be selected and recognized, the results will be better. According to the present invention, since the reliability information gives information on whether the data is good or bad (speech clarity), only good data can be selected and evaluated in the recognition process.

なお、本発明は上記実施形態そのままに限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化できる。また、上記実施形態に開示されている複数の構成要素の適宜な組み合わせにより、種々の発明を形成できる。例えば、実施形態に示される全構成要素から幾つかの構成要素を削除してもよい。さらに、異なる実施形態にわたる構成要素を適宜組み合わせてもよい。 Note that the present invention is not limited to the above-described embodiment as it is, and can be embodied by modifying the constituent elements without departing from the scope of the invention in the implementation stage. In addition, various inventions can be formed by appropriately combining a plurality of components disclosed in the embodiment. For example, some components may be deleted from all the components shown in the embodiment. Furthermore, constituent elements over different embodiments may be appropriately combined.

本発明の一実施形態に係る音声認識装置の機能ブロック図Functional block diagram of a speech recognition apparatus according to an embodiment of the present invention 音源ストリーム抽出分類部の内部構成を示すブロック図Block diagram showing the internal configuration of the sound source stream extraction and classification unit 位相差算出の説明図Explanation of phase difference calculation 座標値計算の説明図Illustration of coordinate value calculation 位相差の循環性説明図Phase difference circulation diagram 投票される平均パワーの関数値の説明図Illustration of function value of average power voted θとΔρの関係図Relationship diagram between θ and Δρ 同時発話時の周波数成分、散布図、ハフ投票結果を示した図A diagram showing frequency components, scatter diagrams, and Hough voting results during simultaneous speech Δρずつ離れた数箇所の得票値を合計して極大位置を探索した結果を示した図A figure showing the result of searching for the maximum position by summing the votes obtained at several points separated by Δρ θの時間軸上の追跡を説明するための図Diagram for explaining tracking of θ on the time axis 信頼可否判定方式１を説明するための図The figure for demonstrating the reliability determination method 1 信頼可否判定方式２を説明するための図The figure for demonstrating the reliability determination method 2 音源分離部の内部構成を説明するための図The figure for demonstrating the internal structure of a sound source separation part 語彙認識部の内部構成を示すブロック図Block diagram showing the internal structure of the vocabulary recognition unit 話者認識部の内部構成を示すブロック図Block diagram showing the internal structure of the speaker recognition unit 物音認識部の内部構成を示すブロック図Block diagram showing the internal structure of the sound recognition unit 本発明の一実施形態に係る音声認識処理の流れを示したフローチャートThe flowchart which showed the flow of the speech recognition process which concerns on one Embodiment of this invention. Ｎ個のマイクを使った変形実施形態を示す機能ブロック図Functional block diagram showing a modified embodiment using N microphones コンピュータを使った実施形態を示す機能ブロック図Functional block diagram showing an embodiment using a computer 記録媒体による実施形態を示す図The figure which shows embodiment by a recording medium

Explanation of symbols

１ａ，１ｂ…マイク，２…音響信号入力部，３…音源ストリーム抽出分類部，４…音源分離部，５…語彙認識部，６…話者認識部，７…物音認識部，８…出力部，９…ユーザインタフェース部 DESCRIPTION OF SYMBOLS 1a, 1b ... Microphone, 2 ... Sound signal input part, 3 ... Sound source stream extraction classification | category part, 4 ... Sound source separation part, 5 ... Vocabulary recognition part, 6 ... Speaker recognition part, 7 ... Object recognition part, 8 ... Output part , 9 ... User interface part

Claims

Input means for inputting the first and second acoustic signals captured at two points;
Calculating means for frequency-decomposing each of the first and second acoustic signals to obtain a frequency component, and calculating a phase difference and power for each frequency component;
Generating means for generating a scatter diagram having the value of the frequency component and the value of the phase difference as coordinate values;
Detection means for detecting an arrangement of frequency components showing linearity on the scatter diagram together with a linear score corresponding to the power, and detecting an arrangement of frequency components having the linear score equal to or greater than a threshold as a straight line indicating the presence of a sound source When,
A sound source stream that groups at least one straight line detected by the detecting means in a time axis direction while allowing a straight line non-detection period and straight line tilt fluctuation within a certain range, the information including the straight line inclination, Extraction means for extracting a sound source stream including a straight line score and information of a time when the straight line is detected;
Classifying means for assigning reliability information based on the level of the straight line score to the time of the sound source stream, and classifying each frame of the sound source stream;
Sound source separation means for extracting sound data of the sound source stream based on the sound source existence angle calculated from the information of the slope of the straight line included in the sound source stream, and separating the sound source;
The sentence hypothesis defined in the grammatical information is expanded into a state and transition search tree, a predetermined acoustic feature is extracted from the sound data of the sound source stream, and the likelihood of the state transition path of the search tree for the acoustic feature sequence And voice recognition means for recognizing the linguistic content of the sound source stream by searching for a state transition path with a high likelihood,
A speech recognition apparatus that controls search of the state transition route based on the reliability information.

The speech recognition means performs a beam search with pruning based on likelihood in the search,
Calculate the likelihood for times classified as trustworthy,
The speech recognition apparatus according to claim 1, wherein the pruning is suppressed for the time classified as unreliable.

The speech recognition means uses a search tree in which a dummy state is added in parallel with each state of the search tree in the search,
For times classified as trustworthy, transition to a state other than the dummy state,
The speech recognition apparatus according to claim 1, wherein a transition to the dummy state is performed for the time classified as unreliable.

Input means for inputting the first and second acoustic signals captured at two points;
Calculating means for frequency-decomposing each of the first and second acoustic signals to obtain a frequency component, and calculating a phase difference and power for each frequency component;
Generating means for generating a scatter diagram having the value of the frequency component and the value of the phase difference as coordinate values;
Detection means for detecting an arrangement of frequency components showing linearity on the scatter diagram together with a linear score corresponding to the power, and detecting an arrangement of frequency components having the linear score equal to or greater than a threshold as a straight line indicating the presence of a sound source When,
A sound source stream that groups at least one straight line detected by the detecting means in a time axis direction while allowing a straight line non-detection period and straight line tilt fluctuation within a certain range, the information including the straight line inclination, A sound source stream extracting means for extracting a sound source stream including a straight line score and information of a time when the straight line is detected;
Classifying means for assigning reliability information based on the level of the straight line score to the time of the sound source stream, and classifying each frame of the sound source stream;
Sound source separation means for extracting sound data of the sound source stream based on the sound source existence angle calculated from the information of the slope of the straight line included in the sound source stream, and separating the sound source;
Feature extraction means for extracting a predetermined feature from audio data at a time determined to be reliable by the reliability information among audio data of the sound source stream;
A calculation means for calculating a similarity between the feature and a class feature learned in advance for each class to be identified;
Recognizing means for recognizing the class of the class feature having the highest similarity as the class of the voice data.

An input step for inputting first and second acoustic signals captured at two points;
A calculation step of frequency-decomposing each of the first and second acoustic signals to obtain a frequency component, and calculating a phase difference and power for each frequency component;
A generation step of generating a scatter diagram having the value of the frequency component and the value of the phase difference as coordinate values;
A detection step of detecting an arrangement of frequency components showing linearity on the scatter diagram together with a linear score corresponding to the power, and detecting an arrangement of frequency components having the linear score equal to or greater than a threshold as a straight line indicating the presence of a sound source. When,
A sound source stream that groups at least one straight line detected by the detection step in a time axis direction while allowing a straight line non-detection period and straight line tilt fluctuation within a certain range, and includes information including the slope of the straight line, An extraction step of extracting a sound source stream including a straight line score and information of a time when the straight line is detected;
A classification step of assigning reliability information based on the level of the straight line score to the time of the sound source stream, and classifying each frame of the sound source stream;
A sound source separation step of extracting sound data of the sound source stream based on a sound source existing angle calculated from information on the slope of the straight line included in the sound source stream, and separating the sound source;
The sentence hypothesis defined in the grammatical information is expanded into a state and transition search tree, a predetermined acoustic feature is extracted from the sound data of the sound source stream, and the likelihood of the state transition path of the search tree for the acoustic feature sequence And recognizing the linguistic content of the sound source stream by searching for a highly likely state transition path,
A speech recognition method, wherein search for the state transition route is controlled based on the reliability information.

An input step for inputting first and second acoustic signals captured at two points;
A calculation step of frequency-decomposing each of the first and second acoustic signals to obtain a frequency component, and calculating a phase difference and power for each frequency component;
A generation step of generating a scatter diagram having the value of the frequency component and the value of the phase difference as coordinate values;
A detection step of detecting an arrangement of frequency components showing linearity on the scatter diagram together with a linear score corresponding to the power, and detecting an arrangement of frequency components having the linear score equal to or greater than a threshold as a straight line indicating the presence of a sound source. When,
A sound source stream that groups at least one straight line detected by the detection step in a time axis direction while allowing a straight line non-detection period and straight line tilt fluctuation within a certain range, and includes information including the slope of the straight line, A sound source stream extracting step of extracting a sound source stream including a straight line score and information of a time when the straight line is detected;
A classification step of assigning reliability information based on the level of the straight line score to the time of the sound source stream, and classifying each frame of the sound source stream;
A sound source separation step of extracting sound data of the sound source stream based on a sound source existing angle calculated from information on the slope of the straight line included in the sound source stream, and separating the sound source;
A feature extraction step of extracting a predetermined feature from audio data at a time determined to be reliable by the reliability information among the audio data of the sound source stream;
A calculation step for calculating a similarity between the feature and a class feature learned in advance for each class to be identified;
And a recognition step for recognizing the class of the class feature having the highest similarity as the class of the voice data.