JP2009276365A

JP2009276365A - Processor, voice recognition device, voice recognition system and voice recognition method

Info

Publication number: JP2009276365A
Application number: JP2008124497A
Authority: JP
Inventors: Seisho Watabe; 生聖渡部
Original assignee: Toyota Motor Corp
Current assignee: Toyota Motor Corp
Priority date: 2008-05-12
Filing date: 2008-05-12
Publication date: 2009-11-26

Abstract

<P>PROBLEM TO BE SOLVED: To favorably divide a reverberation echo pattern into an initial reflection component and a diffused reverberation component. <P>SOLUTION: This processor measures a reverberation pattern from an impulse response, and divides a measured reverberation pattern into an initial reflection component, which is the former half of the reverberation pattern and a diffused reverberation component, which is the latter half of the reverberation pattern, thereby performing voice recognition for input voice. The processor includes: a decay time boundary calculating part 6 for calculating a decay curve of the reverberation pattern, and calculating a decay time boundary showing a temporal boundary between the initial reflection component and the diffused reverberation component based on the decay curve; and a parameter determination part 7 for an acoustic model, which determines an analysis frame length used in the acoustic model, a frame shift and the number of dynamic characteristic amounts. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、話者が発生する音声の音声認識技術に関して、特に、残響パターンを初期反射成分と拡散残響成分とに良好に分割して処理が可能な処理装置、音声認識装置、音声認識システム、音声認識方法に関する。 The present invention relates to a speech recognition technique for speech generated by a speaker. The present invention relates to a speech recognition method.

実環境における音声認識では、話者の発生した音声がその残響に起因して不明瞭となり、認識性能が低下するという問題がある。特に、ハンズフリーマイクなどの非接触型のマイクロフォンによる集音では、部屋やマイク周辺の形状による残響に強く影響を受ける。従来、残響パターンを考慮して音声認識処理を行う技術が知られている（例えば、特許文献１）。
特開２００７−６５２０４号公報 In speech recognition in a real environment, there is a problem that speech generated by a speaker is obscured due to reverberation and recognition performance deteriorates. In particular, sound collection by a non-contact type microphone such as a hands-free microphone is strongly influenced by reverberation due to the shape of the room and the surroundings of the microphone. Conventionally, a technique for performing speech recognition processing in consideration of a reverberation pattern is known (for example, Patent Document 1).
JP 2007-65204 A

しかしながら、従来の音声認識技術では、残響パターンの全部分を考慮して音響モデルを学習するため、十分な認識性能を得ることができないものであった。残響成分を、一次反射成分である初期反射成分と、それ以降の成分である二次反射成分、三次反射成分等を含む拡散残響成分とに分解する。分解したこれら残響パターンは、パワーやスペクトル形状等の特徴に関してそれぞれ大きく異なる性質を有する。話者の発生した音声の初期反射成分が混入した初期反射成分においては、認識対象である音声の音響特性は依然として強く残存しているのに対して、拡散残響成分が混入した音響成分においては、認識対象である音声の音響特性は既に性質が異なったものとなっている。従って、拡散残響成分が混入した音響成分に関しては、認識対象である音声の音響特性は消失しているものと考えられる。このため、音響モデルで残響を学習する際に、音響モデルのフレーム長を超えて影響するこのような拡散残響成分が、正確な音韻学習の妨げとなっていた。 However, in the conventional speech recognition technique, since the acoustic model is learned in consideration of all parts of the reverberation pattern, sufficient recognition performance cannot be obtained. The reverberation component is decomposed into an initial reflection component, which is a primary reflection component, and a diffuse reverberation component including secondary reflection components, tertiary reflection components, and the like, which are subsequent components. These decomposed reverberation patterns have greatly different properties with respect to characteristics such as power and spectral shape. In the initial reflection component mixed with the initial reflection component of the voice generated by the speaker, the acoustic characteristics of the speech to be recognized remain strong, whereas in the acoustic component mixed with the diffuse reverberation component, The acoustic characteristics of the speech to be recognized are already different in nature. Therefore, regarding the acoustic component mixed with the diffuse reverberation component, it is considered that the acoustic characteristics of the speech to be recognized have disappeared. For this reason, when learning reverberation with an acoustic model, such diffuse reverberation components that affect beyond the frame length of the acoustic model hinder accurate phonological learning.

ところで、本出願人の特許出願（特願２００８−１２２２８８号）によれば、インパルス応答から予め残響パターンを測定し、測定した残響パターンを、音響モデルの学習に用いるフレーム長を基準として、残響パターンの前半部である初期反射成分と、初期反射成分以降の拡散残響成分とに分割して扱う音声認識方法が開示されている。残響パターンをこのように分割した後、音響モデルを予め学習する際には、初期反射成分を音響モデル学習により吸収する。そして、入力音声を認識する際には、入力音声から拡散残響成分をスペクトル減算によって除去するとともに、初期反射成分を考慮した音響モデルを参照して入力音声を認識する。即ち、残響パターンを分割して、初期反射成分を音響モデルに反映させると共に、拡散残響成分を入力音声から除去する。音声認識処理をこのように行うことで、より正確な音声認識を実現することが可能となる。 By the way, according to the patent application of the present applicant (Japanese Patent Application No. 2008-122288), a reverberation pattern is measured in advance from an impulse response, and the reverberation pattern is determined based on a frame length used for learning an acoustic model. A speech recognition method that divides and treats an initial reflection component that is the first half of the first and a diffuse reverberation component after the initial reflection component is disclosed. After the reverberation pattern is divided in this way, when the acoustic model is learned in advance, the initial reflection component is absorbed by the acoustic model learning. Then, when recognizing the input speech, the diffuse reverberation component is removed from the input speech by spectral subtraction, and the input speech is recognized with reference to an acoustic model considering the initial reflection component. That is, the reverberation pattern is divided so that the initial reflection component is reflected in the acoustic model and the diffuse reverberation component is removed from the input speech. By performing the voice recognition process in this way, more accurate voice recognition can be realized.

しかし、残響パターンは部屋やマイク周辺の形状によってもその特徴が異なるため、残響パターンを初期反射成分と拡散残響成分とに分割する境界を、音響モデルの学習に用いるフレーム長を基準として決定するものとしては、環境に対して必ずしも良好な境界であるとは限らない。即ち、予め定めた音響モデル学習用のフレーム長に基づいて、残響パターンを初期反射成分と拡散残響成分とに分割するものとしては、残響パターンを環境に対して良好に分割することができないという問題があった。また、残響パターン測定結果を観察したユーザが残響パターンの境界を決定するものとしては、残響測定後、その都度、決定処理が発生し、手間が掛かるという問題があった。このように、残響パターンを初期反射成分と拡散残響成分とに分割する境界を、環境に対して良好に決定して、残響パターンを分割した初期反射成分を良好に学習することが可能な汎用的な方法が求められている。 However, since the characteristics of reverberation patterns differ depending on the shape of the room and the surroundings of the microphone, the boundary that divides the reverberation pattern into the initial reflection component and the diffuse reverberation component is determined based on the frame length used for learning the acoustic model. As such, it is not always a good boundary for the environment. In other words, if the reverberation pattern is divided into the initial reflection component and the diffuse reverberation component based on the predetermined acoustic model learning frame length, the reverberation pattern cannot be divided well with respect to the environment. was there. In addition, the user who observes the reverberation pattern measurement result determines the boundary of the reverberation pattern, and there is a problem that a determination process occurs every time after the reverberation measurement, which takes time. In this way, the boundary that divides the reverberation pattern into the initial reflection component and the diffuse reverberation component can be determined well with respect to the environment, and the initial reflection component obtained by dividing the reverberation pattern can be learned well. Is needed.

本発明は、かかる課題を解決するためになされたものであり、残響パターンを初期反射成分と拡散残響成分とに良好に分割して処理可能な処理装置、音声認識装置、音声認識システム、音声認識方法を提供することを目的とする。 The present invention has been made to solve such a problem, and a processing device, a speech recognition device, a speech recognition system, and a speech recognition that can divide and process a reverberation pattern into an initial reflection component and a diffuse reverberation component. It aims to provide a method.

本発明に係る処理装置は、インパルス応答から残響パターンを測定し、前記測定した残響パターンを、該残響パターンの前半部である初期反射成分と、該残響パターンの後半部である拡散残響成分とに分割して、入力音声の音声認識を行うための処理装置であって、前記残響パターンの減衰曲線を計算し、該減衰曲線に基づいて前記初期反射成分と前記拡散残響成分との時間的な境界を示す減衰時間境界を計算する減衰時間境界計算部と、前記計算した減衰時間境界に基づいて、音響モデルで用いる分析フレーム長と、フレームシフトと、動的特徴量の数と、を決定する音響モデル用パラメタ決定部と、を備えるものである。 The processing apparatus according to the present invention measures a reverberation pattern from an impulse response, and converts the measured reverberation pattern into an initial reflection component that is the first half of the reverberation pattern and a diffuse reverberation component that is the second half of the reverberation pattern. A processing device for dividing and recognizing input speech, calculating an attenuation curve of the reverberation pattern, and based on the attenuation curve, a temporal boundary between the initial reflection component and the diffuse reverberation component An attenuation time boundary calculating unit for calculating an attenuation time boundary indicating the acoustic frame length, an analysis frame length used in the acoustic model, a frame shift, and the number of dynamic features based on the calculated attenuation time boundary A model parameter determination unit.

これにより、測定した残響パターンの減衰状況に応じて、残響パターンを分割するための減衰境界時間を計算し、計算した減衰境界時間に基づいて音響モデル用パラメタ（分析フレーム長と、フレームシフトと、動的特徴量の数。）を決定することで、残響パターンを初期反射成分と拡散残響成分とに良好に分割することができる。 Thereby, the attenuation boundary time for dividing the reverberation pattern is calculated according to the measured attenuation state of the reverberation pattern, and the parameters for the acoustic model (analysis frame length, frame shift, By determining the number of dynamic feature quantities, the reverberation pattern can be favorably divided into the initial reflection component and the diffuse reverberation component.

また、前記減衰時間境界計算部による減衰時間境界の計算方法としては、前記残響パターンが所定量減少した際の時間を前記減衰時間境界としてもよいし、前記残響パターンの減衰曲線の変化量が所定の閾値を下回った際の時間を前記減衰時間境界とするようにしてもよい。 In addition, as a method for calculating the decay time boundary by the decay time boundary calculation unit, a time when the reverberation pattern is decreased by a predetermined amount may be set as the decay time boundary, and a change amount of the decay curve of the reverberation pattern is predetermined. The time when the value falls below the threshold may be set as the decay time boundary.

さらに、前記音響モデルで用いる分析フレーム長Ｔｆと、フレームシフトＴｓと、動的特徴量の数Ｎと、の値の組合せを予め記憶した記憶部を更に備え、前記音響モデル用パラメタ決定部は、前記減衰時間境界計算部で計算した減衰時間境界Ｔａに対して、次の式を満足する前記分析フレーム長Ｔｆと、前記フレームシフトＴｓと、前記動的特徴量の数Ｎと、の値の組合せを選択するようにしてもよい。
Ｔａ≒Ｔｆ＋Ｎ×（２×Ｔｓ） Further, the acoustic model parameter determination unit further includes a storage unit that stores in advance a combination of values of the analysis frame length Tf, the frame shift Ts, and the number of dynamic features N used in the acoustic model. A combination of values of the analysis frame length Tf, the frame shift Ts, and the number N of the dynamic feature quantities satisfying the following expression with respect to the decay time boundary Ta calculated by the decay time boundary calculation unit: May be selected.
Ta≈Tf + N × (2 × Ts)

本発明に係る音声認識装置は、インパルス応答から残響パターンを測定し、前記測定した残響パターンを、該残響パターンの前半部である初期反射成分と、該残響パターンの後半部である拡散残響成分とに分割して、入力音声を認識する音声認識装置であって、前記残響パターンの減衰曲線を計算し、該減衰曲線に基づいて、前記初期反射成分と前記拡散残響成分との時間的な境界を示す減衰時間境界を計算する減衰時間境界計算部と、前記計算した減衰時間境界に基づいて、音響モデルで用いる分析フレーム長と、フレームシフトと、動的特徴量の数と、を決定する音響モデル用パラメタ決定部と、前記測定した残響パターンを前記減衰時間境界で分割して、前記初期反射成分と、前記拡散残響成分とを抽出する残響成分抽出部と、前記抽出した初期反射成分を学習用音声データに反映させて、前記決定した分析フレーム長と、フレームシフトと、動的特徴量の数と、に基づいて音響モデルを学習する学習部と、前記入力音声から前記抽出した拡散残響成分を除去するとともに、前記学習した音響モデルを参照して前記入力音声を認識する認識部と、を備えるものである。 The speech recognition apparatus according to the present invention measures a reverberation pattern from an impulse response, and the measured reverberation pattern includes an initial reflection component that is the first half of the reverberation pattern, and a diffuse reverberation component that is the second half of the reverberation pattern. A speech recognition device for recognizing input speech, calculating an attenuation curve of the reverberation pattern, and determining a temporal boundary between the initial reflection component and the diffuse reverberation component based on the attenuation curve. An attenuation time boundary calculation unit for calculating an attenuation time boundary to be shown; and an acoustic model for determining an analysis frame length, a frame shift, and the number of dynamic features used in the acoustic model based on the calculated attenuation time boundary A parameter determination unit for use, a reverberation component extraction unit that divides the measured reverberation pattern at the attenuation time boundary to extract the initial reflection component and the diffuse reverberation component, and the extraction A learning unit that reflects the initial reflection component in the learning speech data and learns an acoustic model based on the determined analysis frame length, frame shift, and number of dynamic features, and the input speech A recognition unit that removes the extracted diffuse reverberation component and recognizes the input speech with reference to the learned acoustic model.

これにより、測定した残響パターンの減衰状況に応じて、残響パターンを分割するための減衰境界時間を計算し、計算した減衰境界時間に基づいて音響モデル用パラメタ（分析フレーム長と、フレームシフトと、動的特徴量の数。）を決定することで、残響パターンを初期反射成分と拡散残響成分とに良好に分割することができる。分割した初期反射成分を学習用音声データに反映させることでより優れた音響モデルを構築することができ、また、入力音声から拡散残響成分を除去することで、入力音声の認識率を向上させることができる。 Thereby, the attenuation boundary time for dividing the reverberation pattern is calculated according to the measured attenuation state of the reverberation pattern, and the parameters for the acoustic model (analysis frame length, frame shift, By determining the number of dynamic feature quantities, the reverberation pattern can be favorably divided into the initial reflection component and the diffuse reverberation component. By reflecting the divided early reflection components in the speech data for learning, a better acoustic model can be constructed, and the recognition rate of the input speech can be improved by removing the diffuse reverberation component from the input speech. Can do.

本発明に係る音声認識装置は、上記の音声認識装置と、環境中で発生した音を受音して、前記音声認識装置に音声信号を出力するマイクロフォンと、を有するものである。 A voice recognition apparatus according to the present invention includes the voice recognition apparatus described above and a microphone that receives a sound generated in the environment and outputs a voice signal to the voice recognition apparatus.

本発明に係る音声認識方法は、インパルス応答から残響パターンを測定し、前記測定した残響パターンを、該残響パターンの前半部である初期反射成分と、該残響パターンの後半部である拡散残響成分とに分割して、入力音声を認識する音声認識方法であって、前記残響パターンの減衰曲線を計算し、該減衰曲線に基づいて、前記初期反射成分と前記拡散残響成分との時間的な境界を示す減衰時間境界を計算するステップと、前記計算した減衰時間境界に基づいて、音響モデルで用いる分析フレーム長と、フレームシフトと、動的特徴量の数と、を決定するステップと、前記測定した残響パターンを前記減衰時間境界で分割して、前記初期反射成分と、前記拡散残響成分とを抽出するステップと、前記抽出した初期反射成分を学習用音声データに反映させて、前記決定した分析フレーム長と、フレームシフトと、動的特徴量の数と、に基づいて音響モデルを学習するステップと、前記入力音声から前記抽出した拡散残響成分を除去するとともに、前記学習した音響モデルを参照して前記入力音声を認識するステップと、を備えるものである。 The speech recognition method according to the present invention measures a reverberation pattern from an impulse response, and the measured reverberation pattern includes an initial reflection component that is the first half of the reverberation pattern, and a diffuse reverberation component that is the second half of the reverberation pattern. A speech recognition method for recognizing an input speech by calculating an attenuation curve of the reverberation pattern and determining a temporal boundary between the initial reflection component and the diffuse reverberation component based on the attenuation curve. Calculating an attenuation time boundary to be indicated; determining an analysis frame length used in the acoustic model, a frame shift, and a number of dynamic features based on the calculated attenuation time boundary; Dividing the reverberation pattern at the attenuation time boundary to extract the initial reflection component and the diffuse reverberation component; and extracting the extracted initial reflection component into the learning audio data. Reflecting an acoustic model based on the determined analysis frame length, frame shift, and number of dynamic features, and removing the extracted diffuse reverberation component from the input speech And recognizing the input speech with reference to the learned acoustic model.

これにより、測定した残響パターンの減衰状況に応じて、残響パターンを良好に分割することができる。そして、計算した減衰境界時間に基づいて音響モデル用パラメタ（分析フレーム長と、フレームシフトと、動的特徴量の数。）を決定することで、分割した初期反射成分を学習用音声データに反映させる際に、より適した音響モデル用パラメタを決定することができる。これによって、より優れた音響モデルを構築することができ、さらに、入力音声から拡散残響成分を除去することで、入力音声の認識率を向上させることができる。 Thereby, a reverberation pattern can be divided | segmented favorably according to the attenuation condition of the measured reverberation pattern. Then, by determining acoustic model parameters (analysis frame length, frame shift, and number of dynamic features) based on the calculated attenuation boundary time, the divided initial reflection components are reflected in the learning speech data. In this case, a more suitable acoustic model parameter can be determined. As a result, a better acoustic model can be constructed, and the recognition rate of the input speech can be improved by removing the diffuse reverberation component from the input speech.

本発明によれば、残響パターンを初期反射成分と拡散残響成分とに良好に分割して処理が可能な処理装置、音声認識装置、音声認識システム、音声認識方法を提供することができる。 According to the present invention, it is possible to provide a processing device, a speech recognition device, a speech recognition system, and a speech recognition method that can perform processing by dividing a reverberation pattern into an initial reflection component and a diffuse reverberation component.

以下、本発明を実施するための最良の形態について、図面を参照しながら詳細に説明する。説明の明確化のため、以下の記載及び図面は、適宜、省略及び簡潔化がなされている。各図面において同一の構成又は機能を有する構成要素及び相当部分には、同一の符号を付し、その説明を省略する。 Hereinafter, the best mode for carrying out the present invention will be described in detail with reference to the drawings. For clarity of explanation, the following description and drawings are omitted and simplified as appropriate. In the drawings, components having the same configuration or function and corresponding parts are denoted by the same reference numerals, and description thereof is omitted.

発明の実施の形態１．
本実施の形態１に係る音声認識システムは、インパルス応答から残響パターンを測定し、測定した残響パターンの減衰曲線に基づいて、測定した残響パターンの前半部である初期反射成分と、その後半部である拡散残響成分との時間的な境界を示す減衰時間境界を計算する機能を有する。そして、音声認識システムは、計算した減衰時間境界に基づいて、音響モデルで用いる分析フレーム長と、フレームシフトと、動的特徴量の数と、を決定する機能を有する。 Embodiment 1 of the Invention
The speech recognition system according to the first embodiment measures a reverberation pattern from an impulse response, and based on the measured decay curve of the reverberation pattern, an initial reflection component that is the first half of the measured reverberation pattern, and the latter half It has a function of calculating a decay time boundary indicating a temporal boundary with a certain diffuse reverberation component. The speech recognition system has a function of determining the analysis frame length, the frame shift, and the number of dynamic features used in the acoustic model based on the calculated decay time boundary.

まず、図１を参照して、本実施の形態１に係る音声認識システムの特徴的な構成部分について説明する。図１は本実施の形態１に係る音声認識システムの構成を示すブロック図である。本実施の形態１に係る音声認識システムは、マイクロフォン１（以下、マイク１）と、音声認識装置２とを備えている。音声認識装置２は、残響パターンを処理する残響処理部３と、音響モデルを学習する学習部４と、入力音声を認識する認識部５と、を備える。尚、残響成分抽出部８と、学習部４と、認識部５の詳細については後述する。 First, with reference to FIG. 1, characteristic components of the speech recognition system according to the first embodiment will be described. FIG. 1 is a block diagram showing the configuration of the speech recognition system according to the first embodiment. The voice recognition system according to the first embodiment includes a microphone 1 (hereinafter referred to as a microphone 1) and a voice recognition device 2. The speech recognition device 2 includes a reverberation processing unit 3 that processes a reverberation pattern, a learning unit 4 that learns an acoustic model, and a recognition unit 5 that recognizes input speech. Details of the reverberation component extraction unit 8, the learning unit 4, and the recognition unit 5 will be described later.

マイク１は、環境中に設けられ、環境中で発生した音を受音する。従って、マイク１は、発話者が話した音声を集音して、受音した音声に応じた音声信号を音声認識装置２に出力する。マイク１は、例えば、建物の部屋内に設置されている。マイク１は、環境内の予め定められた場所に設置されている。 The microphone 1 is provided in the environment and receives sound generated in the environment. Therefore, the microphone 1 collects the voice spoken by the speaker and outputs a voice signal corresponding to the received voice to the voice recognition device 2. The microphone 1 is installed in a room of a building, for example. The microphone 1 is installed at a predetermined place in the environment.

音声認識装置２は、マイク１からの音声信号に対してデータ処理を行って音声認識を行う。音声認識装置２は、ＣＰＵ（Central Processing Unit）、ＲＯＭ（Read Only Memory）、ＲＡＭ（Random Access Memory）、通信用のインターフェースなどを有する処理装置であり、音声認識に必要なデータ処理を行う。さらに、音声認識装置２は、着脱可能なＨＤＤ、光ディスク、光磁気ディスク等を有し、各種プログラムや制御パラメタなどを記憶し、そのプログラムやデータを必要に応じてメモリ（不図示）等に供給する。例えば、音声認識装置２は、マイク１からの信号をデジタル信号に変換して、演算処理を行う。さらに、音声認識装置２は、ＲＯＭやＨＤＤに格納されたプログラムに従って音声認識処理を実行する。すなわち、音声認識装置２は、音声認識するためのプログラムが格納されており、そのプログラムにより音声認識装置２がデジタル信号に対して各種処理を行う。 The voice recognition device 2 performs voice processing on the voice signal from the microphone 1 to perform voice recognition. The speech recognition device 2 is a processing device having a CPU (Central Processing Unit), a ROM (Read Only Memory), a RAM (Random Access Memory), a communication interface, and the like, and performs data processing necessary for speech recognition. Further, the voice recognition device 2 has a removable HDD, optical disk, magneto-optical disk, etc., stores various programs and control parameters, and supplies the programs and data to a memory (not shown) as necessary. To do. For example, the voice recognition device 2 converts a signal from the microphone 1 into a digital signal and performs arithmetic processing. Furthermore, the voice recognition device 2 executes voice recognition processing according to a program stored in the ROM or HDD. That is, the voice recognition device 2 stores a program for voice recognition, and the voice recognition device 2 performs various processes on the digital signal according to the program.

残響処理部３は、減衰時間境界計算部６と、音響モデル用パラメタ決定部７と、残響成分抽出部８と、を備える。残響処理部３に対して、インパルス応答が入力される。残響処理部３は、音響モデルで用いる音響モデル用パラメタ（分析フレーム長と、フレームシフトと、動的特徴量の数。）を学習部４に出力する。また、残響処理部３は、残響パターンから初期反射成分と拡散残響成分とを抽出して、抽出した初期反射成分と拡散残響成分とを出力する。残響処理部３から、抽出された初期反射成分が学習部４に出力される。残響処理部３から、抽出された拡散残響成分が認識部５に出力される。 The reverberation processing unit 3 includes an attenuation time boundary calculation unit 6, an acoustic model parameter determination unit 7, and a reverberation component extraction unit 8. An impulse response is input to the reverberation processing unit 3. The reverberation processing unit 3 outputs acoustic model parameters (analysis frame length, frame shift, and number of dynamic features) used in the acoustic model to the learning unit 4. The reverberation processing unit 3 extracts the initial reflection component and the diffuse reverberation component from the reverberation pattern, and outputs the extracted initial reflection component and the diffuse reverberation component. The extracted initial reflection component is output from the reverberation processing unit 3 to the learning unit 4. The extracted diffuse reverberation component is output from the reverberation processing unit 3 to the recognition unit 5.

減衰時間境界計算部６は、入力されたインパルス応答から残響パターンを測定する。そして、残響パターンの減衰曲線を計算し、計算した減衰曲線に基づいて、初期反射成分と拡散残響成分との時間的な境界を示す減衰時間境界を計算する。 The decay time boundary calculation unit 6 measures a reverberation pattern from the input impulse response. Then, an attenuation curve of the reverberation pattern is calculated, and an attenuation time boundary indicating a temporal boundary between the initial reflection component and the diffuse reverberation component is calculated based on the calculated attenuation curve.

ここで、残響パターンに含まれる初期反射成分及び拡散残響成分について、図２及び３を用いて説明する。図２は、室内で発生した音が反射する様子を模式的に示す図である。図３は、環境中に設置されたマイク１で検出された信号の一例を示す図である。図３において、横軸は時間、縦軸は信号のパワーを示している。図３では、環境中においてインパルス応答を測定した場合の、測定信号の波形が離散的に示されている。 Here, the initial reflection component and the diffuse reverberation component included in the reverberation pattern will be described with reference to FIGS. FIG. 2 is a diagram schematically illustrating how the sound generated in the room is reflected. FIG. 3 is a diagram illustrating an example of a signal detected by the microphone 1 installed in the environment. In FIG. 3, the horizontal axis indicates time, and the vertical axis indicates signal power. In FIG. 3, the waveform of the measurement signal when the impulse response is measured in the environment is shown discretely.

図２では、図１で示した音声認識システムをロボット４４に搭載した例を示している。図２に示すように、室内で発話者４５が発話した音は、ロボット４４に搭載されたマイク１に到達して、測音される。測音される発話音は、直接マイク１に伝播する場合と、壁面４３で反射してマイク１まで伝播する場合がある。もちろん、壁面４３だけではなく、天井や床や机などで反射することもある。壁面４３などで反射した音は、マイク１に直接、到達した音に比べて遅れる。すなわち、マイク１に直接、到達した直接音と、壁面４３で反射してからマイク１に到達した反射音とは、マイク１で測音されるタイミングが異なっている。さらに、壁面４３で反射した音のうち、繰り返し反射した音には、さらに時間遅れが生じる。このように音の伝播距離等に応じて、測音タイミングが異なっている。尚、室内では、壁面４３だけでなく、天井や床面や机などでも音が反射される。 FIG. 2 shows an example in which the speech recognition system shown in FIG. As shown in FIG. 2, the sound uttered by the speaker 45 in the room reaches the microphone 1 mounted on the robot 44 and is measured. The uttered sound to be measured may propagate directly to the microphone 1 or may be reflected by the wall surface 43 and propagate to the microphone 1. Of course, the light may be reflected not only by the wall surface 43 but also by a ceiling, floor, desk, or the like. The sound reflected by the wall surface 43 or the like is delayed compared to the sound that directly reaches the microphone 1. That is, the direct sound that has reached the microphone 1 directly differs from the reflected sound that has reached the microphone 1 after being reflected by the wall surface 43, and the timing at which the sound is measured by the microphone 1 is different. Further, among the sounds reflected by the wall surface 43, the sound reflected repeatedly has a further time delay. Thus, the sound measurement timing differs depending on the sound propagation distance and the like. In addition, in a room, sound is reflected not only on the wall surface 43 but also on the ceiling, floor, desk, or the like.

図２に示すような室内で、非常に幅の狭い単一パルスからなるインパルスを発生させた場合、測定信号は、図３に示す波形となる。インパルスの時間応答では、壁面４３で反射されずに直接マイク１に到達した直接音が最も早い時間（ｔ＝０）に測音される。そして、壁面４３で反射された反射音が、直接音の後に測音されていく。反射音は、壁面４３などでの吸収があるため、直接音よりもパワーが低くなっている。そして、繰り返し反射した反射音が時間とともに測音されていく。 When an impulse composed of a single pulse having a very narrow width is generated in a room as shown in FIG. 2, the measurement signal has the waveform shown in FIG. In the impulse time response, the direct sound that directly reaches the microphone 1 without being reflected by the wall surface 43 is measured at the earliest time (t = 0). And the reflected sound reflected by the wall surface 43 is measured after the direct sound. Since the reflected sound is absorbed by the wall surface 43 or the like, the power is lower than that of the direct sound. The reflected sound that is repeatedly reflected is measured over time.

ここで、インパルス応答の残響パターンを、初期反射成分と拡散残響成分とに分割する。そのため、インパルス応答から残響成分を測定し、測定した残響成分を初期反射成分と、拡散残響成分とに分割する。残響パターンのうち、前半部分を初期反射成分とし、後半部分を拡散残響成分とする。従って、初期反射成分の後が拡散残響成分となる。初期反射成分は、１次反射や２次反射などの低次反射成分が含まれている。また、拡散残響成分には高次反射成分が含まれている。 Here, the reverberation pattern of the impulse response is divided into an initial reflection component and a diffuse reverberation component. Therefore, a reverberation component is measured from the impulse response, and the measured reverberation component is divided into an initial reflection component and a diffuse reverberation component. Of the reverberation pattern, the first half is the initial reflection component and the second half is the diffuse reverberation component. Therefore, the diffuse reverberation component follows the initial reflection component. The initial reflection component includes low-order reflection components such as primary reflection and secondary reflection. Further, the diffuse reverberation component includes a high-order reflection component.

ここで、初期反射成分と拡散残響成分を区切る時間的な境界を減衰時間境界とする。従って、直接音がマイク１で測音された時間から減衰時間境界までの成分が初期反射成分となり、減衰時間境界以降の成分が拡散残響成分となる。例えば、減衰時間境界を６５ｍｓｅｃとすると、ｔ＝０のデータが直接音となり、０〜６５ｍｓｅｃの範囲（ｔ＝０、ｔ＝６５は含まず）のデータが初期反射成分となり、６５ｍｓｅｃ以降のデータが拡散残響成分となる。 Here, a temporal boundary that divides the initial reflection component and the diffuse reverberation component is defined as an attenuation time boundary. Therefore, the component from the time when the direct sound is measured by the microphone 1 to the attenuation time boundary is the initial reflection component, and the component after the attenuation time boundary is the diffuse reverberation component. For example, if the decay time boundary is 65 msec, the data at t = 0 is a direct sound, the data in the range of 0 to 65 msec (not including t = 0, t = 65) is the initial reflection component, and the data after 65 msec is Diffuse reverberation component.

減衰時間境界計算部６は、図４に示すように、例えば残響パターンの振幅レベル（パワーＰ）が所定量ｘ減少した際の時間ｔを減衰時間境界Ｔａとして計算することができる。図においては、時間０におけるパワーＰ（０）が、ｘ[ｄｂ]減衰した際のパワーＰ（ｔ）の時間ｔをＴａとする（即ち、Ｐ（０）−Ｐ（ｔ）＞ｘを満足する時間ｔをＴａとして扱う）。ここでは、所定量ｘを、例えば２０ｄｂとする。尚、減衰時間境界計算部６は、上述したように残響パターンの振幅レベル（パワーＰ）の減少量に応じて減衰時間境界を計算してもよいし、残響パターンの減衰曲線の変化量が所定の閾値を下回った際の時間を減衰時間境界として計算するようにしてもよい。 As shown in FIG. 4, the decay time boundary calculation unit 6 can calculate, for example, the time t when the amplitude level (power P) of the reverberation pattern decreases by a predetermined amount x as the decay time boundary Ta. In the figure, the time t of the power P (t) when the power P (0) at time 0 is attenuated by x [db] is Ta (that is, P (0) −P (t)> x is satisfied). Time t to be treated as Ta). Here, the predetermined amount x is, for example, 20 db. The attenuation time boundary calculation unit 6 may calculate the attenuation time boundary according to the amount of decrease in the amplitude level (power P) of the reverberation pattern as described above, or the amount of change in the attenuation curve of the reverberation pattern is predetermined. The time when the value falls below the threshold may be calculated as the decay time boundary.

音響モデル用パラメタ決定部７は、減衰時間境界計算部６で計算した減衰時間境界Ｔａに基づいて、音響モデルで用いる分析フレーム長Ｔｆと、フレームシフトＴｓと、動的特徴量Δの数Ｎと、を決定する。より具体的には、音響モデル用パラメタ決定部７は、減衰時間境界計算部６で計算した減衰時間境界Ｔａに対して、次の式を満足する分析フレーム長Ｔｆと、フレームシフトＴｓと、動的特徴量の数Ｎと、の値の組合せを計算することで決定することができる。
Ｔａ≒Ｔｆ＋Ｎ×（２×Ｔｓ） Based on the decay time boundary Ta calculated by the decay time boundary calculation unit 6, the acoustic model parameter determination unit 7 analyzes the analysis frame length Tf used in the acoustic model, the frame shift Ts, and the number N of dynamic features Δ. , Determine. More specifically, the acoustic model parameter determination unit 7 determines the analysis frame length Tf, the frame shift Ts, and the motion that satisfy the following expression with respect to the decay time boundary Ta calculated by the decay time boundary calculation unit 6. It can be determined by calculating a combination of the number N of the characteristic feature amounts.
Ta≈Tf + N × (2 × Ts)

尚、音響モデル用パラメタ決定部７は、上式を満足するような分析フレーム長Ｔｆと、フレームシフトＴｓと、動的特徴量の数Ｎと、の値の組合せを上述したようにして計算してもよいし、分析フレーム長Ｔｆと、フレームシフトＴｓと、動的特徴量の数Ｎと、の値の組合せを予め記憶しておき、記憶されたこれら値の組合せのなかから、上式を満足する分析フレーム長Ｔｆと、フレームシフトＴｓと、動的特徴量の数Ｎと、の値の組合せを選択することで決定するようにしてもよい。これら予め記憶しておく値の組合せ｛Ｔｆ，Ｎ，Ｔｓ｝としては、例えば、｛２５，２，１０｝、｛３２，３，１６｝、｛２０，２，５｝などを利用することができる。 The acoustic model parameter determination unit 7 calculates a combination of values of the analysis frame length Tf, the frame shift Ts, and the number N of dynamic feature quantities that satisfies the above formula as described above. Alternatively, a combination of values of the analysis frame length Tf, the frame shift Ts, and the number N of dynamic features is stored in advance, and the above equation is obtained from the stored combinations of these values. You may make it determine by selecting the combination of the value of the analysis frame length Tf, the frame shift Ts, and the number N of dynamic feature-values to be satisfied. For example, {25, 2, 10}, {32, 3, 16}, {20, 2, 5} may be used as the combination of values {Tf, N, Ts} stored in advance. it can.

このように、測定した残響パターンの減衰状況に応じて、残響パターンを分割するための減衰境界時間を計算し、計算した減衰境界時間に基づいて音響モデル用パラメタ（分析フレーム長と、フレームシフトと、動的特徴量の数。）を決定することで、残響パターンを初期反射成分と拡散残響成分とに良好に分割することができる。 Thus, the attenuation boundary time for dividing the reverberation pattern is calculated according to the measured attenuation state of the reverberation pattern, and the parameters for the acoustic model (analysis frame length, frame shift, and By determining the number of dynamic feature quantities, the reverberation pattern can be favorably divided into the initial reflection component and the diffuse reverberation component.

残響成分抽出部８では、このようにして計算した減衰時間Ｔａを境界として、残響パターンを初期反射成分と拡散残響成分とに分割して抽出する。そして、後述する学習部４において、分割した初期反射成分を学習用音声データに畳み込む。学習部４は、畳み込んだデータを、音響モデル用パラメタ決定部７で決定した音響モデル用パラメタ（分析フレーム長Ｔｆと、フレームシフトＴｓと、動的特徴量の数Ｎ）に従って学習する。これによって、初期反射成分を反映させて音響モデルの学習を実行する。さらに、認識部５は、入力音声から抽出した拡散残響成分を除去すると共に、初期反射成分を反映させて学習させた音響モデルを参照して、入力音声の認識を行う。 The reverberation component extraction unit 8 divides and extracts the reverberation pattern into the initial reflection component and the diffuse reverberation component with the decay time Ta calculated in this way as a boundary. Then, the learning unit 4 to be described later convolves the divided initial reflection component with the learning speech data. The learning unit 4 learns the convolved data according to the acoustic model parameters (analysis frame length Tf, frame shift Ts, and number N of dynamic feature quantities) determined by the acoustic model parameter determination unit 7. Thereby, the learning of the acoustic model is executed by reflecting the initial reflection component. Further, the recognizing unit 5 recognizes the input speech by removing the diffuse reverberation component extracted from the input speech and referring to the acoustic model learned by reflecting the initial reflection component.

続いて、残響成分抽出部８と、学習部４と、認識部５とによる処理の詳細について説明する。音声認識装置２が入力音声を認識する際には、認識部５において、話者が発話した入力音声から残響処理部３で分割した拡散残響成分を除去するとともに、学習部５で予め学習させておいた音響モデルを参照して、入力音声を認識する。 Next, details of processing by the reverberation component extraction unit 8, the learning unit 4, and the recognition unit 5 will be described. When the speech recognition device 2 recognizes the input speech, the recognition unit 5 removes the diffuse reverberation component divided by the reverberation processing unit 3 from the input speech uttered by the speaker and causes the learning unit 5 to learn in advance. The input speech is recognized with reference to the placed acoustic model.

図５は、音声認識システムの詳細な構成を示すブロック図である。音声認識システムはマイク１と、音声認識装置２とを備えている。音声認識装置２は、残響成分抽出部８と、学習部４と、認識部５とを備えている。尚、図５においては、残響処理部３と、減衰時間境界計算部６と、音響モデル用パラメタ決定部７の図示は省略している。 FIG. 5 is a block diagram showing a detailed configuration of the voice recognition system. The voice recognition system includes a microphone 1 and a voice recognition device 2. The speech recognition device 2 includes a reverberation component extraction unit 8, a learning unit 4, and a recognition unit 5. In FIG. 5, the reverberation processing unit 3, the decay time boundary calculation unit 6, and the acoustic model parameter determination unit 7 are not shown.

残響成分抽出部８は、初期反射成分抽出処理部１１と、拡散残響成分抽出処理部２１と、を備えている。学習部４は、畳み込み処理部１２と、学習用音声データベース１３と、音響モデル学習処理部１４と、音響モデル１５と、を備えている。認識部５は、スペクトル変換処理部２２と、フィルタ作成部２３と、スペクトル変換処理部３１と、スペクトル減算処理部３２と、音声認識特徴量変換部３３と、パターンマッチング処理部３４と、を備えている。 The reverberation component extraction unit 8 includes an initial reflection component extraction processing unit 11 and a diffuse reverberation component extraction processing unit 21. The learning unit 4 includes a convolution processing unit 12, a learning speech database 13, an acoustic model learning processing unit 14, and an acoustic model 15. The recognition unit 5 includes a spectrum conversion processing unit 22, a filter creation unit 23, a spectrum conversion processing unit 31, a spectrum subtraction processing unit 32, a speech recognition feature value conversion unit 33, and a pattern matching processing unit 34. ing.

初期反射成分抽出処理部１１と、畳み込み処理部１２と、学習用音声データベース１３と、音響モデル学習処理部１４とは、音声認識に必要な音響モデル１５を作成するための処理を行う。これにより、音声信号の残響パターンの初期反射成分を反映した音響モデル１５が作成される。ここでは音響モデル１５として、隠れマルコフモデル（ＨＭＭ）が用いられている。ここでの処理は、予めオフラインで行われている。すなわち、音声認識する音声信号を検出する前に、音響モデル１５を予め作成しておく。 The initial reflection component extraction processing unit 11, the convolution processing unit 12, the learning speech database 13, and the acoustic model learning processing unit 14 perform processing for creating an acoustic model 15 necessary for speech recognition. Thereby, the acoustic model 15 reflecting the initial reflection component of the reverberation pattern of the audio signal is created. Here, a hidden Markov model (HMM) is used as the acoustic model 15. The processing here is performed offline in advance. That is, the acoustic model 15 is created in advance before detecting a voice signal for voice recognition.

拡散残響成分抽出処理部２１と、スペクトル変換処理部２２と、フィルタ作成部２３とは、拡散残響成分を除去するための処理を行う。これにより、拡散残響成分を減算するための減算フィルタが作成される。ここでの処理は、予めオフラインで行われている。すなわち、音声認識する音声信号を検出する前に、減算フィルタを予め作成しておく。 The diffusion reverberation component extraction processing unit 21, the spectrum conversion processing unit 22, and the filter creation unit 23 perform processing for removing the diffusion reverberation component. Thereby, a subtraction filter for subtracting the diffuse reverberation component is created. The processing here is performed offline in advance. That is, a subtraction filter is created in advance before detecting a voice signal for voice recognition.

スペクトル変換処理部３１と、スペクトル減算処理部３２、音声認識特徴量変換部３３と、パターンマッチング処理部３４とは、入力音声に対して音声認識処理を行う。音声認識処理は、上記の減算フィルタ、及び音響モデル１５を用いて行われる。そして、これらの処理が、入力音声に対してオンラインで行われることで、随時、音声を認識していく。 The spectrum conversion processing unit 31, the spectrum subtraction processing unit 32, the speech recognition feature value conversion unit 33, and the pattern matching processing unit 34 perform speech recognition processing on the input speech. The speech recognition process is performed using the subtraction filter and the acoustic model 15 described above. These processes are performed online with respect to the input voice, so that the voice is recognized as needed.

次に、初期反射成分を用いた音響モデル１５の学習について図５及び６を用いて説明する。図６は、音響モデルの学習フローを示す図である。尚、図６で示した処理はオフラインで行われる。すなわち、音声認識対象の音声信号を取得するより前に、図６に示す処理フローにより音響モデル１５を作成する。 Next, learning of the acoustic model 15 using the initial reflection component will be described with reference to FIGS. FIG. 6 is a diagram illustrating a learning flow of the acoustic model. The process shown in FIG. 6 is performed offline. That is, the acoustic model 15 is created by the processing flow shown in FIG. 6 before acquiring the speech signal to be recognized.

図５に示したように、初期反射成分抽出処理部１１は、インパルス応答入力から、拡散残響成分を取り除いた初期反射成分を抽出する。すなわち、上記のように、マイク１でインパルス応答を測定し、測定されたインパルス応答の残響成分のうち、減衰時間境界よりも前のデータを初期反射成分として抽出する。図６に示すように、初期反射成分をｈ_Ｅとする。畳み込み処理部１２は、初期反射成分ｈ_Ｅを用いて畳み込み処理を行う。 As shown in FIG. 5, the initial reflection component extraction processing unit 11 extracts the initial reflection component from which the diffuse reverberation component is removed from the impulse response input. That is, as described above, the impulse response is measured by the microphone 1, and data before the attenuation time boundary is extracted as the initial reflection component from the reverberation component of the measured impulse response. As shown in FIG. 6, the initial reflection component and h _E. Convolution processing unit 12 performs convolution processing by using the initial reflection components h _E.

学習用音声データベース１３には、クリーンな学習用の音声データが記憶されている。例えば、学習用音声データベース１３には、音素単位の音声データがデータベースとして記憶されている。この音声データは、雑音や残響がない場所で測定されたものであり、例えば、１時間分の会話をコーパスとしている。そして、コーパスに含まれるそれぞれの音素に対して、「あ」、「い」などのラベルが付けられている。このように、学習用音声データベース１３には、音素に対するクリーンな音声データが記憶されている。そして、畳み込み処理部１２は、学習用音声データベース１３に記憶されているクリーンな音声データｓに対して、初期反射成分ｈ_Ｅを畳み込む。これにより、初期反射成分ｈ_Ｅが反映された畳み込みデータｘ_Ｅが生成される。音素単位のそれぞれの音声データｓに対して初期反射成分ｈ_Ｅを畳み込むことで、それぞれの音素に対する畳み込みデータｘ_Ｅが算出される。 The learning speech database 13 stores clean learning speech data. For example, the learning speech database 13 stores speech data in units of phonemes as a database. The voice data is measured in a place where there is no noise or reverberation. For example, a conversation for one hour is used as a corpus. Each phoneme included in the corpus is labeled “A” or “I”. Thus, clean speech data for phonemes is stored in the learning speech database 13. Then, the convolution processing unit 12 convolves the initial reflection component h _E with the clean speech data s stored in the learning speech database 13. Thus, the initial reflection components h _E is is reflected convolved data x _E is generated. By convolving the initial reflection component h _E with each speech data s in units of phonemes, convolution data x _E for each phoneme is calculated.

音響モデル学習処理部１４は、初期反射成分が反映された畳み込みデータｘ_Ｅに基づいて、音響モデル学習処理を行う。音響モデル１５がＨＭＭである場合、音響モデル学習処理部１４は、ＨＭＭ学習を行う。ＨＭＭ学習を行う際には、上述した分析フレーム長Ｔｆと、フレームシフトＴｓと、動的特徴量Δの数Ｎと、に基づいて学習を行う。より具体的には、畳み込みデータｘ_Ｅから特徴量を抽出する。そして、音素単位の特徴量をデータベースとして記憶させる。すなわち、各音素に対する特徴量ベクトルがテンプレートモデルとなる。特徴量ベクトルは、例えば、分析長毎に抽出される。 Acoustic model learning processing unit 14 based on the convolution data x _E initial reflection components is reflected, performs an acoustic model learning process. When the acoustic model 15 is an HMM, the acoustic model learning processing unit 14 performs HMM learning. When performing HMM learning, learning is performed based on the analysis frame length Tf, the frame shift Ts, and the number N of dynamic feature amounts Δ described above. More specifically, to extract a feature from the convolution data x _E. And the feature-value of a phoneme unit is memorize | stored as a database. That is, the feature quantity vector for each phoneme becomes a template model. The feature vector is extracted for each analysis length, for example.

具体的には、畳み込みデータｘ_ＥをＦＦＴ（高速フーリエ変換）等によってスペクトルデータに変換する。そして、人間の聴覚特性に合わせたフィルタを用いて、スペクトルデータを対数変換し、さらにＩＦＦＴ（逆高速フーリエ変換）によって、時間データに変換する。このようにすることで、メルケプストラムが求められる。メルケプストラム空間では、スペクトルの包絡が低次に表れ、微細な振動が高次に表れる。そして、低次の部分を取り出して、ＭＦＣＣを算出する。ここでは、１２次元のＭＦＣＣを算出している。さらには、その１次差分と、パワーの１次差分を特徴量として抽出している。この場合、特徴量ベクトルは２５次元（１２＋１２＋１）となる。もちろん、特徴量を抽出するための処理がこれに限られるものではない。 Specifically, it converted into spectral data by FFT convolution data x _E (Fast Fourier Transform) or the like. Then, the spectral data is logarithmically converted using a filter matched to human auditory characteristics, and further converted into time data by IFFT (Inverse Fast Fourier Transform). In this way, a mel cepstrum is required. In the mel cepstrum space, the spectral envelope appears in the lower order and the fine vibrations appear in the higher order. Then, the low order part is taken out and the MFCC is calculated. Here, a 12-dimensional MFCC is calculated. Further, the primary difference and the power primary difference are extracted as feature quantities. In this case, the feature quantity vector has 25 dimensions (12 + 12 + 1). Of course, the process for extracting the feature quantity is not limited to this.

そして、ＭＦＣＣのデータ群によって学習を行う。尚、大量のコーパスに含まれる音声データｓに対して処理を行うことで、１つの音素に対する特徴量が平均と分散を持っている。音響モデル１５は、平均と分散の値を保持する。そして、音響モデル学習処理部１４は、特徴量の平均と分散に応じてＨＭＭの状態遷移確率や出力確率などを決定する。音響モデル学習処理部１４は、例えば、ＥＭアルゴリズムによってＨＭＭを学習する。もちろん、ＥＭアルゴリズム以外の公知のアルゴリズムを用いてもよい。このようにして、音響モデル１５が学習される。 Then, learning is performed using the MFCC data group. It should be noted that by processing the speech data s included in a large amount of corpus, the feature amount for one phoneme has an average and a variance. The acoustic model 15 holds average and variance values. And the acoustic model learning process part 14 determines the state transition probability, output probability, etc. of HMM according to the average and dispersion | distribution of a feature-value. The acoustic model learning processing unit 14 learns the HMM using, for example, an EM algorithm. Of course, a known algorithm other than the EM algorithm may be used. In this way, the acoustic model 15 is learned.

音響モデル学習処理部１４で学習された音響モデル１５がデータベースとして記憶される。この音響モデル１５は、残響パターンの初期反射成分を考慮したものとなる。すなわち、初期反射成分をＨＭＭでモデル化推定する。これにより、初期反射成分を学習済みの音響モデル１５が構築される。この音響モデル１５を用いることで、音声信号に含まれる初期反射成分の影響を低減することができ、認識率を向上することができる。 The acoustic model 15 learned by the acoustic model learning processing unit 14 is stored as a database. The acoustic model 15 takes into consideration the initial reflection component of the reverberation pattern. That is, the initial reflection component is modeled and estimated by the HMM. Thereby, the acoustic model 15 having learned the initial reflection component is constructed. By using this acoustic model 15, the influence of the initial reflection component contained in the audio signal can be reduced, and the recognition rate can be improved.

次に、拡散残響成分を用いたフィルタ作成処理について、図５、７、８を用いて説明する。図７は、フィルタを作成するための近似計算を説明するための概念図である。図８は、フィルタ作成の処理フローを示す図である。 Next, filter creation processing using a diffuse reverberation component will be described with reference to FIGS. FIG. 7 is a conceptual diagram for explaining approximate calculation for creating a filter. FIG. 8 is a diagram showing a processing flow for creating a filter.

図５に示すように、拡散残響成分抽出処理部２１は、インパルス応答入力に対して、拡散残響成分抽出処理を行う。これにより、インパルス応答の残響パターンの中から初期反射成分が取り除かれた拡散残響成分が抽出される。すなわち、マイク１で計測されたインパルス応答の残響成分のうち、減衰時間境界よりも後のデータを拡散残響成分が抽出される。スペクトル変換処理部２２は、インパルス応答の時間データをスペクトルデータに変換する。すなわち、時間領域の拡散残響成分のデータを周波数領域のデータに変換する。ここでは、フーリエ変換などを用いて、拡散残響成分のデータを変換している。すなわち、ＦＦＴ（高速フーリエ変換）などによって、周波数領域のデータに変換する。尚、スペクトル変換処理部２２は、スペクトルデータに変換する前に、上記の分析フレーム長、及びフレームシフトに応じてフレーム化処理を行っている。 As shown in FIG. 5, the diffusion reverberation component extraction processing unit 21 performs diffusion reverberation component extraction processing on the impulse response input. Thereby, the diffuse reverberation component from which the initial reflection component is removed from the reverberation pattern of the impulse response is extracted. That is, among the reverberation components of the impulse response measured by the microphone 1, the diffuse reverberation components are extracted from the data after the attenuation time boundary. The spectrum conversion processing unit 22 converts impulse response time data into spectrum data. That is, the time domain diffuse reverberation component data is converted to frequency domain data. Here, the data of the diffuse reverberation component is converted using Fourier transform or the like. That is, it is converted into frequency domain data by FFT (Fast Fourier Transform) or the like. Note that the spectrum conversion processing unit 22 performs framing processing according to the analysis frame length and the frame shift before conversion into spectrum data.

フィルタ作成部２３は、拡散残響成分のデータを用いて、拡散残響を除去するための減算フィルタを作成する。まず、図７を用いてフィルタを作成するための近似計算について説明する。図７は、音声認識を行うためのオンライン処理が示されている。 The filter creation unit 23 creates a subtraction filter for removing diffuse reverberation using the data of the diffuse reverberation component. First, approximate calculation for creating a filter will be described with reference to FIG. FIG. 7 shows online processing for voice recognition.

図７に示すように、発話者が話した音声による音声信号を入力ｘとし、インパルス応答での拡散残響成分を後部インパルス応答ｈ_Ｌとする。入力ｘに対する後部拡散残響ｘ_Ｌを入力ｘから除去するためにスペクトル減算処理を行う。スペクトル減算をした後、特徴量に変換し、パターンマッチングにより音声認識を行う。 As shown in FIG. 7, a speech signal by speech spoken by the speaker as an input x, the diffuse reverberation components in the impulse response a rear impulse response h _L. Spectral subtraction processing is performed to remove the back diffuse reverberation x _L for the input x from the input x. After subtracting the spectrum, it is converted into a feature value, and voice recognition is performed by pattern matching.

しかしながら、入力ｘに対する後部拡散残響ｘ_Ｌを直接観測することができない。すなわち、後部拡散残響ｘ_Ｌのみを観察することは不可能である。そこで、事前に観測した後部インパルス応答ｈ_Ｌを用いて後部拡散残響ｘ_Ｌを近似する。すなわち、ｘ'_Ｌ（＝ｘ＊ｈ_Ｌ）をｘ_Ｌに近似することができれば、拡散残響成分のスペクトル成分を減算することが可能になる。従って、入力ｘに後部インパルス応答を畳み込んだものを後部拡散残響ｘ_Ｌと近似することができるようなフィルタを作成する。 However, the back diffuse reverberation x _L for the input x cannot be observed directly. In other words, it is not possible to observe only the rear diffusion reverberation x _L. Therefore, the rear diffuse reverberation x _L is approximated using the rear impulse response h _L observed in advance. That is, if x ′ _L (= x * h _L ) can be approximated to x _L , the spectrum component of the diffuse reverberation component can be subtracted. Therefore, to create a filter that what convolving a rear impulse response to an input x can be approximated to the rear spreading reverberation x _L.

このように近似するためのオフライン処理について図８を用いて説明する。ここでは、インパルス応答を計測して、クリーンな学習用の音声データｓからフィルタδを作成している。学習用音声データベース１３に記憶されている音声データｓに後部インパルス応答ｈ_Ｌ（ｔ）を畳み込む。これにより、後部拡散残響ｘ_Ｌが作成される。また、学習用音声データベース１３に記憶されている音声データｓに対してインパルス応答ｈを畳み込む。すなわち、インパルス応答ｈの全部を音声データｓに対して畳み込む。これにより、クリーンな音声を発した場合における入力ｘが生成される。さらに、入力ｘに対して、後部インパルス応答ｈ_Ｌ（ｔ）を畳み込む。すなわち、音声データｓに対してインパルス応答ｈを畳み込んだ後、そのデータに後部インパルス応答ｈ_Ｌ（ｔ）をさらに畳み込む。この後部インパルス応答ｈ_Ｌ（ｔ）は、クリーンな音声データに畳み込まれた後部インパルス応答ｈ_Ｌ（ｔ）と同一のものである。 The off-line processing for approximating in this way will be described with reference to FIG. Here, an impulse response is measured, and a filter δ is created from clean learning speech data s. The rear impulse response h _L (t) is convolved with the voice data s stored in the learning voice database 13. Thus, the rear diffusion reverberation x _L is generated. Further, the impulse response h is convoluted with the voice data s stored in the learning voice database 13. That is, the entire impulse response h is convoluted with the audio data s. As a result, the input x in the case where a clean voice is emitted is generated. Further, the rear impulse response h _L (t) is convolved with the input x. That is, after convolution of the impulse response h with the audio data s, the rear impulse response h _L (t) is further convolved with the data. This rear impulse response h _L (t) is the same as the rear impulse response h _L (t) convolved with clean audio data.

上記の処理を学習用音声データベース１３に含まれる音声データｓに対してそれぞれ行う。そして、算出された後部拡散残響ｘ_Ｌとｘ'_Ｌが近くなるようなフィルタδを推定する。すなわち、ｘ_Ｌ≒δｘ'_Ｌとなる係数を算出する。ここでは、最小２乗誤差計算によって、フィルタδを推定している。すなわち、ｘ_Ｌがδｘ'_Ｌとの誤差関数を最小にするように処理を行う。これにより、δｘ'_Ｌがｘ_Ｌに最も近くなるようなδを算出することができる。ここで、周波数帯で最適な係数が異なる。従って、フィルタδを、周波数帯別に推定する。図８の右上に示すように、周波数帯毎に最適な係数を算出する。具体的には、１２次元のフィルタδ（δ_１、δ_２、δ_３、δ_４、・・・・δ_１２）を推定する。このフィルタδを用いて、スペクトル減算することで、音声信号から拡散残響成分を除去することができる。すなわち、フィルタδは、拡散残響成分を減算することができる減算フィルタとなる。 The above processing is performed for each of the voice data s included in the learning voice database 13. Then, the calculated rear diffuse reverberations x _L and x _'L estimates the filter δ as close. That is, a coefficient that satisfies x _L ≈δx ′ _L is calculated. Here, the filter δ is estimated by the least square error calculation. That, x _L performs processing so as to minimize the error function between .delta.x _'L. As a result, δ such that δx ′ _L is closest to x _L can be calculated. Here, the optimum coefficient differs in the frequency band. Therefore, the filter δ is estimated for each frequency band. As shown in the upper right of FIG. 8, an optimum coefficient is calculated for each frequency band. Specifically, a 12-dimensional filter δ (δ ₁ , δ ₂ , δ ₃ , δ ₄ ,... Δ ₁₂ ) is estimated. By using this filter δ, spectral subtraction can be performed to remove the diffuse reverberation component from the audio signal. That is, the filter δ is a subtraction filter that can subtract the diffuse reverberation component.

次に、オンラインの音声認識処理について図５及び９を用いて説明する。図９は、音声認識の処理フローを示す図である。まず、マイク１で検出された入力音声が音声認識装置２に入力される。図９では、入力音声を入力ｘとしている。スペクトル変換処理部３１は、入力ｘをスペクトルデータに変換する。すなわち、ＦＦＴなどによって、時間領域のデータを周波数領域のデータに変換する。スペクトル変換処理部３１は、スペクトルデータに変換する前に、上記の分析フレーム長、及びフレームシフトに応じてフレーム化処理を行っている。 Next, online speech recognition processing will be described with reference to FIGS. FIG. 9 is a diagram showing a processing flow of voice recognition. First, the input voice detected by the microphone 1 is input to the voice recognition device 2. In FIG. 9, the input voice is input x. The spectrum conversion processing unit 31 converts the input x into spectrum data. That is, time domain data is converted to frequency domain data by FFT or the like. The spectrum conversion processing unit 31 performs framing processing according to the analysis frame length and the frame shift before conversion into spectrum data.

スペクトル減算処理部３２は、フィルタδを用いてスペクトルデータから拡散残響成分を減算する。このようにフィルタδを用いたスペクトル減算処理を行うことで、音声信号から拡散残響成分の影響が除去される。拡散残響成分のスペクトルが減算された減算データに基づいて、以下のように音声が認識される。 The spectrum subtraction processing unit 32 subtracts the diffuse reverberation component from the spectrum data using the filter δ. By performing the spectrum subtraction process using the filter δ in this way, the influence of the diffuse reverberation component is removed from the audio signal. Based on the subtraction data obtained by subtracting the spectrum of the diffuse reverberation component, the speech is recognized as follows.

音声認識特徴量変換部３３は、スペクトルデータを音声認識の特徴量に変換する。音声認識特徴量変換部３３は、拡散残響成分が減算された減算データに基づいて特徴量を抽出する。特徴量としては、例えば、１２次元のメル周波数ケプストラム係数（ＭＦＣＣ：ＭｅｌＦｒｅｑｕｅｎｃｙＣｅｐｓｔｒｕｍＣｏｅｆｆｉｃｉｅｎｔ）を用いることができる。そのため、メルフィルタによるフィルタバンク分析を行う。そして、対数変換（Ｌｏｇ変換）を行い、離散コサイン変換（ＤＣＴ）を行うことでＭＦＣＣが算出される。ここでは、上記のように、ＭＦＣＣの１次差分と、パワーの１次差分を含む２５次元の特徴量ベクトルが算出される。 The voice recognition feature value conversion unit 33 converts the spectrum data into a voice recognition feature value. The speech recognition feature value conversion unit 33 extracts a feature value based on the subtraction data obtained by subtracting the diffuse reverberation component. As the feature quantity, for example, a 12-dimensional Mel Frequency Cepstrum Coefficient (MFCC) can be used. Therefore, a filter bank analysis using a mel filter is performed. Then, logarithmic transformation (Log transformation) is performed, and discrete cosine transformation (DCT) is performed to calculate the MFCC. Here, as described above, a 25-dimensional feature vector including the MFCC primary difference and the power primary difference is calculated.

音声認識の特徴量にＭＦＣＣを用いる場合、認識率をより向上することができる。すなわち、スペクトル減算のような非線形処理は、音声信号に戻す場合には、ひずみの原因となるが、ＭＦＣＣに変換する場合は全く問題とならない。すなわち、音声信号に戻さずに、拡散残響成分が除去されたスペクトルデータをＭＦＣＣに直接変換するので、ひずみの発生を防ぐことができる。 When MFCC is used as a feature amount for speech recognition, the recognition rate can be further improved. That is, nonlinear processing such as spectral subtraction causes distortion when returning to an audio signal, but does not pose any problem when converted to MFCC. That is, since the spectrum data from which the diffuse reverberation component is removed is directly converted to MFCC without returning to the audio signal, the occurrence of distortion can be prevented.

そして、パターンマッチング処理部３４は、音響モデル１５の特徴量ベクトルを用いてパターンマッチング処理を行う。これにより、検出した音声信号に対する特徴量ベクトルに最も近いパターンの音素が認識される。すなわち、パターンマッチング処理部３４は、音響モデル１５を参照して、音声認識処理を行う認識処理部となる。 And the pattern matching process part 34 performs a pattern matching process using the feature-value vector of the acoustic model 15. FIG. As a result, the phoneme having the pattern closest to the feature vector for the detected speech signal is recognized. That is, the pattern matching processing unit 34 is a recognition processing unit that performs voice recognition processing with reference to the acoustic model 15.

以上説明したように、音声認識処理の実行に際して、初期反射成分が反映された音響モデル１５が用いられているため、より優れた音響モデル１５を構築することができる。学習する分析長を超えて影響する拡散残響成分（高次反射成分）を除去した初期反射成分を学習に用いているため、正確な音韻学習を行うことができる。初期反射成分による影響をＨＭＭ学習によって吸収することができるため、音声の認識率を向上することができる。 As described above, since the acoustic model 15 in which the initial reflection component is reflected is used when executing the speech recognition processing, a more excellent acoustic model 15 can be constructed. Since the initial reflection component from which the diffuse reverberation component (high-order reflection component) that affects beyond the analysis length to be learned is removed is used for learning, accurate phonological learning can be performed. Since the influence of the initial reflection component can be absorbed by the HMM learning, the speech recognition rate can be improved.

さらに、拡散残響成分はスペクトル減算のフィルタδに利用されている。このため、入力音声の拡散残響成分を除去することができる。これにより、拡散残響成分の影響を低減することができ、音声の認識率を向上することができる。 Furthermore, the diffuse reverberation component is used as a filter δ for spectral subtraction. For this reason, the diffuse reverberation component of the input speech can be removed. Thereby, the influence of a diffuse reverberation component can be reduced and the speech recognition rate can be improved.

本実施形態では、実際に音声認識される音声信号が取得される環境と同一環境でインパルス応答を測定し、測定したインパルス応答の残響パターンから初期反射成分と拡散残響成分とを抽出する。ここでは、マイク１が設置された部屋でインパルス応答計測を行っている。部屋の残響やマイク周辺の形状は部屋を移るなどの大きな変化がない限り、ほぼ同一とすることができる。したがって、環境が同じであれば、拡散残響成分は、直接音によらず、ほぼ一定とみなすことができる。すなわち、拡散残響成分は、発話した音声によらず、ほぼ一定となる。マイクを設置する方法を定めた後、部屋のインパルス応答に対する残響を１回だけ測定することで、初期反射成分と拡散残響成分を分割推定することが可能になる。 In the present embodiment, the impulse response is measured in the same environment as the environment in which the voice signal that is actually recognized is acquired, and the initial reflection component and the diffuse reverberation component are extracted from the reverberation pattern of the measured impulse response. Here, impulse response measurement is performed in a room where the microphone 1 is installed. The reverberation of the room and the shape around the microphone can be almost the same as long as there is no significant change such as moving from room to room. Therefore, if the environment is the same, the diffuse reverberation component can be regarded as almost constant regardless of the direct sound. In other words, the diffuse reverberation component is substantially constant regardless of the voice that is spoken. After determining the method of installing the microphone, it is possible to estimate the initial reflection component and the diffuse reverberation component separately by measuring the reverberation for the impulse response of the room only once.

すなわち、環境中でインパルス応答を予め計測して、初期反射成分と拡散反射成分を抽出する。そして、初期反射成分が反映された音響モデル１５と、拡散反射成分に基づいて作成されたフィルタδとを、その環境における音声認識に繰り返し使用する。すなわち、同じ環境中で検出された音声信号に対して同じフィルタδ、及び音響モデル１５を用いる。予めインパルス応答を一度計測するだけでよいため、音響モデル１５の学習、及びフィルタδの作成を簡便に行うことができる。また、予め作成された音響モデル１５とフィルタδを用いているため、オンラインでの処理量を低減することができる。よって、簡便な処理で、認識率の高い音声認識を行うことができる。 That is, the impulse response is measured in advance in the environment, and the initial reflection component and the diffuse reflection component are extracted. The acoustic model 15 reflecting the initial reflection component and the filter δ created based on the diffuse reflection component are repeatedly used for voice recognition in the environment. That is, the same filter δ and the acoustic model 15 are used for audio signals detected in the same environment. Since the impulse response only needs to be measured once in advance, learning of the acoustic model 15 and creation of the filter δ can be performed easily. In addition, since the acoustic model 15 and the filter δ created in advance are used, the amount of processing online can be reduced. Therefore, speech recognition with a high recognition rate can be performed with simple processing.

発話者５が部屋を移るなどして環境が変わった場合は、その環境でインパルス応答計測を一度行う。そして、同様の処理によって音響モデル１５の学習、及びフィルタδの作成を行う。環境に応じてモデル学習、及びフィルタ作成を行うことで、認識率を向上することができる。あるいは、マイク１を交換した場合も、交換したマイク１でインパルス応答計測を行い、同様に処理する。もちろん、環境は室内に限らず、車内や屋外であってもよい。例えば、音声認識システムをカーナビゲーションシステムなどに搭載してもよい。 When the environment changes due to the speaker 5 moving from room to room, impulse response measurement is performed once in that environment. Then, the acoustic model 15 is learned and the filter δ is created by the same processing. The recognition rate can be improved by performing model learning and filter creation according to the environment. Alternatively, when the microphone 1 is replaced, impulse response measurement is performed with the replaced microphone 1 and the same processing is performed. Of course, the environment is not limited to indoors, but may be in a car or outdoors. For example, a voice recognition system may be installed in a car navigation system.

尚、音響モデル１５はＨＭＭ以外の音響モデルであってもよい。すなわち、ＨＭＭ以外の音響モデル１５の学習に、初期反射成分を用いてもよい。また、１つのマイク１で残響を除去することができるため、システムの構成を簡素化することができる。 The acoustic model 15 may be an acoustic model other than the HMM. That is, the initial reflection component may be used for learning the acoustic model 15 other than the HMM. In addition, since the reverberation can be removed by one microphone 1, the system configuration can be simplified.

さらに、各処理が異なるコンピュータによって行われていてもよい。例えば、音響モデル学習、及びフィルタ作成の処理を行うコンピュータと、音声認識を行うコンピュータを物理的に異なるものとしてもよい。この場合、オンライン処理とオフライン処理が異なる装置によって行われる。 Furthermore, each process may be performed by different computers. For example, a computer that performs acoustic model learning and filter creation processing may be physically different from a computer that performs speech recognition. In this case, online processing and offline processing are performed by different devices.

具体的には、初期反射成分抽出処理部１１と畳み込み処理部１２と学習用音声データベース１３と音響モデル学習処理部１４と、拡散残響成分抽出処理部２１とスペクトル変換処理部２２とフィルタ作成部２３とを有する処理装置で、音響モデル１５とフィルタδを予め作成する。そして、スペクトル変換処理部３１とスペクトル減算処理部３２と音声認識特徴量変換部３３とパターンマッチング処理部３４と有する音声認識装置に、作成された音響モデル１５とフィルタδを予め記憶させておく。そして、音声認識装置２に接続されたマイク１で音声信号を検出して、その音声信号に上記の処理を行う。このようにしても、認識率の高い音声認識処理を簡便に行うことができる。あるいは、処理装置などの他のコンピュータに格納されている音響モデル１５、及びフィルタδを参照して、音声認識を行うコンピュータが処理を行ってもよい。 Specifically, the initial reflection component extraction processing unit 11, the convolution processing unit 12, the learning speech database 13, the acoustic model learning processing unit 14, the diffuse reverberation component extraction processing unit 21, the spectrum conversion processing unit 22, and the filter creation unit 23. The acoustic model 15 and the filter δ are created in advance. Then, the created acoustic model 15 and the filter δ are stored in advance in a speech recognition device having the spectrum conversion processing unit 31, the spectrum subtraction processing unit 32, the speech recognition feature value conversion unit 33, and the pattern matching processing unit 34. Then, a voice signal is detected by the microphone 1 connected to the voice recognition device 2, and the above processing is performed on the voice signal. Even in this case, speech recognition processing with a high recognition rate can be easily performed. Alternatively, a computer that performs speech recognition may perform processing with reference to the acoustic model 15 and the filter δ stored in another computer such as a processing device.

さらには、音響モデル学習を行うコンピュータと、フィルタ作成を行うコンピュータを物理的に異なるものとしてもよい。また、フィルタ作成と音響モデル学習との間で異なるインパルス応答の計測結果を用いてもよい。すなわち、異なるインパルス応答測定から、初期反射成分と、拡散残響成分を抽出してもよい。例えば、インパルス応答計測を２回行い、一方のインパルス応答計測に基づき初期反射成分を抽出し、他方のインパルス応答計測に基づき拡散残響成分を抽出してもよい。上記の音声認識システムを音声応答型のロボットに搭載することで、的確な音声応答を行うことができる。尚、連続音声による音声信号が入力される場合は、さらに、言語モデルを用いて音声を認識してもよい。 Furthermore, the computer that performs acoustic model learning and the computer that performs filter creation may be physically different. Further, different impulse response measurement results may be used between filter creation and acoustic model learning. That is, the initial reflection component and the diffuse reverberation component may be extracted from different impulse response measurements. For example, impulse response measurement may be performed twice, an initial reflection component may be extracted based on one impulse response measurement, and a diffuse reverberation component may be extracted based on the other impulse response measurement. By mounting the voice recognition system on a voice response type robot, an accurate voice response can be performed. In addition, when the audio | voice signal by a continuous audio | voice is input, you may recognize a audio | voice further using a language model.

本発明の実施の形態にかかる音声認識システムの構成を示す図である。It is a figure which shows the structure of the speech recognition system concerning embodiment of this invention. 環境内で発生した音が反射する様子を示す図である。It is a figure which shows a mode that the sound generated in the environment reflects. 本発明の実施の形態にかかる音声認識システムで検出された音声信号を模式的に示す図である。It is a figure which shows typically the audio | voice signal detected with the audio | voice recognition system concerning embodiment of this invention. 本発明の実施の形態にかかる音声認識システムで検出された音声信号を模式的に示す図である。It is a figure which shows typically the audio | voice signal detected with the audio | voice recognition system concerning embodiment of this invention. 本発明の実施の形態にかかる音声認識システムの詳細な構成を示す図である。It is a figure which shows the detailed structure of the speech recognition system concerning embodiment of this invention. 本発明の実施の形態にかかる音声認識システムにおける学習処理フローを示す図である。It is a figure which shows the learning process flow in the speech recognition system concerning embodiment of this invention. 本発明の実施の形態にかかる音声認識システムにおけるフィルタ作成処理の近似計算を示す図である。It is a figure which shows the approximate calculation of the filter creation process in the speech recognition system concerning embodiment of this invention. 本発明の実施の形態にかかる音声認識システムにおけるフィルタ作成の処理フローを示す図である。It is a figure which shows the processing flow of filter creation in the speech recognition system concerning embodiment of this invention. 本発明の実施の形態にかかる音声認識システムにおける処理フローを示す図である。It is a figure which shows the processing flow in the speech recognition system concerning embodiment of this invention.

Explanation of symbols

１マイク、２音声認識装置、
３残響処理部、４学習部、５認識部、
６減衰時間境界計算部、７音響モデル用パラメタ決定部、
８残響成分抽出部、１１初期反射成分抽出処理部、
１２畳み込み処理部、１３学習用音声データベース、
１４音響モデル学習処理部、１５音響モデルデータベース、
２１拡散残響成分抽出処理部、２２スペクトル変換処理部、
２３フィルタ作成部、３１スペクトル変換処理部、
３２スペクトル減算処理部、３３音声認識特徴量変換部、
３４パターンマッチング処理部 1 microphone, 2 speech recognition device,
3 reverberation processing unit, 4 learning unit, 5 recognition unit,
6 Decay time boundary calculation unit, 7 Acoustic model parameter determination unit,
8 reverberation component extraction unit, 11 initial reflection component extraction processing unit,
12 convolution processing unit, 13 learning speech database,
14 acoustic model learning processing unit, 15 acoustic model database,
21 diffuse reverberation component extraction processing unit, 22 spectrum conversion processing unit,
23 filter creation unit, 31 spectrum conversion processing unit,
32 spectrum subtraction processing unit, 33 speech recognition feature amount conversion unit,
34 Pattern matching processing section

Claims

The reverberation pattern is measured from the impulse response, and the measured reverberation pattern is divided into an initial reflection component that is the first half of the reverberation pattern and a diffuse reverberation component that is the second half of the reverberation pattern, A processing device for performing recognition,
An attenuation time boundary calculation unit that calculates an attenuation curve of the reverberation pattern and calculates an attenuation time boundary indicating a temporal boundary between the initial reflection component and the diffuse reverberation component based on the attenuation curve;
An acoustic model parameter determination unit that determines an analysis frame length, a frame shift, and the number of dynamic features used in the acoustic model based on the calculated decay time boundary;
A processing apparatus comprising:

The decay time boundary calculation unit includes:
The processing apparatus according to claim 1, wherein a time when the reverberation pattern decreases by a predetermined amount is set as the decay time boundary.

The decay time boundary calculation unit includes:
The processing apparatus according to claim 1, wherein a time when the amount of change in the decay curve of the reverberation pattern falls below a predetermined threshold is set as the decay time boundary.

A storage unit that stores in advance a combination of values of the analysis frame length Tf, the frame shift Ts, and the number N of dynamic features used in the acoustic model;
The acoustic model parameter determination unit
A combination of values of the analysis frame length Tf, the frame shift Ts, and the number N of the dynamic feature quantities satisfying the following expression with respect to the decay time boundary Ta calculated by the decay time boundary calculation unit: Ta≈Tf + N × (2 × Ts)
The processing apparatus according to any one of claims 1 to 3.

Reverberation pattern is measured from impulse response, and the measured reverberation pattern is divided into an initial reflection component that is the first half of the reverberation pattern and a diffuse reverberation component that is the second half of the reverberation pattern to recognize input speech A voice recognition device that
An attenuation time boundary calculation unit that calculates an attenuation curve of the reverberation pattern, and calculates an attenuation time boundary indicating a temporal boundary between the initial reflection component and the diffuse reverberation component based on the attenuation curve;
An acoustic model parameter determination unit that determines an analysis frame length, a frame shift, and the number of dynamic features used in the acoustic model based on the calculated decay time boundary;
Dividing the measured reverberation pattern at the decay time boundary to extract the initial reflection component and the diffuse reverberation component;
A learning unit that reflects the extracted initial reflection component in learning audio data and learns an acoustic model based on the determined analysis frame length, frame shift, and number of dynamic features,
A recognition unit that removes the extracted diffuse reverberation component from the input speech and recognizes the input speech with reference to the learned acoustic model;
A speech recognition apparatus comprising:

A voice recognition device according to claim 5;
A microphone that receives sound generated in the environment and outputs a voice signal to the voice recognition device;
A speech recognition system.

Reverberation pattern is measured from impulse response, and the measured reverberation pattern is divided into an initial reflection component that is the first half of the reverberation pattern and a diffuse reverberation component that is the second half of the reverberation pattern to recognize input speech A voice recognition method for
Calculating an attenuation curve of the reverberation pattern, and calculating an attenuation time boundary indicating a temporal boundary between the initial reflection component and the diffuse reverberation component based on the attenuation curve;
Determining an analysis frame length, a frame shift, and a number of dynamic features to be used in the acoustic model based on the calculated decay time boundary;
Dividing the measured reverberation pattern at the decay time boundary to extract the initial reflection component and the diffuse reverberation component;
Reflecting the extracted initial reflection component in learning speech data, and learning an acoustic model based on the determined analysis frame length, frame shift, and number of dynamic features;
Removing the extracted diffuse reverberation component from the input speech and recognizing the input speech with reference to the learned acoustic model;
A speech recognition method comprising:

The speech recognition method according to claim 7, wherein an attenuation curve of the reverberation pattern is calculated, and a time when the calculated attenuation curve decreases by a predetermined amount is set as the attenuation time boundary.

The speech recognition method according to claim 7 or 8, wherein an attenuation curve of the reverberation pattern is calculated, and a time when a change amount of the calculated attenuation curve falls below a predetermined threshold is used as the attenuation time boundary. .

A pre-stored analysis of combinations of values of the analysis frame length Tf, the frame shift Ts, and the number N of dynamic feature quantities satisfying the following expression with respect to the calculated decay time boundary Ta Determined by selecting from combinations of values of the frame length Tf, the frame shift Ts, and the number N of dynamic feature quantities Ta≈Tf + N × (2 × Ts)
The speech recognition method according to claim 7, wherein: