JP6067760B2

JP6067760B2 - Parameter determining apparatus, parameter determining method, and program

Info

Publication number: JP6067760B2
Application number: JP2015014188A
Authority: JP
Inventors: 智子川瀬; 小林　和則; 和則小林; 仲大室
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2015-01-28
Filing date: 2015-01-28
Publication date: 2017-01-25
Anticipated expiration: 2035-01-28
Also published as: JP2016139025A

Description

この発明は、音声認識技術に関し、特に、音声認識の前処理に用いるパラメータセットを決定する技術に関する。 The present invention relates to a speech recognition technology, and more particularly to a technology for determining a parameter set used for speech recognition preprocessing.

雑音や残響、音声の大きさ、マイクによる歪みなど収音環境の影響が大きいと音声のクリアな収音は困難になる。音声のクリアな収音が困難なシーンで音声認識する場合、入力音響信号に音声強調などの前処理を施すことが有効である。 When the influence of the sound collection environment such as noise, reverberation, sound volume, and distortion by the microphone is large, clear sound collection becomes difficult. When speech recognition is performed in a scene where it is difficult to collect sound clearly, it is effective to perform preprocessing such as speech enhancement on the input acoustic signal.

シングルチャネル音声強調では、入力信号を複数の帯域に分割し、各帯域の信号を占める雑音の比率に基づいて雑音を低減する手法がある（特許文献１参照）。マイクロホンアレイを用いた収音の場合、ビームフォーミングの後にウィーナーフィルタに基づくポストフィルタリングにより音声強調を施す手法がある（特許文献２、非特許文献１参照）。 In the single channel speech enhancement, there is a method of dividing the input signal into a plurality of bands and reducing the noise based on the ratio of noise occupying the signal of each band (see Patent Document 1). In the case of sound collection using a microphone array, there is a method of performing speech enhancement by post filtering based on a Wiener filter after beam forming (see Patent Document 2 and Non-Patent Document 1).

特開平９−２５８７９２号公報JP-A-9-258792 特開２００７−３３６２３２号公報JP 2007-336232 A

K. Niwa, Y. Hioka , K. Kobayashi, “Post-filter design for speech enhancement in various noisy environments”, 14th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 35-39, 2014.K. Niwa, Y. Hioka, K. Kobayashi, “Post-filter design for speech enhancement in various noisy environments”, 14th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 35-39, 2014.

しかしながら、音声強調などの前処理に用いるパラメータセットには固定値が設定されている。このようなパラメータセットの要素としては、例えば、信号パワーレベルを時間平均する際の平均時間、信号パワーレベルを時間平均する際の重みづけ係数、雑音レベル計算時の音響信号レベルの時間平滑化に用いる平滑化係数、ディップホールド時の推定雑音レベルの上昇係数、雑音抑圧の処理強度係数、などが挙げられる。したがって、収音環境が変動する場面で利用すると、設定された固定値が収音環境に最適なパラメータセットの値と異なってしまい、音声認識精度が低下してしまう課題がある。 However, fixed values are set for parameter sets used for preprocessing such as speech enhancement. The elements of such a parameter set include, for example, the average time when the signal power level is time-averaged, the weighting coefficient when the signal power level is time-averaged, and the time smoothing of the acoustic signal level when calculating the noise level. Examples include a smoothing coefficient to be used, an increase coefficient of an estimated noise level at dip hold, a processing intensity coefficient for noise suppression, and the like. Therefore, when used in a scene where the sound collection environment fluctuates, there is a problem that the set fixed value is different from the parameter set value optimum for the sound collection environment, and the speech recognition accuracy is lowered.

この発明の目的は、このような点に鑑みて、収音環境の変動に応じて最適な前処理パラメータセットを選択することができるパラメータ決定技術を提供することである。 In view of the above, an object of the present invention is to provide a parameter determination technique capable of selecting an optimal preprocessing parameter set in accordance with a change in sound collection environment.

上記の課題を解決するために、この発明のパラメータ決定装置は、複数の前処理パラメータセットを記憶するパラメータセット記憶部と、複数の音響信号を複数の前処理パラメータセットそれぞれを用いて音声認識した認識結果を記憶する認識結果記憶部と、音響信号から帯域ごとの雑音レベルを推定し、雑音レベル情報を生成する雑音レベル推定部と、複数の音響信号を雑音レベル情報に基づいてグループ分けし、認識結果からグループごとに算出した認識精度が最大となるようにグループの境界面を最適化する境界面最適化部と、複数の前処理パラメータセットからグループごとに認識精度が最大となる最適前処理パラメータセットを選択するパラメータセット選択部と、を含む。 In order to solve the above-described problem, the parameter determination device according to the present invention recognizes speech using a parameter set storage unit that stores a plurality of preprocessing parameter sets and a plurality of acoustic signals using each of the plurality of preprocessing parameter sets. A recognition result storage unit that stores a recognition result, a noise level estimation unit that estimates a noise level for each band from an acoustic signal, generates noise level information, and groups a plurality of acoustic signals based on the noise level information, Boundary surface optimization unit that optimizes the boundary surface of the group so that the recognition accuracy calculated for each group from the recognition result is maximized, and optimal preprocessing that maximizes the recognition accuracy for each group from multiple preprocessing parameter sets A parameter set selection unit for selecting a parameter set.

この発明によれば、雑音レベルが変動する収音環境での音声認識時に、音響信号の帯域ごとの雑音レベルに基づいて最適なパラメータセットを選択することができる。これにより、収音環境と不適合なパラメータセットで前処理が行われることを防止し、事前に用意したパラメータセットの範囲で音声認識にとって最適な処理後信号を出力することができる。また、学習時に候補となるパラメータセットを多数用意した後は、雑音レベルに応じて最適なパラメータセットの値が自動的に選定されるため、パラメータ調整のコストを削減できる。 According to the present invention, it is possible to select an optimal parameter set based on the noise level for each band of the acoustic signal at the time of speech recognition in a sound collection environment where the noise level varies. Thereby, it is possible to prevent preprocessing from being performed with a parameter set that is incompatible with the sound collection environment, and to output a post-processing signal that is optimal for speech recognition within the parameter set range prepared in advance. In addition, after preparing a large number of candidate parameter sets at the time of learning, the optimum parameter set value is automatically selected according to the noise level, so that the cost of parameter adjustment can be reduced.

図１は、パラメータ決定装置の機能構成を例示する図である。FIG. 1 is a diagram illustrating a functional configuration of the parameter determination device. 図２は、パラメータ決定方法の処理フローを例示する図である。FIG. 2 is a diagram illustrating a processing flow of the parameter determination method. 図３は、雑音レベル推定部の機能構成を例示する図である。FIG. 3 is a diagram illustrating a functional configuration of the noise level estimation unit. 図４は、グループ境界面最適化の処理フローを例示する図である。FIG. 4 is a diagram illustrating a process flow of group boundary surface optimization. 図５は、グループ境界面の初期値を例示する図である。FIG. 5 is a diagram illustrating an initial value of the group boundary surface.

以下、この発明の実施の形態について詳細に説明する。なお、図面中において同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. In addition, the same number is attached | subjected to the component which has the same function in drawing, and duplication description is abbreviate | omitted.

実施形態のパラメータ決定装置および方法は、入力音響信号から帯域ごとの雑音レベルを推定し、マルチチャネル雑音抑圧のパラメータ適応を行う。パラメータ適応では、雑音レベルの範囲に応じて音響信号を複数のグループに分類し、グループごとに最適なパラメータセットを選択する。グループ分けの境界面とグループごとのパラメータセットの値を事前学習により用意する。 The parameter determination apparatus and method of the embodiment estimate a noise level for each band from an input acoustic signal, and perform parameter adaptation for multichannel noise suppression. In parameter adaptation, acoustic signals are classified into a plurality of groups according to the range of the noise level, and an optimal parameter set is selected for each group. Prepare the boundary surface for grouping and the value of the parameter set for each group by prior learning.

雑音環境下での一発話を一ファイルとして音声を収録し、そのデータセットを学習データとする。学習データに対して、多種の値を設定したパラメータセットを用いてマルチチャネル雑音抑圧処理を施した上で音声認識し、認識精度Jを評価しておく。パラメータセットの値は、様々な収音環境を想定して、その収音環境に適した値を設定しておく。認識精度Jは文字正解精度であり、式（１）で算出される。

ただし、nは学習データの文章数、Cは正解文字列の文字数、Sは置換誤り文字数、Dは脱落誤り文字数、Iは挿入誤り文字数を示す。置換誤りとは、異なる単語や音節に置き換えられて認識されてしまう認識誤りである。脱落誤りとは、実際に発話したのに認識されない認識誤りである。挿入誤りとは、実際には発話していない単語や音節が認識結果に現れる認識誤りである。Cは学習データ固有の数値であり、認識結果によらず、パラメータにもよらない。一方、S, D, Iは認識結果によって変動する数値であるため、パラメータが異なると変動する場合がある。Jには音声認識の音響尤度を代用してもよい。 Voice is recorded as one file of one utterance in a noisy environment, and the data set is used as learning data. The learning data is subjected to multi-channel noise suppression processing using a parameter set in which various values are set, and then speech recognition is performed, and the recognition accuracy J is evaluated. The value of the parameter set is set to a value suitable for the sound collection environment assuming various sound collection environments. The recognition accuracy J is the accuracy of character correctness, and is calculated by equation (1).

Here, n is the number of sentences in the learning data, C is the number of characters in the correct character string, S is the number of replacement error characters, D is the number of omission error characters, and I is the number of insertion error characters. The replacement error is a recognition error that is recognized by being replaced with a different word or syllable. Omission error is a recognition error that is not recognized even though the utterance is actually made. An insertion error is a recognition error in which a word or syllable that is not actually spoken appears in the recognition result. C is a numerical value specific to the learning data, and does not depend on the parameter regardless of the recognition result. On the other hand, since S, D, and I are numerical values that vary depending on the recognition result, they may vary if the parameters are different. For J, the acoustic likelihood of speech recognition may be substituted.

学習データの各ファイルについて帯域ごとの雑音レベルを推定し、各学習データをグループ分けするための入力とする。また、各学習データの付加情報として認識結果のS, D, Iの値を保存しておく。グループの境界面は、学習データセット全体で認識精度Jを最大化するように、帯域ごとの雑音レベルの空間上で決定する。 The noise level for each band is estimated for each learning data file, and the learning data is used as an input for grouping. In addition, S, D, and I values of the recognition result are stored as additional information of each learning data. The boundary surface of the group is determined on the noise level space for each band so as to maximize the recognition accuracy J in the entire learning data set.

実施形態のパラメータ決定装置は、例えば、図１に示すように、パラメータセット記憶部１、認識結果記憶部２、FFT部３、雑音レベル推定部４、境界面最適化部５、パラメータセット選択部６、収音処理部７、および音声認識部８を含む。 The parameter determination apparatus according to the embodiment includes, for example, a parameter set storage unit 1, a recognition result storage unit 2, an FFT unit 3, a noise level estimation unit 4, a boundary surface optimization unit 5, and a parameter set selection unit as shown in FIG. 6, a sound collection processing unit 7, and a voice recognition unit 8.

パラメータ決定装置は、例えば、中央演算処理装置（CPU: Central Processing Unit）、主記憶装置（RAM: Random Access Memory）などを有する公知又は専用のコンピュータに特別なプログラムが読み込まれて構成された特別な装置である。パラメータ決定装置は、例えば、中央演算処理装置の制御のもとで各処理を実行する。パラメータ決定装置に入力されたデータや各処理で得られたデータは、例えば、主記憶装置に格納され、主記憶装置に格納されたデータは必要に応じて読み出されて他の処理に利用される。また、パラメータ決定装置の各処理部の少なくとも一部が集積回路等のハードウェアによって構成されていてもよい。 The parameter determination device is a special configuration in which a special program is read by a known or dedicated computer having a central processing unit (CPU), a main storage device (RAM), and the like. Device. For example, the parameter determination device executes each process under the control of the central processing unit. The data input to the parameter determination device and the data obtained in each process are stored in the main storage device, for example, and the data stored in the main storage device is read out as needed and used for other processing. The In addition, at least a part of each processing unit of the parameter determination device may be configured by hardware such as an integrated circuit.

パラメータ決定装置が備える各記憶部は、例えば、RAM（Random Access Memory）などの主記憶装置、ハードディスクや光ディスクもしくはフラッシュメモリ（Flash Memory）のような半導体メモリ素子により構成される補助記憶装置、またはリレーショナルデータベースやキーバリューストアなどのミドルウェアにより構成することができる。パラメータ決定装置が備える各記憶部は、それぞれ論理的に分割されていればよく、一つの物理的な記憶装置に記憶されていてもよい。 Each storage unit included in the parameter determination device includes, for example, a main storage device such as a RAM (Random Access Memory), an auxiliary storage device configured by a semiconductor memory device such as a hard disk, an optical disk, or a flash memory, or a relational device. It can be configured with middleware such as a database or key-value store. Each storage unit included in the parameter determination device may be logically divided and may be stored in one physical storage device.

パラメータセット記憶部１には、様々な収音環境を想定して多種の値が設定された複数の前処理パラメータセットが記憶されている。本形態では、L（≧2）個の異なるパラメータセットP₁,…,P_Lが作成され、記憶されているものとする。 The parameter set storage unit 1 stores a plurality of preprocessing parameter sets in which various values are set assuming various sound collection environments. In this embodiment, it is assumed that L (≧ 2) different parameter sets P ₁ ,..., P _L are created and stored.

認識結果記憶部２には、パラメータセット記憶部１に記憶された前処理パラメータセットP₁,…,P_Lそれぞれを用いて、学習データそれぞれを音声認識した認識結果が記憶されている。認識結果には、置換誤り文字数S、脱落誤り文字数D、および挿入誤り文字数Iを予め計算し、これらを付加情報として関連付けて記憶しておく。 The recognition result storage unit 2 stores a recognition result obtained by voice recognition of each learning data using each of the preprocessing parameter sets P ₁ ,..., P _L stored in the parameter set storage unit 1. In the recognition result, the number of replacement error characters S, the number of omission error characters D, and the number of insertion error characters I are calculated in advance, and these are associated and stored as additional information.

図２を参照して、実施形態のパラメータ決定方法の処理手続きを説明する。 With reference to FIG. 2, a processing procedure of the parameter determination method of the embodiment will be described.

ステップＳ１において、FFT部３は、学習データの音響信号を周波数領域の信号に変換する。音響信号は、例えばサンプリング周波数16kHzで離散化されたディジタル信号である。入力される音響信号がアナログ信号である場合、FFT部３の前段に音響信号をディジタル化するA/D変換器を備えればよい。FFT部３は、離散化された音響信号を、短時間高速フーリエ変換（FFT: Fast Fourier Transform）によって、例えば128個集めたフレーム単位（t=8ms）の間隔、ウィンドウサイズ16msで周波数領域信号に変換する。ウィンドウにはハニングウィンドウの平方根をとったものなどを用いる。周波数領域信号は雑音レベル推定部４へ送られる。 In step S1, the FFT unit 3 converts the acoustic signal of the learning data into a frequency domain signal. The acoustic signal is a digital signal discretized at a sampling frequency of 16 kHz, for example. When the input acoustic signal is an analog signal, an A / D converter that digitizes the acoustic signal may be provided in the preceding stage of the FFT unit 3. The FFT unit 3 converts the discretized acoustic signal into a frequency domain signal by a short-time fast Fourier transform (FFT: Fast Fourier Transform), for example, at intervals of 128 frames (t = 8 ms) and a window size of 16 ms. Convert. As the window, the one obtained by taking the square root of the Hanning window is used. The frequency domain signal is sent to the noise level estimation unit 4.

ステップＳ２において、雑音レベル推定部４は、FFT部３が出力する周波数領域信号から周波数ごとに雑音レベルを推定し、雑音レベル情報を生成する。本形態の雑音レベル情報は、帯域ごとの雑音レベルを要素とするベクトルである。雑音レベル情報は境界面最適化部５へ送られる。 In step S2, the noise level estimation unit 4 estimates the noise level for each frequency from the frequency domain signal output by the FFT unit 3, and generates noise level information. The noise level information of this embodiment is a vector whose element is the noise level for each band. The noise level information is sent to the boundary surface optimization unit 5.

雑音レベル推定部４は、図３に示すように、レベル計算部４１、時間平滑部４２、ディップホールド部４３、および帯域集約部４４を含む。雑音レベル推定部４は、周波数領域信号X(ω, n)を入力とし、推定雑音レベルN(ω, n)を出力する。ここで、ωは周波数を表し、nはフレームの番号を表す。レベル計算部４１は、FFT部３の出力する周波数領域信号X(ω, n)の絶対値|X(ω, n)|を計算する。時間平滑部４２は、周波数領域信号のレベル|X(ω, n)|から式（２）により時間平滑化したレベル|X(ω, n)|'を求める。

ただし、αは平滑化係数であり、0以上1未満の値をとる。αが1に近いほど長い時間で平滑化される。 As shown in FIG. 3, the noise level estimation unit 4 includes a level calculation unit 41, a time smoothing unit 42, a dip hold unit 43, and a band aggregation unit 44. The noise level estimation unit 4 receives the frequency domain signal X (ω, n) and outputs an estimated noise level N (ω, n). Here, ω represents a frequency, and n represents a frame number. The level calculation unit 41 calculates the absolute value | X (ω, n) | of the frequency domain signal X (ω, n) output from the FFT unit 3. The time smoothing unit 42 obtains a level | X (ω, n) | ′ that is time-smoothed by the equation (2) from the level | X (ω, n) | of the frequency domain signal.

However, α is a smoothing coefficient and takes a value of 0 or more and less than 1. As α is closer to 1, smoothing is performed in a longer time.

ディップホールド部４３は、時間平滑化したレベル|X(ω, n)|'に対して式（３）によりディップホールド処理を施し、推定ノイズレベルN(ω, n)を求める。

すなわち、1フレーム前の推定雑音レベルN(ω, n-1)が時間平滑化したレベル|X(ω, n)|'よりも大きいか等しい場合は、推定雑音レベルに時間平滑化したレベル|X(ω, n)|'を代入し、それ以外の場合は、1フレーム前の推定雑音レベルN(ω, n-1)に上昇係数uを乗じ、わずかに雑音レベルを上昇させる。ここで、uは1以上の定数であり、事前に設定する。uは推定雑音レベルの上昇係数であり、1に近いほど緩やかな雑音レベル上昇となり、ディップホールドの効果が得られる。 The dip hold unit 43 performs dip hold processing on the time-smoothed level | X (ω, n) | ′ according to Equation (3) to obtain an estimated noise level N (ω, n).

That is, when the estimated noise level N (ω, n-1) one frame before is greater than or equal to the time smoothed level | X (ω, n) | ', the level smoothed to the estimated noise level | X (ω, n) | ′ is substituted, otherwise, the noise level is slightly increased by multiplying the estimated noise level N (ω, n−1) one frame before by the increase coefficient u. Here, u is a constant of 1 or more and is set in advance. u is an increase coefficient of the estimated noise level. The closer to 1, the more gradually the noise level increases, and the dip hold effect is obtained.

帯域集約部４４は、推定雑音レベルN(ω, n)を、所定の帯域ごとに集約した雑音レベル情報を生成する。本形態では、３帯域に集約するものとして説明するが、帯域数は特に限定されない。例えば、周波数ビンの0番目から7番目（帯域１とする）、8番目から21番目（帯域２とする）、22番目から65番目（帯域３とする）でそれぞれ平均し、N₁(n), N₂(n), N₃(n)とする。さらに、例えば一ファイルの冒頭１秒で平均してそれぞれ一つの数値とし、学習データ一つひとつに対応づける。これをN_i=(N₁, N₂, N₃)_iとする。ただし下添え字のiは学習データの番号である。 The band aggregating unit 44 generates noise level information in which the estimated noise level N (ω, n) is aggregated for each predetermined band. In this embodiment, the description will be made assuming that the three bands are aggregated, but the number of bands is not particularly limited. For example, the frequency bins are averaged from 0 to 7 (band 1), from 8 to 21 (band 2), and from 22 to 65 (band 3), respectively, and N ₁ (n) , N ₂ (n), N ₃ (n). Further, for example, each file is averaged at the beginning of one second to be one numerical value, and is associated with each learning data. Let this be N _i = (N ₁ , N ₂ , N ₃ ) _i . However, the subscript i is the number of the learning data.

ステップＳ３において、境界面最適化部５は、学習データの音響信号を雑音レベル情報に基づいてグループ分けし、認識結果記憶部２に記憶された認識結果からグループごとに認識精度を算出し、その認識精度が最大となるようにグループの境界面を最適化する。以下では、説明の便宜上、グループ数を２グループとするが、グループ数は特に限定されない。グループ数を３以上に構成する場合には、ある境界面で分割される学習データのグループに対して、さらに新たな境界面でグループ分けして最適化することを繰り返せばよい。 In step S3, the boundary surface optimization unit 5 groups the acoustic signals of the learning data based on the noise level information, calculates the recognition accuracy for each group from the recognition results stored in the recognition result storage unit 2, and Optimize group boundaries to maximize recognition accuracy. In the following, for convenience of explanation, the number of groups is two, but the number of groups is not particularly limited. When the number of groups is set to 3 or more, it is only necessary to repeat the optimization by grouping the learning data group divided at a certain boundary surface with a new boundary surface.

本形態の境界面最適化部５は、学習データを「パラメータセットP_Aで音声認識精度が高くなるグループG_A」と、「パラメータセットP_Bで音声認識精度が高くなるグループG_B」とに分割する境界面μを求める。パラメータセットP_A、P_Bはパラメータセット選択部６で決定される。パラメータセット選択部６は、初回実行時には境界面の初期値に対して最適なパラメータセットを選択する。二回目以降の実行時には最適化された境界面に対して最適なパラメータセットを選択する。はじめに、境界面μを表す方程式の初期値として式（４）を与える。

Interface optimization unit 5 of this embodiment, the training data as "Group G _A the speech recognition accuracy increases with the parameter set P _A", in the "parameter set P group G _B the speech recognition accuracy increases with _B" The boundary surface μ to be divided is obtained. The parameter sets P _A and P _B are determined by the parameter set selection unit 6. The parameter set selection unit 6 selects an optimal parameter set for the initial value of the boundary surface at the first execution. In the second and subsequent executions, an optimal parameter set is selected for the optimized boundary surface. First, Equation (4) is given as an initial value of an equation representing the boundary surface μ.

境界面μは、帯域ごとの雑音レベルを軸として生成される空間を分割する平面である。本形態では、N₁をx軸、N₂をy軸、N₃をz軸とする。式（５）を満たす区間に属する学習データはグループG_Aとし、式（６）を満たす区間に属する学習データはグループG_Bとする。

The boundary surface μ is a plane that divides a space generated around the noise level for each band. In this embodiment, N ₁ is the x axis, N ₂ is the y axis, and N ₃ is the z axis. Training data belonging to a section that satisfies Equation (5) is a group G _A, the learning data which belongs to the interval which satisfies the equation (6) is a group G _B.

境界面最適化部５は、境界面μを初期値から変動させ、各グループの認識精度を最大化する境界面を探索する。境界面の最適化の評価値は学習データの認識精度Jである。認識精度Jは式（７）により算出され、境界面の変動に伴って値が変化する。

ただし、S_PA, D_PA, I_PA（下添え字のPAはP_Aを表す）はそれぞれパラメータセットP_Aで音声認識した場合の置換誤り文字数、削除誤り文字数、挿入誤り文字数を表す。同様に、S_PB, D_PB, I_PB（下添え字のPBはP_Bを表す）はそれぞれパラメータセットP_Bで音声認識した場合の置換誤り文字数、削除誤り文字数、挿入誤り文字数を表す。Cは正解文字列の文字数であるため分離していない。 The boundary surface optimization unit 5 searches the boundary surface that varies the boundary surface μ from the initial value and maximizes the recognition accuracy of each group. The evaluation value of the optimization of the boundary surface is the recognition accuracy J of the learning data. The recognition accuracy J is calculated by the equation (7), and the value changes as the boundary surface changes.

However, (in the subscript PA represents the P _A) S _PA, D _PA, I _PA is substitution error characters in the case of speech recognition parameter set P _A, respectively, deletion errors characters, representing the insertion error characters. Similarly, S _PB , D _PB , and I _PB (subscript PB represents P _B ) respectively represent the number of replacement error characters, the number of deletion error characters, and the number of insertion error characters when speech recognition is performed with the parameter set P _B. C is not separated because it is the number of characters in the correct answer string.

認識精度Jの値は境界面に対して一意に定まるが、雑音レベルの関数として数式では表現できず、かつ不連続に変化する。そこで、本形態では山登り法（もしくはヒルクライミング法）を適用して探索を行う。境界面最適化の処理手続きを図４に示す。本形態では、境界面μは式（４）のように三次元空間で表しているため、式（８）に示すように四次元ベクトルとして扱うことができる。

The value of the recognition accuracy J is uniquely determined with respect to the boundary surface, but cannot be expressed by a mathematical expression as a function of the noise level, and changes discontinuously. Therefore, in this embodiment, the search is performed by applying the hill climbing method (or hill climbing method). The boundary surface optimization processing procedure is shown in FIG. In this embodiment, since the boundary surface μ is expressed in a three-dimensional space as shown in Expression (4), it can be handled as a four-dimensional vector as shown in Expression (8).

最適化のために境界面μを微小に移動させる（ステップＳ５１）。まず、微小ベクトル(Δa, Δb, Δc, Δd)の各成分を乱数で生成する。このとき生成される微小ベクトルの大きさにより、認識精度Jの最大値を探索する細かさが決まる。そこで、生成する乱数の範囲を、式（９）を満たすように制限する。

The boundary surface μ is slightly moved for optimization (step S51). First, each component of the minute vector (Δa, Δb, Δc, Δd) is generated with a random number. The fineness of searching for the maximum value of the recognition accuracy J is determined by the size of the minute vector generated at this time. Therefore, the range of random numbers to be generated is limited so as to satisfy Expression (9).

ただし、a_min, a_max, b_min, b_max, c_min, c_max, d_min, d_maxの値は、例えば帯域ごとの平均雑音レベルの最大値に依存する値である。例えば、N₁, N₂, N₃の値域がそれぞれ0から60程度であれば、a_min, b_min, c_min, d_min=0、a_max, b_max, c_max=3、d_maxは任意の正の値、などに設定する。さらに、微小ベクトル(Δa, Δb, Δc, Δd)の各要素の符号を反転し、式（10）に示すように、2⁴=16通りすべての組み合わせを生成する。

However, the values of a _min , a _max , b _min , b _max , c _min , c _max , d _min , and d _max are values depending on, for example, the maximum average noise level for each band. For example, if the range of N ₁ , N ₂ , N ₃ is about 0 to 60 respectively, a _min , b _min , c _min , d _min = 0, a _max , b _max , c _max = 3, d _max is Set to any positive value, etc. Further, the sign of each element of the minute vector (Δa, Δb, Δc, Δd) is inverted, and 2 ⁴ = 16 combinations are generated as shown in the equation (10).

この目的は、認識精度Jの勾配が最大の方向を求める代わりに、2⁴通りのεから移動するべき方向を選択することである。移動後の境界面μ’_k（k=1,2,3,…,2⁴）は、式（11）で表すことができる。

This object is instead the slope of the recognition accuracy J Find the maximum direction is to select the direction in which to move from the street 2 ⁴ epsilon. The boundary surface μ ′ _k (k = 1, 2, 3,..., 2 ⁴ ) after movement can be expressed by Expression (11).

境界面が移動すると、移動前にグループG_Aに属していた学習データのいくつかでは式（６）が満たされるため、これらはグループG_Bへ移動する。同様に、移動前にグループG_Bに属していた学習データのいくつかでは式（５）が満たされるため、これらはグループG_Aへ移動する。移動した学習データのグループを更新して、グループG_Aの学習データにはパラメータセットP_Aを、グループG_Bの学習データにはパラメータセットP_Bを使ったとする。全学習データに対する認識精度を式（７）により再計算し、これをJ’_kとする（ステップＳ５２）。音声認識自体はすでに実施してあるので、式（７）では結果の文字列を集計するだけでよい。本形態では、J’_kは2⁴個存在することになるが、そのうち式（７）を最大にするものをJ_maxとする（ステップＳ５３）。 When the boundary surface moves, Equation (6) is satisfied in some of the learning data that belonged to the group G _A before the movement, so these move to the group G _B. Similarly, in some learning data that belonged to the group G _B before the move, since the expression (5) is satisfied, it is moved to the group G _A. Update the group moved training data, the training data of the group G _A a parameter set P _A, the learning data of the group G _B and using the parameter set P _B. The recognition accuracy for all the learning data is recalculated by the equation (7), and this is set as J ′ _k (step S52). Since the speech recognition itself has already been performed, it is only necessary to add up the resulting character strings in equation (7). In this embodiment, J _'k is made to be present 2 ^4, to those that the out equation (7) the maximum and J _max (step S53).

最大の認識精度J_maxを前回求めた認識精度Jと比較し、境界面の移動によって認識精度の最尤化が進んでいるかを確認する（ステップＳ５４、Ｓ５６）。最尤化が進んでいれば（すなわち、J_maxがJよりも大きければ）認識精度JにJ_maxを代入し、境界面μを式（11）によって移動して、認識精度Jを再計算する処理を継続する（ステップＳ５５）。最大の認識精度J_maxが前回の認識精度Jと等しくなった場合には、式（９）の制約を外してεを再生成し、認識精度Jを再計算する処理を継続する（ステップＳ５７）。境界面を移動させても認識精度Jの値が変わらないということは現実的ではないため、εを再生成して境界面を移動させることを繰り返せば、いずれ認識精度Jの値が変化すると考えられるからである。最大の認識精度J_maxが前回の認識精度J未満になれば、処理を終了して境界面μを確定する。 The maximum recognition accuracy J _max is compared with the previously obtained recognition accuracy J, and it is confirmed whether or not the maximum likelihood of the recognition accuracy is advanced by the movement of the boundary surface (steps S54 and S56). If maximum likelihood is advanced (that is, if J _max is larger than J), substitute J _max for recognition accuracy J, move boundary surface μ using equation (11), and recalculate recognition accuracy J The process is continued (step S55). When the maximum recognition accuracy J _max becomes equal to the previous recognition accuracy J, the process of re-calculating the recognition accuracy J by removing the constraint of Expression (9) and recalculating the recognition accuracy J is continued (step S57). . Since it is not realistic that the value of recognition accuracy J does not change even if the boundary surface is moved, it is considered that the value of recognition accuracy J will eventually change if ε is regenerated and the boundary surface is moved repeatedly. Because it is. If the maximum recognition accuracy J _max is less than the previous recognition accuracy J, the process is terminated and the boundary surface μ is determined.

上述の反復処理により、初期値近傍の最適解が求まる。ただし、認識精度Jには複数の極大値があるため、局所最適に陥るのを防ぐために、複数の境界面を初期値として与える。広い範囲を効率よく探索するために、境界面の初期値を格子状に設定する。３つの帯域の雑音レベルを軸として用いる場合、境界面は式（４）で表される三次元平面である。境界面の初期値には、例えば各軸に平行な平面と、各軸に45°、135°で交わる平面を用意する。図５に、z軸に平行な境界面の初期値を一点鎖線で示す。これらの平面は、a, b, cの値に0, 1, -1のいずれかを代入するすべての組み合わせにより作成できる。dは原点と平面との距離を表すため、学習データの雑音レベルの値の範囲に応じて変更する。また、dの増加分は格子の幅に相当する。 By the above iterative process, an optimal solution near the initial value is obtained. However, since there are a plurality of maximum values in the recognition accuracy J, a plurality of boundary surfaces are given as initial values in order to prevent falling into local optimization. In order to efficiently search a wide range, the initial value of the boundary surface is set in a grid pattern. When the noise levels of the three bands are used as axes, the boundary surface is a three-dimensional plane represented by Expression (4). For example, a plane parallel to each axis and a plane that intersects each axis at 45 ° and 135 ° are prepared as the initial value of the boundary surface. In FIG. 5, the initial value of the boundary surface parallel to the z-axis is indicated by a one-dot chain line. These planes can be created by any combination that substitutes 0, 1, or -1 for the values of a, b, and c. Since d represents the distance between the origin and the plane, it is changed according to the range of the noise level value of the learning data. An increase in d corresponds to the width of the lattice.

すべての境界面の初期値に対して初期値近傍の最適境界面を求め、その中から認識精度Jが最も高い結果を選ぶ。ここまでの一連の処理で最適境界面が決定すると、ステップＳ３の処理を終了し、ステップＳ４へ進む。なお、境界面の初期値を与えてから最適境界面が求まるまでの間は、パラメータセットP_A, P_Bの中身は固定である。 The optimum boundary surface in the vicinity of the initial value is obtained for the initial values of all the boundary surfaces, and the result with the highest recognition accuracy J is selected from the optimal boundary surfaces. When the optimum boundary surface is determined by the series of processes so far, the process of step S3 is terminated and the process proceeds to step S4. Note that the contents of the parameter sets P _A and P _B are fixed until the optimum boundary plane is obtained after the initial value of the boundary plane is given.

ステップＳ４において、パラメータセット選択部６は、パラメータセット記憶部１に記憶されている複数のパラメータセットからパラメータセットP_A, P_Bの中身を決定する。認識結果記憶部２に各パラメータセットを用いて学習データすべてを音声認識した結果が記憶されているため、認識精度Jの計算は、単に置換誤り文字数、削除誤り文字数、挿入誤り文字数を集計するだけである。 In step S 4, the parameter set selection unit 6 determines the contents of the parameter sets P _A and P _B from the plurality of parameter sets stored in the parameter set storage unit 1. Since the recognition result storage unit 2 stores the result of speech recognition of all the learning data using each parameter set, the calculation of the recognition accuracy J is simply to add up the number of replacement error characters, the number of deletion error characters, and the number of insertion error characters. It is.

パラメータセット選択部６には、境界面最適化部５から境界面μの値が渡される。境界面μの一方の側にある学習データはグループG_A、他方の側にある学習データはグループG_Bである。パラメータセット選択部６では、式（12）に従って、各グループについてパラメータセットP₁,…,P_Lそれぞれに対して認識精度J_A,λ, J_B,λ（λ=1,2,…,L）を計算する。

ここで、λはパラメータセットの番号であり、S_Pλ, D_Pλ, I_Pλ（下添え字のPλはP_λを表す）はそれぞれパラメータセットP_λで音声認識した場合の置換誤り文字数、削除誤り文字数、挿入誤り文字数を表す。パラメータセットP_A, P_Bは、式（13）に従って決定する。

The parameter set selection unit 6 is supplied with the value of the boundary surface μ from the boundary surface optimization unit 5. The learning data on one side of the boundary surface μ is group G _A , and the learning data on the other side is group G _B. The parameter set selection section 6, in accordance with equation (12), a parameter set P ₁ for each group, ..., recognition accuracy J _A for each _{_{_{P L, λ, J B,}}} λ (λ = 1,2, ..., L ).

Here, λ is the parameter set number, and S _Pλ , D _Pλ , and I _Pλ (subscript Pλ represents P _λ ) are the number of substitution error characters and deletion errors when speech recognition is performed with the parameter set P _λ , respectively. Indicates the number of characters and the number of insertion error characters. The parameter sets P _A and P _B are determined according to the equation (13).

仮に、式（13）でパラメータセットP_A, P_Bの中身に同じものが選ばれた場合、その学習データセットはその境界面で分割する必要がなかったことを意味するため、その境界面は削除する。 If the same parameter set P _A and P _B is selected in Equation (13), it means that the training data set did not need to be divided at the boundary surface, so the boundary surface is delete.

ステップＳ５において、パラメータ決定装置は、境界面とパラメータセットの最適化が収束したかどうかを判定する。収束条件は、ステップＳ４においてパラメータセットP_A, P_Bの中身が更新されなくなることである。境界面とパラメータセットの最適化が収束していない場合、ステップＳ３の処理を再度実行する。境界面とパラメータセットの最適化が収束した場合、そのパラメータセットを最適パラメータセットとして境界面μと関連付けて記憶し、処理を終了する。 In step S5, the parameter determination device determines whether the optimization of the boundary surface and the parameter set has converged. The convergence condition is that the contents of the parameter sets P _A and P _B are not updated in step S4. If the optimization of the boundary surface and the parameter set has not converged, the process of step S3 is executed again. When the optimization of the boundary surface and the parameter set converges, the parameter set is stored in association with the boundary surface μ as the optimal parameter set, and the process is terminated.

収音処理部７および音声認識部８は、上述のように選択される最適パラメータセットを用いて音声認識を行う。収音処理部７は、入力された音響信号に対して、パラメータセット選択部６により出力される最適パラメータセットの値を用いて音声強調などの前処理を行う。音声認識部８は、処理後音響信号に対して音声認識を行い、認識結果の単語列を出力する。 The sound collection processing unit 7 and the voice recognition unit 8 perform voice recognition using the optimum parameter set selected as described above. The sound collection processing unit 7 performs preprocessing such as speech enhancement on the input acoustic signal using the value of the optimum parameter set output by the parameter set selection unit 6. The voice recognition unit 8 performs voice recognition on the processed acoustic signal and outputs a word string as a recognition result.

上記のように、この発明のパラメータ決定技術は、音声強調技術などのパラメータセットが帯域ごとの雑音レベルによって適切な値が変化する傾向に注目して、パラメータセットの値を単一ではなく複数用意しておき、音響信号の雑音レベルに応じて最適なパラメータセットを選択する。パラメータセットの選択は、学習データの音響信号に対してすべてのパラメータセットを用いて前処理をした上で認識結果を生成し、パラメータセットを切り替えるべき雑音レベルの境界面を、ヒルクライミング法で探索することにより行う。このように構成することにより、雑音レベルの変動がある環境での音声認識において、最適なパラメータセットを選択することができる。また、雑音レベルに応じた最適なパラメータセットの値が自動で選定されるため、パラメータ調整のコストを削減できる。 As described above, the parameter determination technique according to the present invention prepares a plurality of parameter set values instead of a single value by paying attention to the tendency that an appropriate value of a parameter set such as a speech enhancement technique changes depending on a noise level for each band. In addition, an optimal parameter set is selected according to the noise level of the acoustic signal. The parameter set is selected by pre-processing using all parameter sets for the acoustic signal of the training data, generating a recognition result, and searching for a noise level boundary surface for switching the parameter set using the hill climbing method. To do. With this configuration, it is possible to select an optimal parameter set in speech recognition in an environment where noise level varies. In addition, since the optimum parameter set value corresponding to the noise level is automatically selected, the cost of parameter adjustment can be reduced.

この発明は上述の実施形態に限定されるものではなく、この発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。上記実施形態において説明した各種の処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。
［プログラム、記録媒体］
上記実施形態で説明した各装置における各種の処理機能をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記各装置における各種の処理機能がコンピュータ上で実現される。 The present invention is not limited to the above-described embodiment, and it goes without saying that modifications can be made as appropriate without departing from the spirit of the present invention. The various processes described in the above embodiment may be executed not only in time series according to the order of description, but also in parallel or individually as required by the processing capability of the apparatus that executes the processes or as necessary.
[Program, recording medium]
When various processing functions in each device described in the above embodiment are realized by a computer, the processing contents of the functions that each device should have are described by a program. Then, by executing this program on a computer, various processing functions in each of the above devices are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したDVD、CD-ROM等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 This program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるASP（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. A configuration in which the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes a processing function only by an execution instruction and result acquisition without transferring a program from the server computer to the computer. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, the present apparatus is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

１パラメータセット記憶部
２認識結果記憶部
３ FFT部
４雑音レベル推定部
５境界面最適化部
６パラメータセット選択部
７収音処理部
８音声認識部 DESCRIPTION OF SYMBOLS 1 Parameter set memory | storage part 2 Recognition result memory | storage part 3 FFT part 4 Noise level estimation part 5 Boundary surface optimization part 6 Parameter set selection part 7 Sound collection process part 8 Speech recognition part

Claims

A parameter set storage unit for storing a plurality of preprocessing parameter sets;
A recognition result storage unit for storing a recognition result obtained by recognizing a plurality of acoustic signals using each of the plurality of preprocessing parameter sets;
A noise level estimation unit that estimates noise level for each band from the acoustic signal and generates noise level information;
A boundary surface optimization unit that groups the plurality of acoustic signals based on the noise level information and optimizes the boundary surface of the group so that the recognition accuracy calculated for each group from the recognition result is maximized; ,
A parameter set selection unit that selects an optimal preprocessing parameter set that maximizes the recognition accuracy for each group from the plurality of preprocessing parameter sets;
A parameter determination device including:

The parameter determination device according to claim 1,
The noise level estimation unit generates the noise level information as a vector having a plurality of values obtained by aggregating noise levels estimated for each frequency from the acoustic signal in a predetermined frequency band,
The boundary surface optimization unit represents the boundary surface by a linear equation having each value of the frequency band as a variable, and substituting each value of the noise level information into each variable of the linear equation according to a result of the substitution. Signals are grouped, and the boundary surface is optimized by comparing the recognition accuracy for each group before and after changing the coefficients of the linear equation.
A parameter determination device that repeatedly executes the boundary surface optimization unit and the parameter set selection unit until the optimum preprocessing parameter set selected by the parameter set selection unit is not updated.

The parameter determination device according to claim 1 or 2,
The recognition accuracy is a value obtained by dividing a value obtained by subtracting the number of recognition error characters in the recognition result from the number of characters in the correct character string related to the acoustic signal by the number of characters in the correct character string.

A plurality of pre-processing parameter sets are stored in the parameter set storage unit,
The recognition result storage unit stores a recognition result obtained by recognizing a plurality of acoustic signals using each of the plurality of preprocessing parameter sets,
A noise level estimation unit that estimates a noise level for each band from the acoustic signal and generates noise level information; and
The boundary surface optimization unit divides the plurality of acoustic signals into groups based on the noise level information, and optimizes the group boundary surface so that the recognition accuracy calculated for each group from the recognition result is maximized. Interface optimization step to
A parameter set selection step for selecting an optimal preprocessing parameter set that maximizes the recognition accuracy for each group from the plurality of preprocessing parameter sets;
Parameter determination method including

The program for functioning a computer as a parameter determination apparatus in any one of Claim 1 to 3.