JP3129164B2

JP3129164B2 - Voice recognition method

Info

Publication number: JP3129164B2
Application number: JP07226173A
Authority: JP
Inventors: 麻紀山田; 昌克星見; 知浩小沼; 勝行二矢田
Original assignee: Panasonic Corp; Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Corp; Panasonic Holdings Corp
Priority date: 1995-09-04
Filing date: 1995-09-04
Publication date: 2001-01-29
Anticipated expiration: 2015-09-04
Also published as: JPH0968995A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は人間の声を機械に認識さ
せる音声認識方法に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method for recognizing a human voice by a machine.

【０００２】[0002]

【従来の技術】近年、使用者の声を登録することなし
に、誰の声でも認識できる不特定話者用の音声認識装置
が実用として使われるようになった。不特定話者用の実
用的な方法として、特許（特開昭６１−１８８５９９号
公報）を従来例として説明する。2. Description of the Related Art In recent years, a speech recognition apparatus for an unspecified speaker capable of recognizing anyone's voice without registering a user's voice has come into practical use. As a practical method for an unspecified speaker, a patent (JP-A-61-188599) will be described as a conventional example.

【０００３】従来例の方法は入力音声の始端、終端を求
めて音声区間を決定し、音声区間を一定時間長に（Ｊフ
レーム）に線形伸縮し、これと単語標準パターンとの類
似度を統計的距離尺度を用いてパターンマッチングをす
ることによって求め、単語を認識する方法である。[0003] In the conventional method, a speech section is determined by finding the start and end of an input speech, and the speech section is linearly expanded and contracted by a predetermined time length (J frame), and the similarity between the speech section and a word standard pattern is statistically determined. This is a method of recognizing words obtained by performing pattern matching using a target distance scale.

【０００４】以下、従来例について図７、図８を用いて
詳細に説明する。図７は従来例の音声認識方法の処理の
流れを示すフローチャートである。図７において１は音
響分析部、２は特徴パラメータ抽出部、３は音声区間検
出部、１０は時間軸線形正規化部、４は標準パターン格
納部、１１は距離計算部、６は距離比較部である。Hereinafter, a conventional example will be described in detail with reference to FIGS. FIG. 7 is a flowchart showing the flow of processing of the conventional speech recognition method. In FIG. 7, 1 is an acoustic analysis unit, 2 is a feature parameter extraction unit, 3 is a voice section detection unit, 10 is a time axis linear normalization unit, 4 is a standard pattern storage unit, 11 is a distance calculation unit, and 6 is a distance comparison unit. It is.

【０００５】図７において、入力音声が入力されると音
響分析部１で分析時間（フレームと呼ぶ、本従来例では
１フレーム＝10ms）ごとに線形予測（ＬＰＣ）分析を行
なう。次に、特徴パラメータ抽出部２でＰ個の特徴パラ
メータをフレームごとに求める。特徴パラメータは、Ｌ
ＰＣメルケプストラム係数（本例ではC1〜C9まで9
個）、正規化残差C0、および音声対数パワーの時間差分
値V0を用いる。次に音声区間検出部３で入力音声の始端
フレーム、終端フレームを検出する。音声区間の検出は
音声パワーを用いる方法が一番簡単であるがどのような
方法を用いてもよい。検出された音声区間に対して、入
力音声の特徴パラメータ時系列を時間軸線形正規化部１
０でＪフレームに線形伸縮する。これを概念的に示した
のが図８である。通常、計算量および標準パターンの推
定パラメータ数削減のため、Ｊは実際の単語のフレーム
数よりも小さく取る。これは単語の音声区間全体につい
て等間隔にフレームを間引くことに相当する。検出され
た入力音声区間の始端フレームを１フレーム目、終端フ
レームをＩフレーム目とすると、伸縮後の第ｊフレーム
と入力音声の第ｉフレームの関係はIn FIG. 7, when an input voice is input, the acoustic analysis unit 1 performs a linear prediction (LPC) analysis every analysis time (called a frame, 1 frame = 10 ms in the conventional example). Next, the feature parameter extracting unit 2 obtains P feature parameters for each frame. The characteristic parameter is L
PC mel cepstrum coefficient (9 in this example from C1 to C9)
), The normalized residual C0, and the time difference value V0 of the logarithmic power of the voice. Next, the voice section detector 3 detects the start frame and the end frame of the input voice. The method using voice power is the simplest method for detecting a voice section, but any method may be used. For the detected speech section, a time-series linear normalization unit 1 converts the feature parameter time series of the input speech.
At 0, linearly expands and contracts to the J frame. FIG. 8 conceptually illustrates this. Normally, J is set smaller than the actual number of frames of a word in order to reduce the amount of calculation and the number of estimated parameters of the standard pattern. This corresponds to thinning out frames at equal intervals over the entire speech section of a word. Assuming that the start frame of the detected input voice section is the first frame and the end frame is the I frame, the relationship between the j-th frame after expansion and contraction and the i-th frame of the input voice is

【０００６】[0006]

【数１】 (Equation 1)

【０００７】となる。ただし、［］はその数を越えない
最大の整数を表す。伸縮後のＪフレーム分の特徴パラメ
ータを時系列に並べ入力時系列パターンＸを作成する。[0007] Here, [] represents the largest integer not exceeding the number. The feature parameters for the J frames after expansion and contraction are arranged in time series to create an input time series pattern X.

【０００８】[0008]

【数２】 (Equation 2)

【０００９】この入力時系列パターンＸと標準パターン
格納部４に格納されている認識対象語彙の各々の標準パ
ターンとの距離を距離計算部１１で求める。標準パター
ンの作成方法および距離の求めかたについては後述す
る。最後に距離比較部６で、距離計算部１１で求めた各
々の標準パターンとの距離の中で最小（類似度が最大）
の値をもつ標準パターンに対応する音声名を認識結果と
して選択し、出力する。The distance between the input time-series pattern X and each standard pattern of the vocabulary to be recognized stored in the standard pattern storage unit 4 is determined by the distance calculation unit 11. A method of creating the standard pattern and a method of obtaining the distance will be described later. Finally, the distance comparing unit 6 has the smallest distance (similarity is maximum) among the distances from the respective standard patterns obtained by the distance calculating unit 11.
Is selected as a recognition result and output.

【００１０】以下に、単語標準パターンの作成方法、お
よび入力時系列パターンと単語標準パターンとの距離計
算の方法について述べる。Hereinafter, a method of creating a word standard pattern and a method of calculating a distance between an input time-series pattern and a word standard pattern will be described.

【００１１】ある単語ωnの標準パターンは次のような
手順で作成する。 (1)多数の人（ここでは100名）が単語ωnを発声したM個
の学習用音声データを用意する。 (2)各データを（数１）を用いて線形に伸縮を行ないＪ
フレームに正規化する。(3)第m番目の発声データに対し
て伸縮後の特徴パラメータを時系列に並べ、時系列パタ
ーンＣmを求める。(m=1,...,M) (4)M個の時系列パターンＣm(m=1,...,M)を用いてその統
計量（平均値、共分散）を求めることにより標準パター
ンを作成する。The standard pattern of a certain word ωn is created by the following procedure. (1) Prepare M learning speech data in which many people (here, 100 people) uttered the word ωn. (2) Each data is linearly expanded and contracted using (Equation 1) and J
Normalize to frame. (3) The feature parameters after expansion and contraction are arranged in time series with respect to the m-th utterance data, and a time series pattern Cm is obtained. (m = 1, ..., M) (4) By using M time-series patterns Cm (m = 1, ..., M) to obtain their statistics (mean, covariance), Create a pattern.

【００１２】これをN個の認識対象語彙それぞれに対し
て求めておく。第m番目の発声データに対して伸縮後の
特徴パラメータを時系列に並べた時系列パターンＣmは
次のように表される。This is obtained for each of the N words to be recognized. A time-series pattern Cm in which characteristic parameters after expansion and contraction are arranged in a time-series with respect to the m-th utterance data is expressed as follows.

【００１３】[0013]

【数３】 (Equation 3)

【００１４】これをM個の学習用音声データについて求
める。時間パターンＣmを一つのベクトルとして扱うこ
とにより、パラメータのフレーム間の相関を考慮するこ
とになる。M個のＪ×Ｐ次元のベクトルＣm（m=1,...,
M）からその平均値ベクトルμおよび共分散行列Ｗを求
める。以下、第n番目の単語ωnに対する平均値ベクトル
をμn、共分散行列をＷnと表記する。This is obtained for M learning speech data. By treating the time pattern Cm as one vector, the correlation between the parameter frames is taken into account. M J × P-dimensional vectors Cm (m = 1, ...,
From M), the average vector μ and the covariance matrix W are obtained. Hereinafter, the mean vector for the n-th word ωn is denoted by μn, and the covariance matrix is denoted by Wn.

【００１５】入力時系列パターンＸと単語標準パターン
との距離計算は、共分散行列を共通化したベイズ判定に
基づく距離を用いて計算する。The distance between the input time-series pattern X and the word standard pattern is calculated using a distance based on Bayesian judgment using a common covariance matrix.

【００１６】ベイズ判定に基づく距離は以下のようにし
て求める。（数２）で表される入力ベクトルＸが観測さ
れたときにそれが単語ωnである確率Ｐ(ωn｜Ｘ)はベイ
ズの定理よりThe distance based on the Bayes judgment is obtained as follows. When the input vector X expressed by (Equation 2) is observed, the probability P (ωn | X) that it is the word ωn is obtained from Bayes' theorem.

【００１７】[0017]

【数４】 (Equation 4)

【００１８】となる。Ｐ(Ｘ｜ωn)は事前確率で、入力
がカテゴリーωnであったときに入力ベクトルＸが観測
される確率、Ｐ(Ｘ)は生起し得るすべての入力を考えた
場合のベクトルＸが観測される確率である。単語ωnの
出現確率Ｐ(ωn)は各単語同じと仮定して定数とし、入
力Ｘが一定とするとＰ(Ｘ)が定数となるので、事前確率
Ｐ(Ｘ｜ωn)を最大とするカテゴリーωnを判定結果とす
ればよい。## EQU1 ## P (X | ωn) is the prior probability, the probability that the input vector X is observed when the input is the category ωn, and P (X) is the vector X when all possible inputs are considered. Probability. The appearance probability P (ωn) of the word ωn is assumed to be the same for each word and is assumed to be a constant. If the input X is assumed to be constant, P (X) becomes a constant. Therefore, the category ωn which maximizes the prior probability P (X | ωn) May be used as the determination result.

【００１９】パラメータの分布を正規分布と考えると、
事前確率Ｐ(Ｘ｜ωn)は（数５）で表される。When the distribution of parameters is considered to be a normal distribution,
The prior probability P (X | ωn) is represented by (Equation 5).

【００２０】[0020]

【数５】 (Equation 5)

【００２１】ここでtは転置行列を表す。両辺の対数を
とって識別に不要な定数項を省略しさらに−２倍すると
次式を得る。Here, t represents a transposed matrix. By taking the logarithm of both sides and omitting a constant term unnecessary for identification and further multiplying by -2, the following equation is obtained.

【００２２】[0022]

【数６】 (Equation 6)

【００２３】この式は単語ωnに対するベイズ判定に基
づく距離である。計算量および推定パラメータ数削減の
ため、共分散行列を共通化してこの式を線形一次判別式
に展開する。認識対象語彙の各々の標準パターンの共分
散行列Ｗnを共通化し、Ｗとする。Ｗは次式のようにし
て求める。This equation is a distance based on Bayes judgment for the word ωn. In order to reduce the amount of calculation and the number of estimated parameters, the covariance matrix is shared and this equation is developed into a linear linear discriminant. The covariance matrix Wn of each standard pattern of the vocabulary to be recognized is shared and is set to W. W is obtained by the following equation.

【００２４】[0024]

【数７】 (Equation 7)

【００２５】したがってTherefore,

【００２６】[0026]

【数８】 (Equation 8)

【００２７】とおくことができる。これを（数６）に代
入し識別に不要な定数項を省略すると、[0027] Substituting this into (Equation 6) and omitting the constant term unnecessary for identification,

【００２８】[0028]

【数９】 (Equation 9)

【００２９】となり、,

【００３０】[0030]

【数１０】 (Equation 10)

【００３１】[0031]

【数１１】 [Equation 11]

【００３２】とおくことにより、By the way,

【００３３】[0033]

【数１２】 (Equation 12)

【００３４】のような線形一次判別式になることがわか
る。このようにしてＡn，Ｂnを認識対象語彙の各々に対
して求め、標準パターン格納部に格納しておく。距離計
算部では上式を用いて入力時系列パターンＸと、単語ω
nの標準パターンとの距離Ｌnを求める。It can be seen that the linear primary discriminant is as follows. In this way, An and Bn are obtained for each vocabulary to be recognized, and stored in the standard pattern storage unit. The distance calculation unit uses the above equation to calculate the input time-series pattern X and the word ω
The distance Ln between n and the standard pattern is obtained.

【００３５】[0035]

【発明が解決しようとする課題】従来例の方法は、計算
量が少なく実用的な方法である。しかし従来の方法で
は、パラメータの推定精度の面から標準パターンのフレ
ーム数Ｊを大きくすることができず、音声区間全体につ
いて等間隔にフレームを間引いて認識することになる。
このため、子音のように継続長が短く詳細に照合を行な
う必要がある部分の情報が欠落してしまい、十分な音声
認識率が得られないという問題があった。一方、母音の
ように時間的に定常で継続長の長い部分の情報が冗長に
なってしまうという問題があった。The conventional method is a practical method with a small amount of calculation. However, in the conventional method, the number J of frames of the standard pattern cannot be increased from the viewpoint of parameter estimation accuracy, and the entire voice section is recognized by thinning out frames at equal intervals.
For this reason, there is a problem that information of a portion such as a consonant, which has a short duration and needs to be compared in detail, is lost, and a sufficient speech recognition rate cannot be obtained. On the other hand, there is a problem that information in a portion that is temporally stationary and has a long continuous length like a vowel becomes redundant.

【００３６】また、従来の方法は入力音声と標準パター
ンの照合の距離尺度として、音声全体を一つのベクトル
として一次判別関数で表される統計的距離尺度を用いて
いたため、少ない計算量で認識することができたが、近
年の計算機の急速な高速化にともない、計算量が増えて
も認識性能を向上させる必要性がでてきた。Further, the conventional method uses a statistical distance scale represented by a linear discriminant function as a single vector for the entire voice as a distance scale for matching between the input voice and the standard pattern. However, with the rapid increase in the speed of computers in recent years, it has become necessary to improve the recognition performance even if the amount of calculations increases.

【００３７】さらに、従来の方法は単語標準パターンを
作成するために、多数の人が発声した学習用音声データ
が必要となるため、認識対象語彙の変更が容易ではない
という問題があった。Furthermore, the conventional method requires learning speech data uttered by a large number of people in order to create a word standard pattern, so that it is not easy to change the vocabulary to be recognized.

【００３８】本発明は上記従来の課題を解決するもの
で、その第一の目的は従来例よりも認識率を向上させる
音声認識方法を提供することである。The present invention solves the above-mentioned conventional problems, and the first object of the present invention is to provide a speech recognition method that improves the recognition rate as compared with the conventional example.

【００３９】第二の目的は、識別性能の高い距離尺度を
用いて、さらに認識率を向上させる音声認識方法を提供
することである。A second object is to provide a speech recognition method that further improves the recognition rate by using a distance scale having high discrimination performance.

【００４０】第三の目的は、日本語のかな文字表記から
単語標準パターンを作成することができる、認識対象語
彙の変更が容易で高精度な音声認識方法を提供すること
である。A third object of the present invention is to provide a high-accuracy speech recognition method capable of easily changing a vocabulary to be recognized, capable of creating a word standard pattern from Japanese kana character notation.

【００４１】[0041]

【課題を解決するための手段】本発明では第一に、以下
の手段によって上記課題を解決した。Means for Solving the Problems First, the present invention has solved the above-mentioned problems by the following means.

【００４２】単語音声中の子音部は基準フレームを中心
にフレームを連続にとって標準パターンを作成し、母音
部はフレームを線形に伸縮して標準パターンを作成す
る。認識の際には子音部はフレームを連続に照合し、母
音部はフレームを伸縮させて照合を行なう。このような
フレームの取り方をすることにより音声認識性能を向上
させることができる。The consonant part in the word voice creates a standard pattern by using a series of frames around the reference frame, and the vowel part creates a standard pattern by linearly expanding and contracting the frame. At the time of recognition, the consonant part performs collation of frames continuously, and the vowel part performs collation by expanding and contracting the frame. Speech recognition performance can be improved by taking such a frame.

【００４３】計算量および標準パターンの推定パラメー
タ数を増大させないために、入力音声と標準パターンの
照合は、音声全体を一つのベクトルとしてフレーム間相
関を考慮した一次判別関数で表される統計的距離尺度を
用いる。または、計算量は２倍になるが、フレームを独
立に扱い、そのかわりに特徴パラメータの時間変化量で
ある動的特徴パラメータを併用し一次判別関数で表され
る統計的距離尺度を用いる。In order not to increase the amount of calculation and the number of estimated parameters of the standard pattern, the collation between the input speech and the standard pattern is performed by using a statistical distance represented by a linear discriminant function considering the inter-frame correlation with the entire speech as one vector. Use a scale. Alternatively, although the amount of calculation is doubled, frames are treated independently, and instead, a statistical distance scale represented by a linear discriminant function using a dynamic feature parameter that is a time variation of a feature parameter is used.

【００４４】本発明では第二に、以下の手段によって上
記課題を解決した。第一の手段における入力音声と標準
パターンの照合の距離尺度として二次判別関数で表され
る統計的距離尺度を用いる。ただし特徴パラメータの単
語全体の時系列パターンを一つのベクトルとして標準パ
ターンを作成しようとすると、共分散の推定のために膨
大な学習サンプルが必要となるため、時間パターンをフ
レーム毎に独立のベクトルとして扱う。二次判別関数で
表される統計的距離尺度を用いることによりさらに音声
認識性能を向上させることができる。特徴パラメータの
時間変化量である動的特徴パラメータを併用するとさら
に、音声認識性能を向上させることができる。Secondly, the present invention has solved the above problem by the following means. A statistical distance scale represented by a quadratic discriminant function is used as a distance scale for matching the input voice with the standard pattern in the first means. However, when trying to create a standard pattern using the time-series pattern of the whole word of the feature parameter as one vector, a huge number of training samples are required for estimating the covariance, so the time pattern is set as an independent vector for each frame. deal with. By using a statistical distance measure represented by a quadratic discriminant function, the speech recognition performance can be further improved. When a dynamic feature parameter that is a time change amount of the feature parameter is used together, the speech recognition performance can be further improved.

【００４５】本発明では第三に、以下の手段によって上
記課題を解決した。音節、ＣＶ（子音＋母音）、ＶＣ
（母音＋子音）、ＶＣＶ（母音＋子音＋母音）、又はＣ
ＶＣ（子音＋母音＋子音）などの単位ごとに第一、第二
の手段と同様に標準パターンを作成しておき、これらを
接続して任意の単語標準パターンを作成し、第一、第二
の手段と同様に認識する。日本語のかな文字表記にした
がって単語標準パターンを作成することができるため、
認識対象語彙の変更を容易にすることができる。Third, the present invention has solved the above problem by the following means. Syllable, CV (consonant + vowel), VC
(Vowel + consonant), VCV (vowel + consonant + vowel), or C
A standard pattern is created for each unit such as VC (consonant + vowel + consonant) in the same manner as the first and second means, and these are connected to create an arbitrary word standard pattern. Recognize in the same way as the means. Because it is possible to create word standard patterns according to Japanese kana character notation,
The vocabulary to be recognized can be easily changed.

【００４６】[0046]

【作用】日本語は子音と母音によって構成される。一般
に、母音部はスペクトルの時間的変化が少なく定常的あ
り、その継続長は発声速度の相違によって伸縮しやすい
という特徴がある。一方、子音部はスペクトルの時間的
変化に音素を識別するための情報があり、その継続長は
比較的短く発声速度が異なっても伸縮しにくいという特
徴がある。[Function] Japanese is composed of consonants and vowels. In general, the vowel part has a feature that the spectrum of the vowel part is stationary with little temporal change, and its continuation length easily expands and contracts due to a difference in utterance speed. On the other hand, the consonant part has information for identifying a phoneme in the temporal change of the spectrum, and has a feature that its continuation length is relatively short and does not easily expand or contract even if the utterance speed is different.

【００４７】本発明は第一に、子音部は基準フレームを
中心にフレームを連続にとり伸縮させずに照合を行な
い、母音部はフレームを伸縮させて照合を行なうことに
よって、子音部の局所的なスペクトルの時間的変化の特
徴と母音部の大局的なスペクトルの特徴を発声速度に影
響されずに適切にとらえることができるようになり、認
識性能が向上する。標準パターンの子音部を連続にとる
かわりに母音部のフレームを少なくすることにより、標
準パターンのフレーム数は増大しない。In the present invention, first, the consonant part is obtained by successively taking frames around the reference frame and performing collation without expanding / contracting, and the vowel part is collated by expanding / contracting the frame, whereby local consonant parts are collated. The characteristics of the temporal change of the spectrum and the characteristics of the global spectrum of the vowel part can be appropriately captured without being affected by the utterance speed, and the recognition performance is improved. By reducing the number of vowel frames instead of taking the consonant portions of the standard pattern continuously, the number of frames of the standard pattern does not increase.

【００４８】音声全体を一つのベクトルとしてフレーム
間相関を考慮した一次判別関数で表される統計的距離尺
度を用いると、計算量および推定パラメータ数を増大さ
せずに認識率の向上を図ることができる。フレームを独
立に扱い、そのかわりに特徴パラメータの時間変化量で
ある動的特徴パラメータを併用し一次判別関数で表され
る統計的距離尺度を用いると、計算量は２倍になるが、
認識率の向上を図ることができる。Using a statistical distance scale represented by a first-order discriminant function taking the inter-frame correlation into consideration assuming that the entire speech is one vector, the recognition rate can be improved without increasing the amount of calculation and the number of estimated parameters. it can. When a frame is treated independently and a dynamic distance parameter, which is a time variation of a characteristic parameter, is used in combination and a statistical distance scale represented by a linear discriminant function is used, the calculation amount is doubled.
The recognition rate can be improved.

【００４９】本発明は第二に、入力音声と標準パターン
の照合の際、フレームを独立に扱い二次判別関数で表さ
れる統計的距離尺度を用いることによりさらに音声認識
性能を向上させることができる。特徴パラメータの時間
変化量である動的特徴パラメータを併用すると、フレー
ムを独立に扱うことによって失われた時間変化の特徴量
をとらえることができるようになるため、さらに音声認
識性能を向上させることができる。Second, the present invention can further improve speech recognition performance by treating frames independently and using a statistical distance scale represented by a quadratic discriminant function when matching an input speech with a standard pattern. it can. When the dynamic feature parameter, which is the amount of time change of the feature parameter, is used together, the feature amount of the time change lost by treating the frame independently can be captured, so that the speech recognition performance can be further improved. it can.

【００５０】本発明は第三に、音節、ＣＶ（子音＋母
音）、ＶＣ（母音＋子音）、ＶＣＶ（母音＋子音＋母
音）又はＣＶＣ（子音＋母音＋子音）などの標準パター
ンを接続して任意の単語標準パターンを作成し認識する
ことにより、日本語のかな文字表記にしたがって単語標
準パターンを作成することができるため、認識対象語彙
の変更を容易にすることができる。Third, the present invention connects a standard pattern such as syllable, CV (consonant + vowel), VC (vowel + consonant), VCV (vowel + consonant + vowel) or CVC (consonant + vowel + consonant). By creating and recognizing an arbitrary word standard pattern, a word standard pattern can be created in accordance with Japanese kana character notation, so that the vocabulary to be recognized can be easily changed.

【００５１】また、ワードスポッティング機能を導入す
ることによって、騒音に対して頑強な、実用性の高い認
識装置が実現できる。Further, by introducing the word spotting function, a highly practical recognition device that is robust against noise can be realized.

【００５２】[0052]

【Example】

（実施例１）以下、本発明における第１の実施例につい
て説明する。Embodiment 1 Hereinafter, a first embodiment of the present invention will be described.

【００５３】第１の実施例では、日本語の発声の最小の
単位である音節を単独に発声した単音節を認識対象と
し、音声全体を一つのベクトルとして共分散行列を共通
化したベイズ判定に基づく一次判別関数で表される統計
的距離尺度を用いて入力音声と単音節標準パターンの照
合を行ない認識する音声認識方法について説明する。In the first embodiment, a single syllable uttered independently of a syllable, which is the minimum unit of Japanese utterance, is used as a recognition target, and Bayes judgment is performed in which the entire utterance is a single vector and the covariance matrix is shared. A speech recognition method will be described in which an input speech is compared with a single syllable standard pattern using a statistical distance scale represented by a primary discriminant function based on the standard distance function.

【００５４】第１の実施例では未知入力音声の単音節区
間を検出し、これとあらかじめ作成しておいた単音節標
準パターンとの照合を行なうことにより単音節の認識を
行なう。In the first embodiment, a single syllable section is detected by detecting a single syllable section of the unknown input voice and comparing it with a single syllable standard pattern created in advance.

【００５５】日本語の単音節は子音部とそれにつづく母
音部によって構成される。一般に、母音部はスペクトル
の時間的変化が少なく定常的あり、その継続長は発声速
度の相違によって伸縮しやすいという特徴がある。一
方、子音部はスペクトルの時間的変化に音素を識別する
ための情報があり、その継続長は比較的短く発声速度が
異なっても伸縮しにくいという特徴がある。そこで、子
音部はフレーム（分析時間の単位；本実施例では１フレ
ーム＝10ms）を連続にとり伸縮させずに入力音声と標準
パターンの照合を行ない、母音部はフレームを伸縮させ
て照合を行なう。母音部はスペクトルが定常的であるた
め、隣接した数フレーム分をまとめて１フレームの標準
パターンにしても識別性能の低下は少ない。子音部はフ
レームを連続に密にとるかわりに母音部はフレームを間
引いて疎にとることによって、単音節標準パターン全体
のフレーム数を増大させずに認識率の向上を図ることが
できる。A Japanese monosyllable is composed of a consonant part and a vowel part following it. In general, the vowel part has a feature that the spectrum of the vowel part is stationary with little temporal change, and its continuation length easily expands and contracts due to a difference in utterance speed. On the other hand, the consonant part has information for identifying a phoneme in the temporal change of the spectrum, and has a feature that its continuation length is relatively short and does not easily expand or contract even if the utterance speed is different. Therefore, the consonant part continuously compares frames (analysis time unit; one frame = 10 ms in this embodiment) and compares the input voice with the standard pattern without expanding and contracting, and the vowel part performs expansion and contraction of the frame. Since the vowel part has a steady spectrum, even if a few frames adjacent to each other are grouped together and a standard pattern of one frame is used, a decrease in the discrimination performance is small. By reducing the number of frames in the vowel part instead of the number of frames in the consonant part, the recognition rate can be improved without increasing the number of frames in the entire single syllable standard pattern.

【００５６】第１の実施例について図１、図２、図３を
参照しながら説明する。図１は第１の実施例の音声認識
方法の処理の流れを示すフローチャートである。図１に
おいて、１は未知入力音声を分析時間（フレーム）ごと
に線形予測（ＬＰＣ）分析する音響分析部、２は特徴パ
ラメータをフレームごとに求める特徴パラメータ抽出
部、３は入力音声の始端フレームおよび終端フレームを
検出する音声区間検出部、４は単音節標準パターンを格
納する標準パターン格納部、５は入力音声と単音節標準
パターンとの距離を求めるＤＰ照合部、６はＤＰ照合部
５で求めた各々の標準パターンとの距離の中で最小（類
似度が最大）の値をもつ標準パターンに対応する音声名
を認識結果とする距離比較部である。The first embodiment will be described with reference to FIGS. 1, 2 and 3. FIG. 1 is a flowchart showing the flow of processing of the voice recognition method according to the first embodiment. In FIG. 1, reference numeral 1 denotes an acoustic analysis unit that performs linear prediction (LPC) analysis of an unknown input voice for each analysis time (frame), 2 denotes a feature parameter extraction unit that obtains a feature parameter for each frame, and 3 denotes a start frame of the input voice and A voice section detection unit that detects the end frame, a standard pattern storage unit that stores a single syllable standard pattern, a DP matching unit that calculates the distance between the input voice and the single syllable standard pattern, and a DP matching unit that calculates a distance between the input speech and the single syllable standard pattern. A distance comparison unit that recognizes a speech name corresponding to the standard pattern having the smallest value (similarity is maximum) among the distances from the respective standard patterns.

【００５７】次にその動作を説明する。単音節標準パタ
ーンはあらかじめ作成して標準パターン格納部４に格納
しておく。単音節標準パターンの作成方法は後述する。
未知入力音声が入力されると音響分析部１で分析時間
（フレーム）ごとに線形予測（ＬＰＣ）分析を行なう。
次に、特徴パラメータ抽出部２でＰ個（Ｐは正の整数）
の特徴パラメータをフレームごとに求める。特徴パラメ
ータは、ＬＰＣメルケプストラム係数（本例ではC1〜C9
まで9個）、正規化残差C0、および音声対数パワーの時
間差分値V0を用いる。次に音声区間検出部３で入力音声
の始端フレームおよび終端フレームを音声パワー情報な
どを用いて検出する。第１の実施例では音声区間の検出
は音声パワーを用いるがどのような方法を用いてもよ
い。次にＤＰ照合部５で、入力音声の特徴パラメータ時
系列と、標準パターン格納部４に格納されているある単
音節標準パターンとをＤＰ法により動的に照合を行な
い、その単音節標準パターンに対する距離を求める。こ
れを認識対象とする全ての単音節に対して求める。ＤＰ
照合および距離計算の方法は後述する。最後に距離比較
部６で、ＤＰ照合部５で求めた各々の標準パターンとの
距離の中で最小（類似度が最大）の値をもつ標準パター
ンに対応する音声名を認識結果として選択し、出力す
る。Next, the operation will be described. The monosyllable standard pattern is created in advance and stored in the standard pattern storage unit 4. A method for creating a single syllable standard pattern will be described later.
When an unknown input voice is input, the acoustic analysis unit 1 performs a linear prediction (LPC) analysis for each analysis time (frame).
Next, P (P is a positive integer) in the feature parameter extraction unit 2
Is obtained for each frame. The feature parameters are LPC mel-cepstral coefficients (in this example, C1 to C9
9), the normalized residual C0, and the time difference value V0 of the logarithmic power of the voice. Next, the voice section detection unit 3 detects the start frame and the end frame of the input voice using voice power information and the like. In the first embodiment, voice section detection uses voice power, but any method may be used. Next, in the DP matching unit 5, the feature parameter time series of the input voice and a certain single syllable standard pattern stored in the standard pattern storage unit 4 are dynamically checked by the DP method, and the single syllable standard pattern is compared. Find the distance. This is determined for all monosyllables to be recognized. DP
The method of collation and distance calculation will be described later. Finally, the distance comparison unit 6 selects, as a recognition result, a speech name corresponding to the standard pattern having the minimum value (similarity is maximum) among the distances from the respective standard patterns obtained by the DP matching unit 5, Output.

【００５８】以下、単音節標準パターンを作成する方法
について説明する。不特定話者音声認識用の音声標準パ
ターンは、多数の人が発声した学習用音声データを用い
てその統計量（平均値、共分散）を求めることにより作
成する。Hereinafter, a method of creating a single syllable standard pattern will be described. A speech standard pattern for speaker-independent speech recognition is created by obtaining the statistics (average value, covariance) using learning speech data uttered by many people.

【００５９】日本語の単音節は子音部とそれにつづく母
音部によって構成される。単音節標準パターンは、おな
じカテゴリー（単音節）の各学習用音声データから非線
形にフレームを抽出しこれらのフレームの特徴パラメー
タを時系列に並べたベクトルを求め、このベクトルの集
合から作成する。非線形にフレームを抽出する方法は以
下のとおりである。A Japanese monosyllable is composed of a consonant part followed by a vowel part. The single-syllable standard pattern is created from a set of vectors in which frames are extracted non-linearly from the learning voice data of the same category (single syllable), and the feature parameters of these frames are arranged in time series. The method of extracting a frame in a non-linear manner is as follows.

【００６０】子音はスペクトルの時間的変化に音素を識
別するための情報があり、その継続長は比較的短く発声
速度が異なっても伸縮しにくいという特徴がある。そこ
で子音部については、その子音の特徴を最も表している
時間的な位置を基準フレームとし、学習用音声データか
ら各基準フレームの前後数フレームを連続して抽出す
る。母音部はその連続した時間パターンの終端から、音
声の終端フレームまでの間を線形にフレームを伸縮させ
て抽出する。図２がその概念図を示している。A consonant has information for identifying a phoneme in a temporal change of a spectrum, and has a feature that its continuation length is relatively short and does not easily expand or contract even if the utterance speed is different. Therefore, for the consonant part, a temporal position that best represents the feature of the consonant is set as a reference frame, and several frames before and after each reference frame are continuously extracted from the learning voice data. The vowel part is extracted by expanding and contracting the frame linearly from the end of the continuous time pattern to the end frame of the voice. FIG. 2 shows a conceptual diagram thereof.

【００６１】図２において、子音の基準フレームは、子
音ごとに定められている一定の基準に基づいて、目視に
よって学習用音声データに音素ラベル２１としてラベル
付けされている。本実施例では、無声破裂音(/c/,/p/,/
t/,/k/)は破裂フレーム、鼻音(/m/,/n/)および無声摩擦
音(/h/,/s/)は母音へのわたりの部分、有声破裂音(/b/,
/d/,/g/,/r/)は破裂フレーム（バズバーの終端）、/z/
は有声性から無声性へ変わる部分をそれぞれ基準フレー
ムとしている。また単母音（「あ」,「い」,「う」,
「え」,「お」）と半母音（「や」,「ゆ」,「よ」,
「わ」）は語頭の音声パワー２２の立ち上がりのフレー
ムを基準フレームと定義している。そして特徴パラメー
タ時系列２３において、この基準フレームを中心に前L1
フレーム、後L2フレームを連続して抽出する。L1および
L2の値は子音ごとに異なる。L1およびL2は子音を識別す
るために有効なフレームを予備実験により検討して決定
した。さらにこの連続した時間パターンの終端フレーム
から、音節の終端フレームまでの母音部を線形に伸縮し
て抽出することにより、時系列パターンＣｍ２４を作成
する。拗音の/j/は子音から後続母音へのゆっくりとし
たスペクトル遷移に特徴があり発声速度によって伸縮し
やすいため、母音部と同様に線形に伸縮する。In FIG. 2, the reference frame of the consonant is visually labeled as a phoneme label 21 on the learning speech data based on a certain standard determined for each consonant. In the present embodiment, the unvoiced plosive (/ c /, / p /, /
t /, / k /) are plosive frames, nasal sounds (/ m /, / n /) and unvoiced fricatives (/ h /, / s /) are vowels, voiced plosives (/ b /,
/ d /, / g /, / r /) is the burst frame (end of buzz bar), / z /
Designates a portion from voiced to unvoiced as a reference frame. In addition, single vowels (“A”, “I”, “U”,
"E", "O") and semi-vowels ("Ya", "Yu", "Yo",
“Wa”) defines the rising frame of the voice power 22 at the beginning of the word as a reference frame. Then, in the characteristic parameter time series 23, the L1
The frame and the subsequent L2 frame are continuously extracted. L1 and
The value of L2 differs for each consonant. L1 and L2 were determined by examining the valid frames for discriminating consonants by preliminary experiments. Further, the vowel part from the end frame of the continuous time pattern to the end frame of the syllable is linearly expanded and contracted and extracted, thereby creating the time-series pattern Cm24. The / j / of the resonate is characterized by a slow spectral transition from the consonant to the succeeding vowel, and tends to expand and contract according to the utterance speed.

【００６２】ある単音節ωnの標準パターンは次のよう
な手順で作成する。 (1)多数の人（ここでは100名）が単音節ωnを発声したM
個の学習用音声データを用意する。 (2)各データを非線形に伸縮を行ないＪフレームに正規
化する。 (3)第m番目の発声データに対して伸縮後の特徴パラメー
タを時系列に並べ、時系列パターンＣmを求める。(m=
1,...,M) (4)M個の時系列パターンＣm(m=1,...,M)を用いてその統
計量（平均値、共分散）を求めることにより標準パター
ンを作成する。The standard pattern of a single syllable ωn is created in the following procedure. (1) M that many people (here 100 people) uttered a monosyllable ωn
The learning voice data is prepared. (2) Each data is nonlinearly expanded and contracted and normalized to a J frame. (3) The feature parameters after expansion and contraction are arranged in time series with respect to the m-th utterance data, and a time series pattern Cm is obtained. (m =
(4) A standard pattern is created by using the M time-series patterns Cm (m = 1,..., M) to find their statistics (mean, covariance). I do.

【００６３】第m番目の学習用音声データから、時系列
パターンＣmを求める方法について述べる。A method for obtaining the time-series pattern Cm from the m-th learning voice data will be described.

【００６４】標準パターンのフレーム数をＪフレームと
し、このうちのＬフレーム(L=L1+L2+1)を連続にとると
する。第m番目の学習用音声データの｛基準フレーム−L
1｝フレームを1フレーム目、音声区間の終端フレームを
Ｉフレーム目とすると、このデータの第iフレームと伸
縮後の第jフレームの関係は（数１３）で表される。た
だし、［］はその数を越えない最大の整数を表す。第１
の実施例ではＪ=20、Ｌ=10とする。Ｊはすべての単音節
について同じ値でなければならないが、Ｌは単音節毎に
異なってもよい。It is assumed that the number of frames of the standard pattern is J frames, and L frames (L = L1 + L2 + 1) among these are continuous. ｛Reference frame-L of the m-th learning voice data
Assuming that the 11 frame is the first frame and the end frame of the voice section is the I frame, the relationship between the i-th frame of this data and the j-th frame after expansion / contraction is represented by (Expression 13). Here, [] represents the largest integer not exceeding the number. First
In this embodiment, J = 20 and L = 10. J must be the same value for all monosyllables, but L may be different for each monosyllable.

【００６５】[0065]

【数１３】 (Equation 13)

【００６６】伸縮後のＪフレーム分の特徴パラメータを
時系列に並べ時間パターンＣmを作成する。The feature parameters for the J frames after expansion and contraction are arranged in time series to create a time pattern Cm.

【００６７】[0067]

【数１４】 [Equation 14]

【００６８】これをM個の学習用音声データについて求
める。時間パターンＣmを一つのベクトルとして扱うこ
とにより、パラメータのフレーム間の相関を考慮するこ
とになる。M個のＪ×Ｐ次元のベクトルＣm（m=1,...,
M）からその平均値ベクトルμおよび共分散行列Ｗを求
める。This is obtained for M pieces of learning speech data. By treating the time pattern Cm as one vector, the correlation between the parameter frames is taken into account. M J × P-dimensional vectors Cm (m = 1, ...,
From M), the average vector μ and the covariance matrix W are obtained.

【００６９】さらにこれをN個の認識対象とする単音節
に対してそれぞれ求める。以下、第n番目の単音節ωnに
対する平均値ベクトルをμn、共分散行列をＷnと表記す
る。Further, this is obtained for each of the N single syllables to be recognized. Hereinafter, the average value vector for the n-th single syllable ωn is denoted by μn, and the covariance matrix is denoted by Wn.

【００７０】未知入力音声の特徴パラメータの時系列パ
ターンと単音節標準パターンとの距離計算は、共分散行
列を共通化したベイズ判定に基づく距離を用いて計算す
る。The distance between the time-series pattern of the characteristic parameters of the unknown input speech and the standard single-syllable pattern is calculated using a distance based on Bayesian judgment using a common covariance matrix.

【００７１】ベイズ判定に基づく距離は以下のようにし
て求める。いま、未知入力音声の伸縮後の特徴パラメー
タをＪフレーム分並べてできる入力ベクトルＸをThe distance based on the Bayes judgment is obtained as follows. Now, an input vector X that can be obtained by arranging the feature parameters of the unknown input voice after expansion and contraction for J frames is

【００７２】[0072]

【数１５】 (Equation 15)

【００７３】入力ベクトルＸが観測されたときにそれが
単音節ωnである確率Ｐ(ωn｜Ｘ)は、従来例と同様にし
て求められる。ベイズの定理よりＰ(ωn｜Ｘ)は、When the input vector X is observed, the probability P (ωn | X) that it is a single syllable ωn is obtained in the same manner as in the conventional example. From Bayes' theorem, P (ωn | X) is

【００７４】[0074]

【数１６】 (Equation 16)

【００７５】となる。Ｐ(Ｘ｜ωn)は事前確率で、入力
がカテゴリーωnであったときにベクトルＸが観測され
る確率、Ｐ(Ｘ)は生起し得るすべての入力を考えた場合
のベクトルＸが観測される確率である。単語ωnの出現
確率Ｐ(ωn)は各単語同じと仮定して定数とし、入力Ｘ
が一定とするとＰ(Ｘ)が定数となるので、事前確率Ｐ
(Ｘ｜ωn)を最大とするカテゴリーωnを判定結果とすれ
ばよい。Is obtained. P (X | ωn) is the prior probability, the probability that vector X is observed when the input is category ωn, and P (X) is the vector X when all possible inputs are considered. Probability. The appearance probability P (ωn) of the word ωn is assumed to be the same for each word, and is assumed to be a constant.
Is constant, P (X) is a constant, so the prior probability P
The category ωn that maximizes (X | ωn) may be used as the determination result.

【００７６】パラメータの分布を正規分布と考えると、
事前確率Ｐ(Ｘ｜ωn)は（数１７）で表される。Assuming that the parameter distribution is a normal distribution,
The prior probability P (X | ωn) is represented by (Equation 17).

【００７７】[0077]

【数１７】 [Equation 17]

【００７８】ここでtは転置行列を表す。両辺の対数を
とって識別に不要な定数項を省略しさらに−２倍すると
次式を得る。Here, t represents a transposed matrix. By taking the logarithm of both sides and omitting a constant term unnecessary for identification and further multiplying by -2, the following equation is obtained.

【００７９】[0079]

【数１８】 (Equation 18)

【００８０】この式は単音節ωnに対するベイズ判定に
基づく距離である。ここで、計算量および推定パラメー
タ数削減のため、従来例と同様に共分散行列を共通化し
てこの式を線形判別式に展開する。各単音節標準パター
ンの共分散行列Ｗnを共通化し、Ｗとする。Ｗは次式の
ようにして求める。This equation is a distance based on Bayes judgment for a single syllable ωn. Here, in order to reduce the amount of calculation and the number of estimated parameters, the covariance matrix is shared as in the conventional example, and this equation is developed into a linear discriminant. The covariance matrix Wn of each single syllable standard pattern is shared, and is set to W. W is obtained by the following equation.

【００８１】[0081]

【数１９】 [Equation 19]

【００８２】したがってTherefore,

【００８３】[0083]

【数２０】 (Equation 20)

【００８４】とおくことができる。これを（数１８）に
代入し識別に不要な定数項を省略すると[0111] Substituting this into (Equation 18) and omitting the constant term unnecessary for identification

【００８５】[0085]

【数２１】 (Equation 21)

【００８６】となり、[0086]

【００８７】[0087]

【数２２】 (Equation 22)

【００８８】[0088]

【数２３】 (Equation 23)

【００８９】とおくことにより、By setting

【００９０】[0090]

【数２４】 (Equation 24)

【００９１】のような線形一次判別式になることがわか
る。このようにしてＡn，Ｂnを認識対象とする単音節の
各々に対して求め、標準パターン格納部４に格納してお
く。It can be seen that the linear primary discriminant is as follows. In this way, An and Bn are obtained for each single syllable to be recognized and stored in the standard pattern storage unit 4.

【００９２】以下、ＤＰ照合部５で、入力音声と単音節
標準パターンとを、ＤＰ法により動的に時間整合を行な
って照合し、距離を求める方法について詳しく説明す
る。Hereinafter, a detailed description will be given of a method in which the DP collation unit 5 performs dynamic time matching between the input speech and the single syllable standard pattern by the DP method to collate and obtain a distance.

【００９３】音声区間検出部で検出された音声区間の始
端フレームを第１フレーム、終端フレームを第Ｉフレー
ムとする。入力音声の第ｉフレームの特徴パラメータを
Ｐ個並べたものをｘiThe start frame of the voice section detected by the voice section detection unit is the first frame, and the end frame is the I-th frame. Xi is a sequence of P feature parameters of the i-th frame of the input voice.

【００９４】[0094]

【数２５】 (Equation 25)

【００９５】とする。そして、入力音声のr(1),r(2),
…,r(j),…,r(J)番目のフレームのｘを並べてＪフレー
ム分の時間パターンＸを作成する。これが入力ベクトル
になる。It is assumed that Then, r (1), r (2),
.., R (j),..., R (J) -th frame x are arranged to create a time pattern X for J frames. This becomes the input vector.

【００９６】[0096]

【数２６】 (Equation 26)

【００９７】単音節ωnの標準パターンをＡn，Ｂnと
し、ＡnをThe standard patterns of a single syllable ωn are An and Bn, and An is

【００９８】[0098]

【数２７】 [Equation 27]

【００９９】と書くとき、入力ベクトルＸと単音節ωn
の標準パターンとの距離LnはWhen writing, the input vector X and the monosyllable ωn
The distance Ln from the standard pattern is

【０１００】[0100]

【数２８】 [Equation 28]

【０１０１】であるから、Therefore,

【０１０２】[0102]

【数２９】 (Equation 29)

【０１０３】となる。そこで、Lnが最小となるようなr
(j)をＤＰ法により求めればよい。Lnが最小となるとき
の値をＤＰ法によって以下のような漸化式で求める。Is obtained. Therefore, r that minimizes Ln
(j) may be obtained by the DP method. The value at which Ln is minimum is obtained by the DP method using the following recurrence formula.

【０１０４】[0104]

【数３０】 [Equation 30]

【０１０５】ただしｍはｍsからｍeまでの整数でｍs,ｍ
eの値は単音節毎、標準パターンのフレームごとに異な
る。j=1からj=Lまでの連続部ではWhere m is an integer from ms to me and ms, m
The value of e differs for each single syllable and for each frame of the standard pattern. In the continuous part from j = 1 to j = L

【０１０６】[0106]

【数３１】 (Equation 31)

【０１０７】とし、入力音声を伸縮させず連続的に標準
パターンと照合する。伸縮部のｍs,ｍeの値は、本実施
例ではその単音節の標準パターンがThen, the input voice is collated continuously with the standard pattern without expanding / contracting. In this embodiment, the values of ms and me of the expansion and contraction part are the standard patterns of the single syllable.

【０１０８】[0108]

【数３２】 (Equation 32)

【０１０９】の間で伸縮するように決定した。これらの
ＤＰパスを連続部に関しては図３（ａ）に、伸縮部に関
しては図３（ｂ）に示す。It was decided to expand and contract between the two. These DP paths are shown in FIG. 3A for the continuous portion, and FIG. 3B for the stretchable portion.

【０１１０】入力音声の終端フレームにおける単音節標
準パターンの最終フレームの累積距離ｇ(I,J)をＢnから
引いたものが、入力ベクトルＸと単音節ωnの標準パタ
ーンとの距離Lnである。The distance Ln between the input vector X and the standard pattern of the single syllable ωn is obtained by subtracting the cumulative distance g (I, J) of the last frame of the single syllable standard pattern in the last frame of the input speech from Bn.

【０１１１】[0111]

【数３３】 [Equation 33]

【０１１２】これをすべての単音節標準パターンについ
て求める。なお、第１の実施例では入力音声の音声区間
を検出してから照合を行なう方法について説明したが、
入力音声の音声区間検出をせず、ノイズを含む全入力音
声区間について、This is obtained for all single syllable standard patterns. In the first embodiment, the method of performing the matching after detecting the voice section of the input voice has been described.
Without detecting the voice section of the input voice, for all the input voice sections including noise,

【０１１３】[0113]

【数３４】 (Equation 34)

【０１１４】で表される漸化式によって連続ＤＰマッチ
ングを行ない、ｇ(i,J)が最小となる入力フレームiを求
め、そのときのフレームをIminとするとき、When continuous DP matching is performed by a recurrence formula expressed by the following formula, an input frame i with which g (i, J) is minimized is obtained.

【０１１５】[0115]

【数３５】 (Equation 35)

【０１１６】を単音節ωnの標準パターンとの距離とす
ることによって、音声区間を検出しなくても、認識を行
なうことができる。これをワードスポッティングとい
う。By setting the distance from the standard pattern of a single syllable ωn, recognition can be performed without detecting a voice section. This is called word spotting.

【０１１７】ただし、ワードスポッティングを行なう場
合には事後確率化された距離尺度を用いなければならな
い。その方法は以下のとおりである。（数１６）におい
て、ワードスポッティングを行なう場合には異なった入
力区間における入力Ｘについて比較しなければならない
ため、入力Ｘが一定とはならない。したがってＰ(Ｘ)の
項を考慮した事後確率Ｐ(ωn｜Ｘ)を最大とするカテゴ
リーωnを判定結果とする必要がある。However, when performing word spotting, it is necessary to use a posteriorized distance scale. The method is as follows. In (Equation 16), when word spotting is performed, the input X in different input sections must be compared, so that the input X is not constant. Therefore, the category ωn that maximizes the posterior probability P (ωn | X) in consideration of the term of P (X) needs to be set as the determination result.

【０１１８】Ｐ(Ｘ)は生起し得るすべての入力を考えた
場合のベクトルＸが観測される確率である。そこで、事
後確率化のための周囲情報パターンとして、生起し得る
すべての入力についての平均値ベクトルおよび共分散行
列を求めておく。すなわち、認識対象とする全単音節学
習用音声データの特徴パラメータ時系列に対してＪフレ
ームの時間窓を１フレームずつシフトさせながら作成し
たＪフレームの時系列パターンから平均値ベクトルμe
と共分散行列Ｗeを求めておく。ただしノイズを含む区
間から、発声された音声をスポッティングするために
は、事後確率化のため周囲情報パターンにノイズ区間を
含めて作成しておく必要がある。Ｐ(Ｘ)は周囲情報パタ
ーンの平均値ベクトルμe、共分散行列Ｗeから求まる。P (X) is the probability of observing the vector X when all possible inputs are considered. Therefore, as a surrounding information pattern for posterior stochasticization, an average value vector and a covariance matrix of all possible inputs are obtained. That is, the average value vector μe is obtained from the J-frame time-series pattern created by shifting the J-frame time window by one frame with respect to the feature parameter time-series of all monosyllable learning speech data to be recognized.
And a covariance matrix We are obtained in advance. However, in order to spot an uttered voice from a section containing noise, it is necessary to create a surrounding information pattern including a noise section for posterior probability. P (X) is obtained from the average value vector μe of the surrounding information pattern and the covariance matrix We.

【０１１９】パラメータの分布を正規分布と考えると、
事後確率Ｐ(ωn｜Ｘ)は（数３６）で表される。Assuming that the parameter distribution is a normal distribution,
The posterior probability P (ωn | X) is represented by (Equation 36).

【０１２０】[0120]

【数３６】 [Equation 36]

【０１２１】ここでtは転置行列を表す。両辺の対数を
とって−２倍すると次式を得る。Here, t represents a transposed matrix. Taking the logarithm of both sides and multiplying by -2 gives the following equation.

【０１２２】[0122]

【数３７】 (37)

【０１２３】この式は単音節ωnに対する事後確率化し
たベイズ判定に基づく距離である。ここで、計算量およ
び推定パラメータ数削減のため、共分散行列を共通化し
てこの式を線形判別式に展開する。認識対象語彙の各々
の標準パターンの共分散行列Ｗnと周囲情報パターンの
共分散行列Ｗeを共通化し、Ｗとする。Ｗは次式のよう
にして求める。gは周囲情報パターンを混入する割合で
あり、ここではg=Nとする。This equation is a distance based on Bayesian judgment made into a posteriori probability for a single syllable ωn. Here, in order to reduce the amount of calculation and the number of estimated parameters, the covariance matrix is shared and this equation is developed into a linear discriminant. The covariance matrix Wn of the standard pattern of each vocabulary to be recognized and the covariance matrix We of the surrounding information pattern are shared, and are set to W. W is obtained by the following equation. g is the ratio of mixing surrounding information patterns, and here, g = N.

【０１２４】[0124]

【数３８】 (38)

【０１２５】したがって、Therefore,

【０１２６】[0126]

【数３９】 [Equation 39]

【０１２７】とおくことができる。これを（数３７）に
代入すると[0127] Substituting this into (Equation 37) gives

【０１２８】[0128]

【数４０】 (Equation 40)

【０１２９】となり、Becomes

【０１３０】[0130]

【数４１】 [Equation 41]

【０１３１】[0131]

【数４２】 (Equation 42)

【０１３２】とおくことにより、By setting

【０１３３】[0133]

【数４３】 [Equation 43]

【０１３４】のような線形一次判別式になることがわか
る。ワードスポッティングを行なう場合には、このよう
にしてＡn，Ｂnを認識対象とする単音節の各々に対して
求め、標準パターン格納部４に格納しておく。It can be seen that the linear primary discriminant is as follows. When word spotting is performed, An and Bn are obtained for each of the single syllables to be recognized in this way, and stored in the standard pattern storage unit 4.

【０１３５】なお、無声摩擦音や、語頭のバズバーなど
のようにスペクトルが定常で発声によって伸縮の激しい
音素については、基準フレームを中心とした連続パター
ンの時間的に前の部分に母音部と同様の線形伸縮するパ
ターンを設けてもよい。Note that, for a phoneme such as an unvoiced fricative or a buzz bar at the beginning of a word, which has a steady spectrum and expands and contracts sharply due to vocalization, a similar part to a vowel part is temporally preceding a continuous pattern centered on a reference frame. A pattern that linearly expands and contracts may be provided.

【０１３６】また、第１の実施例では単音節を認識する
場合の例を述べたが、単語認識も同様に行なうことがで
きる。その場合も標準パターンは、子音部は基準フレー
ムを中心に連続に、母音部は線形に伸縮させて全体でＪ
フレームになるように作成する。認識する際には、連続
部は伸縮させないようにしながら第１の実施例と同様に
ＤＰ法により照合を行なう。In the first embodiment, an example in which a single syllable is recognized has been described, but word recognition can be performed in a similar manner. In this case as well, the standard pattern is such that the consonant part is continuously expanded around the reference frame, and the vowel part is expanded and contracted linearly, so that the J
Create to be a frame. When recognizing, the collation is performed by the DP method in the same manner as in the first embodiment, while keeping the continuous portion from expanding and contracting.

【０１３７】（実施例２）以下、本発明における第２の
実施例について説明する。(Embodiment 2) Hereinafter, a second embodiment of the present invention will be described.

【０１３８】第２の実施例では、日本語単音節を認識対
象とし、ベイズ判定に基づく二次判別関数で表される統
計的距離尺度を用いて、入力音声と単音節標準パターン
のフレーム毎に得られる特徴パラメータベクトルと動的
特徴パラメータベクトルの照合を行ない認識する音声認
識方法について説明する。In the second embodiment, Japanese monosyllables are to be recognized, and a statistical distance scale represented by a secondary discriminant function based on Bayesian judgment is used for each frame of the input speech and monosyllable standard patterns. A speech recognition method will be described in which the obtained feature parameter vector and the dynamic feature parameter vector are collated and recognized.

【０１３９】第２の実施例では第１の実施例と同じく未
知入力音声の単音節区間を検出し、これとあらかじめ作
成しておいた単音節標準パターンとの照合を行なうこと
により単音節の認識を行なう。In the second embodiment, a single syllable is recognized by detecting a single syllable section of the unknown input voice and comparing it with a previously prepared single syllable standard pattern as in the first embodiment. Perform

【０１４０】第２の実施例について図４を参照しながら
説明する。図４は、第２の実施例の処理の流れを示すフ
ローチャートである。A second embodiment will be described with reference to FIG. FIG. 4 is a flowchart showing the flow of the process of the second embodiment.

【０１４１】図４において１は未知入力音声をフレーム
ごとにＬＰＣ分析を行なう音響分析部、２は特徴パラメ
ータをフレームごとに求める特徴パラメータ抽出部、７
は特徴パラメータの時間変化量を求める動的特徴パラメ
ータ抽出部、３は入力音声の始終端フレームを検出する
音声区間検出部、４は単音節標準パターンを格納する標
準パターン格納部、５は入力音声と単音節標準パターン
との距離を求めるＤＰ照合部、６はＤＰ照合部５で求め
た距離の中で最小の値をもつ標準パターンに対応する音
声名を認識結果とする距離比較部である。In FIG. 4, reference numeral 1 denotes an acoustic analysis unit for performing LPC analysis on an unknown input voice for each frame, 2 denotes a feature parameter extraction unit for obtaining feature parameters for each frame, and 7
Is a dynamic feature parameter extraction unit for calculating the time variation of feature parameters, 3 is a speech section detection unit for detecting the start and end frames of the input speech, 4 is a standard pattern storage unit for storing a single syllable standard pattern, and 5 is an input speech. A DP comparing unit 6 for obtaining a distance between the reference pattern and the single syllable standard pattern, and a distance comparing unit 6 for recognizing a speech name corresponding to the standard pattern having the smallest value among the distances obtained by the DP matching unit 5.

【０１４２】次にその動作を説明する。単音節標準パタ
ーンはあらかじめ作成して標準パターン格納部４に格納
しておく。単音節標準パターンの作成方法は後述する。
未知入力音声が入力されると音響分析部１でフレームご
とにＬＰＣ分析を行ない、特徴パラメータ抽出部２でＰ
個の特徴パラメータをフレームごとに求める。特徴パラ
メータは第１の実施例と同様である。そして動的特徴パ
ラメータ抽出部７で特徴パラメータの各次元についてそ
の時間変化量である回帰係数をフレーム毎にＰ個求め
る。次に音声区間検出部３で入力音声の始終端フレーム
を検出し、ＤＰ照合部５で、入力音声の特徴パラメータ
時系列と、単音節標準パターンとを二次判別関数で表さ
れる統計的距離尺度を用いてＤＰ法により動的に照合を
行ない、各単音節標準パターンに対する距離を求める。
最後に距離比較部６で、ＤＰ照合部５で求めた各々の標
準パターンとの距離の中で最小の値をもつ標準パターン
に対応する音声名を認識結果として選択し、出力する。Next, the operation will be described. The monosyllable standard pattern is created in advance and stored in the standard pattern storage unit 4. A method for creating a single syllable standard pattern will be described later.
When an unknown input voice is input, the acoustic analysis unit 1 performs LPC analysis for each frame, and the feature parameter extraction unit 2 performs PPC analysis.
The number of feature parameters is determined for each frame. The characteristic parameters are the same as in the first embodiment. Then, for each dimension of the feature parameter, the dynamic feature parameter extraction unit 7 obtains P regression coefficients, which are time variation amounts, for each frame. Next, the voice section detection unit 3 detects the start and end frames of the input voice, and the DP matching unit 5 calculates a statistical distance represented by a quadratic discriminant function between the feature parameter time series of the input voice and the single syllable standard pattern. Matching is dynamically performed by the DP method using the scale, and the distance to each monosyllable standard pattern is obtained.
Finally, the distance comparison unit 6 selects and outputs, as a recognition result, a speech name corresponding to the standard pattern having the minimum value among the distances from the respective standard patterns obtained by the DP matching unit 5.

【０１４３】未知入力音声の特徴パラメータの時系列パ
ターンと単音節標準パターンとの距離計算は、ベイズ判
定に基づく距離を用いて計算する。The distance between the time-series pattern of the characteristic parameter of the unknown input speech and the standard single-syllable pattern is calculated using a distance based on Bayesian judgment.

【０１４４】ベイズ判定に基づく距離は二次判別関数で
あり、計算量が距離を求めるベクトルの次元数の２乗に
比例するため、ベクトルの次元数が大きいと計算量が爆
発的に増大する。また共分散の推定のためには膨大な学
習サンプルが必要となる。そこでベクトルの次元数を減
らす必要がある。第１の実施例では特徴パラメータの単
音節全体の時系列パターンを一つのベクトルとして入力
音声と単音節標準パターンの距離を求めたが、第２の実
施例では、これをフレーム毎に分割して扱う。すなわ
ち、P個の特徴パラメータからなるP次元のベクトルをＪ
フレーム分並べたものを標準パターンとし、それぞれの
フレームと入力音声の対応するフレームとの距離をベイ
ズ判定に基づく距離によって求め、その和を入力音声と
単音節標準パターンとの距離とする。しかしこのように
フレームを独立に扱うと、特徴パラメータの動的な変化
を捉らえることができなくなる。そこで特徴パラメータ
の時間変化量を動的特徴パラメータとして導入する。本
実施例では、あるフレームの前後２フレーム（計５フレ
ーム）分のｐ番目の特徴パラメータの回帰係数をそのフ
レームのｐ番目の動的特徴パラメータとする。動的特徴
パラメータ抽出部７ではフレーム毎にＰ個の動的特徴パ
ラメータを求める。The distance based on the Bayes judgment is a quadratic discriminant function, and the amount of calculation is proportional to the square of the number of dimensions of the vector for which the distance is to be obtained. Therefore, when the number of dimensions of the vector is large, the amount of calculation explosively increases. In addition, a large number of training samples are required for estimating the covariance. Therefore, it is necessary to reduce the number of dimensions of the vector. In the first embodiment, the distance between the input speech and the single syllable standard pattern is obtained using the time-series pattern of the entire single syllable of the feature parameter as one vector. In the second embodiment, the distance is divided for each frame. deal with. That is, a P-dimensional vector composed of P feature parameters is represented by J
The arrangement of the frames is used as a standard pattern, the distance between each frame and the corresponding frame of the input voice is determined by the distance based on Bayesian judgment, and the sum is used as the distance between the input voice and the single-syllable standard pattern. However, if the frames are handled independently in this way, it is not possible to capture dynamic changes in the feature parameters. Therefore, the time variation of the feature parameter is introduced as a dynamic feature parameter. In this embodiment, the regression coefficients of the p-th feature parameter of two frames before and after a certain frame (a total of five frames) are set as the p-th dynamic feature parameters of the frame. The dynamic feature parameter extraction unit 7 obtains P dynamic feature parameters for each frame.

【０１４５】いま、未知入力音声の第iフレームのP個の
特徴パラメータからなるベクトルを、Now, a vector composed of P feature parameters of the i-th frame of the unknown input speech is

【０１４６】[0146]

【数４４】 [Equation 44]

【０１４７】また、P個の動的特徴パラメータからなる
ベクトルを、Further, a vector composed of P dynamic feature parameters is represented by

【０１４８】[0148]

【数４５】 [Equation 45]

【０１４９】とする。単音節標準パターンは第１の実施
例と同様にして、各学習用音声データを非線形に伸縮を
行なってＪフレームに正規化し、第n番目の単音節ωnに
対する第jフレームの特徴パラメータの平均値ベクトル
μnjおよび共分散行列Ｗnj、動的特徴パラメータの平均
値ベクトルIt is assumed that In the same manner as in the first embodiment, the monosyllabic standard pattern expands and contracts each of the learning speech data nonlinearly and normalizes it to a J frame, and averages the characteristic parameters of the jth frame with respect to the nth monosyllable ωn. Vector μnj, covariance matrix Wnj, mean vector of dynamic feature parameters

【０１５０】[0150]

【外１】 [Outside 1]

【０１５１】および共分散行列And the covariance matrix

【０１５２】[0152]

【外２】 [Outside 2]

【０１５３】を、j=1〜JまでＪフレーム分求め、これら
を標準パターン格納部４に格納しておく。Are obtained for J frames from j = 1 to J, and these are stored in the standard pattern storage unit 4.

【０１５４】このとき入力の第iフレームと単音節ωnの
第jフレームのベイズ判定に基づく距離は（数４６）で
表される。At this time, the distance based on the Bayes judgment between the input i-th frame and the j-th frame of the single syllable ωn is expressed by (Equation 46).

【０１５５】[0155]

【数４６】 [Equation 46]

【０１５６】ここでtは転置行列を表す。単音節ωnに対
する標準パターンの1,2,…,j,…,J番目のフレームと、
入力音声のr(1),r(2),…,r(j),…,r(J)番目のフレーム
がそれぞれ対応するとき、入力音声と単音節ωnとの距
離ＬnはHere, t represents a transposed matrix. 1,2,…, j,…, J-th frame of the standard pattern for monosyllable ωn,
When the r (1), r (2),..., R (j),..., R (J) -th frames of the input voice correspond to each other, the distance Ln between the input voice and the monosyllable ωn is

【０１５７】[0157]

【数４７】 [Equation 47]

【０１５８】とする。したがって（数４６）（数４７）
よりIt is assumed that Therefore, (Equation 46) (Equation 47)
Than

【０１５９】[0159]

【数４８】 [Equation 48]

【０１６０】となる。そこで、Ｌnが最小となるようなr
(j)をＤＰ法により求めればよい。Ｌnが最小となるとき
の値を第１の実施例と同様に、ＤＰ法によって以下のよ
うな漸化式で求める。Is as follows. Therefore, r such that Ln is minimized
(j) may be obtained by the DP method. The value at the time when Ln becomes the minimum is obtained by the following recurrence formula by the DP method as in the first embodiment.

【０１６１】[0161]

【数４９】 [Equation 49]

【０１６２】ただしｍはｍsからｍeまでの整数でｍs,ｍ
eの値は第１の実施例と同様である。連続部では（数３
１）であり伸縮させずに照合を行なう。Here, m is an integer from ms to me, and ms and m
The value of e is the same as in the first embodiment. In the continuous part (Equation 3
1) and the collation is performed without expanding / contracting.

【０１６３】入力音声の終端フレームにおける単音節標
準パターンの最終フレームの累積距離ｇ(I,J)が、入力
音声と単音節ωn標準パターンとの距離Ｌnである。The cumulative distance g (I, J) of the last frame of the single syllable standard pattern in the end frame of the input voice is the distance Ln between the input voice and the single syllable ωn standard pattern.

【０１６４】[0164]

【数５０】 [Equation 50]

【０１６５】これをすべての単音節標準パターンについ
て求める。なお、第２の実施例ではフレーム毎に独立に
距離計算を行なうため、標準パターンのフレーム数は、
単音節毎に異なってもよい。その場合、入力音声と単音
節ωnとの距離Ｌnは（数４７）のかわりにThis is obtained for all monosyllable standard patterns. In the second embodiment, since the distance is calculated independently for each frame, the number of frames of the standard pattern is
It may be different for each single syllable. In that case, the distance Ln between the input voice and the monosyllable ωn is

【０１６６】[0166]

【数５１】 (Equation 51)

【０１６７】とする。ここでJnは単音節ωnのフレーム
数である。第２の実施例では、ベイズ判定に基づく距離
を用いているため、従来例に比べ計算量が多い。従来例
および第１の実施例では、音声全体を一つのベクトルと
して共分散行列を共通化したベイズ判定に基づく距離を
用いるため、フレーム数をＪ、フレームあたりのパラメ
ータ数をＰ個とすると、１単音節あたりの積和の計算回
数はＪＰ回である。これはＪ=20、Ｐ=11とすると220回
になる。一方、ベイズ判定に基づく距離ではベクトルの
次元数をＰとすると積和の計算回数はＰ（Ｐ＋３）／２
回である。フレームを独立に扱い特徴パラメータベクト
ルと動的特徴パラメータベクトルを使用する場合、１フ
レームあたりの積和の計算回数はＰ（Ｐ＋３）／２×２
回となるから、ＪフレームではＪＰ（Ｐ＋３）回とな
る。これはＪ=20、Ｐ=11とすると3080回になる。すなわ
ち、第２の実施例の積和計算量は従来例の14倍になる。It is assumed that Here, Jn is the number of frames of a single syllable ωn. In the second embodiment, since the distance based on the Bayes determination is used, the calculation amount is larger than that of the conventional example. In the conventional example and the first embodiment, since the distance based on the Bayes decision in which the covariance matrix is shared by using the entire speech as one vector is used, if the number of frames is J and the number of parameters per frame is P, 1 The number of calculations of the sum of products per syllable is JP times. This is 220 times if J = 20 and P = 11. On the other hand, in the distance based on Bayesian judgment, if the number of dimensions of the vector is P, the number of times of product sum calculation is P (P + 3) / 2
Times. When a frame is treated independently and a feature parameter vector and a dynamic feature parameter vector are used, the number of product sum calculations per frame is P (P + 3) / 2 × 2
Therefore, JP (P + 3) times in the J frame. This is 3080 times if J = 20 and P = 11. That is, the product-sum calculation amount of the second embodiment is 14 times that of the conventional example.

【０１６８】なお、第２の実施例では、照合の距離尺度
としてベイズ判定に基づく二次判別関数で表される統計
的距離尺度を用いたが、共分散行列を共通化したベイズ
判定に基づく一次判別関数で表される統計的距離尺度を
用いることもできる。これにより、従来例に比べ計算量
が二倍程度で従来例よりも高い認識率が得られる。In the second embodiment, a statistical distance scale represented by a quadratic discriminant function based on Bayes decision is used as a distance measure for collation. However, a primary measure based on Bayes decision using a common covariance matrix is used. A statistical distance measure represented by a discriminant function can also be used. As a result, the amount of calculation is about twice that of the conventional example, and a higher recognition rate than the conventional example can be obtained.

【０１６９】また、第２の実施例では、入力音声と単音
節標準パターンのフレーム毎に得られる特徴パラメータ
ベクトルと動的特徴パラメータベクトルの照合を行ない
認識したが、特徴パラメータベクトルだけを用いてもよ
い。その場合には認識率はやや落ちるが、計算量が半分
ですむというメリットがある。In the second embodiment, the feature parameter vector obtained for each frame of the input speech and the single syllable standard pattern is compared with the dynamic feature parameter vector for recognition. However, the feature parameter vector may be used alone. Good. In this case, the recognition rate is slightly lowered, but there is an advantage that the calculation amount is reduced by half.

【０１７０】また、第１の実施例と同様に連続ＤＰマッ
チングを行なうことにより、ワードスポッティングを行
なうことが可能である。ワードスポッティングを行なう
場合、異なる入力区間について比較するため、距離尺度
は事後確率化された距離尺度を用いる必要がある。その
方法は以下のとおりである。Also, by performing continuous DP matching as in the first embodiment, word spotting can be performed. When word spotting is performed, it is necessary to use a posterior-probabilistic distance measure as a distance measure in order to compare different input sections. The method is as follows.

【０１７１】事後確率化のための周囲情報パターンとし
て、生起し得るすべての入力についての平均値ベクトル
および共分散行列を求めておく必要がある。認識対象と
する全単音節学習用音声データの全音声区間に対して作
成した１フレームの特徴パラメータの平均値ベクトルμ
eおよび共分散行列Ｗe、動的特徴パラメータの平均値ベ
クトルAs the surrounding information pattern for posterior stochasticization, it is necessary to obtain an average value vector and a covariance matrix for all possible inputs. An average vector μ of feature parameters of one frame created for all speech sections of all syllable learning speech data to be recognized.
e, covariance matrix We, mean vector of dynamic feature parameters

【０１７２】[0172]

【外３】 [Outside 3]

【０１７３】および共分散行列And the covariance matrix

【０１７４】[0174]

【外４】 [Outside 4]

【０１７５】を求めておき、これらも標準パターンとし
て標準パターン格納部４に格納しておく。ただしノイズ
を含む区間から、発声された音声をスポッティングする
ためには、事後確率化のため周囲情報パターンにノイズ
区間を含めて作成しておく必要がある。Are obtained and stored in the standard pattern storage unit 4 as standard patterns. However, in order to spot an uttered voice from a section containing noise, it is necessary to create a surrounding information pattern including a noise section for posterior probability.

【０１７６】事後確率化されたベイズ判定に基づく距離
は（数５２）で表される。The distance based on the Bayesian judgment made into the posterior probability is expressed by (Equation 52).

【０１７７】[0177]

【数５２】 (Equation 52)

【０１７８】したがって、入力音声と単音節ωnとの距
離Ｌnは（数４８）のかわりに（数５３）を用い、ＤＰ
の漸化式は（数４９）のかわりに（数５４）を用いる。Therefore, the distance Ln between the input voice and the monosyllable ωn is calculated by using (Equation 53) instead of (Equation 48), and
Uses (Expression 54) instead of (Expression 49).

【０１７９】[0179]

【数５３】 (Equation 53)

【０１８０】[0180]

【数５４】 (Equation 54)

【０１８１】（実施例３）以下、本発明における第３の
実施例について説明する。Embodiment 3 Hereinafter, a third embodiment of the present invention will be described.

【０１８２】第３の実施例では、学習用単語音声データ
から音節を切りだし、第２の実施例と同様にしてフレー
ム毎の特徴パラメータベクトルと動的特徴パラメータベ
クトルから音節標準パターンを作成し、これらを連結し
て単語標準パターンを作成して、第２の実施例と同様に
してベイズ判定に基づく二次判別関数で表される統計的
距離尺度を用いて照合を行ない単語を認識する方法につ
いて説明する。In the third embodiment, a syllable is cut out from the learning word voice data, and a syllable standard pattern is created from the feature parameter vector and the dynamic feature parameter vector for each frame in the same manner as in the second embodiment. A method of recognizing a word by creating a word standard pattern by concatenating them and performing collation using a statistical distance scale represented by a secondary discriminant function based on Bayes judgment in the same manner as in the second embodiment. explain.

【０１８３】第３の実施例について図５、図６を参照し
ながら説明する。図５は第３の実施例の処理の流れを示
すフローチャートである。A third embodiment will be described with reference to FIGS. FIG. 5 is a flowchart showing the flow of the process of the third embodiment.

【０１８４】図５において１は未知入力音声をフレーム
ごとにＬＰＣ分析する音響分析部、２は特徴パラメータ
をフレームごとに求める特徴パラメータ抽出部、７は特
徴パラメータの時間変化量を求める動的特徴パラメータ
抽出部、３は入力音声の始終端フレームを検出する音声
区間検出部、８はかな表記単語辞書、９は音節標準パタ
ーンを格納する音節標準パターン格納部、５は入力音声
と各単語標準パターンとの距離を求めるＤＰ照合部、６
はＤＰ照合部５で求めた距離の中で最小（類似度が最
大）の値をもつ標準パターンに対応する音声名を認識結
果とする距離比較部である。In FIG. 5, reference numeral 1 denotes an acoustic analysis unit for performing LPC analysis of an unknown input speech for each frame, 2 a feature parameter extraction unit for obtaining a feature parameter for each frame, and 7 a dynamic feature parameter for obtaining a time variation of the feature parameter. Extraction unit, 3 is a voice section detection unit that detects the start and end frames of the input voice, 8 is a kana notation word dictionary, 9 is a syllable standard pattern storage unit that stores syllable standard patterns, and 5 is the input voice and each word standard pattern. Matching unit for calculating distance of the object, 6
Reference numeral denotes a distance comparison unit that recognizes a speech name corresponding to a standard pattern having the smallest value (similarity is maximum) among the distances obtained by the DP comparison unit 5.

【０１８５】次にその動作を説明する。音節標準パター
ンはあらかじめ作成して音節標準パターン格納部９に格
納しておく。音節標準パターンの作成方法は後述する。
未知入力音声が入力されると音響分析部１でフレームご
とにＬＰＣ分析を行ない、特徴パラメータ抽出部２でＰ
個の特徴パラメータをフレームごとに求める。特徴パラ
メータは第１の実施例と同様である。そして動的特徴パ
ラメータ抽出部７で特徴パラメータの各次元についてそ
の時間変化量である回帰係数をフレーム毎にＰ個求め
る。次に音声区間検出部３で入力音声の始終端フレーム
を検出する。次にかな表記単語辞書８に書かれている単
語のかな文字表記にしたがって、音節標準パターン格納
部９に格納されている音節標準パターンを連結し、単語
標準パターンを作成する。ＤＰ照合部５で、第２の実施
例と同様に入力音声の特徴パラメータ時系列と、各単語
標準パターンとをＤＰ法により動的に照合を行ない、各
単語標準パターンに対する距離を求める。最後に距離比
較部６で、ＤＰ照合部５で求めた各々の標準パターンと
の距離の中で最小（類似度が最大）の値をもつ標準パタ
ーンに対応する音声名を認識結果として選択し、出力す
る。Next, the operation will be described. The syllable standard pattern is created in advance and stored in the syllable standard pattern storage 9. A method for creating the syllable standard pattern will be described later.
When an unknown input voice is input, the acoustic analysis unit 1 performs LPC analysis for each frame, and the feature parameter extraction unit 2 performs PPC analysis.
The number of feature parameters is determined for each frame. The characteristic parameters are the same as in the first embodiment. Then, for each dimension of the feature parameter, the dynamic feature parameter extraction unit 7 obtains P regression coefficients, which are time variation amounts, for each frame. Next, the voice section detector 3 detects the start and end frames of the input voice. Next, the syllable standard patterns stored in the syllable standard pattern storage unit 9 are linked according to the kana character notation of the words written in the kana notation word dictionary 8 to create a word standard pattern. As in the second embodiment, the DP collating unit 5 dynamically collates the feature parameter time series of the input speech and each word standard pattern by the DP method, and obtains a distance to each word standard pattern. Finally, the distance comparison unit 6 selects, as a recognition result, a speech name corresponding to the standard pattern having the minimum value (similarity is maximum) among the distances from the respective standard patterns obtained by the DP matching unit 5, Output.

【０１８６】以下、音節標準パターンを作成する方法に
ついて図６（ａ）を用いて説明する。音韻環境を考慮し
て、音韻バランスが取れた種々の単語セットを多数の人
が発声した音声データを学習用音声データとして用意す
る。学習用音声データにはあらかじめ音節６４の始終端
位置と子音の基準フレームを目視によって音素ラベル６
１としてラベル付けを行なっておく。そして各音節の始
端から終端までの音声データを切りだし、音節毎に、第
２の実施例と同様に子音部は基準フレームを中心に連続
に母音部は線形伸縮させて音節標準パターンの特徴パラ
メータ時系列６３を作成する。無声摩擦音や、語頭のバ
ズバーなどのようにスペクトルが定常で発声によって伸
縮の激しい音素については、基準フレームを中心とした
連続パターンの時間的に前の部分に母音部と同様の線形
伸縮するパターンを設けてもよい。Hereinafter, a method for creating a syllable standard pattern will be described with reference to FIG. In consideration of the phonemic environment, voice data in which various people uttered various word sets with balanced phonemes are prepared as learning voice data. In the learning speech data, the start and end positions of the syllable 64 and the reference frame of the consonant are visually checked beforehand for the phoneme label 6.
Labeling is performed as 1. Then, speech data from the beginning to the end of each syllable is cut out, and for each syllable, the consonant part is continuously expanded and contracted linearly around the reference frame as in the second embodiment, and the characteristic parameters of the syllable standard pattern are obtained. A time series 63 is created. For phonemes whose spectrum is steady and sharply expands and contracts due to utterance, such as unvoiced fricatives and buzz bars at the beginning of the word, a linearly expanding and contracting pattern similar to the vowel part is added to the temporally preceding part of the continuous pattern centered on the reference frame. It may be provided.

【０１８７】入力音声を単語標準パターンとＤＰ法によ
り時間伸縮して照合を行なうときも第２の実施例のよう
に、子音部は伸縮させず連続になるようにしながら単語
の始端から終端まで照合を行なう。ＤＰパスは音節毎に
（数３２）で表される範囲に届くようにフレーム毎に変
えてもよいし、音節標準パターンの長さをその音節の平
均継続長の１／２のように音節毎に変えれば伸縮部で一
律にしてもよい。When the input speech is collated with the word standard pattern by time expansion and contraction by the DP method, as in the second embodiment, the consonant portion is collated from the beginning to the end of the word without being expanded or contracted. Perform The DP path may be changed for each frame so as to reach the range represented by (Equation 32) for each syllable, or the length of the syllable standard pattern may be changed for each syllable such as 1/2 of the average continuous length of the syllable. If it changes to, it may be uniform with the expansion and contraction part.

【０１８８】なお、第３の実施例では音節単位に認識を
するが、ＣＶ（子音＋母音）、ＶＣ（母音＋子音）、Ｖ
ＣＶ（母音＋子音＋母音）又はＣＶＣ（子音＋母音＋子
音）などの音声片を単位としてもよい。その場合も子音
部は基準フレームを中心として連続に照合を行なう。図
６（ｂ）は認識の単位をＣＶ・ＶＣとしたときの切り出
し方の例である。In the third embodiment, recognition is performed in syllable units, but CV (consonant + vowel), VC (vowel + consonant), V
A speech piece such as CV (vowel + consonant + vowel) or CVC (consonant + vowel + consonant) may be used as a unit. Also in this case, the consonant part continuously performs collation centering on the reference frame. FIG. 6B is an example of a cutting method when the recognition unit is CV / VC.

【０１８９】また、第３の実施例では、照合の距離尺度
としてベイズ判定に基づく二次判別関数で表される統計
的距離尺度を用いたが、共分散行列を共通化したベイズ
判定に基づく一次判別関数で表される統計的距離尺度を
用いることもできる。これにより、少ない計算量で認識
対象語彙の変更が容易な音声認識方法を実現することが
できる。In the third embodiment, a statistical distance scale represented by a quadratic discriminant function based on Bayesian judgment is used as a distance scale for collation. However, a primary scale based on Bayesian judgment using a common covariance matrix is used. A statistical distance measure represented by a discriminant function can also be used. This makes it possible to realize a speech recognition method that can easily change the recognition target vocabulary with a small amount of calculation.

【０１９０】また、第３の実施例では、入力音声と単音
節標準パターンのフレーム毎に得られる特徴パラメータ
ベクトルと動的特徴パラメータベクトルの照合を行ない
認識したが、特徴パラメータベクトルだけを用いてもよ
い。その場合には認識率はやや落ちるが、計算量が半分
ですむというメリットがある。In the third embodiment, the feature parameter vector obtained for each frame of the input speech and the monosyllable standard pattern is compared with the dynamic feature parameter vector for recognition. However, the feature parameter vector may be used alone. Good. In this case, the recognition rate is slightly lowered, but there is an advantage that the calculation amount is reduced by half.

【０１９１】また、連続ＤＰマッチングを行なうことに
よって、第１、第２の実施例と同様にして、ワードスポ
ッティングを行なうことも可能である。By performing continuous DP matching, word spotting can be performed in the same manner as in the first and second embodiments.

【０１９２】第１、第２、第３の実施例の効果を確認す
るため、男女計１５０名が発声した１１０単音節音声お
よび地名１００単語音声を用いて認識実験を行なった。
このうち１００名（男女各５０名）のデータを用いて音
声標準パターンを作成し、残りの５０名のデータを評価
した。In order to confirm the effects of the first, second, and third embodiments, a recognition experiment was performed using 110 monosyllabic voices uttered by a total of 150 men and women and 100 word voices of place names.
A voice standard pattern was created using the data of 100 people (50 men and women), and the data of the remaining 50 people were evaluated.

【０１９３】（表１）に評価条件を示す。（表２）に従
来例による110単音節認識率および地名100単語認識率、
第１の実施例による110単音節認識率、第２の実施例に
よる110単音節認識率、第３の実施例による地名100単語
認識率を示す。Table 1 shows the evaluation conditions. (Table 2) shows the recognition rate of 110 single syllables and 100 words of place names according to the conventional example.
The recognition rate of 110 single syllables according to the first embodiment, the recognition rate of 110 single syllables according to the second embodiment, and the recognition rate of 100 place names according to the third embodiment are shown.

【０１９４】[0194]

【表１】 [Table 1]

【０１９５】[0195]

【表２】 [Table 2]

【０１９６】（表２）において計算量は、標準パターン
のフレーム数Ｊを20、フレーム毎の特徴パラメータの個
数Ｐを11とした場合の、入力音声と標準パターンの距離
を求める際の積和の演算回数で従来例による方法を１と
したときの比を表している。第３の実施例による方法で
は、地名100単語に出現する音節の総フレーム数分だけ
距離計算を行なえばよいので計算量はそれほど増大しな
い。In Table 2, the amount of calculation is the sum of products when calculating the distance between the input voice and the standard pattern when the number J of frames of the standard pattern is 20 and the number P of feature parameters per frame is 11. The ratio of the number of operations when the method according to the conventional example is set to 1 is shown. In the method according to the third embodiment, the distance is calculated only for the total number of frames of syllables appearing in 100 words of the place name, so that the calculation amount does not increase so much.

【０１９７】このように第１の実施例による方法では、
単音節認識率が従来法の４７．２％に比べ６８．０％
と、計算量や推定パラメータ数を増大させることなく認
識率を向上させることができる。As described above, in the method according to the first embodiment,
Single syllable recognition rate is 68.0% compared to 47.2% of the conventional method
Thus, the recognition rate can be improved without increasing the amount of calculation and the number of estimated parameters.

【０１９８】また第２の実施例による方法では、単音節
認識率が第１の実施例による方法に比べ７５．４％と、
さらに大きく認識率を向上させることができる。In the method according to the second embodiment, the monosyllable recognition rate is 75.4% as compared with the method according to the first embodiment.
The recognition rate can be further improved.

【０１９９】また従来法では認識対象語彙の変更が困難
であったが、第３の実施例による方法では、かな表記か
ら単語標準パターンが作成できるため認識対象語彙の変
更が容易になり、認識率の面でも単語認識率が従来法の
９７．３％から９８．９％に向上した。In the conventional method, it was difficult to change the vocabulary to be recognized. However, in the method according to the third embodiment, a word standard pattern can be created from kana notation. The word recognition rate also improved from 97.3% of the conventional method to 98.9%.

【０２００】本実施例はいずれも、ワードスポッティン
グが可能な方法でありワードスポッティングを導入する
ことによって、騒音に対して頑強な、実用性の高い認識
装置が実現できる。Each of the embodiments is a method capable of word spotting. By introducing word spotting, a highly practical recognition device that is robust against noise can be realized.

【０２０１】[0201]

【発明の効果】本発明は第一に、子音部は基準フレーム
を中心に連続にフレームをとり、母音部は線形伸縮させ
て標準パターンを作成し、認識時には子音部は伸縮させ
ずに照合を行ない、母音部はフレームを伸縮させて照合
を行なうことによって、子音部の局所的なスペクトルの
時間的変化の特徴と母音部の大局的なスペクトルの特徴
を発声速度に影響されずに適切にとらえることができる
ようになるため、認識性能の高い音声認識方法を実現す
ることができる。入力音声と標準パターンの照合に、音
声全体を一つのベクトルとしてフレーム間相関を考慮し
た一次判別関数で表される統計的距離尺度を用いること
により、計算量および標準パターンの推定パラメータ数
を増大させることなく、認識率を向上させることができ
る。また、計算量は２倍になるがフレームを独立に扱
い、そのかわりに特徴パラメータの時間変化量である動
的特徴パラメータを併用し一次判別関数で表される統計
的距離尺度を用いることによっても、認識率を向上させ
ることができる。According to the present invention, first, the consonant part continuously takes a frame around the reference frame, and the vowel part is linearly expanded and contracted to create a standard pattern. The vowel part performs the matching by expanding and contracting the frame, so that the characteristic of the temporal change of the local spectrum of the consonant part and the characteristic of the global spectrum of the vowel part can be appropriately captured without being affected by the utterance speed. Therefore, a speech recognition method with high recognition performance can be realized. The amount of calculation and the number of estimated parameters of the standard pattern are increased by using a statistical distance scale expressed by a linear discriminant function that considers inter-frame correlation as a single vector for matching the input voice with the standard pattern. Without this, the recognition rate can be improved. Also, the amount of calculation is doubled, but the frames are treated independently, and instead, a dynamic distance parameter, which is the time variation of the characteristic parameter, is used together and a statistical distance scale expressed by a linear discriminant function is used. , The recognition rate can be improved.

【０２０２】本発明は第２に、さらに、時間パターンを
フレーム毎に独立のベクトルとして扱い、二次判別関数
で表される統計的距離尺度を用いることにより、さらに
音声認識性能を向上させることができる。また特徴パラ
メータの時間変化量である動的特徴パラメータを併用す
るとさらに、音声認識性能を向上させることができる。Second, the present invention further improves speech recognition performance by treating a time pattern as an independent vector for each frame and using a statistical distance scale represented by a quadratic discriminant function. it can. When a dynamic feature parameter, which is the amount of change of the feature parameter over time, is used together, the speech recognition performance can be further improved.

【０２０３】本発明は第三に、さらに、音節やＣＶ（子
音＋母音）、ＶＣ（母音＋子音）、ＶＣＶ（母音＋子音
＋母音）又はＣＶＣ（子音＋母音＋子音）などの音声片
を組合わせることにより、認識対象語彙の変更が容易で
高精度な音声認識方法を実現することができる。Thirdly, the present invention further recognizes syllables and voice segments such as CV (consonant + vowel), VC (vowel + consonant), VCV (vowel + consonant + vowel) or CVC (consonant + vowel + consonant). By combining them, it is possible to realize a highly accurate speech recognition method in which the vocabulary to be recognized can be easily changed.

【０２０４】また、ワードスポッティング機能を導入す
ることによって、騒音に対して頑強な、実用性の高い認
識装置が実現できる。Also, by introducing the word spotting function, a highly practical recognition device that is robust against noise can be realized.

【０２０５】このように本発明は実用上有効な方法であ
り、その効果は大きい。As described above, the present invention is a practically effective method, and its effect is great.

[Brief description of the drawings]

【図１】本発明の第１の実施例の処理の流れを示すフロ
ーチャートFIG. 1 is a flowchart showing the flow of processing according to a first embodiment of the present invention;

【図２】同第１の実施例において標準パターンの作成方
法を説明する概念図FIG. 2 is a conceptual diagram illustrating a method of creating a standard pattern in the first embodiment.

【図３】同第１の実施例におけるＤＰパスを示す図FIG. 3 is a diagram showing a DP path in the first embodiment.

【図４】同第２の実施例の処理の流れを示すフローチャ
ートFIG. 4 is a flowchart showing the flow of processing according to the second embodiment;

【図５】同第３の実施例の処理の流れを示すフローチャ
ートFIG. 5 is a flowchart showing the flow of processing according to the third embodiment;

【図６】同第３の実施例において標準パターンの作成方
法を説明する概念図FIG. 6 is a conceptual diagram illustrating a method of creating a standard pattern in the third embodiment.

【図７】従来例の処理の流れを示すフローチャートFIG. 7 is a flowchart showing a processing flow of a conventional example.

【図８】従来例において標準パターンの作成方法を説明
する概念図FIG. 8 is a conceptual diagram illustrating a method of creating a standard pattern in a conventional example.

[Explanation of symbols]

１音響分析部２特徴パラメータ抽出部３音声区間検出部４標準パターン格納部５ＤＰ照合部６距離比較部７動的特徴パラメータ抽出部８かな表記単語辞書９音節標準パターン格納部１０時間軸線形正規化部１１距離計算部 REFERENCE SIGNS LIST 1 acoustic analysis unit 2 feature parameter extraction unit 3 voice section detection unit 4 standard pattern storage unit 5 DP matching unit 6 distance comparison unit 7 dynamic feature parameter extraction unit 8 kana notation word dictionary 9 syllable standard pattern storage unit 10 time axis linear normal Conversion unit 11 Distance calculation unit

───────────────────────────────────────────────────── フロントページの続き (72)発明者二矢田勝行神奈川県川崎市多摩区東三田３丁目10番１号松下技研株式会社内 (56)参考文献特開昭61−84695（ＪＰ，Ａ) 特開昭61−188599（ＪＰ，Ａ) 特開昭59−17600（ＪＰ，Ａ) 特開平１−298400（ＪＰ，Ａ) 特開昭58−145999（ＪＰ，Ａ) 特開平１−136197（ＪＰ，Ａ) 特開平３−249700（ＪＰ，Ａ) 特開昭58−57196（ＪＰ，Ａ) 特開平５−73087（ＪＰ，Ａ) 特開平２−23297（ＪＰ，Ａ) 特公平５−4678（ＪＰ，Ｂ２) 特公平５−4679（ＪＰ，Ｂ２) 特公平１−15076（ＪＰ，Ｂ２) 特許2712586（ＪＰ，Ｂ２) 日本音響学会平成７年度春季研究発表会講演論文集▲Ｉ▼，１−Ｑ−７，大附克年外「統計的音韻中心における音韻の特徴分析」ｐ．109−110（平成７年３月 14日発行) Ｌ．Ｒ．Ｒａｂｉｎｅｒ，Ｂ−Ｈ．Ｊｕａｎｇ”ＦｕｎｄａｍｅｎｔａｌｓｏｆＳｐｅｅｃｈＲｅｃｏｇｎｉｔｉｏｎ”，1993，Ｐｒｅｎｔｉｃｅ−Ｈａｌｌ，ｐ．435−439 (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 15/00 - 17/00 ──────────────────────────────────────────────────続き Continuation of the front page (72) Inventor Katsuyuki Niyada 3-10-1, Higashi-Mita, Tama-ku, Kawasaki-shi, Kanagawa Prefecture Matsushita Giken Co., Ltd. (56) References JP-A-61-84695 (JP, A) JP-A-61-188599 (JP, A) JP-A-59-17600 (JP, A) JP-A-1-298400 (JP, A) JP-A-58-145999 (JP, A) JP-A-1-136197 (JP, A) JP-A-3-249700 (JP, A) JP-A-58-57196 (JP, A) JP-A-5-73087 (JP, A) JP-A-2-23297 (JP, A) JP 5-4678 (JP, B2) JP 5-4679 (JP, B2) JP 1-15076 (JP, B2) Patent 2712586 (JP, B2) Lecture at the Spring Meeting of the Acoustical Society of Japan in 1995 Papers I, 1-Q-7, Katsutoshi Ohtsuki, "Statistical Phonology Phonological feature analysis of "in p. 109-110 (issued March 14, 1995) R. Rabiner, BH. Jiang "Fundamentals of Speech Recognition", 1993, Prentice-Hall, p. 435-439 (58) Field surveyed (Int. Cl. ⁷ , DB name) G10L 15/00-17/00

Claims

(57) [Claims]

1. A P-number for each frame with respect to the input speech (P is a positive integer) the input obtained by extracting feature parameters
Speech and N types (N is a positive integer) of word sounds to be recognized
From the beginning to the end of the training word voice data belonging to each voice
Until the consonant section is continuously centered on the reference frame.
For vowel parts, linearly expand and contract the vowel section of each data
Makes the whole word J-frame (J is a positive integer) nonlinear
And P (P is a positive integer) for each frame
P obtained by extracting symbol parameters and arranging them in chronological order
The consonant part is collated continuously by using a statistical distance scale, and the vowel part is collated in a time-matched manner by expanding and contracting , using a J-dimensional vector and a word voice standard pattern created in advance. A speech recognition method comprising: obtaining a similarity between a voice and each word voice standard pattern; and using a word voice name corresponding to the word voice standard pattern having the highest similarity as a recognition result.

2. The method according to claim 1, wherein P (P is a positive integer) characteristic parameters are extracted for each frame of the input speech.
Speech and N types (N is a positive integer) of word sounds to be recognized
From the beginning to the end of the training word voice data belonging to each voice
Until the consonant section is continuously centered on the reference frame.
For vowel parts, linearly expand and contract the vowel section of each data
Makes the whole word J-frame (J is a positive integer) nonlinear
And P (P is a positive integer) for each frame
J obtained by extracting symbol parameters and arranging them in temporal order
Using a statistical distance scale, the consonant part is continuously matched with the word speech standard pattern created in advance using the P-dimensional vectors, and the vowel part is expanded and contracted to match in time, A speech recognition method, wherein a similarity between an input speech and each word speech standard pattern is obtained, and a word speech name corresponding to the word speech standard pattern having the highest similarity is used as a recognition result.

3. A statistical distance measure is, speech recognition method of claim 1 or 2, wherein the represented by a linear discriminant function, such as the distance based on Bayesian decision in common the covariance matrix.

4. The speech recognition method according to claim 2 , wherein the statistical distance scale is represented by a quadratic discriminant function such as a distance based on Bayesian judgment or a Mahalanobis distance.

5. A method according to claim 1 to the speech recognition method according to any one of 4, characterized in that the recognition target single syllable of Japanese.

6. vowel unit dynamic programming speech recognition method according to any one of claims 1 to 5, characterized in that matching aligned (DP method) than the time.

7. Performing continuous DP matching using a statistical distance scale based on posterior probabilities to detect a speech section of an unknown input speech and extract a speech portion from a sufficiently long section including noise. claims 1 to methods speech recognition according to any one of 6, characterized by having a recognized word spotting function with.