JPH1091186A

JPH1091186A - Voice recognizing method

Info

Publication number: JPH1091186A
Application number: JP29511197A
Authority: JP
Inventors: Katsuyuki Futayada; 勝行二矢田; Masakatsu Hoshimi; 昌克星見; Seiji Hiraoka; 省二平岡; Tatsuya Kimura; 達也木村
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1997-10-28
Filing date: 1997-10-28
Publication date: 1998-04-10

Abstract

PROBLEM TO BE SOLVED: To obtain a voice recognizing method which is tolerative for increasing of the number of vocabulary and noise and a recognition rate is high by accumulating distance obtained from an input vector and a partial pattern with a statistical distance scale, obtaining the accumulated distance, and making a word of the minimum accumulation distance the recognized result from the accumulated distance. SOLUTION: A voice recognizing device is constituted with an acoustic analyzing section 1, a feature parameter extracting section 2, a plural frame buffer 3, partial distance calculating section 4, a partial standard pattern storing section 5, a path discriminating section 6, a distance accumulating section 7, a discriminating section 8, a voice section detecting section 9, and a frame synchronizing signal generation section 10. And a partial distance between an input vector formed by plural frames and a partial (standard) pattern of a voice of a word is obtained with a statistical distance scale based on posterior probability, the input vector is updated shifting a frame, distance between each partial vector is accumulated, and a word in which accumulated distance is minimum is made the recognized result. Thereby, a high recognition rate can be obtained for voice recognition for an unspecified talker.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は人間の声を機械に認
識させる音声認識方法に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method for recognizing a human voice by a machine.

【０００２】[0002]

【従来の技術】近年、使用者の声を登録することなし
に、誰の声でも認識できる不特定話者用の認識装置が実
用として使われるようになった。不特定話者用の実用的
な方法として、本出願人が、以前に出願した２つの特許
（特開昭61-188599号公報、特開昭62-111293号公報）を
従来例として説明する。特開昭61-188599号公報を第１
の従来例、特開昭62-111293号公報を第２の従来例とす
る。2. Description of the Related Art In recent years, recognition devices for unspecified speakers that can recognize any voice without registering the voice of the user have come into practical use. As a practical method for an unspecified speaker, two patents (JP-A-61-188599 and JP-A-62-111293) filed by the present applicant before will be described as conventional examples. Japanese Patent Application Laid-Open No. 61-188599
The second conventional example is disclosed in Japanese Patent Application Laid-Open No. 62-111293.

【０００３】第１の従来例の方法は入力音声の始端、終
端を求めて音声区間を決定し、音声区間を一定時間長に
（Ｉフレーム）に線形伸縮し、これと単語標準パターン
との類似度を統計的距離尺度を用いてパターンマッチン
グをすることによって求め、単語を認識する方法であ
る。In the first conventional method, a voice section is determined by finding the start and end of an input voice, and the voice section is linearly expanded and contracted by a fixed time length (I frame). This is a method of recognizing words by calculating the degree by performing pattern matching using a statistical distance scale.

【０００４】単語標準パターンは、認識対象単語を多く
の人に発声させて音声サンプルを収集し、すべての音声
サンプルを一定時間長Ｉフレーム（実施例ではＩ＝１
６）に伸縮し、その後、単語ごとに音声サンプル間の統
計量（平均値ベクトルと共分散行列）を求め、これを加
工することによって作成している。すなわち、すべての
単語標準パターンの時間長は一定（Iフレーム）であ
り、原則として１単語に対し１標準パターンを用意して
いる。[0004] A word standard pattern is obtained by uttering a word to be recognized by many people to collect voice samples, and all voice samples are I-frames of a fixed time length (I = 1 in the embodiment).
6), and thereafter, a statistic (an average value vector and a covariance matrix) between speech samples is obtained for each word, and this is processed by processing. That is, the time length of all word standard patterns is constant (I frame), and one standard pattern is prepared for one word in principle.

【０００５】第１の従来例では、パターンマッチングの
前に音声区間を検出する必要があるが、第２の従来例は
音声区間検出を必要としない部分が異なっている。パタ
ーンマッチングによって、ノイズを含む信号の中から音
声の部分を抽出して認識する方法（ワードスポッティン
グ法）を可能とする方法である。すなわち、音声を含む
十分長い入力区間内において、入力区間内に部分領域を
設定し、部分領域を伸縮しながら標準パターンとのマッ
チングを行なう。そして、部分領域を入力区間内で単位
時間ずつシフトして、また同様に標準パターンとのマッ
チングを行なうという操作を設定した入力区間内全域で
行ない、すべてのマッチング計算において距離が最小と
なった単語標準パターン名を認識結果とする。ワードス
ポッティング法を可能にするために、パターンマッチン
グの距離尺度として事後確率に基づく統計的距離尺度を
用いている。In the first conventional example, it is necessary to detect a voice section before pattern matching, but in the second conventional example, a portion that does not require voice section detection is different. This is a method that enables a method (word spotting method) of extracting and recognizing a voice part from a signal containing noise by pattern matching. That is, in a sufficiently long input section including speech, a partial area is set in the input section, and matching with the standard pattern is performed while expanding and contracting the partial area. Then, the partial area is shifted in the input section by the unit time, and the operation of performing the matching with the standard pattern is similarly performed in the entire input section, and the word having the smallest distance in all the matching calculations is performed. The standard pattern name is used as the recognition result. In order to enable the word spotting method, a statistical distance measure based on a posterior probability is used as a distance measure for pattern matching.

【０００６】[0006]

【発明が解決しようとする課題】従来例の方法は、小型
化が可能な実用的な方法であり、特に第２の従来例は、
騒音にも強いことから実用として使われ始めている。The method of the prior art is a practical method capable of miniaturization. In particular, the second conventional example is:
Since it is strong against noise, it is beginning to be used for practical use.

【０００７】しかし、従来技術の課題は、十分な単語認
識率が得られないことである。このため、語彙の数が少
ない用途にならば使うことが出来るが、語彙の数を増や
すと認識率が低下して実用にならなくなってしまう。従
って、従来技術の方法では認識装置の用途が限定されて
しまうという課題があった。However, a problem of the prior art is that a sufficient word recognition rate cannot be obtained. For this reason, it can be used for applications where the number of vocabularies is small, but if the number of vocabularies is increased, the recognition rate will be reduced and it will not be practical. Therefore, there is a problem that the use of the recognition device is limited in the method of the related art.

【０００８】本発明は上記従来の課題を解決するもの
で、語彙数の増加や騒音に対して頑強な認識率の高い音
声認識方法を提供することを目的とするものである。An object of the present invention is to solve the above-mentioned conventional problems, and an object of the present invention is to provide a speech recognition method which is robust against an increase in the number of words and noise and has a high recognition rate.

【０００９】[0009]

【課題を解決するための手段】この課題を解決するため
に本発明は、多数の人が発声した音声データを用いて、
認識対象単語を隣接するフレームを共有する部分区間に
分割し、その部分区間を表現する部分（標準）パターン
を連接した認識対象単語の標準パターンを、全ての認識
対象単語に対して予め生成する工程と、入力音声を一定
時間長（フレーム）ごとに分析して特徴パラメータを求
め、複数フレームの特徴パラメータで入力ベクトルを求
める工程と、前記入力ベクトルと前記各部分パターンと
の部分距離を事後確率に基づく統計的距離尺度で求める
工程と、フレームをシフトしながら生成した入力ベクト
ルと前記部分パターンとの部分距離を累積した累積距離
を求める工程と、全認識対象単語の標準パターンに対す
る累積距離を相互に比較して、最小累積距離の単語を認
識結果とする工程とを有するものである。In order to solve this problem, the present invention uses voice data uttered by many people,
A step of dividing the recognition target word into partial sections sharing an adjacent frame, and previously generating a standard pattern of the recognition target word in which partial (standard) patterns expressing the partial sections are connected for all the recognition target words; Analyzing the input voice for each fixed time length (frame) to obtain a characteristic parameter, obtaining an input vector using the characteristic parameters of a plurality of frames, and calculating a partial distance between the input vector and each of the partial patterns as a posterior probability. A step of calculating a cumulative distance obtained by accumulating partial distances between the input vector generated while shifting the frame and the partial pattern, and a step of calculating the cumulative distance of all recognition target words with respect to the standard pattern. Comparing the word with the minimum cumulative distance as a recognition result.

【００１０】このことにより、語彙数の増加や騒音に対
して頑強で認識率の高い音声認識方法が得られる。Thus, a speech recognition method that is robust against an increase in the number of words and noise and has a high recognition rate can be obtained.

【００１１】[0011]

【発明の実施の形態】本発明の請求項１に記載の発明
は、認識対象単語の標準パターンを部分パターンの連接
で作成する工程と、入力音声からフレームをシフトしな
がら入力ベクトルを求める工程と、前記入力ベクトルと
前記部分パターンとから統計的距離尺度で求めた距離を
累積し累積距離を求める工程と、前記累積距離から最小
累積距離の単語を認識結果とする工程とを有するもの
で、フレームをシフトしながら入力音声から求めた入力
ベクトルと、単語音声の標準パターンを構成する部分
（標準）パターンとの部分距離を統計的距離尺度で求
め、その距離を累積し、最小累積距離の単語を認識結果
とするもので、不特定話者用の音声認識に対して認識率
が得られものである。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The invention according to claim 1 of the present invention comprises the steps of: forming a standard pattern of a word to be recognized by concatenating partial patterns; and obtaining an input vector while shifting a frame from input speech. A step of accumulating a distance obtained by a statistical distance scale from the input vector and the partial pattern to obtain a cumulative distance; and a step of obtaining a word of a minimum cumulative distance from the cumulative distance as a recognition result. The partial distance between the input vector obtained from the input voice while shifting the input voice and the part (standard) pattern constituting the standard pattern of the word voice is calculated using a statistical distance scale, and the distance is accumulated. A recognition result is obtained, and a recognition rate is obtained for speech recognition for an unspecified speaker.

【００１２】請求項２の発明は、多数の人が発声した音
声データを用いて、認識対象単語を部分区間に分割し、
その部分区間を表現する部分（標準）パターンを連接し
て認識対象単語の標準パターンを、全ての認識対象単語
に対して予め生成する工程と、入力音声を一定時間長
（フレーム）ごとに分析して特徴パラメータを求め、複
数フレームの特徴パラメータで入力ベクトルを求める工
程と、前記入力ベクトルと前記標準パターンを構成する
各部分パターンとの部分距離を統計的距離尺度で求める
工程と、フレームをシフトしながら生成した入力ベクト
ルと前記部分パターンとの部分距離を累積した累積距離
を求める工程と、全認識対象単語の標準パターンに対す
る累積距離を相互に比較して最小累積距離の単語を認識
結果とする工程を有するもので、複数のフレームで形成
される入力ベクトルと、単語音声を部分区間に分割し、
その部分区間を表現する部分（標準）パターンとの部分
距離を事後確率に基づく統計的距離尺度で求め、フレー
ムをシフトしながら入力ベクトルを更新して各部分ベク
トルとの間の距離を累積し、累積距離を最小とする単語
を認識結果とするもので、不特定話者用の音声認識にお
いて、語彙数の増加や騒音に対して頑強で高い認識率が
得られ、また処理が単純なので、信号処理プロセッサ
（ＤＳＰ）等を用いて、小型でリアルタイム動作が可能
な認識装置を実現するという作用を有する。According to a second aspect of the present invention, a word to be recognized is divided into sub-intervals using voice data uttered by many people,
A step of connecting a partial (standard) pattern expressing the partial section to generate a standard pattern of the recognition target word in advance for all the recognition target words, and analyzing the input voice for each predetermined time length (frame) Calculating an input vector with the feature parameters of a plurality of frames, obtaining a partial distance between the input vector and each of the partial patterns constituting the standard pattern on a statistical distance scale, and shifting the frame. Calculating a cumulative distance obtained by accumulating partial distances between the input vector and the partial pattern generated while generating, and comparing the cumulative distances of all the words to be recognized with the standard pattern to each other to obtain a word having a minimum cumulative distance as a recognition result. Is divided into an input vector formed by a plurality of frames and a word voice into sub-intervals,
A partial distance with a partial (standard) pattern representing the partial interval is calculated using a statistical distance scale based on a posterior probability, an input vector is updated while shifting a frame, and a distance between each partial vector is accumulated. The recognition result is a word that minimizes the cumulative distance.In speech recognition for unspecified speakers, a robust and high recognition rate can be obtained for an increase in the number of vocabulary words and noise, and the processing is simple. Using a processing processor (DSP) or the like, it has the effect of realizing a small-sized recognition device capable of real-time operation.

【００１３】請求項３記載の発明は、請求項１または２
において、認識対象単語の部分区間は、互いに重なる区
間を含むように分割するもので、区間の境界の動き情報
を確実に得ることができ、より詳細な部分パターンが生
成できるという作用を有する。[0013] The invention according to claim 3 is the invention according to claim 1 or 2.
In the above, the partial section of the recognition target word is divided so as to include a section overlapping each other, and it is possible to surely obtain the motion information of the boundary of the section and to generate a more detailed partial pattern.

【００１４】請求項４記載の発明は、多数の人が発声し
た音声データを用いて、認識対象単語を複数フレームか
らなる部分区間に分割し、その部分区間を表現する部分
（標準）パターンを連接した認識対象単語の標準パター
ンを、全ての認識対象単語に対して予め生成する工程
と、入力音声を一定時間長（フレーム）ごとに分析して
特徴パラメータを求め、複数フレームの特徴パラメータ
で入力ベクトルを求める工程と、前記入力ベクトルと前
記各部分パターンとの部分距離を事後確率に基づく統計
的距離尺度で求める工程と、フレームをシフトしながら
生成した入力ベクトルと前記部分パターンとの部分距離
を累積した累積距離を求める工程と、全認識対象単語の
標準パターンに対する累積距離を相互に比較して、最小
累積距離の単語を認識結果とする工程とを有するもの
で、複数フレームからなる部分パターンとし、入力ベク
トルと部分距離を求める際に事後確率に基づく統計的距
離尺度で求めることにより、入力の位置や部分パターン
の違いにもかかわらず部分距離を求めることができると
いう作用を有する。According to a fourth aspect of the present invention, a recognition target word is divided into a plurality of partial sections using voice data uttered by a large number of persons, and a partial (standard) pattern expressing the partial section is connected. Generating a standard pattern of the recognized words to be recognized in advance for all the words to be recognized, and analyzing the input voice for each fixed time length (frame) to obtain feature parameters. Calculating a partial distance between the input vector and each of the partial patterns on a statistical distance scale based on a posteriori probability, and accumulating a partial distance between the input vector generated while shifting a frame and the partial pattern. The step of calculating the accumulated distance and the accumulated distance of all the words to be recognized with respect to the standard pattern are compared with each other, and the word of the minimum accumulated distance is recognized. And the result step, as a partial pattern composed of a plurality of frames, and when obtaining an input vector and a partial distance, by using a statistical distance scale based on the posterior probability, the difference between the input position and the partial pattern can be obtained. Regardless, there is an effect that the partial distance can be obtained.

【００１５】以下、本発明の実施の形態について、図面
を用いて説明する。実施の形態１は、入力音声の始端、
終端があらかじめ検出されている場合における実施例で
ある。この場合は音声区間でのみパターンマッチングを
行なえばよい。また、実施の形態２は、入力音声の始
端、終端が未知の場合の実施例である。この場合は入力
音声を含む十分広い区間内を対象として、入力信号と標
準パターンのマッチングを区間全域にわたって単位時間
ずつシフトしながら行ない、距離が最小となる部分区間
を切り出す方法を用いる。この種の方法を一般的にワー
ドスポッティングと呼んでいる。Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the first embodiment, the starting point of the input voice
This is an embodiment in a case where the termination is detected in advance. In this case, pattern matching may be performed only in the voice section. The second embodiment is an example in which the start and end of the input voice are unknown. In this case, a method is used in which matching of the input signal and the standard pattern is performed while shifting the unit time by unit time over the entire area of a sufficiently wide section including the input voice, and a partial section having a minimum distance is cut out. This type of method is commonly called word spotting.

【００１６】（実施の形態１）図１に、本発明の実施の
形態１の音声認識装置の機能ブロック図を示し、説明す
る。(Embodiment 1) FIG. 1 is a functional block diagram of a speech recognition apparatus according to Embodiment 1 of the present invention, which will be described.

【００１７】図１において、音響分析部１は入力信号を
ＡＤ変換して取込み（サンプリング周波数10kHz）、一
定時間長（フレームと呼ぶ。本実施例では10ms)ごとに
分析する。本実施例では線形予測分析（ＬＰＣ分析）を
用いる。特徴パラメータ抽出部２では分析結果に基づい
て、特徴パラメータを抽出する。本実施例では、ＬＰＣ
ケプストラム係数（C₀〜C₁₀）および差分パワー値V₀の
１２個のパラメータを用いている。入力の１フレームあ
たりの特徴パラメータをIn FIG. 1, an acoustic analysis unit 1 converts an input signal into an analog signal, fetches the signal (sampling frequency 10 kHz), and analyzes the signal every fixed time length (called a frame, 10 ms in this embodiment). In this embodiment, linear prediction analysis (LPC analysis) is used. The feature parameter extracting unit 2 extracts a feature parameter based on the analysis result. In this embodiment, the LPC
The twelve parameters of the cepstrum coefficient (C _{0 to} C ₁₀ ) and the difference power value V ₀ are used. Input parameters per frame

【００１８】[0018]

【外１】 [Outside 1]

【００１９】と表すことにすると、特徴パラメータは
（数１）のようになる。In this case, the characteristic parameters are as shown in (Expression 1).

【００２０】[0020]

【数１】 (Equation 1)

【００２１】ただし、jは入力のフレーム番号、pはケプ
ストラム係数の次数である（p＝10）。フレーム同期信
号発生部１０は、１０msごとに同期信号を発生する部分
であり、その出力は全てのブロックに入る。即ち、シス
テム全体がフレーム同期信号に同期して作動する。Here, j is the input frame number, and p is the order of the cepstrum coefficient (p = 10). The frame synchronizing signal generator 10 is a part that generates a synchronizing signal every 10 ms, and its output enters all blocks. That is, the entire system operates in synchronization with the frame synchronization signal.

【００２２】音声区間検出部９は、入力信号音声の始
端、終端を検出する部分である。音声区間の検出法は音
声のパワーを用いる方法が簡単で一般的であるが、どの
ような方法でもよい。本実施例では音声の始端が検出さ
れた時点で認識が始まり、j＝1になる。The voice section detection section 9 is a section for detecting the start and end of the input signal voice. A simple and general method for detecting a voice section uses the power of voice, but any method may be used. In this embodiment, the recognition starts when the beginning of the voice is detected, and j = 1.

【００２３】複数フレームバッファ３は、第jフレーム
の近隣のフレームの特徴パラメータを統合して、パター
ンマッチング（部分マッチング）に用いる入力ベクトル
を形成する部分である。すなわち、第jフレームに相当
する入力ベクトルThe multi-frame buffer 3 is a part that forms an input vector used for pattern matching (partial matching) by integrating feature parameters of frames adjacent to the j-th frame. That is, the input vector corresponding to the j-th frame

【００２４】[0024]

【外２】 [Outside 2]

【００２５】は、次式で表わされる。Is represented by the following equation.

【００２６】[0026]

【数２】 (Equation 2)

【００２７】すなわち、上記入力ベクトルはmフレーム
おきにj−L₁〜j＋L₂フレームの特徴パラメータを統合し
たベクトルである。L₁=L₂=3，m=1 とすると上記入力ベ
クトルの次元数は（P+2）×（L₁+L₂+1）＝12×7＝84と
なる。なお、（数２）ではフレーム間隔mは一定になっ
ているが、必ずしも一定である必要はない。mが可変の
場合は非線形にフレームを間引くことに相当する。That is, the input vector is a vector obtained by integrating the characteristic parameters of j−L ₁ to j + L ₂ frames every m frames. If L ₁ = L ₂ = 3 and m = 1, the number of dimensions of the input vector is (P + 2) × (L ₁ + L ₂ +1) = 12 × 7 = 84. Although the frame interval m is constant in (Equation 2), it need not be constant. When m is variable, it corresponds to thinning out frames non-linearly.

【００２８】部分標準パターン格納部５は、認識対象と
する各単語の標準パターンを、部分パターンの結合とし
て格納してある部分である。ここで、本実施例における
標準パターン作成法を、やや詳細に説明する。The partial standard pattern storage section 5 stores standard patterns of words to be recognized as a combination of partial patterns. Here, the standard pattern creation method in the present embodiment will be described in some detail.

【００２９】話をわかり易くするために、今、認識対象
単語を日本語の数字「イチ」「ニ」「サン」「ヨン」
「ゴ」「ロク」「ナナ」「ハチ」「キュウ」「ゼロ」の
１０種とする。このような例を用いても説明の一般性に
はなんら影響はない。In order to make the story easier to understand, the words to be recognized are now converted to Japanese numerals "Ichi", "Ni", "San", "Yon".
There are 10 types: “go”, “Roku”, “Nana”, “bee”, “kyu”, and “zero”. The use of such an example has no effect on the generality of the description.

【００３０】たとえば、「サン」の標準パターンは次の
ような手順で作成する。（１）多数の人（１００名とする）が「サン」と発声し
たデータを用意する。For example, a standard pattern of "Sun" is created in the following procedure. (1) Prepare data in which many people (supposed to be 100 people) say “sun”.

【００３１】（２）１００名の「サン」の持続時間分布
を調べ、１００名の平均時間長Ｉ₃を求める。(2) The duration distribution of 100 “suns” is examined, and the average time length I ₃ of the 100 persons is determined.

【００３２】（３）時間長のＩ₃サンプルを１００名の
中から探し出す。複数のサンプルがあった場合はフレー
ムごとに複数サンプルの平均値を計算する。このように
求められた代表サンプルを（数３）で示す。(3) Search for I ₃ samples with a length of time from 100 persons. When there are a plurality of samples, an average value of the plurality of samples is calculated for each frame. The representative sample obtained in this way is shown in (Equation 3).

【００３３】[0033]

【数３】 (Equation 3)

【００３４】ここでWhere

【００３５】[0035]

【外３】 [Outside 3]

【００３６】は１フレームあたりのパラメータベクトル
であり、（数１）と同様に１１個のＬＰＣケプストラム
係数と差分パワーで構成される。Is a parameter vector per frame, and is composed of 11 LPC cepstrum coefficients and differential power, as in (Equation 1).

【００３７】（４）１００名分のサンプルの１つ１つと
代表サンプルとの間でパターンマッチングを行ない、代
表サンプルと１００名分の各サンプルとの間の対応関係
（最も類似したフレーム同士の対応）を求める。距離計
算はユークリッド距離を用いる。代表サンプルのiフレ
ームと、あるサンプルのi’フレームとの距離d_i,i' は
（数４）で表わされる。(4) Pattern matching is performed between each of the 100 samples and the representative sample, and the correspondence between the representative sample and each of the 100 samples (correspondence between the most similar frames). ). The distance calculation uses the Euclidean distance. The distance d _{i, i ′} between the i frame of the representative sample and the i ′ frame of a certain sample is represented by (Equation 4).

【００３８】[0038]

【数４】 (Equation 4)

【００３９】ここで、tは転置行列であることを表す。
なお、フレーム間の対応関係はダイナミックプログラミ
ング（ＤＰ法）の手法を用いれば効率よく求めることが
できる。Here, t represents a transposed matrix.
The correspondence between frames can be efficiently obtained by using a dynamic programming (DP method) technique.

【００４０】（５）代表サンプルの各フレーム（i＝1〜
Ｉ₃）に対応して、１００名分のサンプルそれぞれから
（数２）の形の部分ベクトルを切出す。簡単化のためL₁
＝L₂＝3、m＝1 とする。(5) Each frame of the representative sample (i = 1 to
In accordance with I ₃ ), a partial vector in the form of (Equation 2) is extracted from each of the samples for 100 persons. L ₁ for simplicity
= L ₂ = 3 and m = 1.

【００４１】代表サンプルの第iフレームに相当する、
１００名のうちの第n番目のサンプルの部分ベクトルは
以下のようになる。Corresponding to the i-th frame of the representative sample,
The partial vector of the n-th sample of the 100 persons is as follows.

【００４２】[0042]

【数５】 (Equation 5)

【００４３】ここで、（i）は第n番目のサンプル中、代
表ベクトルの第iフレームに対応するフレームであるこ
とを示す。Here, (i) indicates that the frame corresponds to the i-th frame of the representative vector in the n-th sample.

【００４４】[0044]

【外４】 [Outside 4]

【００４５】は本実施例では８４次元のベクトルである
（n＝1〜100）。（６）１００名分の上記ベクトルの平均値Is an 84-dimensional vector in this embodiment (n = 1 to 100). (6) Average value of the above vectors for 100 people

【００４６】[0046]

【外５】 [Outside 5]

【００４７】（本例ではｋ＝３；８４次元）と共分散行
列(In this example, k = 3; 84 dimensions) and covariance matrix

【００４８】[0048]

【外６】 [Outside 6]

【００４９】（８４×８４次元）を求める（i＝1〜
Ｉ₃）。平均値と共分散行列は標準フレーム長の数Ｉ₃だ
け存在することになる（ただし、これらは必ずしも全フ
レームに対して作成する必要はない。間引いて作成して
もよい。）。(84 × 84 dimensions) is obtained (i = 1 to
I ₃ ). The average value and the covariance matrix will exist by the number I ₃ of the standard frame length (however, these need not necessarily be created for all frames; they may be created by thinning them out).

【００５０】上記（１）〜（６）と同様の手続きで「サ
ン」以外の単語に対しても８４次元のベクトルと共分散
行列を求める。In the same procedure as in the above (1) to (6), an 84-dimensional vector and a covariance matrix are obtained for words other than "sun".

【００５１】そして、全ての単語に対する１００名分す
べてのサンプルデータに対し、移動平均The moving average is calculated for all sample data for 100 names for all words.

【００５２】[0052]

【外７】 [Outside 7]

【００５３】（８４次元）と移動共分散行列(84 dimensions) and moving covariance matrix

【００５４】[0054]

【外８】 [Outside 8]

【００５５】（８４×８４次元）を求める。これらを周
囲パターンと呼ぶ。次に平均値と共分散を用いて標準パ
ターンを作成する。(84 × 84 dimensions) is obtained. These are called surrounding patterns. Next, a standard pattern is created using the average value and the covariance.

【００５６】ａ．（数６）により共分散行列を共通化す
る。A. The covariance matrix is shared by (Equation 6).

【００５７】[0057]

【数６】 (Equation 6)

【００５８】ここでKは認識対象単語の種類（K＝10）、
I_kは単語k（k＝1,2,…,K）の標準時間長を表す。また、
gは周囲パターンを混入する割合であり通常g＝1 とす
る。Where K is the type of the word to be recognized (K = 10),
I _k represents a standard time length of a word k (k = 1, 2,..., K). Also,
g is the mixing ratio of the surrounding pattern, and normally g = 1.

【００５９】b．各単語の部分パターンB. Partial pattern of each word

【００６０】[0060]

【外９】 [Outside 9]

【００６１】及びAnd

【００６２】[0062]

【外１０】 [Outside 10]

【００６３】を作成する。Is created.

【００６４】[0064]

【数７】 (Equation 7)

【００６５】[0065]

【数８】 (Equation 8)

【００６６】これらの式の導出は後述する。図２に標準
パターン作成法の概念図を示す。図２（ａ）は入力信号
が「サン」の場合の音声のパワーパターンを示す。図２
（ｂ）は部分パターンの作成法を概念的に示したもので
ある。音声サンプルの始端と終端の間において、代表サ
ンプルとのフレーム対応を求めて、それによって音声サ
ンプルをＩ₃に分割する。図では代表サンプルとの対応
フレームを（i）で示してある。そして、音声の始端
（i）＝１から終端（i）＝Ｉ₃の各々について、（i）−
L₁〜（i）＋L₂の区間の１００名分のデータを用いて平
均値と共分散を計算し、部分パターンThe derivation of these equations will be described later. FIG. 2 shows a conceptual diagram of the standard pattern creation method. FIG. 2A shows a power pattern of audio when the input signal is “sun”. FIG.
(B) conceptually shows a method of creating a partial pattern. In between the start and end of the speech samples, seeking frame correspondence between the representative sample, thereby dividing the speech samples to I _3. In the figure, the corresponding frame with the representative sample is indicated by (i). Then, for each of the start (i) = 1 to the end (i) = I ₃ of the voice, (i) −
L ₁ ~ (i) + with 100 persons data L ₂ of section calculates the mean value and the covariance, the partial pattern

【００６７】[0067]

【外１１】 [Outside 11]

【００６８】[0068]

【外１２】 [Outside 12]

【００６９】を求める。従って、単語kの標準パターン
は互にオーバーラップする区間を含むＩk個の部分パタ
ーンを連接して（寄せ集めた）ものになる。図２（ｃ）
は周囲パターンの作成方法を示す。周囲パターンは標準
パターン作成に使用した全データに対して、図のように
L1+L2+1フレームの部分区間を１フレームずつシフトさ
せながら移動平均値と移動共分散を求める。周囲パター
ン作成の範囲は音声区間内のみならず、前後のノイズ区
間も対象としてもよい。後述する第２の実施例では周囲
パターンにノイズ区間を含める必要がある。Is obtained. Therefore, the standard pattern of the word k is a concatenation (collection) of Ik partial patterns including sections that overlap each other. FIG. 2 (c)
Indicates a method of creating a surrounding pattern. The surrounding pattern is as shown in the figure for all the data used to create the standard
The moving average and the moving covariance are obtained while shifting the partial section of L1 + L2 + 1 frame by one frame. The range of creating the surrounding pattern may be not only within the voice section but also the preceding and following noise sections. In a second embodiment described later, it is necessary to include a noise section in the surrounding pattern.

【００７０】次に部分距離の計算について述べる。上記
のようにしてあらかじめ作成されている各単語の部分標
準パターンと複数フレームバッファ３との間の距離（部
分距離）を部分距離計算部４において計算する。Next, the calculation of the partial distance will be described. The distance (partial distance) between the partial standard pattern of each word created in advance as described above and the plurality of frame buffers 3 is calculated by the partial distance calculation unit 4.

【００７１】部分距離の計算は、（数２）で示す複数フ
レームの情報を含む入力ベクトルと各単語の部分パター
ンとの間で、統計的な距離尺度を用いて計算する。単語
全体としての距離は部分パターンとの距離（部分距離と
呼ぶ）を累積して求めることになるので、入力の位置や
部分パターンの違いにかかわらず、距離値が相互に比較
できる方法で部分距離を計算する必要がある。このため
には、事後確率に基づく距離尺度を用いる必要がある。
（数２）の形式の入力ベクトルをThe calculation of the partial distance is performed by using a statistical distance scale between an input vector including information of a plurality of frames represented by (Equation 2) and a partial pattern of each word. Since the distance as a whole word is obtained by accumulating the distance to the partial pattern (referred to as partial distance), the partial distance can be compared with each other regardless of the input position and the difference in the partial pattern. Needs to be calculated. For this, it is necessary to use a distance measure based on the posterior probability.
An input vector of the form (Equation 2)

【００７２】[0072]

【外１３】 [Outside 13]

【００７３】とする（簡単のため当分の間i,jを除いて
記述する）。単語kの部分パターンω_kに対する事後確率(For the sake of simplicity, i and j will be described for the time being.) Posterior probability of word k for partial pattern ω _k

【００７４】[0074]

【外１４】 [Outside 14]

【００７５】はベイズ定理を用いて次のようになる。Is as follows using the Bayes theorem.

【００７６】[0076]

【数９】 (Equation 9)

【００７７】右辺第１項は、各単語の出現確率を同じと
考え、定数として取扱う。右辺第２項の事前確率は、パ
ラメータの分布を正規分布と考え、The first term on the right side treats the appearance probabilities of the respective words as the same and treats them as constants. The prior probability of the second term on the right side is based on the assumption that the parameter distribution is a normal distribution,

【００７８】[0078]

【数１０】 (Equation 10)

【００７９】で表わされる。Is represented by

【００８０】[0080]

【外１５】 [Outside 15]

【００８１】は単語とその周辺情報も含めて、生起し得
る全ての入力条件に対する確率の和であり、パラメータ
がＬＰＣケプストラム係数やバンドパスフィルタ出力の
場合は、正規分布に近い分布形状になると考えることが
できる。Is the sum of probabilities for all possible input conditions, including the word and its surrounding information. When the parameter is an LPC cepstrum coefficient or band-pass filter output, the distribution shape is considered to be close to a normal distribution. be able to.

【００８２】[0082]

【外１６】 [Outside 16]

【００８３】が正規分布に従うと仮定し、平均値をAssume that the mean follows a normal distribution, and

【００８４】[0084]

【外１７】 [Outside 17]

【００８５】、共分散行列をThe covariance matrix is

【００８６】[0086]

【外１８】 [Outside 18]

【００８７】を用いると、（数１１）のようになる。When using the above, the equation (11) is obtained.

【００８８】[0088]

【数１１】 [Equation 11]

【００８９】（数１０）、（数１１）を（数９）に代入
し、対数をとって、定数項を省略し、さらに−２倍する
と、次式を得る。By substituting (Equation 10) and (Equation 11) into (Equation 9), taking the logarithm, omitting the constant term, and further multiplying by -2, the following equation is obtained.

【００９０】[0090]

【数１２】 (Equation 12)

【００９１】この（数１２）は、ベイズ距離を事後確率
化した式であり、識別能力は高いが計算量が多いという
欠点がある。この式を次のようにして線形判別式に展開
する。全ての単語に対する全ての部分パターンそして周
囲パターンも含めて共分散行列が等しいものと仮定す
る。このような仮定のもとに共分散行列を（数６）によ
って共通化し、（数１２）の[0091] This (Equation 12) is an equation in which the Bayes distance is posteriorly probabilized, and has a disadvantage that the discriminating ability is high but the amount of calculation is large. This equation is developed into a linear discriminant as follows. Assume that the covariance matrices, including all subpatterns and surrounding patterns for all words, are equal. Under this assumption, the covariance matrix is shared by (Equation 6), and

【００９２】[0092]

【外１９】 [Outside 19]

【００９３】、[0093]

【００９４】[0094]

【外２０】 [Outside 20]

【００９５】のかわりにInstead of

【００９６】[0096]

【外２１】 [Outside 21]

【００９７】を代入すると、（数１２）の第１項、第２
項は次のように展開できる。By substituting the first and second terms of (Equation 12),
The terms can be expanded as follows:

【００９８】[0098]

【数１３】 (Equation 13)

【００９９】[0099]

【数１４】 [Equation 14]

【０１００】（数１３）、（数１４）においてIn (Equation 13) and (Equation 14)

【０１０１】[0101]

【数１５】 (Equation 15)

【０１０２】[0102]

【数１６】 (Equation 16)

【０１０３】である。また、（数１２）の第３項は０に
なる。従って、（数１２）は次のように簡単な一次判別
式になる。Is as follows. The third term of (Equation 12) becomes zero. Therefore, (Equation 12) becomes a simple primary discriminant as follows.

【０１０４】[0104]

【数１７】 [Equation 17]

【０１０５】ここで、改めて、入力の第jフレーム成分
（数２）と単語kの第iフレーム成分の部分パターンとの
距離として（数１７）を書き直すと、Here, (Equation 17) is rewritten as a distance between the input j-th frame component (Equation 2) and the partial pattern of the i-th frame component of the word k.

【０１０６】[0106]

【数１８】 (Equation 18)

【０１０７】ここでHere,

【０１０８】[0108]

【外２２】 [Outside 22]

【０１０９】は（数７）で、Is (Equation 7).

【０１１０】[0110]

【外２３】 [Outside 23]

【０１１１】は（数８）で与えられる。Ｌ_k ^i,jは、単語
kの第i部分パターンと入力のjフレーム近隣のベクトル
の部分類似度である。Is given by (Equation 8). L _k ^{i, j} is a word
This is the partial similarity between the i-th partial pattern of k and the vector near the input j-frame.

【０１１２】図１において距離累積部７は、各単語に対
する部分距離をｉ＝１〜Ｉ_kの区間に対して累積し、単
語全体に対する距離を求める部分である。その場合、入
力音声長（Ｊフレーム）を各単語の標準時間長Ｉ_kに伸
縮しながら累積する必要がある。この計算はダイナミッ
クプログラミングの手法（ＤＰ法）を用いて効率よく計
算できる。In FIG. 1, the distance accumulating section 7 is a section for accumulating the partial distances for each word in the section from i = 1 to I _k and obtaining the distance for the entire word. In that case, it is necessary to accumulate the input voice length (J frame) while expanding and contracting to the standard time length I _k of each word. This calculation can be efficiently performed using a dynamic programming method (DP method).

【０１１３】いま、例えば「サン」の累積距離を求める
ことにすると、常にｋ＝３なのでｋを省略して計算式を
説明する。Now, for example, if the cumulative distance of "sun" is to be obtained, k is always 3, so k is omitted and the calculation formula will be described.

【０１１４】入力の第ｊフレーム部分と第ｉ番目の部分
パターンとの部分距離Ｌ^i,jをl（ｉ，ｊ）と表現し、
（ｉ，ｊ）フレームまでの累積距離をｇ（ｉ，ｊ）と表
現することにすると、The partial distance L ^{i, j} between the input j-th frame part and the i-th partial pattern is expressed as l (i, j),
If the cumulative distance to the (i, j) frame is expressed as g (i, j),

【０１１５】[0115]

【数１９】 [Equation 19]

【０１１６】となる。経路判定部６は、（数１９）にお
ける３つに経路のうち累積距離が最小になる経路を選択
する。Is obtained. The route determination unit 6 selects the route with the smallest cumulative distance among the three routes in (Equation 19).

【０１１７】図３は、ＤＰ法によって累積距離を求める
方法を図示したものである。図のようにペン型非対称の
パスを用いているが、その他にもいろいろなパスが考え
られる。ＤＰ法の他に線形伸縮法を用いることもできる
し、また隠れマルコフモデルの手法（ＨＭＭ法）を用い
てもよい。FIG. 3 illustrates a method for obtaining the cumulative distance by the DP method. Although a pen-shaped asymmetric path is used as shown in the figure, various other paths can be considered. In addition to the DP method, a linear expansion / contraction method may be used, or a hidden Markov model method (HMM method) may be used.

【０１１８】このようにして、逐次、距離を累積してい
き、ｉ＝Ｉ_k，ｊ＝Ｊとなる時点でので累積距離Ｇ_k（Ｉ
_k，Ｊ）を単語ごとに求める。In this way, the distances are sequentially accumulated, and at the time when i = I _k and j = J, the accumulated distance G _k (I
_k , J) is determined for each word.

【０１１９】判定部８は、累積距離Ｇ_k（Ｉ_k，Ｊ）の最
小値を求めて、（数２０）により認識結果The determination section 8 finds the minimum value of the cumulative distance G _k (I _k , J) and obtains the recognition result by (Equation 20).

【０１２０】[0120]

【外２４】 [Outside 24]

【０１２１】を出力する。Is output.

【０１２２】[0122]

【数２０】 (Equation 20)

【０１２３】（実施の形態２）次に、図４に本発明の実
施の形態２の音声認識装置の機能ブロック図を示し、説
明する。実施形態１では、音声区間検出の後にパータン
マッチングを行なったが、実施の形態２では音声区間検
出が不要である。入力信号の中から距離が最小の部分を
切出すことによって単語を認識する方法であり、「ワー
ドスポッティング法」の１つである。(Embodiment 2) Next, FIG. 4 shows a functional block diagram of a speech recognition apparatus according to Embodiment 2 of the present invention, and will be described. In the first embodiment, the pattern matching is performed after the voice section detection. However, in the second embodiment, the voice section detection is not required. This is a method of recognizing a word by extracting a portion having a minimum distance from an input signal, and is one of the “word spotting methods”.

【０１２４】この方法は「入力信号中に目的の音声が含
まれていれば、その音声の区間において正しい標準パタ
ーンとの距離（累積距離）が最小になる」という考え方
に基づく方法である。したがって、入力音声の前後のノ
イズ区間を含む十分長い入力区間において１フレームず
つシフトしながら、標準パターンとの照合を行なってい
く方法を採る。図４において、図１と同一番号のブロッ
クは同じ機能を持つ。図４が図１と異なる部分は、音声
区間検出部９を有しないことと、判定部８のかわりに距
離比較部１２と一時記憶１１が存在することである。以
下実施の形態１と異なる部分のみを説明する。This method is based on the idea that "if an input signal contains a target voice, the distance (cumulative distance) to a correct standard pattern in the voice section is minimized." Therefore, a method is employed in which matching with the standard pattern is performed while shifting one frame at a time in a sufficiently long input section including a noise section before and after the input voice. In FIG. 4, blocks having the same numbers as those in FIG. 1 have the same functions. 4 differs from FIG. 1 in that it does not include the voice section detection unit 9 and that a distance comparison unit 12 and a temporary storage 11 are provided instead of the determination unit 8. Hereinafter, only portions different from the first embodiment will be described.

【０１２５】先ず、パターンマッチングが始る時点（ｊ
＝１の時点）が音声の始端よりも前にあり、パターンマ
ッチングが終了する時点（ｊ＝Ｊの時点）が音声の終端
よりも後にある。パターンマチングの終了を検出する方
法はいろいろと考えられるが、本実施例では全ての標準
パターンとの距離が十分大きくなる時点をｊ＝Ｊとして
いる。First, when the pattern matching starts (j
= 1) is before the beginning of the voice, and the time when the pattern matching ends (the time when j = J) is after the end of the voice. There are various methods for detecting the end of pattern matching, but in the present embodiment, the point in time at which the distance from all the standard patterns becomes sufficiently large is j = J.

【０１２６】標準パターンの作成法は、実施の形態１と
全く同じである。ただ、音声サンプルを用いて周囲パタ
ーンを作成する範囲は音声区間の前後の十分広い区間を
用いる必要がある。その理由は、（数９）の分母項The method of creating the standard pattern is exactly the same as in the first embodiment. However, it is necessary to use a sufficiently wide section before and after the voice section as a range in which the surrounding pattern is created using the voice sample. The reason is the denominator term of (Equation 9).

【０１２７】[0127]

【外２５】 [Outside 25]

【０１２８】は、「パターンマッチングの対象となる全
てのパラメータに対する確率密度である」という定義に
よるものである。Is defined as "probability density for all parameters to be subjected to pattern matching."

【０１２９】実施の形態１との一番大きな構成上の違い
は、単語ごとの累積距離の大小比較をフレームごとに行
なう点である。距離比較部１２は（数２１）により、入
力の第ｊフレームにおける各単語の累積距離Ｇ_k（Ｉ_k、
ｊ）を比較して、第ｊフレームにおいて累積距離が最小
となる単語The greatest structural difference from the first embodiment is that the comparison of the cumulative distance for each word is performed for each frame. The distance comparison unit 12 calculates the cumulative distance G _k (I _k ,
j), the word having the smallest cumulative distance in the j-th frame

【０１３０】[0130]

【外２６】 [Outside 26]

【０１３１】を求める。そして、そのときの最小値も同
時に求めておく。即ち、Is obtained. Then, the minimum value at that time is also obtained at the same time. That is,

【０１３２】[0132]

【数２１】 (Equation 21)

【０１３３】[0133]

【数２２】 (Equation 22)

【０１３４】一時記憶１１にはｊ−１フレームまでに出
現した累積距離の最小値Ｇ_minと累積距離が最小となっ
た時の標準パターン名ｋが記憶されている。The temporary storage 11 stores the minimum value G _min of the cumulative distance appearing up to the j−1 frame and the standard pattern name k when the cumulative distance becomes the minimum.

【０１３５】Ｇ_minとWith G _min

【０１３６】[0136]

【外２７】 [Outside 27]

【０１３７】を比較し、Are compared,

【０１３８】[0138]

【外２８】 [Outside 28]

【０１３９】ならば一時記憶１１はそのままにして、次
のフレーム（ｊ＝ｊ＋１）へ進む。If so, the process proceeds to the next frame (j = j + 1) while keeping the temporary storage 11 as it is.

【０１４０】[0140]

【外２９】 [Outside 29]

【０１４１】ならば、Then,

【０１４２】[0142]

【外３０】 [Outside 30]

【０１４３】として次のフレームへ進む。このように、
一時記憶１１には常にそのフレームまでの最小値と認識
結果が残っていることになる。パターンマッチング範囲
の終端（ｊ＝Ｊ）に達した時、一時記憶１１に記憶され
ているThen, the process proceeds to the next frame. in this way,
The temporary storage 11 always has the minimum value and the recognition result up to that frame. When the end of the pattern matching range (j = J) is reached, it is stored in the temporary storage 11.

【０１４４】[0144]

【外３１】 [Outside 31]

【０１４５】が認識結果である。実施の形態２は、騒音
中の発声など、音声区間検出が難しい場合には有効な方
法である。The result is the recognition result. Embodiment 2 is an effective method when it is difficult to detect a voice section such as utterance in noise.

【０１４６】本発明の効果を確認するため、男女計１５
０名が発声した１０数字データを用いて認識実験を行な
った。このうち１００名（男女各５０名）のデータを用
いて標準パターンを作成し、残りの５０名を評価した。
評価条件を（表１）に示し、In order to confirm the effect of the present invention, a total of 15
A recognition experiment was performed using 10 numeric data uttered by 0 persons. A standard pattern was created using data of 100 (50 men and women), and the remaining 50 were evaluated.
The evaluation conditions are shown in (Table 1),

【０１４７】[0147]

【表１】 [Table 1]

【０１４８】評価結果を（表２）に示す。The evaluation results are shown in (Table 2).

【０１４９】[0149]

【表２】 [Table 2]

【０１５０】このように本実施例における認識率向上は
非常に顕著である。As described above, the improvement of the recognition rate in this embodiment is very remarkable.

【０１５１】[0151]

【発明の効果】本発明は、複数のフレームで形成される
入力ベクトルと、単語音声の部分（標準）パターンとの
部分距離を事後確率に基づく統計的距離尺度で求め、フ
レームをシフトしながら入力ベクトルを更新して各部分
ベクトルとの間の距離を累積していき、累積距離を最小
とする単語を認識結果とする方法に関するもので、語彙
数の増加や騒音に対して頑強で高い認識率が得られると
いう効果が得られる。According to the present invention, a partial distance between an input vector formed by a plurality of frames and a partial (standard) pattern of a word voice is obtained by a statistical distance scale based on the posterior probability, and the input is performed while shifting the frames. This method relates to a method of updating vectors and accumulating the distances between each partial vector, and recognizing words that minimize the accumulated distance. Is obtained.

【０１５２】さらに、計算の方法が単純であるので信号
処理プロセッサ（ＤＳＰ）を用いた小型装置として容易
に実現できる。Furthermore, since the calculation method is simple, it can be easily realized as a small device using a signal processor (DSP).

【０１５３】このように本発明は実用上有効な方法であ
り、その効果は大きい。As described above, the present invention is a practically effective method, and its effect is great.

[Brief description of the drawings]

【図１】本発明の実施の形態１における音声認識装置の
機能ブロック図FIG. 1 is a functional block diagram of a speech recognition device according to a first embodiment of the present invention.

【図２】本発明における標準パターン作成法における部
分パターン、周囲パターン作成法を説明する概念図FIG. 2 is a conceptual diagram illustrating a method of creating a partial pattern and a surrounding pattern in a standard pattern creation method according to the present invention

【図３】本発明における入力音声と部分パターンを連接
した標準パターンの照合をダイナミックプログラミング
法で計算する方法を示した模式図FIG. 3 is a schematic diagram showing a method of calculating a collation of a standard pattern in which an input voice and a partial pattern are connected by a dynamic programming method according to the present invention;

【図４】本発明の実施の形態２における音声認識装置の
機能ブロック図FIG. 4 is a functional block diagram of a voice recognition device according to a second embodiment of the present invention.

[Explanation of symbols]

１音響分析部２特徴パラメータ抽出部３複数フレームバッファ４部分距離計算部５部分標準パターン格納部６経路判定部７距離累積部８判定部９音声区間検出部１０フレーム同期信号発生部１１一時記憶１２距離比較部 DESCRIPTION OF SYMBOLS 1 Acoustic analysis part 2 Feature parameter extraction part 3 Multiple frame buffer 4 Partial distance calculation part 5 Partial standard pattern storage part 6 Path judgment part 7 Distance accumulation part 8 Judgment part 9 Voice section detection part 10 Frame synchronization signal generation part 11 Temporary storage 12 Distance comparison section

───────────────────────────────────────────────────── フロントページの続き (72)発明者木村達也神奈川県川崎市多摩区東三田３丁目10番１号松下技研株式会社内 ────────────────────────────────────────────────── ─── Continued on the front page (72) Inventor Tatsuya Kimura 3-10-1 Higashi Mita, Tama-ku, Kawasaki City, Kanagawa Prefecture Matsushita Giken Co., Ltd.

Claims

[Claims]

1. A step of creating a standard pattern of a word to be recognized by concatenating partial patterns, a step of obtaining an input vector while shifting a frame from an input voice, and a statistical distance measure from the input vector and the partial pattern A speech recognition method, comprising: accumulating the distances obtained in (1) to obtain an accumulated distance; and obtaining a word having a minimum accumulated distance from the accumulated distance as a recognition result.

2. A recognition target word is divided into sub-intervals using voice data uttered by a large number of people, and all of the standard patterns of the recognition target word obtained by connecting partial (standard) patterns expressing the sub-intervals are obtained. Generating input words in advance, analyzing input speech at predetermined time lengths (frames) to obtain characteristic parameters, and obtaining input vectors using characteristic parameters of a plurality of frames; A step of obtaining a partial distance to each partial pattern on a statistical distance scale; a step of obtaining a cumulative distance obtained by accumulating partial distances between an input vector generated while shifting a frame and the partial pattern; Comparing the cumulative distances to the patterns with each other to obtain a word having the minimum cumulative distance as a recognition result.

3. The speech recognition method according to claim 1, wherein the partial patterns of the recognition target word include sections (frames) overlapping each other.

4. A recognition target word standard in which a recognition target word is divided into a plurality of partial sections using voice data uttered by a large number of persons, and a part (standard) pattern expressing the partial section is connected. Generating a pattern in advance for all recognition target words; analyzing input speech for each fixed time length (frame) to determine a feature parameter; and determining an input vector using feature parameters of a plurality of frames; Obtaining a partial distance between an input vector and each of the partial patterns on a statistical distance scale based on a posterior probability, and obtaining an accumulated distance obtained by accumulating partial distances between the input vector generated while shifting a frame and the partial pattern. And comparing the cumulative distances of all the recognition target words with respect to the standard pattern to each other, and using the word having the minimum cumulative distance as a recognition result. A voice recognition method.