JPH0854892A

JPH0854892A - Voice recognition method

Info

Publication number: JPH0854892A
Application number: JP19169994A
Authority: JP
Inventors: Hidetaka Miyazawa; 秀毅宮澤
Original assignee: Meidensha Corp; Meidensha Electric Manufacturing Co Ltd
Current assignee: Meidensha Corp; Meidensha Electric Manufacturing Co Ltd
Priority date: 1994-08-16
Filing date: 1994-08-16
Publication date: 1996-02-27

Abstract

PURPOSE:To execute word recognition without preparing standard patterns of unnecessary words for a voice segment which includes the unnecessary words by obtaining the degree of similarity against a sound section only and using the standard template, which makes the degree of similarity a minimum, as word recognition result. CONSTITUTION:Let the point, at which a change takes place from a silence section to a sound section in voice segment, be a starting point stq (q=1, 2...) and let the point, at which a sound section changes to a silence section, be an ending point edr (r=1, 2...). Employing the DP matching equation shown, the degree of similarity SS of the voice segment and the standard template between the point stq and the point edr is obtained. Note that in the equation, Sn (i, j) represents the DP matching score between a first frame of the inputted voice and a j-th frame of a n-th standard pattern, d (Ai, Bnj) signifies the segment distance between the first frame of the inputted voice and the j-th frame of the n-th standard pattern and a min() represents a minimum value among three scores. The word having the standard template, which makes the degree of similarity SS a minimum, is mad as the word recognition result of the voice segment.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、音声認識方法に係り、
特に離散単語発声文の認識方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition method,
In particular, it relates to a method for recognizing a discrete word utterance.

【０００２】[0002]

【従来の技術】コンピュータと対話を行うマン・マシン
・インタフェースは、キーボードやポインティングデバ
イス（マウスなど）の入力デバイスにより人からコンピ
ュータへのデータの入力や指示を行い、ディスプレイ等
の出力デバイスによりコンピュータから人への各種デー
タ表示や返答を行うようにしている。2. Description of the Related Art A man-machine interface for interacting with a computer allows a person to input or instruct data to the computer by using an input device such as a keyboard or a pointing device (mouse, etc.), and an output device such as a display, etc. It displays various data and responds to people.

【０００３】これら、マン・マシン・インタフェース・
デバイスによる他に、最近では音声により入出力を行う
ものがあり、コンピュータには音声出力デバイスとして
音声合成装置と音声発生装置を設け、音声入力デバイス
として音声認識装置と解読装置を設ける。These man-machine interfaces
In addition to the device, there is one that performs input and output by voice recently, and a computer is provided with a voice synthesizer and a voice generator as voice output devices, and a voice recognizer and a decoder as voice input devices.

【０００４】人が発生した音声の認識手法としては、単
語と単語の間にポーズ（区切り）をおいた離散単語発声
文を個々に認識、さらには一連の認識単語から文として
解読するためのＯｎｅＰａｓｓＤＰマッチング法が
ある。この手法は、単語音声認識手法であるＤＰマッチ
ング法を文認識に拡張したものである。As a method of recognizing a voice generated by a person, discrete word utterance sentences in which pauses (breaks) are placed between words are individually recognized, and one for decoding a sentence from a series of recognized words as a sentence. There is a Pass DP matching method. This method is an extension of the DP matching method, which is a word voice recognition method, to sentence recognition.

【０００５】このＤＰマッチング法について、以下に説
明する。The DP matching method will be described below.

【０００６】単語認識を行う際、音声波形はある時間間
隔で標本化され、スペクトラム等の多次元特徴ベクトル
の時系列に変換されてから取り扱われる。また、同様
に、認識の対象となる単語を多次元特徴ベクトルの時系
列に変換しておき、これらを標準パターンとしてコンピ
ュータに登録しておく。When performing word recognition, a speech waveform is sampled at a certain time interval, converted into a time series of multidimensional feature vectors such as a spectrum, and then handled. Similarly, the words to be recognized are converted into time series of multidimensional feature vectors, and these are registered in the computer as standard patterns.

【０００７】音声認識過程においては、入力された特徴
ベクトル時系列と標準パターンの特徴ベクトル時系列の
類似度を全ての標準パターンについて求め、最も類似し
ている標準パターンの単語を認識単語とする。In the voice recognition process, the similarity between the input feature vector time series and the feature vector time series of the standard pattern is calculated for all standard patterns, and the word of the most similar standard pattern is used as the recognition word.

【０００８】しかし、一般的に入力された特徴ベクトル
時系列と標準パターンの特徴ベクトル時系列を直接にそ
のまま比較する事はできない。というのは、人がある文
章なり単語なりを発声する時間の長さには個人差があ
り、また同じ人が同じ言葉を発声しても日によって気分
により大きく変動することによる。しかも、この発声時
間の伸縮は一様でなく、非線形に変動する。However, it is not possible to directly compare the input feature vector time series and the standard pattern feature vector time series as they are. This is because there are individual differences in the length of time that a person utters a sentence or word, and even when the same person utters the same word, it varies greatly depending on the mood depending on the day. Moreover, the expansion and contraction of the utterance time is not uniform and varies nonlinearly.

【０００９】ＤＰマッチング法では、入力された音声の
特徴ベクトル時系列が標準パターンの特徴ベクトル時系
列と最も良く一致するように動的計画法を用いて時間軸
を変換し、その後に類似度を求める。In the DP matching method, the time axis is converted by using the dynamic programming method so that the feature vector time series of the input speech best matches the feature vector time series of the standard pattern, and then the similarity is calculated. Ask.

【００１０】このＤＰマッチングの概念を図５を参照し
て説明する。同図において、水平軸は入力音声を、垂直
軸はコンピュータに登録されている単語の標準パターン
を示す。ここでは、入力音声及び標準パターン共に特徴
ベクトル時系列でなく、音素ラベルの時系列で記述され
ているものとする。The concept of this DP matching will be described with reference to FIG. In the figure, the horizontal axis shows the input voice and the vertical axis shows the standard pattern of words registered in the computer. Here, it is assumed that both the input voice and the standard pattern are described in time series of phoneme labels, not in time series of feature vectors.

【００１１】通常、ＤＰマッチングは、端点固定という
条件の基で、入力音声と標準パターンの類似度が計算さ
れる。端点固定というのは、入力音声の最初のフレーム
が標準パターンの最初のフレームと対応し（始端固
定）、また、入力音声の最終フレームが標準パターンの
最終フレームと対応している（終端固定）という拘束条
件である。Normally, in DP matching, the similarity between the input voice and the standard pattern is calculated under the condition that the end points are fixed. The fixed end point means that the first frame of the input voice corresponds to the first frame of the standard pattern (fixed start point), and the last frame of the input voice corresponds to the last frame of the standard pattern (fixed end point). It is a constraint condition.

【００１２】ＤＰマッチングは、この拘束条件の基で、
入力音声と標準パターンが最も良く一致するように時間
軸を変換し、両者の類似度を求める。DP matching is based on this constraint condition.
The time axis is converted so that the input voice and the standard pattern match best, and the similarity between the two is obtained.

【００１３】ＯｎｅＰａｓｓＤＰマッチングは、単
語認識用ＤＰマッチングの標準テンプレートを図６に示
すように、離散文認識用に変更し、離散文の開始部と終
了部とに対し、端点固定のＤＰマッチングを行い、文の
中に含まれている各単語を認識しようとするものであ
る。In the One Pass DP matching, the standard template of the DP matching for word recognition is changed to the one for discrete sentence recognition as shown in FIG. 6, and the end point fixed DP matching is applied to the start and end of the discrete sentence. To recognize each word contained in the sentence.

【００１４】[0014]

【発明が解決しようとする課題】ＯｎｅＰａｓｓＤ
Ｐマッチング法は、不要語を含まない離散単語発声文に
おいて高い認識性能をもっているが、不要語を含む離散
単語発声文の認識性能はさほど高くない。[Problems to be Solved by the Invention] One Pass D
The P matching method has a high recognition performance in a discrete word utterance sentence containing no unnecessary words, but the recognition performance of a discrete word utterance sentence containing unnecessary words is not so high.

【００１５】ここで、不要語というのは、単語標準テン
プレートに登録されていない語のことで、例えば、本来
「”東京”」と発声するところを、話者が「”えーっと
東京”」と発声した場合の「”えーっと”」に相当す
る。Here, the unnecessary word is a word that is not registered in the word standard template. For example, when a speaker originally says "Tokyo", the speaker says "er Tokyo". It corresponds to “” um ”when you do.

【００１６】本来、ＯｎｅＰａｓｓＤＰマッチング
は、”入力音声の始点と終点を固定し、その区間にどの
ような単語が含まれるときに認識スコアが良くなるか”
を計算的に求める手法であるため、離散単語発声文に不
要語が含まれた場合には、不要語と何らかの標準テンプ
レートを強引に対応づけようとするため、認識性能が悪
くなってしまう。Originally, the One Pass DP matching is "what kind of word is included in that section to improve the recognition score by fixing the start point and the end point of the input voice?"
Since this method is a method of calculating mathmatically, when an unnecessary word is included in the utterance of a discrete word, the unnecessary word and some standard template are forcibly associated with each other, resulting in poor recognition performance.

【００１７】そこで、単語標準テンプレートに不要語を
予め登録しておく方法が考えられるが、不要語とは予期
できない発声であるため、全ての不要語を標準テンプレ
ートとして網羅することは不可能となる。Therefore, a method of registering unnecessary words in the word standard template in advance can be considered. However, since the unnecessary words are unpredictable utterances, it is impossible to cover all unnecessary words as a standard template. .

【００１８】また、不要語には人の発声する言葉以外
に、外部雑音等があるが、これらの標準テンプレートを
作成することも不可能となる。Further, the unnecessary words include external noise in addition to the words uttered by a person, but it is also impossible to create these standard templates.

【００１９】本発明の目的は、離散単語発声文を認識す
るにおいて、発声音に含まれる不要語を排除し、認識対
象となる本来の入力単語のみを認識できるようにした音
声認識方法を提供することにある。An object of the present invention is to provide a speech recognition method which, when recognizing a discrete word utterance, eliminates unnecessary words included in utterance sounds and can recognize only original input words to be recognized. Especially.

【００２０】[0020]

【課題を解決するための手段】本発明は、前記課題の解
決を図るため、離散単語発声文の音声区間を検出する音
声スポッティング処理と、検出された音声区間毎の各フ
レーム毎の特徴ベクトル時系列を求め、この特徴ベクト
ル時系列を音素ラベル時系列に変換し、この音素ラベル
時系列と辞書として持つ音素ラベル時系列の標準テンプ
レートとの類似度を端点固定ＤＰマッチング法で求めて
単語認識する単語認識処理と、認識された各単語候補の
内から標準パターンとの類似度から絞り込みを行う単語
判定処理とを備えたＯｎｅＰａｓｓＤＰマッチング
法による音声認識方法において、前記端点固定ＤＰマッ
チング処理は、音声区間の無音部から有音部に変化する
始点から、有音部から無音部に変化する終点までの音声
区間と前記標準テンプレートとの類似度ＳＳを次式、SUMMARY OF THE INVENTION In order to solve the above problems, the present invention provides a voice spotting process for detecting a voice section of a discrete word utterance and a feature vector time for each frame of each detected voice section. A sequence is obtained, this feature vector time series is converted into a phoneme label time series, and the similarity between this phoneme label time series and the standard template of the phoneme label time series as a dictionary is obtained by the fixed endpoint DP matching method for word recognition. In the voice recognition method by the One Pass DP matching method, which includes a word recognition process and a word determination process that narrows down the degree of similarity from a standard pattern among the recognized word candidates, the end point fixed DP matching process, The voice section from the start point at which the voice section changes from the silent section to the voice section to the end point at which the voice section changes to the voiceless section and the standard template The following equation the similarity SS of the rate,

【００２１】[0021]

【数２】 [Equation 2]

【００２２】Ｓｎ（ｉ，ｊ）：入力音声の第ｉフレーム
と第ｎ標準パターンの第ｊフレーム間のＤＰマッチング
スコア。Sn (i, j): DP matching score between the ith frame of the input voice and the jth frame of the nth standard pattern.

【００２３】ｄ（Ａｉ，Ｂｎｊ）：入力音声の第ｉフレ
ームと第ｎ標準パターンの第ｊフレーム間の部分距離。D (Ai, Bnj): Partial distance between the ith frame of the input voice and the jth frame of the nth standard pattern.

【００２４】ｍｉｎ（）：３つのスコアの中の最小値。Min (): Minimum value among the three scores.

【００２５】にしたがって求め、前記類似度ＳＳを最小
にする標準テンプレートの単語を当該音声区間の単語候
補とすることを特徴とする。According to the present invention, the word of the standard template that minimizes the similarity SS is used as the word candidate of the voice section.

【００２６】[0026]

【作用】不要語と対象単語の間の無音区間が短いために
音声区間に不要語が含まれる場合、有音部のみに対して
端点固定ＤＰマッチングによる類似度を求め、この類似
度を最小にする標準テンプレートを単語認識結果とす
る。When the unnecessary section is included in the voice section because the silent section between the unnecessary word and the target word is short, the similarity by the end point fixed DP matching is calculated only for the voiced section, and this similarity is minimized. The standard template to be used is the word recognition result.

【００２７】[0027]

【実施例】本発明の実施例を説明するにあたって、入力
音声及び標準テンプレートについて以下のように仮定す
る。DESCRIPTION OF THE PREFERRED EMBODIMENTS In describing an embodiment of the present invention, it is assumed that the input voice and standard template are as follows.

【００２８】（１）入力音声はＩフレームからなり、そ
の第ｉフレームをＡｉ（ｉ＝１，２，…，Ｎ）で示す。(1) The input voice consists of I frames, and the i-th frame is indicated by Ai (i = 1, 2, ..., N).

【００２９】（２）認識装置の単語辞書にはＮ単語が登
録されているとする。つまり、標準テンプレート数はＮ
である。ｎ番目の標準テンプレートをＴｎ（ｎ＝１，
２，…，Ｎ）で示し、ＴｎはＪｎフレームからなり、そ
の第ｊフレームがＢｎｊ（ｊ＝１，２，…，ｎ）で示さ
れるとする。(2) It is assumed that N words are registered in the word dictionary of the recognition device. That is, the number of standard templates is N
Is. The nth standard template is Tn (n = 1,
2, ..., N), Tn is composed of Jn frames, and the j-th frame is represented by Bnj (j = 1, 2, ..., N).

【００３０】図１は、本発明の一実施例になる離散単語
発声文の認識処理フローチャートを示す。音声スポッテ
ィング処理Ｓ１は、離散的に連続発声された入力音声の
音声区間を検出する。単語認識処理Ｓ２は、音声スポッ
ティング処理により検出された各音声区間に対し、単語
認識を行い、各音声区間の単語候補を求める。単語判定
処理Ｓ３は、各単語候補の内から、標準パターンとの類
似度から単語候補の絞り込みを行い、１つの単語候補を
認識結果として出力する。以下、各部処理を詳細に説明
する。FIG. 1 is a flowchart showing a recognition processing of a discrete word utterance sentence according to an embodiment of the present invention. The voice spotting process S1 detects a voice section of an input voice that is discretely and continuously uttered. In the word recognition process S2, word recognition is performed on each voice section detected by the voice spotting process, and word candidates of each voice section are obtained. The word determination process S3 narrows down the word candidates from the respective word candidates based on the similarity to the standard pattern, and outputs one word candidate as a recognition result. Hereinafter, processing of each unit will be described in detail.

【００３１】（１）音声スポッティング処理…入力音声
のパワー情報を用いて離散単語発声文から各音声区間を
検出する。(1) Voice spotting process ... Each voice section is detected from a discrete word utterance using the power information of the input voice.

【００３２】まず、図２の（ａ）に示すような入力音声
波形に対し、そのパワー情報は、各フレーム毎の対数パ
ワーを下記式によって求める。First, for the input speech waveform as shown in FIG. 2A, the power information is obtained by the logarithmic power of each frame by the following equation.

【００３３】[0033]

【数３】 (Equation 3)

【００３４】Ｐ（ｉ）：第ｉフレームの対数パワーＡＭ_K：第ｉフレームに含まれる区間のｋ番目の音声の
振幅次に、図２の（ｂ）に示すように、求めた対数パワーが
しきい値θ以下の区間のフレーム数αｐ（ｐ＝１，２，
…）を求める。このαｐが休止時間（ＰＡＵＳＥ）のし
きい値より大きい場合、その前後を無音区間として分離
する。P (i): Logarithmic power of i-th frame AM _K : Amplitude of k-th speech in section included in i-th frame Next, as shown in (b) of FIG. The number of frames αp (p = 1, 2,
...). When this [alpha] p is larger than the threshold value of the pause time (PAUSE), the front and rear thereof are separated as a silent section.

【００３５】逆に、求めた対数パワーがしきい値θ以上
の区間のフレーム数βｐ（ｐ＝１，２，…）を求める。
このβｐが継続時間（ＤＵＲＡＴＩＯＮ）のしきい値よ
り大きい場合、その前後を有音区間として分離する。On the contrary, the number of frames βp (p = 1, 2, ...) In the section where the obtained logarithmic power is equal to or larger than the threshold value θ is obtained.
When this βp is larger than the threshold value of the duration (DURATION), the preceding and succeeding portions are separated as a voiced section.

【００３６】以上までの処理を行うことにより、離散単
語発声文から音声区間（有音区間）をスポッティングす
る。By performing the above processing, the voice section (speech section) is spotted from the discrete word utterance.

【００３７】（２）単語認識処理…図３に示す手順によ
りスポッティングされた音声区間の単語認識を行う。(2) Word recognition process: Word recognition is performed on the spotted voice section by the procedure shown in FIG.

【００３８】まず、各フレーム毎にＤＦＴ（離散フーリ
エ変換）により特徴抽出を行い、特徴ベクトル時系列を
求める（Ｓ２₁）。First, feature extraction is performed by DFT (discrete Fourier transform) for each frame to obtain a feature vector time series (S2 ₁ ).

【００３９】次に、ニューラルネットワークによる音素
識別器によって入力の特徴ベクトル時系列を音素ラベル
時系列に変換する（Ｓ２₂）。Next, the input feature vector time series is converted into a phoneme label time series by a phoneme classifier using a neural network (S2 ₂ ).

【００４０】次に、音素ラベル時系列を辞書として持つ
標準テンプレートとの類似度を、端点固定ＤＰマッチン
グ法で求めて単語認識を得る（Ｓ２₃）。Next, the degree of similarity with the standard template having a phoneme label time series as a dictionary is obtained by the fixed endpoint DP matching method to obtain word recognition (S2 ₃ ).

【００４１】ここで、不要語と対象単語との間の無音区
間が短い場合、音声スポッティング処理Ｓ１で検出され
た音声区間には、不要語と対象単語との両方が含まれる
ことになる。Here, when the silent section between the unnecessary word and the target word is short, both the unnecessary word and the target word are included in the voice section detected in the voice spotting process S1.

【００４２】この場合、通常の端点固定ＤＰマッチング
法では不要語と対象単語を含んだ音声区間に対し、それ
が１単語であるとしてマッチングを行うため、対象単語
と標準テンプレートの類似度を正しく求めることができ
ない。In this case, in the normal fixed-point DP matching method, since the speech segment including the unnecessary word and the target word is matched as one word, the similarity between the target word and the standard template is correctly obtained. I can't.

【００４３】このような不都合を解消するため、本実施
例では、以下の手法により対象単語の認識を行う。In order to eliminate such inconvenience, in the present embodiment, the target word is recognized by the following method.

【００４４】（Ａ）図４に示すように、音声区間の無音
部から有音部に変化する点を始点ｓｔｑ（ｑ＝１，２，
…）とし、有音部から無音部に変化する点を終点ｅｄｒ
（ｒ＝１，２，…）とする。そして、下記式によるＤＰ
マッチングを用いて、始点ｓｔｑから終点ｅｄｒまでの
音声区間と標準テンプレートＴｎとの類似度ＳＳ（ｓｔ
ｑ，ｅｄｒ，ｎ）を求める。(A) As shown in FIG. 4, the start point stq (q = 1, 2,
…), And the end point edr is the point where the sound part changes to the silence part.
(R = 1, 2, ...). And DP according to the following formula
Using matching, the similarity SS (st) between the voice segment from the start point stq to the end point edr and the standard template Tn
q, edr, n) is obtained.

【００４５】[0045]

【数４】 [Equation 4]

【００４６】Ｓｎ（ｉ，ｊ）：入力音声の第ｉフレーム
と第ｎ標準パターンの第ｊフレーム間のＤＰマッチング
スコア。Sn (i, j): DP matching score between the ith frame of the input voice and the jth frame of the nth standard pattern.

【００４７】ｄ（Ａｉ，Ｂｎｊ）：入力音声の第ｉフレ
ームと第ｎ標準パターンの第ｊフレーム間の部分距離で
あり、ＡｉとＢｎｊの類似性が高いほど部分距離の値は
小さくなる。D (Ai, Bnj): Partial distance between the i-th frame of the input voice and the j-th frame of the n-th standard pattern. The higher the similarity between Ai and Bnj, the smaller the value of the partial distance.

【００４８】ｍｉｎ（）：３つのスコアの中の最小値。Min (): Minimum value among the three scores.

【００４９】（Ｂ）ｑ，ｒに対し、類似度ＳＳ（ｓｔ
ｑ，ｅｄｒ，ｎ）を最小にする標準テンプレートと、そ
のときの最小値ＳＳｍｉｎ（ｓｔｑ，ｅｄｒ，ｎ）を求
める。(B) The similarity SS (st
The standard template that minimizes q, edr, n) and the minimum value SSmin (stq, edr, n) at that time are obtained.

【００５０】（Ｃ）ｑ＝１，２，…と、ｒ＝１，２，…
に対し、最小値ＳＳｍｉｎ（ｓｔｑ，ｅｄｒ，ｎ）を最
小にする標準テンプレートを、その音声区間の単語認識
結果（単語候補）とする。これには、標準テンプレート
Ｔｎが区間｛ｓｔｑ，ｅｄｒ｝で類似度ＳＳ（ｓｔｑ，
ｅｄｒ，ｎ）として照合する。(C) q = 1, 2, ... And r = 1, 2 ,.
On the other hand, the standard template that minimizes the minimum value SSmin (stq, edr, n) is set as the word recognition result (word candidate) of the voice section. For this, the standard template Tn has the similarity SS (stq, edr) in the interval {stq, edr}.
edr, n).

【００５１】以上の手法のように、端点固定ＤＰマッチ
ングを行うことにより、不要語が分離されていない音声
区間に対しても安定的に対象単語の認識を行うことがで
きる。By performing fixed-point DP matching as in the above method, it is possible to stably recognize the target word even in a voice section in which unnecessary words are not separated.

【００５２】（３）単語判定処理…単語認識処理により
求められた全単語候補に対し、しきい値γより小さい値
の全ての単語候補を最終単語認識結果とする。(3) Word determination process: All word candidates having a value smaller than the threshold value γ are set as final word recognition results among all the word candidates obtained by the word recognition process.

【００５３】[0053]

【発明の効果】以上のとおり、本発明によれば、端点固
定ＤＰマッチングにより音声区間の音素ラベル時系列と
標準パターンの音素ラベル時系列の類似度を求めるの
に、有音部のみに対して類似度を求め、この類似度を最
小にする標準テンプレートを単語認識結果とするため、
不要語が含まれる音声区間についても不要語の標準パタ
ーンを用意することなく、単語認識ができる効果があ
る。As described above, according to the present invention, the similarity between the phoneme label time series of the voice section and the phoneme label time series of the standard pattern is obtained by the fixed endpoint DP matching, only for the voiced part. In order to obtain the similarity and use the standard template that minimizes this similarity as the word recognition result,
There is also an effect that word recognition can be performed without preparing a standard pattern of unnecessary words in a voice section including unnecessary words.

[Brief description of drawings]

【図１】本発明の一実施例を示すフローチャート。FIG. 1 is a flowchart showing an embodiment of the present invention.

【図２】実施例における音声スポッティング処理態様
図。FIG. 2 is a diagram of a voice spotting processing mode according to the embodiment.

【図３】実施例における単語認識処理フローチャート。FIG. 3 is a word recognition processing flowchart in the embodiment.

【図４】ワードスポッティング用ＤＰマッチングの例。FIG. 4 is an example of DP matching for word spotting.

【図５】従来のＤＰマッチングの例。FIG. 5 shows an example of conventional DP matching.

【図６】従来のＯｎｅＰａｓｓＤＰマッチングの
例。FIG. 6 shows an example of conventional One Pass DP matching.

Claims

[Claims]

1. A voice spotting process for detecting a voice section of a discrete word utterance, a feature vector time series for each frame of each detected voice section is obtained, and this feature vector time series is converted into a phoneme label time series. Then, the word recognition processing for recognizing words by obtaining the similarity between this phoneme label time series and the standard template of the phoneme label time series held as a dictionary by the fixed endpoint DP matching method, and the standard pattern from among the recognized word candidates. One P with word determination processing that narrows down based on the degree of similarity with
In the voice recognition method according to the ass DP matching method, the fixed endpoint DP matching process includes a voice section from a start point of a voice section that changes from a silent section to a voice section to an end point that changes from a voice section to a silence section and the standard. The similarity SS with the template is expressed by the following equation, Sn (i, j): DP matching score between the ith frame of the input voice and the jth frame of the nth standard pattern. d (Ai, Bnj): Partial distance between the ith frame of the input voice and the jth frame of the nth standard pattern. min (): minimum value among the three scores. A speech recognition method, characterized in that the word of the standard template that minimizes the similarity SS is used as a word candidate of the speech section.