JP3291073B2

JP3291073B2 - Voice recognition method

Info

Publication number: JP3291073B2
Application number: JP15757393A
Authority: JP
Inventors: 哲也室井
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1993-06-28
Filing date: 1993-06-28
Publication date: 2002-06-10
Anticipated expiration: 2017-06-10
Also published as: JPH0713590A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は、発声された音声を認識
するための音声認識方式に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition system for recognizing uttered speech.

【０００２】[0002]

【従来の技術】発声された音声の音声区間全体を隙間な
く（１字１句）認識するのではなく、音声をワードスポ
ッティングによって認識する方法は不用語の付加やポー
ズなどの問題を避けることができ、音声対話システムや
音声理解システムに向いていることが知られている。2. Description of the Related Art A method of recognizing a speech by word spotting, instead of recognizing the entire speech section of a uttered speech without a gap (one character and one phrase), avoids problems such as addition of non-words and pause. It is known that it is suitable for a speech dialogue system and a speech understanding system.

【０００３】又、離散発声された単語音声を認識する場
合でも、パワーなどの情報によって音声区間を切り出し
てから認識するのではなく、無音部も含んだパターンか
らスポッティングの手法を用いて認識するほうが受音の
際の騒音や舌打ち音などの影響を受けずにすむという利
点がある。Also, when recognizing a discretely uttered word voice, it is better to use a spotting technique from a pattern including a silent part instead of recognizing the voice section after extracting a voice section based on information such as power. There is an advantage that it is not affected by noise or tongue sound at the time of sound reception.

【０００４】[0004]

【発明が解決しようとする課題】しかしながら、スポッ
ティングには部分マッチングという問題がある。例え
ば、認識対象となる言葉に「新横浜」と「横浜」という
言葉が含まれていた場合、話者が「新横浜」と発声した
とき、この音声中には「横浜」とれているために、「横
浜」も「新横浜」も高いスコア（小さな距離）で認識さ
れてしまい、この言葉のうちのどちらであるか識別でき
ないという問題点があった。この部分マッチングは数字
を認識する際には、特に大きな問題となる。例えば、
「３１」という言葉には、「３０」、「３」、「１
０」、「１１」、「１」という部分マッチングされる言
葉が存在する。However, spotting has a problem of partial matching. For example, if the words to be recognized include the words "Shin-Yokohama" and "Yokohama", and when the speaker utters "Shin-Yokohama", "Yokohama" is included in the voice, Both "Yokohama" and "Shin-Yokohama" were recognized with a high score (small distance), and there was a problem that it was not possible to identify which of these words. This partial matching is particularly problematic when recognizing numbers. For example,
The words "31" include "30", "3", "1"
There are partially matched words such as "0", "11", and "1".

【０００５】但し、この部分マッチングには非対称性が
ある。つまり、長い言葉（上記例では「新横浜」）を短
い言葉（「横浜」）に誤認識することはあるが、その逆
はあまり多くない。例えば、特開平４−２３０７９７号
における方法ではこの非対称性を利用している。つま
り、入力「新横浜」に対しては「横浜」の類似度は高い
が、入力「横浜」に対して「新横浜」の類似度は高くな
いという類似度表を予め統計的に作成しておき、最初に
通常の照合を行い、次にここで得た全単語への類似度と
上記の類似度表との比較を行って、最も類似傾向が似て
いる（距離の小さい）単語を認識結果とするものであ
る。However, this partial matching has asymmetry. That is, a long word (“Shin-Yokohama” in the above example) may be erroneously recognized as a short word (“Yokohama”), but the converse is not so many. For example, the method in Japanese Patent Application Laid-Open No. 4-230797 utilizes this asymmetry. In other words, a similarity table is created in advance that the similarity of “Yokohama” is high for the input “Shin-Yokohama”, but is not high for the input “Yokohama”. First, a normal collation is performed, and then, the similarity to all the words obtained here is compared with the similarity table described above, and a word having a similar similarity (smallest distance) is regarded as a recognition result. Is what you do.

【０００６】しかし、従来技術では認識対象となる全単
語同士の類似表を予め作成しておく必要があることか
ら、認識対象語彙が固定しているアプリケーションにの
み有効である。例えば、特定話者方式の認識装置のよう
に認識対象の言葉を自由に変更できる認識装置では部分
マッチングの問題を解決できないという欠点があった。
又、照合を２段階に行っているために処理が複雑である
という欠点があった。However, in the prior art, since it is necessary to prepare a similarity table of all words to be recognized in advance, it is effective only for an application in which the vocabulary to be recognized is fixed. For example, there is a drawback that the problem of partial matching cannot be solved with a recognition device that can freely change words to be recognized like a recognition device of a specific speaker system.
Further, there is a disadvantage that the processing is complicated because the collation is performed in two stages.

【０００７】本発明は、１段階の照合のみでしかも認識
対象語彙を変更しても部分マッチングの問題をも回避で
きる照合方法を提供することを目的とする。SUMMARY OF THE INVENTION It is an object of the present invention to provide a matching method which can avoid the problem of partial matching even if the vocabulary to be recognized is changed only by one-stage matching.

【０００８】[0008]

【課題を解決するための手段】かかる課題を解決するた
めの本発明の技術的解決手段は、入力された音声を特徴
ベクトルの時系列である入力パターンに変換し、予め与
えられた第１の値から入力パターンの特徴ベクトルと標
準パターンの特徴ベクトルとの距離を引いたものを類似
度と定め、標準パターンの特徴ベクトルの系列に対して
類似度を累積したものを入力音声との類似度とみなし、
入力音声を認識するようにしている。ここで、予め与え
る第１の値としては、標準パターンの特徴ベクトルごと
に定めるか、又は、標準パターンの特徴ベクトルを予め
複数のクラスタに分割し、そのクラスタごとに定めるよ
うにする。更に、音声内容が既知である入力パターンに
対して照合を行った後、照合経路に基づいて入力パター
ンの特徴ベクトルと標準パターンの特徴ベクトルとの対
応付けを行い、この対応付けられた２つの特徴ベクトル
の距離が予め定められた第２の値より大きい場合に、標
準パターンごとに予め定められた第１の値を大きくする
か、又は、標準パターンの属するクラスタごとに予め定
められた第１の値を大きくするように定める。A technical solution according to the present invention for solving the above problem is to convert an input voice into an input pattern which is a time series of a feature vector, and to provide a first predetermined pattern. The value obtained by subtracting the distance between the feature vector of the input pattern and the feature vector of the standard pattern from the value is defined as similarity, and the sum of the similarity for the series of feature vectors of the standard pattern is calculated as the similarity with the input voice. Deemed,
Recognize input voice. Here, the first value given in advance is determined for each feature vector of the standard pattern, or the feature vector of the standard pattern is divided into a plurality of clusters in advance and determined for each cluster. Further, after matching is performed on an input pattern whose voice content is known, the feature vector of the input pattern and the feature vector of the standard pattern are associated based on the matching path, and the two associated features are compared. When the distance of the vector is larger than the second predetermined value, the first predetermined value is increased for each standard pattern, or the first predetermined value is set for each cluster to which the standard pattern belongs. It is decided to increase the value.

【０００９】[0009]

【作用】本発明によれば、先ず、類似度を予め与えられ
た第１の値から入力パターンの特徴ベクトルと標準パタ
ーンの特徴ベクトルとの距離を引いたものと定義してお
く。ここで、予め与える第１の値としては、標準パター
ンの特徴ベクトルごとに定めるか、又は、標準パターン
の特徴ベクトルを予め複数のクラスタに分割し、そのク
ラスタごとに定めるようにする。その上で、入力された
音声を特徴ベクトルの時系列である入力パターンに変換
し、標準パターンの特徴ベクトルの系列に対して類似度
を累積したものを入力音声との類似度と見なして、入力
音声を認識することにより、１段階だけの照合で部分マ
ッチングの問題を回避し、正しく言葉を認識できるよう
になる。又、予め認識対象となる語彙の情報を必要とし
ないために、例えば特定話者方式の音声認識装置のよう
に語彙の変更を行う認識装置であっても動作することが
できる。更に、音声内容が既知である入力パターンに対
して照合を行った後、照合経路に基づいて入力パターン
の特徴ベクトルと標準パターンの特徴ベクトルとの対応
付けを行い、この対応付けられた２つの特徴ベクトルの
距離が予め定められた第２の値より大きい場合に、標準
パターンごとに予め定められた第１の値を大きくする、
又は、標準パターンの属するクラスタごとに予め定めら
れた第１の値を大きくするように定めて、照合経路に基
づいて対応付けられた入力パターンの特徴ベクトルと標
準パターンの特徴ベクトルの類似度が小さい値を取らな
いように制御することによって、標準パターンがスポッ
ティングされないという可能性を低く押さえることによ
って、正しい音声認識が可能になる。According to the present invention, first, the similarity is defined as a value obtained by subtracting the distance between the feature vector of the input pattern and the feature vector of the standard pattern from the first value given in advance. Here, the first value given in advance is determined for each feature vector of the standard pattern, or the feature vector of the standard pattern is divided into a plurality of clusters in advance and determined for each cluster. Then, the input speech is converted into an input pattern that is a time series of feature vectors, and the sum of similarities to the sequence of feature vectors of the standard pattern is regarded as similarity to the input speech, and By recognizing the voice, the problem of partial matching can be avoided by one-stage matching, and words can be correctly recognized. Further, since the information of the vocabulary to be recognized is not required in advance, even a recognition device that changes the vocabulary, such as a specific speaker type speech recognition device, can operate. Further, after matching is performed on an input pattern whose voice content is known, the feature vector of the input pattern and the feature vector of the standard pattern are associated based on the matching path, and the two associated features are compared. When the distance of the vector is larger than the second predetermined value, the first value predetermined for each standard pattern is increased.
Alternatively, the first value predetermined for each cluster to which the standard pattern belongs is set to be large, and the similarity between the feature vector of the input pattern and the feature vector of the standard pattern associated based on the matching path is small. By controlling so as not to take a value, correct speech recognition becomes possible by minimizing the possibility that the standard pattern is not spotted.

【００１０】[0010]

【実施例】以下、本発明の一実施例を図面に基づいて説
明する。図１は本発明にかかる音声認識装置の概略ブロ
ック図である。図１を参照すると、この音声認識装置
は、音声を入力するマイクロフォンや受話器などの音声
入力部１と、入力された音声信号を特徴ベクトルの時系
列の入力パターンへ変換する特徴抽出部２と、音声の標
準パターンを格納する標準パターン格納部６と、抽出結
果の入力パターンと標準パターンとを照合する照合部３
と、照合対象となった入力パターンと標準パターンとの
類似度を計算する類似度計算部４と、入力パターンと標
準パターンとの類似度の大きい対応位置を探索する照合
経路探索部５とを有している。An embodiment of the present invention will be described below with reference to the drawings. FIG. 1 is a schematic block diagram of a speech recognition device according to the present invention. Referring to FIG. 1, the voice recognition device includes a voice input unit 1 such as a microphone or a receiver for inputting voice, a feature extraction unit 2 for converting an input voice signal into a time-series input pattern of a feature vector, A standard pattern storage unit 6 for storing a standard pattern of voice, and a collating unit 3 for collating the input pattern of the extraction result with the standard pattern
And a similarity calculating unit 4 for calculating the similarity between the input pattern and the standard pattern to be compared, and a matching path searching unit 5 for searching for a corresponding position having a large similarity between the input pattern and the standard pattern. are doing.

【００１１】特徴抽出部２は、音声認識に有用な様々な
パラメータが公表されているうち、例えばＬＰＣメルケ
プストラム等のような特徴量を用いて入力パターンを抽
出する。以下、入力パターンＸをＸ＝ｘ₁ｘ₂・・・ｘ_I
のように表すことにする。（ここで、Ｉは入力パターン
の総フレーム数である。）The feature extraction unit 2 extracts an input pattern using feature quantities such as LPC mel-cepstrum, among various parameters useful for speech recognition. Hereinafter, the input pattern X is represented by X = x ₁ x ₂ ... X _I
Will be represented as follows. (Here, I is the total number of frames of the input pattern.)

【００１２】照合部３は、特徴抽出部２で抽出された入
力パターンの特徴ベクトルと標準パターン格納部６の標
準パターンの特徴ベクトルとの照合を行う。ここでＫと
いう言葉の標準パターンＹがＹ＝ｙ₁ｙ₂・・・ｙ_Jのよ
うなベクトル列で表現されているとする。（ここで、Ｊ
は標準パターンの総フレーム数である。）このとき入力パターンの第ｉ番目のフレームの特徴ベク
トルｘ_iと、標準パターンＹの第ｊ番目のフレームの特
徴ベクトルｙ_jとの距離をｄ（ｘ_i，ｙ_j）のように記
述することにする。距離の定義としては、市街地距離、
ユークリッド距離、マハラノビス距離等様々な方法が知
られており、いずれを用いてもよい。The collating unit 3 compares the feature vector of the input pattern extracted by the feature extracting unit 2 with the feature vector of the standard pattern in the standard pattern storage unit 6. Here the standard pattern Y of the word K is represented by a vector column such as _{_{Y = y 1 y 2 ··· y}} J. (Where J
Is the total number of frames of the standard pattern. At this time, the distance between the feature vector x _i of the i-th frame of the input pattern and the feature vector y _j of the j-th frame of the standard pattern Y is described as d (x _i , y _j ). To The definition of distance is city distance,
Various methods such as the Euclidean distance and the Mahalanobis distance are known, and any of them may be used.

【００１３】類似度計算部４は、入力パターンの第ｉ番
目のフレームの特徴ベクトルｘ_iと、標準パターンＹの
第ｊ番目のフレームの特徴ベクトルｙ_jとの類似度を次
のように定義する。[0013] similarity calculator 4 defines a feature vector x _i of the i-th frame of the input pattern, the similarity between the feature vector y _j of the j-th frame of reference pattern Y as follows .

【００１４】[0014]

【数１】 (Equation 1)

【００１５】上記のように求められた類似度をもとにし
て、入力パターンＸ中から標準パターンＹをスポッティ
ングする方法を次に述べるが、この方法には様々な方法
が知られており、これに限ったものではない。A method of spotting the standard pattern Y from the input pattern X based on the similarity obtained as described above will be described below. Various methods are known as this method. It is not limited to.

【００１６】照合経路探索部５は、次のような手順で探
索する。先ず、入力フレームがｉの時点で標準パターン
の第ｊ番目の特徴ベクトルまで照合を終えたときの累積
スコアを格納する配列Ｄ（ｉ，ｊ）とこの照合経路の開
始時点を格納する配列Ｂ（ｉ，ｊ）を用意する。初期時
点においては、次のように配列Ｄ（ｉ，ｊ）および配列
Ｂ（ｉ，ｊ）を決定する。The collation path search section 5 searches in the following procedure. First, an array D (i, j) for storing the accumulated score when the matching is completed up to the j-th feature vector of the standard pattern at the time of the input frame i, and an array B (for storing the start time of this matching path) i, j) are prepared. At the initial time, the array D (i, j) and the array B (i, j) are determined as follows.

【００１７】[0017]

【数２】 (Equation 2)

【００１８】ここで、式（２）において、大きい値が
（ａ）の場合は、Ｂ（ｉ，１）＝Ｂ（ｉ−１，１）とし、大きい値が（ｂ）の場合には、Ｂ（ｉ，１）＝ｉとそれぞれ設定する。中間時点においては、次のように
配列Ｄ（ｉ，ｊ）および配列Ｂ（ｉ，ｊ）を決定する。Here, in equation (2), when the large value is (a), B (i, 1) = B (i-1,1), and when the large value is (b), B (i, 1) = i are set respectively. At the intermediate point, the arrays D (i, j) and B (i, j) are determined as follows.

【００１９】[0019]

【数３】 (Equation 3)

【００２０】ここで、式（３）において、大きい値が
（ｃ）の場合は、Ｂ（ｉ，ｊ）＝Ｂ（ｉ−１，ｊ）とし、大きい値が（ｄ）の場合には、Ｂ（ｉ，ｊ）＝Ｂ（ｉ−１，ｊ−１）Here, in equation (3), when the large value is (c), B (i, j) = B (i-1, j), and when the large value is (d), B (i, j) = B (i-1, j-1)

【００２１】として計算すると、Ｄ（ｉ，Ｊ）が言葉Ｋ
に対するスコアであり、Ｋは入力音声区間のＢ（ｉ，
Ｊ）フレームからｉフレームまでに存在したという認識
結果を得る。尚、この認識結果にはｉの自由度がある
が、認識結果を１つに絞る際には、Ｄ（ｉ，Ｊ）を最も
大きくするｉを選択すればよい。D (i, J) is the word K
, And K is B (i,
J) A recognition result that the frame exists from the frame to the i-th frame is obtained. Although the recognition result has i degrees of freedom, when the number of recognition results is reduced to one, it is sufficient to select i that maximizes D (i, J).

【００２２】次に、Ｓの値の設定方法について述べる。（１）Ｓの値をすべての標準パターンで共通に定める場
合Ｓの設定方法としては、予備的な実験を行い、発声内容
と同じ内容の標準パターンに対する max D(i,J) の値が
正になるように設定し、発声内容と異なる内容の標準パ
ターンに対する max D(i,J) の値が負になるように設定
すればよい。Next, a method of setting the value of S will be described. (1) When the value of S is determined in common for all the standard patterns As a setting method of S, a preliminary experiment is performed, and the value of max D (i, J) for the standard pattern having the same content as the utterance content is positive. And the value of max D (i, J) for a standard pattern having a content different from the utterance content may be set to be negative.

【００２３】（２）Ｓの値を標準パターンの特徴ベクト
ル毎に定める場合この設定方法としては、ある標準パターンのある特徴ベ
クトルｙ_mに対するＳの値をＳ_mと記述することにする
と、Ｓ_mはｙ_mを作成した学習用のデータとｙ_mの距離
の平均ｄ＿ａｖｅ_mに正の定数ｄ₀を加算して設定すれ
ばよい。ここでｄ₀を決定する一例としては、Ｎを標準
パタ−ンの特徴ベクトルの個数として、式（４）を使っ
て実験的に決めることができる。(2) When the value of S is determined for each feature vector of the standard pattern As a setting method, if the value of S for a certain feature vector y _m of a certain standard pattern is described as S _m , then S _m it may be set by adding a positive constant d ₀ on the average D_ave _m distance data and y _m for learning that created the y _m. Here, as an example of determining d ₀ , N can be experimentally determined using equation (4), where N is the number of feature vectors of the standard pattern.

【００２４】[0024]

【数４】 (Equation 4)

【００２５】又、学習用の大部分（例えば９５％）がベ
クトルｙ_mに対する類似度が正になるようにＳ_mの値を
定めてもよい。[0025] Also, most of the learning (e.g. 95%) may be set to the value of S _m as the degree of similarity is positive for the vector y _m.

【００２６】（３）標準パターンの特徴ベクトルが予め
複数のクラスタに分割されており、このクラスタ毎にＳ
の値を定める場合クラスタ分割方法は、音素毎に分割したり、あるいは音
素のグループ（母音、無声摩擦音、鼻音、破裂音など）
に分割すればよい。ここで、あるクラスタＭに対するＳ
の値をＳ_Mと記述すると、クラスタＭに属する要素ｍに
対して、ｙ_mを作成した学習用のデータとｙ_mの距離の
平均値ｄ＿ａｖｅ_mを求め、これを平均したものに正の
定数ｄ₀（この値を決定するには上記と同様な方法が考
えられる。）を加算して設定すればよい。又、クラスタ
Ｍに属する特徴ベクトルｙ_mに対する学習データの大部
分（例えば９５％）がｙ_mとの類似度が正の値をとるよ
うにＳ_Mの値を設定してもよい。(3) The feature vector of the standard pattern is divided into a plurality of clusters in advance.
When the value of is determined The cluster division method is to divide for each phoneme or a group of phonemes (vowel, unvoiced fricative, nasal, plosive, etc.)
What is necessary is just to divide into. Here, S for a certain cluster M
When the values described as S _M, with respect to elements belonging m to the cluster M, the average value D_ave _m of the distance data and y _m for learning that created the y _m, a positive constant that this averaged d ₀ (the same method as described above may be used to determine this value) may be added and set. Further, the value of S _M may be set so that most (eg, 95%) of the learning data for the feature vector y _m belonging to the cluster M has a positive similarity to y _m .

【００２７】更に、発声内容が既知である入力パターン
に対して、同じ内容の標準パターンのベクトル列との照
合を行った後、バックトラックを行う。ここでこの照合
経路に基づいて入力パターンの特徴ベクトルｘｉと標準
パターンの特徴ベクトルｙ_mとが対応ついたとする。こ
こで、この対応付いた２つのベクトルの類似度ｒ（ｘ
ｉ，ｙ_m）の値が予め設定されている閾値ＴＨより小さ
い場合には、Ｓ _m ＝Ｓ _m ＋α（ＴＨ−ｒ（ｘｉ，ｙ _m ））
によってｙ_mに対するＳの値を大きく設定する。同様に
して、標準パターンの特徴ベクトルをクラスタに分割し
たときには、入力パターンｙ _m の属するクラスタをＭと
した場合、Ｓ _M ＝Ｓ _M ＋α（ＴＨ−ｒ（ｘｉ，ｙ _m ））に
よってＳの値を大きく設定する。 Further, the input pattern whose utterance content is known is compared with a vector sequence of a standard pattern having the same content, and then backtracking is performed. Here is the feature vector y _m of the feature vector xi and the standard pattern of the input pattern and with corresponding on the basis of the collation path. Here, the similarity r (x
i, if smaller than the threshold TH value of y _m) is set in _{_{advance, S m = S m + α}} (TH-r (xi, y m))
Sets the value of S for y _m large. Likewise
And divide the feature vector of the standard pattern into clusters
The cluster to which the input pattern y _m belongs is M
If _{_{you, S M = S M + α}} (TH-r (xi, y m)) to
Therefore, the value of S is set large.

【００２８】ここで、学習係数αは、例えば０．１程度
に設定する。又、ＴＨを決定する一例としては、Ｎを標
準パタ−ンの特徴ベクトルの個数として、式（５）を使
って実験的に決めることができる。このとき、ＴＨを小
さめに決定すると正しい発声に対しても、”認識結果なし”のエラ−が増
えるが、誤認識は減る。又、ＴＨを大きめに決定すると ”認識結果なし”のエラ−は減るが、誤認識は増える。このような性質を認識した上で、音声認識装置が使われ
る応用例によって適宜決定する必要がある。Here, the learning coefficient α is set to, for example, about 0.1. As an example of determining TH, N can be experimentally determined using equation (5), where N is the number of feature vectors of the standard pattern. At this time, if TH is determined to be small, the error of "no recognition result" increases for a correct utterance, but erroneous recognition decreases. If TH is determined to be larger, the error of "no recognition result" is reduced, but the number of erroneous recognition is increased. After recognizing such a property, it is necessary to appropriately determine it according to an application example in which the speech recognition device is used.

【００２９】[0029]

【数５】 (Equation 5)

【００３０】図２は、入力された音声が「新横浜」であ
った場合の部分パターンマッチングの類似度を表した例
である。図２（ａ）は、入力音声を表す図である。図２
（ｂ）は、入力音声と標準パターン「新横浜」との距離
を表す図である。図２（ｃ）は、入力音声と標準パター
ン「横浜」との距離を表す図である。この図からもわか
るように、図２のＡの部分である「新」の部分に対する
類似度が正の値であるから、その分だけ「新横浜」の標
準パターンに対する照合スコアの方が大きくなるために
「新横浜」の方が正しく認識される。FIG. 2 is an example showing the similarity of the partial pattern matching when the input voice is "Shin-Yokohama". FIG. 2A is a diagram illustrating an input voice. FIG.
(B) is a diagram showing the distance between the input voice and the standard pattern “Shin-Yokohama”. FIG. 2C is a diagram illustrating a distance between the input voice and the standard pattern “Yokohama”. As can be seen from this figure, since the similarity to the “new” part, which is the part A in FIG. 2, is a positive value, the matching score for the standard pattern “shin-yokohama” is larger by that amount. "Shin-Yokohama" is correctly recognized.

【００３１】図３は、入力された音声が「横浜」であっ
た場合の部分パターンマッチングの類似度を表した例で
ある。図２（ａ）は、入力音声を表す図である。図２
（ｂ）は、入力音声と標準パターン「新横浜」との距離
を表す図である。図２（ｃ）は、入力音声と標準パター
ン「横浜」との距離を表す図である。この例からわかる
ように、「新横浜」の標準パターンに対する照合では標
準パターンの「新」の部分が、非音声区間あるいは
「新」ではない別の言葉と照合されているため、この部
分の類似度が負の値（図３のＢの部分）となるので、
「新横浜」の標準パターンに対する照合スコアが小さく
なるために「横浜」の方がやはり正しく認識される。FIG. 3 is an example showing the similarity of the partial pattern matching when the input voice is "Yokohama". FIG. 2A is a diagram illustrating an input voice. FIG.
(B) is a diagram showing the distance between the input voice and the standard pattern “Shin-Yokohama”. FIG. 2C is a diagram illustrating a distance between the input voice and the standard pattern “Yokohama”. As can be seen from this example, since the “new” part of the standard pattern is compared with the non-speech section or another word that is not “new” in the comparison with the standard pattern of “Shin Yokohama”, the similarity of this part is Becomes a negative value (part B in FIG. 3),
Since the matching score for the standard pattern of “Shin-Yokohama” is small, “Yokohama” is still correctly recognized.

【００３２】[0032]

【発明の効果】上述のように本発明によれば、１段階だ
けの照合で部分マッチングの問題を回避し、正しく言葉
を認識できるようになった。又、予め認識対象となる語
彙の情報を必要としないために、例えば特定話者方式の
音声認識装置のように語彙の変更を行う認識装置であっ
ても動作することができる。又、請求項５および請求項
６に対する効果としては、従来技術では、照合経路に基
づいて対応付けられた入力パターンの特徴ベクトルと標
準パターンの特徴ベクトルの類似度が小さい値（例えば
負の数）をとると、この標準パターンがスポッティング
されない可能性が高くなるが、本発明では、このような
状態の場合に類似度を大きくするようにＳの値を大きく
設定し直すことで正しい音声認識が可能になる。As described above, according to the present invention, the problem of partial matching can be avoided by only one-stage matching, and words can be correctly recognized. Further, since the information of the vocabulary to be recognized is not required in advance, even a recognition device that changes the vocabulary, such as a specific speaker type speech recognition device, can operate. According to the related art, the similarity between the feature vector of the input pattern and the feature vector of the standard pattern, which are associated based on the matching path, is a small value (eg, a negative number). , There is a high possibility that this standard pattern will not be spotted, but in the present invention, correct speech recognition can be performed by resetting the value of S so as to increase the similarity in such a state. become.

[Brief description of the drawings]

【図１】本発明による音声認識装置の概略ブロック図で
ある。FIG. 1 is a schematic block diagram of a speech recognition device according to the present invention.

【図２】本発明を適用した一実施例である。FIG. 2 is an embodiment to which the present invention is applied.

【図３】本発明を適用した他の実施例である。FIG. 3 is another embodiment to which the present invention is applied.

[Explanation of symbols]

１…音声入力部、２…特徴抽出部、３…照合部、４…類
似度計算部、５…照合経路探索部、６…標準パターン格
納部。DESCRIPTION OF SYMBOLS 1 ... Voice input part, 2 ... Feature extraction part, 3 ... Matching part, 4 ... Similarity calculation part, 5 ... Matching path search part, 6 ... Standard pattern storage part.

───────────────────────────────────────────────────── フロントページの続き (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 15/10 ──────────────────────────────────────────────────続き Continued on the front page (58) Field surveyed (Int.Cl. ⁷ , DB name) G10L 15/10

Claims

(57) [Claims]

1. A speech input unit for inputting speech, a feature extraction unit for converting speech input from the speech input unit into an input pattern which is a time series of feature vectors, and a feature extracted from the feature extraction unit. A matching unit that calculates the distance between the feature vector of the input pattern and the feature vector of the standard pattern, and calculates the difference between the first value given in advance and the distance calculated by the matching unit. Speech recognition, wherein the similarity calculated and accumulated for a series of feature vectors of a standard pattern for a word to be recognized is accumulated and used as the similarity of the word to be recognized. method.

2. The speech recognition method according to claim 1, wherein the first value given in advance is a value common to feature vectors of all standard patterns.

3. The speech recognition method according to claim 1, wherein the first value given in advance is determined for each feature vector of a standard pattern.

4. The speech recognition method according to claim 1, wherein the feature vector of the standard pattern is divided into a plurality of clusters in advance, and the predetermined first value is determined for each cluster. .

5. After performing matching on an input pattern whose voice content is known, a feature vector of the input pattern is associated with a feature vector of a standard pattern based on a matching path. 4. The voice according to claim 3, wherein when the distance between the two feature vectors is larger than a predetermined second value, the predetermined first value is increased for each of the feature vectors of the standard pattern. Recognition method.

6. After matching is performed on an input pattern whose utterance content is known, the feature vector of the input pattern and the feature vector of the standard pattern are associated based on the matching path. 5. The speech recognition according to claim 4, wherein when the distance between the two feature vectors is larger than a second predetermined value, the first predetermined value is increased for each cluster to which the standard pattern belongs. method.