JPS6051898A

JPS6051898A - Continuous voice recognition equipment

Info

Publication number: JPS6051898A
Application number: JP58159316A
Authority: JP
Inventors: 斉藤　悦生; 浮田　輝彦
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1983-08-31
Filing date: 1983-08-31
Publication date: 1985-03-23

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】〔発明の技術分野〕本発明は連続発声された入力音声を効率良く認識するこ
とのできる連続音声認識装置に関する。DETAILED DESCRIPTION OF THE INVENTION [Technical Field of the Invention] The present invention relates to a continuous speech recognition device that can efficiently recognize continuously uttered input speech.

発明の技術的背景とその問題点〕音声を情報入力手段とする日本語ワードプロセッサや音
声タイプライタにあっては、自然性良く連続的に発声さ
れる音声を如何に効率良く認識するかが重要な課題とな
る。しかして従来より知られている連続音声認識の１つ
に、認識単位を音素程度のものとし、入力Ｍμの特徴パ
ラメータの時系列を、一旦音素ラベルの列や、所謂セグ
メントラティスに変換してその単Ｂ”７４や文を抽出す
るものがある。然し乍ら、連’１ｊｌｊｊ発声される入
力音声にあっては、同じ音素であってもその前後の音素
環境によって所謂調ｇ結合が生じ、この結果、音響的な
表現が多様な変形を受けると云う性質がある。この為、
高精度に上記音素ラベルへの変換を行うことが難しく、
実用性に乏しかった。[Technical background of the invention and its problems] For Japanese word processors and voice typewriters that use voice as an information input means, it is important to efficiently recognize voices that are continuously uttered in a natural manner. It becomes a challenge. However, one of the conventional continuous speech recognition methods uses a recognition unit as a phoneme, and converts the time series of the feature parameters of the input Mμ into a sequence of phoneme labels or a so-called segment lattice. There is a method that extracts simple B"74 and sentences. However, in the case of input speech uttered in series '1jljj, even if the phoneme is the same, a so-called key g combination occurs depending on the phoneme environment before and after it, and as a result, There is a property that acoustic expression undergoes various transformations.For this reason,
It is difficult to convert to the above phoneme labels with high accuracy,
It lacked practicality.

これに対して、認識単位を単語程度のものとし、特徴パ
ラメータの時系列から単語を直接的に同定し、その後単
語列に文として認識する方式が提唱されている。この方
式は、単語とじて徐準パターンを持つことによって前述
した調音結合の問題を回避したものである。しかして上
記単語の同定法は、入力音声から単語境界位置と検出し
、その境界によって定まる入力計声の部分区間について
単語を同定するものと、逆に境界を検出することなしに
入力音声の全ての部分区間に単語が存在すると不敬して
単語を同定するものとに大別される。上記境界の検出は
、例えば入力音声の音声パワーやスペクトラム変化等の
特徴ノぞラメータを抽出し、その時系列上の極値をめる
等して行われる。ところが、例えば数字の２”　（／ｎ
ｉ／）と数字の１″（／１ｔｆｉ／）が連続発声されて
（／ｎｉ　二ｔｆｉ／）となった場合には、その単語境
界を検出することができない等の不具合があった。In contrast, a method has been proposed in which the recognition unit is word-sized, the words are directly identified from the time series of feature parameters, and then the word string is recognized as a sentence. This method avoids the above-mentioned problem of articulatory combination by having a hypomorphic pattern as a word. However, the above word identification methods detect word boundary positions from the input speech and identify words for a partial interval of the input voice determined by the boundaries, and conversely, they detect word boundaries from the input speech and identify words for a subinterval of the input voice that is determined by the boundaries, and conversely, they detect all of the input speech without detecting the boundaries. If a word exists in a subinterval, it is roughly classified into those that identify the word profanely. The detection of the boundary is performed, for example, by extracting characteristic parameters such as the voice power and spectrum change of the input voice, and finding the extreme values in the time series. However, for example, the number 2" (/n
i/) and the number 1'' (/1tfi/) are uttered consecutively to become (/ni 2tfi/), there were problems such as the inability to detect the word boundary.

この点、上述した後者の単語同定方式は一部において実
用化されている。即ち、この単語同定の基本旧なアルゴ
リズムは、語葉中の各単語（言語的な意味ではなく、音
声認識単位として定義される）に対して、標準パターン
を一定時間毎に分析された特徴パラメータの時系列とし
て準備する。そして、入力音声の全ての部分区間につい
て上記標準パターンとの距離をめて、最小距離を与える
単語を判定するものである。In this regard, the latter word identification method described above has been put into practical use in some cases. In other words, the basic old algorithm for word identification uses feature parameters that are analyzed at regular intervals for standard patterns for each word in a word (defined as a speech recognition unit, not a linguistic meaning). Prepare as a time series. Then, the distance from the standard pattern is determined for all partial sections of the input speech, and the word that provides the minimum distance is determined.

この際、所定の分析時間毎に得られる特徴パラメータ間
の距離Ｃフレーム間距離）を計算し、動的計画性全時間
正規化に利用して時系列パターン間の距離をめる。そし
て、単語列としての入力音声との距離ヲ全ての部分区間
の組合せについて評価し、最小の累積距離を待ち、且つ
入力音声の全体に対応する単語列を認識結果として得る
ものである。At this time, the distance between feature parameters (C, inter-frame distance) obtained at each predetermined analysis time is calculated and used for dynamic planning full-time normalization to determine the distance between time-series patterns. Then, the distance from the input speech as a word string is evaluated for all combinations of partial sections, the minimum cumulative distance is waited, and a word string corresponding to the entire input speech is obtained as a recognition result.

ところがこの方式は話者が特定される場合には良好に作
用するが、話者が不特定になると次のような問題を招来
した。即ち、不特定な話者を対象とすると、話者によっ
て単語の音声バタ　１−ンが大きく異なる為、話者に対
応した非常に膨大な量の単語標準パターンを準備するこ
とが必要となる。故に、不特定な話者に対しては、原理
同には＝＜限数の株準パターンが必要となり、その実現
が著しく困難となる。However, although this method works well when the speaker is specified, it causes the following problems when the speaker is unspecified. That is, when targeting unspecified speakers, the speech patterns of words vary greatly depending on the speaker, so it is necessary to prepare a huge amount of standard word patterns corresponding to each speaker. Therefore, for unspecified speakers, the same principle requires a limited number of quasi-patterns, which is extremely difficult to realize.

そこで近時、各単語について有限小数の標準パターンだ
けを準備し、クラスタリングの手法を応用することによ
って上記不特定話者に対する標準パターンの問題を解決
することが考えられている。然し乍ら、このようにする
と単語列（支）に対する認識率が著しく低下し、実用的
には堪え難いものとなっている。しかも、この手法全採
用すると、全ての単語カテゴリについて、更にはそれぞ
れ複数個の時系列標準パターンについて逐一その距離全
計算する必要があり、全体の計算処理量が非常に膨大な
ものとなると云う致命的な欠点があった。これらの理由
により、連続発声された入力音声を効率良く、効果的に
認識することが非常に困畑であった。Recently, it has been considered to prepare only a finite number of standard patterns for each word and apply a clustering method to solve the problem of standard patterns for unspecified speakers. However, in this case, the recognition rate for word strings (branches) decreases significantly, which is unbearable in practical terms. Moreover, if all of this method is adopted, it will be necessary to calculate all the distances for all word categories and also for each of multiple time-series standard patterns, which has the disadvantage that the total amount of calculation processing will be extremely large. There was a drawback. For these reasons, it has been extremely difficult to efficiently and effectively recognize continuously uttered input speech.

[Purpose of the invention]

本発明はこのような事情を考慮してなされたもので、そ
の目的とするところは、不特定話者が連続発声した入力
音声を高精度に、しかも実時間処理によって効率良く認
識することのできる実用性の高い連続音声認識装置全提
供することにある。The present invention has been made in consideration of these circumstances, and its purpose is to efficiently recognize input speech continuously uttered by an unspecified speaker with high precision and through real-time processing. Our goal is to provide a highly practical continuous speech recognition device.

[Summary of the invention]

すなわち、本発明は、標準パターンを特定の時間点につ
いての周波数軸方向に間する一定次元の特徴ベクトルを
時系列に持つ特徴ベクトル列として持ち、入力音声を一
定時間毎に分析してめられる特徴パラメータのベクトル
と上記各標準パターンの各時間における特徴ベクトルと
のフレー人間類似度をそれぞれめ、これらのフレー人間
類似度から入力音声の部分区間に対する標準パターンの
類似度をめてその部分区間の候補標準ノリーン（候補単
語）とそのｒｉｔ位類似度をめたのち、入力音声区間と
等しい区間をなす部分区間列の各単位類似度の和力・ら
その部分区間列全講成する候補単語列を評価するように
したものである。That is, the present invention has a standard pattern as a feature vector sequence having a time series of feature vectors of a certain dimension in the frequency axis direction for a specific time point, and features that can be determined by analyzing input audio at fixed time intervals. The degree of similarity between the parameter vector and the feature vector of each of the standard patterns at each time is determined, and the degree of similarity of the standard pattern to a subsection of the input speech is determined from these degrees of similarity between the frequency and frequency range, and candidates for that subsection are determined. After determining the standard Noreen (candidate word) and its degree of similarity, calculate the candidate word string to be used for the entire subinterval string by calculating the sum of each unit similarity of the subinterval string that forms an interval equal to the input speech interval. It was designed to be evaluated.

[Embodiments of the invention]

以下、図面を参照して本発明の一実施例につき説明する
。尚、ここでは入力音声の認識単位を単語として説明す
るが、このｔ）１語は言語学的な意味ではなく、音声認
識処理における音声の取扱い単位として定義されるもの
である。才たこの単、沿は、音節や文節式るいはこれに
類するものでよい。Hereinafter, one embodiment of the present invention will be described with reference to the drawings. Note that although the recognition unit of input speech is described here as a word, this t)1 word is not defined in a linguistic sense, but as a unit of speech handling in speech recognition processing. Saitako's simple and yori can be in syllables, phrases, or something similar.

さて、第１図は実施例装置の概略構成図であり、第２図
は同装置の主たる処理手順ヲ示す図である。不特定話者
が連続発声して入力される入力音声は、音響分析部１に
入力されて所定の分析時間毎に分析されて、その特徴／
−，６ラメータに変換される。この音響分析部１は、例
えば音声帯域を１６〜３０程度の帯域に分釧してそのス
ペクトル分析を行う複数の帯域通過フィルターからなる
フィルターバンクによって構成される。これにより、入
力音声の特徴・ｆラメ−タカ・うなる特徴ベクトルが一
足時間毎にめられる。Now, FIG. 1 is a schematic configuration diagram of an embodiment apparatus, and FIG. 2 is a diagram showing the main processing procedure of the same apparatus. Input speech that is continuously uttered by an unspecified speaker is input to the acoustic analysis section 1 and analyzed at predetermined analysis time intervals to determine its characteristics/speech.
-, converted to 6 rammeters. The acoustic analysis section 1 is constituted by a filter bank consisting of a plurality of band-pass filters that divides the audio band into about 16 to 30 bands and performs spectrum analysis. As a result, the features of the input voice, the f-lame-taka, and the growling feature vectors are determined for each pair of steps.

しかして、この入力音声の特徴ベクトルは、フレー人間
類似度計算部２に入力され、標準ノ（ターン記憶部（メ
モリ）３に予め登録された標準パターンの各時間点の特
徴ベク）／しとのフレー人間類似度が計算され、そのｊ
Ｊ１似反値が保持される。この類似度値を入力して単位
類似ｊ−リ計算判定部４が入力音声中の単語イｆ在可浦
な部分区間について、各単語に対する類似度全計３゛１
−シている。Therefore, the feature vector of this input voice is input to the frequency human similarity calculation unit 2, and the standard pattern (feature vector of each time point of the standard pattern registered in advance in the turn storage unit (memory) 3) The Frey-human similarity of j
The J1 reciprocal value is retained. By inputting this similarity value, the unit similarity calculation/judgment unit 4 calculates the total similarity for each word of 3゛1 for the partial interval in which the word is present in the input speech.
-I'm looking forward to it.

第２図は、これらの各部による処理手ＩＩを概略的に示
しており、本装置では類似度計算を、例えば・ぞターン
認識における複合力１１以ｒ」法を用いて行なわれる。FIG. 2 schematically shows the processing steps II performed by each of these parts, and in this apparatus, similarity calculation is performed using, for example, the composite force 11-r method in turn recognition.

しかしてここでは、単１治の音声パターンは、周波数軸
方向にＭ点の特徴・ぐラメータから成るＭ次元の特徴ベ
クトルヲ時間軸方向にＮ点ならべたものとして表現され
る。上記時間軸方向のＮ点は、単語背戸の継続時間につ
いて線形にＮ個の内分点をめて始鴫４定められるもので
あり、また周波数軸方向のＭ点は、前記フィルターバン
クのＭ個の帯域通過フィルタの各出力に対応させる等し
て定められる。この複合類似度法に用いられる＋’ｊｆ
ｌ記記憶耶３に予め登録された複数の単語（認識単位）
の６標準パターンは、例えば不特定多数の発声、１μ語
から予め新訂的処理してめられるものである。すなわち
、各単語のカテゴリｌ　（１”１　＃　２　ａ・・・、
Ｉ）とその時間点ｎ　（ｎ−１、２、−、Ｎ）について
独立に（周波数軸方向に関する）Ｍ次元空間上の分布に
対応する相関行列全計算し、それらの固有ベクトルをそ
の固有値が大きいものから順に並べてｒｒｌ　・ｒｒ２　１ｒｉｊ　・　＝−−ｒＩＪとして
められる。これによって各単語の標準ノ４ターンはそれ
ぞれ相互に直交する特徴ベクトルとして表現されること
になる。Here, however, a single voice pattern is expressed as an M-dimensional feature vector consisting of features/parameters at M points along the frequency axis, arranged at N points along the time axis. The N points in the time axis direction are determined by linearly dividing the duration of the word seito into N points, and the M points in the frequency axis direction are determined by dividing the duration of the word seito into M points in the filter bank. is determined by corresponding to each output of a bandpass filter. +'jf used in this composite similarity method
Multiple words (recognition unit) registered in advance in memory 3
The 6 standard patterns are, for example, those that are pre-processed from an unspecified number of utterances or 1 μ words. That is, each word's category l (1"1 #2 a...,
I) and its time point n (n-1, 2, -, N), calculate all the correlation matrices corresponding to the distribution on the M-dimensional space (in the frequency axis direction) independently, and calculate their eigenvectors with large eigenvalues. Arranged in order from top to bottom, it can be expressed as rrl ・rr2 1rij ・=−rIJ. As a result, the four standard turns of each word are expressed as mutually orthogonal feature vectors.

このような標準パターンに対して、単語カテゴリｉに対
する単語類似度は次のように計算される。すなわち、上
記成る時刻における入力音声の特徴ベクトルＸが入力さ
れると、計算部２では上記入力特徴ベクトルＸに対し記
憶部３に登録された全ての単語カテゴリｉおよび時間サ
ンプル点ｎに関する標準ノ々ターンの各特徴ベクトルと
のフレー人間類似度８ｉ　をＪ’＝１としてめる。ここで（ｘ−ｒｉｊ”）はベクトル！　ト
ヘク）　ｋ　ｒ　ｉ　ｊ　との内債である。この計算量
は、例えばフィルターパンクのフィルタ数Ｍが１０、時
間軸方向のサンプル点数Ｎが１６、標準パターンの固有
ベクトル数Ｊが５．単語カテゴリーの数丁が数字を例と
して１０として与えられるものとすれば、音響分析の一
定時間内に（ＭＸＮＸＪＸＩ）＝８０００回の乗算オヨ
び加算処理を行うものとして与えられる。このとき、音
声信号の分析時間間隔は１６ｍ９ｃｃ程度あればよいの
で、上記８０００回の乗加算処理の各々を２μ（８）以
内で行えば良く、十分に実時間処理を行い得る。また、
この時記憶されるフレーム間類似度の量は、上記の例で
はＮＸＩ＝１６０（ワード）ｔ−各時間点について保持
すれば良く、記憶量としても大きな負担にはならない。For such a standard pattern, the word similarity for word category i is calculated as follows. That is, when the feature vector Let J'=1 be the Frei human similarity 8i with each feature vector of the turn. Here, (x-rij") is a vector! krij". This amount of calculation is, for example, when the number of filter punctures M is 10, the number of sample points N in the time axis direction is 16, and the standard If the number J of eigenvectors of the pattern is 5, and the number of word categories is given as 10 using numbers as an example, then (MXNXJXI) = 8000 multiplications and additions are performed within a certain time of acoustic analysis. At this time, the analysis time interval of the audio signal only needs to be about 16m9cc, so each of the 8000 multiplication and addition processes described above only needs to be performed within 2μ (8), which is sufficient to perform real-time processing. ,
In the above example, the amount of inter-frame similarity to be stored at this time only needs to be stored for each time point (NXI=160 (words) t), and the amount of storage does not become a large burden.

このようにして入力音声の特徴ベクトルと標準パターン
の各時間点における特徴ベクトルとのフレーム間類似度
がその全てについてめられる。In this way, all inter-frame similarities between the feature vector of the input speech and the feature vector at each time point of the standard pattern are determined.

単位類似度計算部４は、このようにしてめられたフレー
ム間類似度Ｓｉから、入力音声中の現時点までに形成さ
れる単語が存在する可能性のある全ての部分区間につい
て上記フレーム間類似度８ｉ　を時間軸上でリサンプル
し、そのリサンプルされたフレーム間類似度Ｓｉｎから
その部分区間における認識単位（単語すなわちカテゴリ
ｉ）に対する類似度Ｓｉ″ＦＣ１ｎ　＝　１としてめている。そして各部分区間について最大の類似
度値をとる単語カテゴリ名ｉ？ｆ−その部分区間の認識
結果としてめ、その類似度値および前記部分区間の位置
と共に記憶する。なお、この計算はサンプル点数Ｎにつ
いての加算をカテがす数丁回だけ行なえば良く、計算量
は少ない。The unit similarity calculation unit 4 calculates the inter-frame similarity for all sub-intervals in which there is a possibility that words formed up to this point in the input speech may exist from the inter-frame similarity Si determined in this way. 8i on the time axis, and from the resampled inter-frame similarity Sin, the similarity for the recognition unit (word, i.e., category i) in that subinterval is determined as Si″FC1n = 1. Then, each subinterval The word category name i?f that takes the maximum similarity value for i?f is taken as the recognition result for that subinterval, and is stored together with its similarity value and the position of the subinterval.This calculation involves addition for the number of sample points N. It only needs to be done a few times, and the amount of calculation is small.

しかるのち、単位評価判定部５は、音声入力区間と同じ
開始端および終了端となる上記部分区間の列を、部分区
間の全ての組合せの中から選択する。そして、その部分
区間の列について、各部分区間毎にめられた前記単語類
似度Ｓｔの和をめ、各列についてそれぞれめられた上記
和の値を相互に比較して、その大小関係から部分区間列
を構成する単語列？、評価している。Thereafter, the unit evaluation determining section 5 selects a sequence of partial sections having the same start and end ends as the voice input section from among all combinations of partial sections. Then, for the column of the subinterval, calculate the sum of the word similarities St determined for each subinterval, compare the values of the sums determined for each column, and determine the size of the word similarity St. A word sequence that makes up an interval sequence? , is being evaluated.

例えば、部分区間列の類似度の和が最大となるものを、
連続発声された入力音声の全区間に亘ってマツチングが
とられていると評価し、その部分区間列を構成する各部
分区間毎にめられた単語カテゴリｉの列全認識結果とし
て出力する。For example, the one for which the sum of the similarities of the subinterval sequences is the maximum is
It is evaluated that matching is achieved over the entire interval of continuously uttered input speech, and the entire sequence of word category i found in each subinterval constituting the subinterval sequence is output as a recognition result.

以上が本装置による連続音声の認識処理の作用である。The above is the operation of continuous speech recognition processing by this device.

これを第３図乃至第６図を参照して、更に詳しく説明す
ると次のようになる。即ち、入力音声の一定時間毎に分
析された特徴ベクトルの時系列が第３図中人に示された
ものとすると、各サンプル時点の入力音声特徴ベクトル
毎に標準パターンの各時間点での特徴ベクトルとのフレ
ーム間類似度が８１　、Ｂ２〜ＢＬの如くめられる。つ
まり、成るサンプリング時刻について、Ｉ、Ｎの全ての
組み合わせが入力音声の特徴パラメータＸについてのフ
レーム間類似度がめられ、例えばテーブルとして格納保
持される。この部分類似度計算は音声入力の時間経過に
伴い、一定の分析時間間隔毎に順次行われる。This will be explained in more detail with reference to FIGS. 3 to 6 as follows. That is, assuming that the time series of feature vectors analyzed at fixed time intervals of input speech is shown in the middle of Figure 3, the features at each time point of the standard pattern for each input speech feature vector at each sample time point are The interframe similarity with the vector is 81, and is expressed as B2 to BL. That is, for each sampling time, all combinations of I and N are evaluated for inter-frame similarity with respect to the characteristic parameter X of the input voice, and are stored and held as a table, for example. This partial similarity calculation is performed sequentially at fixed analysis time intervals as the audio input progresses over time.

しかして、単位類似度計算判定部４は、音声入力開始時
点から現時点までに、入力音声中で単語が存在し得る候
補区間を部分区間として、第４図に示すように決定して
いる。つまり、単語が存在し得る部分区間の長さは成る
範囲を以って殆んど決定され、例えば上記分析単位時間
に比較して、最も短いもので３単位時間、また最も長い
もので１１単位時間として定められる。Accordingly, the unit similarity calculation/judgment unit 4 has determined, as shown in FIG. 4, candidate sections in which words can exist in the input speech as partial sections from the start of speech input to the present time. In other words, the length of a subinterval in which a word can exist is mostly determined by the range, and for example, compared to the above analysis unit time, the shortest is 3 unit hours, and the longest is 11 unit hours. It is defined as.

このような音声入力条件から、例えば現時間全基準とし
て、３単位時間の部分区間、４単位時間の部分区間・・
・１１単位の部分区間等をそれぞれ仮定する。そして、
これらの各部分区間につき、その部分区間に対応したサ
ンプル時点でそれぞれめられた前記部分類似度から、該
部分区間の各標準ノ４ターンに対する類似度を計算する
。この類似度計算を行うに際しては、上記の如く各部分
区間の長さの異なりによる入力音声単語の時間長の異な
りを吸収する為に、これをリサンプルして、処理対象と
する単語の時間長変動を吸収することが必要である。従
って、フレーム間類似度のりサンプル点を、例えば第５
図に示すように、現時点を基準として、長さの異なる部
分区分に対してそれぞれ同数となるように定めておけば
よい。そして、とのりサンプル点によってフレーム間類
似度８４の類似度計算に用いる添字ｉｎ）の位置？決定
し、このようにして選択されたフレーム間類似度から第
３図中Ｃに示すように、その部分区間に対する類似度８
２にめるようにすればよい。これによって、各部分区間
毎に、それぞれ複数の標準パターンに対する類似度がめ
られるから、その中で最大の類似度値を得、且つその類
似度値が所定の闇値を越え、更に第２位の類似度値との
差が十分広いものの単語カテが９１と、その部分区間の
候補単語として認識する。Based on these voice input conditions, for example, based on the entire current time, a partial interval of 3 unit time, a partial interval of 4 unit time...
-Assume 11 unit subintervals, etc. and,
For each of these partial sections, the degree of similarity with respect to each of the four standard turns of the partial section is calculated from the partial similarity determined at the sample time point corresponding to that partial section. When performing this similarity calculation, in order to absorb the difference in the time length of the input speech word due to the difference in the length of each subinterval as described above, this is resampled and the time length of the word to be processed is It is necessary to absorb fluctuations. Therefore, for example, the fifth
As shown in the figure, the number may be determined to be the same for partial segments having different lengths based on the current point in time. Then, the position of the subscript in) used to calculate the inter-frame similarity 84 based on the sample points? Based on the inter-frame similarity determined and selected in this way, the similarity 8 for the partial interval is determined as shown in C in FIG.
All you have to do is put it in 2. As a result, the similarity to a plurality of standard patterns is determined for each subinterval, so the maximum similarity value among them is obtained, the similarity value exceeds a predetermined darkness value, and the second highest similarity value is obtained. If the word category is 91, although the difference with the similarity value is sufficiently wide, it is recognized as a candidate word for that partial interval.

このようにして、各部分区間毎にその候補単語と、この
候補単語を得た類似度とを、上記部分区間の位置毎に整
理すると第６図に示すようになる。そこで、単位列評価
判定部５において、音声区間と等しい区間を為す部分区
間の列を選択し、例えばこの例では（Ｌ、Ｊ、Ｅｌ）　
、（Ｌ、Ｇ、Ｃ）、（Ｋ；Ｈ，Ｃ）、（Ｉ　、Ｂ）なる
部分区間列を選択し、各部分区間列の類似度の和をめる
。この和の値によって、その部分区間列が入力音声の全
区間について良くマツチングしているか否かが評価され
ることになる。尚、この部分区間列の評価については、
■Ｃ■音節を単位とした連続単語音声の認識として知ら
れるような動的計画法や、タスクドメインによる並列探
索の手法を用いることも可能である。またこのとき、成
る時点までに得られた単語類似度の中間結果を順次利用
していくようにしてもよい。このようにすれば音声入力
の終了と同時に、リアルタイムにその認識結果と得るこ
とが可能となる。In this way, the candidate words for each subsection and the degree of similarity obtained from the candidate words are arranged for each position of the subsection as shown in FIG. 6. Therefore, the unit string evaluation determining unit 5 selects a string of subintervals that forms an interval equal to the voice interval, and for example, in this example, (L, J, El)
, (L, G, C), (K; H, C), (I, B) are selected, and the sum of similarities of each subinterval sequence is calculated. Based on the value of this sum, it is evaluated whether the partial interval sequence matches well with all the intervals of the input speech. Regarding the evaluation of this subinterval sequence,
■C■ It is also possible to use a dynamic programming method known as continuous word speech recognition using syllables as a unit, or a parallel search method using a task domain. Further, at this time, intermediate results of word similarity obtained up to the point in time may be sequentially used. In this way, the recognition result can be obtained in real time as soon as the voice input ends.

従って、本発明によれば、認識単位である単語の音声パ
ターンを、周波数方向に関する一定次元の特徴ベクトル
を時系列にならべることで　〔表現し、入力音声の周波
数軸に対応する特徴パラメータが得られる都度、その単
語の標準）やターンの各時刻点について類似度を計算す
るので、連続音声を実時間で処理することが可能となる
。Therefore, according to the present invention, the speech pattern of a word, which is a unit of recognition, is expressed by arranging feature vectors of a certain dimension in the frequency direction in time series, and feature parameters corresponding to the frequency axis of the input speech can be obtained. Since the similarity is calculated for each time point of each turn (standard of the word) and each turn, continuous speech can be processed in real time.

しかも単語の標準パターンを周波数軸方向に関する一定
次元の特徴ベクトルを統計処理したものを時系列に配し
て認識処理に用いるので、不特定話者の発声の異に併う
パターン変動の吸収処理の簡易にして効果的な実施を１
１能とし、その実用的利点は絶大である。Moreover, since the standard pattern of words is statistically processed using characteristic vectors of a certain dimension in the frequency axis direction and arranged in time series for recognition processing, it is possible to absorb pattern fluctuations due to differences in the utterances of unspecified speakers. Simple and effective implementation 1
1 function, and its practical advantages are enormous.

尚、本発明は上記実施例に限定されるものではない。例
えば単語の類似度計算全、マハラノビスの距離計算や、
統計的識別関数を用いて行うこともでき゛る。この場合
、距離値やＩＪ’Ｊ数値を写像処理して、これ全類似度
とすればよい。また認識単位を背部や文ｉＦｉ等として
もよく、これらを組合せても良いことは云う′までもな
い。要するに本発明はその要旨を逸脱しない範囲で挿植
変形して実施することができる。Note that the present invention is not limited to the above embodiments. For example, word similarity calculation, Mahalanobis distance calculation,
This can also be done using a statistical discriminant function. In this case, the distance values and IJ'J values may be subjected to mapping processing to be used as the total similarity. It goes without saying that the recognition unit may be the back or the image iFi, and these may be combined. In short, the present invention can be modified and implemented without departing from the gist thereof.

発明の効果〕以上説明したように本発明によれば、標準パターンが周
波数軸方向の特徴パラメータの一定次元の特徴ベクトル
を時系列に持つ特徴ベクトル列として表わされるので、
不特定話者に起因する音声パターンの多様な変動に十分
対処して高精度に音声を認識することが可能となる。し
かも、各時点の特徴ベクトルとの類似度の関数として部
分区間に対する類似度をめ、これにより部分区間の候補
単語をめることによって、その類似度計算を分析時間毎
に部分同に分解して行うことができ、従って実時間処理
が可能となる。故に、リアルタイムで精度の高い認識処
理が可能となり、実用上絶大なる効果が得られる。[Effects of the Invention] As explained above, according to the present invention, a standard pattern is expressed as a feature vector sequence having constant dimension feature vectors of feature parameters in the frequency axis direction in time series.
It becomes possible to sufficiently deal with various variations in speech patterns caused by unspecified speakers and to recognize speech with high accuracy. Furthermore, by calculating the similarity for a subinterval as a function of the similarity with the feature vector at each time point and determining candidate words for the subinterval based on this, the similarity calculation can be decomposed into parts for each analysis time. Therefore, real-time processing becomes possible. Therefore, it becomes possible to perform highly accurate recognition processing in real time, and a great practical effect can be obtained.

[Brief explanation of the drawing]

図は本発明の一実施例を示すもので、ｆｆ１ｌｌ”４は
実施例装置の概略榴成図、２ｇ２図は同装置の処理子Ｊ
＠を示す図、第３図乃至ｔ’ｆｓ　６図はそれぞれ認識
処理過程における処理概念を示す図である。Ｊ・・・音響分析部、２・・・フレーム間類似度計算部
、３・・・標準・やターン記憶部、４・・・単位類似度
計算判定部、５・・・単位列評価判定部。出願人代理人　弁理士　鈴　江　武　彦第４図を第６図音？Ｖ間Ａ・舒今区間Ｂ”　）　（ＷＬ、ＷＪ、ＷＢ）Ｄ。口ｍＧ。１、　＋手続補正書昭和′５ε５イト　１・”Ｊ丁ζ’ｘ　Ｌ＋特許庁長官
　若杉和夫　殿１、事件の表示特願昭５８−１５９３１６号２、発明の名称連続音Ｐ認識装置３、補正をする者事件との関係　特許出願人（３０７）東京芝浦電気株式会社４、代理人The figure shows an embodiment of the present invention.
The diagrams showing @ and FIGS. 3 to 6 are diagrams each showing the processing concept in the recognition processing process. J...Acoustic analysis unit, 2...Inter-frame similarity calculation unit, 3...Standard / turn storage unit, 4...Unit similarity calculation judgment unit, 5...Unit sequence evaluation judgment unit . Applicant's representative Patent attorney Takehiko Suzue Is Figure 4 the sound of Figure 6? V interval A, Shuima section B") (WL, WJ, WB) D. 口 G. 1, + Procedural amendment Showa'5ε5ite 1・"Jchoζ'x L+ Commissioner of the Japan Patent Office Kazuo Wakasugi 1, Indication of the case Japanese Patent Application No. 159316/1989 2, Name of the invention Continuous sound P recognition device 3, Person making the amendment Relationship to the case Patent applicant (307) Tokyo Shibaura Electric Co., Ltd. 4, Agent

Claims

[Claims]

(1) A memory in which each of a plurality of recognition unit standard gloss turns is stored as a time series consisting of a predetermined number of time points in a feature vector of a constant dimension; Means for determining the degree of similarity between the feature vector of the input voice and the vector at each time point of the standard/four-turn feature vector sequence, and calculating the degree of similarity of each standard pattern to the partial interval of the input voice from these degrees of similarity. A standard pattern salt that calculates the maximum value, a means for determining its unit similarity, and a means for determining the unit similarity for each subinterval of the sequence of subintervals that forms an interval equal to the input speech interval. 1. A continuous speech recognition device comprising: means for evaluating a standard pattern name sequence constituting a sequence of lever subintervals.

(2) The continuous speech recognition device according to claim 1, wherein the recognition unit is defined as a syllable, a word, or a phrase.

(3) The continuous speech recognition device according to claim 1, wherein the partial section of the input speech is defined as a section of recognition units that can be taken up to the present point in the analysis processing process.

(4) The continuous speech recognition device according to claim 1, wherein the degree of similarity is determined based on frame statistics.

(5) The continuous speech recognition device according to claim 1, wherein the series of feature vectors representing the standard four-turn recognition unit is comprised of a number smaller than the number of acoustic analysis sections of the recognition unit.