JPS6011898A

JPS6011898A - Voice recognition equipment

Info

Publication number: JPS6011898A
Application number: JP58119327A
Authority: JP
Inventors: 洋一竹林; 篠田　英範
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1983-06-30
Filing date: 1983-06-30
Publication date: 1985-01-22
Also published as: JPH0469800B2

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】〔発明の技術分野〕本発明は入力音声の音素や単語等を高精度に認識するこ
とのできる実用性の高い音声認識装置に関する。DETAILED DESCRIPTION OF THE INVENTION [Technical Field of the Invention] The present invention relates to a highly practical speech recognition device that can recognize phonemes, words, etc. of input speech with high accuracy.

[Technical background of the invention]

近時、電子技術の進歩に伴って種々の音声認識装置が開
発され、例えば特定話者を対象とした単語音声認識では
認識対象が数６単語程度であっても相当高い認識率が得
られるようになってきた。また不特定話者を対象とする
単語音声認識にあっても、数十単語程度の認識が可能と
なってきている。Recently, with the advancement of electronic technology, various speech recognition devices have been developed. For example, in word speech recognition targeting a specific speaker, a considerably high recognition rate can be obtained even if the recognition target is only a few six words. It has become. Furthermore, even in word speech recognition for unspecified speakers, it has become possible to recognize several dozen words.

然し乍ら、犬語葉単語や連続発声された音声を認識対象
とする場合、単語単位での学習や、認識対象とする語集
の変史が困難である。この為、先ず入力音声の構成要素
である音節や音素レベルでの認識処理を行い、しかるの
ち言語情報と単語辞書を利用した認識処理を行う方式を
採用すべきであると考えられている。このような方式を
採用すれば、音素を認識の基本単位とすることによって
、約２０種類程度の音素認識結果を利用し、単語音声の
みならず連続発声された音声をも認識可能となることが
知られている。そして、このような場合には、その認識
性能が、音素の認識率に強く依存することから、前述し
た音声の分析とその音素認識を如何に精度良く、且つ効
率良く行うかが重要な課題となっている。However, when dog language words or continuously uttered speech are to be recognized, it is difficult to learn word by word or to change the history of the vocabulary to be recognized. For this reason, it is thought that a method should be adopted that first performs recognition processing at the syllable and phoneme level, which are the constituent elements of input speech, and then performs recognition processing using linguistic information and a word dictionary. If such a method is adopted, by using phonemes as the basic unit of recognition, it will be possible to recognize not only word sounds but also continuously uttered sounds by using the results of approximately 20 types of phoneme recognition. Are known. In such cases, recognition performance strongly depends on the phoneme recognition rate, so how to perform the above-mentioned speech analysis and phoneme recognition accurately and efficiently is an important issue. It has become.

[Problems with background technology]

ところで音声の構成要素である音素（音韻）は、その発
声機構の違いによって種々に分類することができる。即
ち、音素は先ず母音と子音とに大別される。そして、上
記子音は、有声・無声、調音位置、調音形式等によって
分類される。そして、上記音素のうち母音は定常的であ
り、且つ持続時間が長いと云う性質を有する。By the way, phonemes (phonemes), which are the constituent elements of speech, can be classified into various types depending on their vocalization mechanisms. That is, phonemes are first broadly classified into vowels and consonants. The consonants are classified according to voiced/unvoiced, articulatory position, articulatory form, etc. Among the above-mentioned phonemes, vowels have the property of being stationary and having a long duration.

また子音の持続時間が短いと云う性質がある。Another characteristic is that the duration of consonants is short.

例えば無声破裂音（ｐ、ｔ、ｋ）は音声変化が速く、且
つ持続時間が短い。そしてその特徴は破裂時点で現われ
ると云う性質を有している。For example, voiceless plosives (p, t, k) have rapid voice changes and short durations. And its characteristics have the property of appearing at the point of rupture.

然し乍ら、従来の音声認識にあっては、上述した各音素
の性質を考慮することなしに、入力音声を一定の分析時
間長で、且つ一定の分析フレーム周期で分析し、これに
よって得られた特徴パラメータを用いて認識処理を行っ
ている。However, in conventional speech recognition, input speech is analyzed for a certain analysis time length and at a certain analysis frame period, without considering the properties of each phoneme, and the characteristics obtained by this are analyzed. Recognition processing is performed using parameters.

この為、母音を認識するに際しては、その分析時間長が
短いので離散的フーリエ変換等によるスペクトル分析で
は周波数分解能が悪くなり、また声帯音源のピッチ周波
数の影響を受けて分析結果が不安定化し、その誤認識が
生じ易いど云う問題があった。一方、前述した無声破裂
音（ｐ　、　ｔ、ｋ　）等の変化の速い子音に関しては
、分析時間長、およびその分析フレーム長が長過ぎる為
、分析によって得られた特徴・千うメータから、上記子
音の変化に関する重要な情報が他の情報に埋もれてしま
う不具合があった。For this reason, when recognizing vowels, the analysis time length is short, so spectrum analysis using discrete Fourier transform etc. has poor frequency resolution, and the analysis results become unstable due to the influence of the pitch frequency of the vocal cord sound source. There was a problem in that erroneous recognition was likely to occur. On the other hand, regarding consonants that change rapidly such as the voiceless plosives (p, t, k) mentioned above, the analysis time length and analysis frame length are too long, so the above-mentioned There was a problem where important information about consonant changes was buried under other information.

また音素認識を行うことなしに、単語全体のツクターン
照合を利用する単語音声認識方式もあるが、認識対象と
するＲ粟が増大すると、その認識対象中に類似する単語
が多くなる為、それらの識別が困難になると云う問題が
あった。またこのような認識方式にあっても、分析時間
長およびその分析フレーム周期を一定にして入力音声の
特徴パラメータを抽出しているので、単語音声特徴ベク
トルが不安定となったシ、或いは顕著な特徴が現われ難
いと云う不具合があった。There is also a word speech recognition method that uses tscutane matching for the entire word without performing phoneme recognition, but as the number of words to be recognized increases, the number of similar words in the recognition target increases. There was a problem that identification became difficult. In addition, even in such a recognition method, the feature parameters of the input speech are extracted by keeping the analysis time length and analysis frame period constant, so there may be cases where the word speech feature vector is unstable or there is a noticeable difference. There was a problem that it was difficult to see the characteristics.

[Purpose of the invention]

本発明はこのような事情を考慮してなされたもので、そ
の目的とするところは、入力音声を構成する音素の特徴
を効果的に利用して、簡易に且つ高精度に音素や単語等
を認識することのできる実用性の高い音声認識装置を提
供することにある。The present invention has been made in consideration of these circumstances, and its purpose is to easily and accurately identify phonemes, words, etc. by effectively utilizing the characteristics of the phonemes that make up input speech. An object of the present invention is to provide a highly practical speech recognition device that can recognize speech.

[Summary of the invention]

本発明は分析時間長および分析フレーム周期をそれぞれ
異ならせた複数の分析時間毎に入力音声の特徴Ａ’ラメ
ータの時系列をめ、これらの特徴パラメータの時系列の
一部を音声特徴ベクトルとして抽出してパターン照合に
よる音声認識を行うようにしたものである。The present invention calculates the time series of the characteristic A' parameter of input speech for each of a plurality of analysis times with different analysis time lengths and analysis frame periods, and extracts a part of the time series of these feature parameters as a speech feature vector. This method performs speech recognition by pattern matching.

即ち、入力音声に対して例えばディジタル信号処理によ
って音声分析を行う場合、その分析時間長Ｔと分析フレ
ーム周期Ｆとを変化させた複数の分析時間（Ｔ＋　、Ｆ
、　）（Ｔ２１Ｆ２）（Ｔ３゜Ｆａ）〜毎に、ディジタ
ル帯域フィルタ群、離散的７−リエ変換（ＤＦＴ）　、
ケシストラム分析、線形予測分析（ＬＰＧ）等によって
前記入力音声の特徴パラメータの時系列Ｘ　１　＋　Ｘ
　２　ＨＸ　３〜をめ、これらの特徴パラメータの時系
列Ｘ、、Ｘ２゜Ｘ３〜の部分系列Ｐ目＋Ｐ’ｚｉ＋Ｐ３
１〜（ｉ＝１゜２．３〜Ｎ）をその音声特徴ベクトルと
して抽出して、予め準備された音声カテゴリ辞書との間
で類似度、例えば距離や尤度を計算して音素、音節、単
語、文章等の入力音声を認識するようにしたものである
。同、前記特徴／４’ラメータの時系列Ｘ！　＋　Ｘｔ
　＋　Ｘ３〜をアナログ処理によってもめることが可能
なことは云うまでもない。That is, when audio analysis is performed on input audio using, for example, digital signal processing, a plurality of analysis times (T+, F
, )(T21F2)(T3°Fa)~, digital bandpass filter group, discrete 7-lier transform (DFT),
The time series of the characteristic parameters of the input speech X 1 +
2 HX 3～, the time series of these characteristic parameters
1 to (i=1°2.3 to N) are extracted as their speech feature vectors, and similarity, such as distance and likelihood, is calculated between them and a speech category dictionary prepared in advance to identify phonemes, syllables, It is designed to recognize input speech such as words and sentences. Same, the time series of the feature/4' rammeter X! +Xt
It goes without saying that +X3~ can be modified by analog processing.

〔Effect of the invention〕

かくして本発明によれば、入力音声の特徴ベクトルの時
系列が分析時間長およびその分析フレーム周期を異なら
せた複数の分析時間毎にめられるので、母音のような比
較的安定で且つ定常的な音素の認識に関しては分析時間
長と分析フレーム周期を長くした分析処理によ請求めら
れた特徴パラメータの時系列を特徴ベクトルとして用い
、これを効果的に、且つ精度良く認識することが可能と
なる。また変化が速く、非定常的な性質を有する破裂性
子音等の音素に関しては、分析時間長および分析フレー
ム周期の短い分析でめられた特徴パラメータの時系列を
特徴ベクトルとして用いることによって、これを簡易に
、且つ高精度に認識することが可能となる。このように
して、個個の性質を考慮してそれぞれ認識された音素の
情報、つまり音素の時系列・ぞターンと音声カテゴリ辞
書とをパターン照合することによシ、ここに入力音声を
効果的に認識することが可能となる。Thus, according to the present invention, since the time series of the feature vectors of the input speech are examined for each of a plurality of analysis times with different analysis time lengths and analysis frame periods, relatively stable and stationary features such as vowels are Regarding phoneme recognition, the time series of feature parameters requested through analysis processing with a longer analysis time length and analysis frame period is used as a feature vector, making it possible to recognize this effectively and accurately. . In addition, for phonemes such as plosive consonants that change quickly and have non-stationary properties, the time series of feature parameters determined by analysis with short analysis time length and analysis frame period can be used as feature vectors. It becomes possible to recognize easily and with high precision. In this way, by pattern matching the information on each recognized phoneme, that is, the time series/turns of the phoneme and the phonetic category dictionary, the input speech can be effectively processed. It becomes possible to recognize the

また上述したように分析時間を変化させて入力音声の特
徴パラメータの時系列を得るので、母音の分析精度の向
上を欄り、母音を重視した認識が可能となる。また子音
の特徴を良く抽出することができるので、子音の異なる
類似単語間の識別を容易ならしめる。従って、異なる分
析時間毎にめられる複数種の特徴ベクトルを有効に利用
して、簡易で精度の高い認識を行い得、実用上絶大なる
効果が奏せられる。つｉシ、音声を構成する音素の性質
を有効に利用した高精度な音声認識が可能となる。Furthermore, as described above, since the time series of the characteristic parameters of the input speech is obtained by varying the analysis time, it is possible to improve the accuracy of vowel analysis and perform recognition with emphasis on vowels. Furthermore, since the features of consonants can be well extracted, it is easy to distinguish between similar words with different consonants. Therefore, by effectively utilizing a plurality of types of feature vectors obtained at different analysis times, simple and highly accurate recognition can be performed, which is extremely effective in practice. In addition, highly accurate speech recognition that effectively utilizes the properties of phonemes that make up speech becomes possible.

[Embodiments of the invention]

以下、図面を参照して本発明の実施例につき説明する。 Embodiments of the present invention will be described below with reference to the drawings.

第１図は音素を認識処理の基本単位とする実施例装置の
概略構成図である。入力音声は、分析時間長および分析
フレーム周期を異にする第１および第２の分析回路１，
２にそれぞれ入力され、その定められた分析時間毎に分
析されて特徴パラメータが抽出される。これらの分析回
路１，２は後述する第２図乃至第５図に委す如く構成さ
れるもので、前記入力音声に対して、ディノタル帯域フ
ィルタ処理、離散的フーリエ変換処理、ケゾヌトラム分
析処理、線形予測分析処理、アナログ帯域フィルタ処理
等を施して、該入力音声の特徴パラメータの時系列をめ
るものである。これらの分析回路１，２でそれぞれめら
れた特徴パラメータの時系列は、音素特徴ベクトルメモ
リに一時的に記憶されたのち、その一部が特徴ベクトル
として読出されて類似度計算回路３，４にそれぞれ与え
られる。類似度計算回路３は、前記分析回路ｌでめられ
た例えば分析時間長が２０ｍ５、フレーム周期が１０　
ｍｓからなる長い分析時間における特徴ベクトルから、
音素辞書５との照合、つまシ類似度計算を行って母音等
の定常的で安定な性質を有する音素を認識している。ま
た他方の類似度肝η−回路４は、前記分析回路２で例え
ば分析時間長が］ＯｍＳ、フレーム周期が５　ｍｓから
なる短い分析時間毎にめられた特徴ベクトルから、音素
辞書６との照合によシ、つ′−！シ類似度計算によって
、子音等の変化の速い音素の認識を行っている。そして
、これらの竺似度計算回路３゜４にてそれぞ九求められ
た音素認識結果の時系列、即ち母音音素パターンおよび
子音音素・母ターンは、例えばその類似度値を含む音素
記号の列として音素パターンメモリ７に格納され、合成
される。FIG. 1 is a schematic diagram of an embodiment of an apparatus in which phonemes are the basic unit of recognition processing. The input audio is passed through first and second analysis circuits 1, which have different analysis time lengths and analysis frame cycles.
2, and are analyzed at each predetermined analysis time to extract characteristic parameters. These analysis circuits 1 and 2 are configured as shown in FIGS. 2 to 5, which will be described later. A time series of characteristic parameters of the input voice is obtained by performing predictive analysis processing, analog band filter processing, etc. The time series of feature parameters determined by these analysis circuits 1 and 2 are temporarily stored in a phoneme feature vector memory, and then a part of them is read out as a feature vector and sent to similarity calculation circuits 3 and 4. each is given. The similarity calculation circuit 3 calculates, for example, the analysis time length of 20 m5 and the frame period of 10 determined by the analysis circuit l.
From the feature vector in a long analysis time consisting of ms,
Phonemes with constant and stable properties, such as vowels, are recognized by checking with the phoneme dictionary 5 and calculating the degree of similarity. The other similarity factor η-circuit 4 uses the feature vectors determined by the analysis circuit 2 for each short analysis time, for example, where the analysis time length is ]OmS and the frame period is 5 ms, to be compared with the phoneme dictionary 6. Yosi, tsu'-! Rapidly changing phonemes such as consonants are recognized by calculating similarity. The time series of phoneme recognition results obtained by these similarity calculation circuits 3.4, that is, the vowel phoneme pattern and the consonant phoneme/vowel turn, are, for example, a string of phoneme symbols including the similarity value. are stored in the phoneme pattern memory 7 and synthesized.

しかるのち、音素パターンメモリ７に得られた入力音声
に対する音素パターンは照合回路８に与えられ、例えば
単語の標準音素パターンを登録してなる音声カテゴリ辞
書９との間で類似度計算されて、音声認識されるものと
なっている。After that, the phoneme pattern for the input speech obtained in the phoneme pattern memory 7 is given to the matching circuit 8, and the similarity is calculated with the speech category dictionary 9, which is formed by registering standard phoneme patterns of words, for example. It has become recognized.

即ち、本装置にあっては分析回路１において分析時間長
Ｔとそのフレーム周期Ｆとが長い分析時間にてめられた
入力音声の特徴パラメータの時系列によって、入力音声
の定常的な性質を有する母音等の音素が効果的に認識さ
れる。That is, in this device, the analysis time length T and its frame period F in the analysis circuit 1 are determined by the time series of the characteristic parameters of the input speech determined over a long analysis time, so that the input speech has a stationary property. Phonemes such as vowels are effectively recognized.

また分析時間長Ｔとフレーム周期Ｆを短く設定した分析
回路２において、入力音声の変化に対して敏感な分析に
よ請求めた特徴パラメータの時系列によって、入力音声
の変化の速い子音等の音素を効果的にめることが可能と
なる。そして、これらのそれぞれめられた母音および子
音の音素の時系列からなる音素パターンを用いて入力音
声の全体を認識するので、その認識オ肯度を十分に高く
することが可能となる。また上述した認識処理はシンプ
ルであυ、従って実用的利点も太きい。In addition, in the analysis circuit 2 in which the analysis time length T and frame period F are set short, phonemes such as consonants that change quickly in the input speech are It becomes possible to effectively reduce the Since the entire input speech is recognized using the phoneme pattern made up of the time series of the phonemes of these vowels and consonants, the recognition accuracy can be made sufficiently high. Furthermore, the recognition processing described above is simple and therefore has great practical advantages.

さて、前述した分析回路１，２は、例えば次のように構
成することができる。第２図はディジタル帯域フィルタ
群による周波数分析による特徴パラメータの抽出を行う
分析回路の構成図である。この回路は、入力音声をＡ／
Ｄ変換器１１を介して例えば１００μｓ毎Ｋｒイジタル
信号変換し、これを並列的に設けられた複数の帯域フィ
ルタ（ＤＢＰＦ）　１２に入力して１６チヤンネルの周
波数成分にそれぞれ分離する。しかるのち、これらの各
チャンネル出力を２東回路１３に入力してそのパワーを
める。しかるのち、これらのパワー成分を第１の低域フ
ィルタ（ＬＰＦ）　７４群、および第２のＬＰＦ　１５
群に入力してそれぞれ平滑化処理し、その音素特徴Ａ’
ラメータをめてメモＩＪ　Ｉ　６　、１７に格納する如
く構成される。Now, the above-mentioned analysis circuits 1 and 2 can be configured as follows, for example. FIG. 2 is a configuration diagram of an analysis circuit that extracts characteristic parameters by frequency analysis using a group of digital bandpass filters. This circuit converts the input audio into A/
A Kr digital signal is converted via a D converter 11 every 100 μs, for example, and inputted to a plurality of band pass filters (DBPF) 12 provided in parallel to separate it into frequency components of 16 channels. Thereafter, the outputs of each of these channels are input to the 2-east circuit 13 to calculate the power thereof. Thereafter, these power components are passed through a first low pass filter (LPF) 74 group and a second LPF 15 group.
The phoneme features A'
It is configured such that the parameters are collected and stored in the memo IJ I 6 , 17.

ここで前記第１のＬＰＦ　１４群は、分析時間長（Ｍ分
時間）を２４　ｍｓとし、フレーム周期８ｍＳ毎に特徴
パラメータｘＩを抽出するものであ夛、また第２のＬＰ
Ｆ　１５群は、分析時間長を８　ｍｓとして、フレーム
周期８ｍ５４Ｈに特徴／クラメータｙ１を抽出するもの
となっている。このようにしてめられた特徴パラメータ
ｘ１の時系列が前記母音等の音素の認識に用いられ、ま
た特徴パラメータｙＩの時系列が子音等の音素の認識に
用いられる。Here, the first LPF 14 group has an analysis time length (M minute time) of 24 ms and extracts feature parameters xI every frame period of 8 ms, and the second LP
In the F15 group, the analysis time length is 8 ms, and the feature/crameter y1 is extracted at a frame period of 8m54H. The time series of the feature parameters x1 determined in this way is used for recognizing phonemes such as vowels, and the time series of the feature parameters yI is used for recognizing phonemes such as consonants.

このような構成の分析回路によれば、その帯域フィルタ
１２の構成を同じくし、或いは共通に用いた上で、ＬＰ
Ｆ　１４　、　Ｊ　５の平滑フレーム長（積分時間）を
変化させるだけで、時間長の異なる特徴パラメータＸｌ
　ｒ　”／ｉを得ることができる。しかもこの場合には
、分析処理の計算量がさはど増大することがなく、回路
構成上経済的であり、実用上有利である。また上記ＬＰ
Ｆ　１４゜１５のフレーム周期を８ｍｓ、４ｍｓと別個
に定めることも効果的である。According to the analysis circuit having such a configuration, the configuration of the bandpass filter 12 is the same or is used in common, and the LP
By simply changing the smoothed frame length (integration time) of F 14 and J 5, feature parameters Xl with different time lengths can be obtained.
r ''/i.Moreover, in this case, the amount of calculation for analysis processing does not increase too much, it is economical in terms of circuit configuration, and it is advantageous in practice.
It is also effective to separately determine the frame period of F14°15 as 8ms and 4ms.

また分析回路１．２をアナログ帯域フィルタ群を用いて
構成する場合には、第３図に示すようにすればよい。即
ち入力音声を１６チヤンネルのアナログ帯域フィルタ（
ＢＰＦ）　、？　１群ヲ通シたのち２東回路２２群に入
力してパワーをめる。しかるのち、その出方をアナログ
低域フィルタＬＰＦ　２３　、２４を介してｆ波したの
ち、Ａ／Ｄ変換器２５．２６群を介してディジクル変換
してメモリＹ６．１７にその特徴ノ（ラメ〜りｘＩ、ｙ
ｌを格納するようＫすればよい。この場合、前記ＬＰＦ
　２３の時定数を例えば３０　ｍｓと長く設定し、標本
化周期を１５ｍ５とするＡ／Ｄ変換器２５を介してディ
ジタル変換して、母音等の音素の認識に適した特徴）や
ラメータＸｉを得るようにする。そして、ＬＰＦ２４に
ついてはその時定数を例えば１０　ｍＳと短く設定し、
Ａ／Ｄ変換器２６の標本化周期を例えば５　ｍＳとする
ことによって、子音等の音素の認識に適した特徴ノやラ
メータを得るようにすればよい。Furthermore, when the analysis circuit 1.2 is configured using a group of analog bandpass filters, it may be configured as shown in FIG. In other words, input audio is passed through a 16-channel analog bandpass filter (
BPF),? After passing through the 1st group, input to the 2nd east circuit 22nd group to increase power. After that, the output wave is converted into f-waves through analog low-pass filters LPF 23 and 24, and then digitally converted through A/D converters 25 and 26, and its characteristics are stored in memory Y6.17. rixI,y
All you have to do is set K to store l. In this case, the LPF
23 is set to a long time constant of 30 ms, for example, and digitally converted via an A/D converter 25 with a sampling period of 15 m5 to obtain characteristics (characteristics suitable for recognizing phonemes such as vowels) and parameters Xi. Do it like this. Then, the time constant of LPF24 is set as short as, for example, 10 mS,
By setting the sampling period of the A/D converter 26 to, for example, 5 mS, characteristics and parameters suitable for recognizing phonemes such as consonants may be obtained.

このように分析回路を構成しても、ＢＰＦ２１を共用で
きるので、回路の大きさをさほど増大させることなしに
特徴パラメータ抽出性能の向上を図ることができる。ま
た経済的であセ、効率的である。壕だ、本実施例は、ス
イノテドキャパシターフィルタを用いても実現できる。Even if the analysis circuit is configured in this way, the BPF 21 can be shared, so it is possible to improve the feature parameter extraction performance without significantly increasing the size of the circuit. It is also economical and efficient. However, this embodiment can also be implemented using a swinoted capacitor filter.

また第４図はケグストラム分析処理を適用して構成され
る分析回路の例を示すものである。Further, FIG. 4 shows an example of an analysis circuit configured by applying kegstral analysis processing.

この場合にはＡ／Ｄ変換器１１を介して人力された入力
音声を２系統に分け、例えば２４　ｍｓのハニング窓を
設けた時間窓回路、？１と、１２ｍ５のハニング窓を設
けた時間窓回路３２に供給する。しかるのち、これらの
時間窓でそれぞれ切出された信号を、離散的フーリエ変
換器（ＤＦＴ）３３．３４ＶＣ入力してスペクトル変換
し、その出力を絶対値回路３５．．３６、および対数変
換回路（ＬＯＧ）　３７　、３８を通したのち、離散的
逆フーリエ変換器（ＩＤＦＴ）　３９　、４０にてケプ
ストラム変換する。これによってＩＤＦ’ｒ　３９の出
力に時間窓２４　ｍＳの信号に対するケプストラムが得
られ、その上位１６個のケゾヌトラムノソラメータを音
素特徴パラメータｅｉとして抽出する。In this case, input audio input manually via the A/D converter 11 is divided into two systems, and a time window circuit with a Hanning window of 24 ms, for example, is used. 1 and a time window circuit 32 provided with a Hanning window of 12 m5. Thereafter, the signals extracted in each of these time windows are input to a discrete Fourier transformer (DFT) 33.34 VC for spectrum conversion, and the output thereof is input to an absolute value circuit 35.34 VC. ．． 36 and logarithmic transform circuits (LOG) 37 and 38, and then undergoes cepstrum transformation in discrete inverse Fourier transformers (IDFT) 39 and 40. As a result, the cepstrum for the signal with a time window of 24 mS is obtained as the output of the IDF'r 39, and the top 16 quesonutramnosorameters are extracted as phoneme feature parameters ei.

またＩＤＥ’Ｔ　４０には、時間窓１２ｍ５の信号に対
するケプストラムが得られ、その上位１６個のケダスト
ラムパラメータを音素特徴パラメータｃ′ｉとする。そ
して、これらをメモリ１６．１７にそれぞれ格納し、上
記特徴パラメータｅｉの時系列が母音等の音素認識に、
また特徴パラメータＣ′ｉの時系列が子音等の音素認識
に用いられる。The IDE'T 40 also obtains the cepstrum for the signal in the time window 12m5, and its top 16 cepstrum parameters are taken as phoneme feature parameters c'i. Then, these are stored in the memories 16 and 17 respectively, and the time series of the feature parameters ei is used for phoneme recognition such as vowels.
Further, the time series of the feature parameters C'i is used for phoneme recognition such as consonants.

このようにして、分析時間窓を変えてケプストラム処理
することによっても、母音の特徴を反映した特徴パラメ
ータｃ１の時系列と、子音の特徴を反映した特徴・ぐラ
メータａｌｔの時系列とをそれぞれ得ることができる。In this way, by changing the analysis time window and performing cepstrum processing, we can obtain a time series of the feature parameter c1 that reflects the characteristics of the vowel, and a time series of the feature parameter alt that reflects the characteristics of the consonant. be able to.

また第５図は、先に説明したディジタル帯域フィルタ群
を用いた分析回路の変形例を示すものであＦ）、ＬＰＦ
１５の出力を所定数ずつ加算器（ＡＤＤ）　１　Ｂに入
力し、その和をめて４チヤンネルの特徴パラメータｙｔ
をめ、これをメモリ１９に格納するようにしたものであ
る。即ち、この例は、母音等の定常的音素を認識する為
の特徴パラメータとして、フレーム数を「１」とし、周
波数方向に１６次としたものをめている。Furthermore, FIG. 5 shows a modification of the analysis circuit using the digital bandpass filter group described above.F), LPF
A predetermined number of the outputs of 15 are input to the adder (ADD) 1B, and the sum is calculated to obtain the feature parameter yt of the 4 channels.
This is stored in the memory 19. That is, in this example, the number of frames is set to "1" and the number of frames is set to 16th order in the frequency direction as characteristic parameters for recognizing stationary phonemes such as vowels.

また変化の速い子音等の音素を認識する為の特徴・やラ
メータとして、フレーム数を４フレームと長くし、周波
数方向には４次としたものをめている。このように、子
音認識用の特徴・ヤラメータの周波数分解能を悪くして
も、その特徴ベクトルは４７レーム忙亘って、変化に対
して敏感なものとなっているので、子音等の認識に十分
に供することが可能である。また母音認識用の特徴パラ
メータは１フレ一ム分しかめられないが、その性質が定
常的であることから周波数分解能を十分高くしておくだ
けで十分にその目的が達成される。従って、第２図に示
す回路よりも、その構成を大幅に簡略化し得ると云う効
果が奏せられる。また、本実施例の一変形例として、Ｌ
ＰＦ　１４を使用せずに、ＬＰＦ　１５の出力を複数フ
レーム分加算してメモリ１６に転送することも可能であ
る。In addition, as features and parameters for recognizing rapidly changing phonemes such as consonants, the number of frames is increased to 4 frames, and the frequency direction is set to 4th order. In this way, even if the frequency resolution of the feature/yarameter for consonant recognition is degraded, the feature vector is busy with 47 frames and is sensitive to changes, so it is not sufficient to recognize consonants etc. It is possible to provide Furthermore, although the characteristic parameters for vowel recognition cannot be determined for one frame, since their nature is stationary, the purpose can be achieved by simply making the frequency resolution sufficiently high. Therefore, the effect is that the configuration can be significantly simplified compared to the circuit shown in FIG. 2. In addition, as a modified example of this embodiment, L
It is also possible to add the outputs of the LPF 15 for a plurality of frames and transfer the result to the memory 16 without using the PF 14.

このように分析回路１．２はその一部を共用する等して
、比較的簡単に構成することができる。そして、この場
合の分析時間長Ｔやフレーム周期Ｆ等は、前述したＬＰ
Ｆの時定数や時間窓の長さ、Ａ／Ｄ変換器のサンプリン
グ速度等を変化させることにより、簡易に変えることが
できる。そして、音素の性質に応じた特徴ノ？ラメータ
を簡易に、且つ効果的に得ることができる。In this way, the analysis circuit 1.2 can be configured relatively easily by sharing a part thereof. The analysis time length T, frame period F, etc. in this case are based on the LP described above.
This can be easily changed by changing the time constant of F, the length of the time window, the sampling rate of the A/D converter, etc. And what about the characteristics depending on the nature of the phoneme? parameters can be obtained easily and effectively.

以上、音素を認識処理の基本単位とする実施例について
説明したが、本発明は単語単位の照合を行う音声認識装
置としても実施することができる。第６図はその概略構
成図であり、分析回路１，２は前記第２図に示されるよ
うな１６チヤンネル型ディジタル帯域フィルタ群等によ
って構成される。この場合、分析回路ｌでは、例えば分
析時間長が３０　ｍＳ、フレーム周期が１５　ｍＳの分
析処理により、定常的な特徴パラメータの時系列がめら
れる。また分析回路２では分析時間長が１０ｍ５．フレ
ーム周期が５　ｍＳの分析処理により、音声の変化に敏
感な特徴・ぞラメータの時系列がめられる。このような
特徴パラメータの時系列に対してリサンプル回路４１．
４２では、前記入力音声のパワー情報からめられる音声
の始端フレームから終端フレームに亘って等間隔にリサ
ンプルし、例えば周波数方向１６次元、時間軸方向１６
次元の泪２５６次元の時間・周波数・モターンをめてい
る。これらの時間・周波数パターンが、単語特徴ベクト
ルメモＩＪ　４３　、４４にそれぞれ格納される。Although an embodiment in which phonemes are used as the basic unit of recognition processing has been described above, the present invention can also be implemented as a speech recognition device that performs word-by-word matching. FIG. 6 is a schematic diagram of the configuration, and the analysis circuits 1 and 2 are composed of a group of 16-channel digital bandpass filters as shown in FIG. In this case, the analysis circuit 1 obtains a steady time series of characteristic parameters through analysis processing with an analysis time length of 30 mS and a frame period of 15 mS, for example. In addition, in analysis circuit 2, the analysis time length is 10m5. Through analysis processing with a frame period of 5 mS, the time series of features and parameters that are sensitive to changes in speech can be determined. A resampling circuit 41 is applied to the time series of such feature parameters.
42, resampling is performed at equal intervals from the start frame to the end frame of the voice determined from the power information of the input voice, for example, 16 dimensions in the frequency direction and 16 dimensions in the time axis direction.
Dimensional Tears: 256 dimensions of time, frequency, and pattern. These time/frequency patterns are stored in word feature vector memos IJ 43 and 44, respectively.

これらの時間・周波数パターンは、それぞれの分析処理
時の分析時間長およびフレーム周期の異なシによって異
ったものとなっている。そして、ベクトルメモリ４３に
得られるパターンは、ベクトルメモリ４４に得られるパ
ターンに比して時間軸方向の分解能が悪く、所謂ボケだ
・ぐターンとなっているが、逆に安定なものとなってい
る。このようにしてめられた時間・周波数パターンに対
して、類似度計算回路４５゜４６では、単語辞書４７．
４８との間で類似度計算を行って、その単語類似度をめ
ている。These time/frequency patterns differ depending on the analysis time length and frame period during each analysis process. The pattern obtained in the vector memory 43 has poor resolution in the time axis direction compared to the pattern obtained in the vector memory 44, resulting in a so-called blurred pattern, but on the contrary, it is stable. There is. The similarity calculation circuits 45 and 46 use word dictionaries 47.
48 to determine the word similarity.

尚、ここで用いられる類似度計算法としては、複合類似
度法や、マハラノビス距離等の統計的尺度を用いること
が有用である。Note that as the similarity calculation method used here, it is useful to use a composite similarity method or a statistical measure such as Mahalanobis distance.

しかるのち、これらの異なる特徴ベクトルを用いてめら
れた各単語カテゴリに対する類似度値と、類似度差とを
用いて総合判定回路４９によシ、その単語が総合的に評
価され、認識される。同、この場合、前記分析時間長と
フレーム時間長を長くしてめた安定な単語特徴ベクトル
に基づいて得た類似度結果を、単語カテゴリに対する大
分類処理に使用し、また分析時間長とフレーム時間長を
短くした単語特徴ベクトルに基づいて得た類似度結果を
、類似単語間の識；；）；　：ｔｒ使用することが望ま
しい。また、本実施例の複数種の照合処理は、階層的に
行うことも可能である。Thereafter, the word is comprehensively evaluated and recognized by the comprehensive judgment circuit 49 using the similarity value for each word category determined using these different feature vectors and the similarity difference. . Similarly, in this case, the similarity result obtained based on the stable word feature vector obtained by increasing the analysis time length and frame time length is used for the general classification process for word categories, and the analysis time length and frame time length are It is desirable to use the similarity results obtained based on word feature vectors with shortened time lengths to identify similar words;;); :tr. Furthermore, the multiple types of matching processing in this embodiment can also be performed hierarchically.

尚、この単語照合に用いる分析回路１．２として、前述
した第３図乃至第５図に示されるようなアナログ帯域フ
ィルタ、ケプヌトラム分析、離散的フーリエ変換等を用
いた分析回路を採用することも勿論可能である。As the analysis circuit 1.2 used for this word matching, an analysis circuit using an analog bandpass filter, Cepnutrum analysis, discrete Fourier transform, etc. as shown in FIGS. 3 to 5 described above may be adopted. Of course it is possible.

以上説明したように本発明によれば、音声を構成する音
素が有する特徴を十分有効に利用して音声を認識するの
で、高精度な認識結果を得ることができる。しかも、簡
易な処理によって効果的に認識結果を得ることができ、
その実用的利点は絶大である。As described above, according to the present invention, since speech is recognized by sufficiently effectively utilizing the characteristics of the phonemes constituting the speech, highly accurate recognition results can be obtained. Moreover, recognition results can be obtained effectively through simple processing.
Its practical advantages are enormous.

同、本発明は上記実施例に限定されるものではない。例
えば分析時間長とフレーム周期は、入力音声の仕様に応
じて定めればよいものであって、上記分析時間長とフレ
ーム周期とによって定められる分析時間の数も複数であ
ればよい。Similarly, the present invention is not limited to the above embodiments. For example, the analysis time length and frame period may be determined according to the specifications of the input audio, and the number of analysis times determined by the analysis time length and frame period may be plural.

要するに本発明はその要旨を逸脱しない範囲で種々変形
して実施することができる。In short, the present invention can be implemented with various modifications without departing from the gist thereof.

[Brief explanation of the drawing]

第１図は本発明の一実施例の概略構成図、第２図乃至第
５図はそれぞれ分析回路の構成例を示す図、第６図は本
発明の別の実施例を示す概略構成図である。１．２・・・分析回路、３．４・・・類似度計算回路、
５．６・・・音素辞書、７・・・音素・ぞターンメモリ
、８・・・照合回路、９・・・音声カテゴリ辞書、１１
・・・Ａ／Ｄ変換器、１２・・・ＤＢＰＦ’、１３・・
・２乗回路、１４゜１５・・・ＬＰＦ、１６，１７．１
９・・・特徴ベクトルメモリ、２ノ・・・ＢＰＦ、２２
・・・２乗回路、２３゜２４・・・ＬＰＦ、２５　、２
６・・・Ａ／Ｄ変換器、３１゜３２・・・時間窓回路、
３３．３４・・・ＤＦＴ、３５゜３６・・・絶対値回路
、３７．３８・・・対数変換回路、３９．４０・・弓Ｄ
ＦＴ、　４１　、４２・・・リサンプル回路、４３．４
４・・・単語特徴ベクトルメモリ、４５゜４６・・・類
似度計算回路、４７．４８・・・単語辞書、４９・・・
総合判定回路。FIG. 1 is a schematic configuration diagram of one embodiment of the present invention, FIGS. 2 to 5 are diagrams each showing a configuration example of an analysis circuit, and FIG. 6 is a schematic configuration diagram showing another embodiment of the present invention. be. 1.2...Analysis circuit, 3.4...Similarity calculation circuit,
5.6... Phoneme dictionary, 7... Phoneme/zo turn memory, 8... Verification circuit, 9... Speech category dictionary, 11
...A/D converter, 12...DBPF', 13...
・Squaring circuit, 14°15...LPF, 16, 17.1
9... Feature vector memory, 2... BPF, 22
...Squaring circuit, 23°24...LPF, 25, 2
6... A/D converter, 31°32... Time window circuit,
33.34...DFT, 35°36...Absolute value circuit, 37.38...Logarithmic conversion circuit, 39.40...Bow D
FT, 41, 42...Resample circuit, 43.4
4...Word feature vector memory, 45°46...Similarity calculation circuit, 47.48...Word dictionary, 49...
Comprehensive judgment circuit.

Claims

[Claims]

(1) Means for analyzing input speech for each of a plurality of analysis times determined by varying the analysis time length and analysis frame period to determine a time series of characteristic parameters of the input voice, and these characteristic parameters. 1. A speech recognition device comprising means for extracting a part of the time series as a speech feature vector and comparing the speech feature vector with a pre-registered speech category dictionary.

(2) The means to obtain the time series of the characteristic nomurameter of the input speech is to apply different sampling processes to the same intermediate results obtained in a series of analysis processing processes for analyzing the input speech, and to determine the analysis time length and the analysis frame period. 2. The speech recognition device according to claim 1, wherein time series of feature parameters are obtained for different analysis times.