JPH0323920B2

JPH0323920B2 -

Info

Publication number: JPH0323920B2
Application number: JP56062157A
Authority: JP
Inventors: Atsukimi Kobayashi
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1981-04-24
Filing date: 1981-04-24
Publication date: 1991-03-29
Also published as: JPS57177198A

Description

【発明の詳細な説明】本発明は音声認識処理装置に関し、特に、発声
された音声について各フレーム毎の特徴パラメー
タベクトルを細分類された音素である細分類音種
情報に変換し、その細分類音種情報の時系列を用
いて音声認識処理を行う音声認識装置において、
その細分類音種情報への変換（音種変換）を、複
数の特徴パラメータベクトル毎に行い、得られた
複数の細分類音種情報の時系列を用いることによ
り、細分類音種の種類を増す相乗効果を得ること
にした音声認識処理装置に関する。DETAILED DESCRIPTION OF THE INVENTION The present invention relates to a speech recognition processing device, and particularly to a speech recognition processing device that converts feature parameter vectors for each frame of uttered speech into subclassified sound type information, which is subclassified phonemes. In a speech recognition device that performs speech recognition processing using a time series of sound type information,
The conversion to subclassified note type information (note type conversion) is performed for each of the plurality of feature parameter vectors, and by using the time series of the obtained multiple subclassified note type information, the types of subclassified note types can be determined. The present invention relates to a speech recognition processing device that achieves an increased synergistic effect.

従来の単語音声認識処理装置においては、多数
単語レベルの認識を行う場合、全ての登録単語の
音声パターンを辞書に保持し、また、認識対象入
力音声と全ての登録単語の音声パターンとの整合
処理を行うため、単語辞書メモリ容量及び計算量
が膨大なものになり、実用に供するには装置の制
約がかなり厳しく、また認識率を点から考えても
まだ十分ではない。 In conventional word speech recognition processing devices, when performing multi-word level recognition, the speech patterns of all registered words are held in a dictionary, and the speech patterns of all registered words are matched with the input speech to be recognized. Therefore, the memory capacity of the word dictionary and the amount of calculation become enormous, the restrictions on the device are quite severe for practical use, and the recognition rate is still not sufficient.

また入力音声を単音節毎に認識する装置におい
ては、日本語の場合、約100音節の音声パターン
を辞書に保持し、さらにそれらを入力音声と整合
するため、辞書メモリ及び計算量ではあまり問題
とならないが、子音、特に破裂音等を含む音節の
認識は現在の技術では難しく、いくつかの試みが
あるものの、認識率の面から実用にはまだ至らな
いという段階である。 In addition, in the case of Japanese, devices that recognize input speech monosyllable by monosyllable have to store speech patterns of about 100 syllables in a dictionary and match them with the input speech, so dictionary memory and computational complexity are not a problem. However, it is difficult to recognize syllables that include consonants, especially plosives, etc. with current technology, and although there have been several attempts, they are still at a stage where they are not yet practical due to the recognition rate.

一方、発声された音声について各フレーム毎の
特徴パラメータベクトルを細分類された音素であ
る細分類音種情報に変換し、その細分類音種情報
の時系列を用いて音声認識処理を行う手法が提案
されている。（日本音響学会講演論文集、昭和55
年10月、第399頁〜第400頁）論文に示されているように、細分類音種とは、
音素を意識せず、スペクトルを物理的な尺度で分
類したもので、例えば13チヤネルのフイルタ
（200〜3400Hz）で分析した全てのフレームのスペ
クトルパターンをｋ−meansクラスタリング手法
を用いて分類し、各クラスタのセンターとなるス
ペクトルパターンを細分類音種とすることができ
る。 On the other hand, there is a method that converts the feature parameter vector for each frame of uttered speech into subclassified sound type information, which is subclassified phonemes, and performs speech recognition processing using the time series of the subclassified sound type information. Proposed. (Proceedings of the Acoustical Society of Japan, 1972)
(October, pp. 399-400) As shown in the paper, subclassified sound types are:
It classifies spectra using physical scales without being aware of phonemes. For example, the spectral patterns of all frames analyzed with a 13-channel filter (200 to 3400 Hz) are classified using the k-means clustering method, and each A spectral pattern that is the center of a cluster can be used as a subclassified sound type.

本発明の目的は、発声された音声をフレーム単
位に細分類化された音素である細分類音種情報に
変換する時に、異なつた２つ以上の特徴パラメー
タベクトルを用いることにより、特徴ベクトル別
に２種類以上の細分類音種情報を得ることによ
り、音声特徴情報の損失を、その２種類以上の細
分類音種の相乗的な効果によつて減少させ得るこ
とにある。そしてこの目的を達成するために本発
明は、所定時間間隔のフレーム毎に入力音声の分
析を行う音声分析処理部と、該分析結果から音声
特徴を抽出する特徴抽出処理部と、入力音声を音
素を細分化した細分類音種の時系列に変換する時
系列変換処理部と、複数の登録音声の細分類音種
の時系列をあらかじめ格納した単語辞書と、入力
音声の細分類音種時系列と単語辞書内の登録音声
の細分類音種の時系列とを整合処理することによ
り入力音声と複数の登録音声の距離計算を行う整
合処理部と、複数の登録音声の中で入力音声に対
して最も距離が小さいものを判定する判定処理部
を有する音声認識処理装置において、上記時系列
変換処理部は、複数の特徴パラメータを使用し、
各特徴毎に細分類音種への変換を行うよう構成さ
れ、さらに上記時系列変換処理部に結合され、あ
らかじめ作成された各特徴毎の細分類音種情報を
保持する細分類音種辞書部をそなえるとともに、
上記単語辞書は音声登録単語の各特徴毎の細分類
音種情報を保持する単語辞書で構成され、音声認
識処理時に、上記時系列変換処理部は、細分類音
種辞書部を用いて入力音声を各特徴毎の細分類音
種情報の時系列に変換し、上記整合処理部は、該
変換されて得られた各特徴毎の細分類音種情報の
時系列を単語辞書の中の対応する細分類音種情報
と整合処理することを特徴とする。 An object of the present invention is to use two or more different feature parameter vectors when converting uttered speech into subclassified sound type information, which is phonemes subdivided into frame units. By obtaining information on more than one type of sub-classified note type, it is possible to reduce the loss of voice feature information due to the synergistic effect of the two or more sub-classified note types. In order to achieve this object, the present invention includes a speech analysis processing section that analyzes input speech for each frame at a predetermined time interval, a feature extraction processing section that extracts speech features from the analysis results, and a phoneme extraction processing section that extracts speech features from the analysis results. a time series conversion processing unit that converts the input speech into a time series of subclassified note types, a word dictionary that stores in advance the time series of subclassified note types of a plurality of registered voices, and a time series of subclassified note types of input speech. and a matching processing unit that calculates the distance between the input speech and a plurality of registered speeches by matching the time series of the subclassified phonetic types of the registered speeches in the word dictionary; In the speech recognition processing device, the time series conversion processing unit uses a plurality of feature parameters,
A subclassified note type dictionary unit configured to perform conversion into subclassified note types for each feature, and further coupled to the above-mentioned time series conversion processing unit, and holds pre-created subclassified note type information for each feature. In addition to providing
The word dictionary is composed of a word dictionary that holds subclassification sound type information for each feature of speech registered words, and during speech recognition processing, the time series conversion processing section uses the subclassification sound type dictionary section to is converted into a time series of subclassified note type information for each feature, and the matching processing unit converts the time series of subclassified note type information for each feature obtained by the conversion into a time series of subclassified note type information for each feature into a time series of subclassified note type information for each feature. It is characterized by matching processing with subclassified note type information.

以下に、まず本発明の原理を説明する。いま
P₁を特徴Ｉによつて得られた細分類音種の集合
とする。音声特徴抽出部で抽出された音声区間内
の特徴パラメータベクトルの時系列が、時系列変
換部でこの細分類音種の番号の時系列V_Iに変換
される。 First, the principle of the present invention will be explained below. now
Let P ₁ be a set of subclassified note types obtained by feature I. The time series of feature parameter vectors within the voice section extracted by the voice feature extractor is converted into a time series _VI of the number of this subclassified note type by the time series converter.

V_I＝P_a1 ⁽¹⁾P_a2 ⁽²⁾……P_ao ⁽ⁿ⁾ ……式１ ∀P_ai ⁽ⁱ⁾∈P_I（１＜ｉ＜Ｍ） (i)：入力音声のフレーム番号Ｍ：特徴Ｉの細分類音種の総数またP_IIを特徴によつて得られた細分類音種
の集合とする。音声特徴抽出部で抽出された音声
区間内の特徴パラメータベクトルの時系列が、時
系列変換部でこの細分類音種の番号の時系列V_II
に変換される。 V _I =P _a1 ⁽¹⁾ P _a2 ⁽²⁾ ...P _ao ⁽ⁿ⁾ ...Equation 1 ∀P _ai ⁽ⁱ⁾ ∈P _I (1<i<M) (i): Frame number M of input audio :Total number of subclassified note types of feature I. Also, let P _II be the set of subclassified note types obtained by the feature. The time series of the feature parameter vectors in the voice section extracted by the voice feature extraction section is converted into a time series of this subclassification note type number _VII by the time series conversion section.
is converted to

V_II＝Q_b1 ⁽¹⁾Q_b2 ⁽²⁾……Q_bo ⁽ⁿ⁾ ……式２ Q_bi ⁽ⁱ⁾∈P_II（１＜ｉ＜Ｎ） (i)：入力音声のフレーム番号Ｎ：特徴の細分類音種の総数ここで、特徴で得られた時系列V_Iで各フレ
ームの細分類音種は集合P_Iの要素の１つであり、
第ｉフレームのP_ai ⁽ⁱ⁾で入力音声の第ｉフレームの
特徴Ｉに関する情報を記述している。そこで、次
に別の特徴によつて得られる、時系列V_IIをつ
け加えるとV_IIの第ｉフレームにはＮ個の細分類
音種の集合P_IIの中の１つが選択されることにな
り、特徴で得られた入力音声の第ｉフレームの
細分類音種の番号を特徴で得られた細分類音種
の番号に併用すると、それぞれの音種番号の組み
合わせ数は（Ｍ×Ｎ）となる。 V _II = Q _b1 ⁽¹⁾ Q _b2 ⁽²⁾ ...Q _bo ⁽ⁿ⁾ ...Formula 2 Q _bi ⁽ⁱ⁾ ∈P _II (1<i<N) (i): Frame number N of input audio: Total number of subclassified sound types of the feature Here, the subclassified sound type of each frame in the time series V _I obtained by the feature is one of the elements of the set P _I ,
P _ai ⁽ⁱ⁾ of the i-th frame describes information regarding the feature I of the i-th frame of the input audio. Then, if we add the time series VI _II obtained by another feature, one of the set of N subclassified note types P _II will be selected for the i-th frame of VI _II . , when the number of the subclassified note type of the i-th frame of the input voice obtained by the feature is used together with the number of the subclassified note type obtained by the feature, the number of combinations of each note type number is (M × N). Become.

このことから、特徴と特徴によつて得られ
た細分類音種を用いると（Ｍ×Ｎ）個の細分類音
種をもつことと同様の効果がある。例えば、いま
特徴と特徴によつて得られた細分類音種の数
をそれぞれＭ＝100、Ｎ＝100とすると、10000個
の細分類音種を持つこととなり、発声された音声
は10000個の細分類音種の中から選択されたもの
の時系列として表すことができ、従来の100個の
細分類音種で特徴および特徴の両方を記述し
た場合にくらべ、格段に音声特徴の情報損失を少
なくする効果がある。また、10000個の細分類音
種を保持する場合に比較し、この例では200（＝Ｍ
＋Ｎ）個の細分類音種を保持すればよく、1/50の
メモリ量を済むことになる。 From this, it can be seen that using the subclassified note types obtained from the features has the same effect as having (M×N) subclassified note types. For example, if the number of subclassified sound types obtained by the features and characteristics is respectively M = 100 and N = 100, there will be 10,000 subclassified sound types, and the uttered voice will be 10,000. It can be expressed as a time series of selected sounds from subclassified note types, and the loss of information about voice features is significantly reduced compared to the conventional case where both features and characteristics are described using 100 subclassified note types. It has the effect of Also, compared to the case where 10,000 subclassified note types are retained, in this example, 200 (=M
It is sufficient to hold +N) subclassified note types, and the amount of memory is reduced to 1/50.

以下、本発明を図面により説明する。 Hereinafter, the present invention will be explained with reference to the drawings.

図は本発明による実施例の音声認識処理装置の
ブロツク図であり、図中、１は入力媒体部、２は
音声分析処理部、３は特徴抽出処理部、４は時系
列変換処理部、５は整合処理部、６は判定処理
部、７−１〜７−ｎは細分類音種を保持する細分
類音種辞書部、８−１〜８−ｎは単語辞書部、９
は結果出力である。 The figure is a block diagram of a speech recognition processing device according to an embodiment of the present invention, in which 1 is an input medium section, 2 is a speech analysis processing section, 3 is a feature extraction processing section, 4 is a time series conversion processing section, and 5 6 is a matching processing unit, 6 is a determination processing unit, 7-1 to 7-n are sub-classified note type dictionary units that hold sub-classified note types, 8-1 to 8-n are word dictionary units, 9
is the result output.

まず、発声された音声は入力媒体部１（マイク
ロホン等）から入力され、音声分析部２において
数ｍｓ〜数十ｍｓ間隔のフレーム毎にLPC分析、
FFT分析等の手法で音声分析される。 First, the uttered voice is input from the input medium section 1 (microphone, etc.), and the voice analysis section 2 performs LPC analysis on every frame at intervals of several ms to several tens of ms.
The audio is analyzed using methods such as FFT analysis.

次に、特徴抽出処理部３において、上記分析結
果より聴覚特性に基づいた周波数尺度上で等間隔
に配置された複数のバンドパスフイルタ特性に従
つて音声の特徴パラメータが抽出されたり、ある
いは得られたスペクトル形状の数個のピーク周波
数（フオルマント周波数）等が抽出される。さら
に特徴抽出処理部３において、たとえばおのおの
複数のパラメータから構成される高周波数帯域や
低周波数帯域などのｎ個の帯域がら異なつたｎ個
の特徴情報を抽出することにより、ｎ個の特徴パ
ラメータベクトルの時系列を出力する。 Next, in the feature extraction processing unit 3, voice feature parameters are extracted or obtained according to the characteristics of a plurality of bandpass filters arranged at equal intervals on a frequency scale based on auditory characteristics based on the above analysis results. Several peak frequencies (formant frequencies) of the spectral shape are extracted. Furthermore, the feature extraction processing unit 3 extracts n feature information different from n bands such as a high frequency band and a low frequency band each consisting of a plurality of parameters, thereby generating n feature parameter vectors. Output the time series.

また、あらかじめ話者によつて発声された音声
データ（登録データ）にもとづいてこの音声分析
処理部２、特徴抽出処理部３によつてｎ個の特徴
パラメータベクトルの時系列を作り、この全時系
列データからフレーム単位の細分類音種を特徴毎
に作成し、細分類音種辞書部７−１〜７−ｎへ識
別用の番号とともに登録しておく。細分類音種の
作成には、たとえば、クラスタ分析を用いる方式
が知られている。 Also, based on the voice data (registered data) uttered by the speaker in advance, the voice analysis processing section 2 and the feature extraction processing section 3 create a time series of n feature parameter vectors, and A subclassified note type in frame units is created for each feature from the series data, and is registered in the subclassified note type dictionary sections 7-1 to 7-n together with identification numbers. For example, a method using cluster analysis is known for creating subclassified note types.

上記で登録したｎ種の細分類音種により、音声
はｎ種の特徴パラメータベクトル系列毎に細分類
音種の番号に変換される。この変換は時系列変換
処理部４によつて行われ、ｎ種の特徴情報別に細
分類音種の番号の時系列が出力される。 Using the n types of subclassified note types registered above, the voice is converted into subclassified note type numbers for each n type of feature parameter vector series. This conversion is performed by the time series conversion processing section 4, and a time series of subclassified note type numbers is output for each of the n types of feature information.

認識対象とする単語の発声に対して上記処理を
施し、細分類音種の番号のｎ種の時系列に変換し
ておき、それらをそれぞれあらかじめ単語辞書部
８−１〜８−ｎに登録しておく。すなわち、単語
辞書部８−１〜８−ｎには話者によつて発声され
た登録音声のｎ種の特徴に関するｎ種の細分類音
種の番号の時系列がそれぞれ登録されている。 The above-mentioned processing is applied to the utterance of the word to be recognized, and it is converted into n types of time series of subclassified phonetic numbers, and each of them is registered in advance in the word dictionary sections 8-1 to 8-n. I'll keep it. That is, the word dictionary units 8-1 to 8-n each register a time series of numbers of n types of subclassified sounds related to n types of features of registered speech uttered by a speaker.

このｎ個の単語辞書部８−１〜８−ｎと認識す
べき入力音声のｎ種の特徴に関するｎ種の細分類
音種の番号の時系列とのマツチングが整合処理部
５でDP法等の手法により計算される。たとえば
DP法による整合処理に際しては、DP法を実行す
るために必要な入力音声の各フレームと登録され
た音声の各フレームの距離は、入力音声の該当フ
レームと登録された音声の該当フレームの各特徴
毎の距離をすべて加算することにより求めること
ができる。各特徴毎の距離は、各特徴毎に入力音
声の該当フレームの細分類音種の番号と登録され
た音声の該当フレームの細分類音種の番号によ
り、その特徴に関する細分類音種間の距離が格納
された距離マトリクスを検索することにより求め
ることができる。最後に、判定処理部６において
距離が最小となる登録単語の名前を認識結果とし
て出力する。 The matching processing unit 5 performs matching between the n word dictionary units 8-1 to 8-n and the time series of the numbers of the n types of subdivided sounds related to the n types of features of the input speech to be recognized using the DP method, etc. Calculated using the method. for example
During matching processing using the DP method, the distance between each frame of the input audio and each frame of the registered audio required to execute the DP method is calculated based on the characteristics of the corresponding frame of the input audio and the corresponding frame of the registered audio. It can be calculated by adding up all the distances. The distance for each feature is determined by the number of the subclass note type of the relevant frame of the input audio and the number of the subclass note type of the relevant frame of the registered audio, and the distance between the subclass note types regarding that feature. It can be obtained by searching a distance matrix in which . Finally, the determination processing unit 6 outputs the name of the registered word with the minimum distance as a recognition result.

以下に動作例をもとに本発明を説明する。 The present invention will be explained below based on an operation example.

理解を容易にするために、分析の対象となる特
徴パラメータベクトルは２種類とし、第１の特徴
パラメータは50の細分類音種からなり、第２の特
徴パラメータは80の細分類音種からなるものとす
る。また入力音声は１秒間の長さで発声され、音
声分析処理部２ではこれを10ｍｓ間隔のフレーム
で音声分析するものとする。さらに認識対象単語
は「オーサカ」、「カナガワ」、「トウキヨウ」の３
単語とする。さらに整合処理は線形整合法を用い
るものとする。 To facilitate understanding, there are two types of feature parameter vectors to be analyzed: the first feature parameter consists of 50 subclassified note types, and the second feature parameter consists of 80 subclassification note types. shall be taken as a thing. It is also assumed that the input voice is uttered with a length of 1 second, and the voice analysis processing section 2 analyzes the voice in frames at intervals of 10 ms. Furthermore, there are three words to be recognized: “Osaka”, “Kanagawa”, and “Tokiyo”.
Let it be a word. Furthermore, it is assumed that the matching process uses a linear matching method.

いま未知入力音声に関して以下に示す100フレ
ーム（＝１秒／10ｍｓ）分の細分類音種情報の時
系列（具体的には細分類音種の番号の時系列）が
得られたものとする。 It is now assumed that a time series of subclassified note type information (specifically, a time series of subclassified note type numbers) for 100 frames (=1 second/10ms) shown below has been obtained for the unknown input voice.

第１特徴に関する時系列情報（細分類音種辞書
部７−１によつて得られるもの）： (1) (2) (3) (4) （100）〔43、５、18、36、…、24〕第２特徴に関する時系列情報（細分類音種辞書
部７−２によつて得られるもの）： (1) (2) (3) (4) （100）〔14、59、64、75、…、33〕一方、単語辞書部には、あらかじめ発声された
「オーサカ」、「カナザワ」、「トウキヨウ」の３単
語について以下に示す細分類音種情報の時系列が
格納されているものとする。 Time series information regarding the first feature (obtained by the subclassification note type dictionary section 7-1): (1) (2) (3) (4) (100) [43, 5, 18, 36,... , 24] Time series information regarding the second feature (obtained by the subclassification sound type dictionary section 7-2): (1) (2) (3) (4) (100) [14, 59, 64, 75,...,33] On the other hand, the word dictionary section stores the time series of subclassified sound type information shown below for the three words "Osaka", "Kanazawa", and "Tokiyo" that have been uttered in advance. shall be.

単語辞書８−１に格納されている第１特徴に関
する時系列情報： (1) (2) (3) (4) （100）オーサカ〔20、50、40、25、…、49〕 (1) (2) (3) (4) （100）カナザワ〔９、24、31、15、…、30〕 (1) (2) (3) (4) （100）トウキヨウ〔40、７、15、38、…、29〕単語辞書８−２に格納されている第２特徴に関
する時系列情報： (1) (2) (3) (4) （100）オーサカ〔25、18、78、25、…、59〕 (1) (2) (3) (4) （100）カナザワ〔70、28、31、42、…、11〕 (1) (2) (3) (4) （100）トウキヨウ〔12、63、58、73、…、36〕整合処理部５では、未知入力音声に関して得ら
れた細分類音種情報の時系列と単語辞書部に格納
されている細分類音種情報の時系列との距離を距
離マトリクスを検索することにより求める。 Time series information regarding the first feature stored in the word dictionary 8-1: (1) (2) (3) (4) (100) Osaka [20, 50, 40, 25,..., 49] (1) (2) (3) (4) (100) Kanazawa [9, 24, 31, 15,..., 30] (1) (2) (3) (4) (100) Tokyo [40, 7, 15, 38] ,..., 29] Time series information regarding the second feature stored in the word dictionary 8-2: (1) (2) (3) (4) (100) Osaka [25, 18, 78, 25,..., 59] (1) (2) (3) (4) (100) Kanazawa [70, 28, 31, 42,..., 11] (1) (2) (3) (4) (100) Tokyo [12, 63, 58, 73,..., 36] The matching processing unit 5 compares the time series of the subclassified note type information obtained regarding the unknown input speech with the time series of the subclassified note type information stored in the word dictionary unit. Find the distance by searching the distance matrix.

いま第１特徴に関する距離として、 (1) (2) (3) (4) （100）オーサカ〔23、45、22、11、…、25〕 (1) (2) (3) (4) （100）カナザワ〔34、19、13、21、…、６〕 (1) (2) (3) (4) （100）トウキヨウ〔３、２、３、２、…、５〕が得られ、第２特徴に関する距離として、 (1) (2) (3) (4) （100）オーサカ〔11、41、14、50、…、26〕 (1) (2) (3) (4) （100）カナザワ〔56、31、33、33、…、22〕 (1) (2) (3) (4) （100）トウキヨウ〔２、４、６、２、…、３〕が得られたとする。この場合、照合対象の単語に
ついて、各特徴毎および各フレーム毎の距離をす
べて加算することにより、単語「トウキヨウ」が
未知入力音声と最も距離が小さいことがわかり、
「トウキヨウ」が認識結果として出力される。 Now, as the distance regarding the first feature, (1) (2) (3) (4) (100) Osaka [23, 45, 22, 11,..., 25] (1) (2) (3) (4) ( 100) Kanazawa [34, 19, 13, 21, ..., 6] (1) (2) (3) (4) (100) Tokyo [3, 2, 3, 2, ..., 5] is obtained, and the As the distance regarding the two features, (1) (2) (3) (4) (100) Osaka [11, 41, 14, 50,..., 26] (1) (2) (3) (4) (100) Suppose that Kanazawa [56, 31, 33, 33, ..., 22] (1) (2) (3) (4) (100) Tokyo [2, 4, 6, 2, ..., 3] are obtained. In this case, by adding up all the distances for each feature and each frame for the word to be matched, it is found that the word "Tokyo" has the smallest distance from the unknown input voice,
"Tokyo" is output as the recognition result.

以上説明したように本発明によれば、少ない種
類の細分類音種の組み合わせにより、実際には多
数の細分類音種が存在する状況をつくりだすこと
ができ、発声された音声の特徴情報の損失を抑え
ることができる。また、辞書メモリ量も削減する
ことができる効果がある。 As explained above, according to the present invention, it is possible to create a situation in which a large number of subclassified sound types actually exist by combining a small number of subclassified sound types, thereby reducing the loss of characteristic information of uttered speech. can be suppressed. Furthermore, the amount of dictionary memory can also be reduced.

[Brief explanation of the drawing]

図は本発明による実施例の音声認識処理装置の
プロツク図である。図中、１は入力媒体部、２は音声分析処理部、
３は特徴抽出処理部、４は時系列変換処理部、５
は整合処理部、６は判定処理部、７−１〜７−ｎ
は細分類音種辞書部、８−１〜８−ｎは単語辞書
部、９は結果出力である。 The figure is a block diagram of a speech recognition processing device according to an embodiment of the present invention. In the figure, 1 is an input medium section, 2 is a speech analysis processing section,
3 is a feature extraction processing unit, 4 is a time series conversion processing unit, 5
6 is a matching processing section, 6 is a judgment processing section, 7-1 to 7-n
8-1 to 8-n are word dictionary sections, and 9 is a result output.

Claims

[Claims] 1. A speech analysis processing unit 2 that analyzes input speech for each frame at a predetermined time interval; a feature extraction processing unit 3 that extracts speech features from the analysis results; and segmentation of input speech into phonemes. a time series conversion processing unit 4 that converts the input speech into a time series of subclassified note types, a word dictionary 8 that stores in advance a time series of subclassified note types of a plurality of registered voices, and a time series of subclassified note types of input speech. a matching processing unit 5 that calculates the distance between the input speech and a plurality of registered speeches by matching the time series of subclassified phonetic types of the registered speeches in the word dictionary 8; In a speech recognition processing device having a determination processing unit 6 that determines the one with the smallest distance from the other, the time series conversion processing unit 4 uses a plurality of feature parameters and converts each feature into subclassified sound types. It is further provided with subclassification note type dictionary units 7-1 to 7-n which are coupled to the time series conversion processing unit 4 and hold subclassification note type information for each feature formed in advance. , the word dictionary 8 is a word dictionary 8 that holds detailed classification sound type information for each feature of voice registered words.
-1 to 8-n, and during speech recognition processing, the time series conversion processing unit 4
converts the input voice into a time series of subclassification sound type information for each feature using the subclassification sound type dictionary units 7-1 to 7-n, and the matching processing unit 5 converts the input speech into a time series of subclassification sound type information for each feature. A speech recognition processing device characterized in that a time series of subclassified sound type information for each feature is matched with corresponding subclassified sound type information in word dictionaries 8-1 to 8-n.