JPS6355599A

JPS6355599A - Voice recognition equipment

Info

Publication number: JPS6355599A
Application number: JP61199637A
Authority: JP
Inventors: 紀代原
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1986-08-26
Filing date: 1986-08-26
Publication date: 1988-03-10

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Abstract] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】産業上の利用分野本発明は認識率の向上を図った音声認識装置に関する。[Detailed description of the invention] Industrial applications The present invention relates to a speech recognition device that improves recognition rate.

従来の技術音韻認識手段はワード・プロセッサや計算機への入力等
マン・マシンインターフェースとして実用化が期待され
ている分野である。Conventional technology Phonological recognition means is a field that is expected to be put to practical use as a man-machine interface such as input to word processors and computers.

音声認識装置には、入力音声を認識する単位として単音
節（ＣＶ　、　Ｃ：子音、Ｖ：母音を表す）を用いるも
の、ＣｖおよびＶＣＶを用いるもの、音素（ＣおよびＶ
）を用いるもの等が考えられる。Speech recognition devices include those that use monosyllables (CV, C: consonant, V: vowel) as units for recognizing input speech, those that use Cv and VCV, and those that use phonemes (C and V).
) may be considered.

また、使用者があらかじめ基準となる音声を発声。In addition, the user speaks a standard voice in advance.

登録してから認識処理をはじめる登録型と、たくさんの
発声データをもとに統計処理を行ない、普遍的な標準パ
ターンを準備しておき、使用者の登録を必要としない不
特定型とがある。また、特徴抽出の方法としては線形予
測分析やフィルタパンクを用いたものが主流となってい
る。本明細書においては、従来例、実施例ともに認識単
位としてＣＶおよびＶＣＶ　、特徴抽出法として線形予
測分析を用いた不特定型音声認識装置について説明する
が、これらに限られるものではない。以下図面１用いて
従来の音声認識装置の一例を説明する。There is a registered type that starts recognition processing after registration, and an unspecified type that performs statistical processing based on a large amount of vocalization data, prepares universal standard patterns, and does not require user registration. . In addition, the mainstream methods of feature extraction are those using linear predictive analysis and filter punching. In this specification, an unspecified speech recognition device using CV and VCV as the recognition unit and linear predictive analysis as the feature extraction method will be described in both the conventional example and the embodiment, but the present invention is not limited to these. An example of a conventional speech recognition device will be described below with reference to FIG.

第２図は不特定型音声認識装置の構成を示すブロック図
である。音声入力端１から入力された音声は線形予測分
析部２において窓長２０ｍ気、フレームシフ）　５　ｍ
５ｅｃ　、次数１５次の自己相関法を用いて分析され、
１５個のケプストラム係数および残差パワー（０次のケ
プストラム係数）の計１６個のパラメータの列として出
力される（Ｍ形予測分析については、マーケル・グレイ
著鈴木久喜訳：音声の線形予測１９８ｏ年コロナ社参照
）。FIG. 2 is a block diagram showing the configuration of an unspecified speech recognition device. The audio input from the audio input terminal 1 is processed by the linear predictive analysis unit 2 with a window length of 20 m (with a frame shift) of 5 m.
5ec, analyzed using the autocorrelation method of order 15,
It is output as a sequence of 16 parameters in total, including 15 cepstral coefficients and residual power (0th order cepstral coefficient). (See Corona).

次に無音検出部３において残差パワーを用いて語頭２語
尾および語中の無音部が決定される。母音認識部４にお
いては、あらかじめ沢山の発声データを処理して得られ
た母音識別関数（安田三部著：社会統計学、２章７節１
９６９年丸善参照）の係数を格納した識別関数記憶部５
より係数を読み込み、無音検出部３において検出された
無音部以外の部分について、各フレーム毎に母音認識を
行なう。６は定常点検出部で母音認識部４で得られた各
フレーム毎の母音認識結果より安定なものを取りだして
定常点列として出力する。この定常点の数が入力音声の
音節数を示す。８は音韻認識部で標準パターン記憶部９
から読みだした標準パターンと入力音声からイ得られた
パラメータ列とでＤＰマツチングを行ない、その結果距
離が最小となる標準パターンを認識音韻列として出力す
る。Next, the silence detection section 3 uses the residual power to determine the beginning, end, and middle of the word. The vowel recognition unit 4 uses a vowel discrimination function (Yasuda Mibe, Social Statistics, Chapter 2, Section 7, 1) obtained by processing a large amount of vocalization data in advance.
Discriminant function storage unit 5 that stores coefficients of 969 Maruzen)
The coefficients are read in, and vowel recognition is performed for each frame in a portion other than the silent portion detected by the silence detector 3. Reference numeral 6 denotes a stationary point detection unit which extracts stable vowel recognition results for each frame obtained by the vowel recognition unit 4 and outputs them as a stationary point sequence. The number of stationary points indicates the number of syllables of the input voice. 8 is a phoneme recognition unit and a standard pattern storage unit 9
DP matching is performed between the standard pattern read from the input speech and the parameter string obtained from the input speech, and the standard pattern with the minimum distance is output as the recognized phoneme string.

標準パタ′−ン記憶部９にはあらかじめ多数の発声デー
タから統計処理を用いて作成された普遍的な標準パター
ンが格納されている。１０は言語処理部で音韻認識部８
で得られた音韻認識結果に対して言語処理を行ない、最
終的な認識結果を認識結果出力端１２に得る。１１は言
語辞書で１０言語処理に用いられる辞書が格納されてい
る（例えば、三船他：電子通信学会ＰＲＩ、８３−４０
　；この論文は線形予測分析ではなくフィルタバンクを
、定常点の検出に母音認識結果ではなく、フレーム間の
分散を用いたものであるが、母音定常点の検出の後に音
韻認識を行なっていることから従来例として上げること
ができる。）。The standard pattern storage section 9 stores universal standard patterns created in advance from a large number of utterance data using statistical processing. 10 is a language processing unit and a phonological recognition unit 8
Linguistic processing is performed on the phoneme recognition results obtained in , and the final recognition results are obtained at the recognition result output terminal 12 . Reference numeral 11 is a language dictionary in which dictionaries used for processing 10 languages are stored (for example, Mifune et al.: Institute of Electronics and Communication Engineers PRI, 83-40.
;This paper uses a filter bank rather than linear predictive analysis, and uses inter-frame variance rather than vowel recognition results to detect stationary points; however, phonological recognition is performed after vowel stationary points are detected. This can be cited as a conventional example. ).

発明が解決しようとする問題点このような従来の音声認識装置では定常点の検出結果を
用いて音韻識別の制御を行なっているので、定常点の検
出結果および定常点における母音認識結果が認識率に大
きな影響を与えている。定常点における母音認識率を向
上させるため母音認識候補を２位まで用いるなどの対応
策があるが、処理時間が増加すると言う問題点があった
。Problems to be Solved by the Invention In such conventional speech recognition devices, phoneme identification is controlled using the detection results of stationary points. is having a big impact on In order to improve the vowel recognition rate at stationary points, there are countermeasures such as using vowel recognition candidates up to the second rank, but there is a problem that processing time increases.

本発明はかかる点に鑑みてなされたもので、定常点検出
の後、検出された定常点に於いて再度母音認識を行ない
、定常点存在の確実性および母音第一候補の信頼性をも
とめ、定常点検出率および母音認識率の向上を計ること
を目的としている。The present invention has been made in view of this point, and after detecting a stationary point, vowel recognition is performed again at the detected stationary point to determine the certainty of the existence of the stationary point and the reliability of the first vowel candidate, The purpose is to measure the improvement of stationary point detection rate and vowel recognition rate.

問題点を解決するための手段本発明は定常点検出の後、検出された定常点に於いて再
度母音認識を行なう手段を備えた音声認識装置である。Means for Solving the Problems The present invention is a speech recognition device that is provided with means for performing vowel recognition again at the detected stationary point after detecting a stationary point.

作　　用本発明は前記した構成により、定常点検出の後、検出さ
れた定常点に於いて再度母音認識を行ない、定常点存在
の確実性および母音第一候補の信頼性をもとめて、必要
な場合には母音候補を決定せずに認識処理を行なうこと
により認識率の向上を計ることができる。According to the above-described configuration, the present invention performs vowel recognition again at the detected stationary point after detecting a stationary point, determines the certainty of the existence of the stationary point and the reliability of the first vowel candidate, and performs necessary recognition. In some cases, the recognition rate can be improved by performing recognition processing without determining vowel candidates.

実施例笥１図は本発明の一実施例における不特定型音声認識装
置の構成を示すブロック図である。図において、′実線
は処理の流れ、点線はデータ参照を示している。音声入
力端１から入力された音声は線形予測分析部（特徴抽出
部）２において窓長２０ｍ５ｅｃ　、フレームシフ）ｓ
ｍ友、次数１　ｓ次の自己相関法を用いて分析され、１
５個のケプストラム係数および残差パワー（０次のケプ
ストラム係数）の計１６個のパラメータの列として出力
される。Embodiment 1 FIG. 1 is a block diagram showing the configuration of an unspecified speech recognition device according to an embodiment of the present invention. In the figure, solid lines indicate the flow of processing, and dotted lines indicate data references. The audio input from the audio input terminal 1 is processed by the linear predictive analysis unit (feature extraction unit) 2 with a window length of 20 m5ec and a frame shift).
It is analyzed using the autocorrelation method of order 1 and order s, and 1
A total of 16 parameters, including 5 cepstrum coefficients and residual power (0th order cepstrum coefficient), are output as a string.

次に無音検出部３において残差パワーを用いて語頭１語
尾および語中の無音部が決定される。第１の母音認識部
４においては、あらかじめ沢山の発声データを処理して
得られた母音識別関数の係数を格納した識別関数記憶部
６より係数を読み込み、無音検出部３において検出され
た無音部以外の部分について、各フレーム毎に母音認識
を行なう。Next, the silence detection section 3 uses the residual power to determine the beginning, end, and middle of the word. The first vowel recognition unit 4 reads the coefficients from the discriminant function storage unit 6 that stores the coefficients of the vowel discriminant function obtained by processing a large amount of utterance data in advance, and reads the coefficients from the discriminant function storage unit 6 that stores the coefficients of the vowel discriminant function obtained by processing a large amount of utterance data in advance. For the other parts, vowel recognition is performed for each frame.

６は定常点検出部で母音認識部４で得られた各フレーム
毎の母音認識結果より安定なものを取りだして定常点列
として出力する。７は第２の母音認取部で検出された定
常点に対してより詳細な母音認識を行なう。第１の母音
認識手段、第２の母音認識手段については後に詳しく説
明する。８は音韻認識部で標準パターン記憶部９から読
みだした標準パターンと入力音声から得られたパラメー
タ列とでＤＰマツチングを行ない、その結果距離が最小
となる標準パターンを認識音韻列として出力する。標準
パターン記憶部９にはあらかじめ多数の発声データから
統計処理を用いて作成された普遍的な標準パターンが格
納されている。１ｏは言語処理部で音韻認識部８で得ら
れた音韻認識結果に対して言語処理を行い、最終的な認
識結果を認識結果出力端１２に得る。１１は言語辞書で
１０言語処理に用いられる辞薔が格納されている。6 is a stationary point detection unit which extracts stable vowel recognition results for each frame obtained by the vowel recognition unit 4 and outputs them as a stationary point sequence. 7 performs more detailed vowel recognition on the stationary points detected by the second vowel recognition section. The first vowel recognition means and the second vowel recognition means will be explained in detail later. A phoneme recognition unit 8 performs DP matching between the standard pattern read from the standard pattern storage unit 9 and the parameter string obtained from the input speech, and outputs the standard pattern with the minimum distance as a recognized phoneme string. The standard pattern storage unit 9 stores universal standard patterns created in advance from a large number of voice data using statistical processing. 1o is a language processing unit that performs language processing on the phoneme recognition result obtained by the phoneme recognition unit 8, and outputs the final recognition result to the recognition result output terminal 12. Reference numeral 11 denotes a language dictionary in which dictionaries used for processing 10 languages are stored.

次に第１の母音認識部、第２の母音認識部について詳細
に説明する。母音認識の手法としてマハラノビス汎距離
を用いた方法は既に知られている。Next, the first vowel recognition section and the second vowel recognition section will be explained in detail. A method using Mahalanobis general distance is already known as a vowel recognition method.

今、識別関数作成に用いた標本集合の共分散行列をＶ、
各母音毎の平均値をｘｖ（ｖ＝ａ、・・・・・・Ｎ）、
入カバターンをＸとすると、各母音と入カバターンとの
マハラノビス汎距離は（１）式で求められる。Now, the covariance matrix of the sample set used to create the discriminant function is V,
The average value for each vowel is xv (v=a,...N),
Letting the incoming covert turn be X, the Mahalanobis general distance between each vowel and the incoming covert turn can be found using equation (1).

ｄ（ｘ、ｖ）＝ｘ　Ｖ−ｘ−２”ｖｔｖ−＋工＋ｘｖｔ
ｙ−１ｘｖ・・・・・・・・・・・・（１）谷々Ｖは゛１６行１５列の行列、Ｘは１５次のベクトル
を示すま１６は特徴抽出部における分析次数に一致〕。d(x, v)=x V-x-2"vtv-+engine+xvt
y-1xv (1) Valley V is a 16-by-15-column matrix, where X indicates a 15th-order vector, and 16 corresponds to the analysis order in the feature extraction section.

（１）式において２項のｘｖ”Ｖ　’　、　３項のｘｖ
ｔ■−１ｘｖは定数となるので（１）式は（２）式のよ
うにおきかえられる。In equation (1), the second term xv"V', the third term xv
Since t■-1xv is a constant, equation (1) can be replaced as equation (2).

ｄ（ｘ、す＝ｘ　Ｖ−ｘ−２７１，ｘ＋Ｄａ　　　・−
−−＝−＝（２）ｌｖ＝ｘＯｔｖ−１Ｄａ＝ｘｖｖ−ｘ
ｖ（２）酸第１項は、１６行、１６列の行列演算を伴う
ので相当な計算量となるが、この部分は母音に依存せず
入カバターンのみによって決まるので、どの母音により
近いかを判定するだけならば、第１項は無視する事がで
きる。そこで第一の母音認識手段においては距離尺度と
して計算式の少ない（３）式を用いて母音認識を行ない
、その結果を用いて定常点検出を行なう。d(x,su=x V-x-271,x+Da ・-
--=-=(2)lv=xOtv-1Da=xvv-x
The first term of v(2) acid requires a considerable amount of calculation as it involves matrix operations with 16 rows and 16 columns, but this part does not depend on the vowel and is determined only by the input cover pattern, so it is difficult to determine which vowel is closer to the first term. If you only want to make a judgment, the first term can be ignored. Therefore, in the first vowel recognition means, vowel recognition is performed using equation (3), which has a small number of calculation formulas, as a distance measure, and the result is used to perform stationary point detection.

ｄ’　（ｘ　、ｖ　）＝−２１Ｖｘ＋Ｄａ　　　　−・
山・−聞−（３）その後、第２の母音認識手段において
、検出された定常点に対してのみｘ　Ｖ−ｘを求め（４
）式にてマハラノビス汎距離を得る。d' (x, v)=-21Vx+Da -・
Yama・-Bun- (3) Then, in the second vowel recognition means, x V−x is determined only for the detected stationary points (4
) to obtain the Mahalanobis general distance.

ｄ（ｘ、ｖ）＝ｘ　Ｖ”’−ｘ＋ｄ’（ｘ、ｖ）　　　
−−−−・−・−・（４）マハラノビス汎距離は統計的
距離であり自由度１６の−分布に従うので信頼性を粂め
る事ができる。d(x, v)=x V"'-x+d'(x, v)
-----・-・-・(4) The Mahalanobis general distance is a statistical distance and follows a -distribution with 16 degrees of freedom, so it can be highly reliable.

例えばｄ（ｘ、ａ）＝１１．０４のときＸがａである確
率（＝信頼性）は７６％と求められる。そこで次の規則
に従って母音の再決定を行なう。For example, when d(x, a)=11.04, the probability (=reliability) that X is a is determined to be 76%. Therefore, vowels are redetermined according to the following rules.

規則１）第１候補となった母音の信頼性が１０チ以下（
ｄ（ｘ、ｖ）≧２２．３　）のとき、すべての母音を候
補とする。Rule 1) The reliability of the first candidate vowel is 10 chi or less (
d(x, v)≧22.3), all vowels are candidates.

規則２）第１候補となった母音の信頼性が５０％以下の
ときは母音は第２候補まで対象とする。Rule 2) When the reliability of the first candidate vowel is 50% or less, the second candidate vowel is considered.

規則３）第１候補となった母音の信頼性が６０％以上の
とき第２候補の信頼性が２０％以上ない場合は、母音第
１候補、第２候補ともに対象とする。Rule 3) If the reliability of the first candidate vowel is 60% or more and the reliability of the second candidate is not 20% or more, both the first and second vowel candidates are considered.

規則４）上記以外は第１候補のみを用いる。Rule 4) Except for the above, only the first candidate is used.

以上のように本実施例によれば、検出された定常点に対
し再度母音認識を行なう手段を設けて信頼性に応じた母
音候補の決定を行なうことにより母音認識率を向上させ
る事が出来る。As described above, according to this embodiment, the vowel recognition rate can be improved by providing means for performing vowel recognition again on the detected stationary points and determining vowel candidates according to reliability.

発明の詳細な説明したように、本発明によれば、母音認識率を向上
させることができ、その実用的価値には犬なるものがあ
る。As described in detail, according to the present invention, the vowel recognition rate can be improved, and its practical value is significant.

[Brief explanation of the drawing]

第１図は本発明における一実施例の音声認識装置のブロ
ック図、第２図は従来例の音声認識装置のブロック図で
ある。２・・・・・・特徴抽出部、３・・・・・・無音検出部
、４・・出・第１の母音認識部、６・・・・・・識別関
数記憶部、６・・・・・・定常点検出部、７・・・−・
・第２の母音認識部、８・・・・・・音韻認識部、９・
・・・・・標準バタン記憶部。FIG. 1 is a block diagram of a speech recognition device according to an embodiment of the present invention, and FIG. 2 is a block diagram of a conventional speech recognition device. 2... Feature extractor, 3... Silence detector, 4... First vowel recognition unit, 6... Discrimination function storage unit, 6... ...Steady point detection section, 7...--
・Second vowel recognition unit, 8... Phoneme recognition unit, 9.
...Standard button storage section.

Claims

[Claims]

a voice input means; a feature extraction means for extracting features at regular intervals from the voice data inputted from the voice input means and outputting a feature parameter string; and a vowel recognition means for performing vowel recognition on the feature parameter string. , a stationary point detection means for detecting a stable part from the result of the vowel recognition and outputting it as a vowel stationary point sequence; a phoneme recognition means for recognizing the phoneme of the characteristic parameter sequence; and a vowel obtained by the stationary point detection means. A speech recognition device characterized by comprising means for performing vowel recognition again at a stationary point.