JPH04254897A

JPH04254897A - Voice recognition system

Info

Publication number: JPH04254897A
Application number: JP3036706A
Authority: JP
Inventors: Takashi Ariyoshi; 有吉　敬
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1991-02-06
Filing date: 1991-02-06
Publication date: 1992-09-10

Abstract

PURPOSE:To significantly reduce processing amount in the DP matching of input voice and registered voice. CONSTITUTION:In a characteristic quantity extraction pat 1, as the characteristic quantity of input voice, a first characteristic quantity having relatively less information quantity, and a second characteristic quantity having relatively larger information quantity are extracted. In a matching part 4, first, on the basis of both the first characteristic quantity of input voice and the first characteristic quantities of respective registered voice inside a registered-voice memory 3, a matching path of the input voice and the respective registered voice is determined by a dynamic programming, and while significantly restricting the matching path, in accordance with this matching path, tolerances of the input voice and the respective registered voice are obtained by using the second characteristic quantity having relatively larger information quantity. One candidate out of the tolerances of respective registered voice is selected, and this is output as the result of recognition.

Description

[Detailed description of the invention]

【０００１】0001

【産業上の利用分野】本発明は、入力音声の特徴量と標
準パターンの特徴量とのマッチングを行なうことにより
入力音声を認識する音声認識方式に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition method for recognizing input speech by matching the features of the input speech with the features of a standard pattern.

【０００２】0002

【従来の技術】入力音声の特徴量と標準パターンの特徴
量とのマッチングを行なうことにより入力音声を認識さ
せようとする場合に、一般に、同じ話者の発声であって
も入力音声の時間的長さはその都度変動し、しかも非線
形に伸縮するので、入力音声と登録音声の標準パターン
との同じ音素同士が対応するように時間軸を非線形に伸
縮する時間正規化を行なう必要がある。このため、この
種の音声認識方式には動的計画法（ＤＰ：ダイナミック
プログラミング）を用いたマッチング（以下、ＤＰマッ
チングと称す）が採用されている。[Prior Art] When trying to recognize input speech by matching the features of the input speech with the features of a standard pattern, it is common to Since the length changes each time and expands/contracts nonlinearly, it is necessary to perform time normalization to nonlinearly expand/contract the time axis so that the same phonemes in the standard pattern of the input voice and the registered voice correspond to each other. For this reason, matching using dynamic programming (DP) (hereinafter referred to as DP matching) is employed in this type of speech recognition method.

【０００３】しかしながら、ＤＰマッチングは、通常処
理量が多く、特に、騒音下での音声認識に適したワード
スポッティング法において用いられる連続ＤＰでは、さ
らに多くの処理量が必要となる。However, DP matching usually requires a large amount of processing, and in particular, continuous DP used in the word spotting method, which is suitable for speech recognition in noise, requires an even larger amount of processing.

【０００４】ＤＰマッチングにおける処理量を削減する
のに、従来、例えば特開平１−２８３５９９号に開示さ
れているような音声認識方式が知られている。この音声
認識方式では、所定時間（フレーム）ごとに、ＢＰＦ（
帯域通過フィルタ）出力値やＬＰＣ（線形予測）分析結
果などを入力音声の第１の特徴量として抽出し、また短
時間エネルギー（パワー）の増減傾向やホルマントの遷
移状態を第２の特徴量として抽出する。しかる後、ＤＰ
マッチングにおいては、入力音声の第１の特徴量と登録
音声の第１の特徴量とからこれらのフレーム間距離を算
出し、この際に、入力音声の第２の特徴量と登録音声の
第２の特徴量との局所的な似具合いに基づく入力音声と
登録音声とのパターン間の時間的対応付けの情報を用い
て、照合範囲が限定されたＤＰマッチングが実行され、
その照合値を基にして最終的な認識結果を得るようにな
っている。In order to reduce the amount of processing in DP matching, a speech recognition method has been known, for example, as disclosed in Japanese Patent Laid-Open No. 1-283599. In this speech recognition method, BPF (
The output value of a bandpass filter), the LPC (linear prediction) analysis result, etc. are extracted as the first feature quantity of the input speech, and the short-term energy (power) increase/decrease trend and the formant transition state are extracted as the second feature quantity. Extract. After that, DP
In matching, the distance between these frames is calculated from the first feature amount of the input voice and the first feature amount of the registered voice. DP matching with a limited matching range is performed using information on the temporal correspondence between the patterns of the input speech and the registered speech based on the local similarity with the feature quantities of
The final recognition result is obtained based on the matching value.

【０００５】[0005]

【発明が解決しようとする課題】このように上述した従
来の音声認識方式では、入力音声の第２の特徴量と登録
音声の第２の特徴量との局所的な似具合いに基づく入力
音声と登録音声とのパターン間の時間的対応付けの情報
により、第１の特徴量のＤＰマッチングのパスを局所的
に制限してＤＰマッチングにおける処理量をある程度は
削減できるが、第１の特徴量自体の情報量が多いので、
ＤＰマッチングのパスを局所的に制限しても、処理量を
大幅に削減することができないという問題があった。[Problems to be Solved by the Invention] As described above, in the conventional speech recognition method described above, the input speech recognition method is based on the local similarity between the second feature amount of the input speech and the second feature amount of the registered speech. Although the amount of processing in DP matching can be reduced to some extent by locally restricting the path of DP matching of the first feature using information on the temporal correspondence between patterns with registered speech, the amount of processing in DP matching can be reduced to some extent; Since there is a large amount of information,
Even if the DP matching paths are locally restricted, there is a problem in that the amount of processing cannot be significantly reduced.

【０００６】本発明は、入力音声の特徴量と登録音声の
特徴量とのＤＰマッチングにおける処理量を大幅に削減
することの可能な音声認識方式を提供することを目的と
している。SUMMARY OF THE INVENTION An object of the present invention is to provide a speech recognition method that can significantly reduce the amount of processing required for DP matching between the feature quantities of input speech and the feature quantities of registered speech.

【０００７】[0007]

【課題を解決するための手段】上記目的を達成するため
に本発明は、入力音声の特徴量と予め登録された複数の
登録音声の特徴量とのマッチングを行なうことにより入
力音声を認識する音声認識方式において、前記入力音声
の特徴量と前記各登録音声の特徴量とは、いずれも、比
較的情報量の少ない第１の特徴量と、比較的情報量の多
い第２の特徴量とからなり、前記入力音声の第１の特徴
量と前記各登録音声の第１の特徴量とに基づき、動的計
画法を用いて入力音声と各登録音声とのマッチングパス
を決定し、該マッチングパスに時間的にそれぞれ対応す
る前記入力音声の第２の特徴量と前記各登録音声の第２
の特徴量とを用いて、入力音声と各登録音声との尤度を
求め、該尤度に基づき入力音声の認識結果を得るように
なっていることを特徴としている。また、前記入力音声
および前記各登録音声の第１の特徴量は、音声のパワー
の変化率を用いた特徴量であることを特徴としている。また、前記入力音声の第１の特徴量を求める際に、入力
音声のパワーの値が予め定められたパワーの下限値に満
たない場合には、該パワーの下限値を入力音声のパワー
の値として用いるようになっていることを特徴としてい
る。また、入力音声の第１の特徴量と各登録音声の第１
の特徴量とに基づき、入力音声と各登録音声のマッチン
グパスを決定する際には、入力音声の第１の特徴量と各
登録音声の第１の特徴量との距離が求められ、該距離の
値に応じて候補の絞り込みを行ない、絞られた候補の登
録音声に対してのみ、第２の特徴量を用いて尤度が求め
られるようになっていることを特徴としている。[Means for Solving the Problems] In order to achieve the above object, the present invention provides a voice that recognizes an input voice by matching the feature amount of the input voice with the feature amount of a plurality of registered voices registered in advance. In the recognition method, the feature amount of the input voice and the feature amount of each registered voice are both a first feature amount with a relatively small amount of information and a second feature amount with a relatively large amount of information. Based on the first feature amount of the input voice and the first feature amount of each registered voice, a matching path between the input voice and each registered voice is determined using dynamic programming, and the matching path is determined. the second feature amount of the input voice and the second feature amount of each registered voice that respectively temporally correspond to
The feature is that the likelihood between the input voice and each registered voice is determined using the feature amounts, and the recognition result of the input voice is obtained based on the likelihood. Further, the first feature amount of the input voice and each of the registered voices is a feature amount using a rate of change in the power of the voice. Furthermore, when determining the first feature amount of the input voice, if the power value of the input voice is less than a predetermined lower limit of power, the lower limit of the power is set to the power value of the input voice. It is characterized by being used as a. In addition, the first feature amount of the input voice and the first feature amount of each registered voice
When determining the matching path between the input voice and each registered voice based on the feature value of the input voice, the distance between the first feature value of the input voice and the first feature value of each registered voice is determined, The system is characterized in that candidates are narrowed down according to the value of , and the likelihood is calculated using the second feature amount only for the registered voices of the narrowed down candidates.

【０００８】[0008]

【作用】本発明では、比較的情報量の少ない第１の特徴
量を用いて、動的計画法で入力音声と各登録音声とのマ
ッチングパスを決定し、マッチングパスを大幅に制限し
てから、このマッチングパスに従って比較的情報量の多
い第２の特徴量を用いて入力音声と各登録音声との尤度
を求めるので、処理量を低減できる。[Operation] In the present invention, the matching path between the input voice and each registered voice is determined by dynamic programming using the first feature amount, which has a relatively small amount of information, and the matching path is significantly limited. According to this matching path, the likelihood between the input voice and each registered voice is determined using the second feature having a relatively large amount of information, so that the amount of processing can be reduced.

【０００９】比較的情報量の少ない第１の特徴量として
は、音声のパワーの変化率を用いることができ、この場
合、入力音声のパワーの値が予め定められたパワーの下
限値に満たない場合には、このパワーの下限値を入力音
声のパワーの値として用いる。[0009] As the first feature having a relatively small amount of information, the rate of change in the power of the voice can be used, and in this case, the power value of the input voice is less than the predetermined lower limit of the power. In this case, this lower limit value of power is used as the power value of the input audio.

【００１０】また、入力音声と各登録音声とのマッチン
グパスを決定する際に、入力音声の第１の特徴量と各登
録音声の第１の特徴量との距離を求め、距離の値に応じ
て各登録音声，すなわち候補の絞り込みを行なった上で
、第２の特徴量を用いて尤度を求めるようにすれば、さ
らに処理量を低減できる。[0010] Furthermore, when determining the matching path between the input voice and each registered voice, the distance between the first feature of the input voice and the first feature of each registered voice is determined, and the distance is calculated according to the distance value. The amount of processing can be further reduced by narrowing down each registered voice, that is, the candidates, and then calculating the likelihood using the second feature amount.

【００１１】[0011]

【実施例】以下、本発明の一実施例を図面に基づいて説
明する。図１は本発明の一実施例のブロック図である。図１を参照すると、本実施例では、入力音声から一定の
フレーム周期（１０〜２０ｍｓ）毎に特徴量を抽出する
特徴量抽出部１と、入力音声の音声区間を検出する音声
区間検出部２と、複数の登録音声の特徴量が予め記憶さ
れている登録音声メモリ３と、動的計画法（ＤＰ）を用
いて入力音声の特徴量と登録音声の特徴量とのＤＰマッ
チングを行なうマッチング部４とが設けられている。DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram of one embodiment of the present invention. Referring to FIG. 1, in this embodiment, a feature amount extraction unit 1 extracts a feature amount from an input voice at a fixed frame period (10 to 20 ms), and a voice interval detection unit 2 detects a voice interval of the input voice. , a registered voice memory 3 in which feature quantities of a plurality of registered voices are stored in advance, and a matching unit that performs DP matching between the feature quantities of input speech and the feature quantities of registered speech using dynamic programming (DP). 4 is provided.

【００１２】特徴量抽出部１に入力する入力音声は、例
えばマイクアンプやアンチエリアスフィルタを介してＡ
／Ｄ変換器でデジタル化されたものであり、特徴量抽出
部１では、入力音声の特徴量として、比較的に情報量の
少ない第１の特徴量と、比較的に情報量の多い第２の特
徴量とを抽出するようになっている。[0012] The input voice input to the feature extraction unit 1 is inputted to the A
/D converter, and the feature extraction unit 1 extracts a first feature with a relatively small amount of information and a second feature with a relatively large amount of information as the feature of the input voice. The feature quantity is extracted.

【００１３】入力音声の第１の特徴量としては、例えば
、入力音声のパワーの変化率が用いられ、この場合には
、入力音声の第１の特徴量Ｐｘ（ｉ）は、各フレームの
入力音声のパワーをｅ（ｉ）とすると、For example, the rate of change in power of the input voice is used as the first feature of the input voice, and in this case, the first feature Px(i) of the input voice is the input voice of each frame. If the power of the voice is e(i),

【００１４】[0014]

【数１】として抽出される。ここで、ｉは、フレーム番号である
。It is extracted as [Equation 1]. Here, i is a frame number.

【００１５】また、第２の特徴量としては、例えば、バ
ンドパスフィルタ（ＢＰＦ）バンクで得られる公知の短
時間スペクトルＳｘ（ｉ，ω）が用いられる。[0015] Furthermore, as the second feature amount, for example, a known short-time spectrum Sx (i, ω) obtained from a band pass filter (BPF) bank is used.

【００１６】音声区間検出部２は、例えば、入力音声の
パワーｅ（ｉ）が予め定められている閾値を超えたか否
かにより音声区間を検出するようになっている。[0016] The voice section detecting section 2 detects a voice section based on, for example, whether the power e(i) of the input voice exceeds a predetermined threshold.

【００１７】また、登録音声メモリ３には、登録音声の
特徴量として、入力音声の第１，第２の特徴量と対応さ
せた形で、第１の特徴量Ｐｔｋ（ｊ），第２の特徴量Ｓ
ｔｋ（ｊ，ω）が記憶されている。ここで、ｊはフレー
ム番号であり、ｋは登録音声番号である。The registered speech memory 3 also stores, as feature quantities of the registered speech, a first feature quantity Ptk(j) and a second feature quantity Ptk(j) in correspondence with the first and second feature quantities of the input speech. Feature amount S
tk(j, ω) is stored. Here, j is the frame number and k is the registered audio number.

【００１８】マッチング部４は、音声区間検出部２で検
出された音声区間の入力音声の第１の特徴量Ｐｘ（ｉ）
と各登録音声毎の第１の特徴量Ｐｔｋ（ｊ）とを用いて
、動的計画法（ＤＰ）によりＤＰマッチングを行ない、
最小の距離をＰｘ（ｉ）とＰｔｋ（ｊ）との距離として
記憶し、またその距離を与えたマッチングパスを記憶す
るようになっており、このようにして全ての登録音声に
対する第１の特徴量の距離を求めた後、これらの距離の
うちで閾値以上の距離を与えた登録音声を認識対象から
除外し、候補の絞り込みを行なうようになっている。マ
ッチング部４は、次いで、入力音声の第２の特徴量Ｓｘ
（ｉ，ω）と、絞り込まれた候補の各登録音声の第２の
特徴量Ｓｔｋ（ｊ，ω）に対して、各登録音声毎に記憶
されているマッチングパスのみを用いて距離を計算し、
これを入力音声と各登録音声との距離とし、これらの距
離のうちで最小のものが予め定められた閾値以下の場合
に、その最小の距離を与えた登録音声のカテゴリーを認
識結果として出力するようになっている。The matching unit 4 extracts the first feature quantity Px(i) of the input speech of the speech interval detected by the speech interval detection unit 2.
DP matching is performed by dynamic programming (DP) using the first feature amount Ptk(j) for each registered voice,
The minimum distance is stored as the distance between Px(i) and Ptk(j), and the matching path that gives that distance is stored, and in this way, the first feature for all registered voices is After determining the amount of distance, registered voices that have a distance greater than a threshold value are excluded from recognition targets, and candidates are narrowed down. The matching unit 4 then calculates the second feature amount Sx of the input speech.
(i, ω) and the second feature Stk (j, ω) of each registered voice of the narrowed down candidates, the distance is calculated using only the matching path stored for each registered voice. ,
This is taken as the distance between the input voice and each registered voice, and if the minimum distance among these distances is less than a predetermined threshold, the category of the registered voice that gave the minimum distance is output as a recognition result. It looks like this.

【００１９】次にこのような構成における音声認識処理
動作について説明する。本実施例では、入力音声の第１
の特徴量として、数１で示されるようなフレームｉごと
のパワーの変動量Ｐｘ（ｉ）を用いる。これにより、入
力音声の第１の特徴量Ｐｘは、Next, the speech recognition processing operation in such a configuration will be explained. In this embodiment, the first input audio
As the feature amount, a power variation amount Px(i) for each frame i as shown in Equation 1 is used. As a result, the first feature amount Px of the input voice is

【００２０】[0020]

【数２】の時系列で表わされる。また、登録音声番号ｋの登録音
声の第１の特徴量Ｐｔｋは、It is expressed in the time series of [Equation 2]. Further, the first feature amount Ptk of the registered voice with registered voice number k is:

【００２１】[0021]

【数３】の時系列で表わされる。[Math 3] It is expressed in chronological order.

【００２２】図２で示すようなＰｘ，Ｐｔｋからなる平
面を考えると、マッチングパスＬ，すなわち、ＰｘとＰ
ｔｋとの時間軸の対応付けは、この平面上の格子点Ｃ＝
（ｉ，ｊ）の系列Ｆ，すなわち、Considering a plane consisting of Px and Ptk as shown in FIG. 2, matching path L, that is, Px and P
The correspondence of the time axis with tk is the grid point C= on this plane.
The sequence F of (i, j), that is,

【００２３】[0023]

【数４】として表現することができる。２つの特徴量Ｐｘ（ｉ）
，Ｐｔｋ（ｊ）との距離をｄ（ｃ）＝ｄ（ｉ，ｊ）で表
わすと、Ｆに沿った距離の総和Ｄｋ（Ｆ）は、It can be expressed as [Equation 4]. Two feature quantities Px(i)
, Ptk(j) is expressed as d(c)=d(i,j), the total distance along F is Dk(F),

【００２
４】002
4]

【数５】として表わすことができ、この値が小さい程、ＰｘとＰ
ｔｋとの対応付けが良いことを示す。ここで、ｗｌはＦ
に関連した正の重み関数である。It can be expressed as [Equation 5], and the smaller this value is, the smaller Px and P
Indicates that the correspondence with tk is good. Here, wl is F
is a positive weight function associated with .

【００２５】マッチング部４では、動的計画法（ＤＰ）
を用い、数５を次のような制約条件の下でＦに関して最
小化する。すなわち、単調性と連続性の条件として、[0025] In the matching unit 4, dynamic programming (DP)
Using , minimize Equation 5 with respect to F under the following constraints. In other words, as the conditions for monotonicity and continuity,

【
００２６】[
0026

【数６】を設定し、境界条件として、[Math 6] and as a boundary condition,

【００２７】[0027]

【数７】を設定し、整合窓Ｗの条件として（すなわち，極端な伸
縮を防ぐためｒを定数として）、[Equation 7] is set, and as a condition for the matching window W (i.e., r is a constant to prevent extreme expansion and contraction),

【００２８】[0028]

【数８】を設定し、さらに数５で分母がＦに依存しない定数にな
るようにｗｌを定めると、数５は簡単化される。例えば
、Equation 8 can be simplified by setting wl such that the denominator in Equation 5 is a constant that does not depend on F. for example,

【００２９】[0029]

【数９】とすると、ｗｌは碁盤の縦横の線に沿ったパス（経路）
，すなわち市街化距離となり、[Equation 9] Then, wl is the path along the vertical and horizontal lines of the Go board.
, that is, the urbanization distance,

【００３０】[0030]

【数１０】となる。このとき、数５は、[Math. 10] becomes. At this time, the number 5 is

【００３１】[0031]

【数１１】となり、最小化する目的関数が加法的になる。動的計画
法では、この最小化を行なうのに、[Formula 11], and the objective function to be minimized becomes additive. In dynamic programming, to perform this minimization,

【００３２】[0032]

【数１２】の部分和を考え、これを、数６乃至数８の条件と数９と
を用いて、Considering the partial sum of [Equation 12], using the conditions of Equations 6 to 8 and Equation 9,

【００３３】[0033]

【数１３】の漸化式で表わし、ｇ（１，１）＝２ｄ（１，１），Ｊ
＝１として、整合窓Ｗの範囲内でｉを変えながら数１３
を計算し、次にｊを増加させて、ｊ＝Ｊとなるまで同様
の計算を繰り返せば、最後に入力音声の特徴量Ｐｘ（ｉ
）と登録音声の特徴量Ｐｔｋ（ｊ）との２つの時系列間
の時間正規化後の距離Ｄｋを、Expressed by the recurrence formula [Equation 13], g (1, 1) = 2d (1, 1), J
= 1, and while changing i within the matching window W, use equation 13.
, then increase j and repeat the same calculation until j=J. Finally, the feature quantity Px(i
) and the feature amount Ptk(j) of the registered speech, the distance Dk after time normalization between the two time series is

【００３４】[0034]

【数１４】として求めることができる。[Math. 14] It can be found as

【００３５】このようにして求められた距離Ｄｋと、そ
の距離Ｄｋを与えたマッチングパスＬ，すなわち数１３
の演算で選択された（ｉ，ｊ）の履歴は、所定のメモリ
（図示せず）に記憶される。例えば、整合窓Ｗの定数ｒ
を用いると、パスは最大（２ｒ＋１）個となり、また、
１つの最大長さは、（Ｉｍａｘ＋２ｒ＋１）となる。こ
こでＩｍａｘは、入力音声の最大フレーム長である。従
って、マッチングパスＬを記憶するのに、メモリとして
は、（２ｒ＋１）（Ｉｍａｘ＋２ｒ＋１）個が必要であ
る。メモリへのマッチングパスＭＰの記憶法としては、
数１３において（ａ）が選択された場合には、ｇ（ｉ，
ｊ−１）のパスに（ｉ，ｊ）を加え、（ｂ）が選択され
た場合には、ｇ（ｉ−１，ｊ−１）のパスに（ｉ，ｊ）
と（ｉ，ｊ）を加え、（ｃ）が選択された場合には、ｇ
（ｉ−１，ｊ）のパスに（ｉ，ｊ）を加え、これらのい
ずれかをｇ（ｉ−１，ｊ−１）のパスが記憶されていた
メモリに記憶する。最終的なパスＬは、ｇ（Ｉ，Ｊ）の
パスであり、例えば、図３に示すように記憶される。な
お、各データは、ｉが増加するか否か、ｊが増加するか
否かの２ビット情報だけで表現されて記憶されていも良
い。The distance Dk obtained in this way and the matching path L given the distance Dk, that is, the equation 13
The history of (i, j) selected by the calculation is stored in a predetermined memory (not shown). For example, the constant r of the matching window W
If you use , there will be a maximum of (2r+1) paths, and
One maximum length is (Imax+2r+1). Here, Imax is the maximum frame length of input audio. Therefore, to store the matching path L, (2r+1)(Imax+2r+1) memories are required. As a mnemonic for matching path MP to memory,
When (a) is selected in Equation 13, g(i,
Add (i, j) to the path of g(i-1, j-1) if (b) is selected, add (i, j) to the path of g(i-1, j-1)
and (i, j), and if (c) is selected, g
(i, j) is added to the path of (i-1, j), and one of them is stored in the memory where the path of g(i-1, j-1) was stored. The final path L is the path of g(I, J), and is stored as shown in FIG. 3, for example. Note that each data may be expressed and stored using only 2-bit information indicating whether i increases or not and whether j increases or not.

【００３６】しかる後、マッチング部４は、メモリに記
憶された各登録音声ごとの距離Ｄｋのうちで、所定の閾
値以上の距離を与えた登録音声を認識対象から除外し、
候補の絞り込みを行なう。候補の絞り込みを行なうと、
次いで、絞られた候補のみについてその第２の特徴量Ｓ
ｔｋ（ｊ，ω）と入力音声の第２の特徴量Ｓｘ（ｊ，ω
）との距離Ｄ´（Ｌ）をそれぞれ、上記のようにして求
められたパスＬを用いて、[0036] Thereafter, the matching unit 4 excludes, from the recognition targets, registered voices to which a distance greater than a predetermined threshold has been given from among the distances Dk for each registered voice stored in the memory.
Narrow down the candidates. After narrowing down the candidates,
Next, the second feature amount S of only the narrowed down candidates is
tk(j, ω) and the second feature amount Sx(j, ω) of the input speech
) using the path L obtained as above,

【００３７】[0037]

【数１５】として求める。ここで、ｄ´（ｉｋ（ｎ），ｊｋ（ｎ）
）は、[Equation 15] Here, d'(ik(n), jk(n)
)teeth,

【００３８】[0038]

【数１６】である。なお、ωは周波数に対応し、チャンネル数Ｅは
、例えば２５０Ｈｚから６３５０Ｈｚまでの１／３ｏｃ
ｔの１５チャンネルである（一般的にチャンネル数Ｅは
、８〜３０程度）。[Formula 16]. Note that ω corresponds to the frequency, and the number of channels E is, for example, 1/3 oc from 250Hz to 6350Hz.
t (generally, the number of channels E is about 8 to 30).

【００３９】このようにして、入力音声の第２の特徴量
Ｓｘ（ｉ，ω）と絞られた候補である各登録音声の第２
の特徴量Ｓｔｋ（ｊ，ω）との距離Ｄ´（Ｌ）をそれぞ
れ計算し、これらの距離のうちで最小のものを求め、こ
れが所定の閾値以下の場合に、この最小距離を与えた候
補のカテゴリ−を認識結果として出力する。In this way, the second feature amount Sx (i, ω) of the input voice and the second characteristic amount of each registered voice that has been narrowed down are
calculate the distance D'(L) to the feature Stk(j, ω), find the minimum distance among these distances, and if this is less than a predetermined threshold, select the candidate with this minimum distance. category is output as a recognition result.

【００４０】ところで、本実施例では、ＤＰマッチング
の処理量は、第１の特徴量を用いてマッチングパスを求
める際の処理量と、第１の特徴量の距離Ｄ´（Ｌ）を求
める際の処理量との和になる。By the way, in this embodiment, the processing amount of DP matching is the processing amount when calculating the matching path using the first feature amount and the processing amount when calculating the distance D'(L) of the first feature amount. It is the sum of the processing amount.

【００４１】第１の特徴量を用いてマッチングパスを求
める際には、先づｄ（ｉ，ｊ）の計算が必要となり、ｄ
（ｉ，ｊ）の計算では、各ｉ，ｊごとに、数１において
入力音声のパワーｅ（ｉ）の割算演算｛ｅ（ｉ）／ｅ（
ｉ−１）｝が１回、その対数演算ｌｏｇが１回、また、
数１において算出された入力音声の第１の特徴量Ｐｘ（
ｉ）と登録音声の第１の特徴量Ｐｔｋ（ｊ）との減算演
算が１回、その絶対値演算が１回必要となり、合計で４
回の処理量となる。[0041] When finding a matching path using the first feature, it is first necessary to calculate d(i, j), and d
In calculating (i, j), for each i, j, the power e(i) of the input voice is divided by {e(i)/e(
i-1)} once, its logarithm operation log once, and
The first feature amount Px(
i) and the first feature Ptk(j) of the registered voice is required once, and its absolute value is calculated once, for a total of 4
The processing amount will be 3 times.

【００４２】また、このようにしてｄ（ｉ，ｊ）が計算
された後、ｇ（ｉ，ｊ）の計算が必要となり、ｇ（ｉ，
ｊ）の計算では、各ｉ，ｊごとに、数１３の（ａ）の加
算演算が１回、数１３の（ｂ）の加算演算が２回、数１
３の（ｃ）の加算演算が１回、（ａ），（ｂ），（ｃ）
の比較演算が２回必要となり、合計で６回の処理量とな
る。[0042] Furthermore, after d(i, j) is calculated in this way, it is necessary to calculate g(i, j), and g(i,
In the calculation of j), for each i and j, the addition operation of (a) of Equation 13 is performed once, the addition operation of (b) of Equation 13 is performed twice, and the addition operation of Equation 13 (b) is performed twice.
3 (c) addition operation is done once, (a), (b), (c)
This requires two comparison operations, resulting in a total of six processing steps.

【００４３】従って、各ｉ，ｊ毎に（４＋６）回，すな
わち１０回の処理量を要し、各ｉ毎に〔１０・（２ｒ＋
１）〕回の処理量となり、第１の特徴量の距離を求める
際には、合計〔１０・（２Ｒ＋１）・Ｉ〕回の処理量が
必要となる。[0043] Therefore, (4+6) times, that is, 10 times, are required for each i and j, and for each i, [10·(2r+
1)] times, and when calculating the distance of the first feature amount, a total of [10·(2R+1)·I] times of processing are required.

【００４４】また、第２の特徴量の距離Ｄ´（Ｌ）を求
める際には、数１６において、各チャンネル毎に、対数
演算Ｌｏｇが１回、減算演算が１回、絶対値演算が１回
必要となるので、チャンネル数がＥであるとすると、こ
の段階で（３Ｅ）回の処理量を要する。さらにチャンネ
ル毎の処理結果を加算するのに、加算演算が（Ｅ−１）
回必要となる。従って、数１６において各ｎ毎に合計（
４Ｅ−１）回の処理量が必要となり、数１５において、
第２の特徴量の距離Ｄ´（Ｌ）を求める際には、合計〔
（４Ｅ−１）・（Ｉ＋Ｊ）〕回の処理量が必要となる。In addition, when calculating the distance D'(L) of the second feature amount, in Equation 16, for each channel, logarithmic operation Log is performed once, subtraction operation is performed once, and absolute value calculation is performed once. Therefore, assuming that the number of channels is E, a processing amount of (3E) times is required at this stage. Furthermore, to add the processing results for each channel, the addition operation is (E-1)
times are required. Therefore, in Equation 16, for each n, the sum (
4E-1) times of processing is required, and in equation 15,
When calculating the distance D'(L) of the second feature amount, the total [
A processing amount of (4E-1)·(I+J)] is required.

【００４５】これにより、本実施例による全体の処理量
Ｑは、[0045] As a result, the total processing amount Q according to this embodiment is:

【００４６】[0046]

【数１７】となる。[Math. 17] becomes.

【００４７】次に、このような本実施例の音声認識方式
での処理量を特開平１−２８３５９９号に開示されてい
るような従来の音声認識方式での処理量と比較する。従
来の音声認識方式では、ｄ（ｉ，ｊ），ｇ（ｉ，ｊ）を
それぞれ、Next, the amount of processing in the speech recognition method of this embodiment will be compared with the amount of processing in the conventional speech recognition method as disclosed in Japanese Patent Laid-Open No. 1-283599. In conventional speech recognition methods, d(i,j) and g(i,j) are respectively

【００４８】[0048]

【数１８】[Math. 18]

【００４９】[0049]

【数１９】のように演算すると、ｄ（ｉ，ｊ）の演算には、各チャ
ンネルに、対数演算が１回、減算演算が１回、絶対値演
算が１回必要であり、チャンネル数がＥであるとすると
、この段階で、（３Ｅ）回の処理量を要する。さらにチ
ャンネル毎の処理結果を加算するのに、加算演算が（Ｅ
−１）回必要となる。この結果、数１８においてｄ（ｉ
，ｊ）を求めるのには、各ｉ，ｊ毎に合計（４Ｅ−１）
回の処理量が必要となる。[Formula 19] When calculating d(i,j), one logarithm operation, one subtraction operation, and one absolute value operation are required for each channel, and the number of channels is If it is E, at this stage, a processing amount of (3E) times is required. Furthermore, in order to add the processing results for each channel, the addition operation (E
-1) times are required. As a result, in Equation 18, d(i
, j), the sum (4E-1) for each i, j
The amount of processing is required.

【００５０】また、数１９のｇ（ｉ，ｊ）の演算には、
加算演算が４回、比較演算が２回必要となるので、各ｉ
，ｊ毎に、合計６回の処理量が必要になる。[0050] Furthermore, for the calculation of g(i, j) in Equation 19,
Four addition operations and two comparison operations are required, so each i
, j, a total of six processing times is required.

【００５１】これにより、各ｉ毎に、〔（４Ｅ＋５）・
（２ｒ＋１）〕回の処理量が必要となり、全体の処理量
Ｒは、[0051] As a result, for each i, [(4E+5)・
(2r+1)] times of processing is required, and the total processing amount R is:

【００５２】[0052]

【数２０】となる。[Math. 20] becomes.

【００５３】いま、チャンネル数Ｅを“１５”，整合窓
の定数ｒを“２０”，ＩおよびＪをそれぞれ“６０”と
すると、従来の音声認識方式では、ＤＰマッチングに要
する全体の処理量Ｒは、数２０により約１６０，０００
回となるのに対し、本実施例の音声認識方式では、ＤＰ
マッチングに要する全体の処理量Ｑは、数１７により３
２，０００回で済み、この例では、従来に比べて約１／
５程度に処理量を削減することが可能となる。Now, assuming that the number of channels E is "15", the constant r of the matching window is "20", and I and J are each "60", in the conventional speech recognition method, the total processing amount R required for DP matching is is about 160,000 according to number 20
In contrast, in the speech recognition method of this embodiment, the DP
The total processing amount Q required for matching is 3 according to equation 17.
It only takes 2,000 times, and in this example, it is about 1/1 time compared to the conventional method.
It becomes possible to reduce the processing amount to about 5.

【００５４】このように本実施例では、比較的に情報量
の少ない第１の特徴量を用いて、動的計画法で入力音声
と各登録音声とのマッチングパスを決定し、マッチング
パスを大幅に制限してから、このマッチングパスに従っ
て比較的に情報量の多い第２の特徴量を用いて入力音声
と各登録音声との距離を計算するようにしているので、
ＤＰマッチングを用いるときにも、処理量の大幅な削減
が可能となる。In this way, in this embodiment, the matching path between the input voice and each registered voice is determined by dynamic programming using the first feature amount, which has a relatively small amount of information, and the matching path is greatly improved. , and then calculates the distance between the input voice and each registered voice using the second feature amount, which has a relatively large amount of information, according to this matching path.
When using DP matching, it is also possible to significantly reduce the amount of processing.

【００５５】さらに本実施例では、マッチングパスの決
定の際に得られる第１の特徴量を用いた距離の値に応じ
て、候補の絞り込みを行なっているので、さらに処理量
を低減することができる。Furthermore, in this embodiment, the candidates are narrowed down according to the distance value using the first feature obtained when determining the matching path, so the amount of processing can be further reduced. can.

【００５６】また、本実施例では、第１の特徴量として
、入力音声のパワーｅ（ｉ）の変化率を用いているので
、入力される音声のレベルに違いがあっても、これによ
る影響を少なくすることができる。Furthermore, in this embodiment, since the rate of change in the power e(i) of the input voice is used as the first feature, even if there is a difference in the level of the input voice, the influence of this can be reduced.

【００５７】なお、第１の特徴量としての入力音声のパ
ワーｅ（ｉ）の変化率を求めるに際して、入力音声のパ
ワーｅ（ｉ）の値が予め定められたパワーの下限値ｅｍ
ｉｎ　（〉０）に満たない場合には、入力音声のパワー
ｅ（ｉ）の値として下限値ｅｍｉｎを用いるのが良く、
これにより、入力音声のパワーｅ（ｉ）が小さい場合の
第１の特徴量の誤差を低減することができる。あるいは
、この場合に、入力音声のパワーｅ（ｉ）として、パワ
ーを数フレーム程度の平滑化をして用いても良い。Note that when calculating the rate of change of the power e(i) of the input voice as the first feature, the value of the power e(i) of the input voice is the lower limit value em of the predetermined power.
If in (>0) is not satisfied, it is better to use the lower limit emin as the value of the input audio power e(i),
Thereby, it is possible to reduce the error in the first feature amount when the power e(i) of the input voice is small. Alternatively, in this case, the power may be smoothed over several frames and used as the input audio power e(i).

【００５８】また、第１の特徴量としては、最大パワー
で正規化された入力音声のパワーの対数値を用いても良
いし、これらの対数変換される前の値などを用いても良
い。あるいは、ゼロ交差点を用いても良い。Further, as the first feature amount, the logarithm value of the power of the input voice normalized by the maximum power may be used, or the value before logarithmic transformation may be used. Alternatively, a zero crossing point may be used.

【００５９】また、上述の例における第１の特徴量とし
ての短時間スペクトルは、ＦＦＴによっても求めること
ができる。また、第２の特徴量として、ＬＰＣ分析によ
るケプストラム，メルケプストラムなどの他の特徴量を
用いても良い。Furthermore, the short-time spectrum as the first feature in the above example can also be obtained by FFT. Furthermore, other feature quantities such as cepstrum and mel cepstrum obtained by LPC analysis may be used as the second feature quantity.

【００６０】また、音声区間検出部２において音声区間
の検出を２閾値法などの他の方法で行なうこともできる
。[0060] Furthermore, the speech section detecting section 2 may detect speech sections using other methods such as a two-threshold method.

【００６１】[0061]

【発明の効果】以上に説明したように本発明によれば、
比較的情報量の少ない第１の特徴量を用いて、動的計画
法で入力音声と各登録音声とのマッチングパスを決定し
、マッチングパスを大幅に制限してから、このマッチン
グパスに従って比較的情報量の多い第２の特徴量を用い
て入力音声と各登録音声との尤度を求めるようにしてい
るので、処理量を低減できる。[Effects of the Invention] As explained above, according to the present invention,
Using the first feature, which has a relatively small amount of information, dynamic programming is used to determine the matching path between the input voice and each registered voice, and after greatly limiting the matching path, relatively Since the second feature having a large amount of information is used to determine the likelihood between the input voice and each registered voice, the amount of processing can be reduced.

【００６２】また、比較的情報量の少ない第１の特徴量
として、音声のパワーの変化率を用いることにより、入
力音声のレベルに違いがあってもこれによる影響を防止
することができる。また、この場合、入力音声のパワー
の値が予め定められたパワーの下限値に満たない場合に
は、このパワーの下限値を入力音声のパワーの値として
用いることにより、入力音声のパワーが小さい場合の第
１の特徴量の誤差を低減できる。Furthermore, by using the rate of change in the power of the voice as the first feature having a relatively small amount of information, even if there is a difference in the level of the input voice, it is possible to prevent the influence of this difference. In addition, in this case, if the power value of the input voice is less than the predetermined lower limit value of the power, the lower limit value of the power is used as the power value of the input voice, so that the power of the input voice is small. It is possible to reduce the error of the first feature amount in the case of the first feature amount.

【００６３】また、入力音声と各登録音声とのマッチン
グパスを決定する際に、入力音声の第１の特徴量と各登
録音声の第１の特徴量との距離を求め、距離の値に応じ
て各登録音声，すなわち候補の絞り込みを行なった上で
、第２の特徴量を用いて尤度を求めるようにすれば、さ
らに処理量を低減できる。In addition, when determining the matching path between the input voice and each registered voice, the distance between the first feature of the input voice and the first feature of each registered voice is determined, and the distance is calculated according to the distance value. The amount of processing can be further reduced by narrowing down each registered voice, that is, the candidates, and then calculating the likelihood using the second feature amount.

[Brief explanation of the drawing]

【図１】本発明の一実施例のブロック図である。FIG. 1 is a block diagram of one embodiment of the present invention.

【図２】入力音声と登録音声とのマッチングパスを説明
するための図である。FIG. 2 is a diagram for explaining a matching path between input speech and registered speech.

【図３】入力音声と登録音声とのマッチングパスの具体
的なデータを示す図である。FIG. 3 is a diagram showing specific data of matching paths between input speech and registered speech.

[Explanation of symbols]

１　　　　特徴量抽出部２　　　　音声区間検出部３　　　　登録音声メモリ４　　　　マッチング部 1 Feature extraction part 2 Voice section detection unit 3 Registered voice memory 4 Matching part

Claims

[Claims]

Claim 1. A speech recognition method that recognizes input speech by matching the feature amount of the input speech with the feature amount of a plurality of registered voices registered in advance, wherein the feature amount of the input speech and each of the registered voices are matched. The feature amounts include a first feature amount that has a relatively small amount of information, and a second feature amount that has a relatively large amount of information, and the first feature amount of the input voice and each of the registrations. A matching path between the input voice and each registered voice is determined using dynamic programming based on the first feature amount of the voice, and a second feature of the input voice that respectively temporally corresponds to the matching path is determined. The method is characterized in that the likelihood of the input voice and each registered voice is determined using the quantity and the second feature of each registered voice, and the recognition result of the input voice is obtained based on the likelihood. A voice recognition method that uses

2. The speech recognition method according to claim 1, wherein the first feature quantity of the input speech and each registered speech is a feature quantity using a rate of change in the power of the speech.

3. When determining the first feature amount of the input voice, if the power value of the input voice is less than a predetermined lower limit value of power, the lower limit value of the power is set to the lower limit value of the input voice. 3. The speech recognition method according to claim 2, wherein the speech recognition method is used as a power value.

4. When determining a matching path between the input voice and each registered voice based on the first feature of the input voice and the first feature of each registered voice, the first feature of the input voice is determined. The distance between the amount and the first feature of each registered voice is determined, the candidates are narrowed down according to the value of the distance, and the second feature is used only for the registered voices of the narrowed down candidates. 4. The speech recognition method according to claim 1, wherein the likelihood is determined by using the method.