JPH0527120B2

JPH0527120B2 -

Info

Publication number: JPH0527120B2
Application number: JP58094750A
Authority: JP
Inventors: Junichiro Fujimoto
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1983-05-27
Filing date: 1983-05-27
Publication date: 1993-04-20
Also published as: JPS59219800A

Description

【発明の詳細な説明】技術分野本発明は、音声認識装置における音声パターン
比較方法に関する。DETAILED DESCRIPTION OF THE INVENTION Technical Field The present invention relates to a speech pattern comparison method in a speech recognition device.

従来技術近年、マン・マシン対話の実現のために音声認
識装置が実用化されつつあるが、音声の認識にお
いて重要な部分は、辞書登録された特徴パターン
と入力音声の特徴パターンの照合部である。通
常、この音声特徴パターンの照合には次の二つの
問題点があり，その一は、発生毎に音声長が変動
することであり、その二は発生者によつてホルマ
ントが異り周波数変動があることである。前記そ
の一の変動吸収のためには動的計画法（DP）に
よるパターンマツチング法が知られている。この
DPマツチング法は時間方向の変動吸収のために
比較すべき二つのパターン間の類似度が最大にな
るようにパターンの時間長を伸縮するものである
が、この方法は二つのパターン間のあらゆる対応
づけとして類似度を求めるものであるため、演算
量が多く、また周波数変動を吸収するためには膨
大なDP演算が必要となる。なお、前記その二つ
の周波数変動を吸収する方法は未だに確立されて
いない。PRIOR ART In recent years, speech recognition devices have been put into practical use to realize man-machine dialogue, but an important part of speech recognition is a section that matches feature patterns registered in a dictionary with feature patterns of input speech. . Normally, there are two problems in matching this speech feature pattern: the first is that the length of the speech varies each time it occurs, and the second is that the formants differ depending on the speaker, resulting in frequency fluctuations. It is a certain thing. A pattern matching method using dynamic programming (DP) is known for the first variation absorption method. this
The DP matching method expands or contracts the time length of a pattern to maximize the similarity between two patterns to be compared in order to absorb fluctuations in the time direction. Since the method calculates similarity as an index, the amount of calculation is large, and a huge amount of DP calculation is required to absorb frequency fluctuations. Note that a method for absorbing these two frequency fluctuations has not yet been established.

目的本発明は、上述のごとき実情に鑑みてなされた
もので、少ない演算量によつて時間変動と周波数
変動の両方を吸収して精度よくパターンを照合し
得るようにした音声パターン比較方法を提供しよ
うとするものである。Purpose The present invention has been made in view of the above-mentioned circumstances, and provides a voice pattern comparison method that can absorb both time fluctuations and frequency fluctuations and match patterns with high accuracy with a small amount of calculation. This is what we are trying to provide.

構成本発明は、上記目的を達成するために、１音声
を周波数分析して得られる特徴量の時系列からな
る音声パターンを比較する音声パターン比較方法
において、周波数分析の結果を２値化して特徴量
分布部を求め、該特徴量分布部を周波数軸方向に
細線化処理して形成された音声パターンを、比較
すべき音声パターンの少なくとも一方とし、動的
計画法により両方の音声パターンの時間軸方向の
整合をとつて比較を行なうこと、或いは、２音声
を周波数分析して得られる特徴量の時系列からな
る音声パターンを比較する音声パターン比較方法
において、周波数分析の結果得られた帯域パワー
のピークを検出し、該ピークを時系列に示した音
声パターンを、比較すべき音声パターンの少なく
とも一方とし、動的計画法により両方の音声パタ
ーンの時間軸方向の整合をとつて比較を行なうこ
とを特徴としたものである。以下、実施例に基づ
いて説明する。Configuration In order to achieve the above object, the present invention provides a speech pattern comparison method for comparing speech patterns consisting of a time series of feature amounts obtained by frequency analysis of one speech, in which the results of frequency analysis are binarized. A voice pattern formed by finding a feature distribution and thinning the feature distribution in the frequency axis direction is used as at least one of the voice patterns to be compared, and the time of both voice patterns is calculated using dynamic programming. In a voice pattern comparison method that performs comparison by aligning in the axial direction, or compares voice patterns consisting of time series of feature values obtained by frequency analysis of two voices, the band power obtained as a result of frequency analysis. Detecting the peak of , and using the speech pattern showing the peak in time series as at least one of the speech patterns to be compared, and performing the comparison by aligning both speech patterns in the time axis direction using dynamic programming. It is characterized by The following will explain based on examples.

最初に、第１図及び第２図を参照しながら通常
のDPマツチング法について説明する。 First, a normal DP matching method will be explained with reference to FIGS. 1 and 2.

まず、第１図において、ａ図のパターンとｂ図
のパターンを比較することを考えるが、同図に
は、音声パターンを時間軸方向に一定間隔でサン
プリングしたパターン１，２…が示されており、
これら各パターン１，２…をフレームと呼んでい
る。このDP法は、まずａ図の第１フレームとｂ
図の第１フレームを対応づけて二つの波形の差を
求め、第２図の斜線部を求める。以下同様にして
ａ図の第１フレームとｂ図の第２フレーム、ａ図
の第１フレームとｂ図の第３フレーム……ａ図の
第２フレームとｂの第１フレーム、ａ図の第２フ
レームとｂ図の第２フレーム……と対応づけ、そ
の波形差が一番少なくなるようにフレーム間、つ
まり時間軸の対応をつけるものである。そのた
め、ａ，ｂが同一人物の発生した音声であるよう
な周波数変動が少ないパターンに関しては有効で
あるが、例えば第２図の破線と実線の波形のよう
に波形が似ているにもかかわらず、周波数にずれ
がある場合にはこれを同一波形とみなすことがで
きない。このような現象はａとｂの音声の発声者
が異なつた場合に起こるが、これには個人のホル
マント差が影響している。 First, let's consider comparing the pattern in Figure A and the pattern in Figure B in Figure 1. In Figure 1, Patterns 1, 2, etc., in which audio patterns are sampled at regular intervals along the time axis, are shown. Ori,
Each of these patterns 1, 2, . . . is called a frame. This DP method first uses the first frame in figure a and the frame in figure b.
The difference between the two waveforms is determined by correlating the first frame in the figure, and the shaded area in FIG. 2 is determined. Similarly, the first frame in figure a and the second frame in figure b, the first frame in figure a and the third frame in figure b...the second frame in figure a, the first frame in figure b, and the second frame in figure a. 2 frame and the second frame in figure b, etc., and the frames, that is, the correspondence on the time axis, are established so that the difference in waveform is minimized. Therefore, it is effective for patterns with little frequency fluctuation, such as when a and b are voices generated by the same person, but even though the waveforms are similar, such as the waveforms of the broken line and solid line in Figure 2, , if there is a difference in frequency, these cannot be considered as the same waveform. This phenomenon occurs when the speakers of voices a and b are different, and this is affected by the formant differences between individuals.

本発明は、上記DPマツチング法の欠点を解決
するためになされたもので、その動作原理につい
て第３図を参照しながら説明する。まず、前述の
ごとくしてサンプリングされたパターンを、周波
数軸方向及び時間軸方向にサンプリングし、周波
数の低い方から順にｉ＝１，２，……Ｉ、時間軸
方向をｊ＝１，２，３，……Ｊとし、二つのパタ
ーンをＡ（ｉ，ｊ）、Ｂ（ｉ，ｊ）で表わす。次に
辞書登録すべきパターンはフイルター群でｉ＝
１，２，……Ｉまで分け、閾値を設けて２値化し
て登録する（Ａ（ｉ，ｊ）、j_A＝１，…J_A）。一方、
認識音声は同様に２値化されたあと、細線化され
てＢ（ｉ，ｊ）、j_B＝１，２，…J_Bとなる。ここで
j_Aとj_Bの対応づけが問題になるが、この対応づけ
を第３図に示す。第３図において、ｉ−j_A面上で
Ａ（ｉ，j_A）を表わすとａ図のようになりＢ（ｉ，
j_B）をｉ−j_B面で表わすとｂ図のようになる。た
だし、２値化して０，１にしたうち１の部分を斜
線で表わしている。この時、j_Aとj_Bのサンプル点
の作るメツシユ（j_A，j_B）各点におけるＡ（ｉ，
j_A）とＢ（ｉ，j_B）の類似度ｒ（j_A，j_B）を次式で定
義し、ｒ（j_A，j_B）＝_I 〓ⁱ⁼¹ Ａ（ｉ，j_A）・Ｂ（ｉ，j_B） ……(1) j_A＝１，j_B＝１からj_A，j_Bまでの類似度の累計を
Ｒ（j_A，j_B）で表わした時、Ｒ（j_A，j_B）＝ｒ（j_A，j_B）＋maxＲ（j_A，j_B−１）Ｒ（j_A−１，j_B−１）Ｒ（j_A−１，j_B） ……(2) となるような（j_A，j_B）を決定して行く（ただし
maxは｛｝内の最大値を採用することを示し
ている）。なお、上記(1)式は積をとつているが、
これは理論演算でも良いし、Ｂ（ｉ，j_B）のｉを
変化させて「１」を抽出し、その部分だけ演算を
しても良い。また(2)式の結果をフレーム数Ｉ＋Ｊ
で正規化することも考えられる。また各パターン
の始端と終端は各々対応づけるものとする。 The present invention has been made to solve the drawbacks of the above-mentioned DP matching method, and its operating principle will be explained with reference to FIG. 3. First, the pattern sampled as described above is sampled in the frequency axis direction and the time axis direction, and in order from the lowest frequency, i = 1, 2, ... I, and in the time axis direction, j = 1, 2, ... 3,...J, and the two patterns are represented by A(i,j) and B(i,j). The next pattern to be registered in the dictionary is the filter group i=
1, 2, . . . I, set a threshold value, binarize, and register (A(i, j), j _A =1, . . . J _A ). on the other hand,
The recognized speech is similarly binarized and then thinned to become B(i,j), j _B =1, 2, . . . J _B . here
The problem is the correspondence between j _A and j _B , and this correspondence is shown in Figure 3. In Figure 3, when A(i, j _A ) is represented on the i-j _A plane, it becomes as shown in diagram a, and B(i,
When j _B ) is expressed on the i-j _B plane, it becomes as shown in diagram b. However, among the binarized values of 0 and 1, the 1 part is indicated by diagonal lines. _At _this _time , A( _i ,
The similarity r(j _A , j _B ) between j _A ) and B(i, j _B ) is defined by the following formula, r(j _A , j _B )= _I 〓 ⁱ⁼¹ A(i, j _A )・B(i, _jB )...(1) When the cumulative total of similarities from _jA = 1, _jB = 1 to _jA , _{jB is expressed as R(jA, jB} ₎ _, R( j _A , j _B ) = r ( j _A , j _B ) + maxR ( j _A , j _B −1) R ( j _A −1, j _B −1) R ( j _A −1, j _B ) ……( 2) Determine (j _A , j _B ) such that (where
max indicates that the maximum value within { } is to be used). Note that although equation (1) above takes the product,
This may be a theoretical calculation, or it may be possible to extract "1" by changing i of B(i, j _B ) and perform calculations only on that part. Also, the result of equation (2) is the number of frames I + J
It is also possible to normalize with It is also assumed that the starting end and ending end of each pattern are associated with each other.

第４図は、上記動作原理に従つて構成された本
発明の一実施例を示すブロツク線図で、図中、１
はマイク、２はフイルター群、３は音声区間検出
部、４は２値化部、５はスイツチ、６は辞書部、
７は細線化部、８は類似度計算部、９はj_A，j_B変
化部、１０は類似度検出部、１１はＲ計算部、１
２はj_A又はj_Bを１ステツプ歩進する歩進部、１３
はＲの最大算出部、１４は認識結果出力部で、本
発明によると、ａのパターンが周波数軸方向に幅
をもち、ｂのパターンの幅がせまいため、発声者
によつて周波数が変動し、そのためｂのパターン
が周波数軸方向に変動してもａのパターン幅から
はみ出さない限りその変動を吸収することができ
る。 FIG. 4 is a block diagram showing an embodiment of the present invention constructed according to the above operating principle.
is a microphone, 2 is a filter group, 3 is a voice section detection section, 4 is a binarization section, 5 is a switch, 6 is a dictionary section,
7 is a thinning section, 8 is a similarity calculation section, 9 is a j _A , j _B changing section, 10 is a similarity detection section, 11 is an R calculation section, 1
2 is a stepping part that advances j _A or j _B by one step, 13
14 is a maximum calculation unit for R, and 14 is a recognition result output unit. According to the present invention, since the pattern a has a width in the frequency axis direction and the width of the pattern b is narrow, the frequency varies depending on the speaker. Therefore, even if the pattern b fluctuates in the frequency axis direction, the fluctuation can be absorbed as long as it does not exceed the pattern width of a.

第５図は、本発明の他の実施例を示す図で、こ
の実施例は、辞書部６の前にピーク検出部１５を
設け、該ピーク検出部１５によつて音声の特徴パ
ターンの周波数上のピークを検出し、そのパター
ンを辞書部に登録しておき、他方、照合すべきパ
ターンが入力された時に、これをある閾値で０，
１に２値化し（この時１になる部分を特徴量分布
部と称する）、これと辞書パターンの類似度を前
記式(2)に従つて動的計画法によつて最大になるよ
うに時間伸縮を行なつて照合するようにしたもの
である。 FIG. 5 is a diagram showing another embodiment of the present invention. In this embodiment, a peak detection section 15 is provided before the dictionary section 6, and the peak detection section 15 detects the frequency of the characteristic pattern of the voice. Detect the peak of , and register that pattern in the dictionary section. On the other hand, when a pattern to be matched is input, it is set to 0,
It is binarized to 1 (the part that becomes 1 at this time is called the feature distribution part), and the similarity between this and the dictionary pattern is maximized over time using dynamic programming according to equation (2) above. This is a method that performs expansion and contraction for comparison.

効果以上の説明から明らかなように、本発明による
と、少ない演算量で時間変動と周波数変動の両方
を吸収することができる精度の高い音声パターン
比較方法を提供することができる。Effects As is clear from the above description, according to the present invention, it is possible to provide a highly accurate voice pattern comparison method that can absorb both time fluctuations and frequency fluctuations with a small amount of calculation.

[Brief explanation of the drawing]

第１図及び第２図は、DPマツチング法を説明
すめための図、第３図は、本発明の原理を説明す
るための図、第４図及び第５図は、それぞれ本発
明の実施例を説明するためのブロツク線図であ
る。１……マイク、２……フイルター群、３……音
声区間検出部、４は２値化部、５……スイツチ、
６……辞書部、７……細線化部、８……類似度計
算部、９……j_A，j_B変化部、１０……最大類似度
算出部、１１……Ｒ算出部、１２……j_A（j_B）歩進
部、１３……Ｒ最大算出部、１４……結果出力
部、１５……ピーク検出部。 Figures 1 and 2 are diagrams for explaining the DP matching method, Figure 3 is a diagram for explaining the principle of the present invention, and Figures 4 and 5 are examples of the present invention, respectively. FIG. 2 is a block diagram for explaining. 1...Microphone, 2...Filter group, 3...Speech section detection section, 4: Binarization section, 5...Switch,
6... Dictionary section, 7... Thinning section, 8... Similarity calculation section, 9... j _A , j _B changing section, 10... Maximum similarity calculation section, 11... R calculation section, 12... ...j _A (j _B ) Step unit, 13...R maximum calculation unit, 14...Result output unit, 15...Peak detection unit.

Claims

[Claims] 1. In a speech pattern comparison method for comparing speech patterns consisting of a time series of feature quantities obtained by frequency analysis of speech, the result of frequency analysis is binarized to obtain a feature quantity distribution part, The audio pattern formed by thinning the feature distribution part in the frequency axis direction is used as at least one of the audio patterns to be compared, and both audio patterns are matched in the time axis direction using dynamic programming and compared. A voice pattern comparison method characterized by performing the following. 2. In a speech pattern comparison method that compares speech patterns consisting of a time series of feature quantities obtained by frequency analysis of speech, a peak of band power obtained as a result of frequency analysis is detected and the peak is shown in a time series. A voice pattern comparison method characterized in that a voice pattern is used as at least one of the voice patterns to be compared, and the comparison is performed by aligning both voice patterns in the time axis direction by dynamic programming.