JPS6229799B2

JPS6229799B2 -

Info

Publication number: JPS6229799B2
Application number: JP55075120A
Authority: JP
Inventors: Hidefumi Ooga; Hidekazu Tsuboka; Hidekazu Yabuchi
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1980-06-03
Filing date: 1980-06-03
Publication date: 1987-06-29
Also published as: JPS57700A

Description

[Detailed description of the invention]

本発明は、話者を特定とする音声認識装置に関
するものである。まず、第１図、２図に従つて、音声認識装置の
概略について説明し、従来の認識装置に関する問
題点を明白にし、本発明の目的とする所を説明す
る。第１図において、マイク１０より入力された音
声信号は増幅器１１で増幅され、１２の周波数分
析部にて、分析され、１３の音声パターン変換部
にて、エネルギー（振巾）の正規化、及び、特徴
が抽出されて、音声パターンに変換される。スイ
ツチ１４を１５側に、あらかじめ、接続しておく
ことにより、音声パターン変換部１３からの出力
を、登録パターンエリア１８に格納する。認識し
てほしい音声単語を順次発声しそれらの音声パタ
ーンを、順次登録エリア１８に格納する。スイツ
チ１４を１６側に接続するパターンマツチング部
１７においては登録パターンエリア１８内の登録
パターンと音声パターン変換部１３からの出力と
の、パターンマツチングが行なわれ、それぞれの
登録パターンと、入力パターン（１３からの出
力）との類似度が算出される。１９の判定部で
は、それらの類似度から、最も類似度の大なるも
のが選ばれ、この類似度に関する妥当性が吟味さ
れる。妥当であつたら、最大の類似度を持つ登録
パターンのカテゴリナンバーが結果として出力さ
れ、入力音声の認識が行なわれたこととなる。音声認識を行なうためには、このように音声信
号を、なんらかの手段によつて、分析し、音声パ
ターンに変換しなくてはならない。第１図の場合
には、分析手段として、周波数分析を使用してい
る。音声認識装置の小型化、あるいは、コストダ
ウンを行なうためには、この分析部が、従来から
問題であつた。第２図は、第１図の周波数分析部の一例を示し
たものである。２０，２１，２２は、それぞれ、
中心周波数の異なる帯域フイルタである。これら
の帯域フイルタの出力を整流回路２３及び、ロー
パスフイルタ２４を通すことにより、直流レベル
に近い値に変換し、制御信号２７により制御され
るマルチプレクサ２５によつて、各々のローパス
フイルタの出力を、Ａ／Ｄ変換器２６と接続し、
アナログ信号を処理しやすいデイジタル信号に変
換する。帯域フイルタは、一般に、８あるいは16
個も必要であり、第２図を実現するためには、装
置全体の大きさが、大きくなり、小型化という
点、あるいは帯域フイルタの調整の難しさ等の点
で問題になる。周波数分析を実現するための、他の手段とし
て、FFT（高速フーリエ変換）があるが、処理
の複雑さ等から、リアルタイムでの処理が難し
く、実現しようとすると、コスト等で問題にな
る。あるいは、第２図を、デイジタルフイルタで実
現することも可能であるが、FFTと同様、難し
い。FFT、およびデイジタルフイルタの場合、
掛算器が、複数個必要になるためである。周波数分析より他の分析、あるいは自己相関係
数、あるいは、線形予測係数等を求め、分析する
手段もあるが、いずれも、掛算器が複数個必要に
なり、装置全体の小型化、あるいは、コストダウ
ンという点では、問題であつた。より簡単な処理で、音声信号を分析し、音声パ
ターンに変換できる手段が、従来より望まれてい
た。本発明の目的も、ここにあり、より簡単な処
理により、音声認識装置の小型化、あるいは、コ
ストダウンを実現しようとするものである。以下
本発明の実施例について図面を参照して説明す
る。信号を周波数領域に変換する手段としては、フ
ーリエ変換がある。しかし直交関数系による変換
には三角関数系以外のものがある。本発明は、三
角関数系以外のものとして、ウオルシユ関数系を
使用し、音声信号を分析しようとするものであ
る。ウオルユ関数系による変換は、変換行列の要
素が＋１と、−１のみであり、その演算は、加減
算のみで実現できる所に特徴がある。さらに、ウ
オルシユ関数系による変換手段として、高速ウオ
ルシユ・アダマール変換があり、これを用いる
と、従来の分析方法に比べ、さらにはるかに簡単
になり、音声認識装置全体の小型化、およびコス
トダウンが可能になる。以下、簡単に、すでに公
知であるウオルシユ関数、及び変換方法について
説明する。まず、ウオルシユ関数について説明する。ウオルシユ関数は、＋１、−１の値だけをとる２
値関数で、周期Ｔによる正規化区間、−1/2θ＜
1/2（θ＝ｔ／Ｔ）において、差分式により次の
ように定義される。 Wal（2j＋ｐ・θ）＝（−１）〔^j/2〕^+p｛wal
（ｊ・２（θ＋1/4）〕＋（−１）^j+pwal〔ｊ・２（θ−1/4〕｝ｐ＝0or1、ｊ＝０、１、２……… wal〔０・θ〕＝１ −1/2θ＜1/2 wal〔０・θ〕＝０ θ＜−1/2・θ1/2
……(1) 上式で定義されたウオルシユ関数の偶関数を
cal（ｉ・θ）．奇関数をsal（ｉ・θ）で表わせ
ば、これらの間には次の関係がある。 cal（ｉ・θ）＝wal（2i・θ） sal（ｉ・θ）＝wal（2i−１・θ） ………式(2) 以上のように定義されたウオルシユ関数系は、
完備な正規直交関数系をなすため、関数ｘ（θ）
は、三角関数系によるフーリエ級数展開と同様、
ウオルシユ関数系により級数展開が出来る。この級数項をＮ（２ⁿ）項まで近似すると、下
記の様になる。第３図ａに、Ｎ＝８（＝2³）の場合のウオルシ
ユ関数系、およびウオルシユ関数系と、三角関数
系との対応を示す。calは、三角関数系のｃｓ
に、salはsinに、対応する。又、三角関数系にお
ける「周波数」に対応するものとして、ウオルシ
ユ関数の場合は、「交番数」になり、交番数スペ
クトルは下式となる。Ｓ_i＝√_i ^２＋_i ^２ｓ_p＝ａ_p ………式(4) つまり、入力信号から、式(3)で示されるａ_iお
よび、ｂ_iを求め、上式(4)より、交番数スペクト
ルを求めれば、フーリエ変換の場合に、入力信号
を、周波数領域に変換したと同様に、交番数領域
に変換することが出来る。一方、下式で、定義されるアダマール行列の行
とウオルシユ関数とは対応が可能である。第３図ｃは、式(5)によるＮ＝８の場合の行列、
（ナチユラル形アダマール行列）を示し、第３図
ｂには、式(1)による８次のウオルシユ・アダマー
ル行列を示す。アダマール行列の行と、ウオルシユ関数との対
応を、３０で示す。入力ベクトル（Ｎ）と、出
力ベクトルを1B（Ｎ）とすると、アダマール変
換は、 1B（Ｎ）＝１／nH（ｎ）〓（Ｎ） ………式(6) で表わされ、アダマール行列の行と、ウオルシユ
関数とを対応させることにより、上式の1B
（Ｎ）は、ａ_i・ｂ_iを要素とするベクトルと考える
ことができる。第４図は、この対応ずけを行なう
ための置換行列Ｔ（Ｎ）であり、この行列を入力
ベクトル〓（Ｎ）あるいは、出力ベクトル1B
（Ｎ）にかければ、交番数の順にａ_i・ｂ_iを得るこ
とが出来る。なお、この置換行列の求め方は、グ
レイコードにもとずくアルゴリズムが考えられて
いる。第５図に従つて、以上のことをまとめて、説明
する。音声信号５０を図の如く、サンプリング
し、ｘ（０）〜ｘ(7)の時刻列５１を得る。これ
に、第４図の置換行列をかけ、５２を得る。５２
に第３図ｃのナチユラル形アダマール行列をかけ
れば、５３に示すように、交番数の順に、それぞ
れの係数が得られ、式(4)に従つて、５４に示すよ
うな交番数スペクトルを、最終的に求める事が出
来る。時系列５１に、いきなり、置換行列をかけ
ることなく、第３図ｂに示すウオルシユ形アダマ
ール行列をかけても、５３のそれぞれの系数は交
番数の順に得られる。しかし、次に示す高速アダ
マール変換を行なおうとすると、置換行列をか
け、ナチユラル形アダマール行列で変換する方
が、計算量が、大巾に減少する。第６図は、８次の高速アダマール変換について
示したものである。ナチユラル形のアダマール変
換を高速に行なう方法は色々あるが、その一例を
示しておく。第６図において矢印のみ（６１等）
は、加算を、−が記してある６２は減算を示す。
この場合の加減算の計算量は24（３×８）回で、
すみこの様な方法をとらないと、64（８×８）回
も必要であり大巾に減少する。Ｎ次の場合の計算
量は、Nlog₂Nとなる。以上、すでに公知であるウオルシユ関数及び、
変換方法について説明した。説明では、主に、Ｎ
＝８の場合についてであつたが、Ｎを２の倍数で
ふやしても、同様な処理方法となる。以下、ウオルシユ・アダマール変換を利用した
音声認識について第７図に従つて説明する。マイク７１より入力される音声信号は、増幅さ
れてＡ／Ｄ変換器７３へ加わり、発振器７５から
のサンプリング時間に応じて、音声信号を順次エ
リア７４、あるいはエリア７６へデイジタル
信号に変換して取込む。８９，８８、２連式のス
イツチで、スイツチ８９がi₁側の時は、スイツチ
８８は_２側に、８９がi₂側の時は、８８は_１
側になる。これらのスイツチのコントロールは、発振器７
５より行なわれ、一定の個数Ｎだけエリアある
いは内に入力される毎に、コントロール信号８
１により、スイツチを切り換える。並びかえ部７
７は、エリア７４あるいは、エリア７６の内
容について、第４図に示したような置換行列をか
ける所である。この並びかえられた入力ベクトル
に関して、７８の高速アダマール変換部にて、ア
ダマール変換がなされる（第６図と同様の処
理）。７８の出力より、式(4)で示した様な交番数
スペクトルを８２で求める。７７，７８，８２の
処理はエリア７４、あるいは、エリア７６を
用いて行なわれる。一方のエリアに関して、並び
かえ、アダマール変換、およびスペクトルを算出
している時には、他方のエリアには、Ａ／Ｄ変換
器からの出力を入力しており、これらの切り換え
をスイツチ８８，８９で行なつているわけであ
る。サンプリング時間を100μｓ・Ｎ＝256（256次
のアダマール）とすると、25.6ｍｓ（100μ×
256）毎に音声信号の交番数スペクトルが求ま
る。直流分をふくめ、129点のスペクトルが８２
で算出される。７９では、直流分をのぞく、スペ
クトルの総２乗和、あるいは総２乗和の平方根が
算出され、25.6ｍｓ毎に音声パワーがたまる。音
声単語切り出し部８０は、この25.6ｍｓ毎の音声
パワーより、有音期間か、無音期間かの検出を行
ない、８３の圧縮部を作動さすか否かを示す信号
９１を圧縮部８３に送出する。又、この８０によ
つて、発音された音声単語の前後の無音を取り除
き、音声単語の切り出しを行なうための信号９２
を入力パターンエリア９０へ送出する。音声単語
の中には、無音期間が存在するものもあるため、
（例えば、トヲキヨウの、トヲと、キヨウの間に
無音期間が存在する。）一概に、無音期間が存在
する所で、音声単語の終了とはならない。無音期
間が、ある一定の間、続いた後を、音声単語の終
了とする。有音期間が現われた時より、圧縮部８
３からの出力は、入力パターンエリア９０へ格納
されていき、音声単語の終了時点まで順次圧縮部
８３からの出力は格納されていく。前述の音声単
語終了付近の無音は、取り除かれ、入力パターン
エリア内には、音声単語のみが、最終的には格納
される様に、信号９２を用いて、入力パターンエ
リア内で処理がなされる。交番数スペクトル部８２の情報をそのまま、音
声パターンとすることはできず、８３で圧縮処理
がなされる。圧縮部８３の動作は、情報量の圧縮
と、音声パワーの大きさに無関係になるような、
情報量の正規化である。Ｎ＝256、サンプリング
時間＝100μsecの時には、25.6ｍｓ毎に、129点
の交番数スペクトルが得られるが、これらは、音
の大きさにより、変化するし、かつ、情報量も多
い。これ故、前述の圧縮と、正規化が必要とな
る。第８図に圧縮部８３のさらに細かいブロツク
図を示す。チヤンネル分割部９５は、Ｎコ（ただし直流分
のスペクトルは除いて、）のスペクトルを、Ｍコ
に分割し、分割されたスペクトルの総和を求めて
Ｎコのスペクトルから、Ｍチヤンネルの情報に変
換する。第９図に、Ｎ＝256・サンプリング時間＝100μ
でＭチヤンネルに分割する場合を示す。例えば、
チヤンネル８では The present invention relates to a speech recognition device that specifies a speaker. First, referring to FIGS. 1 and 2, the outline of a speech recognition device will be explained, problems with the conventional recognition device will be clarified, and the object of the present invention will be explained. In FIG. 1, an audio signal input from a microphone 10 is amplified by an amplifier 11, analyzed by a frequency analysis section 12, normalized for energy (amplitude), and normalized by an audio pattern conversion section 13. , features are extracted and converted into speech patterns. By connecting the switch 14 to the 15 side in advance, the output from the voice pattern converter 13 is stored in the registered pattern area 18. The voice words desired to be recognized are uttered in sequence and their voice patterns are stored in the registration area 18 in sequence. In the pattern matching unit 17 that connects the switch 14 to the switch 16 side, pattern matching is performed between the registered pattern in the registered pattern area 18 and the output from the audio pattern conversion unit 13, and each registered pattern is matched with the input pattern. (output from 13) is calculated. In the determination unit 19, the one with the highest degree of similarity is selected from these degrees of similarity, and the validity of this degree of similarity is examined. If it is valid, the category number of the registered pattern with the greatest degree of similarity is output as a result, which means that the input speech has been recognized. In order to perform speech recognition, the speech signal must be analyzed and converted into a speech pattern by some means. In the case of FIG. 1, frequency analysis is used as the analysis means. This analysis section has been a problem in the past in order to downsize or reduce the cost of speech recognition devices. FIG. 2 shows an example of the frequency analysis section of FIG. 1. 20, 21, 22 are respectively,
These are band filters with different center frequencies. The outputs of these bandpass filters are passed through a rectifier circuit 23 and a low-pass filter 24 to convert them to values close to the DC level, and a multiplexer 25 controlled by a control signal 27 converts the outputs of each low-pass filter into Connect with the A/D converter 26,
Convert analog signals to digital signals that are easy to process. Bandwidth filters are generally 8 or 16
In order to realize FIG. 2, the size of the entire device becomes large, which poses problems in terms of miniaturization and difficulty in adjusting the band filter. Another method for realizing frequency analysis is FFT (Fast Fourier Transform), but due to the complexity of processing, it is difficult to process in real time, and if you try to realize it, there will be problems such as cost. Alternatively, it is possible to implement the diagram in FIG. 2 using a digital filter, but it is difficult to do so, as is the case with FFT. For FFT and digital filters,
This is because multiple multipliers are required. There are other methods than frequency analysis, or methods for obtaining and analyzing autocorrelation coefficients, linear prediction coefficients, etc., but all of these methods require multiple multipliers, reducing the size of the entire device and reducing costs. In terms of downs, it was a problem. There has been a desire for a means that can analyze audio signals and convert them into audio patterns through simpler processing. This is also the purpose of the present invention, and it is an object of the present invention to realize miniaturization or cost reduction of a speech recognition device through simpler processing. Embodiments of the present invention will be described below with reference to the drawings. Fourier transform is a means of converting a signal into the frequency domain. However, there are other types of transformations using orthogonal function systems other than trigonometric function systems. The present invention uses a Walsh function system other than a trigonometric function system to analyze audio signals. The transformation using the Uoluyu function system is characterized in that the transformation matrix has only +1 and -1 elements, and its operations can be realized only by addition and subtraction. Furthermore, there is a high-speed Walsh-Hadamard transform as a conversion method using the Walsh function system, which is much simpler than conventional analysis methods, making it possible to downsize and cost down the entire speech recognition device. become. Hereinafter, the already well-known Walsh function and conversion method will be briefly explained. First, the Walsh function will be explained. The Walsh function only takes values of +1 and -12
Value function, normalized interval with period T, -1/2θ<
At 1/2 (θ=t/T), it is defined by the difference equation as follows. Wal (2j+p・θ)=(-1) [ ^j/2 ] ^+p {wal
(j・2(θ+1/4)] +(−1) ^j+p wal[j・2(θ−1/4]} p=0or1, j=0, 1, 2...... wal[0・θ ]=1 −1/2θ＜1/2 wal[0・θ]=0 θ＜−1/2・θ1/2
...(1) The even function of the Walsh function defined by the above equation is
cal(i・θ). If an odd function is expressed as sal(i·θ), the following relationship exists between them. cal(i・θ)=wal(2i・θ) sal(i・θ)=wal(2i−1・θ)……Formula (2) The Walsh function system defined above is
In order to form a complete orthonormal function system, the function x(θ)
is similar to the Fourier series expansion using a trigonometric function system,
Series expansion is possible using the Walsh function system. Approximating this series term to N(2 ⁿ ) terms results in the following. FIG. 3a shows the Walsh function system in the case of N=8 (=2 ³ ) and the correspondence between the Walsh function system and the trigonometric function system. cal is trigonometric function cs
, sal corresponds to sin. In addition, in the case of the Walsh function, what corresponds to the "frequency" in the trigonometric function system is the "alternating number", and the alternating number spectrum is expressed by the following formula. S _i =√ _i ² + _i ² s _p = a _p ......Equation (4) In other words, from the input signal, find a _i and b _i shown in Equation (3), and from Equation (4) above, Once the alternating number spectrum is obtained, the input signal can be transformed into the alternating number domain in the same way as the input signal is transformed into the frequency domain in the case of Fourier transform. On the other hand, the rows of the Hadamard matrix defined by the following equation can correspond to the Walsh function. Figure 3c shows the matrix for N=8 according to equation (5),
(natural form Hadamard matrix), and FIG. 3b shows the 8th-order Walsh-Hadamard matrix based on equation (1). The correspondence between the rows of the Hadamard matrix and the Walsh function is indicated by 30. Letting the input vector (N) and the output vector be 1B (N), the Hadamard transformation is expressed as 1B (N) = 1/nH (n) 〓 (N) ......Equation (6), and the Hadamard matrix is By associating the line with the Walsh function, 1B of the above equation
(N) can be thought of as a vector whose elements are a _i and b _i . Figure 4 shows a permutation matrix T(N) for performing this correspondence, and this matrix is used as the input vector 〓(N) or the output vector 1B
By multiplying by (N), a _i and b _i can be obtained in the order of the number of alternations. Note that an algorithm based on Gray code has been considered to obtain this permutation matrix. The above will be summarized and explained according to FIG. As shown in the figure, the audio signal 50 is sampled to obtain a time sequence 51 of x(0) to x(7). This is multiplied by the permutation matrix shown in FIG. 4 to obtain 52. 52
By multiplying by the natural form Hadamard matrix of Figure 3c, the respective coefficients are obtained in the order of the number of alternations, as shown in 53, and according to equation (4), the number spectrum of alternations as shown in 54 is obtained. You can finally ask for it. Even if the time series 51 is multiplied by the Walsh-form Hadamard matrix shown in FIG. 3b without suddenly being multiplied by a permutation matrix, each of the series 53 can be obtained in the order of the alternating numbers. However, when trying to perform the fast Hadamard transformation shown below, the amount of calculation is greatly reduced by multiplying by a permutation matrix and transforming by a natural form Hadamard matrix. FIG. 6 shows the 8th order fast Hadamard transform. There are various methods to perform the Hadamard transformation of natural forms at high speed, and I will show one example. Only the arrow in Figure 6 (61 etc.)
62 indicates addition, and 62 indicates subtraction.
In this case, the amount of calculation for addition and subtraction is 24 (3 x 8) times,
If Sumiko's method is not used, 64 (8 x 8) times are required, which is a huge reduction. The amount of calculation in the N-th case is Nlog ₂ N. As mentioned above, the already well-known Walsh function and
The conversion method was explained. In the explanation, mainly N
=8, but even if N is increased by a multiple of 2, the same processing method will be obtained. Speech recognition using the Walsh-Hadamard transform will be described below with reference to FIG. The audio signal input from the microphone 71 is amplified and sent to the A/D converter 73, and depending on the sampling time from the oscillator 75, the audio signal is sequentially converted to a digital signal and sent to an area 74 or an area 76 for processing. It's crowded. 89, 88, a double switch, when switch 89 is on the _i1 side, switch 88 is on _{the 2} side, and when 89 is on the _i2 side, 88 is on _{the 1 side.}
Be on your side. The control of these switches is controlled by oscillator 7.
5, and every time a certain number N are input into the area or inside, a control signal 8 is sent.
1 switches the switch. Sorting section 7
7 is a place where the contents of area 74 or area 76 are multiplied by a permutation matrix as shown in FIG. The rearranged input vectors are subjected to Hadamard transformation in a high-speed Hadamard transformation unit 78 (processing similar to that shown in FIG. 6). From the output of 78, the alternation number spectrum as shown in equation (4) is determined by 82. Processes 77, 78, and 82 are performed using area 74 or area 76. When rearranging, Hadamard transform, and spectrum are being calculated for one area, the output from the A/D converter is input to the other area, and switches 88 and 89 are used to switch between these. It's because I'm getting used to it. If the sampling time is 100μs・N=256 (256th order Hadamard), then 25.6ms (100μ×
256), the alternation number spectrum of the audio signal is found. Including the DC component, there are 82 spectra of 129 points.
It is calculated by In step 79, the total square sum of the spectrum or the square root of the total square sum, excluding the DC component, is calculated, and the audio power is accumulated every 25.6 ms. The speech word cutting section 80 detects whether it is a sound period or a silent period from the speech power every 25.6 ms, and sends a signal 91 to the compression section 83 indicating whether or not to activate the compression section 83. . Further, this 80 generates a signal 92 for removing silence before and after the pronounced spoken word and cutting out the spoken word.
is sent to the input pattern area 90. Some spoken words have periods of silence, so
(For example, there is a silent period between ``to'' and ``kiyo'' in ``towokiyou''.) In general, the existence of a silent period does not mean the end of the spoken word. The audio word ends after the silent period continues for a certain period of time. From the time the sound period appears, the compression section 8
The output from the compression unit 83 is stored in the input pattern area 90, and the output from the compression unit 83 is stored sequentially until the end of the audio word. Processing is performed within the input pattern area using signal 92 so that the aforementioned silence near the end of the audio word is removed and only the audio word is ultimately stored within the input pattern area. . The information in the alternation number spectrum section 82 cannot be used as a voice pattern as it is, and is subjected to compression processing in step 83. The operation of the compression unit 83 is such that compression of the amount of information is independent of the magnitude of the audio power.
This is normalization of the amount of information. When N=256 and sampling time=100 μsec, an alternating number spectrum of 129 points is obtained every 25.6 ms, but these vary depending on the loudness of the sound and have a large amount of information. Therefore, the aforementioned compression and normalization are necessary. FIG. 8 shows a more detailed block diagram of the compression section 83. The channel dividing unit 95 divides N spectra (excluding the DC component spectrum) into M spectra, calculates the sum of the divided spectra, and converts the N spectra into M channel information. do. In Figure 9, N=256・Sampling time=100μ
This shows the case where the channel is divided into M channels. for example,
on channel 8

【式】を行ない、チヤンネル８の値とする。Ｓ_iは式(4)に従う。正規化部９６は、音声パワーに無関係な情報に
するための正規化部で、下式に従う。Ｌ_kは整数値であり、このようにすることによ
つて、音声パワーと無関係なものになる。Ｋをた
とえば16にすると、Ｌ_kの最大値は16になり、Ｌ_k
としては、４ビツトを用意しておけばよい。この
ようにすれば、25.6ｍｓ毎の情報量は、４ビツト
×16チヤンネル＝64ビツトになる。Ａ／Ｄ変器を
８ビツトとすると、８ビツト×256コの情報量を
64ビツトに圧縮し、かつ音声パワーと無関係なも
のに正規化したことになる。分割の仕方としては、一定の分割巾である等分
割、あるいは分割領域の幾化平均を中心として、
となりあう中心値の比がほぼ一定になる様な対数
的な分割が考えられるが、後者の分割が良い。な
ぜなら、サンプリング時間を100μｓ・Ｎ＝256と
するとスペクトル分解能は39Hjで、39Hjを単位
として変番数スペクトルが求まり、一方音声の第
１ホルマントは低交番数領域に存在するため、等
分割にすると、高い交番数領域から算出されたチ
ヤンネルの情報量は、きわめて少なくなつてしま
い、音声信号の特徴がでにくくなるためである。又、直流分は、増幅器のドリフト、あるいは、
オフセツト等から、パワーの算出、あるいは、ス
ペクトルの分割には、使用しない方が良い。なお、スペクトルを、チヤンネルに分割する場
合は、スペクトルの単なる加算より、スペクトル
の２乗和を求め、さらに平方根を取る方が自然で
ある。しかし、この場合には、処理が複雑にな
る。実験では、スペクトルの単なる加算でも十分
であつた。あるいは、平方根をとらず、単に２乗
和をし、正規化する方法も考えられるが、いずれ
の場合も、（総和、２乗和、２乗和の平方根）実
験結果としては、そう大差がなかつた。処理のしやすさの点から、総和が有利である。又、音声パワーの低い所で、この様な圧縮を行
なうと、Ｓ／Ｎ比の点で、安定した音声パターン
が得られない。故に、有音、無音の検出を行な
い、音声パワーの小さい所では、無音期間とし
て、圧縮を行なわず、音声パターンは、ある定ま
つた値（例えば、オールゼロ）とした。音声信号中の無音区間には、実際にはノイズが
重畳されており、このノイズについてそのまま正
規化処理が行なわれると、このノイズは本来の信
号成分と同じレベルにまで増幅される。しかも、
このノイズによる成分のスペクトル上での発生位
置は一定していないため、予め登録された音声パ
ターンにおけるノイズ成分のスペクトル上での発
生位置とは一般に異なつている。このため、入力
音声信号による音声パターンと予め登録された音
声パターンとのマツチングの際、非常に大きな距
離の差となつて現われることがあり誤認識の原因
となつていたが、前記のように無音区間を一定値
とすることにより、この様な問題は解決される。以上、圧縮部８３について説明した。これ以後
の処理は、従来例第１図で述べたと同様な処理に
なる。スイツチ８７をｅ側にしておくことにより、登
録パターンエリア８４へ、入力パターンを転送
し、音声単語の登録を行なう。認識時には、スイ
ツチ８７をＲ側にして、入力パターンエリア９０
内のパターンと登録パターンエリア８４内の登録
パターン間で、パターンマツチングを行なう。判
定部８６は、パターンマツチング部８５の出力を
受けて、最も類似性のある登録パターンのカテゴ
リー、ナンバーを選出するとともに、その類似性
が妥当なものか吟味する。妥当であると、そのカ
テゴリナンバーを出力し、入力音声単語の認識が
なされたことになる。交番数スペクトルとして、式(4)で示したが、こ
の計算では、２乗及び平方根計算が必要になる。
又、パワーの算出でも２乗及び平方根が必要にな
る。アダマール部の計算は、単に加算及び減算で
あるのに対し、これらの計算では、２乗及び平方
根の計算が必要でその処理が複雑になり問題であ
る。交番数スペクトルとしては、絶対値和として下
式(8)を定義し、第７図の交番数スペクトル部の計
算をこの式で行ない、又、パワーとしては、交番
数スペクトルの総和を求め、実験を行なつたが、
認識結果としては、式(4)と比べ、ほとんど差異が
なかつた。Ｓ_i＝｜ａ_i｜＋｜ｂ_i｜ ………式(8) 又、このように絶対値化することにより交番数
スペクトルとしては、わざわざ、上式(8)の処理を
行なわなくても良くなる。アダマール変換後の値
の絶対値化を、すべての値について行ない、一括
してチヤンネル分割の処理をすれば良いわけで、
下表の分割されるスペクトル番号に２を乗じたス
ペクトル番号で、スペクトルの分割を行なえば良
い事になり、処理が、きわめて簡単になる。Perform [Formula] and set the value of channel 8. S _i follows equation (4). The normalization unit 96 is a normalization unit for converting information unrelated to audio power, and follows the formula below. L _k is an integer value, and by doing so, it becomes independent of voice power. For example, if K is 16, the maximum value of L _k will be 16, and L _k
Therefore, it is sufficient to prepare 4 bits. In this way, the amount of information every 25.6 ms becomes 4 bits x 16 channels = 64 bits. If the A/D converter is 8 bits, the amount of information is 8 bits x 256 pieces.
This means that it is compressed to 64 bits and normalized to something that has nothing to do with audio power. The method of division is equal division with a fixed division width, or centering on the geometric mean of the divided areas.
Logarithmic division can be considered so that the ratio of adjacent center values is almost constant, but the latter division is preferable. This is because if the sampling time is 100μs・N=256, the spectral resolution is 39Hj, and a variable number spectrum can be found in units of 39Hj.On the other hand, the first formant of speech exists in the low alternation number region, so if it is divided into equal parts, This is because the amount of information of a channel calculated from a high alternation number region becomes extremely small, making it difficult to distinguish the characteristics of the audio signal. Also, the DC component is due to amplifier drift or
It is better not to use it for power calculation or spectrum division based on offsets, etc. Note that when dividing a spectrum into channels, it is more natural to obtain the sum of squares of the spectra and then take the square root, rather than simply adding the spectra. However, in this case, processing becomes complicated. In experiments, simple addition of spectra was sufficient. Alternatively, it is possible to simply calculate the sum of squares and normalize without taking the square root, but in either case, the experimental results (sum, sum of squares, square root of the sum of squares) are not very different. Ta. Summation is advantageous in terms of ease of processing. Furthermore, if such compression is performed in a place where the audio power is low, a stable audio pattern cannot be obtained in terms of the S/N ratio. Therefore, voice presence and silence are detected, and areas where voice power is low are treated as silent periods without compression, and the voice pattern is set to a certain predetermined value (for example, all zeros). Noise is actually superimposed on the silent section of the audio signal, and if normalization processing is performed on this noise as it is, this noise will be amplified to the same level as the original signal component. Moreover,
Since the position of occurrence of this noise component on the spectrum is not constant, it is generally different from the position of occurrence on the spectrum of the noise component in a pre-registered voice pattern. For this reason, when matching the audio pattern based on the input audio signal with the audio pattern registered in advance, a very large distance difference may appear, causing misrecognition. This problem can be solved by setting the interval to a constant value. The compression section 83 has been described above. The subsequent processing is similar to that described in FIG. 1 of the conventional example. By setting the switch 87 to the e side, the input pattern is transferred to the registered pattern area 84 and the voice word is registered. At the time of recognition, set the switch 87 to the R side and input pattern area 90.
Pattern matching is performed between the pattern within and the registered pattern within the registered pattern area 84. The determining unit 86 receives the output from the pattern matching unit 85, selects the category and number of the registered pattern with the most similarity, and examines whether the similarity is appropriate. If it is valid, the category number is output and the input speech word is recognized. Equation (4) is used as an alternating number spectrum, but this calculation requires square and square root calculations.
Also, the calculation of power also requires squares and square roots. While the Hadamard part calculations are simply addition and subtraction, these calculations require calculations of squares and square roots, which complicates the processing and poses a problem. For the alternation number spectrum, the following formula (8) is defined as the sum of absolute values, and the calculation of the alternation number spectrum part in Figure 7 is performed using this formula.As for the power, the sum of the alternation number spectra is determined and the experiment is performed. I did it, but
There was almost no difference in the recognition results compared to equation (4). S _i = | a _i | + | b _i | ......Formula (8) Also, by converting into absolute values in this way, the alternation number spectrum can be obtained without going to the trouble of processing the above formula (8). Get better. All you have to do is convert the values after the Hadamard transform to absolute values for all values, and then process the channel division all at once.
It is sufficient to divide the spectrum using the spectrum number obtained by multiplying the spectrum number to be divided by 2 in the table below, and the processing becomes extremely simple.

【表】しかし、音声パワーを絶対値から算出すると、
有音・無音の検出において、若干問題になる。絶
対値和によるパワーの算出と、実際の音声パワー
とは、明白な相関がないため、有音、無音の検出
が正しくなされず、正しく認識できないことが生
じた。第９図は、絶対値で、処理し、かつ上記の絶対
値による問題点を解決したものである。アダマー
ル変換部７８からの出力の絶対値和を100で求
め、第８式の交番数スペクトルを求めることな
く、いきなりチヤンネル分割を、前述したように
９５で行ない、正規化部９６で、音声パターンに
変換する。パワーの計算は、アダマール変換され、絶対値
をとられたものを使用せず、図の如く、アダマー
ル変換前の音声信号そのものから算出する。Ａ／
Ｄ変換された音声信号ｘ（ｎ）から次式でパワー
が求められる。この値を利用すれば、より正確に、有音、無音
の検出が可能となる。第９図においては、その他のブロツクに関して
は、第７図と同様である。しかし２乗計算、ある
いは平方根計算を、取り除くことが出来て、単純
な加算、減算で、すべての処理が可能になり、装
置の小型化、あるいは、コストダウン化がいつそ
う可能になつた。以上のように本発明によれば、直交変換手段で
得られる変換値の情報圧縮を行ない音声パターン
に変換する情報圧縮手段を、前記直交変換手段で
得られる変換値の各スペクトル毎の絶対値を取
り、この各スペクトル毎の絶対値をＭ個のグルー
プに分割し、それぞれの分割領域毎に総和を求
め、Ｍ個のチヤネル情報に変換するチヤネル情報
交換手段と、前記分割領域毎の総和が一定になる
ように前記チヤネル情報を正規化する正規化手段
とで構成し、かつ前記正規化手段は、音声信号中
の無音区間を一定値に置換するように構成したの
で、前記情報圧縮により計算量が大幅に減少する
と共に、前記正規化方法により音声信号中の無音
区間の不自然な正規化を回避出来るので認識精度
が大幅に向上したものである。[Table] However, when calculating the audio power from the absolute value,
There is a slight problem in detecting sound/silence. Since there is no clear correlation between the calculation of the power based on the sum of absolute values and the actual voice power, the presence or absence of voice is not detected correctly, resulting in failure to recognize it correctly. FIG. 9 shows processing using absolute values and solving the problems caused by the above-mentioned absolute values. The sum of the absolute values of the output from the Hadamard transform unit 78 is determined by 100, and without determining the alternation number spectrum of equation 8, channel division is suddenly performed by 95 as described above, and the normalization unit 96 converts the voice pattern into Convert. The power is calculated from the audio signal itself before the Hadamard transform, as shown in the figure, without using the Hadamard transform and taking the absolute value. A/
The power is determined from the D-converted audio signal x(n) using the following equation. By using this value, it becomes possible to detect voice presence and silence more accurately. In FIG. 9, other blocks are the same as in FIG. 7. However, by eliminating square calculations or square root calculations, it became possible to perform all processing with simple additions and subtractions, making it possible to miniaturize equipment and reduce costs. As described above, according to the present invention, the information compression means for compressing the information of the transform values obtained by the orthogonal transform means and converting them into speech patterns is configured to compress the absolute value of each spectrum of the transform values obtained by the orthogonal transform means. a channel information exchange means that divides the absolute value of each spectrum into M groups, calculates the sum for each divided region, and converts it into M channel information; and normalizing means for normalizing the channel information so that In addition, the normalization method avoids unnatural normalization of silent sections in the audio signal, resulting in a significant improvement in recognition accuracy.

[Brief explanation of the drawing]

第１図は話者特定型の音声認識装置のブロツク
線図、第２図は第１図の装置の周波数分析部のブ
ロツク線図、第３図はウオルシユアダマール関数
説明のための図、第４図はナチユラル形アダマー
ル行列をウオルシユ形アダマール行列に変換する
ための置換行列を示す図、第５図は交番数領域へ
の変換方法を説明するための図、第６図は高速ア
ダマール変換の一例を示す図、第７図は本発明の
一実施例における音声認識装置のブロツク線図、
第８図は同装置の圧縮部のブロツク線図、第９図
は同他の実施例のブロツク線図である。７１……マイク、７２……増幅器、７３……
Ａ／Ｄ変換器、７５……発振器、７４……エリア
、７６……エリア、７７……並びかえ部、７
８……高速アダマール変換部、７９……パワー計
算部、８０……音声単語切出し部、８２……交番
数スペクトル部、８３……圧縮部、８４……登録
パタンエリア、８５……パターンマツチング部、
８６……判定部、８７，８８，８９……スイツ
チ、９０……入力パターンエリア。 Fig. 1 is a block diagram of a speaker-specific speech recognition device, Fig. 2 is a block diagram of the frequency analysis section of the device in Fig. 1, Fig. 3 is a diagram for explaining the Walsh-Hadamard function, Figure 4 is a diagram showing a permutation matrix for converting a natural-form Hadamard matrix into a Walsh-form Hadamard matrix, Figure 5 is a diagram to explain the conversion method to the alternating number domain, and Figure 6 is an example of fast Hadamard transformation. FIG. 7 is a block diagram of a speech recognition device according to an embodiment of the present invention.
FIG. 8 is a block diagram of the compression section of the same apparatus, and FIG. 9 is a block diagram of another embodiment. 71...Microphone, 72...Amplifier, 73...
A/D converter, 75... oscillator, 74... area, 76... area, 77... sorting unit, 7
8...Fast Hadamard transform unit, 79...Power calculation unit, 80...Speech word extraction unit, 82...Alternating number spectrum unit, 83...Compression unit, 84...Registered pattern area, 85...Pattern matching Department,
86... Judgment unit, 87, 88, 89... Switch, 90... Input pattern area.

Claims

[Claims]

1. Orthogonal transformation means for performing orthogonal transformation of an audio signal;
Information compression means for compressing information of the transformed value obtained by the orthogonal transformation means and converting it into a speech pattern;
A pattern matching means performs pattern matching between the voice pattern and a pre-registered voice pattern, and the information compression means takes an absolute value for each spectrum of the transformed value obtained by the orthogonal transformation means, Channel information converting means divides the absolute value of each spectrum into M groups, calculates the sum for each divided region, and converts the sum into M channel information, and the total sum for each divided region becomes constant. normalizing means for normalizing the channel information so that the channel information is normalized, and the normalizing means replaces a silent section in the audio signal with a constant value.