JPS61281300A

JPS61281300A - Voice recognition equipment

Info

Publication number: JPS61281300A
Application number: JP60123904A
Authority: JP
Inventors: 誠赤羽; 雅男渡; 曜一郎佐古; 平岩　篤信
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 1985-06-07
Filing date: 1985-06-07
Publication date: 1986-12-11

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】〔産業上の利用分野〕この発明は音声認識装置に関し、特にパワーレベルや話
者の違いによる変動要因を除去する音源特性を正規化す
るための技術に関する。DETAILED DESCRIPTION OF THE INVENTION [Field of Industrial Application] The present invention relates to a speech recognition device, and particularly to a technique for normalizing sound source characteristics to remove fluctuation factors due to differences in power level and speakers.

[Summary of the invention]

この発明は音響分析部により入力音声より音声スペクト
ルを求め、この音声スペクトルから最小２乗法を用いて
音源スペクトル特性を求め、求めた音源スペクトルを元
の音声スペクトルから差″し引いて音源特性の正規化さ
れた音声スペクトルを得、これを用いて音声認識をなす
装置において、上記音響分析部におい・て周波数軸を対
数等間隔で分割しないような場合においても、対数等間
隔で分割したのと同等の音声スペクトルを得る補正手段
を設けたもので、音響分析部の設計が音声認識に適した
ものにでき、また、音源スペクトルの傾きを有音声、無
声音の区別なく常に音声スペクトルから除去するので１
、音源特性の変動成分がより少なくなり、不特定話者音
声認識装置の性能向上が図れるものである。In this invention, a sound spectrum is obtained from input speech by an acoustic analysis section, a sound source spectrum characteristic is obtained from this sound spectrum using the least squares method, and the obtained sound source spectrum is subtracted from the original sound spectrum to normalize the sound source characteristic. In a device that obtains a converted speech spectrum and performs speech recognition using it, even if the frequency axis is not divided at equal logarithmic intervals in the acoustic analysis section, it is equivalent to dividing it at equal logarithmic intervals. This system is equipped with a correction means to obtain a speech spectrum of
, the fluctuation components of the sound source characteristics are further reduced, and the performance of the speaker-independent speech recognition device can be improved.

[Conventional technology]

音声は時間軸に沿って変化する現象で、スペクトラム・
パターンが刻々と変化するように音声を発声することに
よって固有の単語や言葉が生まれる。この人間が発声す
る単語や言葉を自動認識する技術が音声認識であるが、
人間の聴覚機能に匹敵するような音声認識を実現するこ
とは現在のところ至難のことである。このため、現在実
用化されている音声認識の殆んどは、一定の使用条件の
下で、認識対象単語の標準パターンと入カバターンとの
パターンマツチングを行なうことによりなす方法である
。Speech is a phenomenon that changes along the time axis, and has a spectrum.
Unique words and words are created by uttering sounds with ever-changing patterns. Speech recognition is a technology that automatically recognizes words and phrases spoken by humans.
At present, it is extremely difficult to achieve speech recognition that is comparable to the human auditory function. For this reason, most speech recognition methods currently in practical use are performed by pattern matching a standard pattern of a word to be recognized and an input cover pattern under certain conditions of use.

ところで、音声認識装置の入力の発声話者を登録話者に
限定しない不特定話者入力の場合、話者正規化の方法が
重要な問題である。By the way, in the case of unspecified speaker input in which the speaking speaker input to the speech recognition device is not limited to registered speakers, the method of speaker normalization is an important issue.

第２図はこのことも考慮した音声認識装置の一例のブロ
ック図で、マイクロホン（１）よりの音声入力が音響分
析回路（２）に供給される。この音響分析回路（２）で
は入力音声パターンの特徴を表わす音響パラメータが抽
出される。この音響パラメータを抽出する音響分析の方
法は種々考えられるが、その−例としてバンドパスフィ
ルタと整流回路を１チヤンネルとし、このようなチャン
ネルを音声帯域を分割した通過帯域をそれぞれ有するも
のとして複数個並べ、このバンドパスフィルタ群の出方
としてスペクトルパターンの時間変化を抽出する方法が
用いられる。FIG. 2 is a block diagram of an example of a speech recognition device that also takes this into account, in which speech input from a microphone (1) is supplied to an acoustic analysis circuit (2). This acoustic analysis circuit (2) extracts acoustic parameters representing the characteristics of the input speech pattern. Various acoustic analysis methods can be considered to extract these acoustic parameters, but one example is to use a bandpass filter and a rectifier circuit as one channel, and to divide such a channel into multiple channels each having a passband that is obtained by dividing the audio band. A method is used to extract temporal changes in the spectral pattern as a way of arranging the bandpass filters.

すなわち、音響分析回路（２）においては、マイクロホ
ン（１）からの音声信号がアンプ（２１）及び帯域　−
制限用のローパスフィルタ（２２）を介してＡ／Ｄコン
バータ（２３）に供給され、例えば１２．５ｋＨｚのサ
ンプリング周波数で１２ビツトのデジタル音声信号に変
換される。このデジタル音声信号は、例えば１６チヤン
ネルのバンドパスフィルタバンクの各チャンネルのデジ
タルバンドパスフィルタ（２４ｒ）（２４２）　、・・
・・、　　（２４ｔｅ　）に供給される。このデジタル
バンドパスフィルタ（２４１）　、　　（２４２）　。That is, in the acoustic analysis circuit (2), the audio signal from the microphone (1) is transmitted to the amplifier (21) and the band -
The signal is supplied to an A/D converter (23) via a limiting low-pass filter (22) and converted into a 12-bit digital audio signal at a sampling frequency of, for example, 12.5 kHz. This digital audio signal is processed by the digital bandpass filters (24r) (242) of each channel of a 16-channel bandpass filter bank, for example.
..., (24te) is supplied. These digital band pass filters (241) and (242).

・・・・、　　（２４ｘｓ　）は例えばバターワース４
次のデジタルフィルタにて構成され、２５０Ｈｚから５
．５ＫＨｚまでの帯域が対数軸上で等間隔で分割された
各帯域が各フィルタの通過帯域となるようにされている
。..., (24xs) is, for example, Butterworth 4
Consists of the following digital filters, from 250Hz to 5
．． Each band obtained by dividing the band up to 5 kHz at equal intervals on the logarithmic axis becomes the pass band of each filter.

すなわち、ログ・スケール等間隔で周波数分割されて１
６チヤンネル分のバンドパスフィルタバンクが構成され
ている。In other words, the frequency is divided into 1 at equal intervals on a log scale.
A bandpass filter bank for six channels is configured.

各デジタルバンドパスフィルタ（２４１）　、　　（２
４２）　。Each digital band pass filter (241), (2
42).

・・・・、　　（２４１ｓ　）の出力信号はそれぞれ整
流回路（２５ｚ　）　、　　（２５２）　、・・・・、
　　（２５ｔｓ　）に供給され、これら整流回路（２５
１）　、　　（２５２）　、　・・・・（２５１ｇ）の
出力はそれぞれデジタルローパスフィルタ（２６１）　
。The output signals of ..., (241s) are respectively connected to rectifier circuits (25z), (252), ...,
(25ts) and these rectifier circuits (25ts).
1), (252), ... (251g) outputs are each digital low-pass filter (261)
.

（２６２）　、・・・・、　　（２６１Ｊ　）に供給さ
れる。これらデジタルローパスフィルタ（２６１）　、
　　（２６２）　。(262), ..., (261J). These digital low-pass filters (261),
(262).

・・・・、　　（２６１ｇ　）は例えばカットオフ周波
数５２．８Ｈｚ；のＦＩＲローパスフィルタにて構成さ
れる。..., (261g) is composed of, for example, an FIR low-pass filter with a cutoff frequency of 52.8 Hz.

音響分析回路（２）の出力である各デジタルローパスフ
ィルタＣ２６１）　、　　（２Ｂ２　）　、・・・・、
　　（２６１ｇ　）の出力信号は特徴抽出回路を構成す
るサンプラー（３〕に供給される。このサンプラー（３
）ではデジタルローパスフィルタ（２６１）　、　　（
２６２）　、・・・・。Each digital low-pass filter C261), (2B2), ... which is the output of the acoustic analysis circuit (2)
The output signal of (261g) is supplied to the sampler (3) that constitutes the feature extraction circuit.
), the digital low-pass filter (261), (
262) ,...

（２６１ｓ）の出力信号をフレーム周期５．１２ｍ５ｅ
ｃ毎にサンプリングする。したがって、これよりはサン
プル時系列Ａｔ（ｎｌ　（ｉ　＝　１．　２．　”Ｉ６
；ｎはフレｊ　　　　−ム番号でｎ＝１．２．　　・・
・・、Ｎ）が得ら・れる。(261s) output signal with a frame period of 5.12m5e
Sample every c. Therefore, from this, the sample time series At(nl (i = 1. 2. ”I6
;n is the frame number and n=1.2.・・・
..., N) is obtained.

このサンプラー（３）からの出力、つまり音声スペクト
ルのサンプル時系列Ａｔ（ｎｌは音源情報正規化回路（
４）に供給され、これにて認識しようとする音声の話者
による声帯音源特性の違い及び発声の大きさの違いが除
去される。The output from this sampler (3), that is, the sample time series At of the audio spectrum (nl is the sound source information normalization circuit (
4), thereby eliminating differences in vocal cord sound source characteristics and differences in utterance loudness between speakers of the speech to be recognized.

この発声の大きさの正規化と声帯音源特性の個人差の正
規化を、振幅、周波数軸とも対数で表した音声スペクト
ルの最小２乗近似直線を用いることにより行う方法が従
来提−案されている（参考文献「非線形スペクトルマツ
チングによる単語音声認識の一方式」三輪、小環、牧野
、城戸；電子通信学会論文誌’８１／　Ｉ　　Ｖｏｌ、
　Ｊ　６４−　Ｄ　　Ｎｏ、Ｉ　　Ｐ４６〜Ｐ５３）。Conventionally, a method has been proposed that normalizes the loudness of vocalizations and normalizes individual differences in vocal cord sound source characteristics by using a least squares approximation straight line of the voice spectrum expressed logarithmically on both the amplitude and frequency axes. (Reference: "A method of word speech recognition using nonlinear spectral matching" Miwa, Kokan, Makino, Kido; Transactions of the Institute of Electronics and Communication Engineers '81/I Vol.
J64-D No, I P46-P53).

すなわち、音源情報正規化回路（４）ではフレーム周期
毎にサンプラー（３）から供給されるサンプル時系列Ａ
ｔ（ｎ）に対してＡｔ（ｎｌ＝　　ｌｏｇ　（Ａｉ（ｎ）＋Ｂ）　　　　
　　　・・・（１）なる対数変換がなされる。この（１
）式において、Ｂはバイアスでノイズレベルが隠れる程
度の値を設定する。That is, the sound source information normalization circuit (4) uses the sample time series A supplied from the sampler (3) for each frame period.
At(nl=log(Ai(n)+B) for t(n)
...(1) A logarithmic transformation is performed. This (1
), B is set to a value such that the noise level is hidden by the bias.

ここで、声帯音源特性をｙｉ　＝ａ−ｔ＋ｂなる式でモ
デル化する。Here, the vocal cord sound source characteristics are modeled using the equation yi =at+b.

このとき、声帯音源特性を表すｙｔを音声スペクトルｘ
１の最小２乗近似直線として係数ａ、　　ｂを求める。At this time, yt representing vocal cord sound source characteristics is expressed as voice spectrum x
Find coefficients a and b as least squares approximation straight lines of 1.

すなわち、係数ａ、ｂは次式により決定される。That is, coefficients a and b are determined by the following equations.

（Ｔ＝１６）　　　　　・・・（２）（１＝１６）　　　　　・・・（３）そして、正規化した後の音声スペクトル２Ｌは次式で与
えられる。(T=16)...(2) (1=16)...(3) Then, the normalized audio spectrum 2L is given by the following equation.

ｚＩ＝ｘＩ−ｙＬ　　　　　　　・・・（４）ただし、
ここでは無声音と有声音の違いを考慮し、最小２乗近似
直線の傾きがａ≧０の場合には音声レベルの正規化のみ
を行い、スペクトルの全体的な傾きは変えないようにし
ている。zI=xI-yL...(4) However,
Here, considering the difference between unvoiced sounds and voiced sounds, when the slope of the least squares approximation straight line is a≧0, only the voice level is normalized, and the overall slope of the spectrum is not changed.

すなわち、有声音の場合には第３図に示すようにａ＜０
であるのに対し、無声音の場合には、第４図に示すよう
にａ≧０である。よって無声音の場合には声帯音源スペ
トル特性の傾きは除去せずに音声レベルの正規化のみを
行なうのである。That is, in the case of voiced sounds, a<0 as shown in Figure 3.
On the other hand, in the case of unvoiced sounds, a≧0 as shown in FIG. Therefore, in the case of unvoiced sounds, only the voice level is normalized without removing the slope of the vocal cord sound source spectrum characteristics.

以上を整理すると、次のようになる。Putting the above in order, we get the following.

音源の正規化されたパラメータをＰｉ（ｎｌとすると、
ａ（ｎ）＜ＱのときパラメータＰｉ（ｎ）はＰｉ（ｎ）
＝Ａｉ（ｎ）　　（ａ（ｎ）　・ｉ　＋　ｂ（ｎ））　
　　　・・・（５）と表される。Letting the normalized parameters of the sound source be Pi(nl),
When a(n)<Q, the parameter Pi(n) is Pi(n)
=Ai(n) (a(n) ・i + b(n))
...It is expressed as (5).

又、ａ　（ｆｉｌ≧０のときレベルの正規化のみ行ない
、パラメータＰｉ（ｎ）はこうして声帯音源特性の違いが正規化されて除去された
音響パラメータ時系列Ｐ　ｉ　（ｎ）がこの音源情報正
規化回路（４）より得られる。Also, when a (fil≧0, only the level is normalized, and the parameter Pi(n) is the acoustic parameter time series P i (n) in which the differences in the vocal cord sound source characteristics are normalized and removed, and the sound source information is normalized. obtained from the conversion circuit (4).

この音源情報正規化回路（４）よりの音響パラメータＰ
ｉ（ｎ）は音声区間内パラメータメモリ（５）に供給さ
れる。この音声区間内パラメータメモ１月５）では音声
区間判定回路（６）からの音声区間判定信号を受けて、
パラメータｐｔ（ｎｌが、判定さた音声区間毎にストア
される。Acoustic parameter P from this sound source information normalization circuit (4)
i(n) is supplied to the intra-speech interval parameter memory (5). In this voice section parameter memo (January 5), upon receiving the voice section judgment signal from the voice section judgment circuit (6),
A parameter pt(nl is stored for each determined voice section.

音声区間判定回路（６）はゼロクロスカウンタ（６１）
とパワー算出回路（６２）と音声区間決定回路（６３）
とからなり、Ａ／Ｄコンバータ（２１）よりのデジタル
音声信号がゼロクロスカウンタ（６１）及びパワー算出
回路（６２）に供給される。ゼロクロスカウンタ（６１
）では１フレ一ム周期５．１２ｍ５ｅｃ毎に、この１フ
レ一ム周期内の６４サンプルのデジタル音声信号のゼロ
クロス数をカウントし、そのカウント値が音声区間決定
回路（６３）の第１の入力端に供給される。パワー算出
回路（６２）では１フレ一ム周期毎にこの１フレ一ム周
期内のデジタル音声信号のパワー、すなわち２乗和が求
められ、その出力パワー信号が音声区間決定回路（６３
）の第２の入力端に供給される。音声区間決定回路（６
３）には、さらに、その第３の入力端に音源情報正規化
回路（４）よりの音源正規化情報ａ、ｂの値が供給され
る。そして、この音声区間決定回路（６３）においては
ゼロクロス数、区間内パワー及び音源正規化情報が複合
的に処理され、無音、無声音及び有声音の判定処理が行
なわれ、音声区間が決定される。The voice section determination circuit (6) is a zero cross counter (61)
, power calculation circuit (62), and voice section determination circuit (63)
A digital audio signal from the A/D converter (21) is supplied to a zero cross counter (61) and a power calculation circuit (62). Zero cross counter (61
) counts the number of zero crosses of the 64 samples of digital audio signal within this one frame period every 5.12 m5ec, and the count value is the first input of the audio section determination circuit (63). Supplied at the end. The power calculating circuit (62) calculates the power of the digital audio signal within one frame period, that is, the sum of squares, for each frame period, and the output power signal is sent to the audio section determining circuit (63).
) is supplied to the second input of the circuit. Voice section determination circuit (6
3) is further supplied with the values of sound source normalization information a and b from the sound source information normalization circuit (4) to its third input terminal. Then, in this speech section determining circuit (63), the number of zero crossings, the power within the section, and the sound source normalization information are processed in a composite manner, and a process for determining silence, unvoiced sound, and voiced sound is performed, and a speech section is determined.

この音声区間決定回路（６３）よりの判定された音声区
間を示す音声区間判定信号は音声区間判定回路（６）の
出力として音声区間内パラメータメモ１月５）に供給さ
れ、その判定された音声区間単位毎に、このメモリ（５
）に音源情報の正規化された音響パラメータ時系列Ｐｉ
（ｎ）がストアされる。The voice interval determination signal indicating the determined voice interval from the voice interval determination circuit (63) is supplied to the voice interval parameter memo (January 5) as an output of the voice interval determination circuit (6), and the voice interval determination signal indicating the determined voice interval is This memory (5
) is the normalized acoustic parameter time series Pi of the sound source information
(n) is stored.

こうしてメモ１月５）にストアされた音響パラメータ時
系列Ｐｉ（ｎ）は音声認識部（７）に供給されて音声認
識に供される。The acoustic parameter time series Pi(n) thus stored in the memo (January 5) is supplied to the speech recognition section (7) and subjected to speech recognition.

すなわち、例えば、予め登録されている！！！識対象語
の音響パラメータ時系列との距離がそれぞれ算出され、
その算出距離の最小の語が音声認識出力として認識され
ることになる。That is, for example, it is registered in advance! ! ! The distance between the target word and the acoustic parameter time series is calculated,
The word with the smallest calculated distance will be recognized as the speech recognition output.

この場合に、時間正規化処理のため、距離算出に当たっ
て動的計画法を用いるいわゆるＤＰマッチングを行なう
ようにしてもよい。In this case, for time normalization processing, so-called DP matching using dynamic programming may be performed in distance calculation.

また、この時間正規化処理として、音響パラメータ時系
列Ｐ　１（ｎ）がそのパラメータ空間で描く軌跡を推定
し、この軌跡に沿って再サンプリングすることにより、
音響パラメータ時系列Ｐｉ（ｎ）から時間正規化された
新たなパラメータ時系列Ｑｉに）を形成し、この新たな
パラメータ時系列Ｑｉ（ホ）の形で認識対象語の標準パ
ターンを登録しておくとともに、このパラメータＱｉ（
ホ）で認識時の距離算出をなすようにしてもよい（例え
ば特願昭５９−１０６１７７号参照）〔発明が解決しよ
うとする問題点〕上述したような従来の話者正規化法の場合、周波数軸を
対数等間隔で分割したときに音源スペクトル特性ｙｔが
直線近似により得られることを用いているので、音響分
析にバンドパスフィルタバンクを用いた場合、周波数軸
を対数等間隔で分割して各チャンネルを定めなければな
らないという制約があった。In addition, as this time normalization process, by estimating the trajectory drawn by the acoustic parameter time series P 1 (n) in its parameter space and resampling along this trajectory,
A new parameter time series Qi that is time-normalized from the acoustic parameter time series Pi(n) is formed, and a standard pattern of the recognition target word is registered in the form of this new parameter time series Qi(e). In addition, this parameter Qi (
E) may be used to calculate the distance during recognition (for example, see Japanese Patent Application No. 59-106177) [Problems to be Solved by the Invention] In the case of the conventional speaker normalization method as described above, It uses the fact that the sound source spectrum characteristic yt can be obtained by linear approximation when the frequency axis is divided at equal logarithmic intervals, so when using a bandpass filter bank for acoustic analysis, it is possible to divide the frequency axis at equal logarithmic intervals. There was a constraint that each channel had to be defined.

これに対し、バンドパスフィルタバンクとしては、人間
の聴覚特性に対応して周波数軸をとったメル（ｍｅｔ）
スケールを用いたいという要求もある。On the other hand, as a band-pass filter bank, the MET filter has a frequency axis that corresponds to the human auditory characteristics.
There is also a request to use a scale.

また、対数等間隔で音声周波数帯域を分割して複数チャ
ンネルのバンドパスフィルタバンクを構成した場合には
、低域側が細かくなってフィルタの設計が難かしくなる
とともにチャンネル数が多くなる欠点を除去するため、
低域側はメルスケールで等間隔で周波数帯域分割して各
チャンネルをとり、高域側は対数等間隔のままにすると
いうように、メルスケールと対数軸との利点を応用した
周波数軸をとってバンドパスフィルタバンクを構成した
いという要求もある。In addition, when a multi-channel bandpass filter bank is constructed by dividing the audio frequency band at equal logarithmic intervals, the disadvantages that the low frequency side becomes finer, making filter design difficult and increasing the number of channels can be eliminated. For,
The frequency axis takes advantage of the advantages of the mel scale and the logarithmic axis, such as dividing the frequency bands at equal intervals on the mel scale and taking each channel on the low frequency side, and leaving them at equal logarithmic intervals on the high frequency side. There is also a demand for configuring a bandpass filter bank.

ところが、上記のように従来の方法は対数等間隔のバン
トパスフィルタバンクを前提としているので、このまま
では上記のような演算が簡単な最小２乗法を用いた音源
情報正規化法を用いることができなしへ。However, as mentioned above, the conventional method assumes a logarithmically evenly spaced band-pass filter bank, so it is not possible to use the sound source information normalization method using the least squares method, which is easy to calculate, as described above. To none.

さらに、上述した従来の正規化法の場合には、有声音と
無声音とを分け、無声音の場合には音源スペクトルの傾
きの除去は行なっていない。Furthermore, in the case of the conventional normalization method described above, voiced sounds and unvoiced sounds are separated, and in the case of unvoiced sounds, the slope of the sound source spectrum is not removed.

これは、有声音の音韻の特徴はフォルマントの位置によ
って決まり、スペクトルの傾きは関係ないとされるのに
対し、無声音の特徴はフォルマントだけでなくスペクト
ルの傾きも重要であるとされていることに起因する。This is because the phonological characteristics of voiced sounds are determined by the position of the formants and are not related to the spectral slope, whereas the characteristics of unvoiced sounds are said to be determined not only by formants but also by the spectral slope. to cause.

ところが、スペクトルの傾きは話者やしゃべり方の違い
により相当変動すると予想される。したがって、従来の
場合、このように大きな変動要因が音源情報に未だ含ま
、れていることになる。However, the slope of the spectrum is expected to vary considerably depending on the speaker and the way they speak. Therefore, in the conventional case, such large fluctuation factors are still included in the sound source information.

[Means for solving problems]

この発明は以上のような問題点を解決できるようにした
もので、入力音声信号の音声スペクトルを求める音響分
析手段と、上記音声スペクトルの周波数軸が対数等間隔
で分割したときと等しくなるように補正をなす補正手段
と、上記音声スペクトルから最小２乗近似直線として音
源スペクトル特性を求める手段と、上記音源スペクトル
を上記補正された音声スペクトルから差し引く減算手段
と、この減算手段よりの音源情報の正規化された音声ス
ペクトルを用いて音声認識をなす手段とからなる。The present invention has been made to solve the above-mentioned problems, and includes an acoustic analysis means for determining the audio spectrum of an input audio signal, and an acoustic analysis means that calculates the audio spectrum of an input audio signal so that the frequency axis of the audio spectrum is equal to that when divided at equal logarithmic intervals. a correction means for performing correction, a means for obtaining a sound source spectrum characteristic from the voice spectrum as a least square approximation straight line, a subtraction means for subtracting the sound source spectrum from the corrected voice spectrum, and a normalization of the sound source information from the subtraction means. and means for performing speech recognition using the converted speech spectrum.

[Effect]

バンドパスフィルタバンクが対数等間隔で周波数帯域分
割されて各チャンネルの中心周波数が決定されたもので
なくても、ｊｌｉ正手段によって対数等間隔で分割した
ときと等しくなるように補正される。Even if the bandpass filter bank is not divided into frequency bands at equal logarithmic intervals and the center frequency of each channel is determined, it is corrected by the jli positive means so that it is equal to the frequency band divided at equal logarithmic intervals.

したがって、音声スペクトル上で最小２乗法により求め
た音源スペクトルで音源情報の正規化が行えるようにな
る。Therefore, the sound source information can be normalized using the sound source spectrum obtained by the least squares method on the sound spectrum.

また、減算手段に、おいては有声音と無声音の区別なく
、音声スペクトルから音源スペクトルが減算されて音源
スペクトルの傾きが除去され、話者の違いを含めた音源
情報の変動分をより除去できる。In addition, in the subtraction means, the sound source spectrum is subtracted from the voice spectrum without distinguishing between voiced and unvoiced sounds, and the slope of the sound source spectrum is removed, making it possible to further remove variations in sound source information including differences in speakers. .

〔Example〕

第１図はこの発明の一実施例のブロック図で、マイクロ
ホン（１）より入力された音声はバンドパスフィルタバ
ンクを用いた音響分析部（２）に供給され、その分析出
力がサンプラー（３）に供給されて、これより音響パラ
メータが得られる。FIG. 1 is a block diagram of an embodiment of the present invention, in which audio input from a microphone (1) is supplied to an acoustic analysis section (2) using a band-pass filter bank, and the analysis output is sent to a sampler (3). from which the acoustic parameters are obtained.

この音響パラメータは音源情報正規化回路（４）の補正
回路（４１）に供給される。This acoustic parameter is supplied to the correction circuit (41) of the sound source information normalization circuit (4).

この場合、周波数ｆ、の音源パワースペクトルをｙｔと
し、音源特性を次の関数でモデル化する。In this case, the sound source power spectrum at frequency f is set to yt, and the sound source characteristics are modeled using the following function.

ｌｏｇ　ｙｔ　＝Ｃ１ｏｇ　ｒｔ　＋ａ　　　　１　・
−（７）このモデルはバンドパスフィルタの中心周波数
と帯域幅がログ・スケール等間隔に取られているとき音
源対数パワースペクトルが直線で表現されることを意味
している。そして、このとき音声対数パワースペクトル
の最小２乗近似直線を音源パワースペクトルとすると、
最小２乗法により音源特性が求められるのである。log yt =C1og rt +a 1 ・
-(7) This model means that when the center frequency and bandwidth of the bandpass filter are set at equal intervals on the log scale, the source logarithmic power spectrum is expressed as a straight line. At this time, if the least squares approximation straight line of the audio logarithmic power spectrum is taken as the sound source power spectrum, then
The sound source characteristics are determined by the least squares method.

このように、対数パワースペクトルの周波数軸がログ・
スケールにとられているときは、単純に最小２乗法を通
用すれば音源特性が求められるが、ログ・スケールでな
いときには補正が必要になる。In this way, the frequency axis of the logarithmic power spectrum is
When the scale is used, the sound source characteristics can be found by simply applying the method of least squares, but when the scale is not log scale, correction is required.

補正回路（４１）ではこの補正が行なわれる。この補正
回路（４１）でなされる補正としては次の３種類が考え
られる。This correction is performed in the correction circuit (41). The following three types of correction can be considered as the correction performed by this correction circuit (41).

■　バンドパスフィルタバンクの場合、帯域幅の補正 ■　周波数軸のログ・スケールへの変換■　２乗誤差和
を評価するときの誤差に対する重み補正第１番目の帯域幅の補正は以下のようにしてなされる。■ In the case of a bandpass filter bank, bandwidth correction ■ Conversion of the frequency axis to log scale ■ Weight correction for errors when evaluating the sum of squared errors The first bandwidth correction is as follows. It will be done.

ｉチャンネルのバンドパスフィルタの中心周波数をｆＩ
、帯域下限周波数をＦ（−１、上限周波数をＦｔ、帯域
幅を、ΔＦ」とすると、 ΔＦ【−ＦＬ　　Ｆ［−１・・・（８）ｉチャンネルの
平均出力パワーをＸｉとし、ｘｌを基準とし、平坦なス
ペクトルを持つ信号が入力されたとき、各チャンネルの
出力が一定になるように補正すると、補正された出力ｘ
Ｉ′は、ｌｏｇ　ｘ＋’＝１ｏｇ　ｘ（−１ｏｇ　　（
ΔＦ＋／ΔＦｔ）・　・　・ａω ここで、ｌｏｇ　ｘｌ’＝Ｘｔ′、ｌｏｇ　ｘｔ　＝Ｘ
ｔ　＋ｌｏｇ　　（ΔＦＬ／ΔＦｔ）＝ＤＦ＋　とする
と、ＸＩ′＝ｘＬ　−ＤＦＩ　　　　　　　　・　・　
・　（１１）となり、バンドパスフィルタの出力レベル
が補正されることになる。The center frequency of the i-channel bandpass filter is fI
, the lower limit frequency of the band is F(-1, the upper frequency limit is Ft, the bandwidth is ΔF", then ΔF[-FL F[-1...(8) The average output power of the i channel is Xi, and xl is When a signal with a flat spectrum is input as a standard and the output of each channel is corrected to be constant, the corrected output x
I' is log x+'=1og x(-1og (
ΔF+/ΔFt)・・・aω Here, log xl'=Xt', log xt=X
If t +log (ΔFL/ΔFt)=DF+, then XI'=xL -DFI ・・
- (11), and the output level of the bandpass filter is corrected.

第２番目の周波数軸のログ・スケールへの変換は次式に
よりなされる。The conversion of the second frequency axis to the log scale is done by the following equation.

ｈＩ＝ｌｏｇｆｔ　　　　　　　　・・・　（１２）次
に第３番目の誤差に対する重み補正は次のようにしてな
される。hI=logft (12) Next, weight correction for the third error is performed as follows.

音声パワースペクトルの周波数軸がログ・スケールにと
られていないときは、ログ・スケール上で不等間隔に音
声スペクトルの標本点が並ぶため、ログ・スケール上で
等間隔に並ぶような効果を出すため、各スペクトルの値
の２乗誤差に重みω」をかける。When the frequency axis of the audio power spectrum is not plotted on a log scale, the sample points of the audio spectrum are arranged at uneven intervals on the log scale, creating the effect that they are arranged at equal intervals on the log scale. Therefore, a weight ω' is applied to the squared error of each spectrum value.

ここで、Ｙｔ　−１ｏｇ　ｙｔとすると２乗誤差和ｅは重みω１は次のように定める。Here, if Yt - 1og yt, the sum of squared errors e is The weight ω1 is determined as follows.

Δｈｔ＝１ｏｇ　Ｆ（−１ｏｇ　Ｆｔ−ｔ　　　・・・
（１４）ログ・スケール上等間隔で帯域幅をΔｇとする
と、（Δｇ−ΔｈＮとおいている）以上のようにして補正された音声スペクトルの各位は音
源スペクトル特性抽出回路（４２）に供給され、最小２
乗法により近似直線が求められる。Δht=1og F(-1og Ft-t...
(14) If the bandwidth is Δg at equal intervals on the log scale, then (Δg - ΔhN) each part of the audio spectrum corrected as above is supplied to the sound source spectrum characteristic extraction circuit (42), minimum 2
An approximate straight line can be found by multiplication.

すなわち、（１３）　、式に（７）式、（１２）式を代
入すると、ａ　ｅ　／　ａ　ｃ　＝　０から・・・　（１７〕ａ　ｅ　／　ａ　ｄ　＝　Ｑから・　・　・　（１ｇ）よって、・・・　（１９）・・・　（２０）（１９）式、（２０）式から（７）式の傾きＣ２切片ｄ
の値が求まる。That is, by substituting equations (7) and (12) into equation (13), from a e / a c = 0... (17) from a e / a d = Q... (1g) Therefore , ... (19) ... (20) Slope C2 intercept d of equation (19), equation (20) to equation (7)
Find the value of

このｃ、ｄにより音源スペクトル特性が表現される。The sound source spectrum characteristics are expressed by c and d.

こうして得られた音源スペクトルの各位は減算回路（４
３）に供給され、補正回路（４１）からの音声スペクト
ルの各位から減算される。すなわち、音源情報正規化さ
れた対数パワースペクトルをＺ【とすると、減算回路（
４３）では次式のような演算がなされる。Each part of the sound source spectrum obtained in this way is subtracted by a subtraction circuit (4
3) and is subtracted from each part of the audio spectrum from the correction circuit (41). In other words, if the logarithmic power spectrum normalized to the sound source information is Z[, then the subtraction circuit (
43), the following calculation is performed.

Ｚ＋　＝Ｘｔ’−Ｙｔ＝Ｘｔ’　　　（ｃ、ｈｌ　　＋ｄ）　　　　・・・（
２１）こうして減算回路（４３）からは音源情報正規化
された音声パワースペクトルＺｉが得られる。Z+ =Xt'-Yt =Xt' (c, hl +d)...(
21) In this way, the subtraction circuit (43) obtains the sound power spectrum Zi whose sound source information has been normalized.

そして、これが音声認識部（７）に供給されて、前述し
たように登録されている認識対象語の標準パターンと距
離演算なされ、その算出距離の最小値を判別することに
より認識出力が得られる。This is then supplied to the speech recognition unit (7), where distance calculation is performed on the standard pattern of the recognition target word registered as described above, and a recognition output is obtained by determining the minimum value of the calculated distance.

この場合に、前述もしたように、いわゆるＤＰマツチン
グを行なってもよいし、また、音響パラメータがそのパ
ラメータ空間で描く軌跡を推定し、その軌跡を再サンプ
リングして新たなパラメータを求めて時間正規化し、こ
れを標準パターンとの距離算出に利用するようにしても
よい。In this case, as mentioned above, so-called DP matching may be performed, or the trajectory drawn by the acoustic parameter in the parameter space may be estimated, the trajectory may be resampled to obtain new parameters, and time normalization may be performed. may be used to calculate the distance to the standard pattern.

〔Effect of the invention〕

この発明によれば、音響分析時、周波数軸を対数等間隔
にとらない場合においても、演算の簡単な最小２乗法に
より音源スペクトルを求めて、これを元の音声スペクト
ルから減算して除去する音源情報正規化法を用いること
ができる。According to this invention, even when the frequency axis is not arranged at equal logarithmic intervals during acoustic analysis, the sound source spectrum is determined by the simple least squares method, and the sound source spectrum is subtracted from the original sound spectrum to be removed. Information normalization methods can be used.

したがって、音響分析部にバンドパスフィルタバンクを
用いる場合に、その各チャンネルの通過周波数の設計が
音声認識に適した種々の周波数軸をとったものとして行
なうことができるので、設計の自由度が太き（なる。Therefore, when using a bandpass filter bank in the acoustic analysis section, the passing frequency of each channel can be designed on various frequency axes suitable for speech recognition, which increases the degree of freedom in design. Become.

また、この発明においては有声音だけでなく、無声音に
ついても音源スペクトルの傾きを除去するようにしたの
で、この無声音についての音源スペクトルの傾きが話者
の違いにより変動することを考えれば、音源特性の変動
成分がより少なくなり、不特定話者対応音声認識装置の
性能向上が計れるものである。In addition, in this invention, the slope of the sound source spectrum is removed not only for voiced sounds but also for unvoiced sounds, so considering that the slope of the sound source spectrum for unvoiced sounds varies depending on the speaker, the sound source characteristics The fluctuation component of is further reduced, and the performance of the speech recognition device compatible with unspecified speakers can be improved.

[Brief explanation of drawings]

第１図はこの発明の要部の一実施例のブロック図、第２
図は音声認識装置の一例の構成を示すブロック図、第３
図及び第４図は音源スペクトル特性を説明するための特
性図である。（２）は音響分析部、（４）は音源情報正規化回路で、
（４工）は補正回路、（４２）は音源スペクトル特性抽
出回路、（４３）は減算回路である。FIG. 1 is a block diagram of an embodiment of the main part of this invention, and FIG.
The figure is a block diagram showing the configuration of an example of a speech recognition device.
4 and 4 are characteristic diagrams for explaining sound source spectral characteristics. (2) is the acoustic analysis section, (4) is the sound source information normalization circuit,
(4) is a correction circuit, (42) is a sound source spectrum characteristic extraction circuit, and (43) is a subtraction circuit.

Claims

[Scope of Claims] (a) Acoustic analysis means for determining the audio spectrum of an input audio signal; (b) Correction means for making corrections so that the frequency axis of the audio spectrum is equal to that when divided at equal logarithmic intervals. (c) means for determining a sound source spectrum characteristic as a least squares approximation straight line from the sound spectrum; (d) a subtraction means for subtracting the sound source spectrum from the corrected sound spectrum; and (e) a sound source from the subtraction means. A speech recognition device comprising means for performing speech recognition using a normalized speech spectrum of information.