JPS6147439B2

JPS6147439B2 -

Info

Publication number: JPS6147439B2
Application number: JP55030753A
Authority: JP
Inventors: Yasunobu Ina
Original assignee: Sanyo Electric Co Ltd
Current assignee: Sanyo Electric Co Ltd
Priority date: 1980-03-10
Filing date: 1980-03-10
Publication date: 1986-10-18
Also published as: JPS56126897A

Description

【発明の詳細な説明】本発明は音声認識装置に関する。[Detailed description of the invention] The present invention relates to a speech recognition device.

近年、人間の音声に関する分析研究が進み、音
声の特徴を表わすパラメータをその音声から抽出
する手法が多数開発されつつある。そして斯様な
音声パラメータを用いた音声認識装置が実現され
ている。 In recent years, analytical research on human speech has progressed, and many methods are being developed for extracting parameters representing the characteristics of speech from speech. A speech recognition device using such speech parameters has been realized.

この音声パラメータとしては情報量ができるだ
け小さくしかも認識に有効なものが望まれてお
り、現在の所、周波数スペクトルを表わす10〜20
チヤネル程度のバンドパスフイルタに依る検波出
力値、及び線形予測係数の一種である10次程度の
偏自己相関係数等が用いられている。 It is desired that this audio parameter has as small an amount of information as possible and is effective for recognition.Currently, 10 to 20
A detection output value from a channel-like bandpass filter and a 10th-order partial autocorrelation coefficient, which is a type of linear prediction coefficient, are used.

斯様な音声認識装置は、入力音声の音声パラメ
ータを、予め貯えてある多数の標準的な参照音声
の音声パラメータと比較演算して最も類似の参照
音声パラメータを選定し、この参照パラメータを
有する参照音声をこの時の入力音声と認識するも
のである。然るに人間の音声は、たとえ同じ話者
が同じ単語をくり返し発声したとしても、単語の
発声時間長にかなりの変動がみられる事から実用
的な認識率を得る為には音声パラメータの比較演
算に際して入力音声と参照音声の両音声パラメー
タの時間軸整合を行なう必要がある。 Such a speech recognition device selects the most similar reference speech parameters by comparing the speech parameters of the input speech with the speech parameters of a large number of standard reference speeches stored in advance, and selects the most similar reference speech parameters. The voice is recognized as the input voice at this time. However, in human speech, even if the same word is uttered repeatedly by the same speaker, there is considerable variation in the duration of the utterance of the word, so in order to obtain a practical recognition rate, it is necessary to calculate the comparison of speech parameters. It is necessary to perform time axis alignment of the audio parameters of both the input audio and the reference audio.

現在、斯様な時間軸整合手段としては、音声認
識を単語単位で行なう事とし、入力音声の時間長
を参照音声の時間長に合せて線形な時間軸の伸縮
を行なう線形整合手段、並びに動的計画法の手法
を用いて両パターン間の２乗誤差を最小にするべ
く非線形な伸縮を行なう非線形整合手段がある。
この非線形整合手段を音声認識装置に用いた場合
は、上記線形整合手段を用いた場合より認識率の
向上が図れるが、その計算処理量が膨大となるの
で通常のマイクロプロセツサでは実時間処理が困
難である。 Currently, such time axis matching means performs speech recognition on a word-by-word basis, and includes linear matching means that linearly expands and contracts the time axis according to the time length of the input speech to match the time length of the reference speech, and There is a nonlinear matching means that performs nonlinear expansion and contraction to minimize the squared error between both patterns using a systematic programming method.
When this non-linear matching means is used in a speech recognition device, the recognition rate can be improved more than when using the above-mentioned linear matching means, but since the amount of calculation processing is enormous, real-time processing cannot be performed with a normal microprocessor. Have difficulty.

本発明は斯る現状に鑑み、人間の音声の時間長
変動が殆んどの場合有声音区間の時間長変動に依
存している事に着目して為されたものであり、音
声の有声音区間と無声音区間の夫々に於ける入力
音声パラメータと参照音声パラメータとの時間軸
整合を図つた音声認識装置を提供するものであ
る。 In view of the current situation, the present invention has been made by focusing on the fact that the time length fluctuation of human speech is almost always dependent on the time length fluctuation of the voiced sound section. The present invention provides a speech recognition device that achieves time axis alignment between input speech parameters and reference speech parameters in each of the speech and unvoiced sections.

第１図は本発明音声認識装置の構成を示すブロ
ツク図であり、同図に於て、１は入力音声を音声
信号に変換するマイクロフオン、２は該音声信号
を増巾する増巾器、３は増巾器２で得られる音声
信号から入力音声の特徴パラメータを抽出する音
声特徴抽出手段であつて、詳しくは音声帯域
（100Hz〜4KHz程度）内に1/4オクターブ程度の通
過帯域間隔を設けた複数のバンドパスフイルタ
４，４…と、該バンドパスフイルタ４，４…から
の夫々の周波数帯域成分を順次選択して取り出す
マルチプレクサ５と、該マルチプレクサ５からの
各周波数帯域成分値をデイジタル量の入力音声パ
ラメータに変換するＡ／Ｄ変換器６と、から構成
されている。７は上記増巾器２からの音声信号に
基いて入力音声の有声音区間と無声音区間とを判
定し有声無声パターンを得る判定手段、８は入力
音声記憶手段であり、該判定手段７から得られる
有声無声パターンと共に上記音声特徴抽出手段３
から得られる入力音声パラメータを一時的に貯え
るものである。９は参照音声記憶手段であり、参
照音声の有声無声パターンと参照音声パラメータ
とを関連付けて複数対貯えたものである。１０は
上記入力音声記憶手段８の有声無声パターンと参
照音声記憶手段９の有声無声パターンとの時間軸
に於ける比較を行なう比較手段であつて、入力音
声の有声音区間或いは無音声区間の時間長に対す
る参照音声の有声音区間或いは無音声区間の時間
長の比を対応する夫々の区間について導出するも
のである。１１は時間軸整合手段であつて、該比
較手段１０からの比較結果に基き、入力音声記憶
手段８の入力音声パラメータの有声音区間並びに
無声音区間の夫々に於て、時間軸に対する比例伸
長或いは比例短縮を行なうものである。１２は認
識処理手段であり、上記参照音声パラメータの参
照音声パラメータと、上記時間軸整合手段１１に
依つてこの参照音声パラメータに合せて時間軸整
合された入力音声パラメータと、の演算を多数の
参照パラメータに対して実行し、最も類似と判定
された参照パラメータが示す参照音声をこの時の
入力音声であると認識して入力音声に対応する信
号を出力するものである。 FIG. 1 is a block diagram showing the configuration of the speech recognition device of the present invention, in which 1 is a microphone that converts input speech into a speech signal, 2 is an amplifier that amplifies the speech signal, 3 is a voice feature extraction means for extracting characteristic parameters of input voice from the voice signal obtained by the amplifier 2; A plurality of band-pass filters 4, 4... provided, a multiplexer 5 that sequentially selects and extracts each frequency band component from the band-pass filters 4, 4..., and digitally converts each frequency band component value from the multiplexer 5. and an A/D converter 6 for converting input audio parameters into quantitative input audio parameters. Reference numeral 7 denotes a determining means for determining voiced and unvoiced sections of input speech based on the audio signal from the amplifier 2 to obtain a voiced and unvoiced pattern; The voice feature extraction means 3 together with the voiced and unvoiced patterns
This is to temporarily store the input audio parameters obtained from the . Reference numeral 9 denotes a reference voice storage means, which stores a plurality of pairs of voiced and unvoiced patterns of the reference voice and reference voice parameters in association with each other. Reference numeral 10 denotes a comparison means for comparing the voiced and unvoiced patterns of the input voice storage means 8 and the voiced and unvoiced patterns of the reference voice storage means 9 on the time axis, and compares the time of the voiced sound section or the unvoiced section of the input sound. The ratio of the time length of a voiced section or a voiceless section of the reference speech to the length of the reference speech is derived for each corresponding section. Reference numeral 11 denotes a time axis matching means which, based on the comparison result from the comparing means 10, expands or expands proportionally to the time axis in each of the voiced sound section and the unvoiced sound section of the input audio parameters of the input audio storage means 8. This is a shortening. Reference numeral 12 denotes a recognition processing means, which performs calculations on the reference voice parameters of the reference voice parameters and the input voice parameters time-axis aligned according to the reference voice parameters by the time-axis alignment means 11 with multiple references. This is executed on the parameters, and the reference voice indicated by the reference parameter determined to be most similar is recognized as the current input voice, and a signal corresponding to the input voice is output.

尚、斯様な本発明音声認識装置に用いられる音
声の有声無声に対する判定手段の一例を第２図に
示す。同図に於て、１３は音声信号の高域成分の
みを取り出すハイバスフイルタ、１４はこのフイ
ルタ１３からの高域成分値を所定のしきい値電圧
と比較するコンパレータ、１５は該コンパレータ
１４から得られる零交叉波数を計数するカウン
タ、１６は単位時間当りのカウンタ１５に依る計
数値の大小を判別する判別器であり、この様な構
成の判定手段は零交叉波に依る有声無声判定手段
として一般的なものである。 FIG. 2 shows an example of means for determining voiced/unvoiced speech used in such a speech recognition apparatus of the present invention. In the figure, 13 is a high-bass filter that extracts only the high-frequency components of the audio signal, 14 is a comparator that compares the high-frequency component value from this filter 13 with a predetermined threshold voltage, and 15 is a high-pass filter that extracts only the high-frequency components of the audio signal. A counter 16 counts the number of zero-crossing waves obtained, and a discriminator 16 discriminates the magnitude of the count value obtained by the counter 15 per unit time.The determining means having such a configuration can be used as a voiced/unvoiced determining means based on zero-crossing waves. It is common.

斯様な本発明音声認識装置を用いて数を認識さ
せる場合を具体的に示す。この場合０〜９までの
標準的な音声パラメータ並びに有声無声パターン
を予め参照音声記憶手段に貯えておく事に依つて
認識動作が可能になる。この認識動作中に、話者
がマイクロフオン１に向つて例えばichiと発声す
ると、この入力音声信号は増巾器２で増巾され音
声特徴抽出手段３に入力されると共に有声無声の
判定手段７に導入される。該音声特徴抽出手段３
に於ては、夫々異なる通過帯域を有する複数のバ
ンドパスフイルタ４から得られるichiなる入力音
声の各周波数成分がマルチプレクサ５とＡ／Ｄ変
換器６を介して順次出力されデイジタル量の入力
音声パラメータが得られる。一方、上記判定手段
７に於ては、ichiなる入力音声の有声音区間であ
る(i)の２区間に対して第３図ａ，ｂに模式的に示
す如く、例えばＬレベル、無声音区間である
（ch）の区間に対して例えばＨレベルの２値信号
を出力する事になる。斯様にして判定手段７から
得られた（有）（無）（有）の順で表わされるichi
なる入力音声の有声無声パターンと、上記音声特
徴抽出手段３から得られるichiなる入力音声パラ
メータとは共に上記入力音声記憶手段８に貯えら
れる。その後、該入力音声記憶手段８から先ず有
声無声パターンが上記比較手段１０に伝送され、
該比較手段１０に於て上記参照音声記憶手段９の
０〜９の数に対する参照音声の有声無声パターン
との比較が行なわれる。この時、先ず有声無声パ
ターンの有声音区間と無声音区間の順序と数に対
する比較が為され、この場合、参照音声の内から
３区間の（有）（無）（有）の順序の有声無声パタ
ーンを有する１（ICHI）と６（ROKU）が選ば
れる。次にこれ等選ばれた参照音声の内、第３図
ｄと同図ｅに模式的に示す如き有声無声パターン
と参照音声パラメータとを有する１（ICHI）の
参照音声に関して、入力音声の第１有声音区間(i)
に対する参照音声の第１有声音区間（Ｉ）の時間
長比を導出し、更に無声音区間（ch）（CH）、と
第２音声音区間(i)，（Ｉ）についても同様の時間
長比を導出する。これ等の比較結果に基き上記時
間軸整合手段１１に於てichiなる入力音声パラメ
ータを第３図ｃに模式的に示す如くその有声音区
間並びに無声音区間の夫々に対して時間軸の線形
な伸縮を行ない、ichiの入力音声パラメータと
ICHIの参照音声パラメータとの時間軸整合が為
される。斯様に時間軸整合されたichiの入力音声
パラメータとICHIの参照パラメータとは上記認
識処理部１２に於て演算処理されその結果が一時
的に貯えられる。斯様な音声パラメータの処理を
上記比較手段１０に於て選ばれるもう一方の６
（ROKU）の参照パラメータに就ても同様に行な
い、この認識処理手段１２での演算処理結果に基
いてichiなる入力音声を１と認識し、この１に対
する信号が該認識処理手段１２から出力される。 A case in which numbers are recognized using such a speech recognition device of the present invention will be specifically described. In this case, the recognition operation becomes possible by storing standard voice parameters from 0 to 9 and voiced/unvoiced patterns in the reference voice storage means in advance. During this recognition operation, when the speaker utters, for example, ichi into the microphone 1, this input voice signal is amplified by the amplifier 2 and input to the voice feature extraction means 3, and the voiced/unvoiced determination means 7. will be introduced in The voice feature extraction means 3
In this case, each frequency component of the input audio, i.e., obtained from a plurality of bandpass filters 4 each having a different passband, is sequentially outputted via a multiplexer 5 and an A/D converter 6, and is converted into digital input audio parameters. is obtained. On the other hand, in the above-mentioned determination means 7, as shown schematically in FIG. For example, an H level binary signal is output for a certain (ch) section. ichi expressed in the order of (presence) (absence) (presence) obtained from the determination means 7 in this way
Both the voiced and unvoiced pattern of the input voice and the input voice parameter ichi obtained from the voice feature extraction means 3 are stored in the input voice storage means 8. After that, the voiced and unvoiced patterns are first transmitted from the input voice storage means 8 to the comparison means 10,
In the comparison means 10, the numbers 0 to 9 in the reference voice storage means 9 are compared with the voiced and unvoiced patterns of the reference voice. At this time, first, the order and number of the voiced and unvoiced sections of the voiced and unvoiced pattern are compared. 1 (ICHI) and 6 (ROKU) with . Next, among these selected reference voices, regarding reference voice 1 (ICHI) having voiced/unvoiced patterns and reference voice parameters as schematically shown in FIGS. 3d and 3e, the first Voiced interval (i)
The time length ratio of the first voiced sound section (I) of the reference sound is derived, and the same time length ratio is also calculated for the unvoiced sound section (ch) (CH) and the second voice sound section (i), (I). Derive. Based on the results of these comparisons, the time axis matching means 11 adjusts the input voice parameter 1 by linearly expanding or contracting the time axis for each of the voiced sound section and the unvoiced sound section, as schematically shown in FIG. 3c. and set ichi's input audio parameters and
Time axis alignment with reference audio parameters of ICHI is performed. The input audio parameters of ichi and the reference parameters of ICHI, which are time-aligned in this manner, are subjected to arithmetic processing in the recognition processing section 12, and the results are temporarily stored. The other 6 selected by the comparison means 10 processes such audio parameters.
The same process is performed for the reference parameter of (ROKU), and the input voice ichi is recognized as 1 based on the arithmetic processing result in this recognition processing means 12, and a signal corresponding to this 1 is output from the recognition processing means 12. Ru.

本発明の音声認識装置は、以上の説明から明ら
かな如く、入力音声から抽出した有声無声パター
ンと参照音声の有声無声パターンとを比較する比
較手段と、この比較結果に基いて入力音声から抽
出した入力音声パラメータと参照音声パラメータ
との時間軸整合を両パラメータの対応する有声音
区間と無声音区間との夫々に於て行なう時間軸整
合手段と、時間軸整合された入力音声パラメータ
と参照音声パラメータとの演算を多数の参照パラ
メータに対して実行して入力音声を認識する認識
処理手段と、を備えているので、音声パラメータ
の有効な時間軸整合が可能となる。従つて、従来
の線形時間軸整合手段を用いた場合より高い認識
率が得られ、従来の非線形時間軸整合手段を用い
た場合より、時間軸整合手段に於ける処理時間を
大巾に短縮する事ができる。 As is clear from the above description, the speech recognition device of the present invention includes a comparison means for comparing voiced and unvoiced patterns extracted from input speech and voiced and unvoiced patterns of reference speech, and a comparison means for comparing voiced and unvoiced patterns extracted from input speech and voiced and unvoiced patterns extracted from input speech based on the comparison result. A time axis matching means for aligning the input audio parameters and the reference audio parameters in the corresponding voiced sound sections and unvoiced sound sections, respectively; Since the present invention includes recognition processing means for recognizing input speech by executing the calculations for a large number of reference parameters, effective time axis alignment of speech parameters is possible. Therefore, a higher recognition rate can be obtained than when using the conventional linear time axis alignment means, and the processing time in the time axis alignment means can be significantly shortened than when using the conventional nonlinear time axis alignment means. I can do things.

又、本発明音声認識装置は、有声無声パターン
に対しての上記比較手段１０と、音声パラメータ
に対しての上記認識処理手段１２と、の２つの比
較処理手段を有しているが、この第１段目の比較
処理手段である比較手段１０に於て、入力音声の
有声無声パターンと形態が大きく異なる有声無声
パターンを有する参照音声をこの比較の時点で認
識の対象から除外する事が非常に有効である。即
ち、有声無声パターンが夫々時間長の異なる有声
音区間と無声音区間との組合せで成つている事か
ら、入力音声の有声音区間と無声音区間の順序と
数を両有声無声パターンの比較時の許容基準とし
て、これ等区間の時間長を無視した大略の比較処
理を行ない、認識の対象とするべき参照パターン
を選択して、第２段目の比較処理手段である認識
処理手段１２に依つて音声パラメータを比較演算
し最終的な認識を行なう。従つて該認識処理手段
１２に於ける演算処理量が低減し、処理時間の短
縮につながる。 Further, the speech recognition device of the present invention has two comparison processing means, the above-mentioned comparison processing means 10 for voiced and unvoiced patterns, and the above-mentioned recognition processing means 12 for voice parameters. In the comparison means 10, which is the first-stage comparison processing means, it is extremely difficult to exclude from the recognition target a reference speech having a voiced/unvoiced pattern that is significantly different in form from the voiced/unvoiced pattern of the input speech at the time of this comparison. It is valid. In other words, since a voiced and unvoiced pattern is composed of a combination of voiced and unvoiced sections with different time lengths, the order and number of voiced and unvoiced sections of the input speech are determined by the allowable order and number of voiced and unvoiced sections when comparing both voiced and unvoiced patterns. As a standard, a rough comparison process is performed that ignores the time length of these sections, a reference pattern to be recognized is selected, and the recognition processing means 12, which is the second stage comparison processing means, performs voice recognition. Parameters are compared and calculated for final recognition. Therefore, the amount of calculation processing in the recognition processing means 12 is reduced, leading to a reduction in processing time.

以上の効果に依り、本発明は高い認識率を有し
ながらも応答時間の短い音声認識装置を実現する
ものであり、その有益性大なるものである。 As a result of the above effects, the present invention realizes a speech recognition device that has a high recognition rate and a short response time, and is highly useful.

[Brief explanation of the drawing]

第１図は本発明音声認識装置の構成を示すブロ
ツク図、第２図は本発明装置に用いられる判定手
段の構成を示すブロツク図、第３図は音声パラメ
ータと有声無声パターンとを表わす模式図であ
り、３は音声特徴抽出手段、７は判定手段、８は
入力音声記憶手段、９は参照音声記憶手段、１０
は比較手段、１１は時間軸整合手段、１２は認識
処理手段、を夫々示している。 FIG. 1 is a block diagram showing the configuration of the speech recognition device of the present invention, FIG. 2 is a block diagram showing the configuration of the determination means used in the device of the present invention, and FIG. 3 is a schematic diagram showing voice parameters and voiced and unvoiced patterns. 3 is a voice feature extraction means, 7 is a determination means, 8 is an input voice storage means, 9 is a reference voice storage means, and 10
Reference numeral 11 indicates a comparison means, 11 a time axis matching means, and 12 a recognition processing means.

Claims

[Scope of Claims] 1. A microphone that converts input speech into an audio signal, an audio feature extraction means that extracts feature parameters of the input audio from the audio signal, and a voiced sound section of the input audio based on the audio signal. a human voice storage means for temporarily storing the voiced and unvoiced patterns obtained from the judgment means together with the input speech parameters obtained from the speech feature extraction means; a reference voice storage means storing a plurality of voiced and unvoiced patterns in association with each other; and a time period of each voiced or unvoiced section corresponding to the voiced and unvoiced pattern of the input voice storage means and the plurality of voiced and unvoiced patterns of the reference voice storage means. A comparison means that performs a comparison with respect to the axis, and a linear time axis matching between the input voice parameter of the input voice storage means and the reference voice parameter of the reference voice storage means based on the time length comparison result from the comparison means. a time axis matching means that performs the operation for each voiced or unvoiced section, and calculations between input audio parameters and reference audio parameters that are time axis aligned by the time axis matching means are performed for a large number of reference audio parameters; and recognition processing means for recognizing a voice of a reference voice parameter determined to be most similar to the input voice parameter as the input voice. 2. In the speech recognition device according to claim 1, the comparison means is provided with a predetermined acceptance criterion, and the voiced and unvoiced pattern of the reference speech storage means and the voiced and unvoiced pattern of the input speech storage means are A speech recognition device characterized in that, if the difference in voiced and unvoiced patterns deviates from the acceptance criteria, reference speech parameters associated with this voiced and unvoiced pattern are excluded from calculation targets for input speech parameters.