JPS59176794A

JPS59176794A - Word voice recognition equipment

Info

Publication number: JPS59176794A
Application number: JP58050698A
Authority: JP
Inventors: 鬼頭　淳悟
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 1983-03-25
Filing date: 1983-03-25
Publication date: 1984-10-06
Also published as: JPH0480398B2

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】く技術分野〉本発明は未知入力音声に対する性能向上をはかった単語
音声認識装置に関するものである。DETAILED DESCRIPTION OF THE INVENTION Technical Field The present invention relates to a word speech recognition device with improved performance for unknown input speech.

く背　景〉単語音声認識装置は、一般に、発声された単語音声をマ
イクアンプ等の増巾器で増巾した後、音響分析部にて音
声の特徴を表現出来る特徴パラメータ、例えばＢＰＦ群
（バントハスフィルタ一群）によるパワースペクトル、
自己相関関数、零交差数に分析される。この後、特徴抽
出部にて単語区間の判定、単語区間内の特徴パラメータ
時系列を、パターンメモリ量の低減、マツチング時の演
算時間の低減化の関点より、時間軸方向に圧縮する等、
所定のアルゴリズムにより入カバターンとして作成が行
なわれる。Background> In general, a word speech recognition device amplifies the uttered word speech with an amplifier such as a microphone amplifier, and then uses feature parameters that can express the characteristics of the speech in an acoustic analysis section, such as BPF group (Bant filter). power spectrum by a group of Hass filters),
The autocorrelation function is analyzed into the number of zero crossings. After this, the feature extraction unit determines the word section and compresses the feature parameter time series within the word section in the time axis direction from the perspective of reducing the amount of pattern memory and reducing the calculation time during matching.
It is created as an input pattern using a predetermined algorithm.

この後、入カバターンは前もって同様の方法で登録され
ている（特定話者を対象とするもの）。After this, the introductory pattern is registered in advance in a similar manner (targeting a specific speaker).

あるいは多数の話者より作成されている（不特定話者を
対象とするもの）標準パターンとマツチングが取られ、
最も類似したもの゛を認識結果として出力する。Or, it is matched with a standard pattern created by many speakers (targeting unspecified speakers),
The most similar one is output as the recognition result.

この標準認識に於いて、認識対象の単語以外の音声、特
に使用者が不用意に発声した話声、せきばらい等、さら
に周囲騒音で突発的に発生した物音に対して、誤認識を
行なってしまう不都合がよくみられる。In this standard recognition, sounds other than the words to be recognized, especially voices carelessly uttered by the user, coughing, etc., as well as sudden noises caused by ambient noise, may be misrecognized. This is a common inconvenience.

〈発明の目的〉本発明はこのような不都合を改善するものであり、標準
パターンに各単語毎の時間長情報を付加し、単語長制限
してパターンマツチング処理を行なうことにより、 ■　音声の特徴表現力によるか、特徴の音韻に対する識
別能力が低い場合、認識対象語以外の継続時間長の異な
る単語でも非常に類似したパターンが作成され、これが
誤認識の原因となる。<Purpose of the Invention> The present invention aims to improve such inconveniences by adding time length information for each word to a standard pattern and performing pattern matching processing with word length restrictions. If the discriminative ability for the phoneme of the feature is low, perhaps due to the feature expressive ability, very similar patterns are created even for words with different durations other than the recognition target word, which causes misrecognition.

この誤認識を低減する。Reduce this misrecognition.

■　標準パターン容量の低減化に併ない、標準・ζター
ン長を単語長の差異にかかわらず一定長に固定した場合
、特徴の部分的欠落か単語長が長くなるに従がい犬きく
なり、パフメーマンスの低下をまねく。これによる誤認
識を低減する。■ In line with the reduction in standard pattern capacity, if the standard ζ turn length is fixed at a constant length regardless of the difference in word length, performance problems may occur due to partial loss of features or a tendency to follow suit as the word length becomes longer. leading to a decrease in Misrecognition caused by this is reduced.

■　また、標準パターンとして登録されている認識対象
語間の語長の分散か大きい場合、即ち、対象語長が異な
るものが多い場合、マツチング処理として入力単語長が
各標準パターン単語長許容内にある標準パターンのみに
限定できる。■ In addition, if the word length variance between the recognition target words registered as standard patterns is large, that is, if there are many target words with different lengths, the input word length will be adjusted within the word length tolerance of each standard pattern as a matching process. It can be limited to a certain standard pattern.

これは単語長に制限を設けることで、マツチング処理の
対象となる標準パターンを少数に絞り込む予備選択操作
であり、マツチング処理時間の低減化に役立つ。This is a preliminary selection operation that narrows down the standard patterns to be matched to a small number by setting a limit on the word length, and is useful for reducing the matching processing time.

〈実施例〉以下図面に従って本発明の一実施例を説明する。<Example> An embodiment of the present invention will be described below with reference to the drawings.

第１図は本発明の一実施例を示すブロック構成図である
。FIG. 1 is a block diagram showing one embodiment of the present invention.

マイクｌに向って発声された単語音声は、前処理部２に
より増ｒｌｒされ、必要に応じプリエンファシス等の処
理がなされる。この段以降がディジタル的な処理である
場合、この前処理部２に於いて音声波形はディジタル信
号に変換される。The word sound uttered into the microphone 1 is amplified by the preprocessing unit 2, and is subjected to processing such as pre-emphasis as necessary. If the processing after this stage is digital processing, the audio waveform is converted into a digital signal in this preprocessing section 2.

続いて、特徴分析部３では音声の特徴を表しうる量、倒
れは短時間パワースペクトル１自己相関関数、零交差数
の分析が行なわれる。Subsequently, the feature analysis section 3 analyzes the quantities that can represent the characteristics of the voice, the short-time power spectrum 1 autocorrelation function, and the number of zero crossings.

この特徴時系列は単語区間切出し部４で単語区間の特徴
時系列＋０１と継続時間長情報＋０２が抽出される。単
語区間の特徴時系列ｌｏｔはパターン作成部５にて、時
間軸方向への情報圧縮が行なわれパターン＋０３として
作成される。From this feature time series, the word section extraction unit 4 extracts the feature time series +01 and duration information +02 of the word section. The feature time series lot of the word section is compressed in the time axis direction by the pattern creation unit 5 and created as pattern +03.

パターンｌｏｇと継続時間長１０２は切換スイッチＳＷ
Ｉ又はＳＷ２の動きによって、標準パターン記憶部６か
入カバターン記憶部７に送られる。Pattern log and duration length 102 are selector switch SW
The pattern is sent to either the standard pattern storage section 6 or the input pattern storage section 7 depending on the movement of I or SW2.

通常、特定話者を対象とする認識装置では、予じめ認識
対象語を登録するためＳｌを閉し、標準パターン記憶部
６に登録する。そして認識処理の場合は、Ｓ２を閉じ、
入力語のパターンを入カバ、ヘーン記憶部７に一担格納
する。Normally, in a recognition device targeted at a specific speaker, Sl is closed in order to register a recognition target word in advance, and the recognition target word is registered in the standard pattern storage section 6. In the case of recognition processing, close S2,
The input word pattern is stored in the input word storage section 7.

不特定話者を対象にする場合は、この切換スイッチ５Ｗ
ｌ、Ｓ〜■２はなく、標準パターン記憶部６は多数話者
の発声した単語より抽出した特徴パターン及び標準単語
継続時間長が前もって格納されている。この場合、この
記憶部６はＲＯＭ等で構成される。If you want to target unspecified speakers, use this selector switch 5W.
1, S to 2 are not included, and the standard pattern storage section 6 stores in advance feature patterns and standard word duration lengths extracted from words uttered by multiple speakers. In this case, this storage section 6 is composed of a ROM or the like.

パターンマツチング部８は認識処理時に於いて、／標準パターン記憶部６に格納されているパターンと入カ
バターン記憶部７の現入力しくターンとの比較照合を行
ない、最も類似している標準パターンの番号等を認識結
果１０４として出力する。During recognition processing, the pattern matching unit 8 compares and matches the pattern stored in the standard pattern storage unit 6 with the current input pattern in the input pattern storage unit 7, and selects the most similar standard pattern. The number etc. are output as the recognition result 104.

制御部９は各部及び各部間の制御を行なうもので、前記
標準パターン内の各単語の継続時間長から演算によって
その許容範囲を外部から調整できる外部入力部１０を有
している。外部入力部１０から入力される許容設定コー
ド１０５は、許容巾を時間で表現したり、標準時間長に
対する上限・下限比率（＠で表わしている。The control section 9 controls each section and between each section, and has an external input section 10 that can adjust the permissible range from the outside by calculating from the duration of each word in the standard pattern. The allowable setting code 105 input from the external input unit 10 expresses the allowable width in terms of time or the upper and lower limit ratios (represented by @) with respect to the standard time length.

第２図は第１図の点線部内を１ビットマイクロコンピュ
ータ−１で構成したものである。マイクロコンピュータ
ー１はＣＰｔＪＩ２　　ＲＯＭ＋３．ＲＡＭＩ４１／を含み、不特定話者を対象とする場合、標準パターンは
上記ＲＯＭ］３に格納されることとなる。In FIG. 2, the area enclosed by the dotted line in FIG. 1 is constructed with a 1-bit microcomputer-1. The microcomputer 1 is CPtJI2 ROM+3. RAMI41/, and if the standard pattern is intended for unspecified speakers, the standard pattern will be stored in the ROM]3.

特定話者単語音声認識は、近年のディジタル信号処理技
術、ＬＳＩ技術の進歩により、低価格化の方向でボード
やＬＳＩが開発されているが、不特定話者認識は、大型
かつ高価な装置にとどまっている。For speaker-specific word speech recognition, boards and LSIs have been developed to reduce costs due to recent advances in digital signal processing technology and LSI technology, but speaker-independent recognition requires large and expensive equipment. It's staying.

不特定話者の単語音声認識において、話者による変動の
少ない特徴パラメーターとして、すべての音韻について
適用できるものはない。しかしながら、大まかな分類を
行なう場合には比較的個人差の少ないパラメータとして
零交差数が知られている。パラメータとして零交差数を
採り上げ、種々の検討を加えた結果、短時間定レベル交
差数（Ｌｅｖｅｌ　Ｃｒｏｓｓｉｎｇ）分析法を採用す
れば、上記のようにアナログ回路部を除いて、不特定音
声認識袋装置をＣＭＯ３ＩチンプＬＳＩで実現すること
かできる。In speaker-independent word speech recognition, there is no feature parameter that can be applied to all phonemes that has little variation depending on the speaker. However, when performing rough classification, the number of zero crossings is known as a parameter with relatively few individual differences. As a result of taking the number of zero crossings as a parameter and conducting various studies, we found that if we adopt the short-time constant level crossing analysis method, we can use the non-specific voice recognition bag device without the analog circuit part as described above. can be realized with CMO3I chimp LSI.

第３図に短時間定レベル交差数の定義を示す。Figure 3 shows the definition of the number of short-term constant level crossings.

本行微量は音声波形Ｖが一定の閾値レベルｔｈを交差す
る回数を、短時間フレーム（周期で）毎に計数して得ら
れるものである。閾値レベルは定常周囲騒音よりもやや
大きい値に調整設定することにより、単語音声区間と無
音区間の判別を可能にしている。This trace amount is obtained by counting the number of times the audio waveform V crosses a certain threshold level th for each short time frame (period). By adjusting and setting the threshold level to a value slightly larger than the steady ambient noise, it is possible to distinguish between word speech sections and silent sections.

特徴量は音声のスペクトルの相対強度を表わし得るもの
で、第４図に定レベル交差数分析により得られた特徴時
系例（ストップと発生した場合）を示す。（ａ）は音声
波形、（ｂ）は定レベル交差系列である。図中に見られ
るように、有声音では低い値を示し、無声音、特にゝゝ
ｓ　Ｉ′、　ｓＨ”等の摩擦音。The feature amount can represent the relative strength of the spectrum of the voice, and FIG. 4 shows an example of the feature time series (when a stop occurs) obtained by constant level crossing number analysis. (a) is a voice waveform, and (b) is a constant level crossing series. As seen in the figure, the value is low for voiced sounds, and for unvoiced sounds, especially fricatives such as ゝゝs I', sH''.

ゝＴＨ“等の破擦音に対しては高い値を示す。零レベル
交差分析法では特徴分析部３においで、このようなフレ
ーム毎の定レベル交差数が特徴として抽出される。A high value is shown for affricates such as "TH". In the zero-level crossing analysis method, the feature analysis unit 3 extracts the number of fixed-level crossings for each frame as a feature.

壕だ第４図（ｂ）において、Ａ点は語頭、Ｂ点は語尾で
あり、単語区間切出し部４はこれに基いて単することと
なる。第２図のように１チツプマイクロコンピユータ１
１を使用するものでは、この処理は１チツプマイクロコ
ンピユータＩＩの内部で行なわれる。In FIG. 4(b), point A is the beginning of a word, point B is the end of a word, and the word section extraction unit 4 separates words based on this. As shown in Figure 2, a 1-chip microcomputer 1
1, this processing is performed inside the 1-chip microcomputer II.

第５図は標準パターン記憶部６又は１チツプマイクロコ
ンピユータＩＩのＲＯＭ１３に格納される標準パターン
例であり、各単語毎に特徴系列Ｘ。FIG. 5 shows an example of a standard pattern stored in the standard pattern storage unit 6 or the ROM 13 of the 1-chip microcomputer II, in which a feature series X is stored for each word.

標準単語時間長Ｙを記憶している。更に、図示のように
このパターンが選択された時に出力する結果出力コード
Ｚ等の情報を含む場合もある。The standard word duration Y is memorized. Furthermore, as shown in the figure, information such as a result output code Z to be output when this pattern is selected may be included.

第６図は認識手順の主要部をフローチャートとして示し
たものであり、点線内が単語長許容値によるマツチング
手順を示している。FIG. 6 is a flowchart showing the main part of the recognition procedure, and the dotted line indicates the matching procedure based on the word length tolerance.

開始後、まず外部入力部！０からの許容設定コード１０
５０入力状況が見られ、後述する上限・下限計算のため
このコードが読取られ記憶される。After starting, first is the external input section! Allowable setting code from 0 to 10
50 input conditions are seen, and this code is read and stored for upper and lower limit calculations to be described later.

その後、第４図Ａ点のような入力単語の語頭を検出し、
語頭が検出されれば同図Ｂ点の語尾までの時間、即ち発
声された単語の継続時間長を計測する。次にこの単語区
間の特徴時系列により入カバターンが作成され、入力・
くターン記憶部７（第２図では１チツプマイクロコンピ
ユータ１１のＲＡＭ１４）に記憶される。入カバターン
は上で計測された単語長情報を含んでいる。そしてフロ
ーチャートの点線内の単語長許容値によるマ・ノチング
処理に入る。After that, detect the beginning of the input word like point A in Figure 4,
If the beginning of the word is detected, the time to the end of the word at point B in the figure, that is, the duration of the uttered word is measured. Next, an input cover pattern is created based on the feature time series of this word interval, and the input
The data is stored in the turn storage section 7 (RAM 14 of the one-chip microcomputer 11 in FIG. 2). The input cover turn contains the word length information measured above. Then, ma-noting processing is started using the word length tolerance value within the dotted line in the flowchart.

まず、標準パターン記憶部６（第２図ではｌチッフマイ
クロコンピュータｌ　ＩＬｖＲＯＭ　Ｉ　３　）ニ格納
されている標準パターンの標準単語継続時間長Ｙを取込
み、前記許容設定コード１０５とで単語長許容値の上限
・下限を計算する。これによって入力語が許容値に入っ
ているか判断し、入っていれば標準パターンとのパター
ン間距離を計算する。パターン間距離を計算した後、あ
るいは入力＼語ふ許容値に入っていない場合は、このパ
ターン間距離計算をジャンプして、標準パターンが終了
したかどうかの判断？シ、次の比較照合すべき標準パタ
ーンの標準単語継続時間長Ｙを取込む０以上を比較照合
すべき標準パターンが終了するまでそれぞれの単語毎に
繰返す。First, the standard word duration length Y of the standard pattern stored in the standard pattern storage section 6 (in FIG. Calculate the upper and lower limits of. This determines whether the input word falls within the allowable value, and if so, calculates the inter-pattern distance from the standard pattern. After calculating the inter-pattern distance, or if the input\word is not within the tolerance value, jump this inter-pattern distance calculation and judge whether the standard pattern is finished. B. The standard word duration length Y of the next standard pattern to be compared and verified is taken and the process of 0 or more is repeated for each word until the standard pattern to be compared and verified is completed.

ここで、特徴の音韻に対する識別能力が低い場合、認識
対象語以外の継続時間長の異なる単語でも非常に類似し
たパターンが作成されるが、上述のように単語長制限す
ることにより許容値外はパターン間距離の計算が省略さ
れ、誤認識は低減される。Here, if the discriminative ability for the phoneme of the feature is low, very similar patterns will be created even for words with different duration lengths other than the recognition target word, but by limiting the word length as described above, Calculation of inter-pattern distances is omitted, reducing misrecognition.

また、標準パターンの容量の低減化に併ない、標準パタ
ーンを単語長の差異にかかわらず一定長に固定した場合
も、特徴の部分的欠落が単語長が長くなるに従がい大き
くなり、パフォーマンスの低下をまねくが、上記のよう
な制限で単語長そのものを比較照合の情報とすることに
よりこれの誤認識を低減することができる。In addition, as the capacity of standard patterns decreases, even if standard patterns are fixed at a constant length regardless of differences in word length, the partial loss of features will increase as the word length increases, resulting in poor performance. However, by using the word length itself as information for comparison and verification with the above-mentioned restrictions, it is possible to reduce misrecognition of this word length.

更に単語長に制限を設けることで、パターン間距離の計
算を省略するなど、マツチング処理の対象となる標準パ
ターンを少数に絞り込む予備選択操作を行なうことがで
き、マツチ、ング処理時間の低減化に役立つ。Furthermore, by setting a limit on the word length, it is possible to perform a preliminary selection operation that narrows down the number of standard patterns to be matched to a small number, such as by omitting the calculation of the distance between patterns, which reduces the matching processing time. Helpful.

このような比較照合の後、計算された標準パターンのり
・から最小距離のものを検索する。そして、ここでは更
に所定の閾値を定めて、所定値以上のもののみを有効と
判断して結果出力を行なうようにしている。所定値以下
のものはりジェツトとして出力される。After such comparison and verification, the one with the minimum distance from the calculated standard pattern is searched. Here, a predetermined threshold value is further defined, and only those that are equal to or higher than the predetermined value are determined to be valid and the results are output. Anything below a predetermined value is output as a beam jet.

こうして再び外部入力部１０からの許容設定コード１０
３の読取りに戻る。外部入力部１０からは任意の許容設
定コード＋０５を入力することが可能であり、変化すれ
ば変化した許容設定コード１０５が読取られるみこれは
使用状況や話者に応じて最適の許容巾を設定するもので
、実例ては、例えば±２０〜３０係の許容範囲で制限を
加えても、未知入力音声に対する性能向上をはかった上
で、認識率への影響も低いことが確かめられた。In this way, the allowable setting code 10 is input again from the external input section 10.
Return to reading 3. It is possible to input any allowable setting code +05 from the external input section 10, and if it changes, the changed allowable setting code 105 will be read.This sets the optimum allowable width according to the usage situation and speaker. In actual practice, it has been confirmed that even if a limit is added within the permissible range of, for example, ±20 to 30, the performance for unknown input speech is improved and the effect on the recognition rate is small.

上記の実施例では、パターン間距離を尺度にして類似パ
ターンを検索する手法を示したが、ノくターンが記号的
シンボルで表現されでおり、その時間軸方向への遷移と
して系列が与えられるようなパターン作成部を有してい
る装置で、この遷移系列と完全に一致するものを認識結
果とするような場合は、一致を取った上で単語長許容巾
を計算し、入力単語長が許容内であればその一致したパ
ターンの番号を出力し、さもなければ入力語を棄却する
アルゴリズムとして構成することも容易である。In the above example, a method of searching for similar patterns using the distance between patterns as a measure was shown, but the turn is expressed as a symbolic symbol, and the sequence is given as the transition in the time axis direction. If the recognition result is a recognition result that completely matches this transition sequence using a device that has a pattern creation section, the word length tolerance is calculated after the match is made, and the input word length is It is also easy to construct an algorithm that outputs the number of the matched pattern if it is within the range, and rejects the input word otherwise.

〈発明の効果〉以上のように本発明によれば、標準パターンに各単語毎
の時間長情報を付加し、これにより単語長を制限してパ
ターンマツチングを行なうものであり、未知入力音声（
実使用における周囲騒音等）に対する性能向上をはかっ
た有用な単語音声認識装置が提供できる。<Effects of the Invention> As described above, according to the present invention, time length information for each word is added to a standard pattern, and pattern matching is performed with the word length limited thereby.
Therefore, it is possible to provide a useful word speech recognition device with improved performance against ambient noise (in actual use, etc.).

特に不特定話者向きに、特徴パラメータとして比較的話
者変動の小さい定レベル交差数を採用すれば、その特性
上、類似した音韻系列を有する認向上して、ＩチップＬ
ＳＩ等に集積回路化して実現することができる。In particular, if a constant level crossing number with relatively small speaker variation is adopted as a feature parameter for unspecified speakers, the I-chip L
It can be implemented as an integrated circuit such as SI.

[Brief explanation of drawings]

第１図、は本発明の一実施例を示すブロック構成図、第
・２−７図は１チツプマイクロコンビ（ユータを用いた
場合のブロック図、第３図は音声波形を示す図、第４図
は音声波形（ａ）と特徴系列（ｂ）を対比して示す図、
第５図は標準パターンのメモリマツプ、第６図は主要部
の動作を説明するフローチャートである。ｌ・・マイク、２・・前処理部、３・・・特徴分析部、
４・・単語区間切り出し部、５・・パターン作成部、６
・・・標準パターン記憶部、７・・入カバターン記憶部
、８　パターンマツチング部、９・・制御部、１０・外
部入力部、１１・・ｌチップマイクロコンピュータ、１
２・・ＣＰｔＪ、１３・・・ＲＯＭ、＋４　・・ＲＡＭ
。Ｘ・・・特徴系列、Ｙ・・標準単語継続時間長。代理人　弁理士　福　士　愛　彦（他２名）矛、４　図１第５　閃Fig. 1 is a block diagram showing one embodiment of the present invention, Figs. 2-7 are block diagrams when a 1-chip microcombi (user) is used, Fig. 3 is a diagram showing audio waveforms, and Fig. 4 is a block diagram showing an embodiment of the present invention. The figure shows a comparison of the audio waveform (a) and the feature sequence (b).
FIG. 5 is a memory map of the standard pattern, and FIG. 6 is a flowchart explaining the operation of the main parts. l...Microphone, 2...Preprocessing unit, 3...Feature analysis unit,
4. Word section extraction section, 5. Pattern creation section, 6.
. . . Standard pattern storage section, 7. Input pattern storage section, 8 Pattern matching section, 9. Control section, 10. External input section, 11. L-chip microcomputer, 1
2...CPtJ, 13...ROM, +4...RAM
. X...Feature series, Y...Standard word duration length. Agent Patent attorney Aihiko Fukushi (and 2 others), 4 Figure 1 5th Flash

Claims

[Claims] 1. Means for storing standard word duration information in addition to feature series patterns as a standard pattern to be compared and verified; means for measuring the duration of a word; means for calculating a tolerance value from each standard word duration information in the standard pattern; and means for determining whether the tolerance value includes the duration of the input word. and means for performing pattern matching processing only on standard patterns that are within an allowable value according to the determination result. 2. The word speech recognition device according to claim 1, wherein the permissible value is set externally based on the word duration length within the standard pattern. 3 Using a constant level crossing waveform as the voice feature, 1
The word speech recognition device according to claim 1, characterized in that the word speech recognition device is integrated into a chip.