JPS6344699A

JPS6344699A - Voice recognition equipment

Info

Publication number: JPS6344699A
Application number: JP61189246A
Authority: JP
Inventors: 西山　敏雄; 弘岡本; 貞治江守
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1986-08-11
Filing date: 1986-08-11
Publication date: 1988-02-25

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Abstract] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】「産業上の利用分野」この発明は音素または音節を認識単位としたトップダウ
゛ン方式で入力音声を認識する音声認識装置に関する。DETAILED DESCRIPTION OF THE INVENTION "Field of Industrial Application" The present invention relates to a speech recognition device that recognizes input speech in a top-down manner using phonemes or syllables as recognition units.

「従来の技術」この種の音声認識装置においては入力された音声信号を
パワ、スイクトルなどの特徴パラメータの時系列として
メモリに記憶し、一方認識対象単語が音素又は音節の系
列として辞書メモリに記憶してあり、認識に当ってその
辞書内の単語を仮定し、その単語の音素又は音節系列に
従って音素または音節ごとに予め決められた規則に従っ
て長さ、レベルなどが決められ、この音素又は音節で入
力音声信号を対応して分割できるかを判定し、対応する
音素又は音節が切出せる場合は、その対応度（類似度）
を示す値を付けて次の音素又は音節の切出しを行い、つ
まシセグメンテーションとスコアリングを行い、仮定し
た単語で入力音声信号をセグメンテーションすることが
できた場合に、スコアを合計し、これが所定値以上であ
れば、その仮定した単語として入力音声信号を認識する
。"Prior Art" In this type of speech recognition device, input speech signals are stored in memory as a time series of characteristic parameters such as power and sictor, while words to be recognized are stored in dictionary memory as a series of phonemes or syllables. During recognition, a word in the dictionary is assumed, and the length, level, etc. are determined according to predetermined rules for each phoneme or syllable according to the phoneme or syllable sequence of the word. Determine whether the input audio signal can be divided correspondingly, and if the corresponding phoneme or syllable can be extracted, the degree of correspondence (similarity)
The next phoneme or syllable is extracted by assigning a value indicating , segmentation and scoring are performed, and if the input speech signal can be segmented by the assumed word, the scores are totaled, and this is the predetermined value. If the above is the case, the input audio signal is recognized as the assumed word.

このような音素または音節を認識単位としたトップダウ
ンによる従来の音声認識装置では、音素の継続長に関し
、音素が母音の場合は固定の継続長を与えるだけで対処
していたため（日本音響学会音声研究会資料Ｓ−８２−
６２（ＤＥＣ，２０、１９８２，））、発声者の発声す
る内容や発声状態を限定しない場合、固定の継続長との
差異が大きくなり認識率の低下を招くという欠点があっ
た。In conventional top-down speech recognition devices that use phonemes or syllables as recognition units, when the phoneme is a vowel, the continuation length of a phoneme is dealt with by simply giving a fixed duration (Acoustical Society of Japan Study group material S-82-
62 (DEC, 20, 1982,)), if the content or state of the utterance by the speaker is not limited, there is a drawback that the difference from the fixed duration becomes large, leading to a decrease in the recognition rate.

また、子音の継続長は無声破裂音と無声摩擦音のような
子音グループ間での差異が大きく（信学会編：「聴覚と
音声」第３版、１９８２．）、子音を−まとめにした固
定の継続長を用いると認識率が低下するという欠点もあ
った。In addition, there are large differences in the duration of consonants between consonant groups such as voiceless plosives and voiceless fricatives (edited by IEICE: "Hearing and Speech", 3rd edition, 1982). There is also the drawback that the recognition rate decreases when the duration is used.

これらの欠点を解決するため、従来の音声認識装置にお
いては、音素または音節の継続長に対する制限を具体的
な判定手段を用いずに緩めて対処していた。このため音
声信号に対して明らかに継続長が異なる音素まだは音節
が仮定されてもその仮定単語が除去されにくく処理量が
増加するという欠点があった。In order to solve these drawbacks, conventional speech recognition devices have relaxed restrictions on the duration of phonemes or syllables without using specific determination means. For this reason, even if phonemes or syllables with clearly different duration lengths are assumed for the speech signal, the assumed words are difficult to remove and the amount of processing increases.

さらに、先頭の音素または音節から順次処理を行うため
、入力された音声信号の音声長に比べ、音声信号に仮定
される単語の音素または音節数が明らかに少ない、もし
くは多い場合についても、その不適当さを考慮すること
なく処理を行うため処理量が増加するという欠点があっ
た。Furthermore, since processing is performed sequentially starting from the first phoneme or syllable, even if the number of phonemes or syllables in a word assumed to be in a speech signal is clearly smaller or larger than the speech length of the input speech signal, the problem may occur. Since processing is performed without considering appropriateness, there is a drawback that the amount of processing increases.

この発明は、母音、子音に対して固定の継続長を与える
ことによる認識率の低下という欠点、および継続長に対
する制限を具体的な判定手段を用いずに緩めることに基
づく処理量の増加という欠点をなくすため、音素または
音節の個別ちるいはグループ毎の継続長の統計データを
基に、入力された音声信号に対し、仮定された単語から
推定される音声長と実際の音声長との比較を行い、統計
的に判定することによって、時間に関する特徴を積極的
に利用した音声認識装置を提供することにある。This invention has the disadvantage of a decrease in the recognition rate due to giving fixed duration lengths to vowels and consonants, and the disadvantage of an increase in the amount of processing due to loosening restrictions on duration lengths without using specific determination means. In order to eliminate this problem, we compare the estimated speech length from a hypothetical word with the actual speech length for the input speech signal based on statistical data on the duration of individual phonemes or syllables or groups. The purpose of this invention is to provide a speech recognition device that actively utilizes time-related features by performing statistical determination.

「問題点を解決するための手段」この発明は、従来補助的な利用もしくは回避していた音
声の時間的特徴を音素または音節の継続長として音素ま
たは音節を認識単位とする音声認識装置に利用すること
を最も主要な特徴とする。``Means for Solving the Problems'' This invention utilizes the temporal characteristics of speech, which have conventionally been used auxiliary or avoided, as the duration of phonemes or syllables, in a speech recognition device that uses phonemes or syllables as recognition units. The most important feature is that

即ち、音素または音節の継続長の統計データを各音素ま
たは音節について個別あるいはグループ毎に求めて継続
長メモリに記憶しておき、仮定した単語の音素または音
節の系列に従い各音素または音節ごとの各セグメンテー
ションが終了した時点で、それ以降の入力音声長と、仮
定単語中の未処理の各音素または音節の継続長の統計デ
ータから求められる推定音声長とをそれぞれ計算し、こ
れら音声長と推定音声長とを比較し、その差が、所定値
、例えば各音素または音節の継続長の統計データが持つ
分散によって予想される推定誤差以上に差異がある場合
、以後の音素または音節認識処理（セグメンテーション
）を打ち一切シ不必要な処理の削減を行う。That is, statistical data on the duration of phonemes or syllables is obtained for each phoneme or syllable individually or for each group and stored in a duration memory, and statistical data on the duration of each phoneme or syllable is calculated for each phoneme or syllable according to the phoneme or syllable sequence of a hypothetical word. When segmentation is completed, the input speech length from then on and the estimated speech length obtained from the statistical data of the duration of each unprocessed phoneme or syllable in the hypothetical word are calculated, and these speech lengths and the estimated speech length are calculated. If the difference is greater than a predetermined value, for example, the estimation error expected based on the variance of the statistical data of the duration of each phoneme or syllable, subsequent phoneme or syllable recognition processing (segmentation) to reduce unnecessary processing.

「実施例」第１図はこの発明の実施例を示す。"Example" FIG. 1 shows an embodiment of the invention.

入力端子１１から入力された音声信号はφ変換器１２に
よシディジタル信号に変換され、そのディジタル信号は
特徴抽出部１３で特徴・セラメータ時系列に変換される
。この特徴抽出部１３の出力は時系列メモリ１４に格納
される。An audio signal inputted from an input terminal 11 is converted into a sidigital signal by a φ converter 12, and the digital signal is converted into a feature/cerameter time series by a feature extraction section 13. The output of this feature extractor 13 is stored in a time series memory 14.

認識処理部１５で辞書メモリ１６から入力音声信号に対
し仮定された単語の音素または音節の系列を読出し、そ
の音素または音節単位で入力音声信号に対し認識処理、
つまりセグメンテーション、スコアリングを行う（第２
図ステップｓｉ）。一つの音素または音節の認識処理が
終了し、その仮定単語がリノエクトされなかった場合（
ステップＳ２）、入力音声信号の未処理部分の音声長（
Ｌ−Ｌｐ）を計算しくステップＳ３）、またそれ以降の
音素または音節について継続長メモリ１７から継続長の
統計データを仮定し、それらの和によシ与えられる推定
音声長ΣＬｋを計算する（ステン７’Ｓ、）。The recognition processing unit 15 reads the phoneme or syllable sequence of the word assumed for the input speech signal from the dictionary memory 16, and performs recognition processing on the input speech signal in units of phonemes or syllables.
In other words, segmentation and scoring are performed (second
Figure step si). If the recognition process for one phoneme or syllable is completed and the hypothetical word is not renoected (
Step S2), the audio length of the unprocessed part of the input audio signal (
L-Lp) is calculated (Step S3), and the statistical data of duration lengths from the duration memory 17 are assumed for subsequent phonemes or syllables, and the estimated speech length ΣLk given by the sum of them is calculated (step S3). 7'S,).

この未処理の音声信号の部分の音声長と推定音声長とを
比較し、その差が、音素または音節の継続長の統計デー
タの持つ分散から予想される推定誤差Ｅ以上の場合は以
降の音素または音節についての認識処理を打ち切り、Ｅ
以下の場合はステップＳ１に戻って認識処理を続行する
（ステップＳＳ）。The speech length of this unprocessed speech signal portion is compared with the estimated speech length, and if the difference is greater than the estimation error E expected from the variance of the statistical data of phoneme or syllable duration length, subsequent phonemes or terminate the recognition process for the syllable and
In the following cases, the process returns to step S1 to continue the recognition process (step SS).

第３図は、先頭からｐ番目の音素または音節まで処理を
行い、ここで処理打ち切シを行う場合の概念図を示す。FIG. 3 shows a conceptual diagram when processing is performed from the beginning to the p-th phoneme or syllable and the processing is discontinued at this point.

このときの打ち切シ判定は以下の式を満たした場合に行
う。The discontinuation determination at this time is performed when the following formula is satisfied.

Ｌ　：音声信号の音声長Ｌｐ　：　ｐ番目の音素または音節まで処理を行った時
点までの音声長Ｌｋ　：　ｋ番目の音素また：ま音節（ｋ＞ｐ）の継続
長の統計データＮｌ：音声信号に対し仮定された単語の音素または音部
数Ｅ　：推定誤差 σｌ：ｉ番目の音素の分散）を用い、第４図に実際に時
系列メモリ１４甲の音声信号／ａｂａｓｈｉｒｉ／　ｋ
て対して、辞書メモリ１６から／　ａｉｚｕｗａｋａｍ
ａｔｕ　、／を仮定した場合の処理が打ち切られる様子
を示す。L: Audio length of the audio signal Lp: Audio length up to the point when the p-th phoneme or syllable is processed Lk: Statistical data of the duration of the k-th phoneme or syllable (k>p) Nl: Audio signal Using the assumed number of phonemes or parts of the word E: estimation error σl: variance of the i-th phoneme), FIG.
From dictionary memory 16/aizuwakam
It shows how the process is aborted when assuming atu, /.

すなわち入力音声信号／　ａｂａｓｈｉｒｉ　／が曲線
２１で示され、仮定単語／　ａｉｚａｗａｋａｍａｔｕ
　／はその最初の部分が二種類の／ａ／■■と、三種類
の／ａ／の結合■■（のとの５つのセグメンテーション
候補がある。仮定単語の最初／ａ／は母音であるため母
音の立上り＊が入力音声信号について存在するかの認識
処理が行われ、その後、まず第１候補■に゛ついて／、
／の認識処理が行われ、入力音声信号に対し／ａ／によ
るセグメンテーションが行われる。この時、入力音声の
残りの部分／　ｂ　−ｉ　／の音声長と、仮定単語の残
りの部分／ｉ−ｕ／の推定音声長とが比較され、後者の
方が推定誤差Ｅ以上に長いため、この候補■に対する認
識処理は打切られ、次の候補■についての認識処理に移
る。候補■も同様にして／ａ／に対するセグメンテーシ
ョンの後、認識処理が打切られる。候補■■については
／ａ１／が入力音声信号中の／ａ／又は／ａ　ｂ　／と
一応対応付けられ、その後残り音声長と、残り推定音声
長とをそれぞれ比較し、その認識処理が打切られた状態
を示す。また候補■ば／ａ１／が入力音声信号中の／ａ
／と対応ずけられ、この時の／ａ　ｉ／が短かいため、
その時の入力音声信号の残シ部分／　ｂ　”　ｉ　／の
音声長と、仮定単語の残υ部分／ｚ−ｕ／の推定音声長
との差が推定誤差Ｅ以下であり、このため次の／Ｚ／に
対する認識処理に移り、これが入力音声信号中の／ｂ／
と対応ずけられ、この時の残り部分の音声長と推定音声
長との比較により、その差が推定誤差Ｅ以上と判定され
てこの認識処理が打切られる。That is, the input speech signal /abashiri/ is shown by the curve 21, and the hypothetical word /aizawakamatu
There are 5 segmentation candidates: the first part of / is two types of /a/■■ and three types of /a/ combination ■■(. Because /a/ at the beginning of the hypothetical word is a vowel. A recognition process is performed to determine whether the vowel rise* exists in the input audio signal, and then, first, regarding the first candidate ■/,
/ recognition processing is performed, and the input audio signal is segmented by /a/. At this time, the speech length of the remaining part of the input speech /b -i / and the estimated speech length of the remaining part of the hypothetical word /i-u/ are compared, and since the latter is longer than the estimation error E, , the recognition process for this candidate ■ is aborted, and the process moves on to the recognition process for the next candidate ■. Similarly, for candidate ■, the recognition process is terminated after segmentation for /a/. For candidate ■■, /a1/ is tentatively associated with /a/ or /a b / in the input audio signal, and then the remaining speech length is compared with the estimated remaining speech length, and the recognition process is aborted. Indicates the condition. Also, candidate ■ba /a1/ is /a in the input audio signal.
/a i/ is short, so
The difference between the speech length of the remaining part /b '' i / of the input speech signal at that time and the estimated speech length of the remaining υ part /z−u/ of the hypothetical word is less than the estimation error E, and therefore the next / Moving on to the recognition process for Z/, this is recognized as /b/ in the input audio signal.
By comparing the remaining speech length with the estimated speech length, it is determined that the difference is greater than the estimation error E, and this recognition process is terminated.

図中太線を施した部分はこの発明により認識処理が打切
られた部分であるが、従来の装置では、この太線部分の
認識処理が行わｒし、その後のＸ印で認識処理が打切ら
れる。つまり、この発明ではこの各太線部分の処理量だ
け、従来よりも処浬麺が少なくなる。The part marked with a thick line in the figure is the part where the recognition process is discontinued according to the present invention, but in the conventional apparatus, the recognition process is performed on this thick line part, and the recognition process is discontinued at the subsequent X mark. In other words, according to the present invention, the processed noodles are less than the conventional method by the amount of processing indicated by each thick line.

以上の結果から明らかなように、従来の技術？て比べこ
の発明装置によれば処理量の削減ができろとともに、各
音素または音節の継続長について統計的なバラツキを考
慮した推定誤差を許容し゛、１．・）るので、不用意な
処理打ち切りを行う危険性を小さくでき、音素または音
節の継続長を用いることによる認識率の低下を小さくで
きる。As is clear from the above results, the conventional technology? In comparison, the device of the present invention can reduce the amount of processing and allow for estimation errors in consideration of statistical variations in the duration of each phoneme or syllable.1. ), it is possible to reduce the risk of inadvertently aborting the process, and it is possible to reduce the reduction in recognition rate due to the use of phoneme or syllable duration.

「発明の効果」以上説明したように、この発明を適用するく二とにより
音素または音節の時間的特徴と１．て継続長が利用でき
、それをもとに音声信号に対１〜で仮定された単語につ
いての推定音声長と、実際の音声信号の入力音声長との
比較による判定が可能でｋ）るから、明らかに数あるい
は種類の異なる音素またニー′ｉ音節の系列からなる仮
定単語についての処理１削減という利点がちる。"Effects of the Invention" As explained above, by applying the present invention, the temporal characteristics of phonemes or syllables can be determined by 1. Based on this, it is possible to make a judgment by comparing the estimated speech length of the word assumed in the speech signal with the input speech length of the actual speech signal.k) , it has the advantage of reducing the amount of processing required for hypothetical words consisting of sequences of phonemes or syllables that clearly differ in number or type.

音素毎に上記の粂件でこの発明を適用した認識を行った
結果、１００単語認識の可能な１００単語辞書メモリを
使用した場合約２０　％、１０００単語認識の可能な１
０００単語辞書メモリを使用した場合で約３０％の処理
量を削減することができた。As a result of performing recognition using the present invention on the above-mentioned phonemes for each phoneme, it is approximately 20% when using a 100 word dictionary memory capable of recognizing 100 words, and 1% when using a 100 word dictionary memory capable of recognizing 1000 words.
When using a 000 word dictionary memory, the processing amount could be reduced by about 30%.

また、この発明による推定音声長と入力音声長との比較
では、推定誤りを考慮した推定誤差を与えているため、
音素または音節の継続長の統計データを用いることによ
る認識率の低下を抑えるという利点もちる。Furthermore, in the comparison between the estimated speech length and the input speech length according to the present invention, an estimation error is given that takes estimation errors into consideration.
It also has the advantage of suppressing the decline in recognition rate due to the use of statistical data on the duration of phonemes or syllables.

[Brief explanation of drawings]

第１図はこの発明の実施例を示すブロック図、第２図は
その認識処理動作の要部の動作を示す流れ図、第３図は
推定により処理の打ち切りを判定する際の概念図、第４
図はこの発明によって実際に処理の打ち切りが行われる
様子を示す図である。特許出願人　　日本電信電話株式会社代　理　人　　草　　野　　　　　　卓２＝１　７土力矛舌果オ　３７粘ｒ１２７２　図オ　４　図２１人″Ｄ苦？慣腎Fig. 1 is a block diagram showing an embodiment of the present invention, Fig. 2 is a flowchart showing the operation of the main part of the recognition processing operation, Fig. 3 is a conceptual diagram when determining whether to discontinue processing by estimation, and Fig. 4
The figure is a diagram showing how processing is actually terminated according to the present invention. Patent Applicant Nippon Telegraph and Telephone Corporation Representative Taku Kusano 2 = 1 7 Earth Power Irregularity 37 Adhesion R1 272 Figure 4

Claims

[Claims]

(1) The input speech signal is stored in a time series memory as a time series of its feature parameters, a word is assumed from a dictionary memory that stores words to be recognized, and the phoneme or syllable sequence of that word is followed. In a speech recognition device that performs segmentation of the parameter time series of an input speech signal for each phoneme or syllable and performs scoring for each segmentation for recognition, statistical data of the duration length of individual phonemes or syllables or groups thereof is stored. speech length calculation means for calculating the remaining speech length of the input speech signal that has not yet been segmented for each segmentation; and statistics of duration lengths corresponding to the phonemes or syllables in the assumed word. Estimated speech length calculation means for reading data from the continuous length memory and calculating the estimated speech length of a sequence of phonemes or syllables remaining after the phoneme or syllable for which segmentation has been completed in the assumed word; and a determination means for comparing the estimated voice length calculated by the voice calculation means with the voice length calculated by the voice calculation means and determining whether or not to terminate recognition processing for the hypothetical word. Device.