JPS6165297A

JPS6165297A - Voice recognition system

Info

Publication number: JPS6165297A
Application number: JP59186342A
Authority: JP
Inventors: 正和秋山; 吉明北爪; 利一安江; 遠藤　武之
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1984-09-07
Filing date: 1984-09-07
Publication date: 1986-04-03

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

[Detailed description of the invention]

〔発明の利用分野〕本発明は音声認識装置に係り、特に、非定常な雑音によ
る誤認識を防止するのに好適な音声認識方式に関する。〔発明の背景〕音声認識装置は一般的に入力音声のパワーを監視して、
認識すべき音声区間を切り出して、その区間内で認識処
理を行なう。従ってマイク等から一定以上のパワーがあ
る音声や雑音がはいるとその音声区間に対して認識結果
を出力することになる。このため、周辺の他人の声や自
分のつぶやき等がｔｌｉｌｉ不能として出力されること
になる。また、入力された音声区間全体でマツチング処
理をするため、「東京」という言葉を認識してほしい時
に、「え〜東京」と発声すると、認識不能となる。この
ように従来の認識装置は、非定常な雑音や、人的な発声
エラーにより、誤認識、認識不能となってしまう。この
対策としては、例えば、特開昭５８−２３０９７号公報
「音声認識装置」に記載されているように雑音をあらか
じめ登録しておき、その音に対しては結果を出力しない
というような方法も提案されているが、登録や認識語数
の点で、制約があり、十分とは言えない。〔発明の目的〕本発明の目的は、音声入力中に生じる非定常な雑音や認
識すべき言葉の語頭、語尾につく雑音による誤認識やｇ
識不能を回避して、高詔晶率で実用性の高い音声認識方
式を提供することにある。〔発明の概要〕本発明は、音声認識方式において、連続ＤＰ処理（連続
照合処理）を使用し、連続ＤＰのもつ音声区間切出しが
不要である利点を生かした音声認識方式を与えるもので
ある。すなわち、登録パターンとのマツチングを入力さ
れた音声区間のみで行なわず、さらに音声区間内に適当
な候補がなければ、雑音あるいは誤発声とみなして認識
を継続することで、不必要な音声や非定常な雑音に影響
をうけず、目的とする入力音声のみを認識する音声認識
方式（以下ワードスボクティング方式と記す）を与える
。〔発明の実施例〕本発明が適用される音声認識装置の構成を第１図に示す
。図において、１は、入力された音声をデジタル化するＡ
／Ｄ変換部、２は、各周波数帯域ごとに音声パワーを求
め、特徴パターンを抽出する音声分析部、３は、梗４[Field of Application of the Invention] The present invention relates to a speech recognition device, and particularly to a speech recognition method suitable for preventing erroneous recognition due to non-stationary noise. BACKGROUND OF THE INVENTION Speech recognition devices typically monitor the power of input speech and
A speech section to be recognized is cut out and recognition processing is performed within that section. Therefore, when a voice or noise with a certain level of power is input from a microphone or the like, a recognition result is output for that voice section. For this reason, the voices of other people in the vicinity, one's own tweets, etc. are output as tlili-incapable. Furthermore, since the matching process is performed on the entire input voice section, if you say ``Eh~Tokyo'' when you want the word ``Tokyo'' to be recognized, it will not be recognized. In this manner, conventional recognition devices may misrecognize or fail to recognize due to unsteady noise or human speech errors. As a countermeasure against this problem, for example, there is a method of registering noise in advance and not outputting a result for that sound, as described in Japanese Patent Application Laid-Open No. 58-23097 ``Speech Recognition Device''. Although it has been proposed, there are limitations in terms of registration and number of recognized words, and it cannot be said to be sufficient. [Object of the Invention] The object of the present invention is to prevent misrecognition and g
The object of the present invention is to provide a highly practical voice recognition method with a high crystallinity rate while avoiding confusion. [Summary of the Invention] The present invention provides a speech recognition method that uses continuous DP processing (continuous matching processing) and takes advantage of the advantage of continuous DP in that it does not require speech section extraction. In other words, matching with the registered pattern is not performed only in the input speech section, and if there are no suitable candidates within the speech section, it is assumed to be noise or erroneous pronunciation and recognition continues, thereby eliminating unnecessary speech or non-speech. To provide a speech recognition method (hereinafter referred to as word boxing method) that recognizes only target input speech without being affected by stationary noise. [Embodiments of the Invention] FIG. 1 shows the configuration of a speech recognition device to which the present invention is applied. In the figure, 1 is A that digitizes the input audio.
/D conversion unit, 2 is a voice analysis unit that obtains voice power for each frequency band and extracts a characteristic pattern; 3 is a voice analysis unit that extracts a characteristic pattern;

【
−パターンメモリ５にあらかじめ登録されている標準パ
ターンと入力特徴パターンとの間で距離を計算する距離
計算部、４は距離計算の結果をもとに、連続ＤＰ処Ｊ】
Ｈによる照合を行なう照合部、６は、２，５゜４の各部
の制御を行なうと共に、照合部４から得られた照合結果
を選択して認識を行なう制御部である。なお、連続ＤＰ
処理による照合部については、特開昭５７−８３８８０
号公報「マツチング方式切換制御方法」、及び、特開昭
５５−２２０５号公報「実時間連続音声ｄゑ謀装置」を
参照されたい。さて、本発明に適用した連続ＤＰ方式では、連続した音
声を入力すると各入力フレームごとにあらかじめ登録さ
れている各標準パターンとマツチングされ、ある標準パ
ターンと類似した。音声があるとその標準パターン番号と共にマツ。チングした音声区間、マツチングスコア（以下ＤＰ値と
する）を出力する。すなわち各入力フレーム単位に照合
結果を得ることができる。従って、入カバターンと登録
パターンとのマツチングに際し、必ずしも入力音声区間
を正確に切り出して、区間全体でマツチングを行なう必
要けない。そこで、ｆ８２図の７０チヤートのように音声が入力さ
れると音声区間が終了するまで連続ＤＰ処理を継続して
行ない、入力音声区間に関係なくＤＰ値の小さいものを
優先的に認識結果とする。この時、認識結果の妥当性チ
ェックとして、％絖Ｄ　Ｐで生じやすい短い音声区間の
候補は削除して、わ言出しによる誤認識をチェックする
。あるいはｍ３鮎すべき１′葉のマツチング区間の最大
、最小をあらかじめ決めておいて、その言葉によっての
み認識するようにしてもよい。例えば、「ドウキロつ」の標準パターンが登録されてい
る時、「ニー・・・トウキ１つ」と音声が入力されると
第３図の（Ｃ）のような音声区間となる◇この区間の標
準パターン「トウキ田つ」とのマツチング値は第６図の
αのように変化し、極小点Ａで「トウキ曹つ」のパター
ン番号とマツチングした音声区間、ＤＰ値が出力される
。ここで、マツチングした音声区間は（α）であるが、入
力音声区間とは無関係に認識結果が得られる。一方、「ニー」という雑音は登録されていないため、Ｄ
Ｐ値の極小点が現われるのは稀である。登録語との関係で、たとえ第３図のβのＢ点のような極
小点が現われたとしても、低いＤＰ値とはなり得す、除
去される。また、雑音として、［キョウハ、・・・・・・」という
ような入力があったとすると、認識時は連続ＤＰ処理を
継続して行なっているため、「東京」の「キ目つ」と「
今日」の「キョウ」が類似して、第３図のγの０のよう
なりＰ極小点が現われうる。このとき、マツチングした
区間も出力されて＜ｂ＞のようなマツチング長が得られ
る。この場合は、認識結果の妥当性チェックにおいて、
マツチング区間が規定より短いことにより、候補から除
宍する。このように登録した語に比べて基準値より短い
音声区間でマツチングした候補を除来することにより、
連続ＤＢ処理におけ。る欠点でもある部分的マツチングによる誤認識を防止す
ることができる。この場合マツチング値及び妥当なマツ
チング区間は下のように決める。マツチング値は、照合部の演算ビット微の制約すなわち
ダイナミックレンジに依存して決める必要があるが、従
来の検討から正規化した出力としては１０ピツ）８度で
充分であることがわかっており、判定の余裕を考えると
、正解のマツチング値と最大の直との比が４〜５倍必要
なので、その制約から正解とする閾値は２００程度と決
める。またマツチング区間としては、取りあつかっている単語
長に依存して決める必要があり、各単語の標準パターン
テーブルに格納される単語長に対して各々０．８〜１．
２倍程度の範囲を正解とする。これは、人間による音声
の発声において、発生しうる時間軸方向の伸縮が±２０
％程度であることがこれまでの実験でわかっているから
である。なお、これらマツチング値、マツチング区間は、システ
ムに応じて変更をうけることはいうまでもない。〔発明の効果〕本発明によれば、音声認識中に誤まって発生した言葉や
、周囲の非定常の雑音や、発声者の息などで、認識不能
や誤認識とはならず、実用レベルでの音声認識装置での
認識率向上に著しい効果がある。[
- A distance calculation unit that calculates the distance between the standard pattern registered in advance in the pattern memory 5 and the input feature pattern; 4 is a continuous DP process based on the distance calculation result;
A verification unit 6 that performs verification by H is a control unit that controls each part of the 2.5° 4 and also selects and recognizes the verification results obtained from the verification unit 4. In addition, continuous DP
Regarding the verification section by processing, please refer to Japanese Patent Application Laid-Open No. 57-83880.
Please refer to Japanese Patent Publication ``Matching Method Switching Control Method'' and Japanese Patent Application Laid-Open No. 55-2205 ``Real-time Continuous Audio Data Planning Apparatus.'' Now, in the continuous DP method applied to the present invention, when continuous audio is input, each input frame is matched with each pre-registered standard pattern, and the pattern is similar to a certain standard pattern. There is a voice and pine along with its standard pattern number. The matched audio section and matching score (hereinafter referred to as DP value) are output. That is, matching results can be obtained for each input frame. Therefore, when matching an input cover pattern with a registered pattern, it is not necessarily necessary to accurately cut out the input speech section and perform matching on the entire section. Therefore, when a voice is input as shown in chart 70 in the f82 diagram, continuous DP processing is performed continuously until the voice section ends, and recognition results are given priority to those with a small DP value regardless of the input voice section. . At this time, as a validity check of the recognition result, short speech section candidates that are likely to occur in %絖DP are deleted to check for erroneous recognition due to profanity. Alternatively, the maximum and minimum matching sections of the 1' leaf to be m3 sweetfish may be determined in advance, and recognition may be made only by those words. For example, when the standard pattern of "Douki tsu" is registered, if the voice is input as "Nie... Touki 1", the voice section will be as shown in (C) in Figure 3 ◇ This section The matching value with the standard pattern "Touki Tatsu" changes as shown by α in FIG. 6, and at the minimum point A, the voice section and DP value that are matched with the pattern number of "Touki Sotsu" are output. Here, although the matched speech section is (α), recognition results can be obtained regardless of the input speech section. On the other hand, the noise "knee" is not registered, so D
It is rare that a minimum point of P value appears. Even if a minimum point like point B of β in FIG. 3 appears in relation to the registered word, it may result in a low DP value and will be removed. Also, if there is an input such as [Kyoha......] as noise, continuous DP processing is continuously performed during recognition, so the "key eyes" of "Tokyo" and "
Similar to ``Kyou'' in ``Today'', a P minimum point may appear, such as γ of 0 in FIG. At this time, the matched section is also output, and a matching length like <b> is obtained. In this case, when checking the validity of the recognition result,
Because the matching interval is shorter than specified, it is excluded from the candidates. In this way, by removing candidates that are matched in a speech interval shorter than the reference value compared to the registered words,
In continuous DB processing. It is possible to prevent misrecognition due to partial matching, which is also a drawback. In this case, the matching value and appropriate matching interval are determined as follows. The matching value must be determined depending on the constraints on the operation bits of the matching unit, that is, the dynamic range, but it has been found from previous studies that 10 degrees (8 degrees) is sufficient for the normalized output. Considering margin for determination, the ratio of the correct matching value to the maximum directness needs to be 4 to 5 times, so based on this constraint, the threshold value for determining the correct answer is determined to be about 200. Furthermore, the matching interval must be determined depending on the word length being handled, and is 0.8 to 1.0 for each word length stored in the standard pattern table for each word.
The correct answer is approximately twice the range. This means that the expansion and contraction in the time axis direction that can occur in human vocalization is ±20
This is because previous experiments have shown that it is about %. It goes without saying that these matching values and matching intervals are subject to change depending on the system. [Effects of the Invention] According to the present invention, words generated by mistake during speech recognition, surrounding non-stationary noise, the breath of the speaker, etc. will not cause unrecognizability or misrecognition, and the level of recognition will be reduced to a practical level. It has a remarkable effect on improving the recognition rate of speech recognition devices.

[Brief explanation of drawings]

第１図は本発明が適用される音声認識装置の構成図、第
２図は認識方式を説明するためのフローチャート、第３
図は、音声区間と音声パワーとＤＰ値の関係を示した波
形図である。Ｌ：Ａ／Ｄセ噴　２：者声扮罫部３：老Ｍ訂＄ＩＳ　　　４：牒合部号：祿子ハゝター〕メモノロ　：　卸（４Φ右ドパ）代理人弁理士　高　　橋　　明　　夫第　ノ　　（３第　２　図第　３　図Fig. 1 is a block diagram of a speech recognition device to which the present invention is applied, Fig. 2 is a flowchart for explaining the recognition method, and Fig. 3 is a flowchart for explaining the recognition method.
The figure is a waveform diagram showing the relationship between voice sections, voice power, and DP values. L: A/D Separation 2: Person Voice Editing Section 3: Old M Edit $IS 4: Part Number: Keiko Hater Memorandum: Wholesaler (4Φ Right Dopa) Agent Patent Attorney Akira Takahashi Husband's Day (3 Figure 2 Figure 3

Claims

[Claims]

In a speech recognition device consisting of a speech input section, an analysis section, a distance calculation section, a matching section, a standard pattern memory, and a control section, the matching section is equipped with a continuous matching means, and the speech digitized by the speech input section is matched. The data is input to the analysis section and features included in the speech are extracted, and the control section uses the data to detect speech sections.On the other hand, the registered pattern stored in advance in the standard pattern memory and the input speech pattern are is continuously verified in the verification section, and in the control section,
A matching value and a matching interval that are output from the matching section are input, and only when the matching value is smaller than a preset threshold and the input speech interval and the matching interval have a predetermined relationship, the word is recognized as a recognition result. A voice recognition method featuring: