JPS63123100A

JPS63123100A - Voice recognition

Info

Publication number: JPS63123100A
Application number: JP61269116A
Authority: JP
Inventors: 二矢田　勝行
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1986-11-12
Filing date: 1986-11-12
Publication date: 1988-05-26
Anticipated expiration: 2013-02-10
Also published as: JP2710045B2

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】産業上の利用分野本発明は人間の声を機械に認識させる音声認識方法に関
するものである。DETAILED DESCRIPTION OF THE INVENTION Field of the Invention The present invention relates to a voice recognition method that allows a machine to recognize a human voice.

従来の技術単語音声認識装置は近年、能力も大きくなり、小型低価
格化も進んでいるため、しだいに普及するようになって
きた。しかし、使用できる語常数３べ一／の制限、使用環境の制限や、余計な言葉を言ってはなら
ないなど、いろいろな使用上の制限があるため、決して
使い易い装置ではない。特に、不特定話者用の装置では
、これらの使用上の制限を知らない人も装置を使うこと
になる。このため、不特定話者用の単語音声認識装置を
音声応答装置と組合わせて用い、質問応答形式のシステ
ム構成にするのが一般的である。すなわち、ユーザの発
声内容を音声応答装置から発するガイド音声によって誘
導することにより、装置の使用上の制限を意識しなくて
も使えるように配慮する。Conventional technical word speech recognition devices have become increasingly popular in recent years as their capabilities have increased and they have become smaller and cheaper. However, it is not an easy-to-use device because of various usage restrictions, such as a limit on the number of words that can be used, restrictions on the environment in which it can be used, and the prohibition of saying unnecessary words. Particularly, in devices for non-specific speakers, people who are not aware of these usage restrictions end up using the devices. For this reason, it is common to use a word speech recognition device for non-specific speakers in combination with a voice response device to create a question-and-answer type system configuration. That is, by guiding the user's utterances with the guide voice emitted from the voice response device, consideration is given to allowing the user to use the device without being aware of any restrictions on its use.

たとえば、士数字と「ハイ」、「イイエ」の１２単語を
認識対象とする装置を用いて列車の切符を予約する場合
は次のようになる。For example, when reserving a train ticket using a device that recognizes 12 words such as numerals, "hai", and "iiie", the process is as follows.

例１：装置：切符が何枚必要でしょうか。数字でお答え下さい
。Example 1: Equipment: How many tickets do you need? Please answer in numbers.

ユーザ：４（ヨン）装置：新幹線の切符４枚ですね。User: 4 (Yon) Device: Four Shinkansen tickets.

ユーザ゛はい。User: Yes.

第４図は現在用いられている不特定話者用の認識応答シ
ステムの機能ブロック図（従来例）である。ホスト計算
機５１のタスクが起動されると、音声合成部５２からガ
イド音声が出力される。ガイド音声の指示に従って、ユ
ーザが発声すると、それが音声認識部５０で認識され、
認識結果をホスト計算機５１へ送出する。ホスト計算機
５１では、ユーザの意図を解して応答文を作成し、音声
合成部５２から応答音声を出力する。ユーザはそれによ
って次の発声を行なう・・・・・・・・・というような
手順で、タスクを遂行してゆく。FIG. 4 is a functional block diagram (conventional example) of a currently used recognition response system for non-specific speakers. When the task of the host computer 51 is activated, a guide voice is output from the voice synthesis section 52. When the user speaks according to the instructions of the guide voice, it is recognized by the voice recognition unit 50,
The recognition result is sent to the host computer 51. The host computer 51 creates a response sentence based on the user's intention, and outputs the response voice from the speech synthesis unit 52. The user then performs the task by making the next utterance, and so on.

音声認識部５０では次のようにして認識処理が行なわれ
る。先ず音響分析部５３でフィルタ分析、ＬＰＧ分析や
パワー計算などが行なわれ、パラメータが抽出される。The speech recognition section 50 performs recognition processing as follows. First, the acoustic analysis section 53 performs filter analysis, LPG analysis, power calculation, etc., and extracts parameters.

音声区間検出部５４では、音響分析結果を用いて、入力
音声の音声区間が検出される。類似度計算部５５では、
音声区間内に含まれるパラメータと、標準パターン部５
６に格納５べ一／゛されている各音声の標準パターンとの類似度の計算を行
ない、最も類似度が高い標準パターンに対応する言葉を
認識結果とする。The speech section detection unit 54 detects the speech section of the input speech using the acoustic analysis results. In the similarity calculation unit 55,
Parameters included in the voice section and standard pattern section 5
The degree of similarity of each voice with the standard pattern stored in the table 6 is calculated, and the word corresponding to the standard pattern with the highest degree of similarity is taken as the recognition result.

発明が解決しようとする問題点上記のように質問応答形式で用いれば、音声認識装置に
ある程度理解がある人ならば使用できる。Problems to be Solved by the Invention If used in a question-and-answer format as described above, anyone who has some understanding of speech recognition devices can use it.

しかし、音声認識装置の制限を全く知らない人は、必ず
しも予期した表現で返答するとは限らない。However, people who are completely unaware of the limitations of speech recognition devices may not necessarily respond with the expected expressions.

たとえば上記例１の場合、ユーザは下線のように返答す
るかも知れない。For example, in the case of Example 1 above, the user may respond as shown below.

例２：装置：切符が何枚必要でしょうか。数字でお答え下さい
。Example 2: Equipment: How many tickets do you need? Please answer in numbers.

装置：新幹線の切符４まいですね。Device: Shinkansen ticket 4 times.

■ユーザ：はい、そうです。■User: Yes, that's right.

例２のユーザの返答■、■は、会話では自然に用いられ
る表現であるが、単語音声を対象とする音声認識では、
このような文音声の認識はできないのでリジェクトされ
るか、認識誤まりを生じる６ベー／場合が多い。すなわち、従来の単語音声認識装置では、
例１のように決められた表現で返答する必要があり、は
んの少しの表現の変形も許されない。The user's responses ■ and ■ in Example 2 are expressions that are naturally used in conversation, but in speech recognition targeting word sounds,
Since this kind of sentence speech cannot be recognized, it is often rejected or misrecognized. In other words, in the conventional word speech recognition device,
As in Example 1, it is necessary to respond using a fixed expression, and even the slightest variation in expression is not allowed.

これは使用上、非常に大きな制約であり、音声認識装置
の使用分野を狭める大きな原因となっている。This is a very large restriction in use, and is a major cause of narrowing the field of use of the speech recognition device.

本発明は、かかる従来例の問題点を解決し、単語音声認
識装置の使用上の制限を緩和しようとするものである。The present invention aims to solve the problems of the conventional example and to ease the restrictions on the use of word speech recognition devices.

すなわち、入力音声の表現の中に認識対象とする表現以
外の多少の余分な部分があった場合においても、正しく
認識できる手段を提供することによって、装置の使用上
の制限を緩和することを目的とする。In other words, the purpose is to alleviate restrictions on the use of the device by providing a means to correctly recognize even when there are some extra parts in the input speech expression other than the expression to be recognized. shall be.

問題点を解決するための手段本発明は上記目的を達成するものであり、その技術的手
段は、騒音や不要な音声を含む入力信号を分析して入力
パラメータ時系列に変換し、入力パラメータ時系列の部
分区間と、あらかじめ作成してある認識対象単語の標準
パターンとの類似度計算を、部分区間を入力パラメータ
時系列の最初７ヘーノから最後まで単位区間ずつシフｌ−Ｌながら連続的に行
ない、類似度が大きくなる認識単語候補区間を１つまた
は複数切り出して、その区間に対応する単語名を認識単
語候補とし、一方、別に入力信号からピッチ周波数とそ
の時間的変化パターンを連続的に求めておき、前記認識
単語候補区間またはその近傍におけるピッチ周波数の大
きさやその変化パターンを用いて、前記認識単語候補の
うちから認識単語を決定することを特徴とする音声認識
方法にある。Means for Solving the Problems The present invention achieves the above object, and its technical means is to analyze an input signal containing noise and unnecessary sounds, convert it into an input parameter time series, and convert the input parameter time series into an input parameter time series. The similarity calculation between the subintervals of the series and the standard pattern of recognition target words created in advance is performed continuously by shifting the subintervals from the first 7 henos of the input parameter time series to the end, unit by unit interval. , one or more recognition word candidate sections with a high degree of similarity are cut out, and the word name corresponding to that section is used as a recognition word candidate.Meanwhile, the pitch frequency and its temporal change pattern are separately obtained from the input signal continuously. The speech recognition method is characterized in that a recognition word is determined from among the recognition word candidates using the pitch frequency magnitude and its change pattern in the recognition word candidate section or its vicinity.

作　　用本発明は音声区間全体を１つの単語と見做さず、認識対
象単語を含む十分に広い区間（認識対象単語の他にその
前後の余分な８葉や音声の前後のノイズを含む）から、
パターンマツチングによる類似度の値とピッチ周波数の
大きさ、動きを用いて、認識対象単語のみを切出して認
識する。すなわち、前記の認識対象単語を含む十分広い
区間の一部に対して、認識対象単語の標準パターンの各
々との類似度を計算し、次に単位区間ずつずらせて同様
に類似度を計算する・・・・・・というように区間の全
域に対して各々の標準パターンとの類似度の動きを求め
る。そして、類似度が大きい区間（複数でもよい）を求
める。次に類似度が大きい区間のピッチ周波数を参照し
、ピッチ周波数がその区間の周囲よりも安定して高い値
を示している場合のみ、その区間を認識対象単語の区間
として認識する。Function: The present invention does not regard the entire speech interval as one word, but instead considers the entire speech interval as a sufficiently wide interval that includes the word to be recognized (in addition to the word to be recognized, it also includes eight extra words before and after the word and noise before and after the voice). from,
Only the words to be recognized are extracted and recognized using the similarity value obtained by pattern matching, the size of the pitch frequency, and the movement. In other words, the degree of similarity with each standard pattern of the word to be recognized is calculated for a part of a sufficiently wide interval including the word to be recognized, and then the similarity is calculated in the same way by shifting unit intervals. . . . The movement of the degree of similarity with each standard pattern is determined for the entire area. Then, a section (or sections) having a high degree of similarity is determined. Next, the pitch frequency of the section with a high degree of similarity is referred to, and only when the pitch frequency consistently shows a higher value than the surrounding areas of the section, that section is recognized as the section of the word to be recognized.

ピッチ周波数が周囲よりも安定して高くない場合は、そ
の区間は棄却する。このようにして、認識対象単語の部
分のみをスポツティングして認識する方法を用いること
によって、余計な言葉や騒音を含む入力に対しても、正
しい認識結果を得ることができ、単語音声認識装置を使
用する上での制約を緩和し、使い易い音声認識応答シス
テムを実現できるようになる。またそれによって、装置
に不慣れな人でも使うことができるようになり音声認識
装置の用途の拡大を図ることができる。If the pitch frequency is not consistently higher than the surrounding area, that section is rejected. In this way, by using the recognition method by spotting only the part of the word to be recognized, it is possible to obtain correct recognition results even for input that contains unnecessary words or noise, and the word speech recognition system It becomes possible to ease the restrictions on the use of voice recognition and response systems and realize an easy-to-use voice recognition response system. Furthermore, this allows even people who are not familiar with the device to use it, thereby expanding the uses of the voice recognition device.

実施例以下、本発明の実施例について説明する。Example Examples of the present invention will be described below.

本発明はパターンマツチングによって発声され９ヘー。The present invention is uttered by pattern matching.

た音声中からキーワードを切出しくワードスポツティン
グ）、それをピッチ周波数の変化パターンを用いて検証
することによって認識する方法である。ワードスポツテ
ィングは、各々の単語標準パターンを入力音声の全域に
わたって走査させながら類似度計算を行ない、類似度の
大きくなる区間とその時の単語名を抽出する方法である
。しかし、類似度最大として抽出された単語が必ずしも
正解とは限らないので、ある基準を設け、基準を満たす
複数の候補を抽出しておく。そして、それらの候補の中
から、ピッチ周波数の変化パターンと類似度値を用いて
、１つの単語に絞ってゆく。This is a method of recognizing keywords by extracting keywords from the recorded speech (word spotting) and verifying them using pitch frequency change patterns. Word spotting is a method that calculates similarities while scanning each word standard pattern over the entire input speech, and extracts the section where the similarity becomes large and the word name at that time. However, since the word extracted with the highest degree of similarity is not necessarily the correct answer, a certain criterion is set and a plurality of candidates satisfying the criterion are extracted. Then, from among these candidates, one word is narrowed down using the change pattern of pitch frequency and the similarity value.

ピッチ周波数を用いるのは、人間は重要な言葉の部分で
は、その他の部分よりは誦い周波数でしかもはっきりと
発声するという事実を利用するためである。特に会話で
は、音声分析の経験上、その傾向が強い。例２のような
質問応答文では、重要な単語（キーワード）が認識対象
単語であるから、ピッチ周波数を用いれば、認識対象単
語の存在する位置を決めることが可能である。The purpose of using pitch frequency is to take advantage of the fact that humans vocalize important parts of speech more clearly and at a higher recitation frequency than other parts. This tendency is particularly strong in conversations, based on our experience with speech analysis. In a question-and-answer sentence like Example 2, the important word (keyword) is the recognition target word, so by using the pitch frequency, it is possible to determine the position where the recognition target word exists.

１０へ−７第１図は、このような考え方に基づく本発明の一実施例
における音声認識方法を利用し認識応答システムの機能
ブロック図である。図中、ホスト計算機５１および音声
合成部５２０機能は従来例と全く同じであるので、音声
認識部６の内容のみについて説明する。To 10-7 FIG. 1 is a functional block diagram of a recognition response system using a voice recognition method in an embodiment of the present invention based on this idea. In the figure, since the functions of the host computer 51 and the speech synthesis section 520 are exactly the same as in the conventional example, only the contents of the speech recognition section 6 will be explained.

入力音声は音響分析部１によって分析区間（フレーム）
に区分され、フレームごとにＬＰＧ分析されて、ＬＰＧ
ケプストラム係数が抽出される。The input audio is analyzed by the acoustic analysis unit 1 into analysis sections (frames).
The LPG is analyzed for each frame, and the LPG
Cepstral coefficients are extracted.

サンプリング周波数は８にＨｚ、フレーム周期は１０ｍ
５ｅｃ１　分析の窓長は２０　ｎ５ｅｃのハミング窓を
用いている。ＬＰＧ分析の次数は１０次、ＬＰＧケプス
トラムは５次（０１〜Ｃｓ）の係数と、パワー項ＣＯを
用いている。ピッチ抽出部４では、フレームごとにピッ
チ周波数を求め、その値を一定期間蓄積する。ピッチ抽
出の方法はいろいろあるが、最も簡単な波形相関法を用
いている。人力信号をｘｌとすると相関関数ンτは１１　ｌ＼−／Ｖτが最大となる場合のτを仝とすると、ピッチ周波数
はｆｏ＝８０００／４　　　　　　　　　　　　（２）で
求められる。Sampling frequency is 8Hz, frame period is 10m
A Hamming window with a window length of 20 n5ec is used for the 5ec1 analysis. The order of LPG analysis is 10th order, and the LPG cepstrum uses coefficients of 5th order (01 to Cs) and a power term CO. The pitch extraction unit 4 determines the pitch frequency for each frame and accumulates the value for a certain period of time. There are various methods for pitch extraction, but the simplest one is the waveform correlation method. If the human input signal is xl, the correlation function nτ is 11 l\-/ If τ is the maximum value of Vτ, then the pitch frequency is found as fo=8000/4 (2).

類似度計算部２は入力パラメータ（ＬＰＣケプストラム
係数）と単語標準パターン部３の各単語の標準パターン
を遂次比較してゆき、類似度が大きい部分を単語として
切り出して蓄積しておく。The similarity calculation section 2 successively compares the input parameters (LPC cepstral coefficients) with the standard pattern of each word in the word standard pattern section 3, and cuts out and stores the parts with high similarity as words.

類似度計算は音声が存在するところは勿論のこと、前後
のノイズ区間を含む十分広い区間で行ない、音声区間の
検出を不要としている。The similarity calculation is performed not only in areas where speech exists, but also in a sufficiently wide range including noise segments before and after the sound, thereby eliminating the need to detect speech segments.

次に類似度計算によってノイズや音声の中から単語をス
ポツティングする方法を説明する。Next, we will explain how to spot words from noise or speech by calculating similarity.

マス、パターンマツチングに用いている距離尺度（統計
的距離尺度）について説明する。The distance measure (statistical distance measure) used in mass and pattern matching will be explained.

入力単語音声長を一定長Ｊフレームに線形伸縮し、１フ
レームあたりのパラメータベクトルを帽とすると、入カ
ベクトルＡは次のようになる。ただしｔは転置を表す。If the input word audio length is linearly expanded or contracted to a constant length of J frames, and the parameter vector per frame is defined as a parameter vector, the input vector A will be as follows. However, t represents transposition.

／Ａｔ−（＆ｔ、Ｍｔ、　・＝−、ａｔ）ここで、各１
ｊはｐ次元のベクトルである。/At-(&t, Mt, ・=-, at) where each 1
j is a p-dimensional vector.

単語ωｎ（ｎ＝１．２．・・・、Ｎ）の標準パターンと
して、平均値ベクトルをＩｌ　ｎ　Ｎ共分散行列を＼Ｗ
ｎとすると、事後確率Ｐ（ωｎｌ／Ａ）を最大とする単
語を認識結果とすればよい。As a standard pattern of word ωn (n=1.2...,N), the mean value vector is Il n N covariance matrix \W
Assuming n, the word with the maximum posterior probability P(ωnl/A) may be the recognition result.

ベイズの定理よりＰ（ωｎ１／Ａ）−Ｐ（ωｎ）・Ｐ（ＡＩωｎ）／Ｐ（
／Ａ）（３）右辺第１項のＰ（ωｎ）は定数と見なせる
。正規分布を仮定すると、第２項はＰ（ＩＡｌωｎ）−（２π）］＼Ｗｉ１−　ｅｘｐ　（
−１／２　（／Ａ−７ｔｔｎ　）ｔＷ　　ｎ　（Ａ＃／
）ｎ　））　　（４）分母項Ｐ（／Ａ）は入力パラメー
タが同一ならば定、数と見做せるが、異なる入力に対し
て相互比較するときは、定数にならない。ここでは、Ｐ
　（／Ａ　）が平均値ｊｌｎ　ｚ共分散行列ＩＷ　ｎの
正規分布に従うものと仮定する。From Bayes theorem, P(ωn1/A)−P(ωn)・P(AIωn)/P(
/A) (3) P(ωn), the first term on the right-hand side, can be regarded as a constant. Assuming a normal distribution, the second term is P(IAlωn)−(2π)]\Wi1−exp(
-1/2 (/A-7ttn)tW n (A#/
) n )) (4) The denominator term P(/A) can be regarded as a constant number if the input parameters are the same, but it does not become a constant when comparing different inputs. Here, P
Assume that (/A ) follows a normal distribution with mean jln z covariance matrix IW n.

Ｐ（／Ａ）−（２π）１＼Ｗム１ °ｅｘｐ（−１／２（／Ａ−Ｉｌａ）ｔｔＷ　ａ（／Ａ
−Ｉｌａ））　　（５）１３　へ−ン（１）の対数をとり、定数項を省略して、これをＭｎと
置くと、Ｍｎ＝（／Ａ−／／１ｎ）ｔ％ｗ　ｎ（Ａ、、ｙｌｎ）
−（Ａ−７７１ａ）’ＩＷ　ａ（Ａ−＃ａ）＋ｌｏｇ　
１ｌＷｎ　ｌｌ−１ｏ　ｌｌｗａ　ｌ　　　　　　　　
（６）ここで、ｌＷｎ　、　ｌＷａを全て共通と置き）
Ｗとする。P(/A)-(2π)1\Wmu1 °exp(-1/2(/A-Ila)ttW a(/A
-Ila)) (5)13 If we take the logarithm of Höhn (1), omit the constant term, and set it as Mn, we get Mn=(/A-//1n)t%w n(A, ,yln)
-(A-771a)'IW a(A-#a)+log
1lWn ll-1o llwa l
(6) Here, lWn and lWa are all set as common)
Let it be W.

すなわち、ＩＷ−（ｔｗ、　十Ｗ２　＋・・・・・・＋ＩＷ　ｎ　
＋ＩＷ　ａ　）　／　（Ｎ　＋１　）　　　（７）とし
て（４）式を展開すると、Ｍｎ　＝（ｐｎ　−ＩＥ　ｎ　−／Ａ　　　　　　　　
　　　　（８）ただし、ＩＥｎ＝２（Ｗ　　−／ｌｌｎ　−ＩＷ　　・／／ｌａ
）　　　　　　（９）Ｃｎ−ｐｎ’ｌＷ’７１ｎ−Ｐａ
−ＩＷ　　−／ｌｌａ　　（１０）（８）式は計算量が
少ない１次判別式である。ここで、（８）式を次のよう
に変形する。That is, IW-(tw, 10W2 +...+IW n
+IW a ) / (N +1) (7) When formula (4) is expanded, Mn = (pn −IE n −/A
(8) However, IEn=2(W −/lln −IW ・//la
) (9) Cn-pn'lW'71n-Pa
-IW-/lla (10) Equation (8) is a primary discriminant with a small amount of calculation. Here, equation (8) is transformed as follows.

すなわち、Ｍｎはフレームごとの部分類似度ｄ　Ｊ　＝
＠ｔｊ　Ｈａ（Ｊの８回の加算と１回の減算で求められ
る。That is, Mn is the partial similarity for each frame d J =
@tj Ha (calculated by 8 additions and 1 subtraction of J.

次に、上記の距離尺度を用いて、音声をスポソテ１４ペ
ー／゛イングして認識する方法と、計算量の削減法について説
明する。Next, a method for recognizing speech by spo-sote 14 pages and a method for reducing the amount of calculation will be described using the above-mentioned distance measure.

ワードスポツティングは認識すべき音声を確実に含む十
分長い区間を対象として、この中に基準点ｉを設定し、
ｉを基準として種々の部分区間を考え、各部分区間に対
して各単語との類似度を（１１）式によって求め、全て
の部分区間を通して類似度が大きくなる単語を基準点１
に対する認識結果とすればよい。そして、１を１〜１の
範囲で単位区間ずつ進めて同様の操作を行なってゆけば
よい。本実施例では、類似度が３位以内に入る単語と、
その区間を求めている。In word spotting, a reference point i is set within a sufficiently long section that definitely includes the speech to be recognized.
Consider various subintervals with i as a reference, calculate the degree of similarity with each word for each subinterval using equation (11), and set the word with the highest degree of similarity throughout all subintervals as reference point 1.
This may be the recognition result for. Then, the same operation can be performed by advancing 1 by unit interval in the range of 1 to 1. In this example, words whose similarity is in the top three,
I'm looking for that interval.

この類似度計算をそのまま実行すると計算量が膨大とな
るが、単語の持続時間を考慮して部分区間長を制限し、
また計算の途中で部分類似度ｄ１を共通に利用すること
によって、大幅に計算量を削減できる。第２図はその方
法の説明図である。If this similarity calculation is performed as is, the amount of calculation will be enormous, but by considering the duration of the word and limiting the subinterval length,
Further, by commonly using the partial similarity d1 during the calculation, the amount of calculation can be significantly reduced. FIG. 2 is an explanatory diagram of the method.

入力と単語ｎの照合を行う場合、部分区間長で（４１＜
４＜４２）を標準パターン長Ｊに線形伸縮し、フレーム
ごとに終端固定で類似度を計算して１５ヘー。When matching input and word n, the subinterval length is (41<
4 < 42) to the standard pattern length J, and calculate the similarity with the end fixed for each frame.

いく様子を示している。類似度はＱＲ上の点Ｔから出発
してＰで終るルートに沿って（１１）式で計算される。It shows how it goes. The degree of similarity is calculated using equation (11) along a route starting from point T on QR and ending at P.

したがって、１フレームあたりの類似度計算は全てΔＰ
ＱＲ内で行われる。ところで（１１）式のａＩＪは、区
間長４を伸縮した後の第ｊフレーム成分なので、対応す
る入力フレームｉ゛が存在する。そこで入カベクトルを
用いて、ｄｊを次のように表現できる。Therefore, all similarity calculations per frame are ΔP
This will be done within QR. By the way, since aIJ in equation (11) is the j-th frame component after expanding and contracting the section length 4, a corresponding input frame i' exists. Therefore, using the input vector, dj can be expressed as follows.

ｄ　（ｉ’、　Ｊ　）＝＠ｔｊ−ｊ＋ｉ　　　　　　　
　（１２）ただし、ｉ’　−ｉ　−ｒ　ｋ　（Ｊ）　＋
１　　　　（１３）ここで、ｒｋ（ｊ）は単語長にとＪ
の線形伸縮を関係づける関数である。したがって、入力
の各フレームとεｊとの部分類似度が予め求められてい
れば、（１１）式はｉ“の関係を有する部分類似度を選
択して加算することによって簡単に計算できる。ところ
で、ΔＰＱＲは１フレームごとに右へ移動するので、ｐ
ｓ上で［株］ｊとａ′１ｉ　　の部分類似度を計算して
、それをΔＰＱＳに相当する分だけメモリにＩｓし、フ
レームごとにシフトするように構成しておけば、必要な
類似度は全てメモリ内にあるので、部分類似度を求める
演算が大幅に省略でき、計算量が非常に少なくなる。d(i', J)=@tj−j+i
(12) However, i' −i −r k (J) +
1 (13) Here, rk(j) is the word length and J
It is a function that relates the linear expansion and contraction of . Therefore, if the partial similarity between each input frame and εj is determined in advance, equation (11) can be easily calculated by selecting and adding the partial similarities having the relationship i''.By the way, Since ΔPQR moves to the right every frame, p
By calculating the partial similarity between [stock] j and a′1i on s, storing it in memory by the amount corresponding to ΔPQS, and shifting it for each frame, the required similarity can be calculated. are all in memory, so the calculation for determining partial similarity can be largely omitted, resulting in a very small amount of calculation.

判断部５では、類似度計算部２で切り出された区間（本
実施例では３つ）に対して、ピッチ抽出部４で抽出した
ピッチ周波数の変化パターン（時間的にスムージングし
てある）を適用して、認識単語を１つに絞る。すなオつ
ち、前記切り出された区間において、（１）　　ピッチ周波数が全区間を通じて最も高い部分
が含まれていれば、その区間に対応する単語を認識結果
とする。The judgment unit 5 applies the pitch frequency change pattern (temporally smoothed) extracted by the pitch extraction unit 4 to the intervals (three in this example) extracted by the similarity calculation unit 2. to narrow down the number of recognized words to one. In other words, in the cut-out section, (1) If the section includes the part where the pitch frequency is the highest throughout the entire section, the word corresponding to that section is taken as the recognition result.

（２）　　（１）に該当する区間がない場合、ピッチ周
波数の動きが区間内で凸状になっている区間があり、６
ピッチ周波数が十分高い（凸状の山のピークが２番目）
ならば、その区間に対応する単語を認識結果とする。(2) If there is no section corresponding to (1), there is a section in which the pitch frequency movement is convex, and 6
The pitch frequency is sufficiently high (the peak of the convex mountain is the second)
If so, the word corresponding to that section is taken as the recognition result.

（３）　　（１）または（２）に該当する区間が複数存
在するときは、類似度が高い方を優先する。また該当す
る区間がない場合は、リジェクトとじて扱う。(3) When there are multiple sections that fall under (1) or (2), the one with higher similarity is given priority. If there is no applicable section, it will be treated as rejected.

次に第３図を用いて、以上の説明を具体例で示１７　べ
−２す。第３図において、（ａ）は発声内容を時間に対応し
て示したものであり、例２の■「ええと、４枚です」を
用いる。（ｂ）、　（Ｃ）はそれぞれパワーおよびピッ
チ周波数の時間的な動きを示している。太線３１の部分
で、ピッチ周波数が山を形成し、また山の部分で全域を
通じて最大値になっている。（ｄ）は各標準パターンに
対する類似度の時間的な動きであり、太線３２〜３４の
部分が類似度計算部２で切り出された区間である（第２
図で説明した方法によれば、区間の最後尾に類似度が大
きい部分が位置するので、第３図でもそのように表現し
てある）。Next, the above explanation will be illustrated with a concrete example using FIG. In FIG. 3, (a) shows the content of the utterance in relation to time, and uses ■ "Um, 4 pieces" from Example 2. (b) and (C) show the temporal movement of power and pitch frequency, respectively. In the thick line 31, the pitch frequency forms a peak, and at the peak, the pitch frequency reaches its maximum value throughout the entire area. (d) shows the temporal movement of the similarity with respect to each standard pattern, and the thick lines 32 to 34 are the sections cut out by the similarity calculation unit 2 (second
According to the method explained in the figure, a portion with a high degree of similarity is located at the end of the interval, so it is also expressed as such in FIG. 3).

これら３つの区間のそれぞれに対して、前記の２つの条
件（１）、　（２）をあてはめる。先ず、３４はピッチ
周波数が最大でもなく、また動きが凸状でないので、却
下する。３２．３３はピッチがどちらも凸状であるが、
３３の区間でピッチ周波数が最大なので、３３に対応す
る単語「ヨシ」を認識結果とする。このようにして、余
分な言葉や周囲ノイズの中から、正しい認識結果を得る
ことができ１８へ一／る。The above two conditions (1) and (2) are applied to each of these three sections. First, 34 is rejected because it does not have the maximum pitch frequency and its motion is not convex. Both pitches of 32 and 33 are convex, but
Since the pitch frequency is maximum in the section 33, the word "yoshi" corresponding to 33 is set as the recognition result. In this way, correct recognition results can be obtained from unnecessary words and ambient noise.

このように、本実施例の方法を用いれば、ノイズ中や簡
単な文音声の中から、正しく目的の単語をスポツティン
グできる。そして、あらかじめ音声区間を検出する必要
がないので、処理が単純である。しかも、類似度計算に
要する計算量が少ないので、ハード化が実に容易である
。In this way, by using the method of this embodiment, it is possible to correctly spot a target word from noise or simple sentences. Furthermore, since there is no need to detect voice sections in advance, the processing is simple. Furthermore, since the amount of calculation required for calculating the similarity is small, it is really easy to implement it in hardware.

発明の効果以上要するに本発明は、連続線形伸縮照合法により入力
音声中から認識単語候補とそれらの区間を切り出し、一
方ピッチ周波数とその変化パターンを求めておき、認識
単語候補区間内でピッチ周波数が十分高く、またその変
化パターンが凸状であるとき、認識単語を特定するもの
で、従来の装置が非常に静かな環境で用いなければなら
ず、使用者は必要な言葉以外を喋ってはならないという
、使用上の制限を緩和し、単語音声認識装置の使途の拡
大と普及に貢献できる利点を有する。Effects of the Invention In short, the present invention extracts recognition word candidates and their sections from input speech using a continuous linear expansion/contraction matching method, and also obtains the pitch frequency and its change pattern, and calculates the pitch frequency within the recognition word candidate section. When it is sufficiently high and the change pattern is convex, it identifies the recognized word, and conventional devices must be used in a very quiet environment, and the user must not speak other than the necessary words. This has the advantage of easing restrictions on usage and contributing to the expansion and popularization of word speech recognition devices.

[Brief explanation of the drawing]

第１図は本発明の一実施例における音声認識方１９ペー
／法を具現化する認識、応答システムの機能ブロック図、
第２図は本実施例において、入力音声と単語標準パター
ンの類似度計算の方法を説明する概念図、第３図（、）
〜（ｄ）は本実施例の認識方法を具体的な例で説明する
概念図、第４図は従来の音声認識方法を用いた認識応答
システムのブロック図である。１・・・・・・音響分析部、２・・・・・・類似度計算
部、３・・・・・・単語標準パターン部、４・・・・・
・ピッチ抽出部、５・・・・・・判断部。代理人の氏名　弁理士　中　尾　敏　男　ほか１名第２
図汁ヌの第３図［１ □ □吋ｒれFIG. 1 is a functional block diagram of a recognition and response system that embodies the speech recognition method in an embodiment of the present invention.
Figure 2 is a conceptual diagram explaining the method of calculating the similarity between input speech and word standard patterns in this embodiment, and Figure 3 (,)
-(d) are conceptual diagrams illustrating the recognition method of this embodiment using a specific example, and FIG. 4 is a block diagram of a recognition response system using a conventional voice recognition method. 1... Acoustic analysis section, 2... Similarity calculation section, 3... Word standard pattern section, 4...
- Pitch extraction section, 5... Judgment section. Name of agent: Patent attorney Toshio Nakao and 1 other person 2nd
Figure 3 of Zujiru Nu [1 □ □Speak

Claims

[Claims]

(1) Analyze the input signal containing noise and unnecessary speech, convert it into an input parameter time series, and calculate the similarity between a partial interval of the input parameter time series and a standard pattern of recognition target words created in advance. , continuously shift the subintervals from the beginning to the end of the input parameter time series by unit interval, cut out one or more recognition word candidate intervals with a high degree of similarity, and recognize the word name corresponding to that interval. On the other hand, the pitch frequency and its temporal change pattern are separately obtained from the input signal continuously, and the pitch frequency and its change pattern in the recognition word candidate section or its vicinity are used to determine the recognition word. A speech recognition method characterized by determining recognition words from among candidates.

(2) As a method for calculating the similarity between the input parameter time series and the standard pattern, the time length of the input parameter is linearly expanded or contracted to the time length of the standard pattern, and the calculation is performed using a statistical distance measure that is converted to a posteriori probability. A speech recognition method according to claim 1, characterized in that:

(3) As a method for determining a recognition word from recognition word candidates, in the recognition word candidate section, the pitch frequency stably takes the maximum value or a value similar to it in the entire input signal section, or the recognition word In the candidate section,
2. The speech recognition method according to claim 1, wherein when the pitch frequency change pattern becomes convex, a word corresponding to the recognized word candidate section is set as the recognized word.