JPS6193499A

JPS6193499A - Voice pattern collation system

Info

Publication number: JPS6193499A
Application number: JP59214594A
Authority: JP
Inventors: 潤一郎藤本
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1984-10-12
Filing date: 1984-10-12
Publication date: 1986-05-12

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】孜血分界本発明は、音声認識装置における音声パターン照合方式
に関する。DETAILED DESCRIPTION OF THE INVENTION The present invention relates to a voice pattern matching method in a voice recognition device.

従来肢止パターンマツチングを利用した音声認識では認識すべき
音声をあらかじめ登録しておき、入力された音声と登録
された音声のパターンを比較照合して最大類似の音声を
探し、それを認識結果としている。その場合、音声が入
力された時、登録されている音声全部と照合すると時間
がかかるので、他の特徴によってどの登録パターンと照
合すべきかを判断してから照合する方法が知られている
。Conventionally, in speech recognition using halting pattern matching, the speech to be recognized is registered in advance, the input speech and the registered speech patterns are compared and matched to find the most similar speech, and the recognition results are calculated. It is said that In this case, when a voice is input, it takes time to match all the registered sounds, so a method is known in which it is determined which registered pattern should be matched based on other characteristics and then the pattern is matched.

この判断のための特徴として音声区間の中に無音部が存
在するか否かを用いる方法があり（特開昭５１３−１４
２５９８号）、無音部が存在する音声は無音部を持つ登
録パターンとのみ照合することによって対象を減らし高
速認識を図っている。これは／ｐ／、／ｌ／、／に／音
や促音の前に無音区間が生じることを利用しているが、
単語音声認識の場合、無音区間を有するものが多くなり
やすいため、無音区間保有に分類された場合、そうでな
い場合に比べて照合に時間がかかるという欠点がある。As a feature for this judgment, there is a method that uses whether or not there is a silent part in the voice section (Japanese Patent Laid-Open No. 513-14
No. 2598), speech that includes silent parts is compared only with registered patterns that have silent parts, thereby reducing the number of targets and achieving high-speed recognition. This takes advantage of the fact that there is a silent interval before the / sound or consonant in /p/, /l/, /.
In the case of word speech recognition, since many words tend to have silent sections, there is a disadvantage that matching takes longer when the word is classified as having a silent section than when it does not.

又、無音区間の有無を利用する場合、／ｄ／、／ｇ／の
音素の前に無音区間の発生が一定しないことから正しく
対象が減らせないという欠点もある。Furthermore, when using the presence or absence of a silent section, there is a drawback that the number of targets cannot be reduced correctly because the occurrence of silent sections before the /d/ and /g/ phonemes is not constant.

且迫本発明は、上述のごとき従来技術の欠点を解決するため
になされたもので、特に、音声中に無音区間を有する単
語を正しく判別し、高速で認識することを目的としてな
されたものである。The present invention was made in order to solve the above-mentioned drawbacks of the prior art, and in particular, the present invention was made with the aim of correctly distinguishing words that have silent sections in speech and recognizing them at high speed. be.

１底本発明は、上記目的を達成するため、音声信号を電気信
号に変換する手段と、該電気信号を特徴パラメータに変
換する手段と、このように作られた音声パターンを登録
しておく手段を備え、入力された音声パターンとすでに
登録されているパターンとを比較照合して認識結果を出
力する音声認識装置において、パターンの特徴的な部分
の数をパターンの一部に添加し、登録すべき一つの音声
について複数のパターンの添加された情報を加え合わせ
たものをパターンと共に登録しておき、未知入力パター
ンが与えられた時に同じ特徴的な部分の数がどの程度一
致するかを登録されたパターンの添加情報と比較し、あ
る程度以上一致した標準パターンとのみ照合することを
特徴としたものである。以下、本発明の実施例に基づい
て説明する。In order to achieve the above object, the present invention provides a means for converting an audio signal into an electrical signal, a means for converting the electrical signal into a characteristic parameter, and a means for registering the audio pattern created in this way. In a speech recognition device that outputs a recognition result by comparing and matching an input speech pattern with a pattern that has already been registered, the number of characteristic parts of the pattern is added to a part of the pattern and registered. The combination of information added from multiple patterns for one voice is registered together with the pattern, and the degree to which the number of the same characteristic parts matches when an unknown input pattern is given is registered. The feature is that the pattern is compared with the additional information, and only the standard patterns that match to a certain degree or more are matched. Hereinafter, the present invention will be explained based on examples.

第１図は、本発明の一実施例を説明するための構成図で
、図中、１はマイク、２は特徴抽出部、３は音声区間検
出部、４はパワー検出部、５はレジスタ、６は一定値以
下判定部、７はカウンタ、８は無音個数パターン、９は
第１照合部、１０は単語辞書、１１は一定値以下判定部
、１２は第２照合部、１３は最大類似度算出部、１４は
結果出力部で、以下に、パターンの特徴的な部分として
無音区間を採用した場合について説明するが、本発明は
、無音区間に限定するものではなく、高い周波数成分が
多く存在する部分、有声の部分、無声の部分等どのよう
な特徴に注目しても良い。FIG. 1 is a block diagram for explaining one embodiment of the present invention, in which 1 is a microphone, 2 is a feature extraction section, 3 is a voice section detection section, 4 is a power detection section, 5 is a register, 6 is a certain value or less determination unit, 7 is a counter, 8 is a silent number pattern, 9 is a first matching unit, 10 is a word dictionary, 11 is a certain value or less determination unit, 12 is a second matching unit, 13 is the maximum similarity The calculation unit 14 is a result output unit. Below, a case will be explained in which a silent section is adopted as a characteristic part of the pattern. However, the present invention is not limited to silent sections, and there are many high frequency components. You may pay attention to any features such as the voiced portion, the voiced portion, or the unvoiced portion.

最初に、辞書作成について説明すると、不特定話者向き
の音声認識では何人かの音声の平均によって一つの辞書
パターンを登録することは知られている（音声研究会資
料５８１−５７）。このような各パターンに無音区間の
数を第２図の如くつけ加える。第２図（ａ）は話者Ａの
ある単語の無音区間数で、８ビツトのメモリーＭは順に
無音区間０，１．２・・・７と個数を表わしており、（
ａ）は無音区間が１個存在したことを示す。このような
８ビツトのパターンを音声パターンの一部につ１けてお
く。話者Ｂ、Ｃは同じ単語でありながら無音区間数はＯ
であった。ここでは８人のパターンの平均を辞書パター
ンにしており、パターンと共にこの８ビツトパターンも
和がと、られると（ｉ）のような辞書パターンが出来る
。これは各要素を３ビツトで表わしており、８人中６人
が無音区間０．２人が１でそれ以上多い人はいない。こ
の（ｉ）の断面を連続的にして（ｊ）に示す。このよう
なパターンが辞書パターンと共に登録されている。First, to explain dictionary creation, it is known that in speech recognition for unspecified speakers, one dictionary pattern is registered by averaging the voices of several speakers (Speech Study Group Materials 581-57). The number of silent sections is added to each pattern as shown in FIG. Figure 2 (a) shows the number of silent intervals of a certain word by speaker A, and the 8-bit memory M represents the number of silent intervals 0, 1, 2, . . . 7 in order, and (
a) indicates that there was one silent section. An 8-bit pattern like this is attached to a part of the audio pattern. Speakers B and C use the same word, but the number of silent intervals is O.
Met. Here, the average of the patterns of eight people is used as a dictionary pattern, and when this 8-bit pattern is summed together with the pattern, a dictionary pattern as shown in (i) is created. Each element is represented by 3 bits, and 6 out of 8 people said the silent section was 0.2 people were 1, and no one else had more. The cross section of (i) is made continuous and shown in (j). Such patterns are registered together with dictionary patterns.

第１図に戻り、バンドパスフィルタ群などによって音声
の特徴抽出をし、音声区間をレジスタに貯える、一方、
その音声のパワー検出等によって一定値以下、つまり無
音区間がいくつあるかをカウンタで数え、第２図（ｋ）
のような無音個数パターンを作る。先に述べた一定値は
限ずしも０ではなく周辺のノイズ等を考慮して無音区間
とみなせる値に決定すれば良い。次に辞書の第２図（ｉ
）の部分と入力の（ｋ）の部分が、第１照合部で照合さ
れる。この照合は対応するエレメントの積和をとるだけ
で良い。この値が一定値例えばＯ以下なら８人のパター
２９１人も同じ無音区間数の人が居なかった訳で全体の
パターンを照合するまでもなく、異なるパターンである
と判断して次の辞書パターンをとり出す。この照合の結
果がある程度の値をもつ辞書パターンについてのみ第２
照合部でパターン全体の照合をする。このように辞書に
登録されている全パターンを照合して最大の類似度のも
のを認識結果とする。Returning to Figure 1, the features of the voice are extracted using a group of bandpass filters, etc., and the voice section is stored in a register.
By detecting the power of the voice, etc., we use a counter to count the number of silent sections that are below a certain value, as shown in Figure 2 (k).
Create a silent pattern like this. The above-mentioned constant value is not necessarily 0, but may be determined to be a value that can be considered as a silent section, taking into consideration surrounding noise and the like. Next, figure 2 of the dictionary (i
) and the input part (k) are matched by the first matching unit. This matching only requires calculating the sum of products of corresponding elements. If this value is less than a certain value, for example O, there were no 291 putters with the same number of silent intervals, so there is no need to match the entire pattern, and it is determined that the pattern is different and the next dictionary pattern is used. Take out. The result of this matching is the second one only for dictionary patterns that have a certain value.
The matching section matches the entire pattern. In this way, all the patterns registered in the dictionary are compared, and the one with the highest degree of similarity is taken as the recognition result.

一方、第３図のように、乗算回路１５を設け、無音区間
数を予備選択的に用いるのではなく、第１と第２の照合
で得られた類似度の積をもって新たな類似度を定置して
も良い。この場合、どちらかの照合結果が０なら積をと
る必要はないし、類似度の闇値を決めておきそれより大
なる時は以後の演算を打ち切ってそれを認識結果として
も良い。On the other hand, as shown in FIG. 3, a multiplication circuit 15 is provided, and instead of pre-selectively using the number of silent intervals, a new similarity is determined by the product of the similarities obtained in the first and second matching. You may do so. In this case, if either of the matching results is 0, there is no need to take the product, or it is possible to decide on a dark value of similarity, and if it is greater than that value, then abort the subsequent calculations and use it as the recognition result.

第４図は、本発明の他の実施例を説明するための構成図
で、図中、１６は区間終了判定部、１７は予備選択部、
１８は照合部で、その他第１図及び第３図と同様の作用
をする部分には、第１図及び第３図と同一の参照番号が
付しである。第４図において、単語辞書には各登録単語
のパターンと共にそれらの単語中の無音区間の数の変化
範囲が記録されている。この変化範囲は特定話者向きの
装置では使用者が何回か同じ単語を発生した時の無音区
間数のバラツキ、不特定話者向きのものなら、多数の人
が同一単語を発声した時の無音区間数のバラツキ具合が
あらかじめ登録されている。FIG. 4 is a block diagram for explaining another embodiment of the present invention, in which 16 is a section end determination section, 17 is a preliminary selection section,
Reference numeral 18 is a collation section, and other parts having the same functions as those in FIGS. 1 and 3 are given the same reference numerals as in FIGS. 1 and 3. In FIG. 4, the word dictionary records the patterns of each registered word as well as the range of change in the number of silent intervals in those words. The range of this change is the variation in the number of silent intervals when the user utters the same word several times in the case of a device suitable for specific speakers, and the variation in the number of silent intervals when the user utters the same word several times, and the variation in the number of silent intervals when the user utters the same word several times in the case of a device suitable for non-specific speakers. The degree of variation in the number of silent sections is registered in advance.

例えば「上」という単語では誰が発声しても無音が存在
しないから０．０と記憶され、「下Ｊ／５ｉｔａ／では
／１／の前に無音部があり、人によらず１区間だけで変
化せず、この場合、１．１が記憶される。一方、「右Ｊ
／ｍ１ｇ１／では／ｇ／の前に無音区間が発生する人と
ない人があり、０又は１個できるから０，１が記憶され
る。人によって１個から３個まで変動する単語では１．
３と記憶しておく。For example, the word ``upper'' is memorized as 0.0 because there is no silence no matter who utters it, and ``lower J/5ita/ has a silent part before /1/, so it is recorded as 0.0 regardless of the person who utters it. does not change, and in this case 1.1 is memorized.On the other hand, "Right J
In /m1g1/, some people have a silent section before /g/ and some do not, and since 0 or 1 can occur, 0 and 1 are stored. Words that vary from one to three depending on the person: 1.
Remember it as 3.

そこでまず、第１図のマイクから未知の音声が入力され
たならバンドパスフィルタ群のような手段で特徴抽出さ
れ音声の区間のみが次のステップへ伝達される。次にバ
ンドパスフィルタ群の場合、全チャンネルの出力合計つ
まりパワーが一定値以下の部分があればカウンタを１つ
だけカウントアツプする。単語全体で一定値以下の部分
がいくつあるか計数する訳であり、一定値以下の部分が
連続してもカウンターは進まない。この一定値は周辺の
ノイズ等から無音区間検出ルベルを決めれば良い。こう
して単語全体でいくつの無音区間が存在するかを求め、
単語辞書内の無音区間の一致するものをみつけて照合す
る。例えば入力単語の無音区間数が「２」なら、０，２
；１，２；２，２；２．３；１，３・・・・・・などの
変化範囲内に２を含む単語と照合し、含まない単語パタ
ーンとは照合しないようにする。こうして照合によって
得られた類似度の最大のものを認識結果として出力する
。First, when an unknown voice is input from the microphone shown in FIG. 1, features are extracted using a means such as a group of band-pass filters, and only the section of the voice is transmitted to the next step. Next, in the case of a group of band-pass filters, if there is a portion where the total output of all channels, that is, the power, is less than a certain value, the counter is incremented by one. It counts how many parts of the entire word are below a certain value, and the counter does not advance even if there are consecutive parts below a certain value. This constant value can be determined by determining the silent section detection level based on surrounding noise, etc. In this way, find how many silent intervals there are in the whole word,
Find and match the silent interval in the word dictionary. For example, if the number of silent intervals in the input word is "2", 0,2
; 1, 2; 2, 2; 2.3; 1, 3, etc., are matched against words that include 2 within the variation range, and word patterns that do not include 2 are not matched. The maximum similarity obtained through matching is output as the recognition result.

この場合の照合方法は限定されるものではな（、周知の
ＤＰマツチング（動的計画法による照合）などを用いれ
ば良い。The matching method in this case is not limited; well-known DP matching (matching using dynamic programming) or the like may be used.

処理以上の説明から明らかなように、本発明によると、特徴
的な部分の数が統計的に処理されるため、誤った対象を
選ぶことがなくなる。Processing As is clear from the above description, according to the present invention, the number of characteristic parts is statistically processed, thereby eliminating the possibility of selecting a wrong target.

[Brief explanation of the drawing]

第１図は、本発明の一実施例を説明するための電気的ブ
ロック線図、第２図は、辞書登録動作を説明するための
図、第３図及び第４図は、それぞれ本発明の他の実施例
を説明するための構成図である。１・・・マイク、２・・・特徴抽出部、３・・・音声区
間検出部、４・・・パワー検出部、５・・・レジスタ、
６・・・一定値以下判定部、７・・・カウンタ、８・・
・無音個数パターン、９・・・第１照合部、１０・・・
単語辞書、１１・・・一定値以下判定部、１２・・・第
２照合部、１３・・・最大類似度算出部、１４・・・結
果出力部、１５・・・乗算回路、１６・・・区間終了判
定部、１７・・・予備選択部、１８・・・照合部。第１図（Ｃ）ロ可旺工四旧回ＣＣｋ）口可匹匝Ｗ旧Ｅ第３図第４図FIG. 1 is an electrical block diagram for explaining an embodiment of the present invention, FIG. 2 is a diagram for explaining dictionary registration operation, and FIGS. 3 and 4 are respectively diagrams for explaining an embodiment of the present invention. FIG. 7 is a configuration diagram for explaining another embodiment. DESCRIPTION OF SYMBOLS 1... Microphone, 2... Feature extraction part, 3... Voice section detection part, 4... Power detection part, 5... Register,
6... Below a certain value determination unit, 7... Counter, 8...
・Silent number pattern, 9...first matching section, 10...
Word dictionary, 11... Constant value or less determination unit, 12... Second matching unit, 13... Maximum similarity calculation unit, 14... Result output unit, 15... Multiplication circuit, 16... - Section end determination section, 17...preliminary selection section, 18... collation section. Figure 1 (C) Lokao Tech 4 Old C Ck) Mouth Capable W Old E Figure 3 Figure 4

Claims

[Claims]

(1) The system includes a means for converting an audio signal into an electrical signal, a means for converting the electrical signal into a characteristic parameter, and a means for registering the audio pattern created in this way, and the input audio pattern and In a speech recognition device that compares and matches patterns that have already been registered and outputs recognition results, the number of characteristic parts of the pattern is added to a part of the pattern, and multiple patterns are combined for one speech to be registered. The combination of added information is registered together with the pattern, and when an unknown input pattern is given, the degree to which the number of the same characteristic parts matches is compared with the added information of the registered pattern, A voice pattern matching method characterized by matching only standard patterns that match to a certain degree or more.

(2) When an unknown input pattern is given, the degree to which the number of the same characteristic parts matches is compared with the additional information of the registered pattern, and the degree of matching is reflected in the similarity. A voice pattern matching method according to claim (1).

(3) The change range of the number of characteristic parts is added to a part of the pattern, and when an unknown input pattern is given, the number of the same characteristic parts falls within the change range of the standard pattern. The voice pattern matching method according to claim 1, characterized in that only those that are present are matched.