JPS6163899A

JPS6163899A - Monosyllabic voice recognition equipment

Info

Publication number: JPS6163899A
Application number: JP59185899A
Authority: JP
Inventors: 達伊福部; 陽一吉田; 道夫倉田
Original assignee: Dai Nippon Printing Co Ltd
Current assignee: Dai Nippon Printing Co Ltd
Priority date: 1984-09-05
Filing date: 1984-09-05
Publication date: 1986-04-02
Also published as: JPH0339319B2

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】（発明の技術分野）この発明は、入力音声゛（単音ｍ）をバンドパスフィル
タ群により周波数分析すると共に、子音部及び母音部に
分割して生成した出力パターンと、予め登録しである標
準パターンとをマツチングせしめることにより入力音声
を認識するようにした単音節音声認識装置に関する。DETAILED DESCRIPTION OF THE INVENTION (Technical Field of the Invention) This invention analyzes the frequency of an input voice (single sound m) using a group of band-pass filters, and generates an output pattern by dividing it into a consonant part and a vowel part; The present invention relates to a monosyllabic speech recognition device that recognizes input speech by matching it with a standard pattern that is registered in advance.

（発明の技術的背景とその問題点）従来、単音節音声の認識の分野では、入力音声（弔音Ｗ
Ｊ）を周波数分析してスペクトル包路線を計算し、この
包路線を解析して得られるエネルギーの集中した周波数
帯、すなわちホルマント周波数を入力音声（単音節）の
特徴パラメータとして抽出し、このホルマント周波数の
時間的変化（音声パターン）を標準パターンと比較して
マツチング処理するようにした単音節音声認識装置が提
案されている。かかる従来の単音節音声認識装置では、
第１図に示すようにマイクロホンｌから入力された音声
信号Ａｓが増幅器２により所定の振幅レベルに増幅され
、その出力はそれぞれ異なった帯域幅を有するバンドパ
スフィルタ３１〜３ｎに入力される。そして、これらの
バンドパスフィルタ３１〜３ｎの出力はそれぞれ整流器
４１〜４ｎを介して包路線検出回路５１〜５ｎに入力さ
れ、包絡線検出回路５１〜５ｎからの包絡線出力Ｅ１ｘ
Ｅｎはマルチプレクサ６に入力されると共に、平均化回
路７にも入力され、包路線検出回路５１〜５ｎの包絡線
出力Ｅｌ−Ｅｎと共に順次切換えられて出力される。そ
して、マルチプレクサ７の出力はＡＤ変換器８によりデ
ィジタル化された後、メモリ９に記憶されるようになっ
ている。(Technical background of the invention and its problems) Conventionally, in the field of monosyllabic speech recognition, input speech (funeral sound W
J) is frequency-analyzed to calculate the spectral envelope line, and the frequency band with concentrated energy obtained by analyzing this envelope line, that is, the formant frequency, is extracted as a characteristic parameter of the input speech (monosyllabic), and this formant frequency A monosyllabic speech recognition device has been proposed that performs matching processing by comparing temporal changes (speech patterns) with standard patterns. In such conventional monosyllabic speech recognition devices,
As shown in FIG. 1, an audio signal As input from a microphone 1 is amplified to a predetermined amplitude level by an amplifier 2, and its output is input to bandpass filters 31 to 3n each having a different bandwidth. The outputs of these bandpass filters 31 to 3n are input to envelope detection circuits 51 to 5n via rectifiers 41 to 4n, respectively, and the envelope outputs E1x from the envelope detection circuits 51 to 5n are input to envelope detection circuits 51 to 5n.
En is input to the multiplexer 6 and also to the averaging circuit 7, and is sequentially switched and output together with the envelope outputs El-En of the envelope detection circuits 51 to 5n. The output of the multiplexer 7 is digitized by an AD converter 8 and then stored in a memory 9.

ここにおいて、平均化回路７の出力波形ａ（、ＥＭ）は
、第２図に示すように子音部分Ｃ１子音から母音へ移る
遷移部分Ｍ及び母音部分Ｖに３分割されて示されるが、
従来は８３図に示すように平均化回路７の出力波形ａを
観測し、この波形ａが予め設定したエネルギーレベルＥ
０より大きな値を所定時間７１以上ｇ１続して示した場
合に、音声の入力があったと判断するようにしていた。Here, the output waveform a (, EM) of the averaging circuit 7 is divided into three parts, as shown in FIG.
Conventionally, as shown in Fig. 83, the output waveform a of the averaging circuit 7 is observed, and this waveform a corresponds to a preset energy level
When a value larger than 0 is continuously shown for a predetermined time period of 71 g1 or more, it is determined that a voice has been input.

そして、登録用の標準パターンをメモリ９の所定番地に
予め記憶しておき、新しくメモリ９に書込まれた入力音
声強度とメモリ９に予め記憶されている子音部を切出す
ための基準パターン６とを、第４図に矢印Ｐ、Ｑで示す
ように時間軸に沿って平行移動せしめ、基準パターンｂ
が入力音声強度ａと最も近く一致する位置（時点）を求
めていた。Then, a standard pattern for registration is stored in advance at a predetermined location in the memory 9, and a reference pattern 6 is used to extract the input voice intensity newly written to the memory 9 and the consonant part stored in advance in the memory 9. are moved in parallel along the time axis as shown by arrows P and Q in FIG.
The position (point in time) where the input voice intensity a most closely matches the input audio intensity a was determined.

ここにおいて、基準パターンｂと入力音声強度ａとが一
致すると、第３図に示す時間７２部分を子音パターンと
して抽出すると共に、入力音声強度ａの最大値ＥＰに定
数（例えば０．９）を乗算した値Ｅマと、入力音声強度
ａとが交叉した位置（時点）を母音パターンとして採用
する。そして、これら抽出した子音パターン及び母音パ
ターンを、予めメモリ９に記憶しておいた標準音声の子
音パターン及び母音パターンと比較し、最も類似したパ
ターンを選択して出力するようになっていた。Here, when the reference pattern b and the input speech intensity a match, the time 72 portion shown in FIG. 3 is extracted as a consonant pattern, and the maximum value EP of the input speech intensity a is multiplied by a constant (for example, 0.9). The position (time point) at which the value Ema intersects with the input speech intensity a is adopted as a vowel pattern. These extracted consonant patterns and vowel patterns are compared with consonant patterns and vowel patterns of standard speech previously stored in the memory 9, and the most similar pattern is selected and output.

しかしながら、かかる従来の単音節認識処理では、例え
ば“し”と“ち”又は“す°′と゛つパ等の類似した単
音節が入力された場合、子音部分を正確に検出できない
という欠点がある。また、第５図に示すように入力音声
強度ａの極大点（ピーク）が子音データ部分に重なって
バンドパスフィルタ群の平均出力に出現する場合、入力
音声強度ａと基準パターンの同定が第６図（Ａ）又は（
Ｂ）に斜線部ｄで示す如〈実施され、子音部の抽出を正
確に行ない得ないという問題点があった。However, such conventional monosyllable recognition processing has the disadvantage that when similar monosyllables such as "shi" and "chi" or "su°' and ゛tsupa" are input, the consonant part cannot be detected accurately. Furthermore, as shown in Fig. 5, when the maximum point (peak) of the input speech intensity a overlaps with the consonant data portion and appears in the average output of the bandpass filter group, the identification of the input speech intensity a and the reference pattern is Figure 6 (A) or (
As shown in the shaded area d in B), there was a problem in that the consonant part could not be extracted accurately.

（発明の目的）この発明の目的は、上述の如き欠点・問題点のない単音
節音声認識装置を提供することにある。(Objective of the Invention) An object of the present invention is to provide a monosyllabic speech recognition device that does not have the above-mentioned drawbacks and problems.

（発明の概要）この発明は、入力音声をバンドパスフィルタ群により周
波数分析すると共に、子音部及び母音部に分割して生成
した出力パターンと、予め登録されている標準パターン
とを距離演算することにより、入力音声を単音節で認識
するようにした単音節音声認識装置に関するもので、ホ
ルマントを入力音声毎に抽出し、ホルマントに相当スる
バンドパスフィルタを選択し、その出力から音声強度を
生成して入力音声の子音部及び母音部を検出すると共に
、バンドパスフィルタ群の出力パターンと標準パターン
とを比較して当該入力音声の子音部及び母音部を同定す
るようにしたものである。(Summary of the Invention) This invention analyzes the frequency of input speech using a group of band-pass filters, and calculates the distance between an output pattern generated by dividing it into consonant parts and vowel parts, and a standard pattern registered in advance. This relates to a monosyllabic speech recognition device that recognizes input speech as monosyllables.It extracts the formant for each input speech, selects a bandpass filter corresponding to the formant, and generates speech intensity from the output. The consonant part and vowel part of the input speech are detected, and the output pattern of the band-pass filter group is compared with a standard pattern to identify the consonant part and vowel part of the input speech.

（発明の実施例）この発明は第１図に対応させて第７図に示すように、入
力音声（単音ｊ！ｆｆ）をバンドパスフィルタ群３１〜
３ｎにより周波数分析すると共に、子音部及び母音部に
分割して生成した出力パターンと、予め登録しておいた
標準パターンとをマツチングせしめることにより入力音
声（単音！ｉ）を認識するようにした単音節認識処理に
関する。・ここに、５図に示されるＸの部分は、第８図
に示されるように低周波数帯域（例えば２００〜５００
）１ｚ）に強いスペクトルをもった八ズ音（Ｂａｚｚ−
Ｂａｒ）ＢＳ　（例えば“ば”、“だ”、・・・・・・
）が現われる場合や、第９図に示されるようにある周波
数帯域（例えば２００〜２０００Ｈｚ）に複数のスペク
トルｓｐが現われ（以下、ｎａｓａｌ−ｅｕｒｍｕｒと
呼ぶ）（例えば“ま°”、“な”、・・・・・・）が現
われるため、第５図に示される入力音声強度のＸとなっ
て子音部の抽出に影響を与えていた。なお、第８図及び
第９図において、　ＴＳは時間スペクトルパターンを示
している。さらに、子音の認識率を左右する子音部と母
音部との境界（以下、基準点と呼ぶ）　ＲＰの検出が、
従来法の全バンドパスフィルタ出力での入力音声強度に
よる同定では難しく、債大差、更には経時的影響による
認識率の低下要因となっているため、上述した問題のな
いホルマント情報のみを使用した入力音声強度の生成が
必要となる。この発明は、このことを考慮してなされた
ものである。(Embodiment of the Invention) As shown in FIG. 7 corresponding to FIG.
3n, and also recognizes the input speech (single sound! Concerning syllable recognition processing.・Here, the part X shown in Figure 5 corresponds to the low frequency band (e.g. 200 to 500) as shown in Figure 8.
)1z) has a strong spectrum.
Bar) BS (e.g. “Ba”, “Da”, etc.
) appears, or as shown in Fig. 9, multiple spectra sp appear in a certain frequency band (for example, 200 to 2000 Hz) (hereinafter referred to as ``nasal-eurmur'') (for example, ``ma°'', ``na'', . . . ) appears, resulting in the input speech intensity of X shown in FIG. 5, which affects the extraction of consonant parts. Note that in FIGS. 8 and 9, TS indicates a time spectrum pattern. Furthermore, the detection of RP at the boundary between the consonant part and the vowel part (hereinafter referred to as the reference point), which affects the consonant recognition rate,
Identification based on the input speech intensity using the output of all bandpass filters in the conventional method is difficult, resulting in large discrepancies and a decline in the recognition rate due to the effect of time. Therefore, input using only formant information without the above-mentioned problems is recommended. Generation of audio intensity is required. This invention was made with this in mind.

以下、この発明を第７図について説明する。The invention will now be explained with reference to FIG.

入力音声をバンドパスフィルタ群３１〜３ｎにより周波
数分析すると共に、ある周波数帯域（例えば２００〜５
００ＦＩｚ）を除いた周波数帯域出力Ｅ５〜Ｅｎを、音
声が入力されたと判断された時点（母音安定部）から比
較器ｌＯに取込み、最も強いレベルのあったチャンネル
個数（例えば５とする）を選択してその信号をカウンタ
１１に加算する。そして、カウンタ１３で所定周波数の
クロー。The input voice is frequency-analyzed by band-pass filter groups 31 to 3n, and a certain frequency band (for example, 200 to 5
The frequency band outputs E5 to En excluding 00FIz) are taken into the comparator IO from the time when it is determined that the voice has been input (vowel stable part), and the number of channels with the strongest level (for example, 5) is selected. Then, the signal is added to the counter 11. Then, the counter 13 clocks at a predetermined frequency.

クバルスＣＰを計数して計時し、例えば１５０サンプル
のデータを取込んだか否かを判別し、１５０サンプルの
データを取込み終った時にカウンタ１１の内容からチャ
ンネル選択回路１２で最も大きい値から５個チャンネル
のデータを選択して出力する。これと共に、メモリ９に
格納しておいた音声デー７０丁を加算／平均回路１４に
読出し、これとチャンネル選択回路１２で選択したチャ
ンネルのデータを加算／平均回路１４で時間毎に加算平
均し、この加算平均データを更に高周波成分除去を目的
としたローパスフィルタ（ＬＰＦ）　１５を通過させ、
これにより得られた入力音声強度ＥＮＶをマルチプレク
サ６に送出する。さらに、入力音声強度ＥＮＶから入力
音声（単音節）の子音部及び母音部を検出すると共に、
バンドパスフィルタ群３１〜３ｎの出力パターンとメモ
リ９に予め記憶されている標準音声パターンとを比較し
、入力音声の子音部及び母音部を同定するようにしてい
る。For example, it is determined whether or not 150 samples of data have been captured by counting the Kuvarus CP, and when the data of 150 samples has been captured, the channel selection circuit 12 selects 5 channels from the largest value based on the contents of the counter 11. Select and output the data. At the same time, 70 pieces of audio data stored in the memory 9 are read out to the addition/average circuit 14, and this and the data of the channel selected by the channel selection circuit 12 are added and averaged every time by the addition/average circuit 14. This averaged data is further passed through a low pass filter (LPF) 15 for the purpose of removing high frequency components,
The input audio intensity ENV obtained thereby is sent to the multiplexer 6. Furthermore, the consonant part and vowel part of the input speech (monosyllabic) are detected from the input speech intensity ENV, and
The output patterns of the bandpass filter groups 31 to 3n are compared with standard speech patterns stored in advance in the memory 9 to identify consonant parts and vowel parts of the input speech.

（発明の効果）上記の方法を使用することで、第１０図（Ａ）及び（Ｂ
）に示されるように基準点ＲＰを正確に抽出し１．子音
の認識率向上が計れると共に、側大差の影響を吸収し、
標準パターンの安定度が高い等の利点がある。すなわち
、第１θ図（Ａ）は従来の入力音声強度を示しており、
同図（Ｂ）がこの発明による入力音声強度を示しており
、パターンマツチングを円滑に行ない得ることが明らか
となっている。また、上述したハードウェア構成はコン
ピュータソフトウェアで容易にプログラミングできるた
め、安価なシステム構成だけにより、高性能の音声認識
を行なうことができるという利点がある。さらに、この
発明の音声認識装置をタイプライタ、電算写植等の入力
手段や１機械・装置の運転・制御に応用することも容易
である。(Effect of the invention) By using the above method, FIGS.
) Extract the reference point RP accurately as shown in 1. In addition to improving the recognition rate of consonants, it also absorbs the influence of lateral differences,
It has advantages such as high stability of the standard pattern. That is, FIG. 1θ (A) shows the conventional input voice intensity,
FIG. 2B shows the input voice intensity according to the present invention, and it is clear that pattern matching can be performed smoothly. Further, since the above-mentioned hardware configuration can be easily programmed using computer software, there is an advantage that high-performance speech recognition can be performed using only an inexpensive system configuration. Furthermore, the voice recognition device of the present invention can be easily applied to input means such as typewriters and computer phototypesetting, or to the operation and control of a single machine or device.

[Brief explanation of the drawing]

第１図は従来の単音節音声認識装置の一例を示すブロッ
ク図、第２図は音声エネルギーの時間的推移を説明する
ためのタイムチャート、第３図及び第４図はそれぞれ音
声のマツチング過程を説明するためのタイムチャート、
第５図及び第６図（Ａ）、（８）は従来の単音節音声認
識プロセスを説明するためのタイムチャート、第７図は
この発明の一実施例を示すブロック図、第８図及び第９
図は従来装置の改良項目を示す図、第１Ｏ図（Ａ）、（
Ｂ）はこの発明の改善内容を示す図である。ｌ・・・マイクロホン、２・・・増幅器、３１〜３ｎ・
・・バンドパスフィルタ、４１〜４ｎ・・・整流器、５
１〜５ｎ・・・包絡線検出回路、６・・・マルチプレク
サ、７・・・包路線生成回路、８・・・ＡΩ変換器、９
・・・メモリ、ｌＯ・・・比較器、１１．１３・・・カ
ウンタ、１２・・・チャンネル選択回路、１４・・・加
算／平均回路、１５・・・ローパスフィルタ。出願人代理人　　安　形　雄　三第　ｌ　図第　２２第　３２Ｌ　４　図　　　　　　　第　５１第　６　固（Ａ）　　　　　　　　　　　　　　　　　　＜ｓ）第
６　図第　９　図第　ｌθ　図第　１３　　図手続補正書（方式）昭和６０年２月２０日Fig. 1 is a block diagram showing an example of a conventional monosyllabic speech recognition device, Fig. 2 is a time chart for explaining the temporal transition of speech energy, and Figs. 3 and 4 each illustrate the speech matching process. Time chart to explain,
5 and 6 (A) and (8) are time charts for explaining the conventional monosyllabic speech recognition process, FIG. 7 is a block diagram showing an embodiment of the present invention, and FIGS. 9
The figure shows the improvement items of the conventional device, Figure 1O (A), (
B) is a diagram showing the improvement content of the present invention. l...Microphone, 2...Amplifier, 31-3n.
... Band pass filter, 41~4n... Rectifier, 5
1 to 5n...Envelope detection circuit, 6...Multiplexer, 7...Envelope generation circuit, 8...AΩ converter, 9
...Memory, lO...Comparator, 11.13...Counter, 12...Channel selection circuit, 14...Addition/averaging circuit, 15...Low pass filter. Applicant's agent Yu Angata No. 3 Figure 22 Figure 32 L 4 Figure 51 6 Hard (A) <s) Figure 6 Figure 9 Figure lθ Figure 13 Written amendment to the procedure (formality) 1985 February 20th

Claims

[Claims]

By frequency-analyzing the input voice using a group of band-pass filters and calculating the distance between the output pattern generated by dividing the input voice into consonant and vowel parts and a pre-registered standard pattern, the input voice can be converted into monosyllables. In the monosyllabic speech recognition device, a formant is extracted for each input speech, a bandpass filter corresponding to the formant is selected, a speech intensity is generated from the output, and the consonant and consonant portions of the input speech are extracted. Along with detecting the vowel part,
A monosyllabic speech recognition device characterized in that a consonant part and a vowel part of the input speech are identified by comparing the output pattern of the group of bandpass filters and a standard pattern.