JPS6315298A

JPS6315298A - Pattern generation system

Info

Publication number: JPS6315298A
Application number: JP15926886A
Authority: JP
Inventors: 潤一郎藤本
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1986-07-07
Filing date: 1986-07-07
Publication date: 1988-01-22

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】技術分野本発明は、音声認識における特徴パターンの作成方式に
関する。DETAILED DESCRIPTION OF THE INVENTION Technical Field The present invention relates to a method for creating feature patterns in speech recognition.

旦米挟虚単語の音声を認識する方法として数多くの方式が開発さ
れている。これらの多くは、あらかじめ利用する音声を
登録しておいて、後に入力される音声がすでに登録され
ている音声のうちのどれに最もよく類似しているかを調
べて未知の入力音声を認識する、いわゆる、パターンマ
ツチングによるものである。A number of methods have been developed to recognize the sounds of dan-mei-ku words. Most of these systems register the voices to be used in advance, and then check which voice is most similar to the already registered voices to recognize the unknown input voice. This is based on so-called pattern matching.

第３図は、上記パターンマツチング法の一例を示すブロ
ック線図で１図中、１はマイク、２はフィルタバンク、
３は辞書、４は類似度算出部、５は結果表示部で、周知
のように、スイッチＳをａ側にして利用する音声を予め
辞書３に登録しておき、認識時、スイッチＳをｂ側にし
て入力音声と辞書３に登録されている音声とを類似度算
出部４によって算出し、その結果を結果表示部５に表示
するものである。このパターンマツチング法は他の方法
、例えば判別関数等を用いるものに比べて、演算数が少
なく、認識精度が良いことから広く普及している。この
ようなパターンマツチング法の１つで、データ量が少な
く、簡単な演算で実行できるものに２値のＴ　Ｓ　Ｐ　
（Ｔｉｍｅ−５ｐｅｃｔｒｕｍ　Ｐａｔｔｅｒｎ）を用
いるものが発表されている。FIG. 3 is a block diagram showing an example of the pattern matching method described above, in which 1 is a microphone, 2 is a filter bank,
3 is a dictionary, 4 is a similarity calculation unit, and 5 is a result display unit.As is well known, the voice to be used is registered in the dictionary 3 in advance with the switch S set to the a side, and during recognition, the switch S is set to the b side. A similarity calculation section 4 calculates the input speech and the speech registered in the dictionary 3 on the side, and displays the result on a result display section 5. This pattern matching method is widely used because it requires fewer calculations and has better recognition accuracy than other methods, such as those using discriminant functions. One of these pattern matching methods, which requires a small amount of data and can be executed with simple calculations, is binary TSP.
(Time-5 pectrum pattern) has been announced.

第４図は、パターンマツチングを２値のＴＳＰによって
行う単語音声認識方式の一例を説明するための図で（も
し、必要ならば日本音響学会講演論文集、昭和５８年１
０月、Ｐ、ＰＬ９５，１９６参照）、図中、１はマイク
ロフォン、２はフィルタバンク、３は辞書、４は類似度
算出部、５は結果表示部、６は２値化部、７はローカル
ピーク検出部、８は重ね合せ部で、該パターンマツチン
グ法において、マイクから入力された音声は、バンドパ
ス・フィルタ群等を利用して周波数分析され。Figure 4 is a diagram for explaining an example of a word speech recognition method in which pattern matching is performed using binary TSP (if necessary, see Proceedings of the Acoustical Society of Japan, 1981).
In the figure, 1 is a microphone, 2 is a filter bank, 3 is a dictionary, 4 is a similarity calculation section, 5 is a result display section, 6 is a binarization section, and 7 is a local A peak detecting section 8 is a superimposing section, and in the pattern matching method, the sound input from the microphone is frequency-analyzed using a group of band-pass filters or the like.

周波数とその時間変化がパターン（ＴＳＰ）として表さ
れる。更にこれを、周波数上のピークを中心として「１
」、他をｒＯＪとして２値化して２値のＴＳＰ（Ｂｉｎ
ａｒｙ　ＴＳＰ＝ＢＴＳＰ）に変換し、複数回発声して
得られたＢＴＳＰを重ねて標準パターンとして登録して
おく、このうち周波数上のピークのみを１としたパター
ンをピークパターン、ピークとその周辺を含めて１とし
たものをブロードパターンと呼ぶことにする。未知の音
声が入力された際、この音声も標準パターン作成時と同
様な過程でＢＴＳＰをつくり、あらかじめ登録しである
標準パターンと照合して各標準パターンとの類似度を求
める。類似度は未知音声のＢＴＳＰと標準パターンとを
重ね合せた時の「１」のエレメントの重なり具合からも
とめる（第７図参照）。なお、詳細は前記日本音響学会
講演論文誌に記載されているのでここでは省略する。The frequency and its time change are expressed as a pattern (TSP). Furthermore, this is ``1'' centered around the peak on the frequency.
”, and the others are binarized as rOJ to create a binary TSP (Bin
ary TSP = BTSP), and register the BTSP obtained by uttering multiple times as a standard pattern.The pattern with only the frequency peak as 1 is called the peak pattern, and the peak and its surroundings are called the peak pattern. The pattern including the total number of 1 is called a broad pattern. When an unknown voice is input, a BTSP is created for this voice in the same process as when creating a standard pattern, and compared with previously registered standard patterns to determine the degree of similarity with each standard pattern. The degree of similarity is determined from the degree of overlap of elements of "1" when the BTSP of the unknown voice and the standard pattern are superimposed (see FIG. 7). The details are described in the aforementioned journal of the Acoustical Society of Japan, so they will be omitted here.

この方法は、標準パターンをうまく作れば、誰の声でも
認識できる不特定話者音声認識装置の実現が容易である
というメリットを有している。更に、この方法では、ス
ペクトルを２値で表すため、音声の大きさの影響を受け
にくいという長所を有する反面、これが、パワーが小さ
い子音と大きい母音の区別がつきにくいという短所とも
なっている。この対策として音声のパワーの包絡形状を
２値化してスペクトルパターンとともに音声の特徴パタ
ーンとして用いることが考えられる。This method has the advantage that if a standard pattern is created well, it is easy to realize a speaker-independent speech recognition device that can recognize anyone's voice. Furthermore, this method has the advantage that it is less affected by the loudness of the voice because the spectrum is expressed in binary form, but it also has the disadvantage that it is difficult to distinguish between consonants with low power and vowels with high power. As a countermeasure to this problem, it is conceivable to binarize the envelope shape of the voice power and use it together with the spectrum pattern as a characteristic pattern of the voice.

第５図は、音声パワーの包絡形状を２値化する場合の音
声入力部の構成を示す図で、図中、３１はマイクロフォ
ン、３２は前段増幅回路、３３は高域強調回路、３４は
ＡＧＣ，３５はフィルタバンク、３６はマルチプレクサ
及びＡ／Ｄ変換回路で、マイクロフォン３１から入力さ
れた音声信号は、まず前段増幅回路３２で増幅され、子
音等の音韻情報を顕著にするために、その高域成分が高
域強調回路３３により強調される。音声レベルは話者に
より、また、心理的及び生理的な発声条件により、その
都度変動するので、この音声レベルの変動を正規化する
ために、ＡＧＣ回路３４によりその振幅が補正され、ま
たは補正されることなく、その出力がフィルタバンク３
５に入る。フィルタバンク３５には１例えば、１５ｃｈ
の帯域フィルタ（Ｂ、Ｐ、Ｆ）を用いるが、その帯域フ
ィルタの中心周波数は２５０〜６３００Ｈｚ、その配置
は１／３ｏｃｔ間隔とし、尖鋭度はＱ＝６とする。帯域
フィルタの各出力は、その振幅包絡成分だけを取り出す
ために、全波整流回路と、単調なステップ応答を示すＢ
ｅ５ｓｅｌ形の４次のリニアフェーズフィルタである平
滑回路に伝達される。このようにして得られる音声の特
徴は、音声スペクトルの概略形を示すもので、スペクト
ル包絡と呼ばれる。FIG. 5 is a diagram showing the configuration of the audio input section when the envelope shape of audio power is binarized. , 35 is a filter bank, and 36 is a multiplexer and A/D conversion circuit.The audio signal input from the microphone 31 is first amplified by the preamplifier circuit 32, and its high level is increased to make phonological information such as consonants more prominent. The high frequency component is emphasized by the high frequency emphasis circuit 33. Since the voice level fluctuates each time depending on the speaker and psychological and physiological speaking conditions, the amplitude is corrected or corrected by the AGC circuit 34 in order to normalize the voice level fluctuation. The output is sent to filter bank 3 without
Enter 5. The filter bank 35 has 1 channel, for example, 15 channels.
The bandpass filters (B, P, F) are used, and the center frequency of the bandpass filters is 250 to 6300Hz, they are arranged at 1/3 oct intervals, and the sharpness is Q=6. In order to extract only the amplitude envelope component of each output of the bandpass filter, a full-wave rectifier circuit and a B
The signal is transmitted to a smoothing circuit which is an e5sel type fourth-order linear phase filter. The characteristics of the voice obtained in this way indicate the approximate shape of the voice spectrum and are called the spectral envelope.

フィルタバンクによって得られたスペクトル包絡は、ア
ナログマルチプレクサ（ＭＰＸ）と１２ｂｉｔのＡＤ変
換器（ＡＤＣ）３６により順次デジタルコードに変換さ
れ、これをｌ０ｍ５のサンプリング周期でサンプリング
することにより、音声の特徴量はスペクトル包絡の時系
列として表現される。而して、認識のために入力される
未知のパターンはピークパターンであるが、このパター
ンはピークを「１」、他を［０」で表わしている。ここ
で辞書を作成するためのパターンも２値化して扱えたな
ら、実際の認識装置を実現する上でメリットとなる。こ
のような観点から、特徴パターンを全て２値のパターン
から作成することが提案されている。The spectral envelope obtained by the filter bank is sequentially converted into a digital code by an analog multiplexer (MPX) and a 12-bit AD converter (ADC) 36, and by sampling this at a sampling period of 10m5, the voice feature amount is It is expressed as a time series of spectral envelopes. The unknown pattern input for recognition is a peak pattern, and this pattern represents peaks with "1" and others with "0". If the patterns used to create the dictionary could also be treated as binary data, this would be an advantage in realizing an actual recognition device. From this point of view, it has been proposed to create all feature patterns from binary patterns.

第６図は、上述のごとくして音声のパワー包絡形状を２
値化して音声認識を行うＢＴＳＰ方式の　　　−例を説
明するための図で、第６図は、２値化ＴＳ　Ｐ　（Ｂｉ
ｎａｒｙ　Ｔｉｍｅ−５ｐｅｃｔｒｕｍ　Ｐａｔｔｅｒ
ｎ）の−例を説明するための構成図で、図中、４１はマ
イクロフオン、４２はフィルタバンク、４３は最小２乗
による補正部、４４は２値化部、４５はＢＴＳＰ作成部
、４６は線形伸縮による１回発声パターンの加算部、４
７は辞書部、４８はピークパターン作成部、４９は線形
伸縮によるパターン長合わせ部５５０は類似度算出部、
５１は結果表示部で、これは、単語単位に発声した音声
を２値化処理して求めた入カバターンと辞書パターンを
線形マツチングして認識するものである。不特定話者用
の音声認識の場合は、辞書のパターンは複数の人が発声
して′４られたＴＳＰの重ね合わせとして新たに作るよ
うにしている（ＢＴＳＰの詳細について、もし必要なら
ば、　Ｒｉｃｏｈ　Ｔｅｃｈｎｉｃａｌ　Ｒｅｐｏｒｔ
　Ｎａ　１１　。Figure 6 shows the power envelope shape of the voice as described above.
This is a diagram for explaining an example of the BTSP method that performs speech recognition by converting into a value.
nary Time-5pectrum Patter
This is a block diagram for explaining an example of (n), in which 41 is a microphone, 42 is a filter bank, 43 is a least squares correction section, 44 is a binarization section, 45 is a BTSP creation section, 46 is the addition part of the one-time utterance pattern by linear expansion and contraction, 4
7 is a dictionary section, 48 is a peak pattern creation section, 49 is a pattern length matching section by linear expansion and contraction 550 is a similarity calculation section,
Reference numeral 51 denotes a result display section, which performs recognition by linearly matching dictionary patterns with input cover patterns obtained by binarizing speech uttered word by word. In the case of speech recognition for unspecified speakers, a new dictionary pattern is created as a superposition of TSPs uttered by multiple people (for details on BTSP, if necessary, Ricoh Technical Report
Na 11.

ＭＡＹ、１９８４．Ｐ、Ｐ４〜１２；日本音響学会講演
論文集、昭和５８年１０月、Ｐ、Ｐ１９５〜１９６（３
−１−８）等を参照されたい）、。MAY, 1984. P, P4-12; Proceedings of the Acoustical Society of Japan, October 1982, P, P195-196 (3
-1-8) etc.).

この方式は、周波数方向へのパターン変動、つまり１人
による差には強く不特定話者方式に適したものであるが
、時間変動の吸収は線形伸縮が基本になっているため、
ＤＰマツチングに比べ劣つている。This method is strong against pattern fluctuations in the frequency direction, that is, differences between individual speakers, and is suitable for a speaker-independent system. However, since the absorption of time fluctuations is based on linear expansion and contraction,
It is inferior to DP matching.

第７図は５通常のＢＴＳＰのパターンの重なりを、又、
第８図は、時間変動が吸収しにくい例を示す図で１両図
とも、（ａ）はブロードパターン、（ｂ）はピークパタ
ーン、（ｃ）は（ａ）のパターンと（ｂ）のパターンを
重ね合わせた結果を示し、第８図に示した例の場合、ブ
ロードパターン（ａ）とピークパターン（ｂ）を点線に
て示すように線形伸縮しているが、第８図（ｃ）に丸印
ｄをつけて示すように、時間変動によってはみ出し部を
生じ、時間変動を吸収しにくい欠点がある。Figure 7 shows the overlap of the five normal BTSP patterns, and
Figure 8 shows an example in which time fluctuations are difficult to absorb. In both figures, (a) is a broad pattern, (b) is a peak pattern, and (c) is a pattern of (a) and a pattern of (b). In the example shown in Figure 8, the broad pattern (a) and peak pattern (b) are linearly expanded and contracted as shown by dotted lines, but in Figure 8 (c) As shown by the circle d, there is a drawback that a protruding portion is generated due to time fluctuations, making it difficult to absorb time fluctuations.

而して、この方法で作られたパターン中のスペクトルを
あられす部分（ＢＴＳＰ）は音声中のホルマントの２〜
３次が表されるように帯域を選ぶため、１フレームのＢ
ＴＳＰ中には「１」の数がホルマント数或はそれ以上存
在するのに対し、パワーを表すパワー中には包絡を示す
１つだけしが存在しない。つまり、パターン照合に際し
てスペクトル部はパワ一部より大きな重みがついている
ことになり、パワーの違いを十分にパターン照合に反映
することができなかった。Therefore, the spectral part (BTSP) in the pattern created by this method is the 2nd to 5th part of the formant in the voice.
In order to select the band so that the third order is represented, one frame of B
In the TSP, the number of "1"s is equal to or more than the formant number, whereas in the power representing the power, only one "1" indicating an envelope does not exist. In other words, during pattern matching, the spectrum part is given greater weight than the power part, and the difference in power cannot be sufficiently reflected in pattern matching.

１−一方本発明は、上述のごとき実情に鑑みてなされたもので、
特に、パターン照合に際し、スペクトルとパワーが同じ
重みであつかわれるようなパターンを作成する方式を提
供することを目的としてなされたものである。1-On the other hand, the present invention has been made in view of the above-mentioned circumstances,
In particular, the purpose of this invention is to provide a method for creating a pattern in which spectrum and power are treated with the same weight during pattern matching.

構成本発明は、上記目的を達成するために、音声を収録する
部分と、周波数分析する手段を有し、周波数分析した結
果のレベルの大きさにより２値化し、更に入力された音
声の大きさの時間変化を２値化し、両者を結合して１つ
の音声パターンとする音声パターン作成方式において、
周波数パターン上でパターンを特徴づけるパターンエレ
メントの数と同数のエレメント数で音声の大きさを表す
ようにしたことを特徴としたものである。以下、本発明
の実施例に基いて説明する。Structure In order to achieve the above object, the present invention has a part for recording audio and a means for frequency analysis, binarizes the result of the frequency analysis based on the level, and further converts the input audio into binarization based on the magnitude of the level. In an audio pattern creation method that binarizes the temporal change of , and combines the two into one audio pattern,
This method is characterized in that the loudness of the sound is expressed by the same number of pattern elements as the number of pattern elements characterizing the pattern on the frequency pattern. Hereinafter, the present invention will be explained based on examples.

第１図は、本発明の一実施例を説明するための電気的ブ
ロック線図で、図中、１１はマイクロフォン、１２は音
声区間検出部、１３はバンドパスフィルタ、１４は音声
パワー検出部、１５はＡ／Ｄ変換部、１６は２値化部、
１７．１８はレジスタ、１９はカウンタ、２０はピーク
正規化部、２１はパワーパターン作成部、２２は組み合
わせ部、２３はパターン部で、音声をマイクロフォン１
１で集音し、区間検出部１２によってノイズ等から分離
して音声区間のみをとりだし、バンドパスフィルタ群１
３により周波数分析する。一方、同じ信号のパワーの大
きさをパワー検出部］４で測定し、これを２値ｆヒ処理
するためにパワーの最大が一定になるよう正規化する。FIG. 1 is an electrical block diagram for explaining one embodiment of the present invention, in which 11 is a microphone, 12 is a voice section detection section, 13 is a band pass filter, 14 is a voice power detection section, 15 is an A/D conversion section, 16 is a binarization section,
17. 18 is a register, 19 is a counter, 20 is a peak normalization section, 21 is a power pattern creation section, 22 is a combination section, 23 is a pattern section, and the audio is transmitted to the microphone 1.
1, the section detecting section 12 separates it from noise etc. and extracts only the voice section, and the band pass filter group 1
Frequency analysis is performed using 3. On the other hand, the magnitude of the power of the same signal is measured by a power detection unit]4, and is normalized so that the maximum power is constant in order to perform binary fhi processing.

この時のパワーの大きさは例えばオールパスのフィルタ
の出力で求めることができる。周波数分析したパターン
は最小２乗誤差近似直線を引く方法等によって２値化し
、２値化後のパターンの中の「１」の数をカウンタ１９
により計数する。次にカウンタ１９で数えられた「１」
の数はパワーパターンへ反映され、スペクトルパターン
で生じた数の「１」をパワーパターン中に作る。パワー
パターンは第２図に示す如く音声パワーの包絡を「１」
、他を「０」で表したもので、スペクトルパターン中の
「１」の数に応じてパワー包絡をあられす「１」の数を
振幅方向へ増加させる。ただし、破裂子音の前や、促音
として表れる無音区間では音声パワーはＯになり、スペ
クトルもなくなるが、この場合は上記のような「１」の
数を一致させる必要はない。The magnitude of the power at this time can be determined, for example, from the output of an all-pass filter. The frequency-analyzed pattern is binarized by a method such as drawing a least square error approximation straight line, and the number of "1"s in the binarized pattern is counted by a counter 19.
Count by Next, “1” was counted by counter 19.
The number is reflected in the power pattern, and the number "1" generated in the spectral pattern is created in the power pattern. The power pattern is as shown in Figure 2, with the envelope of the audio power being "1".
, others are represented by "0", and the number of "1"s that cause the power envelope increases in the amplitude direction according to the number of "1"s in the spectrum pattern. However, before a plosive consonant or in a silent section that appears as a consonant, the voice power becomes O and the spectrum disappears, but in this case, it is not necessary to match the number of "1"s as described above.

このようにしてつくられたパワーパターンとＢＴＳＰを
組合せることによってパターンを作成する。A pattern is created by combining the power pattern created in this manner and BTSP.

来−−−果以上の説明から明らかなように１本発明によると、パタ
ーン照合の際にスペクトルの部分とパワーの部分が同じ
重みで扱われることになり、スペクトル分布が似ていて
も、パワーの大きさが異なっている音声パターンを区別
することが容易になる。As is clear from the above explanation, according to the present invention, the spectrum part and the power part are treated with the same weight during pattern matching, so even if the spectral distributions are similar, the power This makes it easier to distinguish between speech patterns that have different magnitudes.

[Brief explanation of the drawing]

第１図は１本発明の一実施例を説明するための電気的ブ
ロック線図、第２図は、本発明によるパワーパターンの
一例を示す図、第３図は、従来の音声認識の一例を示す
ブロック線図、第４図は。ＢＴＳＰ方式の一例を説明するための電気的ブロック線
図、第５図は、音声パワーの包絡形状を２値化する場合
の音声入力部の構成を示す図、第６図は、音声のパワー
包絡形状を２値化して音声認識を行うＢＴＳＰ方式の一
例を説明するための図、第７図は、通常のＢＴＳＰのパ
ターンの重なりを示す図、第８図は１時間変動が吸収し
にくい例を示す図である。１１・・・マイクロフォン、１２・・・音声区間検出部
、１３・・・バンドパスフィルタ、１４・・・音声パワ
ー検出部、１５・・・Ａ／Ｄ変換部、１６・・・２値化
部。１７．１８・・・レジスタ、１９・・・カウンタ、２０
・・・ピーク正規化部、２１・・・パワーパターン作成
部、２２・・・組み合わせ部、２３・・・パターン部。特許出願人　　株式会社　リコーＷ／　　図第２図０１目００００１０００１００　０００１１１１１第　
３　区第４１１！Ｊ第５図FIG. 1 is an electrical block diagram for explaining one embodiment of the present invention, FIG. 2 is a diagram showing an example of a power pattern according to the present invention, and FIG. 3 is a diagram showing an example of conventional speech recognition. The block diagram shown in FIG. An electrical block diagram for explaining an example of the BTSP method, FIG. 5 is a diagram showing the configuration of the audio input section when the envelope shape of audio power is binarized, and FIG. 6 is a diagram showing the configuration of the audio power envelope. A diagram for explaining an example of the BTSP method that performs voice recognition by binarizing shapes. Figure 7 is a diagram showing the overlapping of normal BTSP patterns. Figure 8 is an example in which hourly fluctuations are difficult to absorb. FIG. DESCRIPTION OF SYMBOLS 11... Microphone, 12... Voice section detection part, 13... Band pass filter, 14... Audio power detection part, 15... A/D conversion part, 16... Binarization part . 17.18...Register, 19...Counter, 20
... Peak normalization section, 21... Power pattern creation section, 22... Combination section, 23... Pattern section. Patent applicant: Ricoh W Co., Ltd. / Figure 2, item 01 00001000100 00011111
3 Ward No. 411! J Figure 5

Claims

[Claims]

It has a part for recording audio and a means for frequency analysis, and it binarizes the result of the frequency analysis based on the magnitude of the level, further binarizes the temporal change in the volume of the input audio, and combines the two. A voice pattern creation method for forming one voice pattern, characterized in that the loudness of the voice is represented by the same number of pattern elements as the number of pattern elements characterizing the pattern on a frequency pattern.