JPS6310196A

JPS6310196A - Voice recognition equipment

Info

Publication number: JPS6310196A
Application number: JP61154658A
Authority: JP
Inventors: 潤一郎藤本; 安田　晴剛
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1986-07-01
Filing date: 1986-07-01
Publication date: 1988-01-16

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】且亙ｌ互本発明は、音声認識装置に関する。[Detailed description of the invention] mutually The present invention relates to a speech recognition device.

灸來技監最近、簡易に単語音声を認識する方法としてパターンを
２値化処理後、重ね合わせて標準パターンとして登録し
ておき、未知入力音声を同様に２値化処理して類似度を
求める方式が提案されている。Recently, as a method for easily recognizing word sounds, patterns are binarized, superimposed, and registered as standard patterns, and unknown input sounds are similarly binarized to find the similarity. A method has been proposed.

第８図は、２値化Ｔ　Ｓ　Ｐ　（Ｂｉｎｏｒｙ　Ｔｉｍ
ｅ−５ｐｅｃｔｒｕ＋＋＋Ｐａｔｔａｒｎ）の−例を説
明するための構成図で、図中、２１はマイクロフォン、
２２はフィルタバンク、２３は最小２乗による補正部、
２４は２値化部、２５はＢＴＳＰ作成部、２６は線形伸
縮によるｎ回発生パターンの加算部、２７は辞書部、２
８はピークパターン作成部、２９は線形伸縮によるパタ
ーン長合わせ部、３０は類似度算出部、３１は結果表示
部で、これは、単語単位に発生した音声を２値化処理化
して求めた入カバターンと辞書パターンを線形マツチン
グして認識するものであり、図示のように、不特定話者
用の音声認識の場合には、辞書のメンバーシップ関数は
、ＴＳＰ周波数特性を用いずにＢＴＳＰの重ね合わせと
して新たに作るようにしている。（ＢＴＳＰの詳細にラ
イて、もし必要ならば、Ｒｉｃｏｈ　Ｔｅｃｈｎｉｃａ
ｌＲｅｐｏｒｔ　Ｎａｌ　１．ＭＡＹ、１９８４．Ｐ、
Ｐ４〜１２　；日本音響学会講演論文集、昭和５８年１
０月、Ｐ１９５（３−１−８）等を参照されたい、）こ
の方式は、周波数方向へのパターン変動、つまり、人に
よる差には強く不特定話者方式に適したものであるが、
時間変動の吸収は線形伸縮が基本になっているため、Ｄ
Ｐマツチングに比べ劣つている。FIG. 8 shows the binarized T S P (Binary Tim
This is a configuration diagram for explaining an example of the e-5pectru+++Pattern, in which 21 is a microphone;
22 is a filter bank, 23 is a correction unit using least squares,
24 is a binarization unit, 25 is a BTSP creation unit, 26 is an addition unit for n-time occurrence patterns by linear expansion and contraction, 27 is a dictionary unit, 2
8 is a peak pattern creation section, 29 is a pattern length matching section using linear expansion/contraction, 30 is a similarity calculation section, and 31 is a result display section, which displays the input data obtained by binarizing the speech generated for each word. It is recognized by linearly matching cover turns and dictionary patterns, and as shown in the figure, in the case of speech recognition for unspecified speakers, the dictionary membership function is a combination of BTSPs without using TSP frequency characteristics. I'm trying to make a new one as a match. (Please refer to the details of BTSP and, if necessary, the Ricoh Technica
lReportNal 1. MAY, 1984. P,
P4-12; Proceedings of the Acoustical Society of Japan, 1980 1
This method is resistant to pattern fluctuations in the frequency direction, that is, differences between people, and is suitable for speaker-independent methods.
Since the absorption of time fluctuation is based on linear expansion and contraction, D
It is inferior to P matching.

第９図は、通常のＢＴＳＰのパターンの重なりを、又、
第１０図は、時間変動が吸収しにくい例を示す図で５両
図とも、（ａ）はブロードパターン、（ｂ）はピークパ
ターン、（Ｃ）は（ａ）のパターンと（ｂ）のパターン
を重ね合わせた結果を示し、第１０図に示した例の場合
、ブロードパターン（ａ）とピークパターン（ｂ）を点
線にて示すように線形伸縮しているが、第１０図（ｃ）
に丸印ｄをつけて示すように、時間変動によってはみ出
し部を生じ５時間変動を吸収しにくい欠点がある。Figure 9 shows the overlapping patterns of normal BTSP, and
Figure 10 shows an example in which time fluctuations are difficult to absorb. In both figures, (a) is a broad pattern, (b) is a peak pattern, and (C) is a pattern of (a) and (b). In the example shown in Figure 10, the broad pattern (a) and peak pattern (b) are linearly expanded and contracted as shown by dotted lines, but Figure 10 (c)
As shown by the circle d in the figure, there is a drawback that protruding parts occur due to time fluctuations, making it difficult to absorb 5-hour fluctuations.

目　　　　　的本発明は、上述のごとき実情に鑑みてなされたもので、
特に、時間方向にずらしたパターンとずらさないパター
ンの一部のみを加え合わせるようにして、時間変動吸収
能力を向上させることを目的としてなされたものである
。Purpose The present invention was made in view of the above-mentioned circumstances.
In particular, this was done with the aim of improving the ability to absorb time fluctuations by adding only a portion of patterns that are shifted in the time direction and patterns that are not shifted in the time direction.

隻−一部本発明は、上記目的を達成するために、音声の入力部と
１分析部と、あらかじめ登録したパターンを保持するパ
ターン保持部とを具備し、入力された音声を２値化した
パターンにより認識させる音声認識装置において、（１
）入力された音声と。In order to achieve the above object, the present invention includes an audio input section, an analysis section, and a pattern holding section that holds pre-registered patterns, and binarizes input audio. In a speech recognition device that recognizes patterns, (1
) with the input audio.

それを時間方向にずらしたパターンを加え合わせた後、
２値化処理を施し、一つの登録音声に対して複数個入力
されたパターンを重ねて標準パターンを作成すること、
（２）入力された音声を２値化処理した後にそれを時間
方向にずらしたパターンを加え合わせ、一つの登録音声
について複数個入力されたパターンを重ねて標準パター
ンを作成すること、（３）入力された音声を２値化処理
し。After adding the pattern shifted in the time direction,
Performing binarization processing and overlapping multiple input patterns for one registered voice to create a standard pattern;
(2) Creating a standard pattern by binarizing the input audio and then adding patterns that are shifted in the time direction, overlapping multiple input patterns for one registered audio, (3) The input audio is binarized.

一つの登録音声について複数個入力されたパターンを重
ね合わせた後、このパターンを時間方向にずらしたパタ
ーンを重ね合わせて標準パターンを作成すること、（４
）入力された音声を２値化処理して標準パターンを作成
し、認識時に未知の音声が入力された時、このパターン
を２値化処理した後１時間方向にずらせて加え合わせて
認識させることを特徴としたものである。以下、本発明
の実施例に基いて説明する。After superimposing a plurality of input patterns for one registered voice, a standard pattern is created by superimposing patterns obtained by shifting this pattern in the time direction (4)
) A standard pattern is created by binarizing the input audio, and when unknown audio is input during recognition, this pattern is binarized, shifted in the direction of 1 hour, and added together for recognition. It is characterized by Hereinafter, the present invention will be explained based on examples.

第１図は１本発明の一実施例を説明するための電気的ブ
ロック線図で１図中、１はマイクロフォン、２は音声区
間検出部、３は特徴分析部、４は２値化部、５はレジス
タ、６は１フレームずらし部、７は加算部、８は２値化
部、９は標準パターン部、１０は照合部、１１は結果出
力部で、この実施例は、音声の入力部と、特徴分析部と
、あらかじめ登録したパターンを保持するパターン保持
部とを具備し、入力された音声を２値化したパターンに
より認識させる音声認識装置において、入力された音声
と、それを時間方向にずらしたパターンを加え合わせた
後、２値化処理を施し、一つの登録音声に対して複数個
入力されたパターンを重ねて標準パターンを作成し認識
させるようにしたものである。音声の区間をとり出して
、例えば、スペクトル等の特徴分析を施して一度音声特
徴パターンをレジスタ５にとり込み、そのパターンを時
間方向に１フレームずらせたものと、もとのパターンを
加算後これを２値化して標準パターンとする。その後、
認識時は未知音声の特徴パターンと標準パターンと線形
伸縮して重ね合わせて照合し、類似度を求めて認識結果
を出力する。これによってローカルピークがブロードパ
ターンからはみ出しにくくなり時間的変動の吸収能力が
向上する。ここでは１フレームずらせる説明をしたが、
必ずしも１フレームである必要はなく、第１０図のよう
なブロードパターンからのはみ出しが少なくなるように
最適な値を求めれば良い、ただし、あまり大きくずらせ
ると、単語間のパターン差がなくなってしまう１以上に
よって、ピークパターンが時間変動を受けても吸収でき
る能力が向上した。FIG. 1 is an electrical block diagram for explaining one embodiment of the present invention. In the figure, 1 is a microphone, 2 is a voice section detection section, 3 is a feature analysis section, 4 is a binarization section, 5 is a register, 6 is a one frame shift section, 7 is an addition section, 8 is a binarization section, 9 is a standard pattern section, 10 is a collation section, 11 is a result output section, and in this embodiment, it is an audio input section. In a speech recognition device that includes a feature analysis section, and a pattern holding section that holds pre-registered patterns, the input speech is recognized using a binary pattern. After adding the shifted patterns, a binarization process is performed, and a standard pattern is created by overlapping a plurality of input patterns for one registered voice and is recognized. Extract the audio section, perform feature analysis such as spectrum, and once import the audio feature pattern into the register 5, add the pattern shifted by one frame in the time direction and the original pattern, and then add this pattern. Binarize it and use it as a standard pattern. after that,
During recognition, the characteristic pattern of the unknown voice and the standard pattern are linearly expanded and contracted, superimposed and compared, and the degree of similarity is determined and the recognition result is output. This makes it difficult for local peaks to protrude from the broad pattern, improving the ability to absorb temporal fluctuations. Here, I explained shifting by one frame, but
It doesn't necessarily have to be one frame, just find the optimal value so that there are fewer protrusions from the broad pattern as shown in Figure 10. However, if you shift it too much, the pattern difference between words will disappear. 1 or higher, the ability to absorb even if the peak pattern is subject to time fluctuations is improved.

第２図は１本発明の他の実施例を説明するための電気的
ブロック線図で、図中、第１図の実施例と同様の作用を
する部分には第１図の場合と同一の参照番号が付しであ
る。而して、この実施例においては、入力された音声を
２値化処理した後にそれを時間方向にずらしたパターン
を加え合わせ、一つの登録音声について複数個入力され
たパターンを重ねて標準パターンを作成して認識させる
ようにしており、第１図に示した方法が２値化処理する
前に時間方向へずらして加え合わせるのに対し、この方
法は２値化後に時間方向へずらして加え合わせるように
したものである。実際に認識に用いるのは未知入力を２
値化して演算するため、この方法の方がパターンの時間
的ずらし効果が表われる。この場合は、時間をずらせて
重ね合わせて加えると２値化パターンは３値化されるの
で再度２値化処理部を通す必要がある。また、１つの単
語音声を同一人物が複数回、又は、別の人物が発声した
複数回のパターンを重ねて和、又は、平均をとるような
方式があり、この場合は、入力された音声を２値化処理
し、１つの登録音声について複数個入力されたパターン
を重ね合わせた後、このパターンを時間方向にずらした
パターンを重ね合わせて標準パターンを作成して認識さ
せる。FIG. 2 is an electrical block diagram for explaining another embodiment of the present invention. Reference numbers are included. In this embodiment, after binarizing input audio, patterns obtained by shifting the binarized audio in the time direction are added, and a standard pattern is created by overlapping multiple input patterns for one registered audio. The method shown in Figure 1 shifts the data in the time direction and adds them together before binarization processing, whereas this method shifts the data in the time direction and adds them after the binarization process. This is how it was done. Actually used for recognition is 2 unknown inputs.
Since this method is converted into values and calculated, the effect of temporally shifting the pattern appears better. In this case, the binarized pattern will be ternarized if the signals are superimposed and added at different times, so it is necessary to pass the signals through the binarization processing section again. In addition, there is a method in which one word is uttered multiple times by the same person or by different people, and the sum or average is taken. In this case, the input voice is After performing binarization processing and superimposing a plurality of input patterns for one registered voice, patterns obtained by shifting this pattern in the time direction are superimposed to create a standard pattern, which is recognized.

第３図は、そのような場合の一実施例を説明するための
電気的ブロック線図で、この実施例においては、第２図
に示したようにして時間方向にずらしたパターンを加え
合わせ再度２値化処理してパターンを作り、このように
複数回発声されたパターンを作り重ね合わせて荷重平均
をとっても良　□いが、あらかじめ複数回の発声された
パターンを荷重平均し、その結果を時間方向へずらして
重ね合わせ部１１にて重ね合わせるようにしたものであ
る。この場合、ずらしたパターンを加え合わせているた
め、パターンの各エレメントの値が異ってくるので正規
化部１３で正規化している。２値化したパターンの３回
発声分を重ね合わせるとＯ〜３の４値パターンとなり、
１フレ一ム前後に時間ずらしをしたパターンを重ねて加
え合わせると０〜９の１０値化パターンとなるので、こ
こで言う正規化とは１０値化パターンを再び４値パター
ンに変換することを意味する。FIG. 3 is an electrical block diagram for explaining an example of such a case. In this example, patterns shifted in the time direction are added again as shown in FIG. It is also possible to create a pattern by binarizing it, create a pattern uttered multiple times in this way, superimpose it, and take a weighted average. They are shifted in the direction and overlapped at the overlapping portion 11. In this case, since the shifted patterns are added together, the values of each element of the pattern are different, so they are normalized by the normalization unit 13. When the three utterances of the binarized pattern are superimposed, it becomes a four-value pattern from O to 3,
If you stack and add patterns that are time shifted before and after one frame, you will get a 10-value pattern from 0 to 9, so normalization here refers to converting the 10-value pattern back into a 4-value pattern. means.

第４図は、本発明の他の実施例を説明するための電気的
ブロック線図で、この実施例は、入力された音声を２値
化処理して標準パターンを作成し、認識時に未知の音声
が入力された時、このパターンを２値化処理した後、時
間方向にずらせて加え合わせて認識させるようにしたも
のである。これは＃ｊ準パターンに時間ずらしをせずに
入力のパターンに時間ずらしの効果を入れたもので、上
述の方式では標準パターンと入力のパターンが全く重な
らなかったようなものに若干の重りが出来易くしたもの
である。FIG. 4 is an electrical block diagram for explaining another embodiment of the present invention. In this embodiment, input speech is binarized to create a standard pattern, and unknown patterns are When audio is input, this pattern is binarized and then shifted in the time direction and added together for recognition. This is a method that adds a time shift effect to the input pattern without time shifting the #j quasi-pattern, and adds a slight weight to the input pattern where the standard pattern and the input pattern did not overlap at all in the above method. This makes it easy to do.

第５図は、時間ずらし重ね合わせの一例（−次元パター
ンで１フレームずらす）を示す図で。FIG. 5 is a diagram showing an example of time-shifted superimposition (shifting one frame in a -dimensional pattern).

（ａ）図のパターンを時間ずらしして（ｂ）図のごとく
し、次いで、（ａ）図のパターンと（ｂ）図のパターン
を重ね合わせて（Ｑ）図のパターンを作り、これを正規
化して（ｄ）図に示すパターンとし、パターンのずらせ
方を時刻と共に異なるようにしたものである。これは２
つのパターンを比較する際に始終端を対応づけておいて
、その間は線形伸縮によってパターン長を一致させるた
め。(a) Time-shift the pattern in the figure (b) make it as shown in the figure, then superimpose the pattern in the figure (a) and the pattern in the figure (b) to create the pattern in the figure (Q), which is then normalized. The pattern shown in Figure (d) is created by changing the way the pattern is shifted depending on the time. This is 2
When comparing two patterns, the beginning and end are matched, and the pattern lengths are matched by linear expansion and contraction between them.

パターンの始終端付近には時間変動の影響が小さく、パ
ターンの中央部が大きいためで時間変動の大きな部分の
時間のずらせ方を大きく、始端で小さくするものである
。このようにしてパターンをずらす例を第６図に示すが
、この場合、パターンの始終端１／３は時間ずらしをせ
ず、中央の１／３だけは１フレ一ム分のパターンずれを
起させている。更に、同様の効果を得るための方法とし
て一定時間ずれをしたパターンと時間ずれをしていない
パターンを重ね合わせておいて中央部のみ加え合わせる
方法を第７図に示すが、これはパターンの部分によって
加えたパターン数が違うので正規化を注意して行なわね
ばならない０以上によって線形伸縮照合の時間変動能力
の向上が期待できる。Since the effect of time fluctuation is small near the beginning and end of the pattern, and large in the center of the pattern, the time shift is large for the portion where time variation is large, and is small at the start end. An example of shifting the pattern in this way is shown in Figure 6. In this case, the start and end 1/3 of the pattern is not time shifted, but only the center 1/3 is shifted by one frame. I'm letting you do it. Furthermore, as a method to obtain a similar effect, a pattern with a certain time lag and a pattern with no time lag are superimposed and only the central part is added, as shown in Figure 7. Since the number of patterns added differs, care must be taken in normalization.By adding 0 or more, it is expected that the time-varying ability of linear expansion/contraction matching will be improved.

勿−−二東以上の説明から明らかなように１本発明によると、線形
伸縮を基本にした照合方式の時間変動吸収能力を向上さ
せることができる。As is clear from the above description, according to the present invention, it is possible to improve the ability to absorb time fluctuations of a matching method based on linear expansion and contraction.

[Brief explanation of drawings]

第１図乃至第４図は、それぞれ本発明の詳細な説明する
ための電気的ブロック線図、第５図は、時間ずらし重ね
合わせの原理を説明するための図、第６図及び第７図は
、それぞれ本発明の他の実施例を説明するための要部構
成図、第８図は、ＢＴＳＰ方式の一例を説明するための
図、第９図及び第１ｏ図は、それぞれＢＴＳＰパターン
の重なりを説明するための図である。１・・・マイクロフォン、２・・・音声区間検出部、３
・・・特徴分析部、４・・・２値化部、５・・・レジス
タ、６・・・１フレームずらし部、７・・・加算部、８
・・・２値化部、９・・・標準パターン、１０・・・照
合部、１１・・・結果出力部、１２・・・重ね合わせ部
、１３・・・正規化部。特許出願人　　株式会社　リコー第　　１　　図第　　３　図第４図1 to 4 are electrical block diagrams for explaining the present invention in detail, FIG. 5 is a diagram for explaining the principle of time-shifted superposition, and FIGS. 6 and 7. 8 is a diagram for explaining an example of the BTSP method, and FIG. 9 and FIG. FIG. 1... Microphone, 2... Voice section detection unit, 3
...Feature analysis section, 4...Binarization section, 5...Register, 6...1 frame shifting section, 7...Addition section, 8
. . . Binarization unit, 9 . . . Standard pattern, 10 . Patent applicant Ricoh Co., Ltd. Figure 1 Figure 3 Figure 4

Claims

[Claims]

(1) comprises a voice input section, a feature analysis section, and a pattern holding section that holds pre-registered patterns;
In a speech recognition device that recognizes input speech using a binarized pattern, the input speech and a pattern shifted in the time direction are combined, and then binarized to create one registered speech. A speech recognition device characterized in that a standard pattern is created by overlapping a plurality of input patterns for recognition.

(2) The speech recognition device according to claim (1), wherein the pattern is shifted in a different manner depending on time.

(3) The speech recognition device according to claim (1), wherein only part of the pattern shifted in the time direction and the part of the pattern not shifted in the time direction are added together.

(4) comprising a voice input section, a feature analysis section, and a pattern holding section that holds pre-registered patterns;
In a speech recognition device that recognizes input speech using a binarized pattern, the input speech is binarized and then patterns that are shifted in the time direction are added together to recognize multiple input speech patterns for one registered speech. A speech recognition device is characterized in that a standard pattern is created by overlapping two patterns, and the standard pattern is recognized.

(5) The speech recognition device according to claim (4), wherein the manner in which the pattern is shifted in the time direction differs depending on time.

(6) The speech recognition device according to claim (4), wherein only part of the pattern shifted in the time direction and the part of the pattern not shifted in the time direction are added together.

(7), comprising a voice input section, a feature analysis section, and a pattern holding section that holds pre-registered patterns;
In a speech recognition device that recognizes input speech using a binarized pattern, the input speech is binarized, multiple input patterns for one registered speech are superimposed, and then this pattern is converted over time. A speech recognition device characterized in that a standard pattern is created by superimposing patterns shifted in a direction, and the standard pattern is recognized.

(8) The speech recognition device according to claim (7), characterized in that only part of the pattern shifted in the time direction and the part of the pattern not shifted are added together.

(9) The speech recognition device according to claim (7), wherein the pattern is shifted in a temporal direction differently depending on time.

(10) A speech recognition device that includes a speech input section, a feature analysis section, and a pattern holding section that holds pre-registered patterns, and that recognizes input speech using binarized patterns. A standard pattern is created by binarizing the voice, and when an unknown voice is input during recognition, this pattern is binarized, shifted in the time direction, and added together for recognition. voice recognition device.

(11) The speech recognition device according to claim (10), wherein the pattern is shifted in a temporal direction differently depending on time.

(12) The speech recognition device according to claim (10), characterized in that only a part of the pattern shifted in the time direction and a part of the pattern not shifted in the time direction are added together.