JPS6076800A

JPS6076800A - Voice recognition system

Info

Publication number: JPS6076800A
Application number: JP58185671A
Authority: JP
Inventors: 米山　正秀
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1983-10-03
Filing date: 1983-10-03
Publication date: 1985-05-01

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】抜梱氷乱本発明は、パターンマツチング技術を用いた音声認識方
式に関する。DETAILED DESCRIPTION OF THE INVENTION The present invention relates to a speech recognition method using pattern matching technology.

巖米妓■ 単語音声を認識するための新しいパターンマツチング方
式として、本出願人は、先に、周波数と時間を変数とす
るスペクトルグラフを基本として２種類の２次元パター
ン（ローカルピークの軌跡を表わす細線化パターンと２
値化されたブロードパターン）の重なり具合を用いて類
似度を算出する方式について種々提案した。As a new pattern matching method for recognizing word sounds, the applicant first developed two types of two-dimensional patterns (local peak trajectories) based on a spectrum graph with frequency and time as variables. Thinning pattern and 2
We have proposed various methods for calculating similarity using the degree of overlap of digitized broad patterns).

更に詳細に説明すると、単廃の音声認識を目的として、
マイクロホンから取り入れた単廃の音声信号を特徴抽出
部に導き、何らかの方法によって数ｍ　ｓ　ｅ　ｃ−数
十ｍ５ｅｃ程度の時間間隔で周波数分析し、音声信号の
短時間パワースペクトルをめ、これを特徴量とした場合
、こわ等の特徴量を周波数と時間を２軸として２次元平
面上に表示したパターン、は音声のタイムスペクトルパ
ターンとして知られている。本発明は、このタイムスペ
クトルパターンを基本として、そのローカルピークを連
ねた細線化パターン（第１図）と、ある閾値を設け、こ
れによって２値化したブロードパターン（第２図）との
２種類のパターンの重畳比較により類似度を算出し、認
識をおこなう方式に係わるものである６而して、上記２
種類のパターンのマツチング処理をおこなう場合、各時
間軸と周波数を揃えて比較する必要がある。しかしなが
ら、同一話者による同一単語であっても発声毎の発声速
度のバラツキによってマツチングをすべき辞書のパター
ンと入カバターンの時間軸の長さが異なり、そのままで
はマツチングができない。したがって、従来は辞書また
は入力のいずれか一方のタイムスペクトルパターンの時
間軸を線形に伸縮して他方のパターンの時間軸の長さに
合せて類似度計算をおこなう線形伸縮マツチング方式を
用いていた。しかしながら、単語音声の発声速度の変動
は必ずしも線形ではなく、単語を構成する音声によって
速度変動が異なり、そのため線形マツチングでは十分な
認識率が得られなかった。To explain in more detail, for the purpose of simple voice recognition,
A single audio signal taken in from a microphone is led to a feature extraction unit, and frequency analysis is performed at time intervals of several msec to several tens of m5ec by some method, and the short-term power spectrum of the audio signal is obtained, and this is extracted as a feature. When expressed as a quantity, a pattern in which characteristic quantities such as stiffness are displayed on a two-dimensional plane with frequency and time as two axes is known as a time spectrum pattern of audio. The present invention is based on this time spectrum pattern, and has two types: a thin line pattern (Figure 1) in which the local peaks are connected, and a broad pattern (Figure 2) that is binarized by setting a certain threshold value. This method involves calculating the degree of similarity by superimposing and comparing the patterns of
When performing matching processing on different types of patterns, it is necessary to align each time axis and frequency for comparison. However, even for the same word by the same speaker, the length of the dictionary pattern to be matched and the time axis of the input pattern differ due to variations in the rate of speech for each utterance, and matching cannot be performed as is. Therefore, conventionally, a linear expansion/contraction matching method has been used in which the time axis of either the dictionary or the input time spectrum pattern is linearly expanded/contracted to match the length of the time axis of the other pattern to calculate the similarity. However, the variation in the speaking speed of word sounds is not necessarily linear, and the speed variation differs depending on the sounds that make up the word, so linear matching has not been able to obtain a sufficient recognition rate.

且−一度本発明は、上述のごとき実情に鑑みてなされたもので、
特に、辞書パターンと入カバターンの各フレーム毎にそ
のフレームが有声音（Ｖ）か無声音（ＵＶ）かの識別を
おこない、このＶとＵＶの時系列によって辞書パターン
のクラスタリングをおこなって計算時間の短縮と認識率
の向上を図ったものである。Moreover, the present invention was made in view of the above-mentioned circumstances,
In particular, for each frame of the dictionary pattern and input pattern, it is determined whether the frame is voiced (V) or unvoiced (UV), and the dictionary patterns are clustered based on the time series of V and UV to reduce calculation time. The aim is to improve the recognition rate.

猜−−ヨ又本発明の構成について、以下、実施例に基づいて説明す
る。In addition, the structure of the present invention will be described below based on examples.

本発明においては、単語の辞書パターンを登録する際に
、それぞれのパターンについて各フレームがＶ（有声音
）かまたはＵＶ（無声音）かを判定して予めフレーム毎
にＶまたはＵＶを記録しておくが、この判定方法として
は種々のものが考えられ、例えば、波形的処理としては
、周期構造を有するものがＶであり、ランダムな部分が
ＵＶに相当すること、を利用して分割判定をすることが
でき、また、周波数領域における処理としてはスペクト
ル包絡の近似直線の傾斜がＶ（有声音）の場合は負であ
り、ＵＶ（無声音）の場合はＯまたは正であるという性
質を利用して分類することもできる。In the present invention, when registering word dictionary patterns, it is determined whether each frame is V (voiced sound) or UV (unvoiced sound) for each pattern, and V or UV is recorded for each frame in advance. However, various methods can be considered for this determination.For example, as for waveform processing, division determination is made by utilizing the fact that the periodic structure corresponds to V and the random part corresponds to UV. In addition, processing in the frequency domain takes advantage of the property that the slope of the approximation straight line of the spectral envelope is negative for V (voiced sounds) and O or positive for UV (unvoiced sounds). It can also be classified.

第３図は、上述のごとくして各辞書パターンのそれぞれ
について■またはＵＶのマーキング付けをした場合の例
を示す図で１図中、（ａ）はフレームナンバー、（ｂ）
はラベリング、（ｃ）は変化パターンを示し、このよう
にして、ＶとＵＶの変化を、この単語の特徴量の一つと
して辞書パターンと共に予め記録しておく。認識時にお
いては、入カバターンの各フレーム毎にＶ、ＵＶの分類
判別を同様におこないラベリングを施してＶ、ＵＶの変
化パターンを作成する。次に、この変化パターンと同様
な変化をするパターンを辞書の変化パターンの中より選
びだす予備選択を行い、このような予備選択をおこなう
ことにより類似度計算の短縮と認識率の向上を達成する
ことができる。更に、最終的段階として；予備選択され
た複数の辞書パターンと入カバターンとをマツチング計
算して認識結果を出力する。Figure 3 is a diagram showing an example of marking each dictionary pattern with ■ or UV as described above. In Figure 1, (a) is the frame number, (b)
(c) shows the labeling, and (c) shows the change pattern. In this way, the changes in V and UV are recorded in advance together with the dictionary pattern as one of the features of this word. During recognition, V and UV classification is similarly performed for each frame of the input pattern, and labeling is applied to create a V and UV change pattern. Next, a preliminary selection is performed to select a pattern that changes similarly to this change pattern from among the change patterns in the dictionary, and by performing such preliminary selection, the similarity calculation can be shortened and the recognition rate can be improved. be able to. Furthermore, as a final step, a matching calculation is performed between a plurality of pre-selected dictionary patterns and the input pattern, and a recognition result is output.

〔実施例１〕入カバターンと予備選択された複数の辞書パターンとの
マツチングをおこなう場合、入カバターンまたは辞書パ
ターンのいずれか一方を線形伸縮して他のパターンとの
時間軸の長さを合せ、て類似度計算をおこなう。[Example 1] When matching an input cover turn and a plurality of preselected dictionary patterns, either the input cover turn or the dictionary pattern is linearly expanded or contracted to match the length of the time axis with the other pattern, Perform similarity calculation.

〔実施例２〕入カバターンと予備選択されたパターンとのパターンマ
ツチングの際に、線形伸縮をおこなうが、類似度計算に
当っては、第４図に示すように入カバターンと辞書パタ
ーンの対応するそれぞれのフレームのラベルが両方共に
Ｖ（有声音）かまたは両方共にＵＶ（無声音）のときの
み、そのフレームの類似度計算をおこない。一方のラベ
ルがＶで他方がＵＶのように相対するフレームのラベル
が異なるようなフレームについては類似度計算をおこな
わない。このような措置を施すことにより発声の際の速
度変動により異なる種類の音素間の類傾度計算をおこな
うことを避けることができる。[Example 2] Linear expansion and contraction is performed during pattern matching between the input cover pattern and the preselected pattern, but in calculating the similarity, the correspondence between the input cover pattern and the dictionary pattern is determined as shown in Fig. 4. The similarity of each frame is calculated only when the labels of the respective frames are both V (voiced sound) or both UV (unvoiced sound). Similarity calculation is not performed for frames in which opposing frames have different labels, such as one label being V and the other UV. By taking such measures, it is possible to avoid calculating the degree of similarity between different types of phonemes due to speed fluctuations during utterance.

〔実施例３〕入カバターンと予備選択された辞書パターンのパターン
マツチングの際に、第５図に示すように。[Embodiment 3] During pattern matching of the input cover pattern and the preselected dictionary pattern, as shown in FIG.

各変化パターンに従って、同一ラベルの部分同志におい
て長さを伸縮して類似度計算をおこなうような部分的線
形伸縮をおこなう。この場合、単語全体から見ると、非
線形な伸縮がおこなわれることになり、勿論異なった種
類の要素間のマツチング計算も避けることができる。According to each change pattern, partial linear expansion/contraction is performed in which the lengths are expanded/contracted between parts of the same label to calculate the degree of similarity. In this case, non-linear expansion and contraction is performed when looking at the entire word, and of course matching calculations between elements of different types can be avoided.

例−一米以上の説明から明らかなように、本発明によると、■（
有声音）　、　ＵＶ　（無声音）の変化パターンを作り
、このＶ、ＵＶ変化パターンによる辞書パターンを利用
することにしたので、類似度計算時間の短縮及び認識率
の向上を達成することができる。Example - As is clear from the above description, according to the present invention,
By creating change patterns for voiced sounds) and UV (unvoiced sounds) and using dictionary patterns based on these V and UV change patterns, it is possible to shorten the similarity calculation time and improve the recognition rate.

[Brief explanation of drawings]

第１図は、細線化パターン、第２図は、ブロードパター
ンを示す図、第３図乃至第５図は、それぞれ本発明の実
施例を示す図である。 ■・・・有声音、ＵＶ・・・無声音。 → 第３１！Ｉａ（ｒレ−ムＮｏ）　＋　２　３　４　５　ｓ　７・−
−−−−−−−Ｎｂ（うにリング　）　ｖ　ｖ　ｖ　ｖ
ｖ　ｕｖ　ｖ　ｖ　ｕｖｕｖＣ（液化パター＞’）　ｖ
　ｕｖ　ｖ　ｕｖ第４図第　５　図ｃ（ｕＨ，；、＜、ｙ；？）　ｖ　ｖ　ｖ　ｕｖ　ｕｖ
　ｕｖ　ｕｖ　ｖ　ｕｖ　ｕｖ　ｕｖ手続補正書（師１．事件の表示昭和５８年　特許願　第１８５６７１号２、発明の名利
・音声認識方式３、補正をする考事件との関係　特許出願人オオタク　ナカマゴメ住所　東京都大田区中馬込　１丁目３＠６号氏　名（名
称）　（６７４）　株式会社　リコー代表者　浜　１）
　広シャトレーイン横浜８０７号５、補正命令の日付７、補正の内容（１）、明細書第２頁第１３行目及び第１４行目に記載
の「単廃」を「単語」に補正する。（２）、同第３頁第１９行目に記載の「音声ｊを「音素
」に補正する。FIG. 1 shows a thinning pattern, FIG. 2 shows a broad pattern, and FIGS. 3 to 5 each show an embodiment of the present invention. ■...Voiced sounds, UV...Unvoiced sounds. → No. 31! I a (r-me No.) + 2 3 4 5 s 7・-
-----------Nb (Uni ring) v v v v
v uv v v uvuvC (liquefied putter>') v
uv v uvFigure 4Figure 5c (uH,;,<,y;?) v v v uv uv
uv uv v uv uv uv Procedural amendment (Revised 1. Indication of the case 1982 Patent Application No. 185671 2, Benefits of the invention/Speech recognition method 3, Relationship with the case to be amended Patent applicant Otaku Nakamagome address Tokyo 1-3 Nakamagome, Ota-ku, Tokyo @6 Name (Name) (674) Ricoh Co., Ltd. Representative Hama 1)
Hiro Chatelaine Inn Yokohama No. 807 No. 5, date of amendment order 7, contents of amendment (1), "single elimination" stated in page 2, line 13 and line 14 of the specification is amended to "word". (2) ``Correct the sound j to ``phoneme'' as described on page 3, line 19.

Claims

[Claims]

(1) A label of either a voiced sound or an unvoiced sound is attached to each frame of both the dictionary and the input kataan, and preliminary selection of the dictionary is performed based on the voiced/unvoiced change pattern based on this label. Voice recognition method.

(2) A patent claim characterized in that when matching an input cover pattern and a dictionary pattern, the voiced/unvoiced label for each frame is used to avoid matching calculations between different phonemes. The speech recognition method described in scope (1).

(3) Partial linear expansion/contraction matching in which when matching input cover patterns and dictionary patterns, matching is performed by linearly matching the lengths of corresponding parts of the same label using the voiced/unvoiced labels for each frame. A speech recognition method according to claim (1), characterized in that the method performs the following.