JPS63104098A

JPS63104098A - Voice recognition equipment

Info

Publication number: JPS63104098A
Application number: JP61251255A
Authority: JP
Inventors: 英生瀬川
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1986-10-22
Filing date: 1986-10-22
Publication date: 1988-05-09

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Abstract] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】［発明の目的］（産業上の利用分野）本発明は、Ｄ　Ｐ　（ｄｙｎａｍｉｃ　ｐｒｏｇｒａｍ
［Ｏｉｎｇ　）マツチングにより単語認識処理を行なう
音声認識装置に関する。[Detailed Description of the Invention] [Object of the Invention] (Industrial Application Field) The present invention is directed to D P (dynamic program
[Oing] The present invention relates to a speech recognition device that performs word recognition processing by matching.

（従来の技術）単語音声認識においては、従来より、入力信号の音韻系
列と標準パターンとの距離が最小となるように、時間方
向の変動を吸収しつつ、両者をマツチングさせてい＜Ｄ
Ｐマツチングが、比較的認識率の高い方法として多用さ
れている。しかし、このＤＰマツチングでは、例えば２
つの単語の標準パターンＡＢとＡ’ＣＢ’があり、両方
のパターンに似通った音韻系列（ＡとＡ’　、ＢとＢ’
　）が含まれている場合、ＡＢと発音したときに、その
照合結果の累計スコアが殆ど変わらない場合が多く、こ
れによる認識率の低下が無視できないという問題があっ
た。(Prior art) In word speech recognition, conventionally, the phonological sequence of an input signal and a standard pattern are matched while absorbing fluctuations in the time direction so that the distance between the two is minimized.
P matching is often used as a method with a relatively high recognition rate. However, in this DP matching, for example, 2
There are two standard word patterns AB and A'CB', and both patterns have similar phonological sequences (A and A', B and B').
), when AB is pronounced, the cumulative score of the matching result often remains almost the same, and there is a problem in that the recognition rate decreases due to this cannot be ignored.

そこで、音韻継続時間を考慮したＤＰマツチングも試み
られているが、このオートマトン制御によれば、各音韻
の本来とるべき継続時間長を全体のバスの中でずらした
複数のマツチングバス候補から最小距離のバスを選択し
°ていくので、通常のＤＰマツチングに比べ、認識精度
は良好であるもののマツチングバスの候補の数が数倍に
増え、認識処理に時間がかかるという欠点があった。し
たがって、この方法はせいぜい数十の語量についての単
語認識が限度であり、語量が数百、数千になると、極め
て多くの認識処理時間を要してしまうという欠点があっ
た。Therefore, attempts have been made to perform DP matching that takes phoneme duration into consideration, but according to this automaton control, the minimum distance is determined from multiple matching bus candidates in which the original duration length of each phoneme is shifted within the overall bus. Since the buses are selected one by one, the recognition accuracy is good compared to normal DP matching, but the number of matching bus candidates increases several times, and the recognition process takes time. Therefore, this method has the drawback that it is limited to word recognition for a few tens of words at most, and that it takes an extremely long time for recognition processing when the word size becomes hundreds or thousands.

（発明が解決しようとする問題点）このように、従来のＤＰマツチングを用いた音声認識装
置においては、精度の高い単語認識処理を高速に行なう
ことができないという欠点があった。(Problems to be Solved by the Invention) As described above, the conventional speech recognition device using DP matching has the drawback that it cannot perform highly accurate word recognition processing at high speed.

本発明は、このような問題に鑑みなされたちので、高精
度で、しかも高速に単語認識処理を行なうことができる
音声認識装置を提供することを目的とする。SUMMARY OF THE INVENTION The present invention has been made in view of these problems, and an object of the present invention is to provide a speech recognition device that can perform word recognition processing with high precision and at high speed.

［発明の構成］（問題点を解決するための手段）本発明は、入力された音声信号を所定の時間間隔毎に分
析し特徴を抽出する特徴抽出部と、この特徴抽出部で抽
出された特徴情報と予め定められた音韻辞書とを照合し
音韻系列を求める音韻認識部と、この音韻認識部で得ら
れた音韻系列と予め設定された単語辞書とを照合して単
語認識結果を得る単語認識部とを備えた音声認識装置お
いて、上記単語認識部を以下のような予Ｑ選択部と本選
択部とに分けたことを特徴としている。[Structure of the Invention] (Means for Solving the Problems) The present invention includes a feature extraction section that analyzes an input audio signal at predetermined time intervals and extracts features, and a feature extraction section that analyzes input audio signals at predetermined time intervals to extract features. A phoneme recognition unit that compares feature information with a predetermined phoneme dictionary to obtain a phoneme sequence, and a word recognition unit that compares the phoneme sequence obtained by this phoneme recognition unit with a preset word dictionary to obtain word recognition results. The speech recognition device is characterized in that the word recognition section is divided into a preliminary Q selection section and a main selection section as described below.

即ち、予備選択部は、前記音韻認識部で求められた音韻
系統と予め設定された単語辞書の標準音韻系列とを高速
のＤＰマツチングにより照合し、照合結果が良好で、か
つ各音韻について予期された継続時間長との異なりの小
さい複数の単語を単語認識候補として選択するものであ
る。また、本選択部は上記予備選択部で選択された単語
認識候補の標準音韻系列と前記入力音声の音韻系列とを
音韻継続時間を考慮した精度の高いＤＰマツチングによ
って照合し単語認識結果を得るものとなっている。That is, the preliminary selection unit matches the phoneme system obtained by the phoneme recognition unit with the standard phoneme sequence of a word dictionary set in advance by high-speed DP matching, and selects a case where the matching result is good and expected for each phoneme. A plurality of words having a small difference in duration length from each other are selected as word recognition candidates. Further, the main selection unit compares the standard phoneme sequence of the word recognition candidate selected by the preliminary selection unit with the phoneme sequence of the input voice by highly accurate DP matching that takes phoneme duration into consideration, and obtains a word recognition result. It becomes.

（作用）本発明では、予備選択部゛で高速のＤＰマツチングを行
なって略実時間で各単語についての照合結果を得、この
照合結果が良好なもののうち、予備選択部における後処
理で各音韻について本来のその音韻の継続時間となるべ
き時間がらの逸脱が小さい単語を単語認識候補としてい
るので、本選択部における候補は、ある程度音韻継続時
間を考慮した候補であり、その単語数は大幅に絞り込ま
れることになる。したがって、本選択部において音韻継
続時間を考慮した精度の高いＤＰマツチングを行なった
場合でも、その単語候補自体が少ないことにより認識処
理に大幅な時間がかがることはない。(Function) In the present invention, the preliminary selection section performs high-speed DP matching to obtain matching results for each word in approximately real time, and among those with good matching results, the preliminary selection section performs post-processing for each phoneme. Words with a small deviation from the original duration of the phoneme are selected as word recognition candidates, so the candidates in this selection section are candidates that take into account the phoneme duration to some extent, and the number of words is significantly It will be narrowed down. Therefore, even if highly accurate DP matching is performed in consideration of phoneme duration in this selection section, recognition processing will not take a significant amount of time due to the small number of word candidates.

（実施例）以下、図面に基づいて本発明の一実施例につき説明する
。(Example) Hereinafter, one example of the present invention will be described based on the drawings.

第１図は本実施例に係る音声認識処理装置の構成を示す
図である。Ａ／Ｄ変換部１は、例えば図示しない音声入
力装置や公衆電話回線等がら入力された音声信号を、所
定の分析時間間隔（フレーム）毎にＡ／Ｄ変換する。帯
域通過フィルタ群（ＢＰＳ）２は、特徴抽出部となる部
分で、入力音声をスペクトル分析して、各フィルタがら
の出力値を特徴ベクトルＸとして抽出し、内部のパター
ンメモリに記憶するとともに、音韻認識部３とセグメン
ト管理部４とに出力する。音韻認識部３は、分析時間間
隔毎に得られたパターンベクトルＸと音韻辞書５内の各
音韻カテゴリの学習用パターンとの類似度又は距離を計
算し、各フレーム毎に各音韻ラベルについてそのスコア
を求める。類似度計算の方法としては、良く知られたユ
ークリッド距離、パターン認識技術として周知の統計的
距離、また複合類似度法等を用いることができる。FIG. 1 is a diagram showing the configuration of a speech recognition processing device according to this embodiment. The A/D conversion unit 1 performs A/D conversion on a voice signal input from, for example, a voice input device or a public telephone line (not shown) at predetermined analysis time intervals (frames). The band-pass filter group (BPS) 2 is a feature extraction unit that spectrally analyzes the input speech, extracts the output value of each filter as a feature vector X, stores it in an internal pattern memory, and It is output to the recognition unit 3 and segment management unit 4. The phoneme recognition unit 3 calculates the similarity or distance between the pattern vector X obtained at each analysis time interval and the learning pattern of each phoneme category in the phoneme dictionary 5, and calculates the score for each phoneme label for each frame. seek. As a method for calculating the similarity, the well-known Euclidean distance, the well-known statistical distance as a pattern recognition technique, the composite similarity method, etc. can be used.

なお、ここでは１フレーム（通常は１０〜２゜ｌｌｌ５
ｅｃ）毎に各音韻のスコアが得られるが、フレーム毎に
後述するＤＰマツチングを行なうと時間がかかる。そこ
で、セグメント管理部４は、ＢＰＦ２から入力された特
徴ベクトルの変化、つまりスペクトル変化が、あるしき
い値以下の区間は、同一の音韻区間に属するものと考え
て、一つのセグメントとして取扱い、セグメンテーショ
ンの自動化を図る機能を有している。Note that here, 1 frame (usually 10~2゜lll5
Although a score for each phoneme can be obtained for each frame, it takes time to perform DP matching, which will be described later, for each frame. Therefore, the segment management unit 4 considers that the section in which the change in the feature vector input from the BPF 2, that is, the spectral change, is less than a certain threshold value belongs to the same phonetic section, and treats it as one segment, and performs segmentation. It has the function of automating the process.

後段に備えられた単語認識部６は、得られた各音韻ラベ
ルについてのスコアの系列を音韻一系列として入力し、
単語辞書に格納された標準音韻系列との照合を行なって
単語認識結果を求め、これを出力するものである。この
単語認識部６は、予備選択部８と本選択部９とで構成さ
れている。予備選択部６はさらに予備認識部１０と後処
理部１１とで構成されている。予備認識部１０は、高速
のＤＰマツチングにより単語辞書７を参照して実時間で
各単語についてのスコアを得る。後処理部１１は、上記
予備認識部１０におけるＤＰマツチングのパスに基づき
、各音韻について予期された継続時間長との異なりの大
きい単語を排除して単語認識候補の数を制限するもので
ある。この後処理によって後続する本選択部９には、あ
る程度音韻継続時間長の考慮された単語認識候補が与え
られることになる。本選択部９では、入力された単語認
識候補から、音韻継続時間長を考慮した高精度のＤＰマ
ツチングにより単語認識結果を求め、これを出力する。The word recognition unit 6 provided in the latter stage inputs the obtained score series for each phoneme label as one phoneme series,
A word recognition result is obtained by comparing it with a standard phoneme sequence stored in a word dictionary, and this is output. This word recognition section 6 is composed of a preliminary selection section 8 and a main selection section 9. The preliminary selection section 6 further includes a preliminary recognition section 10 and a post-processing section 11. The preliminary recognition unit 10 obtains a score for each word in real time by referring to the word dictionary 7 through high-speed DP matching. The post-processing unit 11 limits the number of word recognition candidates by excluding words that have a large difference in duration from the expected duration of each phoneme based on the DP matching pass in the preliminary recognition unit 10. Through this post-processing, the subsequent main selection section 9 is provided with word recognition candidates in which the phoneme duration length is taken into consideration to some extent. The main selection unit 9 obtains a word recognition result from the input word recognition candidates by performing highly accurate DP matching in consideration of the phoneme duration length, and outputs the result.

以上のように構成された音声認識装置において、いま、
入力された単語Ｗが音韻系列［Ｐ　ｌ　ｒ　　Ｐ２　。In the speech recognition device configured as above, now,
The input word W is a phoneme sequence [P l r P2 .

・・・、ＰｒＬｌで表され、音韻認識部３から予備選択
部１０に出力される各セグメント毎の各音韻のスコアを
Ｓｉ　　（Ｐ）とする。なお、Ｐはカテゴリ名、ｉはセ
グメント各号である。..., PrLl, and the score of each phoneme for each segment output from the phoneme recognition unit 3 to the preliminary selection unit 10 is assumed to be Si (P). Note that P is a category name and i is a segment number.

予備選択部１０では、次のようなりＰマツチングにより
、時間方向の変動を吸収しつつ、ある単語についての標
準パターンと入カバターンとのスコアＳ　ｏｆＷが最大
のバスを選び、それをその単語についてのスコアとする
。各単語のスコアＳ　ｏｆＷを最大にするには、次の漸
化式から最適なパスを選べば良い。The preliminary selection unit 10 selects the bus with the maximum score S of W between the standard pattern and the input cover turn for a certain word while absorbing temporal fluctuations by P matching as follows, and selects the bus with the maximum score S of W between the standard pattern and the input cover turn for a certain word, Score. In order to maximize the score S ofW for each word, the optimal path can be selected from the following recurrence formula.

Ｓｏｆ’Ｗ（ｔ、ｊ）この漸化式であると、１段ずつの階段上がりの如きバス
しか存在しないが、例えば標準パターンを単語に含まれ
る音韻をノードとし、音韻間で遷移可能なバスをエツジ
とするネットワークで表現すれば、１段飛ばし等のバス
を形成できる（第３図２１）。Sof'W(t,j) With this recursion formula, there is only a bus that goes up a staircase one step at a time, but for example, if the standard pattern is a bus that uses phonemes included in a word as nodes and allows transitions between phonemes. If it is expressed as a network with edges, it is possible to form a bus that skips one stage (Fig. 3, 21).

後処理部１１では、求められた各単語についてのスコア
に、パスの歪みを考慮して単語認識候補を求める（同図
２２〜３３）。例えば、“ＨＡＩ″という音声入力に対
し、“ＨＡＩ″という音韻系列の標準パターンをマ・リ
チングさせると、“Ｈ″。The post-processing unit 11 calculates word recognition candidates by considering path distortion in the obtained score for each word (FIGS. 22 to 33). For example, if you match the standard pattern of the phoneme sequence "HAI" to the voice input "HAI", the result will be "H".

“Ａ″、“Ｉ゛の各音韻は比較的音韻継続時間長が長く
、認識し易い音韻であることから、第２図（ａ）に示す
ように、最大スコアを与えるバス（例えば１＋）は水平
方向に長いパスとなる。一方、入カバターンとして音韻
系列“ＩＡＩ”、が与えられ、標準パターン“ＨＡＣＩ
“と比較する場合を考えると、“Ｃ０は、入カバターン
に該当する音韻がないので、同図（ｂ）に示すように、
“Ｃ″の部分のパスｔ２は短くなる。以上のように各音
韻に応じて予め予想されるパスの長さを考慮して、バス
の歪みｄｉｓｔを例えば次のように計算する。なお、こ
こでｄｉｓｔ−０は最も歪みが大きく、ｄｉｓｔ−１は
最も歪みが少ない。Each of the phonemes “A” and “I” has a relatively long phoneme duration and is easy to recognize, so as shown in Figure 2 (a), the bass that gives the maximum score (for example, 1+) is The path is long in the horizontal direction.On the other hand, the phonetic sequence "IAI" is given as the input pattern, and the standard pattern "HACI" is given as the input pattern.
If we consider the case of comparing with “C0,” there is no phoneme that corresponds to the introductory pattern, so as shown in Figure (b),
The path t2 in the "C" portion becomes shorter. As described above, the bus distortion dist is calculated, for example, as follows, taking into consideration the path lengths expected in advance for each phoneme. Note that here, dist-0 has the largest distortion, and dist-1 has the smallest distortion.

■　Ａ、Ｉ、Ｕ、Ｅ、Ｏ，Ｎ、Ｘ　（ん）２Ｍ。■ A, I, U, E, O, N, X (n) 2M.

Ｃ，Ｓなどの音は、比較的長く発声し易い音韻であるの
で、これらの音韻がバスに含まれている場合（２３）、
最短継続時間長Ｄ１をファイルが読出しく２４）、その
音韻の継続時間長ｄがＤｌよりもさらに短い場合は（２
５）、歪みは最大でｄｉｓｔ−０（２６）。Sounds such as C and S are relatively long phonemes that are easy to pronounce, so if these phonemes are included in the bus (23),
The file reads the shortest duration D1 (24), and if the duration d of the phoneme is even shorter than Dl, the shortest duration D1 is read (24).
5), the maximum distortion is dist-0 (26).

■　Ｐ、Ｔ、に等は比較的長く発音し難い音韻である為
、これらの音韻がパスに含まれている場合（２７）　、
最長継続時間長Ｄ２をファイルが読出しく２８）、その
音韻の継続時間長ｄがＤｌよりもさらに長い場合は（２
９）、ｄｉｓｔ　−０（２６）。■ P, T, ni, etc. are relatively long phonemes that are difficult to pronounce, so if these phonemes are included in the path (27),
The file reads the maximum duration D2 (28), and if the duration d of the phoneme is longer than Dl, the maximum duration D2 is read (28).
9), dist −0 (26).

■　１つの単語において、モーラ（音節）長は略一定で
あると考えられる。そこで、パスから単語内のモーラの
平均時間と分散とを計算する（３０）。ｎモーラ存在し
て各モーラの長さがｍｌとすると、ｄｉｓｔ−１−ｒ〒π−ｎ）”／ｎ／ｍを計算する（３
１）。但し、長母音は一般に短母音の２倍よりは短い場
合が多いので、適当なバイアスをかけて、実際より長母
音区間は長くする。■ In one word, the mora (syllable) length is considered to be approximately constant. Therefore, the average time and variance of mora within a word are calculated from the path (30). If there are n moras and the length of each mora is ml, calculate dist-1-r〒π-n)''/n/m (3
1). However, since long vowels are generally twice as long as short vowels, an appropriate bias is applied to make the long vowel section longer than it actually is.

また、１モーラの単語は上の記述のうち、モーラを音韻
に置換えて、同様に歪みを計算すれば良い。Furthermore, for a word with one mora, in the above description, the mora can be replaced with a phoneme, and the distortion can be calculated in the same way.

このようにして、後処理で計算された歪みのスコア（２
６，３１）とＤＰマツチングのスコア（３２）とを適当
な重み付け（ｗ　）をして足し合せ（３３）　、得られ
たスコアを各単語のスコアとして採用し、上位いくつか
の単語を単語認識候補として本選択部９に出力する。In this way, the distortion score (2
6, 31) and the DP matching score (32) are added together with appropriate weighting (w) (33), the obtained score is used as the score for each word, and the top few words are used for word recognition. It is output to the main selection section 9 as a candidate.

本選択部９では、入力された単語認識候補のみについて
単語辞書７から標準パターンを読出し、これら標準パタ
ーンと人カバターンとをＤＰマツチングにより照合する
。この照合は音韻継続時間を考慮して行われる。即ち、
ある音韻カテゴリＡの音韻継続時間長がＴｏであるとし
たとき、人カバターンの時刻ｔにおける最適パスは、時
刻ｔを含む前後２Ｔｏのパスを全て計算に入れて決定さ
れるので。計算時間は予備選択部１０におけるＤＰマツ
チングの数倍になるが認識率は大幅に向上する。The main selection unit 9 reads standard patterns from the word dictionary 7 for only the input word recognition candidates, and matches these standard patterns with human cover patterns by DP matching. This matching is performed by taking phoneme duration into consideration. That is,
Assuming that the phoneme duration length of a certain phoneme category A is To, the optimal path at time t for the human kataan is determined by taking into account all the two paths before and after time t. Although the calculation time is several times that of DP matching in the preliminary selection section 10, the recognition rate is significantly improved.

そして、この装置では、時間のかかる本選択部９におけ
るＤＰマツチングの単語候補数を予備選択部において大
幅に制限するようにしているので、高速で、かつ高精度
の単語認識処理が行なえる。In this device, the number of word candidates for the time-consuming DP matching in the main selection section 9 is greatly limited in the preliminary selection section, so that word recognition processing can be performed at high speed and with high precision.

［発明の効果］以上説明したように、本発明によれば、単語認識部を予
備選択部と本選択部とに分けることにより、認識精度は
良いが時間のかかる本選択部での候補数を削減し、高速
で且つ高精度の単語認識処理が可能である。[Effects of the Invention] As explained above, according to the present invention, by dividing the word recognition section into the preliminary selection section and the main selection section, the number of candidates in the main selection section, which has good recognition accuracy but takes time, can be reduced. It is possible to perform word recognition processing at high speed and with high accuracy.

[Brief explanation of the drawing]

第１図は本発明の一実施例に係る音声認識装置１・・・
Ａ／Ｄ変換部、２・・・バンドパスフィルタ群、３・・
・音韻認識部、４・・・セグメント管理部、５・・・音
韻辞書、６・・・単語認識部、７・・・単語辞書、８・
・・予備選択部、９・・・本選択部、１０・・・予備認
識部、１１・・・後処理部。出願人代理人　弁理士　鈴江武彦ＡＩ入力λぜ７−ン入カバ゛７−ン第２囚FIG. 1 shows a speech recognition device 1 according to an embodiment of the present invention.
A/D converter, 2... band pass filter group, 3...
- Phonological recognition unit, 4... Segment management unit, 5... Phonological dictionary, 6... Word recognition unit, 7... Word dictionary, 8.
...Preliminary selection section, 9.. Main selection section, 10.. Preliminary recognition section, 11.. Post-processing section. Applicant's agent Patent attorney Takehiko Suzue AI Input

Claims

[Claims] A feature extraction unit that analyzes an input audio signal at predetermined time intervals and extracts features, and a feature information extracted by the feature extraction unit that compares with a predetermined phonetic dictionary. A speech recognition device comprising: a phoneme recognition unit that obtains a phoneme sequence of the input voice; and a word recognition unit that compares the phoneme sequence found by the phoneme recognition unit with a preset word dictionary to obtain a word recognition result. In the above, the word recognition unit matches the phoneme system obtained by the phoneme recognition unit with a standard phoneme sequence in a word dictionary set in advance by high-speed DP matching, and if the matching result is good and the prediction is correct for each phoneme. a preliminary selection unit that selects a plurality of words having a small difference in duration length from the input speech as word recognition candidates; A speech recognition device comprising: a main selection unit that performs matching by highly accurate DP matching in consideration of duration and obtains a word recognition result.