JPH0412478B2

JPH0412478B2 -

Info

Publication number: JPH0412478B2
Application number: JP57133431A
Authority: JP
Inventors: Satoru Kabasawa; Hidekazu Tsuboka; Yoshiteru Mifune
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1982-07-29
Filing date: 1982-07-29
Publication date: 1992-03-04
Also published as: JPS5923398A

Description

【発明の詳細な説明】本発明は音声の認識を行なう単音節音声認識装
置に関するものである。DETAILED DESCRIPTION OF THE INVENTION The present invention relates to a monosyllabic speech recognition device for recognizing speech.

従来より考案されている単音節音声認識装置
は、入力音声を子音部分と母音部分に区分し、各
部分の時間的な中央部の平均的特徴パターンを用
いて入力音声を認識するように構成されている。
しかし、子音部分のパターンは時間的に非定常な
場合が多いことは一般的によく知られており、子
音部分の識別に前述した平均的特徴パターンを用
いると、子音の非定常な特徴が不明瞭になる。そ
こで、特に子音部分の識別には、子音区間内の複
数フレームを特徴パターンを用いることが望まし
い。なお、フレームとは特徴パターンを発生する
ために、入力音声データを一定時間毎にサンプリ
ングする際の各サプリング時点のことをいう。 Conventionally devised monosyllabic speech recognition devices are configured to segment input speech into consonant parts and vowel parts, and recognize the input speech using the average characteristic pattern of the temporal center of each part. ing.
However, it is generally well known that the patterns of consonant parts are often non-stationary in time, and if the average feature pattern described above is used to identify consonant parts, the non-stationary features of consonants can be Become clear. Therefore, especially for identifying consonant parts, it is desirable to use characteristic patterns for multiple frames within a consonant section. Note that a frame refers to each sampling point in time when input audio data is sampled at fixed time intervals in order to generate a characteristic pattern.

一方、音声区間内の全フレームのもつ特徴パタ
ーンを用いて入力音声の認識を行う装置も、従来
から単音節音声認識装置として考案されている。
このような装置は、前記の装置の欠点を補うこと
はできるが、母音部分に関しては比較的長時間
（例えば、百数＋ｍsec程度）に渡つて特徴パター
ンが安定しており、前記音声区間の全フレームの
特徴パターンを用いて入力音声の認識を行うこと
は、必要以上の時間を費す結果となり、認識に要
する時間を短縮するためにも、前述の如く平均的
特徴パターンを用いて認識を行うことが望まし
い。 On the other hand, devices that recognize input speech using characteristic patterns of all frames within a speech section have also been devised as monosyllabic speech recognition devices.
Although such a device can compensate for the shortcomings of the above-mentioned devices, the characteristic pattern for vowel parts is stable over a relatively long period of time (for example, about 100+ msec), and the Recognizing input speech using frame feature patterns results in spending more time than necessary, so in order to shorten the time required for recognition, recognition is performed using average feature patterns as described above. This is desirable.

本発明は上記欠点に鑑み、単音節音声に対して
先ず母音部分と確信できる部分、即ち音声区間の
語尾の数フレーム手前の数フレームに関して特徴
パターンを平均して平均的特徴パターンを求め、
次に語頭から順次特徴パターンを求めて前記平均
的特徴パターンとの距離を計算し、前記距離が前
記閾値よりも小さくなるフレームが数フレーム続
いた時点、或いは前記距離が前記閾値よりも小さ
くなるフレームが連続して数フレーム続いた時点
で、特徴パターンを求める分析を終了し、既に求
められた特徴パターンを用いて入力音声の認識を
行うことにより、認識に要する時間を短縮し、ハ
ードウエアに要求される処理の高速化を軽減する
ことのできる単音節音声認識装置を提供するもの
である。 In view of the above-mentioned drawbacks, the present invention first calculates an average feature pattern by averaging feature patterns for a portion that is certain to be a vowel part of a monosyllabic speech, that is, a few frames before the end of a speech section.
Next, feature patterns are found sequentially from the beginning of the word and the distance from the average feature pattern is calculated, and when the distance is smaller than the threshold for several frames, or the distance is smaller than the threshold. When this continues for several frames in a row, the analysis for finding the feature pattern ends, and the input speech is recognized using the feature pattern that has already been found, thereby reducing the time required for recognition and reducing the demands on the hardware. The present invention provides a monosyllabic speech recognition device that can reduce the speed of processing performed.

以下、本発明の一実施例について図面を参照し
ながら説明する。 An embodiment of the present invention will be described below with reference to the drawings.

第１図は本発明の一実施例における単音節音声
認識装置のブロツク図である。 FIG. 1 is a block diagram of a monosyllabic speech recognition device according to an embodiment of the present invention.

第１図において、１は音声区間検出部で、単音
節音声入力ａの語頭と語尾に対応した時点を検出
して、語頭及び語尾検出信号を出力する。２は音
声保持部で、音声区間検出部１から送出された語
頭検出信号を得た時点から語尾検出信号を得る時
点までの間の単音節音声入力ａのデータを保持す
る。３は平均的特徴パターン発生部で、音声保持
部２で保持されている音声データのうち、語尾の
検出時点よりも数フレーム（数サンプリング時
点）手前、例えば１０フレーム手前の時点より、
数フレーム分（数サンプリング時点分）例えば５
フレーム分の音声データに対して、一定時間毎に
サンプリングして特徴パターンを求め、それらを
平均して平均的特徴パターンを発生して出力す
る。４は特徴パターン発生部で、音声保持部２で
保持されている音声データを、語頭から順次一定
時間毎にサンプリングして特徴パターンを発生し
て出力する。５は特徴パターン保持部で、特徴パ
ターン発生部４の出力である特徴パターンを保持
する。６は特徴パターン間距離計算部で、平均的
特徴パターン発生部３から送出された平均的特徴
パターンと、特徴パターン発生部４から送出され
た特徴パターンとの距離を計算し、求められた距
離を出力する。７は閾値判定部で、特徴パターン
間距離計算部６の出力である距離と予め定めた閾
値とを比較して大小判定を行い、距離が閾値より
も小さい時、計数増加信号を出力する。８は計数
部で、閾値判定部の出力である計数増加信号を得
る毎に計数値を１増加させ、計数値が予め定めた
値になつた時点で、特徴パターン発生部４に特徴
パターンの発生を終了させる特徴パターン発生終
了信号を出力すると同時に、特徴パターン保持部
５に保持している特徴パターンを出力させる特徴
パターン出力命令信号を出力する。９は音声識別
部で、特徴パターン保持部５の出力を用いて入力
音声の識別を行い、単音節音声認識結果ｂを出力
する。 In FIG. 1, reference numeral 1 denotes a speech section detection unit which detects time points corresponding to the beginning and end of a monosyllabic speech input a, and outputs beginning and end detection signals. Reference numeral 2 denotes a voice holding unit that holds data of the monosyllabic voice input a from the time when the word beginning detection signal sent from the voice section detecting unit 1 is obtained to the time when the word end detection signal is obtained. Reference numeral 3 denotes an average feature pattern generation unit, which generates a pattern from a point several frames (several sampling points) before, for example, 10 frames before the end of a word detection point in the voice data held in the voice holding unit 2.
Several frames (several sampling points), e.g. 5
Frames of audio data are sampled at regular intervals to obtain feature patterns, which are averaged to generate and output an average feature pattern. Reference numeral 4 denotes a characteristic pattern generating section which samples the voice data held in the voice holding section 2 sequentially from the beginning of the word at regular intervals to generate and output characteristic patterns. Reference numeral 5 denotes a feature pattern holding unit that holds the feature pattern output from the feature pattern generating unit 4. 6 is an inter-feature pattern distance calculation unit that calculates the distance between the average feature pattern sent from the average feature pattern generation unit 3 and the feature pattern sent from the feature pattern generation unit 4, and calculates the calculated distance. Output. Reference numeral 7 denotes a threshold determination section, which compares the distance output from the inter-feature pattern distance calculation section 6 with a predetermined threshold to determine the magnitude, and outputs a count increase signal when the distance is smaller than the threshold. Reference numeral 8 denotes a counting unit, which increases the count value by 1 each time it receives a count increase signal that is the output of the threshold value determination unit, and when the count value reaches a predetermined value, causes the characteristic pattern generation unit 4 to generate a characteristic pattern. At the same time, it outputs a characteristic pattern generation end signal for terminating the process, and at the same time outputs a characteristic pattern output command signal for outputting the characteristic pattern held in the characteristic pattern holding section 5. Reference numeral 9 denotes a speech identification section which identifies input speech using the output of the feature pattern holding section 5 and outputs a monosyllabic speech recognition result b.

以下、上記のように構成された装置の動作を具
体的に説明する。 The operation of the apparatus configured as described above will be specifically explained below.

まず遮断周波数5KHzで低域濾波された入力音
声を標本化周波数10kHzでＡ／Ｄ変換し、離散的
信号について音声区間検出部１により、例えばエ
ネルギーレベルを用いて語頭を検出し、音声保持
部２で語頭に応答した時点から離散信号の保持を
開始し、音声区間検出部１により語尾を検出した
時点で離散的信号の保持を終了する。次に平均的
特徴パターン発生部３では、例えば時間幅12.8ｍ
secのハミング窓を6.4ｍsecずつずらしながら
（この時、フレーム周期は6.4ｍsecとなる）音声
保持部２で保持されている離散的信号に付加する
とともに、語尾から70.4ｍsec（語尾フレームから
10フレーム）手前の時点から、語尾から32ｍsec
手前の時点までの、５フレーム分の離散的信号に
関して14次のPARCOR係数を求めて、それらを
平均して平均的特徴パターンを発生し、特徴パタ
ーン間距離計算部６に出力する。一方特徴パター
ン発生部４では、平均的特徴パターン発生部３と
同様に、例えば時間幅12.8ｍsecの前記ハミング
窓を6.4ｍsecずつずらしながら、音声保持部２で
保持されている離散的信号に付加して、語頭から
順次前記PARCOR係数をフレーム周期6.4ｍsec
で発生し、特徴パターン保持部５と特徴パターン
間距離計算部６に出力する。そして特徴パターン
保持部５では、6.4ｍsec毎に特徴パターン発生部
４で発生される特徴パターンを保持する。一方特
徴パターン間距離計算部６では、平均的特徴パタ
ーンとしてのPAPCOR係数と6.4ｍsec毎に得ら
れる特徴パターンとしてのPARCOR係数との、
例えばユークリツド距離を計算し、計算結果を閾
値判定部７に出力する。次に閾値判定部７では、
例えば閾値を0.2として、閾値よりもユークリツ
ド距離値が小さくなつた時、計数部８の計数を１
増加させる計数増加信号を出力する。計数増加信
号が印加されると計数部８では、例えば計数値が
５になつた時、特徴パターン発生部４に特徴パタ
ーン発生を終了させる特徴パターン発生終了信号
を出力すると同時に、特徴パターン保持部５で保
持している特徴パターンを音声識別部９に出力さ
せる特徴パターン出力命令信号を特徴パターン保
持部５に出力する。音声識別部９では、この様に
して得られた前記特徴パターンを用いて単音節音
声入力ａの識別を行い、単音節音声認識結果ｂを
出力することができる。 First, input speech that has been low-pass filtered with a cutoff frequency of 5 KHz is A/D converted at a sampling frequency of 10 kHz, and the discrete signal is detected by the speech section detection section 1 using, for example, energy level to detect the beginning of a word, and the speech holding section 2 The holding of the discrete signal is started from the time when the beginning of the word is responded to, and the holding of the discrete signal is ended when the voice section detecting section 1 detects the ending of the word. Next, in the average feature pattern generating section 3, for example, the time width is 12.8 m.
While shifting the Hamming window of sec by 6.4 msec (at this time, the frame period is 6.4 msec), it is added to the discrete signal held in the voice holding unit 2, and the hamming window is added to the discrete signal held in the voice holding unit 2 by 70.4 msec from the end of the word (from the end frame of the word).
10 frames) From the previous point, 32 msec from the end of the word
The 14th-order PARCOR coefficients are obtained for the five frames of discrete signals up to the previous point in time, and averaged to generate an average feature pattern, which is output to the inter-feature pattern distance calculation section 6. On the other hand, in the same way as the average feature pattern generating part 3, the characteristic pattern generating part 4 adds the Hamming window, which has a time width of 12.8 msec, to the discrete signal held in the audio holding part 2 while shifting the Hamming window by 6.4 msec, for example. The PARCOR coefficients are sequentially calculated from the beginning of the word at a frame period of 6.4 msec.
, and is output to the feature pattern holding section 5 and the inter-feature pattern distance calculation section 6. The characteristic pattern holding section 5 holds the characteristic pattern generated by the characteristic pattern generating section 4 every 6.4 msec. On the other hand, in the feature pattern distance calculation unit 6, the PAPCOR coefficient as an average feature pattern and the PARCOR coefficient as a feature pattern obtained every 6.4 msec,
For example, the Euclidean distance is calculated and the calculation result is output to the threshold determination section 7. Next, in the threshold determination section 7,
For example, if the threshold is set to 0.2, and the Euclidean distance value becomes smaller than the threshold, the count of the counter 8 is set to 1.
Outputs a count increment signal to increase the count. When the count increase signal is applied, for example, when the count value reaches 5, the counting unit 8 outputs a characteristic pattern generation end signal that causes the characteristic pattern generation unit 4 to terminate characteristic pattern generation, and at the same time outputs the characteristic pattern generation end signal to the characteristic pattern holding unit 5. A feature pattern output command signal is output to the feature pattern holding unit 5 to cause the voice identifying unit 9 to output the feature pattern held in the voice identifying unit 9 . The speech identification section 9 can identify the monosyllabic speech input a using the characteristic pattern obtained in this way, and output the monosyllabic speech recognition result b.

第２図、第３図及び第４図は、それぞれ「ア」、
「サ」、「タ」なる単音節音声に関して、音声区間
の前半部分と前記平均的特徴パターンとのユーク
リツド距離を求めた計算結果であり、“FRAME”
は各音節に関するフレーム番号を示し、“DIST”
は前記ユークリツド距離値を示す。またこれらの
図において、閾値を0.2とし、計数値が５となつ
た場合のフレームに下線を引いてある。語頭より
下線を施したフレームまで、即ち「ア」（第２図）
の場合は語頭より５フレーム分、「サ」（第３図）
の場合は語頭より20フレーム分、「タ」（第４図）
の場合は語頭より15フレーム分を特徴パターンと
するので、語頭から語尾までの全フレームの特徴
パターンとする場合に比べて、特徴パターンの量
も少なく、したがつて、装置に要求される記憶容
量が少なくなるばかりでなく、入力音声の識別に
関する処理量が少なくなるので、より短時間で認
識結果を得ることができる。 Figures 2, 3 and 4 are "A" and "A" respectively.
These are the calculation results of the Euclidean distance between the first half of the speech interval and the average feature pattern for monosyllabic sounds such as "sa" and "ta", and "FRAME"
indicates the frame number for each syllable, “DIST”
represents the Euclidean distance value. Furthermore, in these figures, the threshold value is set to 0.2 and the frames where the count value is 5 are underlined. From the beginning of the word to the underlined frame, that is, "A" (Figure 2)
In the case of , 5 frames from the beginning of the word, "sa" (Figure 3)
In the case of , 20 frames from the beginning of the word, "ta" (Figure 4)
In the case of , the feature pattern is 15 frames from the beginning of the word, so compared to the case where the feature pattern is for all frames from the beginning to the end of the word, the amount of feature patterns is smaller, and therefore the storage capacity required for the device is smaller. This not only reduces the amount of processing required to identify input speech, but also allows recognition results to be obtained in a shorter time.

第５図、第６図及び第７図は、それぞれ「ア」、
「サ」、「タ」なる単音節音声に関して、閾値を
0.2、計数値を５とした場合の線形予測によるス
ペクトル包絡の時間変化を表したものであり、各
音節の子音部分のスペクトル的特徴及び母音部分
のスペクトル的特徴が簡潔に把握できる。 Figures 5, 6, and 7 are "A" and "A", respectively.
For monosyllabic sounds such as “sa” and “ta”, the threshold value is
It shows the time change of the spectral envelope based on linear prediction when the count value is 0.2 and 5, and the spectral characteristics of the consonant part and the spectral characteristics of the vowel part of each syllable can be easily grasped.

以上のように本実施例によれば、単音節音声ａ
の語頭から特徴パターン発生部４により特徴パタ
ーンを求めるとともに、単音節音声ａの語尾から
平均的特徴パターン発生部３により平均的特徴パ
ターンを求め、その語に特徴パターンと平均的特
徴パターンとの距離を特徴パターン間距離計算部
６により求め、閾値判定部７により距離があらか
じめ定められた閾値0.2よりも小さくなるフレー
ムが５回連続して続いた時点で特徴パターンを求
める分析動作を終了し、特徴パターン保持部５の
特徴パターンから音声識別部９により音声の認識
を行なうことにより、短時間で音声認識を行なう
ことができる。 As described above, according to this embodiment, the monosyllabic speech a
The feature pattern generator 4 calculates a feature pattern from the beginning of the word a, and the average feature pattern generator 3 calculates an average feature pattern from the end of the monosyllabic speech a, and calculates the distance between the feature pattern and the average feature pattern for the word. is calculated by the inter-feature pattern distance calculation unit 6, and when the distance is smaller than a predetermined threshold of 0.2 five consecutive frames by the threshold value determination unit 7, the analysis operation for determining the feature pattern is terminated, and the feature pattern is determined by the threshold determination unit 7. The voice recognition unit 9 recognizes the voice from the characteristic pattern in the pattern holding unit 5, thereby making it possible to perform voice recognition in a short time.

また本実施例では、特徴パターンとして
PARCOR係数を用い、特徴パターン間距離を尺
度としてユークリツド距離を用いたが、特徴パタ
ーンとしては例えばフイルタバンクの出力を用い
るなど、入力音声の特徴を表現しうるものであれ
ば良く、また距離尺度も例えば市街距離やcosh
尺度など種々の距離尺度を用いた場合でも有効で
ある。 In addition, in this example, the characteristic pattern is
We used the Euclidean distance using the PARCOR coefficient and the distance between feature patterns as a measure, but the feature pattern may be any pattern that can express the characteristics of the input voice, such as the output of a filter bank, and the distance measure may also be used. For example, city distance or cosh
It is also effective when using various distance measures such as scale.

更に、閾値判定部７で特徴パターン間距離が閾
値よりも小さくない場合には、計数部８の計数値
をクリアする計数値クリア信号を、また閾値判定
部７から計数部８に出力して閾値判定部７で特徴
パターン間距離が閾値よりも小さい場合には計数
部８の計数値を１増加させる計数値増加信号を、
閾値判定部７から計数部８に出力することとすれ
ば、前記閾値よりも小さい前記パターン間距離の
フレームが前記計数値に対応したフレーム数だけ
連続した時点までの特徴パターンを用いることに
より、単音節音声入力の識別を行うようにしても
よい。 Furthermore, if the distance between the feature patterns is not smaller than the threshold in the threshold determination section 7, a count clear signal for clearing the count of the counting section 8 is output from the threshold determination section 7 to the counting section 8 to set the threshold value. If the distance between feature patterns is smaller than the threshold value in the determination unit 7, a count value increase signal is sent to increase the count value of the counting unit 8 by 1.
If the threshold determination unit 7 outputs the data to the counting unit 8, the feature patterns up to the point in time when frames with the inter-pattern distance smaller than the threshold are consecutive for the number of frames corresponding to the count value are used to calculate the Syllable speech input may also be identified.

以上のように本発明は子音部分の様に時間的に
非定常な特徴パターンを持つ部分では非定常部分
の特徴パターンをすべて得るとともに、母音部分
の様に子音部に比べて定常でしかも継続時間の長
い部分については特徴パターンのすべてを求めな
いように構成することにより、特徴パターンを発
生させる分析時間を短縮できるだけでなく、音声
識別の認識時間も短縮でき、更に音声識別におい
て識別に必要な標準パターンの量も減少させるこ
とができ、装置の要求される処理の高速性を緩和
させると同時に記憶容量も少なくすることがで
き、その工業的価値は大なるものがある。 As described above, the present invention obtains all the feature patterns of the non-stationary part in a part such as a consonant part that has a temporally non-stationary feature pattern, and also obtains all the characteristic patterns of the non-stationary part, such as a vowel part, which is more constant and has a longer duration than the consonant part. By configuring so that not all feature patterns are obtained for long parts of The amount of patterns can also be reduced, the high-speed processing required of the device can be eased, and at the same time the storage capacity can be reduced, which has great industrial value.

[Brief explanation of drawings]

第１図は本発明の一実施例における単音節音声
認識装置のブロツク図、第２図、第３図及び第４
図は特徴パターンと平均的特徴パターンとのユー
クリツド距離の計算結果を示した図、第５図、第
６図及び第７図は線形予測によるスペクトル包絡
の波形図である。１……音声保持部、２……音声区間検出部、３
……平均的特徴パターン発生部、４……特徴パタ
ーン発生部、５……特徴パターン保持部、６……
特徴パターン間距離計算部、７……閾値判定部、
８……計数部、９……音声識別部。 FIG. 1 is a block diagram of a monosyllabic speech recognition device according to an embodiment of the present invention, and FIGS.
The figure shows the calculation results of the Euclidean distance between the feature pattern and the average feature pattern, and FIGS. 5, 6, and 7 are waveform diagrams of the spectrum envelope obtained by linear prediction. 1... Voice holding section, 2... Voice section detection section, 3
...Average feature pattern generation section, 4... Feature pattern generation section, 5... Feature pattern holding section, 6...
Inter-feature pattern distance calculation unit, 7...threshold determination unit,
8...Counting section, 9...Speech identification section.

Claims

[Scope of Claims] 1. A speech section detecting means for detecting a monosyllabic speech section, a speech holding means for holding the detected monosyllabic speech, and a speech section detecting means for detecting a monosyllabic speech section; and an average feature pattern that generates an average feature pattern by averaging feature patterns of a plurality of specific frames from the end of the monosyllabic speech held by the sound holding means. generating means; inter-feature pattern distance calculating means for sequentially calculating the distance between the feature patterns of the average feature pattern and the voice feature pattern; and comparing the distance between the feature patterns with a predetermined threshold to determine the magnitude thereof. a threshold value determining means for determining the distance between the feature patterns; a counting means for counting the number of frames in which the distance between the characteristic patterns is smaller than the threshold value, and generating a signal when the counted value reaches a predetermined value; feature pattern holding means for holding the feature pattern generated by the feature pattern generation means at regular intervals until the signal generated by the signal arrives; and a voice for identifying input monosyllabic speech using the feature pattern of the feature pattern holding means. identification means, the inter-feature pattern distance calculation means sequentially calculates a distance from the beginning of the monosyllabic speech interval detected by the speech interval detection means, and the threshold determination means calculates the distance between the distance and the threshold. The counting means counts the number of frames determined to be small by the threshold determining means, and when the number of frames counted by the counting means reaches the preset threshold, the signal is A monosyllabic speech recognition device configured to recognize an input monosyllabic speech using a characteristic pattern up to that point when a monosyllabic speech occurs.