JP2574242B2

JP2574242B2 - Voice input device

Info

Publication number: JP2574242B2
Application number: JP61138537A
Authority: JP
Inventors: 哲樺澤
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1986-06-13
Filing date: 1986-06-13
Publication date: 1997-01-22
Anticipated expiration: 2012-01-22
Also published as: JPS62294297A

Description

【発明の詳細な説明】産業上の利用分野本発明は音声速度の変動に対処した音声入力装置に関
する。Description: BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice input device that copes with fluctuations in voice speed.

従来の技術従来のこの種の音声入力装置としては、例えば、タケ
オワタナベ（Takeo Watanabe），セグメンテーション
フリーシラブルレコグニションインコンティ
ニュアスリースポークンジャパニーズ（“Segmen
tation−free Syllable Recognition In Continuously
Spoken Japanese"），アイシーエーエスエスピー（ICAS
SP）−83,pp320〜323,1983.に示されているように第２
図のような構成になっていた。2. Description of the Related Art Conventional voice input devices of this type include, for example, Takeo Watanabe, segmentation-free syllable recognition in continuous three-spoken Japanese (“Segmen
tation−free Syllable Recognition In Continuously
Spoken Japanese "), ICS SP (ICAS
SP) -83, pp. 320-323, 1983.
The configuration was as shown in the figure.

すなわち、音声入力端子31、入力音声信号を特徴ベク
トルの系列から成る入力パタンに変換する特徴抽出部3
2、音節標準パタンを記憶する音節標準パタン記憶部3
3、入力パタンの母音部分を検出して識別する母音部識
別部34、DPマッチングを用いて時間軸伸縮しながら入力
パタンの部分パタンと前記音節標準パタンとのパタン間
距離を求めるパタンマッチング部35、前記パタンマッチ
ング部35で得られたパタン間距離の累積距離の最小値を
与える音節標準パタン列を判定して入力音声のもつ音節
列を決定する音節列決定部36、認識結果出力端子37から
構成され、入力された音声の母音部分を識別し、その母
音部分毎の部分パタンと母音部の識別結果と同じ母音部
をもつ音節標準パタンとのパタン間距離を求め、パタン
間距離の累積距離が最小となる音節標準パタン列を判定
して入力音声のもつ音節列として決定することにより入
力音声を認識するようになっている。That is, a voice input terminal 31, a feature extraction unit 3 for converting an input voice signal into an input pattern comprising a sequence of feature vectors
2.Syllable standard pattern storage unit 3 that stores syllable standard patterns
3.A vowel part identification unit 34 for detecting and identifying a vowel part of an input pattern, a pattern matching unit 35 for calculating a pattern distance between a partial pattern of the input pattern and the syllable standard pattern while expanding and contracting on the time axis using DP matching. A syllable string determining unit 36 that determines a syllable string pattern of the input voice by determining a syllable standard pattern string that gives the minimum value of the cumulative distance of the inter-pattern distance obtained by the pattern matching unit 35, and a recognition result output terminal 37. The vowel part of the input speech is identified, and the pattern distance between a partial pattern for each vowel part and a syllable standard pattern having the same vowel part as the result of the identification of the vowel part is obtained. The input speech is recognized by determining a syllable standard pattern sequence that minimizes the syllabic sequence and determining the syllable sequence of the input speech.

発明が解決しようとする問題点しかし、このような構成の音声認識装置を使用して入
力音声を認識する際、DPマッチングにより時間軸伸縮し
ているものの、入力音声の発声速度が、極端に遅かった
り速かったりすると、母音部分の識別において、余分の
母音部分が付加したり、母音部分が検出できず脱落した
りして、認識精度が劣化するという問題があった。Problems to be Solved by the Invention However, when recognizing input speech using the speech recognition device having such a configuration, although the time axis is expanded and contracted by DP matching, the utterance speed of the input speech is extremely slow. If it is faster or faster, there is a problem that extra vowel parts are added in the identification of vowel parts, or vowel parts cannot be detected and fall off, thereby deteriorating recognition accuracy.

そこで、本発明は、入力音声の発声速度が、極端に遅
かったり速かったりした場合に、話者に、認識装置の側
から、「もう少し速く発声して下さい」とか「もう少し
ゆくり発声して下さい」といった指示を発声することに
より、話者の音声速度の変動をできるだけ小さくして、
認識精度の劣化を防ぐものである。Thus, according to the present invention, when the utterance speed of the input voice is extremely slow or fast, the speaker can ask the recognizer to say, "Speak a little faster" or "Speak a little more slowly." ”To minimize fluctuations in the speaker's voice speed,
This prevents deterioration of recognition accuracy.

問題点を解決するための手段上記問題点を解決する本発明の技術的な手段は、発声
速度を制御するために、音声速度に関する指示を与える
発声速度指示部を設けたことにある。Means for Solving the Problems A technical means of the present invention for solving the above problems is to provide an utterance speed instruction unit for giving an instruction relating to the audio speed in order to control the utterance speed.

作用この技術的手段による作用は次のようになる。Operation The operation of this technical means is as follows.

すなわち、発声速度が、極端に遅かったり速かったり
すると、発声速度指示部が、「もう少し速く発声して下
さい」とか「もう少しゆっくり発声して下さい」という
指示を出して、発声速度を矯正し、発声速度を一定に保
つことができる。In other words, if the utterance speed is extremely slow or fast, the utterance speed instruction unit will issue an instruction such as "Please speak a little faster" or "Please speak a little more slowly", correct the speech speed, and make a speech. Speed can be kept constant.

この結果、認識装置において、発声速度が極端に遅い
ために発声する余分な音節の付加や、発声速度が極端に
速いために発声する音節の脱落を防止することができ
て、発声速度の変動に起因する認識精度の劣化を防ぐこ
とができるのである。As a result, in the recognition device, it is possible to prevent the addition of extra syllables that are uttered because the utterance speed is extremely low, and to prevent the syllables from being dropped because of the extremely fast utterance speed. It is possible to prevent the deterioration of the recognition accuracy due to the above.

実施例以下、本発明の実施例について説明するが、その前に
パタンマッチングによる単語音声認識装置について説明
する。この装置の一般的な構成は次のようなものであ
る。Embodiments Hereinafter, embodiments of the present invention will be described, but before that, a word speech recognition apparatus using pattern matching will be described. The general configuration of this device is as follows.

入力音声信号を、フィルタバンク，周波数分析LPC分
析等によって特徴ベクトルの系列に変換する特徴抽出手
段と、予め発声され、この特徴抽出手段により抽出され
た特徴ベクトルの系列を認識単語全部について標準パタ
ンとして登録しておく標準パタン記憶手段と、認識させ
るべく発声され、前記特徴抽出手段により抽出された入
力パタンと前記標準パタン記憶手段に記憶されている標
準パタンの全てと特徴ベクトルの系列としての類似度あ
るいは距離を計算するパタン比較手段と、パタン比較の
結果、最も類似度の高かった（距離の小さかった）標準
パタンに対応する単語を認識結果として判定出力する判
定手段からなる。A feature extraction means for converting the input speech signal into a sequence of feature vectors by a filter bank, frequency analysis LPC analysis, and the like; and a feature vector sequence uttered in advance and extracted by the feature extraction means as a standard pattern for all recognized words. Standard pattern storage means to be registered; similarity as a series of feature vectors with all of the input patterns uttered for recognition and extracted by the feature extraction means and the standard patterns stored in the standard pattern storage means Alternatively, it comprises pattern comparison means for calculating a distance, and determination means for determining and outputting as a recognition result a word corresponding to a standard pattern having the highest similarity (smallest distance) as a result of the pattern comparison.

このとき、同一話者が同一の単語を発声しても発声の
都度、その発声時間長が異るので、前記パタン比較手段
で標準パタンと入力パタンの比較を行う際には、両者の
時間軸を伸縮させ、両者のパタン長を揃えて比較する必
要がある。その際、発声時間長の変化は、発声単語の各
部で一様に生じているのではないので、各部を不均一に
伸縮する必要がある。At this time, even if the same speaker utters the same word, the utterance time length is different each time the utterance is made. Therefore, when comparing the standard pattern and the input pattern by the pattern comparing means, the time axis of both of them is Must be expanded and contracted, and the pattern lengths of both need to be aligned for comparison. At this time, since the change of the utterance time length does not occur uniformly in each part of the utterance word, it is necessary to expand and contract each part unevenly.

これを図で表現したのが第３図である。第３図（ａ）
においては横軸は入力パタンＡ＝a₁,a₂,…,a_I（a_iは入
力パタンの第ｉフレームの特徴ベクトル）に対応するｉ
座標、縦軸は標準パタン（rⁿ _jnは標準パタンRⁿの第ｊフレームの特徴ベクトル）
に対応するｊ座標を表す。入力パタンＡと標準パタンRⁿ
とを時間軸を非線形に伸縮してマッチングするとはこの
格子グラフ上において、両パタンの各特徴ベクトルの対
応関係を示す経路１を、両パタンの、系列としての距離
が最小になるという評価基準のもとで見出し、そのとき
の距離を両パタンの距離とする。この計算を効率的に行
う方法として動的計画法を用いる方法が良く知られてお
り、DPマッチングと呼ばれている。FIG. 3 illustrates this in a diagram. Fig. 3 (a)
, The horizontal axis represents the input pattern A = a ₁ , a ₂ ,..., A _I ( _ai is the feature vector of the i-th frame of the input pattern).
Coordinates and vertical axis are standard patterns (R ⁿ _jn is the feature vector of the j-th frame of the standard pattern R ⁿ )
Represents the j coordinate corresponding to. Input pattern A and standard pattern R ⁿ
Is matched by expanding and contracting the time axis non-linearly. In this grid graph, the path 1 indicating the correspondence between the feature vectors of the two patterns is defined as the evaluation criterion that the distance as a series of the two patterns is minimized. The heading is used as the basis, and the distance at that time is defined as the distance between the two patterns. As a method for efficiently performing this calculation, a method using dynamic programming is well known, and is called DP matching.

この径路を決める際には音声の性質を考慮して制限条
件を設ける。第３図（ｂ）は傾斜制限と呼ばれる径路選
択の条件の一例である。即ち、この例では点（i,j）へ
至る径路は、点（ｉ−2,i−１）から点（ｉ−1,j）を通
る径路が、点（ｉ−1,j−１）からの径路か、点（ｉ−
1,j−１）から点（i,j−１）を通る径路かの何れかの経
路しか取り得ないことを意味しており、入力パタンと標
準パタンの始端と終端は必ず対応させるという条件をつ
ければ、前記マッチングの径路は第３図（ａ）の斜線の
部分に制限される。この制限は、いかに時間軸が伸縮す
るとはいっても、同一単語に対してはそれ程極端に伸縮
するはずはないという事実からあまり極端な対応づけが
生じないようにするためである。When determining this route, a limiting condition is set in consideration of the nature of the sound. FIG. 3 (b) shows an example of a condition for selecting a path called an inclination limit. That is, in this example, the path from the point (i-2, i-1) to the point (i-1, j) is the path from the point (i-2, i-1) to the point (i-1, j-1). Or from the point (i-
This means that only one of the routes from (1, j-1) to the point (i, j-1) can be taken, and the condition that the input pattern and the start and end of the standard pattern must always correspond If it is used, the path of the matching is limited to the hatched portion in FIG. 3 (a). This restriction is to prevent a very extreme correspondence from occurring due to the fact that no matter how the time axis expands or contracts, the same word should not expand or contract so much.

両系列間の距離は、入力ベクトルa_iと標準パタンベク
トル▲ｒⁿ _j▼のベクトル間距離dⁿ（i,j）の前記経路に
沿う重み付平均として定義される。このとき径路に沿う
重みの和が径路の選ばれ方に依らず一定になるようにし
ておけばDPマッチングの手法が使える。The distance between the two series, the input vector a _i and the standard pattern vector ▲ r ⁿ _j ▼ vector distance d ⁿ (i, j) of is defined as the average weighted along the path of. At this time, the DP matching method can be used if the sum of the weights along the path is kept constant regardless of the selection of the path.

第４図は単音節音声標準パタンを結合することによっ
て構成した単語標準パタンと入力パタンのマッチングの
様子を図示したものである。同図において、R^q(1),R
^q(2),R^q(3)は単音節ｑ（１）,q（２）ｑ（３）の標準パ
タンを意味し、この例は単音節ｑ（１）,q（２）ｑ
（３）から成る単語の標準パタンと入力パタンをマッチ
ングする場合を示している。前記説明に従ってマッチン
グ経路は、例えば２のようになる。FIG. 4 illustrates a matching state between a word standard pattern and an input pattern formed by combining single syllable voice standard patterns. In the figure, R ^{q (1)} , R
^{q (2)} and R ^{q (3)} mean standard patterns of single syllables q (1), q (2) q (3), and in this example, single syllables q (1), q (2) q
A case is shown in which the standard pattern of the word consisting of (3) is matched with the input pattern. According to the above description, the matching path is, for example, as follows.

以下、前記したパタンマッチングの手法を用いた本発
明の実施例について説明する。Hereinafter, an embodiment of the present invention using the above-described pattern matching method will be described.

第１図は本発明の一実施例を示すブロック図である。
同図において、１は音声信号の入力端子、２はフィルタ
バンク等で構成された、入力音声信号を特徴ベクトルの
系列に変換する特徴抽出部である。３は音節標準パタン
記憶部であって、各音節の特徴ベクトルの系列に変換さ
れた標準パタンが記憶される。ここで、音声標準パタン
としては、単音節標準パタンのみと定義しても、或いは
単音節を連続発声した際に生じる調音結合（ある単音節
音声を単独で発声した場合の特徴ベクトルに対し、連続
発声された単音節音声の特徴ベクトルがその単音節音声
の前後の音声の影響を受けて変化する現象）を考慮し
て、単音節標準パタン及びVCV音節標準パタン（V:母音,
C:子音）と定義しても良いが、以下の説明は単音節標準
パタンのみと定義する。ただし、音節として単音節標準
パタン及びVCV音節標準パタンと定義した場合には、単
音節の認識には単音節標準パタンのみで充分であるが、
単語認識の場合に単音節標準パタンだけでなくVCV音節
を用いることができ、前記調音結合の問題を解消するこ
とができる。FIG. 1 is a block diagram showing one embodiment of the present invention.
In FIG. 1, reference numeral 1 denotes an input terminal of an audio signal, and reference numeral 2 denotes a feature extraction unit configured by a filter bank or the like, which converts the input audio signal into a sequence of feature vectors. Reference numeral 3 denotes a syllable standard pattern storage unit which stores a standard pattern converted into a series of feature vectors of each syllable. Here, the speech standard pattern may be defined as only a single syllable standard pattern, or may be defined as an articulatory combination generated when a single syllable is continuously uttered (for a feature vector when a single syllable speech is uttered alone, A phenomenon in which the feature vector of an uttered monosyllable voice changes under the influence of the voices before and after the monosyllable voice), and a monosyllable standard pattern and a VCV syllable standard pattern (V: vowel,
C: consonant), but the following description defines only a single syllable standard pattern. However, when a syllable is defined as a standard single syllable pattern and a standard VCV syllable pattern, only a single syllable standard pattern is sufficient for recognition of a single syllable.
In the case of word recognition, not only a single syllable standard pattern but also VCV syllables can be used, and the problem of articulation coupling can be solved.

さて、４はベクトル間距離計算部であって、音節標準
パタン記憶部３の標準パタンRⁿを構成するベクトルrⁿ _j
と入力パタンＡを構成するベクトルa_iのベクトル間距離
dⁿ（i,j）を計算する。いま、a_i＝（a_i1,a_i2,……,
a_il），とするとき、dⁿ（i,j）は最も簡単には、で与えられる。５はベクトル間距離記憶部であって、ベ
クトル間距離計算部４で計算された結果を記憶してい
る。６は累積距離計算部であって、例えば第５図で、第
ｉフレームにおいて、ｎ＝1,2,…,N（Ｎは音節標準パタ
ン数）に対して音声標準パタンのそれぞれのベクトル▲ｒⁿ _j▼と入力パタンＡ＝a₁,a₂,
…,a_Iの第ｉフレームのベクトルa_iとのベクトル間距離d
ⁿ（i,j）を前記ベクトル間距離記憶部５から読み出し
て、R^q(1),R^q(2),R^q(3)の結合パタンとa_iとのベクトル間累積距離を求める。マッチング径路
の拘束条件として第３図（ｂ）を採用し、各径路に沿う
重み係数を同図の径路上に付した数値とすると、座標
（i,j）における標準パタンRⁿに対する累積距離Dⁿ（i,
j）は次のように与えられる。Reference numeral 4 denotes an inter-vector distance calculation unit, which is a vector r ⁿ _j that constitutes the standard pattern R ⁿ of the syllable standard pattern storage unit 3.
Between the vector and the vector a _i that constitutes the input pattern A
Calculate d ⁿ (i, j). Now, a _i = (a _i1 , a _i2 , ……,
a _il ), Then d ⁿ (i, j) is most simply Given by Reference numeral 5 denotes an inter-vector distance storage unit which stores a result calculated by the inter-vector distance calculation unit 4. Numeral 6 denotes a cumulative distance calculation unit. For example, in FIG. 5, in an i-th frame, n = 1, 2,..., N (N is the number of syllable standard patterns) and a speech standard pattern Each Vector ▲ r ⁿ _j ▼ the input pattern A = a _1, a _2,
..., a distance d between the vector and the vector a _i of the i-th frame of a _I
ⁿ (i, j) is read out from the inter-vector distance storage unit 5, and the combined pattern of ^{Rq (1)} , ^{Rq (2)} , and ^{Rq (3)} is read. To find the cumulative distance between vectors and a _i . FIG. 3 (b) is adopted as the constraint condition of the matching path, and if the weighting factor along each path is a numerical value given on the path in FIG. 3, the cumulative distance D with respect to the standard pattern R ^{n at the} coordinates (i, j) is obtained. ⁿ (i,
j) is given as follows.

７は音節列判定部であって、最終累積距離のうち、最
小値を与える音節列を認識結果として、モード選択部８
を介して認識結果出力端子９から出力する。８はモード
選択部であって、音声入力モードと訂正モードの２モー
ドを切換える。９は結果の出力端子である。10は訂正部
であって、例えばキーボードで構成され、前記認識結果
の中に話者が誤りを発見した場合に、前記モード選択部
８を訂正モードに切換えた上で、誤り部分を訂正する。
誤り訂正された結果は前記モード選択部８を介して前記
認識結果出力端子９から出力されると共に、指示起動部
13へ出力される。前記モード選択部８は、例えば、訂正
モードで音声が入力されるとブザー音を発して、話者に
対して音声入力モードへの切換えを促す。すなわち、前
記モード選択部８は、前記認識結果の訂正が必要な場合
のみ訂正モードに切換えられ、通常は音声入力モードに
ある。前記モード選択部８は、前記訂正モードに切換え
られると、指示起動部13に訂正モード信号を発する。11
は標準音声時間長記憶部であって、音節数に対応した標
準発声時間長を記憶している。12は音声時間長計算部で
あって、例えば特徴ベクトル系列から得られるエネルギ
ーレベルの時系列に対して閾値を設け、先ず入力音声の
始端を検出して、音声時間長のカウントを開始し、終端
を検出した時点で音声時間長のカウントを終了すること
により音声時間長t₁を求める。前記音声時間長t₁は指示
起動部13に送出される。13は指示起動部であって、前記
訂正モード信号が入力された状態において、前記訂正部
10で訂正された認識結果（訂正結果）の音節数を計数
し、前記標準発声時間長記憶部11から前記訂正結果の音
節数に対応した標準発声時間長t₂を読み出し、前記音声
時間長t₁との比α（＝t₂/t₁）を求め、閾値TH_lとTH_h（T
H_l＜TH_h）について、α＜TH_lならば、発声速度指示部14
に対して、「もう少しゆっくり発声して下さい」という
指示を発生させる信号S₁を出力し、α＞TH_hならば、発
声速度指示部14に対して、「もう少し速く発声して下さ
い」という指示を発生させる信号S₂を出力する。TH_l≦
α≦TH_hの場合は信号は発生しない。14は発声速度指示
部であって、前記指示起動部13から前記信号S₁が入力さ
れた場合に、「もう少しゆっくり発声して下さい」とい
う指示を指示出力端子15より出力し、前記指示起動部13
から前記信号S₂が入力された場合に、「もう少し速く発
声して下さい」という指示を指示出力端子15より出力す
る。15は指示出力端子である。 Reference numeral 7 denotes a syllable string judging unit, which determines a syllable string giving the minimum value among the final cumulative distances as a recognition result.
Is output from the recognition result output terminal 9 via the. Reference numeral 8 denotes a mode selector, which switches between two modes, a voice input mode and a correction mode. 9 is a result output terminal. Reference numeral 10 denotes a correction unit, which is constituted by, for example, a keyboard. When a speaker finds an error in the recognition result, the mode selection unit 8 is switched to a correction mode, and the error part is corrected.
The result of the error correction is output from the recognition result output terminal 9 via the mode selection unit 8 and the instruction activation unit
Output to 13. The mode selection unit 8 emits a buzzer sound when a voice is input in the correction mode, for example, to urge the speaker to switch to the voice input mode. That is, the mode selection unit 8 is switched to the correction mode only when the recognition result needs to be corrected, and is normally in the voice input mode. When the mode selection unit 8 is switched to the correction mode, the mode selection unit 8 issues a correction mode signal to the instruction activation unit 13. 11
Is a standard voice duration storage unit, which stores a standard voice duration corresponding to the number of syllables. Reference numeral 12 denotes a speech time length calculation unit, which sets a threshold value for a time series of energy levels obtained from, for example, a feature vector sequence, first detects a beginning of the input speech, starts counting speech duration, and terminates. obtaining a voice time length t ₁ by ending the count of the voice time length in the time of detecting the. The voice time length t ₁ is sent to the instruction starting unit 13. Reference numeral 13 denotes an instruction starting unit, and when the correction mode signal is input, the instruction starting unit
10 corrected recognition result by counting the number of syllables (correction result), the reading from the standard utterance time length memory 11 standard utterance time length t ₂ which corresponds to the number of syllables of the correction result, the voice time length t The ratio α to ₁ (= t ₂ / t ₁ ) is obtained, and the threshold values _THl and TH _h (T
For H _l <TH _h ), if α <TH _l , the utterance speed indicating unit 14
Against, an indication that outputs a signal S ₁ to generate an indication that "should be a little more slowly speaking", α> if TH _h, to the utterance speed instruction unit 14, "should be speaking a bit faster." and outputs a signal S ₂ for generating. TH _l ≤
signal does not occur in the case of α ≦ TH _h. Reference numeral 14 denotes a utterance speed instruction unit, which outputs an instruction `` Please speak a little more slowly '' from an instruction output terminal 15 when the signal S ₁ is input from the instruction activation unit 13, 13
If the signal S ₂ is input from, and output from the instruction output terminal 15 an instruction "Please say a bit faster." 15 is an instruction output terminal.

以上のように、本実施例によれば、前記認識結果が前
記訂正部で訂正された場合に、発声速度指示部より必要
に応じて発声速度に関する指示を話者に与えることによ
り、話者の発声速度を制御してできるだけ一定に保ち、
発声速度が極端に遅いために発声する余分な音節の付加
や、発声速度が極端に速いために発生する音節の脱落を
防止することができて、発声速度の変動に起因する認識
精度の劣化を防ぐことができるものである。As described above, according to the present embodiment, when the recognition result is corrected by the correction unit, the utterance speed instructing unit gives an instruction regarding the utterance speed to the speaker as needed, so that Control the utterance speed to keep it as constant as possible,
It is possible to prevent the addition of extra syllables that are uttered because the utterance speed is extremely low, and to prevent the syllables from being dropped because of the extremely high utterance speed, and to reduce recognition accuracy due to fluctuations in the utterance speed. That can be prevented.

なお、本実施例では、指示起動部13は、前記訂正モー
ドにおいて、前記音声時間長と前記標準発声時間長との
比αの値が前記閾値TH_lとTH_hとの間にない場合に、発声
速度指示部14に発声指示起動信号を発する構成とした
が、前記特徴ベクトルから母音定常部の個数（β）を検
出する母音定常部を設け、指示起動部では、前記訂正部
10で得られる前記訂正結果の音節数（γ）と前記βの値
に関して、β＜γのときには前記信号S₁を発生し、β＞
γのときには前記信号S₂を発生する構成とすることも可
能である。この場合には、前記音声時間長計算部12と前
記標準発声時間長記憶部11は不要となる。ただし、前記
母音定常部検出部が必要となる。In this embodiment, instructing activation unit 13, in the correction mode, when the value of the ratio α of the voice time length and the standard utterance time length is not between the threshold TH _l and TH _h, Although the utterance speed instructing unit 14 is configured to issue the utterance instruction activating signal, a vowel stationary unit that detects the number (β) of vowel stationary units from the feature vector is provided.
With respect to said correction result number of syllables obtained in 10 (gamma) and the beta value, beta <when the gamma generates the signal S _1, β>
when the γ can be configured to generate the signal S _2. In this case, the speech time length calculation unit 12 and the standard utterance time length storage unit 11 become unnecessary. However, the vowel stationary part detection unit is required.

また、以上説明した実施例の各構成要素は、ソフトウ
ェア手段によりその機能を実現することも可能である。Also, each component of the above-described embodiment can realize its function by software means.

発明の効果本発明の音声入力装置は、訂正モードにあって、訂正
部で認識結果が訂正された場合に、その訂正結果の標準
発声時間長と入力音声の音声時間長が極端に異なると
き、発声速度指示部から発声速度に関する指示を話者に
与えることにより、話者の発声速度を制御し、発声速度
が極端に遅いために発生する余分な音節の付加や、発声
速度が極端に速いために発声する音節の脱落を防止する
ことができて、発声速度に起因する認識精度の劣化を防
ぐことができるのである。The voice input device of the present invention is in the correction mode, and when the recognition result is corrected by the correction unit, when the standard utterance time length of the correction result and the voice time length of the input voice are extremely different, By giving the speaker an instruction regarding the utterance speed from the utterance speed instruction unit, the utterance speed of the speaker is controlled, and extra syllables generated due to the extremely low utterance speed and the utterance speed being extremely fast In this way, it is possible to prevent the syllables from falling off, and to prevent the recognition accuracy from deteriorating due to the utterance speed.

[Brief description of the drawings]

第１図は本発明の一実施例を示すブロック図、第２図は
従来例を示すブロック図、第３図a,bはDPマッチングの
原理説明図、第４図は本発明の実施例において音節標準
パタンを用いて入力音声を認識する原理の説明図であ
る。２……特徴抽出部、３……音節標準パタン記憶部、４…
…ベクトル間距離計算部、５……ベクトル間距離記憶
部、６……累積距離計算部、７……音節列判定部、８…
…モード選択部、10……訂正部、11……標準発声時間長
記憶部、12……音声時間長計算部、13……指示起動部、
14……発声速度指示部。FIG. 1 is a block diagram showing one embodiment of the present invention, FIG. 2 is a block diagram showing a conventional example, FIGS. 3a and 3b are explanatory diagrams of the principle of DP matching, and FIG. FIG. 4 is an explanatory diagram of a principle of recognizing an input voice using a syllable standard pattern. 2 ... feature extraction unit, 3 ... syllable standard pattern storage unit, 4 ...
... inter-vector distance calculation unit, 5 ... inter-vector distance storage unit, 6 ... cumulative distance calculation unit, 7 ... syllable string determination unit, 8 ...
… Mode selection unit, 10… correction unit, 11… standard utterance time length storage unit, 12… voice time length calculation unit, 13… instruction start unit,
14 ... Speaking speed indicator.

Claims

(57) [Claims]

An input speech signal is converted into a sequence of feature vectors (a ₁ , a
_{_{2, ..., a i, ...}} , a feature extraction means for converting the input pattern A composed by a _I), a standard pattern of syllables ^{^{_{R n = (▲ r n 1}}} ▼, ▲ r
^{_{n 2 ▼, ..., ▲ r}} n j ▼, ..., ▲ r n J ▼,) (n = 1,2, ...,
And syllable reference pattern storage means for storing N), the reference pattern feature vectors ▲ r constituting ^{^{_{R n n j ▼ (j =}}} 1,2 ,,
J _n ) and the inter-vector distance calculation means for calculating the inter-vector distance d ⁿ (i, j) between the feature vector a _i of the i-th frame of the input pattern A and the inter-vector distance calculation means. An inter-vector distance storing means for storing the inter-pattern distance in association with each syllable standard pattern; and the vector between each vector constituting each of the standard pattern sequence and the input pattern expressed as a combination of the standard patterns. Cumulative distance calculating means for calculating the cumulative distance of the inter-distance d ⁿ (i, j); syllable string determining means for determining a syllable string based on the result of the cumulative distance calculating means; Correcting means for correcting, a mode selecting means for enabling switching between a voice input mode and a correcting mode, a standard uttering time length storing means for storing in advance a standard uttering time length corresponding to the number of syllables, A speech time length calculating means for detecting a beginning and an end of a voice to obtain a time length between the beginning and the end, and a "fast speech when the ratio between the speech time length and the standard sounding time length is larger than a predetermined range. Utterance speed instruction prompting "
If the ratio is smaller than the range, an utterance speed instruction unit that issues an utterance speed instruction that prompts “slow utterance”;
When the result of the syllable string determination means is corrected by the correction means in the correction mode by the mode selection means,
A standard utterance time length corresponding to the number of syllables as a result of the correction is read out from the standard utterance time length storage means, and when the ratio between the voice time length and the standard utterance time length is not within a predetermined range, an utterance speed instruction is prompted. A voice input device comprising: an instruction activation unit.