JP3029654B2

JP3029654B2 - Voice recognition device

Info

Publication number: JP3029654B2
Application number: JP02243954A
Authority: JP
Inventors: 洋一竹林; 宏之坪井; 博史金澤
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1990-09-17
Filing date: 1990-09-17
Publication date: 2000-04-04
Anticipated expiration: 2015-04-04
Also published as: JPH04125597A

Description

【発明の詳細な説明】〔発明の目的〕（産業上の利用分野）本発明は、単語音声や連続音声に対する実用の際の認
識性能を効果的に高めることのできる音声認識装置に関
する。DETAILED DESCRIPTION OF THE INVENTION [Purpose of the Invention] (Industrial application field) The present invention relates to a speech recognition apparatus capable of effectively improving the recognition performance of word speech and continuous speech in practical use.

（従来の技術）マンマシン・インタフェースを自然な方法で実現する
重要な技術の１つとして音声認識があり、従来より種々
の研究開発がなされている。(Prior Art) Speech recognition is one of the important techniques for realizing a man-machine interface by a natural method, and various researches and developments have been made conventionally.

しかし、人間が機械（音声認識装置）に対して音声を
入力する場合には、通常の人間と人間の自然な会話のよ
うに自由に発声すると、「え〜」，「ん〜」，「あの
〜」等の間投詞や舌打ちなどのユーザの意図と直接関係
のない発話が行われる。このようなユーザの意図した言
語情報を表わす音声以外の発話成分を不要語と定義す
る。この不要語が、入力すべき単語音声や文音声の前後
に付加した場合、現状の音声認識では認識率が著しく低
下することが知られている。これは認識対象外の発話で
あるため、実験室レベルで性能評価された例が少なく離
散単語音声認識装置でさえも、実用に供されている場合
の重大な問題の一つである。すなわち、ユーザ入力しよ
うとした単語の前後に、「いいよどみ」やシステムに不
慣れなために不要な非言語音声が付加する場合が多く、
音声の始点と終点（単語境界）の検出が困難になる。文
節単位，句，あるいは連続発声の場合も同様である。However, when a human inputs a voice to a machine (speech recognition device), if he / she utters freely like a natural conversation between a normal human and a human, "e ~", "n ~", " Speaking that is not directly related to the user's intention, such as interjections or tongue tapping, is performed. Such an utterance component other than voice representing linguistic information intended by the user is defined as an unnecessary word. It is known that if this unnecessary word is added before or after a word or sentence voice to be input, the recognition rate in current speech recognition is significantly reduced. Since this is an utterance that is not a recognition target, there are few examples of performance evaluation at the laboratory level, and even a discrete word speech recognizer is one of the serious problems when it is put to practical use. In other words, before and after the word that the user tries to input, "good stagnation" and unnecessary non-verbal voices that are unnecessary due to inexperience with the system are often added.
It becomes difficult to detect the start point and end point (word boundary) of speech. The same applies to phrase units, phrases, or continuous utterances.

この様な問題に対処するための有力な手段の一つとし
てワードスポッティング法が提案されている。この方法
は、入力音声を分析後、従来法のように単語境界の検出
を行わずに、全ての時刻において常時単語の存在をチェ
ックする方法である。このため、従来法に比べて、莫大
な入力音声パターンと認識辞書との間のパターン照合を
必要とするが計算機パワーが増大し、安価なLSIやワー
クステーション，パーソナルコンピュータでも実現でき
る状況になってきた。The word spotting method has been proposed as one of the effective means for dealing with such a problem. In this method, after analyzing the input voice, the presence of a word is always checked at all times without detecting a word boundary as in the conventional method. For this reason, a huge amount of input voice patterns and pattern matching between the recognition dictionary are required as compared with the conventional method, but the power of the computer is increased, and it becomes possible to realize even inexpensive LSIs, workstations, and personal computers. Was.

また、騒音環境下の音声認識を安定に、高精度に行う
方法として、上述したワードスポッティング法に騒音を
人工的に加えて学習を行う雑音免疫学習法が開発されて
いる。この方法は、従来のワードスポッティング法に対
して、単なるパターン照合の連続化（ワードスポッティ
ング）を認識の時のみならず、学習の際にも行う方法で
あり、認識の場合に比べて、さらに、莫大な演算量を必
要とするが、実時間処理でなくても、騒音環境をコント
ロールして、信号対雑音比（S/N）を除々に低下させ
て、騒音の重畳した音声をシミュレートし、このパター
ンを用いて、雑音に強い音声認識が行えるようになって
きた。As a method for performing speech recognition in a noise environment stably and with high accuracy, a noise immune learning method has been developed in which learning is performed by artificially adding noise to the above-described word spotting method. This method is different from the conventional word spotting method in that continuity of pattern matching (word spotting) is performed not only at the time of recognition but also at the time of learning. It requires a huge amount of computation, but even without real-time processing, it controls the noise environment and gradually lowers the signal-to-noise ratio (S / N) to simulate superimposed noise. By using this pattern, it has become possible to perform speech recognition that is resistant to noise.

しかしながら、本発明で対象とする不要語、つまり不
用意な発声や、非言語音声，ためいき，舌打ち言い換え
ると実際に必要となる文意には不必要な音声情報に関し
ては有効ではなく、上述したワードスポッティング法を
用いても、気楽に音声入力が行えない状況にある。However, the unnecessary words targeted by the present invention, that is, careless utterances, non-verbal voices, fuzzy words, tongue tapping, in other words, speech information that is unnecessary for the sentence actually required are not effective. Even if the spotting method is used, voice input cannot be performed easily.

（発明が解決しようとする課題）上述したように、従来の音声認識方式では、ユーザが
入力しようと思っていない発話、すなわち、入力しよう
とした単語あるいは文音声の前後に付加した不要語に関
しての対策が十分になされていない。これが、人間の最
も自然な通信手段である音声による対話の実用化を妨げ
ている。(Problems to be Solved by the Invention) As described above, in the conventional speech recognition method, a user does not intend to input an utterance, that is, an unnecessary word added before or after a word or sentence voice to be input. Measures have not been taken sufficiently. This hinders the practical use of voice-based dialogue, which is the most natural means of human communication.

本発明は上記に鑑みてなされたものであり、その目的
とするところは、ユーザが発した音声に含まれる、入力
を意図した言語情報を表す音声以外の発話成分（間投
詞，舌打ち，いいよどみ，非言語音声）による認識性能
低下を改善するロバストな手段を提供することである。
これにより、ユーザへの負担が軽減し、ユーザがより自
然に発話した場合でも音声入力が容易に行えるようにな
る。このため人間と計算機との対話を円滑に能率的に進
めることが可能になる。The present invention has been made in view of the above, and it is an object of the present invention to provide an utterance component (interjection, tongue tapping, Iodomi, It is an object of the present invention to provide a robust means for improving recognition performance degradation due to non-verbal speech.
Thus, the burden on the user is reduced, and voice input can be easily performed even when the user speaks more naturally. For this reason, it is possible to smoothly and efficiently perform the dialog between the human and the computer.

[Configuration of the invention]

（課題を解決するための手段）本発明は、上記目的を達成するために、不要語（不用
意な発話）が付加した場合の音声を人工的に生成し、こ
れを学習用音声データとして使用し、ワードスポッティ
ング性能を向上させるための学習を行うことを特徴とす
るものである。(Means for Solving the Problems) In order to achieve the above object, the present invention artificially generates a voice when an unnecessary word (inadvertent utterance) is added, and uses the generated voice as learning voice data. In addition, learning for improving word spotting performance is performed.

この際、不要語は様々なバリエーションがあり、個人
によって異ったり、文化や民族性，地域により多様なも
のがある。すなわち、従来の音声認識では発話の内容を
音声言語として記述できるものを対象としてきたが、本
発明で対象とする音声の範囲は極めて広く、多岐に渡っ
ており継続時間や不要語と認識すべき発話との間の間隔
等についての考慮も上述した不要語の付加した学習用音
声データの生成の際に必要である。At this time, there are various variations of the unnecessary word, and there are various words depending on individuals, cultures, ethnicities, and regions. That is, in the conventional speech recognition, the speech content can be described as a speech language, but the scope of the speech targeted in the present invention is very wide, and it is diversified, and the duration and the unnecessary words should be recognized. It is also necessary to consider the interval between utterances and the like when generating the learning speech data to which the above-mentioned unnecessary words are added.

従って、本発明で扱う音声認識装置では、ワードスポ
ッティング法を前提とした際の不要語の付加による誤認
識や誤った単語のわき出し（挿入，付加）を減少させる
ためのものである。これは、計算機によるシミュレーシ
ョン能力と計算機パワーを活用して、膨大な学習用デー
タを純粋な音声データベースと不要語データベースとを
用いた合成データを生成し、この合成データと用いてワ
ードスポッティング処理を行なうことにより学習用デー
タを抽出し、ワードスポッティングの際の認識辞書を生
成するためのものである。Therefore, the speech recognition apparatus used in the present invention is intended to reduce erroneous recognition due to addition of an unnecessary word and erection (insertion, addition) of an erroneous word on the premise of the word spotting method. In this method, a huge amount of learning data is generated by using a pure speech database and an unnecessary word database, and a word spotting process is performed using the synthesized data by utilizing a simulation capability and a computer power of a computer. Thus, learning data is extracted, and a recognition dictionary for word spotting is generated.

（作用）本発明によれば、従来抵抗できなかった不要語の付加
したユーザの発話に対しても、音声認識装置が安定に動
作し、ロバスト性が得られるので、ユーザは自然，円
滑，快適に思考を妨げることなく計算機と対話すること
が可能となる。このため、従来の音声認識の応用に対し
て、ユーザへの負担が軽減するので、対話システムやデ
ータエントリーシステム等への応用が拡大する。(Operation) According to the present invention, the speech recognition device operates stably and obtains robustness even in the case of a user's utterance to which unnecessary words have been added, which could not be resisted in the past, so that the user can naturally, smoothly and comfortably. It is possible to interact with the computer without disturbing thinking. Therefore, the burden on the user is reduced compared to the conventional speech recognition application, and the application to a dialog system, a data entry system, and the like is expanded.

（実施例）以下、図面を参照しながら本発明の実施例について説
明する。まずシステムの説明に先立ち、不要語について
簡単に説明を行う。(Example) Hereinafter, an example of the present invention will be described with reference to the drawings. First, prior to the description of the system, unnecessary words will be briefly described.

第１図は、不要語が付加した音声パターンを示すもの
であり、「あの〜さっぽろ」と、ユーザの意図した「さ
っぽろ」以外に、「あの〜」という不要語が付加してい
る。しかも「あの〜」と「さっぽろ」の時間間隔は「さ
っぽろ」の促音の時間長よりも短かく、音声の始終端が
不明確になるばかりでなく、「あの〜」と「さっぽろ」
の「さ」の部を結合した「あの〜さ」が、あたかも１つ
の単語のように見なされ、誤認識を生じる場合がある。
すなわち、前後に付加する可能性のある種々の不要語に
より、音声認識に多大な影響を与えるわけである。FIG. 1 shows a voice pattern to which an unnecessary word is added. In addition to “Ah ~ Sapporo” and “Sapporo” intended by the user, an unnecessary word “Ah ~” is added. Moreover, the time interval between "Ano" and "Sapporo" is shorter than the time length of the prompting sound of "Sapporo", and not only does the beginning and end of the voice become unclear, but also "Ano" and "Sapporo"
"Asa" combining the "sa" part of "" is regarded as a single word, which may cause erroneous recognition.
That is, various unnecessary words that may be added before and after have a great effect on speech recognition.

第２図は、本発明による音声認識装置の概略構成図で
ある。本装置は、音声分析を行う音声分析部とワードス
ポッティング処理を行うワードスポッティング部を含む
認識部と不要語に対して安定に動作するためのワードス
ポッティング用の辞書を学習するための学習部から構成
されている。本装置の動作は、認識モードと学習モード
の両モードのいずれかで処理が行われる。FIG. 2 is a schematic configuration diagram of a speech recognition device according to the present invention. This device consists of a speech analysis unit that performs speech analysis, a recognition unit that includes a word spotting unit that performs word spotting processing, and a learning unit that learns a dictionary for word spotting to operate stably for unnecessary words. Have been. The operation of the present apparatus is performed in either the recognition mode or the learning mode.

前述したように、音声認識装置を実用化する際にユー
ザが意図せずに発声した不要語に対する処理が本発明の
中心であるが不要語の種類，発話方式は多岐にわたって
おり、バリエーションが豊富である。また、言語的内容
を全く表わさない場合や間投詞等を、一般の音声認識の
方法で用いられている音韻系列で表現したり、限定した
カテゴリーとして表現することはむずかしく、従来の音
声認識の手法では対応できないという問題があった。本
発明は、計算機のシミュレーション機能に着目し、不要
語が意図した音声の前後に付加した状況を人為的人工的
にシミュレーションして、学習処理を行い、認識をワー
ドスポッティングを用いて行う際に、不要語に対して強
く安定に動作する認識辞書を自動的に生成するものであ
る。As described above, the processing of unnecessary words uttered unintentionally by the user when the speech recognition device is put into practical use is central to the present invention, but the types of unnecessary words and the utterance method are diverse, and there are many variations. is there. In addition, it is difficult to express cases that do not express any linguistic content, interjections, etc., using phoneme sequences used in general speech recognition methods or as limited categories. There was a problem that it could not be handled. The present invention focuses on the simulation function of the computer, artificially simulates the situation where unnecessary words are added before and after the intended speech, performs learning processing, and performs recognition using word spotting. A recognition dictionary that operates strongly and stably for unnecessary words is automatically generated.

次に音声認識システムについて述べる。まず本システ
ムは第２図に示されるように認識部を含む音声認識装置
と、学習の為に必要となる学習部を含む学習用音声認識
装置とで構成される。そこでまず、学習について説明す
る。Next, a speech recognition system will be described. First, as shown in FIG. 2, the system includes a speech recognition device including a recognition unit and a learning speech recognition device including a learning unit necessary for learning. Therefore, learning will be described first.

第２図に示されるように認識に当り認識対象とする単
語音声を予め学習用音声データとして多数個収集し音声
データベース23として保有するとともに、不要語音声を
別途収集し、図示した不要語データベース24として予め
準備する。尚システムとしては学習モードとなってお
り、モードスイッチ19は下側の端子と接続されている。As shown in FIG. 2, a large number of word voices to be recognized in the course of recognition are collected in advance as learning voice data and held as a voice database 23, and unnecessary word voices are separately collected. Prepare in advance. The system is in a learning mode, and the mode switch 19 is connected to a lower terminal.

そして、学習処理に供する不要語を認識対象単語の前
後のいずれかまたは両方に伴って付加した音声データを
音声データ合成部25で合成する。このときの留意点は、
認識対象単語と不要語の発声はオーバラップしないよう
に制御するか、あるいは不要語は意図した単語の前にだ
け付加する可能性があるものに対しては、その条件に合
致するものだけを合成するようにする。「あの〜」，
「え〜と」等は前にだけ付加する不要語であり、逆に後
にだけ付加する単語に対しては同様に認識対象単語の後
部にだけ付加して学習用音声データを合成する。Then, the audio data synthesizing unit 25 synthesizes the audio data in which the unnecessary words to be provided for the learning process are added along with one or both of the words before and after the recognition target word. Points to keep in mind at this time are:
Either control the utterance of the recognition target word and the unnecessary word so that they do not overlap, or if the unnecessary word may be added only before the intended word, synthesize only the words that meet the conditions. To do it. "that~",
“Eto” and the like are unnecessary words added only before, and conversely, words added only after are similarly added only to the rear part of the recognition target word to synthesize the learning speech data.

さらに、予め音声による対話を観察して得られた知見
を利用して不要語と認識対象単語の間の間隔を制御して
学習用データを合成するのは効果的である。例えば、都
市名「札幌」を入力しようとした際に「あの−」が付加
した場合、第３図の様に「あの−」は「札幌」の前に付
加し、不要語区間の始端の範囲を、−1.5秒から−0.2秒
として学習用音声データを合成する。すなわち、発話
「札幌」をS_iとし、「あの−」をN_iとすると合成データ
S_i′は次のようになる。Further, it is effective to synthesize the learning data by controlling the interval between the unnecessary word and the recognition target word by using the knowledge obtained by observing the dialogue by voice in advance. For example, if "ano-" is added when trying to enter the city name "Sapporo", "ano-" is added before "Sapporo" as shown in FIG. Is changed from -1.5 seconds to -0.2 seconds to synthesize the learning voice data. That is, if the utterance "Sapporo" is _Si and "Ano" is _Ni , the synthesized data
S _i ′ is as follows.

S_i′＝S_i＋αN_i-p ここで、αを重畳した際の発話レベルを設定するため
のゲイン定数であり、ｐは時間軸を定義するための変数
である。上述した例では、−1.5秒から−0.2秒までの範
囲を適宜選んで学習用音声データを合成する。S _i ′ = S _i + αN _ip Here, a gain constant for setting an utterance level when α is superimposed, and p is a variable for defining a time axis. In the example described above, the range from -1.5 seconds to -0.2 seconds is appropriately selected to synthesize the learning voice data.

上述の処理を行う際には、人工的に不要語の付加した
音声データを合成するための知識が必要であるが、上記
の様に簡単なものでも良いが、詳細にシミュレーション
を行うのも効果的である。When performing the above-described processing, it is necessary to have knowledge for synthesizing voice data to which unnecessary words are artificially added. However, simple processing as described above may be used, but performing detailed simulation is also effective. It is a target.

次に、合成して得られた音声データを用いた学習処理
の実施例について述べる。Next, an embodiment of the learning process using the voice data obtained by the synthesis will be described.

本発明の特長は、人工的に不要語を付加する処理を行
うため単語音声と不要語音声の時間軸の位置関係に関す
る情報を学習処理部が用いることができる点であり、従
来の実際の環境で収集した音声データを用いる音声認識
のための学習処理と著しく異なる点である。耐雑音性能
を有するワードスポッティング法に基づく音声認識装置
では単語音声のレベルと雑音のレベル、つまり、信号対
雑音比（S/N）を制御し学習を行ったが、ここでは時間
軸上の情報を用いる点が特徴である。A feature of the present invention is that the learning processing unit can use information on the positional relationship between the word voice and the unnecessary word voice on the time axis in order to perform processing for artificially adding unnecessary words. This is significantly different from the learning process for speech recognition using the speech data collected in step (1). In a speech recognition device based on the word spotting method with noise immunity, learning was performed by controlling the level of word speech and the level of noise, that is, the signal-to-noise ratio (S / N). Is characterized in that

さて、前述したような合成処理により得られた音声デ
ータは、学習の際にワードスポッティング部22において
ワードスポッティング処理がなされる。このときの方式
は基本的に学習型のワードスポッティング法が基本とな
る。すなわち、学習モードのときでも合成音声データを
ワードスポッティング法を用いて認識処理する。The speech data obtained by the above-described synthesis processing is subjected to word spotting processing in the word spotting section 22 during learning. The method at this time is basically based on a learning type word spotting method. That is, even in the learning mode, the recognition processing is performed on the synthesized voice data using the word spotting method.

このときのワードスポッティングのパターン照合には
種々の方法が適用できるが、ここではパターン変形に対
して強い複合類似度についての実施例につき説明する。Various methods can be applied to the word spotting pattern matching at this time. Here, an example of a composite similarity that is strong against pattern deformation will be described.

学習モードにおける認識は第６図に示すように認識対
象単語における始端候補区間における始端ｉ及び終端
候補点ｊに対し複合類似度値S_ijが、ある閾値を越え、
ある区間で最大値maxS_ijを示す場合、その単語を認識結
果として出力する。In the recognition in the learning mode, as shown in FIG. 6, the composite similarity value S _ij exceeds a certain threshold with respect to the starting end i and the ending candidate point j in the starting end candidate section in the word to be recognized.
If the maximum value MAXS _ij at a certain interval, and outputs the word as a recognition result.

すなわち、第２図における学習用特徴ベクトル抽出部
26において第７図に示すように学習モードのときに入力
単語Ａに対する単語認識結果が正解のとき、最大類似度
と入力単語の最大類似度は一致し、その最大類似度を示
した始端t_s ^Aと終端t_e ^Aに対応する単語特徴ベクトルを学習用単語特徴ベクトルとして切り出し学習処理に用
いる。That is, the learning feature vector extraction unit in FIG.
At 26, when the word recognition result for the input word A is correct in the learning mode as shown in FIG. 7, the maximum similarity and the maximum similarity of the input word match, and the starting point t _s indicating the maximum similarity is shown. ^A word feature vector corresponding to ^A and terminal t _e ^A Is used as a learning word feature vector in the cutout learning process.

一方、単語認識結果が不正解のとき、最大類似と入力
単語Ａの最大類似度は異なり、入力単語Ａについて最大
類似度を与える始端t_s ^Aと終端t_e ^Aに対応する単語特徴ベ
クトルを学習用単語特徴ベクトルとして切り出いとともに、誤
認識を招いた単語Ｂ（最大類似度を示す単語）の最大類
似度となったt_s ^Bと終端t_e ^Bに対応する単語特徴ベクトルを学習に使用する。この様子を第８図に示した。On the other hand, when the word recognition result is incorrect, the maximum similarity and the maximum similarity of the input word A are different, and the word feature vector corresponding to the start t _s ^A and the end t _e ^A that gives the maximum similarity for the input word A With cut Day as a word feature vector for learning, t _s ^B and word feature vector corresponding to the end t _e ^B became maximum similarity of words led to erroneous recognition B (word indicating the maximum similarity) Use for learning. This is shown in FIG.

このようにして抽出された学習用特徴ベクトルを基
に、認識辞書作成部27において複合類似度法の認識辞書
の作成が行われるが、認識対象単語の検出能力向上と不
要語に対いるリジェクト能力を高め安定に動作するため
の学習がポイントである。認識辞書作成部27において作
成された辞書は認識辞書28に蓄えられ不要語に強い辞書
が構築される。Based on the learning feature vectors extracted in this way, the recognition dictionary creation unit 27 creates a recognition dictionary based on the composite similarity method. The recognition ability is improved, and the rejection ability for unnecessary words is improved. The key is learning to increase the stability and operate stably. The dictionary created by the recognition dictionary creating unit 27 is stored in the recognition dictionary 28, and a dictionary that is strong against unnecessary words is constructed.

上述したように学習モードにおいて構築された認識辞
書28を用いた音声認識装置について説明する。A speech recognition device using the recognition dictionary 28 constructed in the learning mode as described above will be described.

本実施例における装置は、第２図の構成図において、
音声入力部20,音声分析部21,ワードスポッティング部22
及び認識辞書28からなる。音声入力部20はマイクロホン
等を介して入力される音声信号をディジタル信号に変換
して音声分析部21に出力を与える。The device in the present embodiment is the same as that shown in FIG.
Voice input unit 20, voice analysis unit 21, word spotting unit 22
And the recognition dictionary 28. The voice input unit 20 converts a voice signal input via a microphone or the like into a digital signal and provides an output to the voice analysis unit 21.

この音声入力部20は例えば5.4KHz以上の周波数成分を
除去するローパスフィルタと標本化周波数12KHz、量子
化ビット数16ビットでディジタル信号に変換するA/D変
換器により構成されている。次の音声分析部21では、デ
ィジタル化された入力音声信号に対して時間窓長24ms.
のハミング窓を使用し離散的フーリエ変換処理がFET
（高速フーリエ変換）アルゴリズムを用いて行われる。
そして、FETにより求められたパワースペクトルを平滑
化し、対数処理を行い、周波数方向16次元の周波数スペ
クトルを求める。この処理は、分析のフレーム周期を8m
s.に設定して行い、8ms.毎に周波数スペクトルを表す16
次元の特徴パラメータが得られるわけである。The audio input unit 20 includes, for example, a low-pass filter that removes a frequency component of 5.4 KHz or more and an A / D converter that converts a digital signal with a sampling frequency of 12 KHz and a quantization bit number of 16 bits. In the next voice analysis unit 21, the time window length is 24 ms for the digitized input voice signal.
Discrete Fourier Transform using FET Hamming Window
(Fast Fourier Transform) algorithm is used.
Then, the power spectrum obtained by the FET is smoothed and logarithmic processing is performed to obtain a 16-dimensional frequency spectrum in the frequency direction. This process sets the analysis frame period to 8m
s., which represents the frequency spectrum every 8 ms.
The dimensional feature parameters are obtained.

このときの音声分析手法としては前述したFET分析に
限らず、LPC分析やケプストラム分析やフィルタ分析等
の他の分析処理手法を用いることが可能である。The voice analysis method at this time is not limited to the FET analysis described above, and other analysis processing methods such as LPC analysis, cepstrum analysis, and filter analysis can be used.

上述した音声の特徴パラメータを用いてワードスポッ
ティング部22では、始終端を非固定にし時間軸上連続的
に認識辞書との間でパターン照合を行う。このとき、周
波数方向は16次元の特徴パラメータを用いるが、パター
ン照合に用いる単語特徴ベクトルは単語の始端t_s及び終
端t_eを仮定し、両端点を等分割して得られる時間軸上の
12点の音声特徴パラメータを用いる。すなわち、周波数
方向16次元，時間軸方向12次元の16×12＝192次元のベ
クトルを単語特徴ベクトルとして使用し、認識辞書28とのパタ
ーン照合がフレーム周期毎に（例えば8ms.）連続的に行
われるわけである。The word spotting unit 22 uses the above-described speech feature parameters to make the start and end non-fixed, and performs pattern matching with the recognition dictionary continuously on the time axis. In this case, the frequency direction using the characteristic parameters of the 16-dimensional, the word feature vector used in pattern matching assuming starting t _s and the end t _e of the word, on the time axis obtained by equally dividing the two end points
We use 12 voice feature parameters. That is, a 16 × 12 = 192 dimensional vector with 16 dimensions in the frequency direction and 12 dimensions in the time axis direction Is used as a word feature vector, and pattern matching with the recognition dictionary 28 is performed continuously every frame period (for example, 8 ms.).

ここで用いる上述した単語特徴ベクトルは、時間周波
数スペクトルであり、これを基本とした複合類似度法に
よるワードスポッティングは雑音に対して安定に動作す
るとの結果が既に得られている。このワードスポッティ
ング部22において、仮に設定される終端点を基準として
或る音声区間条件を満たす複数の始端点を仮設定し、こ
れらの始端終点間で定まる仮の音声区間から単語特徴ベ
クトルを複数個抽出し、これに対して各々辞書と照合し
類似度値を求める。この処理をフレーム周期毎に時間軸
方向にシフトしながら行うわけである。The above-mentioned word feature vector used here is a time-frequency spectrum, and it has already been obtained that the word spotting based on the compound similarity method based on this is stable with respect to noise. In the word spotting section 22, a plurality of start points satisfying a certain voice section condition are provisionally set based on the provisionally set end point, and a plurality of word feature vectors are determined from the provisional voice section determined between these start end points. Then, the extracted data is compared with a dictionary to determine a similarity value. This process is performed while shifting in the time axis direction for each frame cycle.

しかして、各特徴ベクトルについて求められた類似度
値を比較し、最大類似度を得た認識対象カテゴリーと、
その音声区間（始終端情報）を入力音声に対する認識結
果として求めるわけである。以上のワードスポッティン
グによる音声認識処理によれば、始終端検出を予め行わ
ないので従来の始終端の検出誤まりに起因する問題が、
軽減することができる。このためには、類似度計算を始
終端を定めずに多数回行うための演算を必要とするが、
性能向上として得られる効果は大きい。Then, the similarity values calculated for each feature vector are compared, and the recognition target category that has obtained the maximum similarity is:
The voice section (start / end information) is obtained as a recognition result for the input voice. According to the above speech recognition processing by word spotting, since the start and end detection is not performed in advance, the problem caused by the conventional detection error of the start and end is a problem.
Can be reduced. For this purpose, an operation for performing the similarity calculation many times without defining the start and end is required,
The effect obtained as a performance improvement is great.

次に、不要語に対して、積極的に学習処理により対処
できる認識辞書を作成し、不要語に対する性能を飛躍的
に向上させる例えを説明する。Next, an example will be described in which a recognition dictionary that can actively deal with unnecessary words by a learning process is created and the performance for unnecessary words is dramatically improved.

本実施例において学習モードにおける学習処理では、
上述したように、不要語データを付加して、学習用音声
データを合成し、ワードスポッティング処理を行い学習
用特許ベクトル抽出部26において学習用音声特徴ベクト
ルを自動的に抽出する。In the present embodiment, in the learning process in the learning mode,
As described above, the unnecessary speech data is added, the learning speech data is synthesized, the word spotting process is performed, and the training speech feature vector is extracted in the training patent vector extraction unit 26. Is automatically extracted.

認識辞書作成部26における複合類似度法の認識辞書作
成の場合、学習は共分散行列Ｋを介して行われる。In the case of creating a recognition dictionary based on the composite similarity method in the recognition dictionary creation unit 26, learning is performed via a covariance matrix K.

ここで、カテゴリーの共分散行列は以下のように定義される。Where the covariance matrix of the category Is defined as follows:

ここではカテゴリー名,tは転置を表す。 Here, the category name, t represents transposition.

各カテゴリー毎の共分散行列をＫ−Ｌ展開（主成分分
析）して得られる固有値λ_ｋ ^（），固有ベクトルを求め、パターン照合に用いるわけである。Eigenvalue λ _k ⁽⁾ obtained by KL expansion (principal component analysis) of the covariance matrix for each category, eigenvector Is obtained and used for pattern matching.

さて、不要語を前後に付加した学習用音声データから
自動抽出された学習用音声データを用いてなる処理により、共分散行列Ｋが順次更新され、学習処
理により認識辞書28が充実するわけである。ここで、α
は更新のための係数を表す。Now, the learning voice data automatically extracted from the learning voice data with unnecessary words added before and after Using By this process, the covariance matrix K is sequentially updated, and the learning dictionary is enriched by the learning process. Where α
Represents a coefficient for updating.

認識辞書作成部27における共分散行列の更新の際に、
前述した認識結果情報（正誤の判定結果と始終端情報）
を用いるのは効果的であり、カテゴリーＡの入力音声に
対して得られた学習用単語特徴ベクトルを用いて第４図のような学習処理を行う。この処理によ
り、リジェクト機能を高めることができる。すなわち、
入力音声のカテゴリーがＡの場合に、不要語の影響で誤
ってＢと認識された場合にを用いて学習処理を行い認識辞書を更新するのは効果的
である。特に、人工的な学習用音声パターンの合成を、
実際の使用環境に近い状態に設定することが特に大切で
ある。When updating the covariance matrix in the recognition dictionary creation unit 27,
Recognition result information described above (correct / false determination result and start / end information)
It is effective to use the learning word feature vector obtained for the input voice of category A. Is used to perform a learning process as shown in FIG. By this processing, the reject function can be enhanced. That is,
When the category of the input voice is A and it is mistakenly recognized as B due to the unnecessary word It is effective to update the recognition dictionary by performing a learning process by using. In particular, the synthesis of artificial learning speech patterns
It is particularly important to set the condition close to the actual use environment.

例えば、不要語については話者により相当の“クセ”
（口グセ）があるので、特定話者用にチューニングする
のも効果的である。For example, unnecessary words can be considerably “habit” depending on the speaker.
It is effective to tune for a specific speaker because there is (mouth).

また、雑音の重畳との併用も効果的であり、第５図の
ような構成で行うことができる。第２図において雑音デ
ータベースを付加している。この場合、音声データを合
成する際に不要語の付加と雑音重畳の両方をシミュレー
ションし学習を行うわけである。It is also effective to use it together with superimposition of noise, and it can be performed with the configuration as shown in FIG. In FIG. 2, a noise database is added. In this case, when synthesizing voice data, learning is performed by simulating both addition of unnecessary words and noise superposition.

すなわち、雑音の重畳に関しては音声と雑音データを
時間軸上重ねてS/Nを除々に低くして合成音声データを
生成し、不要語に対しては時間軸上オーバラップしない
ように不要語を前後に配置して、学習音声データを合成
する。In other words, for noise superposition, speech and noise data are superimposed on the time axis, and the S / N is gradually reduced to generate synthesized speech data. It is arranged before and after to synthesize the learning speech data.

学習処理については、同様であり、この処理により、
雑音にも強く、不要語にも強い安定な性能を有する音声
認識性能が得られる。The same applies to the learning process.
Speech recognition performance that is strong against noise and stable against unnecessary words can be obtained.

尚、本発明は上述した実施例に限定されるものではな
い。例えば、実例では入力音声を離散的に発声された単
語としたが連続的発声による連続単語や文節及び音声の
認識上に適用することも可能であり、また認識及び学習
手法としては複合類似度法について実施例の説明を行っ
たが、他のマハラノビス距離，部分空間法，ニューラル
ネットワークやHMM（Hidden Marcov Model）等の統計的
手法の適用が可能である。Note that the present invention is not limited to the above-described embodiment. For example, in the example, the input speech is a discretely uttered word, but it can be applied to the recognition of continuous words, phrases and speech by continuous utterance. Although the embodiment has been described, other statistical methods such as Mahalanobis distance, subspace method, neural network, and HMM (Hidden Marcov Model) can be applied.

さらに、認識処理に用いる音声特徴ベクトルの次元数
についても特に限定されることはない。Further, the number of dimensions of the speech feature vector used for the recognition processing is not particularly limited.

本発明のポイントは、実際の音声認識装置をマンマシ
ンインタフェースとして使用する際のユーザの不用意
な、意図と無関係な発声を人工的にシミュレーション
し、その不要語が付加した学習用音声データを用いた学
習処理にあり、本発明における始終端非固定の連続的パ
ターン照合であるワードスポッティング法の認識辞書作
成にある。The point of the present invention is to artificially simulate a user's careless, unintentional utterance when using an actual speech recognition device as a man-machine interface, and use learning speech data to which the unnecessary words are added. In the recognition processing of the word spotting method, which is a continuous pattern matching with non-fixed start and end in the present invention.

本発明の要旨を逸脱しない範囲で種々変形して実施す
ることができる。Various modifications can be made without departing from the scope of the present invention.

〔The invention's effect〕

以上説明したように本発明によれば、ワードスポッテ
ィングによる音声認識を適用する際に、不要語を排除し
て安定な動作を得るために、音声認識辞書の学習を認識
対象単語音声データに不要語の音声データを前後に付加
して人工的に合成した学習用音声データを用いて行う。
したがって、極めてバリエーションの多い言いよどみ，
舌打ち，呼気音，間投詞等の付加を人工的にシミュレー
ションすることができるので、認識対象単語が変った場
合にも、今後、発展が続く計算機パワーの進歩を活用し
て高精度の認識辞書の作成が可能となる。As described above, according to the present invention, when speech recognition by word spotting is applied, in order to eliminate unnecessary words and obtain a stable operation, learning of a speech recognition dictionary is performed by This is performed using the learning voice data artificially synthesized by adding the voice data before and after.
Therefore, there are extremely many variations
Creation of tongue tapping, exhalation sounds, interjections, etc. can be artificially simulated, so even if the recognition target word changes, use the advances in computer power that will continue to evolve in the future to create a highly accurate recognition dictionary Becomes possible.

また、音声認識装置による背景雑音に対する学習と併
用することにより実用的で種々の外乱に強い音声認識装
置が実現でき、音声入力の有する自然性を生かしたマン
マシンインタフェースが可能となり、実用上多大な効果
が得られる。In addition, by using it together with background noise learning by the speech recognition device, a practical and robust speech recognition device can be realized, and a man-machine interface that makes use of the naturalness of speech input becomes possible. The effect is obtained.

[Brief description of the drawings]

第１図は不要語を付加した音声パターンの例を示す図、
第２図は本発明に係る構成を示す図、第３図は学習用音
声データ合成のための不要語付加のシミュレーションを
説明する図、第４図は認識辞書作成の手順を示す図、第
５図は本発明に係る耐雑音性と耐不要語に対する学習を
行う構成を示す図、第６図は学習モードにおける認識を
説明するための図、第７図は単語認識結果が正解の時の
特徴ベクトルの切り出し方を説明する図、第８図は単語
認識結果が不正解の時の特徴ベクトルの切り出し方を説
明する図である。FIG. 1 is a diagram showing an example of a voice pattern to which an unnecessary word is added.
FIG. 2 is a diagram showing a configuration according to the present invention, FIG. 3 is a diagram for explaining a simulation of adding unnecessary words for learning speech data synthesis, FIG. 4 is a diagram showing a procedure for creating a recognition dictionary, FIG. FIG. 6 is a diagram showing a configuration for performing learning on noise-resistant and unnecessary-words according to the present invention. FIG. 6 is a diagram for explaining recognition in a learning mode. FIG. 7 is a feature when a word recognition result is a correct answer. FIG. 8 is a diagram for explaining a method of extracting a vector, and FIG. 8 is a diagram for explaining a method of extracting a feature vector when a word recognition result is incorrect.

フロントページの続き (56)参考文献特開平１−182898（ＪＰ，Ａ) 特開平１−158498（ＪＰ，Ａ) 特開昭62−65093（ＪＰ，Ａ) 特開昭59−119400（ＪＰ，Ａ) 特開昭58−23097（ＪＰ，Ａ) 電子情報通信学会技術研究報告［音声］，Ｖｏｌ．89，Ｎｏ．90，ＳＰ89− 19，「学習型ワードスポッティング法による騒音環境下の不特定話者単語音声認識」，ｐ51−58，（1989年６月23日発行) 電子情報通信学会技術研究報告［音声］，Ｖｏｌ．91，Ｎｏ．96，ＳＰ91− 22，「不要語を含む連続音声中からの単語検出」，ｐ33−39，（1991年６月21日発行) 日本音響学会平成２年度秋季研究発表会講演論文集２−８−８「騒音環境下でのワードスポッティングによる音声認識における不要後の影響」，ｐ．61−62 （1990年９月20日発表) 日本音響学会平成３年度春季研究発表会講演論文集２−５−５「対話音声中からのワードスポッティングに関する一検討」，ｐ．63−64（1991年３月27日発行) (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 15/06 G10L 15/20 G10L 15/10 ＪＩＣＳＴファイル（ＪＯＩＳ)Continuation of front page (56) References JP-A-1-182898 (JP, A) JP-A-1-158498 (JP, A) JP-A-62-65093 (JP, A) JP-A-59-119400 (JP) , A) JP-A-58-23097 (JP, A) IEICE Technical Report [Voice], Vol. 89, No. 90, SP89-19, “Speech Recognition of Unspecified Speakers in Noisy Environment by Learning Type Word Spotting Method”, p51-58, (published June 23, 1989) IEICE Technical Report [ Voice], Vol. 91, No. 96, SP91-22, “Detection of words from continuous speech including unnecessary words”, pp. 33-39, (published June 21, 1991) Proceedings of the Acoustical Society of Japan Fall Meeting 2-8 -8 “After-unnecessary effects on speech recognition due to word spotting in a noisy environment”, p. 61-62 (published on September 20, 1990) Proceedings of the Acoustical Society of Japan, Spring Meeting, 1991, 2-5-5 "A Study on Word Spotting from Dialogue Speech", p. 63-64 (issued on March 27, 1991) (58) Fields surveyed (Int. Cl. ⁷ , DB name) G10L 15/06 G10L 15/20 G10L 15/10 JICST file (JOIS)

Claims

(57) [Claims]

1. A feature parameter of an input voice is obtained,
In a voice recognition device that performs similarity matching between the characteristic parameter and data of a dictionary stored in advance for recognition and outputs a recognition result for the voice based on the matching result, data to be recognized and the data A speech recognition apparatus characterized in that a word spotting process is performed using synthesized data obtained by adding unnecessary word data to the data to extract learning data from the data and use the data for learning a recognition dictionary.