JPH08110790A

JPH08110790A - Sound recognizing device

Info

Publication number: JPH08110790A
Application number: JP6245365A
Authority: JP
Inventors: Yoshio Nakadai; 芳夫中台; Yutaka Nishino; 豊西野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1994-10-11
Filing date: 1994-10-11
Publication date: 1996-04-30

Abstract

PURPOSE: To provide a word sound recognizing device capable of coping with the unspecified speaker sound recognition without requiring a registering means other than a sound input apparatus even when the sound registration is limited to once. CONSTITUTION: The word sound is converted into a character train by the unspecified speaker sound recognition of a sound recognition section 2 and registered in a word dictionary 5 in advance. The word sound input is converted into a character train in the similar procedure at the time of recognition, this character train is compared with the character trains registered in the word dictionary 5, and the character train having the highest similarity is outputted from an indicator 8 and an output terminal 9 as the recognition result.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】この発明は、学習により文字列又
は音声パターンの単語辞書を作り、この単語辞書を用い
て入力音声を認識する音声認識装置に関するものであ
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice recognition device for creating a word dictionary of character strings or voice patterns by learning and recognizing an input voice using this word dictionary.

【０００２】[0002]

【従来の技術】従来より音声認識に関する研究開発が盛
んに行われているが、商用的なシステムとして成功した
例は少ない。その原因の一つには、認識用音声辞書の登
録の煩わしさの割に、高い認識性能が得られない点が挙
げられる。音声辞書、すなわちテンプレートの生成方法
については２つの方法がある。１つは入力される音声パ
ターンと同様の手法で取得したパターンをそのまま音声
辞書として使用する方法（構造的パターンマッチング
法）であり、もう１つは、入力される音声パターンと同
様の手法で取得したパターンを多数個集めて統計的な音
声モデルを生成し、入力される音声パターンがこの音声
モデルに合致するかどうかを比較する方法（統計的パタ
ーンマッチング法）である。構造的パターンマッチング
法の一般的な例にはＤＰマッチング法があり、また統計
的パターンマッチング法にはＨＭＭ（隠れマルコフモデ
ル）やニューラルネットによる認識手法がある。2. Description of the Related Art Although research and development relating to voice recognition have been actively carried out in the past, there are few successful commercial systems. One of the causes is that high recognition performance cannot be obtained despite the troublesomeness of registering the recognition voice dictionary. There are two methods for generating a voice dictionary, that is, a template. One is a method that uses a pattern acquired by the same method as the input voice pattern as it is as a voice dictionary (structural pattern matching method), and the other is a method that is acquired by the same method as the input voice pattern. This is a method (statistical pattern matching method) of collecting a large number of the generated patterns to generate a statistical voice model and comparing whether or not the input voice pattern matches this voice model. A general example of the structural pattern matching method is a DP matching method, and a statistical pattern matching method is a recognition method using an HMM (Hidden Markov Model) or a neural network.

【０００３】構造的パターンマッチング法の利点は、入
力した音声パターンをそのまま認識辞書にも使用できる
ことである。このため認識させたい音声（例えば単語音
声）を１回登録した直後から、その音声（その単語音
声）についてはその認識装置の使用が可能である。とこ
ろが、登録した音声パターンの品質が直接的に認識性能
に反映するため、登録時の発声に際しては細心の注意を
払う必要があり、音声登録に失敗すると認識率の激減を
招く。また不特定話者音声認識への拡張を考慮した音声
辞書は大規模なものとなる。The advantage of the structural pattern matching method is that the input voice pattern can be used as it is in the recognition dictionary. Therefore, the recognition device can be used for the voice (the word voice) immediately after the voice (the word voice) to be recognized is registered once. However, since the quality of the registered voice pattern is directly reflected in the recognition performance, it is necessary to pay close attention to the voice during registration, and if the voice registration fails, the recognition rate is drastically reduced. In addition, the speech dictionary that considers the extension to the unspecified speaker speech recognition becomes large.

【０００４】逆に統計的パターンマッチング法では、認
識させたい音声を多数回入力させ統計的に学習すること
によって少数の音声パターンで殆どの音声のバリエーシ
ョン（時間的、周波数特性的、経時変化的なゆらぎ）に
対応できるため、高い認識性能を保ち、また不特定話者
音声認識への拡張も容易である。しかし、認識させたい
音声を少数回登録しただけでは認識性能を高く得ること
ができない。On the contrary, in the statistical pattern matching method, most voice variations (temporal, frequency characteristic, time-dependent change) are obtained with a small number of voice patterns by inputting the voice to be recognized many times and statistically learning. Since it can deal with fluctuations, it maintains high recognition performance and can easily be extended to unspecified speaker speech recognition. However, it is not possible to obtain high recognition performance simply by registering the voice to be recognized a few times.

【０００５】そこで利用者になじみ易い技術として、構
造的パターンマッチング法のように簡易な音声登録操作
によって、統計的パターンマッチング法なみの認識性能
を得る音声登録手順が望まれる。すなわち、統計的パタ
ーンマッチング法を基本として、１回の登録操作で直ち
に使用できる音声認識装置が必要となる。その一例とし
ては、文献、平山、吉田、服部著、「ＤＳＰによる不特
定話者単語認識」、１９９４年電子情報通信学会春季大
会講演論文集、講演番号Ａ−２６１、に発表されたよう
な音声認識装置がある。この装置には、既に統計的パタ
ーンマッチング法に基づいた半音節単位の不特定話者音
声認識辞書が内蔵されており、認識させたい音声パター
ンに相当するラベル名、例えば単語を文字列として入力
すると、文字列から不特定話者用音声パターンが自動生
成され、これが単語音声パターンとして用いられる。こ
のような音声認識装置では、１回の文字列の入力だけで
も、認識精度の高い不特定話者音声認識辞書が構成でき
る。Therefore, as a technique which is easy for the user to familiarize with, a voice registration procedure for obtaining a recognition performance similar to that of the statistical pattern matching method by a simple voice registration operation such as the structural pattern matching method is desired. That is, based on the statistical pattern matching method, a voice recognition device that can be used immediately by one registration operation is required. An example of such a speech is as described in the literature, Hirayama, Yoshida, Hattori, “Unspecified speaker word recognition by DSP”, Proceedings of the 1994 IEICE Spring Conference, lecture number A-261. There is a recognition device. This device already has a built-in unspecified speaker voice recognition dictionary in semi-syllable units based on the statistical pattern matching method. When a label name corresponding to a voice pattern to be recognized, for example, a word is input as a character string, , An unspecified speaker voice pattern is automatically generated from a character string, and this is used as a word voice pattern. In such a voice recognition device, an unspecified speaker voice recognition dictionary with high recognition accuracy can be constructed by only inputting a character string once.

【０００６】[0006]

【発明が解決しようとする課題】しかし、文字列が入力
できる機能を持たない機器、例えば、超小型の音声認識
装置を構成するような場合や、電話回線を経由して遠隔
地から認識語彙登録を行うようなサービスには、この技
術の適用は困難となる。ところで認識用音声パターンの
生成源となる文字列自体を、音節、音韻あるいはそれ以
下の時間単位での不特定話者音声認識の結果によって自
動生成することが可能であれば、上述した文字列から音
声パターンを自動生成する音声認識技術の適用が可能に
なる。すなわち、１回の音声登録によって音声認識機能
が利用できるようになり、かつ不特定話者に対して高い
認識性能が得られるようになる。However, when a device that does not have a function of inputting a character string, for example, a very small voice recognition device is configured, or a recognition vocabulary is registered from a remote place via a telephone line. It is difficult to apply this technology to services such as. By the way, if the character string itself, which is the source of the recognition speech pattern, can be automatically generated by the result of unspecified speaker speech recognition in syllable, phonological unit or a time unit shorter than that, from the character string described above, It becomes possible to apply a voice recognition technology that automatically generates a voice pattern. That is, the voice recognition function can be used by one-time voice registration, and high recognition performance can be obtained for an unspecified speaker.

【０００７】このとき、文字列から音声パターンを生成
するときの各々の音声素片は、音節、音韻単位の音声認
識で使用する音声辞書として共用が可能となる。またこ
の場合、認識辞書を音声パターンでなく、文字列自体で
構成することも可能となる。さらに、音声登録時に音
節、音韻認識された結果の文字列が、発声者にとって視
覚的または聴覚的に理解可能な表記である場合、すなわ
ち例えば、「なかだい」、または「な」、「か」、
「だ」、「い」と音声入力して認識された結果が、例え
ば「なかだい」あるいは「ＮＡＫＡＤＡＩ」のように表
記可能な文字列として出力されるような場合には、音節
認識の結果から「なかだい」という単語音声パターンを
作成して単語音声認識へ供する場合に、発声者はその認
識結果を「なかだい」あるいは「ＮＡＫＡＤＡＩ」なる
文字列として、視覚的または聴覚的に確認することも可
能になる。[0007] At this time, each of the speech units when generating a speech pattern from a character string can be shared as a speech dictionary used in speech recognition in syllable and phonological units. Further, in this case, the recognition dictionary can be configured by the character string itself instead of the voice pattern. Furthermore, when the character string resulting from syllable and phonological recognition at the time of voice registration is a notation that can be visually or auditorily understood by the speaker, that is, for example, "nakadai", or "na", "ka" ,
When the result of voice recognition of "da" and "i" is output as a character string that can be written, for example, "nakadai" or "NAKADAI", the result of syllable recognition is used. When a word voice pattern “Nakadai” is created and used for word voice recognition, the speaker may visually or auditorily confirm the recognition result as a character string “Nakadai” or “NAKADAI”. It will be possible.

【０００８】この発明の目的は、単語音声認識の音声辞
書を、音節または音韻レベルの不特定話者音声認識結果
より自動生成することにより、音声登録を１回に限定し
ても高い認識性能が得られ、また不特定話者への対応が
可能であり、かつ音声入力機器以外の音声登録手段を必
要としない単語音声認識装置を提供することにある。An object of the present invention is to automatically generate a voice dictionary for word voice recognition from a voice recognition result of an unspecified speaker having a syllable or a phonological level, so that high recognition performance can be obtained even if voice registration is limited to one time. An object of the present invention is to provide a word voice recognition device which can be obtained and can deal with an unspecified speaker, and which does not require voice registration means other than a voice input device.

【０００９】[0009]

【課題を解決するための手段】請求項１の発明によれ
ば、音声を入力する手段と、入力された音声に対して音
節または音韻レベルの認識を実行する手段と、認識に使
用する音声辞書と、前記認識結果を文字列に変換する手
段と、変換した文字列を単語辞書として保存する手段
と、単語辞書に保存した文字列と入力された文字列との
比較を行う手段と、比較した結果を出力する手段より構
成される。According to the invention of claim 1, a means for inputting a voice, a means for recognizing a syllable or a phoneme level for the input voice, and a voice dictionary used for the recognition. And a means for converting the recognition result into a character string, a means for saving the converted character string as a word dictionary, and a means for comparing the character string saved in the word dictionary with the input character string. It is composed of means for outputting the result.

【００１０】請求項２の発明によれば請求項１の発明に
対し、更に、単語辞書および音声辞書からさらに単語認
識のための音声パターンを生成する手段と、生成された
音声パターンを辞書として蓄積する手段と、辞書に蓄積
された音声パターンと入力された音声との間で単語レベ
ルの認識を行う手段を追加し、また、結果を出力する手
段の代わりに単語音声認識の結果を出力する手段より構
成される。According to the invention of claim 2, in addition to the invention of claim 1, means for further generating a voice pattern for word recognition from the word dictionary and the voice dictionary, and storing the generated voice pattern as a dictionary. Means for recognizing a word level between a voice pattern stored in a dictionary and an input voice, and a means for outputting a result of word voice recognition instead of a means for outputting a result. It is composed of

【００１１】請求項３の発明では、請求項１又は２の発
明に対し、認識結果の一部として出力される文字列を発
声者に呈示する手段を追加する。According to the invention of claim 3, to the invention of claim 1 or 2, means for presenting a character string output as a part of the recognition result to the speaker is added.

【００１２】[0012]

【作用】請求項１の発明によれば、孤立発声された単語
音声は、音節または音韻レベルの不特定話者音声認識に
よって文字列に変換され、同様の手順によって音声から
文字列に変換された単語辞書との間で比較され、同一文
字列または類似の文字列と判断された単語辞書の内容が
認識結果として出力される。According to the first aspect of the invention, the isolated uttered word voice is converted into a character string by recognizing a syllable or phoneme level unspecified speaker voice, and is converted from a voice into a character string by the same procedure. The content of the word dictionary that is compared with the word dictionary and determined to be the same character string or a similar character string is output as a recognition result.

【００１３】請求項２の発明によれば、認識のための単
語辞書は、請求項１の発明の場合と同様に音声から文字
列に変換されたものから、さらに文字列から単語音声パ
ターンに変換されて辞書化されており、孤立発声された
単語音声は、この単語音声パターンと照合され、最も類
似度が高いと判定された単語音声パターンの生成源であ
る文字列が、認識結果として出力される。According to the second aspect of the present invention, the word dictionary for recognition is converted from a voice into a character string in the same manner as in the first aspect of the invention, and further converted from a character string into a word voice pattern. The word speech that is isolated and uttered is collated with this word speech pattern, and the character string that is the generation source of the word speech pattern determined to have the highest similarity is output as the recognition result. It

【００１４】また請求項１又は２の発明の何れにおいて
も、発声者が音声認識結果の文字列を、視覚的または聴
覚的に確認することも可能である。In either of the first and second aspects of the invention, the speaker can visually or audibly confirm the character string of the voice recognition result.

【００１５】[0015]

【実施例】図１に請求項１の発明に基づく実施例を示
す。図１において、音声入力部１は発声者の音声入力用
の機器であり、例えばオーディオマイクロホンである。
音声認識部２は音声入力部１より入力された音声に対し
てＡ／Ｄ変換を行い、音声区間を検出し、検出された区
間について分析を行い、音節あるいは音韻単位での不特
定話者音声認識を行い、その認識結果として文字列を出
力する。ここで、入力する音声は、例えば、「なかだ
い」のように通常に発声された単語音声であり、または
例えば、「な」、「か」、「だ」、「い」などのように
単音節に区切って入力した音声である。Ａ／Ｄ変換の方
法は音声認識手法で一般的に行われているもので、例え
ばサンプリング周波数１０ｋＨｚ，１２ビット量子化で
あり、音声区間検出方式は、例えば、短時間パワーの観
測値による切り出し方法、音声分析手法は、例えば、Ｌ
ＰＣケプストラム分析、音韻認識方法は、例えば、文
献、中川著、「確率モデルによる音声認識」、電子情報
通信学会、等で述べられている、不特定話者音声に対す
る音節スポッティングアルゴリズムである。音声辞書部
３は、音声認識部２で使用する音節または音韻の標準音
声パターンが、これに対応するラベル名（例えば、音声
「あ」のパターンに対してはラベル名「あ」）を付与さ
れて格納されたものであり、この音声パターンは、例え
ば、前記の文献に示されている統計的パターンマッチン
グ法の一例であるＨＭＭなどによって整備されたもので
ある。言語処理部４は、音声認識部２より出力された認
識結果の文字列を整備して単語化し、単語辞書部５およ
び比較部６へ送出する。音声認識部２の出力は、入力さ
れた音声が「なかだい」のような単語音声（連続音節）
であれば、第１位の音節認識結果だけでは不要音節の付
加や音節の脱落などが観測されるため、１つの音韻区間
に対して複数の認識結果を認めた形式、すなわち例え
ば、音韻ラティスの形式で出力される場合がある。そこ
で言語処理部４は言語情報を付加して音声認識部２の出
力文字列を整理する役割を成している。単語辞書部５は
言語処理部４の出力単語（文字列）を単語認識用の辞書
として蓄積する。比較部６は、単語辞書部５に蓄積した
単語と、言語処理部４の出力との間で文字列の比較を行
い、その結果を表示器８および出力端９へ出力する。ス
イッチ７は比較部６の動作を学習時と認識時とで切り換
えるものである。表示器８は、発声者に認識結果を表示
するものであり、例えば、キャラクターディスプレイ
や、また例えば、音声合成装置である。出力端９は最終
的な認識結果を外部の機器へ出力するものであり、例え
ば、コンピュータのインタフェース部である。FIG. 1 shows an embodiment based on the invention of claim 1. In FIG. 1, a voice input unit 1 is a device for voice input by a speaker, and is, for example, an audio microphone.
The voice recognition unit 2 performs A / D conversion on the voice input from the voice input unit 1, detects a voice section, analyzes the detected section, and analyzes the unspecified speaker voice in syllable or phoneme unit. It recognizes and outputs a character string as the recognition result. Here, the input voice is a normally spoken word voice such as “nakadai”, or a single voice such as “na”, “ka”, “da”, and “i”. It is a voice that is input by dividing it into syllables. The A / D conversion method is generally performed by a voice recognition method, for example, a sampling frequency is 10 kHz and 12-bit quantization, and the voice section detection method is, for example, a cutout method based on an observed value of short-time power. , The voice analysis method is, for example, L
The PC cepstrum analysis and phonological recognition method is a syllable spotting algorithm for unspecified speaker speech, which is described in, for example, the literature, Nakagawa, “Speech recognition by probabilistic model”, IEICE. In the voice dictionary unit 3, a standard speech pattern of a syllable or a phoneme used in the voice recognition unit 2 is given a label name corresponding to it (for example, a label name “A” for a pattern of the voice “A”). This voice pattern is prepared by, for example, HMM which is an example of the statistical pattern matching method shown in the above-mentioned document. The language processing unit 4 prepares the character string of the recognition result output from the voice recognition unit 2 into a word, and sends it to the word dictionary unit 5 and the comparison unit 6. The output of the voice recognition unit 2 is a word voice (continuous syllable) in which the input voice is “Nakadai”
In this case, addition of unnecessary syllables and omission of syllables are observed only with the syllable recognition result of the first place. Therefore, a form in which a plurality of recognition results are recognized for one phoneme section, that is, for example, phonological lattice It may be output in the format. Therefore, the language processing unit 4 plays a role of organizing the output character string of the voice recognition unit 2 by adding language information. The word dictionary unit 5 stores the output word (character string) of the language processing unit 4 as a dictionary for word recognition. The comparison unit 6 compares the character strings between the words accumulated in the word dictionary unit 5 and the output of the language processing unit 4, and outputs the result to the display 8 and the output terminal 9. The switch 7 switches the operation of the comparison unit 6 between learning and recognition. The display 8 displays the recognition result to the speaker, and is, for example, a character display or a voice synthesizer. The output terminal 9 outputs the final recognition result to an external device, and is, for example, an interface unit of a computer.

【００１６】以下に図１に示した単語音声認識装置の動
作について説明する。この動作は登録時と認識時の２つ
に分かれる。（１−１）登録時の場合登録時には、この装置は入力された単語音声を、単語
（文字列）として単語辞書部５へ登録する動作を行う。
発声者は、まずスイッチ７によって比較部６を登録時動
作に切り換えておき、次に音声入力部１に対して音節単
位での認識が可能な音声を発声する。この音声は音声認
識部２にて音節単位で不特定話者音声認識され、言語処
理部４で単語（文字列）に変換されて比較部６へ送られ
る。比較部６では入力された単語と、さきに同様の手法
で音声から文字列へ変換され、単語辞書部５に蓄積され
た複数個の単語（文字列）とを比較する。ここで、入力
された単語と単語辞書部５に蓄積された単語との間に同
一あるいはきわめて類似性が高いと判断された単語が存
在した場合には、比較部６は、現在の単語音声入力内容
が単語辞書部５へ登録不可能と判断し、入力された単語
を破棄し、登録できない旨の情報を表示器８へ表示して
発声者に再度の音声入力を促す。つまり発声者は通常は
単語辞書部５に既に登録した単語を知っており、その単
語と同一と判定されたことは、発声者の発声の仕方が悪
かったと判断して再度の発声入力を促す。一方入力され
た単語が単語辞書部５に蓄積されてある全ての単語と異
なる場合には、比較部６は入力された単語（文字列）を
新しい単語として単語辞書部５へ登録し、登録が完了し
たことを表示器８によって発声者に通知する。ここで、
入力され認識された単語（文字列）が、発声者にとって
視覚的または聴覚的に理解可能な表記である場合、すな
わち例えば、「なかだい」と音声入力して認識された結
果が、例えば「なかだい」あるいは「ＮＡＫＡＤＡＩ」
のように表記可能であるような場合には、表示器８は登
録結果として文字列を表示することが可能である。（１−２）認識時の場合認識時には、この装置は入力された単語音声を認識して
その結果を出力端９へ出力する。発声者は、まずスイッ
チ７によって比較部６を認識時動作に切り換えておき、
次に音声入力部１に対して単語音声を発声する。この音
声は登録時と同様に音声認識部２にて音節単位で不特定
話者音声認識され、言語処理部４で単語（文字列）に変
換されて比較部６へ送られる。比較部６では入力された
単語と、さきに同様の手法で音声から文字列へ変換され
単語辞書部５に蓄積された複数個の単語（文字列）とを
比較する。ここで、入力された単語と単語辞書部５に蓄
積された単語との間に同一あるいはきわめて類似性が高
いと判断された単語が存在した場合には、比較部６はそ
の単語を認識結果として出力端９より出力する。また登
録時の場合と同様に、認識結果である単語（文字列）が
発声者にとって視覚的または聴覚的に理解可能な表記で
ある場合には、表示器８は出力端９と同時に、登録結果
として文字列を表示することが可能である。The operation of the word voice recognition apparatus shown in FIG. 1 will be described below. This operation is divided into two operations: registration and recognition. (1-1) Case of Registration At the time of registration, this device performs an operation of registering the input word voice as a word (character string) in the word dictionary unit 5.
The speaker first switches the comparison unit 6 to the operation at the time of registration with the switch 7, and then outputs a voice that can be recognized in syllable units to the voice input unit 1. The voice recognition unit 2 recognizes the unspecified speaker voice in syllable units, converts it into a word (character string) in the language processing unit 4, and sends it to the comparison unit 6. The comparison unit 6 compares the input word with a plurality of words (character strings) that have been converted from speech into character strings and stored in the word dictionary unit 5 in the same manner as above. Here, if there is a word that is judged to be the same or very similar between the input word and the word stored in the word dictionary unit 5, the comparison unit 6 causes the current word voice input. It is determined that the contents cannot be registered in the word dictionary unit 5, the input word is discarded, information indicating that the word cannot be registered is displayed on the display unit 8 and the speaker is prompted to input the voice again. That is, the utterer usually knows a word already registered in the word dictionary unit 5, and when it is determined that the word is the same as the vocabulary, it is determined that the utterer's utterance was bad, and the utterer inputs again. On the other hand, when the input word is different from all the words stored in the word dictionary unit 5, the comparison unit 6 registers the input word (character string) in the word dictionary unit 5 as a new word, and the registration is completed. The display 8 notifies the speaker of the completion. here,
When the input and recognized word (character string) is a notation that can be visually or auditorily understood by the speaker, that is, the result of recognizing by inputting “Nakadai” is Dai ”or“ Nakadai ”
In such a case, the display device 8 can display a character string as a registration result. (1-2) At the time of recognition At the time of recognition, this device recognizes the input word voice and outputs the result to the output terminal 9. The speaker first switches the comparison unit 6 to the recognition operation by the switch 7,
Next, a word voice is uttered to the voice input unit 1. As in the case of registration, this voice is recognized by the voice recognition unit 2 as a syllable-based unspecified speaker voice, converted into words (character strings) by the language processing unit 4, and sent to the comparison unit 6. The comparing unit 6 compares the input word with a plurality of words (character strings) converted from speech to character strings and stored in the word dictionary unit 5 by the same method as above. Here, if there is a word that is determined to be the same or very similar between the input word and the word stored in the word dictionary unit 5, the comparison unit 6 determines that word as the recognition result. Output from the output terminal 9. Further, as in the case of registration, when the word (character string) which is the recognition result is a notation that can be visually or aurally understood by the speaker, the display 8 simultaneously displays the output end 9 and the registration result. It is possible to display a character string as.

【００１７】次に図２を参照してこの発明の第２の実施
例を説明する。図２において、音声入力部１、音声認識
部２、音声辞書部３、言語処理部４、単語辞書部５、比
較部６、表示器８、出力端９はそれぞれ図１中の同一番
号が付けられた部と同様の機能を有している。ただし、
図２においては、音声辞書部３に登録されている音声パ
ターンは、単語音声生成部１０で単語音声パターンの生
成にも使用し、また出力端９は、比較部６からの出力で
はなく、単語音声認識部１２の認識結果を出力する。ス
イッチ７は、音声入力部１の出力先を学習時と認識時と
で切り換えるものである。単語音声生成部１０は、単語
辞書部５に蓄積した単語（文字列）と音声辞書部３に蓄
積した音声パターンとより、単語音声認識のための不特
定話者の単語音声パターンを生成する。単語音声辞書部
１１は、単語音声生成部１０で生成された不特定話者の
単語音声パターンを蓄積するものである。単語音声認識
部１２は、入力された単語音声に対してＡ／Ｄ変換を行
い、音声区間を検出し、検出された音声区間について分
析を行い、単語音声辞書部１１に蓄積された複数個の音
声パターンとの間で単語単位での音声認識を行う。Ａ／
Ｄ変換、音声区間検出、音声分析手法等は、音声認識部
と同一である。Next, a second embodiment of the present invention will be described with reference to FIG. In FIG. 2, the voice input unit 1, the voice recognition unit 2, the voice dictionary unit 3, the language processing unit 4, the word dictionary unit 5, the comparison unit 6, the display unit 8, and the output terminal 9 are assigned the same numbers in FIG. It has the same function as the above-mentioned part. However,
In FIG. 2, the voice pattern registered in the voice dictionary unit 3 is also used in the word voice generation unit 10 to generate a word voice pattern, and the output end 9 is not the output from the comparison unit 6 but the word The recognition result of the voice recognition unit 12 is output. The switch 7 switches the output destination of the voice input unit 1 between learning and recognition. The word voice generation unit 10 generates a word voice pattern of an unspecified speaker for word voice recognition from the words (character strings) accumulated in the word dictionary unit 5 and the voice pattern accumulated in the voice dictionary unit 3. The word voice dictionary unit 11 stores the word voice patterns of the unspecified speaker generated by the word voice generation unit 10. The word voice recognition unit 12 performs A / D conversion on the input word voice, detects a voice section, analyzes the detected voice section, and analyzes a plurality of words stored in the word voice dictionary section 11. Speech recognition is performed on a word-by-word basis with the speech pattern. A /
The D conversion, voice section detection, voice analysis method and the like are the same as those of the voice recognition unit.

【００１８】以下に図２に示したこの実施例の動作につ
いて説明する。この動作も、図１に示した実施例と同様
に、登録時と認識時の２つに分かれる。（２−１）登録時の場合登録時には、この装置は入力された単語音声を単語（文
字列）へ変換し、さらに単語音声パターンへと変換する
動作を行う。発声者は、まずスイッチ７によって音声入
力部１の出力先を音声認識部２へ切り換え、次に音声入
力部１に対して単語音声を発声する。この音声は図１の
実施例と同様に、音声認識部２にて音節単位で不特定話
者音声認識され、言語処理部４で単語（文字列）に変換
されて比較部６へ送られる。比較部６では入力された単
語と、さきに同様の手法で音声から文字列へ変換され単
語辞書部５に蓄積された複数個の単語（文字列）とを比
較する。ここで、入力された単語と単語辞書部５に蓄積
された単語との間に同一あるいはきわめて類似性が高い
と判断された単語が存在した場合には、比較部６は現在
の単語音声入力内容が単語辞書部５へ登録不可能と判断
し、入力された単語を破棄し、登録できない旨の情報を
表示器８へ表示して発声者に再度の音声入力を促す。ま
た、入力された単語が単語辞書部５に蓄積されている全
ての単語と異なる場合には、比較部６は入力された単語
（文字列）を新しい単語として単語辞書部５へ登録し、
登録が完了したことを表示器８によって発声者に通知す
る。ここで、第１の実施例の場合と同様に、入力され認
識された単語（文字列）が、発声者にとって視覚的また
は聴覚的に理解可能な表記である場合には、表示器８は
登録結果として文字列を表示することが可能である。こ
の後、単語辞書部５は単語音声生成部１０へその新たに
登録した単語（文字列）データを送出する。単語音声生
成部１０では、音声辞書部３に登録された音声パターン
をもとに、文字列から不特定話者の単語音声パターンを
生成する。生成された単語音声パターンは、音節音声パ
ターンの連結されたものであるが、連結後に単語音声と
してパターンの整形（連結学習）を行うことも可能であ
る。このようにして生成された単語音声パターンは、パ
ターンの生成源となった単語（文字列）ラベルを付与さ
れて、単語音声辞書部１１に蓄積される。（２−２）認識時の場合認識時には、この装置は入力された単語音声を認識して
その結果を出力端９へ出力する。発声者は、まずスイッ
チ７によって音声入力部１の出力先を単語音声認識部１
２に切り換え、次に音声入力部１に対して単語音声を発
声する。この音声は単語音声認識部１２において、音声
認識部２の動作と同様に音声分析され、単語音声辞書部
１１に蓄積された複数個の単語音声パターンと単語単位
でパターンマッチングされる。その結果、最も類似度が
高いと判断された単語音声パターンに付与されたラベル
が、単語音声認識結果として出力端９より出力される。
ここで、第１の実施例の場合と同様に、認識結果の単語
音声パターンに付与されたラベル（文字列）が、発声者
にとって視覚的または聴覚的に理解可能な表記である場
合には、表示器８は、出力端９と同時に、登録結果とし
て文字列を表示することが可能である。The operation of this embodiment shown in FIG. 2 will be described below. This operation is also divided into two, that is, at the time of registration and at the time of recognition, as in the embodiment shown in FIG. (2-1) Case of Registration At the time of registration, this device performs an operation of converting an input word voice into a word (character string) and further converting into a word voice pattern. The speaker first switches the output destination of the voice input unit 1 to the voice recognition unit 2 with the switch 7, and then speaks a word voice to the voice input unit 1. Similar to the embodiment shown in FIG. 1, this voice is recognized by the voice recognition unit 2 in the syllable unit as an unspecified speaker voice, converted into a word (character string) by the language processing unit 4, and sent to the comparison unit 6. The comparing unit 6 compares the input word with a plurality of words (character strings) converted from speech to character strings and stored in the word dictionary unit 5 by the same method as above. Here, if there is a word that is determined to be the same or very similar between the input word and the word stored in the word dictionary unit 5, the comparison unit 6 determines that the current word voice input content is present. Judges that the word cannot be registered in the word dictionary unit 5, discards the input word, displays information indicating that the word cannot be registered on the display 8 and prompts the speaker to input the voice again. Further, when the input word is different from all the words accumulated in the word dictionary unit 5, the comparison unit 6 registers the input word (character string) in the word dictionary unit 5 as a new word,
The display unit 8 notifies the speaker that registration has been completed. Here, as in the case of the first embodiment, when the input and recognized word (character string) is a notation that can be visually or auditorily understood by the speaker, the display device 8 is registered. As a result, it is possible to display a character string. After that, the word dictionary unit 5 sends the newly registered word (character string) data to the word voice generation unit 10. The word voice generation unit 10 generates a word voice pattern of an unspecified speaker from a character string based on the voice pattern registered in the voice dictionary unit 3. The generated word voice pattern is a concatenation of syllable voice patterns, but it is also possible to perform pattern shaping (concatenation learning) as a word voice after the concatenation. The word voice pattern generated in this manner is given a word (character string) label that is the source of the pattern generation, and is accumulated in the word voice dictionary unit 11. (2-2) At the time of recognition At the time of recognition, this device recognizes the input word voice and outputs the result to the output terminal 9. The speaker first switches the output destination of the voice input unit 1 by the switch 7 to the word voice recognition unit 1
2 is switched to, and then a word voice is uttered to the voice input unit 1. This voice is subjected to voice analysis in the word voice recognition unit 12 similarly to the operation of the voice recognition unit 2, and is subjected to pattern matching with a plurality of word voice patterns accumulated in the word voice dictionary unit 11 on a word-by-word basis. As a result, the label assigned to the word voice pattern determined to have the highest degree of similarity is output from the output terminal 9 as the word voice recognition result.
Here, as in the case of the first embodiment, when the label (character string) given to the word voice pattern of the recognition result is a notation that can be visually or auditorily understood by the speaker, The display 8 can display a character string as a registration result at the same time as the output terminal 9.

【００１９】図１及び図２において、音声辞書部３、単
語辞書部５、単語音声辞書部１１以外の各種処理、分析
などを行う部分は電子計算機やＤＳＰ（デジタルシグナ
ルプロセッサ）などにより行われ、各部が独立したハー
ドウェアを構成しなくてもよい。1 and 2, parts other than the voice dictionary unit 3, the word dictionary unit 5, and the word voice dictionary unit 11 that perform various processes and analysis are performed by an electronic computer or a DSP (digital signal processor). Each unit does not have to constitute independent hardware.

【００２０】[0020]

【発明の効果】以上述べたようにこの発明では、統計的
パターンマッチング手法を用いた単語音声認識装置にお
いて、認識辞書用の単語音声パターンを、音節あるいは
音韻レベルの不特定音声認識結果によって得られる文字
列として、あるいはその文字列から生成する音声パター
ンとして自動生成することにより、（１）特定の人間
の、かつ１回の音声入力で不特定話者の単語音声認識装
置としても利用可能である、（２）従来、文字入力機器
での文字の入力から行っていた単語音声パターンの作成
が、音声入力だけで可能となる、（３）認識辞書生成時
における音節・音韻レベルの不特定音声認識によって生
成した文字列が、発声者にとって視覚的または聴覚的に
理解可能な表記である場合には、認識結果として、単語
音声パターンの生成源である文字列を発声者に提示し
て、確認させることが可能となる、効果を有する。As described above, according to the present invention, in the word voice recognition device using the statistical pattern matching method, the word voice pattern for the recognition dictionary is obtained from the unspecified voice recognition result of the syllable or the phoneme level. By automatically generating as a character string or a voice pattern generated from the character string, (1) it can be used as a word voice recognition device for a specific person and for an unspecified speaker with a single voice input. , (2) It is possible to create a word voice pattern that was conventionally performed by inputting characters with a character input device. (3) Unspecified voice recognition of syllable / phoneme level during recognition dictionary generation When the character string generated by is a notation that can be visually or auditorily understood by the speaker, a word voice pattern is generated as a recognition result. The string is then presented to the speaker, it is possible to check, having an effect.

[Brief description of drawings]

【図１】この発明の第１の実施例を示すブロック図。FIG. 1 is a block diagram showing a first embodiment of the present invention.

【図２】この発明の第２の実施例を示すブロック図。FIG. 2 is a block diagram showing a second embodiment of the present invention.

Claims

[Claims]

1. A means for inputting a voice, a means for recognizing a syllable or a phoneme level for the inputted voice, a voice dictionary for storing a voice pattern of a syllable or a phoneme level, and a dictionary used for the recognition. , A means for converting the recognition result into a character string, a means for storing the converted character string as a word dictionary, the character string stored in the word dictionary and the voice input, and the character of the recognized result It is composed of a means for comparing with a string and a means for outputting the result of the comparison, and converts a syllable or a phonological level to an isolated uttered word voice to convert it into a character string, and the converted character string A voice recognition device comprising means for comparing the contents of the word dictionary with the contents of the word dictionary and outputting the result.

2. A means for generating a voice pattern for word recognition from a character string stored in the word dictionary and a voice pattern stored in the voice dictionary, and the generated voice pattern and a generation source. A word-speech dictionary that stores a combination of character strings as a dictionary, a means for recognizing a word-level between a speech pattern stored in the word-speech dictionary and an input speech, and a word-speech isolated Recognizing the syllable or phonological level to convert to a character string, to generate the word voice dictionary for word voice recognition from the converted character string, the generated word voice dictionary, and the newly input word The voice recognition device according to claim 1, further comprising means for performing recognition with a voice and outputting the result.

3. The voice recognition device according to claim 1, further comprising means for presenting a character string output as a part of the recognition result to a speaker.