JP2003029776A

JP2003029776A - Voice recognition device

Info

Publication number: JP2003029776A
Application number: JP2001211921A
Authority: JP
Inventors: Kenji Nakamura; 賢二中村; Yoshiyuki Ogata; 芳幸緒方; Masakazu Tateyama; 雅一立山; Hiroto Nishida; 博人西田; Yoshiaki Kuroki; 義明黒木; Yasuyuki Nishioka; 靖幸西岡; Tatsuhiro Goshima; 龍宏五島
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 2001-07-12
Filing date: 2001-07-12
Publication date: 2003-01-31

Abstract

PROBLEM TO BE SOLVED: To provide a voice recognition device which keeps a recognition performance close to the performance of a specific speaker system and is enhanced in convenience by automatically conducting the training. SOLUTION: When recognition words are registered, text information of the recognition words is inputted into a voice synthesis processing section 15, synthesized voice signals that are to be outputted are converted into synthesized sound acoustic data by a synthesized voice sound converting section 19. The data are registered in a word acoustic data storage section 21 in place of the uttering of a speaker in a conventional training. The coincident process for acoustic data in a word recognition section 26 in the voice recognition process, that is conducted after the registration, is conducted for the synthesized sound acoustic data. Thus, the voice recognition device for a specific speaker is realized at a low cost and the convenience which is similar to the convenience of an unspecified speaker system voice recognition device that does not required training can be provided.

Description

Detailed Description of the Invention

【発明の属する技術分野】本発明は音声認識装置に関す
るものであり、特定話者を対象とする音声認識技術にお
いて、話者によるトレーニング手続きなしで音声認識を
行うことを特徴とする音声認識装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice recognition device, and in a voice recognition technique for a specific speaker, a voice recognition device characterized by performing voice recognition without a training procedure by the speaker. .

【従来の技術】近年、電話機やＦＡＸ、カーナビゲーシ
ョンシステムなどの情報処理装置において、いわゆる音
声認識技術を応用して音声入力による本体操作が可能な
装置が製品化されるようになってきた。音声認識技術の
方式には、話者を限定しない不特定話者方式（speaker
independent）と、話者を限定する特定話者方式（speak
er dependent）の二つに大別される。不特定話者方式
は、音声に含まれる言語的な特徴を抽出し、ニューラル
ネットワークに代表されるパターン認識技術を応用して
話者の発話内容を推定するものである。ところが、話者
の発話音声には各個人特有の声質があり、不特定の話者
に対して安定した認識率を確保するためには、複雑な処
理を必要とする。結果として製品のコストアップにつな
がる。一方、特定話者方式は対応できる話者を限定する
ことにより安価なシステムで良好な音声認識率を得るも
のである。この方式では、装置の初回使用時に話者自身
の声質を登録（トレーニング）することが必要であり、
その分の手間を必要とする。音声認識処理では、あらか
じめデータベースの形で音声認識装置内に保存された単
語群の中から、話者が発声した単語に該当するものを識
別し、結果を話者に返すことが基本的な動作となる。以
下、図面を参照しながら従来の特定話者方式の音声認識
装置についておおまかな動作説明を行う。図９は従来の
特定話者方式の音声認識装置の構成図、図１０は図９中
の音声認識処理部の詳細図、図１１は図１０中の単語音
響データ格納部の詳細図である。話者の発声した単語
は、マイク１で電気信号へ変換され、信号処理部２にて
後の処理に適した形式の音声信号へ変換されて音声認識
処理部４へ送られる。音声認識処理部４内の音響処理部
６はこの音声信号から音響的な特徴量を抽出し、単語識
別部８では入力された音響データにもっとも一致するも
のを単語音響データ格納部７に保持されている音響デー
タの中から探し出す。この結果一致した音響データに関
連づけられた単語識別子が識別情報として信号処理部２
へ戻され、それによって信号処理部２は話者の発声した
単語を認識でき、適切な処理制御を実施する。以上が話
者不特定および話者特定の音声認識方式に共通する基本
的な認識処理の流れであるが、両方式の基本的な相違点
は単語音響データ格納部７の単語音響データの生成方法
にある。前述したように話者特定方式においては単語音
響データはトレーニングによって生成される。したがっ
て、装置の初期状態では単語音響データは未定義の状態
であるため、音声認識処理の前にこのトレーニングが必
須となる。トレーニングとは、話者が認識対象であるす
べての単語について発声を行い、それを単語音響データ
格納部に登録する処理である。トレーニングにおいて、
話者は発声した特定の認識対象の単語はマイク１により
入力され信号処理部２によって音声信号に変換される
が、このとき個々の認識対象単語を区別するための単語
識別子が付加される。音声認識処理部４ではこの音声信
号を音響処理部６で音響データに変換し、単語識別子と
ともに単語音響データ格納部７へ供給する。単語音響デ
ータ格納部７では、この音響データと単語識別子が互い
に関連付けて格納される。こうして全ての音声認識対象
の単語に対して同様のトレーニングを繰り返すことによ
り初めて音声認識が可能になる（図７参照）。一方、話
者不特定方式においては、単語音響データ格納部内の単
語音響データの作成には話者の発声を必要としないた
め、話者の音声認識動作の前にあらかじめ設定しておく
ことが可能であり話者の負担はない。ただし、話者が新
しく認識単語を追加するためには単語音響データの生成
に複雑な計算を要することから小規模な音声認識装置で
は実現が難しいという欠点もある。この点で、話者特定
方式では音響データの生成は、音声認識処理と共通の音
響処理部８によって容易に実現されるので、話者による
新規の認識単語の追加が小規模な認識装置でも簡単に実
現できるという利点がある。図中の音声合成処理部３は
テキストを音声に変換する処理部である。音声認識処理
には直接の関連はないが、音声認識機能を備えた装置に
おいては一般的に併用されており、本特許においては、
この音声合成機能を積極的に利用することを特徴とする
ため、説明のために併記している。音として出力したい
テキストはテキスト情報として信号処理部２から音声合
成処置部３へ送られ、結果としの合成音声信号が返され
る。この合成音声信号は話者に音声として伝えるためス
ピーカー１２で出力される。2. Description of the Related Art In recent years, in information processing devices such as telephones, fax machines, and car navigation systems, so-called voice recognition technology has been applied to commercialize devices that can be operated by voice input. The speech recognition technology has a speaker-independent method (speaker
independent) and a specific speaker method (speak
er dependent). The unspecified speaker system extracts linguistic features contained in speech and applies pattern recognition technology typified by neural networks to estimate the utterance content of the speaker. However, the speech uttered by the speaker has a voice quality peculiar to each individual, and complicated processing is required to secure a stable recognition rate for an unspecified speaker. As a result, the cost of the product increases. On the other hand, the specific speaker method obtains a good speech recognition rate in an inexpensive system by limiting the speakers who can handle it. In this method, it is necessary to register (train) the voice quality of the speaker himself when the device is first used.
It takes time and effort. In the speech recognition process, the basic operation is to identify the word corresponding to the word uttered by the speaker from the words stored in advance in the database in the form of a database and return the result to the speaker. Becomes A general operation of a conventional specific speaker type voice recognition device will be described below with reference to the drawings. 9 is a block diagram of a conventional specific speaker type speech recognition device, FIG. 10 is a detailed diagram of a speech recognition processing unit in FIG. 9, and FIG. 11 is a detailed diagram of a word acoustic data storage unit in FIG. The word uttered by the speaker is converted into an electric signal by the microphone 1, converted into a voice signal in a format suitable for the subsequent processing by the signal processing unit 2, and sent to the voice recognition processing unit 4. The sound processing unit 6 in the voice recognition processing unit 4 extracts the acoustic feature amount from the sound signal, and the word identification unit 8 holds the one that most matches the input sound data in the word sound data storage unit 7. Find out from the sound data that is present. As a result, the word identifier associated with the matched acoustic data is used as the identification information by the signal processing unit 2.
Then, the signal processing unit 2 can recognize the word uttered by the speaker and perform appropriate processing control. The above is the flow of the basic recognition processing common to the speaker-unspecified and speaker-specific voice recognition methods, but the basic difference between both methods is the method of generating word acoustic data in the word acoustic data storage unit 7. It is in. As described above, in the speaker identification method, word sound data is generated by training. Therefore, in the initial state of the device, the word acoustic data is in an undefined state, and this training is indispensable before the voice recognition processing. Training is a process in which a speaker utters all words to be recognized and registers them in the word acoustic data storage unit. In training,
A specific recognition target word uttered by the speaker is input by the microphone 1 and converted into a voice signal by the signal processing unit 2. At this time, a word identifier for distinguishing each recognition target word is added. In the voice recognition processing unit 4, this voice signal is converted into acoustic data by the acoustic processing unit 6 and supplied to the word acoustic data storage unit 7 together with the word identifier. The word sound data storage unit 7 stores the sound data and the word identifier in association with each other. In this way, the speech recognition becomes possible only by repeating the same training for all the speech recognition target words (see FIG. 7). On the other hand, in the speaker-independent method, it is not necessary to utter the speaker to create the word sound data in the word sound data storage unit, so it can be set in advance before the speaker's voice recognition operation. And there is no burden on the speaker. However, there is a drawback that it is difficult to realize with a small-scale speech recognition device because it requires complicated calculation to generate word acoustic data in order for the speaker to add a new recognized word. In this respect, in the speaker identification method, the generation of the acoustic data is easily realized by the acoustic processing unit 8 which is common to the voice recognition processing, so that the speaker can easily add a new recognition word even with a small-scale recognition device. There is an advantage that can be realized. The voice synthesis processing unit 3 in the figure is a processing unit for converting text into voice. Although it is not directly related to the voice recognition process, it is generally used together in a device having a voice recognition function.
Since the feature is that this voice synthesis function is positively used, it is also shown for the sake of explanation. The text to be output as a sound is sent as text information from the signal processing unit 2 to the voice synthesis processing unit 3, and the resultant synthesized voice signal is returned. This synthesized voice signal is output from the speaker 12 to be transmitted to the speaker as voice.

【発明が解決しようとする課題】前述のように特定話者
方式の音声認識装置ではトレーニング作業が必要であ
り、認識対象の単語が多いシステムにおいては話者のト
レーニングに要する時間も大きく、その話者への負担が
システムの利便性を低下させてしまっていた。一方、話
者不特定方式の装置では、組み込み機器に代表される小
規模な装置において新規の認識単語の追加が困難である
ためシステムの拡張性や応用性が制限されてしまうし、
認識単語の追加が可能になるほど装置の計算能力を高め
てしまうと、コストが増大して小規模システムには適用
できない等の課題があった。As described above, the specific speaker-type speech recognition apparatus requires training work, and in a system with many words to be recognized, the time required for training the speaker is long and The burden on the person has reduced the convenience of the system. On the other hand, in a speaker-unspecified system, it is difficult to add a new recognition word in a small-scale device typified by an embedded device, which limits the expandability and applicability of the system.
If the calculation ability of the device is increased to the extent that the recognition words can be added, there is a problem that the cost increases and it cannot be applied to a small-scale system.

【課題を解決するための手段】本発明は上記従来の課題
を解決するために、低コストで実現できる特定話者の音
声認識装置を基本として、一般的に音声認識と併用され
ることの多い音声合成機能を利用し、それから生成され
る音声信号を話者による発声の代わりにトレーニングに
使用することを特徴とする音声認識装置である。話者に
負担となるトレーニングを装置内部で自動的に行うこと
で、トレーニングのない不特定話者方式の音声認識装置
と同様の利便性を提供することができる。In order to solve the above-described conventional problems, the present invention is based on a voice recognition device for a specific speaker that can be realized at low cost, and is generally used in combination with voice recognition. A voice recognition device characterized by using a voice synthesis function and using a voice signal generated therefrom for training instead of utterance by a speaker. By automatically performing the training that burdens the speaker inside the device, it is possible to provide the same convenience as that of an unspecified speaker-type speech recognition device without training.

【発明の実施の形態】本発明の請求項１に記載の発明
は、音声入力装置であるマイクと、音声出力装置である
スピーカーと、キーボードなどの入力装置と、認識結果
を表示する表示装置と、前記マイク、前記スピーカー、
前記入力装置および前記出力装置が接続され、音声認識
装置全体の処理制御を実施する信号処理部と、前記信号
処理部よりテキスト情報を入力され合成音声信号を出力
する音声合成処理部と、前記信号処理部より入力された
音声信号を、内部で保持している複数の音響データと比
較し、その一致結果を音声認識の結果として前記信号処
理へ出力する音声認識処理部を備えた音声認識装置にお
いて、前記音声合成処理部からの合成音声信号を音響デ
ータである合成音音響データに変換する合成音声変換部
を備えることで、合成音音響データを従来は話者による
トレーニングで生成されていた音響データの代わりに音
声認識処理部内に保持し音声認識に用いることにより、
話者の負担になるトレーニングを必要としないことを特
徴とし、トレーニングのない不特定話者方式の音声認識
装置と同様の利便性を提供することができる。本発明の
請求項２に記載の発明は、請求項１の音声認識処理部に
おいて、合成音声変換部より出力された合成音音響デー
タかあるいは音響処理部より出力された音響データを選
択する音響データ選択部を有することにより、単語音響
データ格納部に対して装置の初期時は合成音声音響デー
タを格納しておき、ある認識単語を認識した場合には該
当単語の発話音声に相当する音響データで合成音音響デ
ータを置き換えることで初回以降は実発音での音響デー
タに対する認識を可能とすることを特徴とし、合成音声
と実発音が異なる場合にも音声認識率の低下を防ぐこと
が出来るという作用を有する。本発明の請求項３に記載
の発明は、請求項２の単語音響データ格納部において、
前記合成音声変換部より出力された合成音音響データ
と、前記音響処理部より出力された話者の発話音声の音
響データを両方を保持し、両者のうちのいずれかに一致
したときに該当する単語の識別情報を出力することを特
徴とするものであり、音声認識率のさらなる向上を図る
ことができるという作用を有する。本発明の請求項４に
記載の発明は、請求項１の単語音響データ格納部におい
て、一つの単語に対して複数の発音の仕方を前記合成音
声変換部より入力し格納する構成を持つことを特徴とす
るものであり、話者がいずれの発音を行った場合でも、
該当する単語を正しく認識できるという作用を有する。
本発明の請求項５に記載の発明は、請求項４の単語音響
データ格納部において、話者が該当する単語を発声し、
一致した音響データを残して他の音響データを削除する
ことを特徴とするものであり、次回から不要な識別処理
を省略し、認識処理をより高速に行うことが出来るとい
う作用を有する。本発明の請求項６に記載の発明は、請
求項５の単語音響データ格納部において、ある単語に対
する複数の合成音音響データのそれぞれに話者の発生が
一致した頻度を保持する機構を追加する。該当単語の認
識が行われた際に、スレッシュレベル以下の一致頻度の
音響データのみを削除することを特徴とするものであ
り、次回から不要な識別処理を省略し、認識処理を高速
に行うことが出来るという作用を有する。本発明の請求
項７に記載の発明は、請求項１の発明において、個々の
識別対象単語に対する合成音響データを音声認識処理部
に保持する際に、話者に対して合成音を前記スピーカー
より再生し、話者の意図する合成音であるかどうかを確
認し、意図しない場合に限り話者によるトレーニング手
続きを行うことを特徴とするものであり、話者がトレー
ニングを実施する頻度を少なくすることができるという
作用を有する。以下、本発明の実施の形態について、図
面を参照しながら説明する。（実施の形態１）図１に本発明を適用した音声認識装置
の基本的な構成例を示す。図２は図１における音声認識
処理部１６の内部構成図である。図１においては、基本
的構成は図９の従来の一般的な特定話者方式音声認識装
置と同一であるが、合成音声変換部１９を備えることを
特徴としている。図１において、１３は音声入力装置で
あるマイクであり、話者が発声した音声を電気信号へ変
換する。１８はスピーカー、２３は表示部である。１７
は話者が認識結果の確認を行うためのキー入力や装置全
体を制御するための入力装置である。１４は信号処理
部、１５は音声合成処理部、１６は音声認識処理部、１
９は合成音声変換部である。図２に示す音声認識処理部
１６内において、２０は入力音声信号から音響的な特徴
量を抽出する音響処理部である、２１は電話帳にあるす
べての相手先の名前を認識単語としてそれぞれの単語音
響データを保持する単語音響データ格納部である。２２
は入力された音響データにもっとも一致するものを単語
音響データ格納部２１の中から探し出す単語識別部であ
る。話者の発声した単語はマイク１３で電気信号へ変換
され、信号処理部１４へ入力される。信号処理部１４で
は入力された音声信号を音声認識処理部１６での処理に
適した形式の音声信号へ変換する。図２に示す音声認識
処理部１６内において音響処理部２０は、信号処理部１
４が出力する音声信号から音響的な特徴量を抽出し音響
データとして単語識別部２２へと出力する。単語識別部
２２では入力された音響データにもっとも一致するもの
を単語音響データ格納部２１に保持されている音響デー
タの中から探し出す。この結果一致した音響データに関
連づけられた単語識別子が識別情報として信号処理部１
４へと戻される。図１において、信号処理部１４では音
声認識の結果である識別情報によって話者の発声した単
語を認識し、それに基づいて装置の適切な処理制御を実
施したり、表示装置２３を介して話者に認識結果をフィ
ードバックする。ここで、本発明の音声認識装置を電
話機に適用し、電話機における音声による電話帳検索処
理のために、トレーニングに代わって実施される音声合
成による単語音響データ格納部への音響データの登録処
理について説明する。電話機おいて電話帳は話者が通信
を行う相手の名前とその相手の電話番号やメールアドレ
ス等の個人情報を電話内部で保持している一種のデータ
ベースである。話者はこの電話帳に相手先の情報を登録
しておけば、毎回電話番号を入力することなく容易に電
話をかけることができる。音声認識装置を組み込んだ電
話機においては、話者が相手の名前を発声することで自
動的に相手の電話番号を電話帳から検索し電話をかける
という、いわゆるボイスダイアリング機能として利用さ
れることが多い。ボイスダイアリング機能の実現のため
には、電話帳にあるすべての相手先の名前を認識単語と
してそれぞれの単語音響データを単語音響データ格納部
２１に保持しておかねばならない。このとき単語識別子
として、テキスト形式の相手先の名前か、あるいは電話
帳におけるエントリ番号が保持される。従来の特定話者
の音声認識装置では音響データは話者が発声する必要が
あるため、電話帳にあるすべての相手先の名前を発声す
る必要があった。本発明ではそれに該当する処理を合成
音声信号を生成する音声合成処理部１５と、合成音声を
音響データに変換する合成音声変換部１９を用いて話者
に暗黙的に実行する。具体例として、電話帳に１００件
の名前を以下の様に新規に登録する場合を考える。この
電話帳のデータは信号処理部１４の内部メモリに保持さ
れるものであるが、先に述べたようにボイスダイアリン
グのためには、単語音響データ格納部２１にも音響デー
タを登録する必要がある。従来は話者が相手先名前「ad
am」およびその電話番号「１１１−２２２２」を入力装
置２３のキーより入力した場合、これらは信号処理部１
４内の電話帳のエントリ１に登録されるが、単語音響デ
ータ格納部２１の入力として必要な音響データを音響処
理部２０で生成するためにマイク１３によって「adam」
を実際に発声する。同時にこれに対応するエントリ番号
「１」が単語識別子として音響データに関連付けられ単
語音響データ格納部２１に登録される。この発声を伴う
音響データの登録には、話者の発声自体に要する時間に
加えて、登録処理との発声タイミングをとる時間も必要
であり、１件あたり数秒から数十秒の時間を要し、その
手順を１００回繰り返す必要があったので話者への負荷
は大きかった。本発明においては、従来と同様に相手先
名前と電話番号をキー入力した直後に、信号処理部１４
の制御によって相手先名前「adam」がテキスト情報とし
て音声合成処理部１５に送られる。合成音声処理部１５
では「adam」に相当した標準的な発声音である合成音声
信号「アダム」が生成される。この合成音声信号は合成
音声変換部１９によってその特徴量データである合成音
音響データが生成される。この合成音音響データは信号
処理部１４からの単語識別子「１」とともに音声認識処
理部１６内にある単語音響データ格納部２１に保存され
る。こうして話者による発声が伴わないために、音響デ
ータ登録の一連の処理は数秒内で自動的に実施される。
使用者が相手先名前と電話番号のキー入力を１００回繰
り返すことによって単語音響データ格納部２１への音響
データの登録処理は完了する。この場合、合成音音響デ
ータにはマイクからの音声を変換して作成した音響デー
タと異なりマイク周囲の雑音の混入による影響もない。
使用者は相手先名前と電話番号のキー入力を登録件数分
行うが、トレーニングのためにマイクに向かって発声す
る必要がなく、登録処理との発声タイミングをとる時間
も必要ないので、使用者への負担は少ない。表１は本願
発明の電話機の電話帳の構成例を示す。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The invention according to claim 1 of the present invention includes a microphone as a voice input device, a speaker as a voice output device, an input device such as a keyboard, and a display device for displaying a recognition result. , The microphone, the speaker,
A signal processing unit connected to the input device and the output device for performing processing control of the entire voice recognition device; a voice synthesis processing unit for receiving text information from the signal processing unit and outputting a synthesized voice signal; In a voice recognition device equipped with a voice recognition processing unit that compares a voice signal input from a processing unit with a plurality of acoustic data held therein and outputs the matching result to the signal processing as a result of voice recognition. By providing a synthetic voice conversion unit that converts the synthetic voice signal from the voice synthesis processing unit into synthetic voice acoustic data that is acoustic data, the synthetic voice acoustic data is generated by the training of the speaker. Instead of, by holding it in the voice recognition processing unit and using it for voice recognition,
It is characterized by not requiring training that burdens the speaker, and can provide the same convenience as that of an unspecified speaker-type speech recognition device without training. The invention according to claim 2 of the present invention is, in the speech recognition processing part of claim 1, acoustic data for selecting synthetic sound acoustic data output from a synthetic speech conversion part or acoustic data output from the acoustic processing part. By including the selection unit, the synthesized voice sound data is stored in the word sound data storage unit in the initial stage of the device, and when a certain recognition word is recognized, the sound data corresponding to the speech sound of the corresponding word is stored. The feature is that by replacing the synthetic voice and acoustic data, it is possible to recognize the acoustic data with the actual pronunciation from the first time onwards, and it is possible to prevent the voice recognition rate from decreasing even when the synthetic speech and the actual pronunciation are different. Have. According to a third aspect of the present invention, in the word acoustic data storage section of the second aspect,
Both of the synthetic sound acoustic data output from the synthetic speech conversion unit and the acoustic data of the utterance voice of the speaker output from the acoustic processing unit are held, and the case corresponds to any one of them. It is characterized by outputting the identification information of words, and has an effect that the voice recognition rate can be further improved. According to a fourth aspect of the present invention, in the word sound data storage section of the first aspect, there is a configuration in which a plurality of pronunciation methods for one word are input and stored from the synthetic speech conversion section. It is a feature, no matter which pronunciation the speaker makes,
This has the effect of correctly recognizing the corresponding word.
According to a fifth aspect of the present invention, in the word acoustic data storage section of the fourth aspect, the speaker utters a corresponding word,
It is characterized in that the matched acoustic data is left and the other acoustic data is deleted, and has an effect that unnecessary identification processing can be omitted from the next time and the recognition processing can be performed at higher speed. According to a sixth aspect of the present invention, in the word acoustic data storage section of the fifth aspect, a mechanism is added to hold the frequency at which the occurrence of the speaker coincides with each of a plurality of synthetic voice acoustic data for a certain word. . It is characterized by deleting only the acoustic data of the matching frequency below the threshold level when the corresponding word is recognized, and unnecessary recognition processing will be omitted from the next time, and recognition processing will be performed at high speed. It has the effect that According to a seventh aspect of the present invention, in the first aspect of the invention, when the synthesized sound data for each identification target word is held in the speech recognition processing unit, a synthesized sound is output to the speaker from the speaker. It is characterized by playing back, checking whether it is the synthesized sound intended by the speaker, and performing the training procedure by the speaker only when it is not intended, reducing the frequency of training by the speaker It has the effect of being able to. Hereinafter, embodiments of the present invention will be described with reference to the drawings. (Embodiment 1) FIG. 1 shows a basic configuration example of a voice recognition device to which the present invention is applied. FIG. 2 is an internal configuration diagram of the voice recognition processing unit 16 in FIG. 1, the basic configuration is the same as that of the conventional general specific speaker system speech recognition device of FIG. 9, but is characterized in that a synthetic speech conversion unit 19 is provided. In FIG. 1, reference numeral 13 is a microphone which is a voice input device and converts a voice uttered by a speaker into an electric signal. Reference numeral 18 is a speaker, and 23 is a display unit. 17
Is an input device for a speaker to input a key to confirm the recognition result and to control the entire device. Reference numeral 14 is a signal processing unit, 15 is a voice synthesis processing unit, 16 is a voice recognition processing unit, 1
Reference numeral 9 is a synthetic voice conversion unit. In the voice recognition processing unit 16 shown in FIG. 2, 20 is an acoustic processing unit for extracting acoustic feature quantities from the input voice signal, and 21 is the names of all the destinations in the telephone directory as recognition words. It is a word sound data storage unit that holds word sound data. 22
Is a word identification unit that searches the word sound data storage unit 21 for the best match with the input sound data. The word spoken by the speaker is converted into an electric signal by the microphone 13 and input to the signal processing unit 14. The signal processing unit 14 converts the input voice signal into a voice signal in a format suitable for the processing in the voice recognition processing unit 16. In the speech recognition processing unit 16 shown in FIG.
The acoustic feature amount is extracted from the voice signal output by the output device 4 and is output to the word identifying unit 22 as acoustic data. The word identification unit 22 searches the acoustic data stored in the word acoustic data storage unit 21 for the best match with the input acoustic data. As a result, the signal processing unit 1 uses the word identifier associated with the matched acoustic data as identification information.
Returned to 4. In FIG. 1, the signal processing unit 14 recognizes a word uttered by a speaker based on the identification information which is the result of the voice recognition, and based on that, performs appropriate processing control of the device, or the speaker via the display device 23. The recognition result is fed back to. Here, the voice recognition device of the present invention is applied to a telephone, and for voice telephone directory search processing in the telephone, registration processing of acoustic data in the word acoustic data storage unit by voice synthesis performed instead of training is performed. explain. In the telephone, the telephone directory is a kind of database that holds the name of the person with whom the speaker communicates and the personal information such as the telephone number and mail address of the person inside the telephone. By registering the information of the other party in this telephone directory, the speaker can easily make a call without entering the telephone number each time. A telephone with a voice recognition device is often used as a so-called voice dialing function, in which a speaker speaks the name of the other party to automatically retrieve the other party's phone number from the phone book and make a call. Many. In order to realize the voice dialing function, it is necessary to hold the names of all the destinations in the telephone directory as recognition words and store the respective word sound data in the word sound data storage unit 21. At this time, the name of the other party in text format or the entry number in the telephone directory is held as the word identifier. In the conventional voice recognition device for a specific speaker, since the speaker needs to utter the acoustic data, it is necessary to utter the names of all the other parties in the telephone directory. In the present invention, the process corresponding thereto is implicitly executed by the speaker using the speech synthesis processing unit 15 that generates a synthetic speech signal and the synthetic speech conversion unit 19 that converts the synthetic speech into acoustic data. As a specific example, consider the case where 100 names are newly registered in the telephone directory as follows. The data of this telephone directory is stored in the internal memory of the signal processing unit 14, but as described above, it is necessary to register the acoustic data in the word acoustic data storage unit 21 for voice dialing. There is. In the past, the speaker called the destination name "ad
When "am" and its telephone number "111-2222" are input by the key of the input device 23, these are processed by the signal processing unit 1.
Although registered in the entry 1 of the telephone directory in 4, the microphone 13 causes “adam” to be generated by the sound processing unit 20 to generate the sound data required as the input of the word sound data storage unit 21.
Is actually uttered. At the same time, the entry number “1” corresponding to this is associated with the acoustic data as a word identifier and registered in the word acoustic data storage unit 21. In addition to the time required for the speaker's utterance itself, the registration of the acoustic data accompanied by this utterance requires the time for utterance timing with the registration process, and it takes several seconds to several tens of seconds per case. Since the procedure had to be repeated 100 times, the load on the speaker was heavy. In the present invention, as in the conventional case, the signal processing unit 14 is provided immediately after key input of the destination name and telephone number.
The destination name “adam” is sent to the voice synthesis processing unit 15 as text information under the control of. Synthetic voice processing unit 15
Generates a synthetic voice signal "Adam" which is a standard vocalization sound corresponding to "adam". The synthetic speech signal is generated by the synthetic speech conversion unit 19 as synthetic sound acoustic data which is the feature amount data. The synthesized voice sound data is stored in the word sound data storage unit 21 in the voice recognition processing unit 16 together with the word identifier “1” from the signal processing unit 14. In this way, since a speaker does not utter, a series of processing for acoustic data registration is automatically performed within a few seconds.
The user repeats the key input of the destination name and the telephone number 100 times to complete the registration process of the acoustic data in the word acoustic data storage unit 21. In this case, unlike the acoustic data created by converting the voice from the microphone, the synthetic sound acoustic data is not affected by the noise around the microphone.
Although the user inputs the recipient's name and telephone number for the number of registrations, the user does not need to speak into the microphone for training, and there is no time required to speak with the registration process. Is less burdensome. Table 1 shows a configuration example of the telephone directory of the telephone of the present invention.

【表１】登録完了後、使用者が例えば「john」に電話をかけたい
場合、相手の名前である「ジョン」をマイク１３から入
力する。この音声は信号処理部１４を経由して音声信号
として音声認識処理部１６に送られる。音響処理部２０
はこの「ジョン」を音響データに変換し単語識別部２２
へ送る。単語識別部２２はこの音響データを単語音響デ
ータ格納部の１００件の音響データと比較し、結果とし
て一致したエントリの単語識別子「３」を識別情報とし
て信号処理部１４へ返す。信号処理部１４では内部の電
話帳データを検索し、識別情報である「３」からエント
リ番号３の「john」の電話番号「１２３−３３４４」を
得ることが出来る。通常はこの電話番号に対して話者の
確認を行った後、電話がかけられる。以上のように本実
施の形態によれば、低コストの特定話者方式の音声認識
装置でありながら、話者に負担となるトレーニングを装
置内部で自動的に行うことで、見かけ上はトレーニング
のない不特定話者方式の音声認識装置と同様の利便性の
音声認識装置を得ることができるという効果が生じる。（実施の形態２）実施の形態１で説明したように単語音
響データ格納部の音響データをすべて合成音音響データ
とすることによって、話者をトレーニングから解放する
ことができた。しかし、この方法では、方言等の影響に
よって話者の発声する単語が音声合成変換部で生成され
る標準的な発音と非常に異っている場合には、話者の単
語が認識が困難になることが考えられる。実施の形態２
はこの課題を解決するために、基本的な構成は形態１と
同様であるが、図３に示すように音声認識処理部に音響
データ選択部２７を設けることによって、単語音響デー
タ格納部２５に格納する認識単語の音響データを合成音
音響データかあるいは音響処理部２４からの音響データ
かの選択を可能とした。以下に形態１と同様の電話機で
のボイスダイアリングを例にしてその動作を説明する。
この発明においては１００件の電話帳登録に伴う単語音
響データ格納部２５への音響データ登録の際には、音響
データ選択部２７では合成音音響データが選択される。
したがって、登録完了時点では単語音響データ格納部２
５の登録内容は形態１と同一となる。しかし、エントリ
番号２の「Henry」に対する合成音声信号が「ヘンリ
ー」であるのに、実際の読みが「アンリ」だった場合を
考える。ボイスダイアリングにおいて話者が発する「ア
ンリ」に対して単語識別部２６が正しくエントリ番号
「２」を識別情報として信号処理部に返す割合（いわゆ
る音声認識率）は、単語音響データ格納部２５に正しい
読みである「アンリ」が登録されている場合よりも小さ
くなってしまう。本発明では、このような合成音声と実
際の発声音との相違がある場合の音声認識率の低下を防
ぐために、単語識別部２６からの識別情報に基づいて、
単語音響データ格納部２５の該当音響データを音響処理
部２４が出力する音響データに置き換える機構を設け
る。この場合には、話者が「アンリ」と発声し、単語識
別部２６から識別情報としてエントリ番号「２」が信号
処理部に戻された時点で、音響処理部２４に保持されて
いた「アンリ」に相当する音響データが音響データ選択
２７によって選択され、同時に信号処理部からは単語識
別子「２」が単語音響データ格納部２５に入力される。
これらのデータから単語音響データ格納部のエントリ２
は「ヘンリー」から「アンリ」に相当する音響データに
置き換えられる。この処理は話者に対しては暗黙的に実
行されるため、話者の負担は生じない。こうして、実際
の音声がその合成音声と多少異なる場合でも、識別され
た単語に対する音響データは実際の話者の発声音に相当
する音響データに常に更新されていくため、合成音声と
実発音の違いに基づく永続的な音声認識率の低下を防ぐ
効果がある。（実施の形態３）形態２においては認識が行われた場合
には、該当単語の音響データを発声された音響データに
置き換えて、合成音声が実発音と異なる場合の音声認識
率の低下を防ぐことを目的とした。しかし、置き換えと
して登録される音声が常に適切なものとは限らない。た
とえば、発音の異なる複数の話者が音声認識装置を共用
している場合や、同一話者の発声においても周囲のノイ
ズが混入した場合など、一度特殊な発声の音響データが
該当単語の音響データとして単語音響データ格納部に登
録されてしまうと、標準的な発声を行っても音声認識率
が低下してしまうことが考えられる。実施の形態３では
この課題を解決するために、図４に示すように単語音響
データ格納部を認識対象単語毎に合成音音響データ２９
と発声音響データ３０を保持できるようにする。これに
より標準的な合成音声に近い発声においても、また話者
独特の標準的でない発声の場合においても、いづれかの
音響データに一致することで音声認識率の低下を防ぐこ
とができる。電話帳の先例を用いて動作を説明する。エ
ントリ番号２の「Henry」に対する合成音声信号が「ヘ
ンリー」であるのに、実際の読みが「アンリ」だった場
合を考える。まず、単語音響データ格納部において、電
話帳の登録時に単語識別子２８にエントリ番号「２」が
格納され、それに関連する合成音音響データ２９に「ヘ
ンリー」が格納される。この後、実際の話者の発生「ア
ンリ」によって認識が成功し、エントリ「２」が識別情
報として信号制御部に返された時に「アンリ」がエント
リ「２」に関連する発生音響データ３０へ格納される。
この一連の処理は形態２において音響データの書き換え
が起こらない点を除いて同一である。こうして、電話帳
の単語「Henry」について２つの音響データが単語音響
データ格納部に保持されることになる。こうして次のボ
イスダイアリングにおいて話者が「アンリ」と発声した
場合においても、別の話者が「ヘンリー」と発声した場
合においても、どちらも音声認識率を低下させることな
く音声認識処理が実施できるという効果がある。発声音
響データ３０については、形態２と同様に、エントリ
「２」への認識が成功するたびに話者の発声に対応する
音響データへと更新される。（実施の形態４）実施の形態１においては1つの認識単
語に対して登録できる音響データは1つであった。しか
し、電話帳登録の例での「Henry」のように、あらかじ
め複数の発音が存在することがわかっている場合には、
電話帳の登録時にその全てを登録するほうが音声認識率
を上げることができる。実施の形態４は、基本的な構成
は形態１と同一であるが、図５にあるように1つの認識
単語に対して複数の合成音音響データを登録できる単語
音響データ格納部を備えることを特徴とし、話者の発声
がいづれかの合成音音響データに一致した場合には、該
当単語の認識を可能にしたものである。たとえば、「He
nry」の例では、単語識別子３１にエントリ番号「２」
が格納され、それに関連する合成音音響データＡ３２に
「ヘンリー」に該当する音響データが、合成音音響デー
タＢ３３に「アンリ」に該当する合成音音響データが格
納される。この２つの合成音音響データは合成音声変換
部において一般的な読み方の知識を用いて自動的に生成
される。こうして、話者が「ヘンリー」あるいは「アン
リ」と発声した場合においても認識率を低下させること
なく正しく「Henry」が認識される。（実施の形態５）実施の形態４において、1つの認識単
語に対して複数の合成音音響データを持つことを可能に
することで異なる発声に対する認識率の低下を改善でき
た。しかし、認識の際に単語識別部が比較する音響デー
タが増大するため音声認識処理の時間を増加させるとい
う問題がある。実施の形態５では、単語音響データ格納
部内のそれぞれの単語の合成音音響データについて、認
識単語のうち話者の発声にもっとも一致した合成音音響
データを残して、他の合成音音響データを削除する機構
を備えることによって、次回の認識処理から不要な識別
処理を省略し、認識処理をより高速に行うことを特徴と
する。（実施の形態６）実施の形態５では、認識された単語の
複数の合成音音響データのうち、もっとも一致したもの
を除いて他の全てが削除された。しかし、ある認識単語
に対して１つ以上の発音が同程度に発生することも考え
られ、この場合には残された発音以外の音声認識率が低
くなってしまう。実施の形態６ではこれを解決する手段
として、図６の単語音響データ格納部において、各単語
の複数の合成音音響データそれぞれに対して該当単語の
認識が行われた際に一致した頻度情報を記録する機構を
追加する。この頻度情報が一定のスレッシュレベルを下
回った場合に限り、その合成音音響データを削除する。
このようにして、発声頻度の少ない合成音音響データを
音声認識のたびに徐々に削除することで、発声可能性の
ある合成音音響データを残すことで音声認識率の低下を
防ぎつつ、不要な音響データによる音声認識時間の増大
を防ぐ効果がある。電話帳における「Henry」の例を用
いて動作を説明する。図６の単語音響データ格納部にお
いて、電話帳登録時に単語識別子３５にエントリ番号
「２」が格納され、それに関連する合成音音響データＡ
３６に「ヘンリー」に該当する音響データ、合成音音響
データＢ３８に「アンリ」に該当する合成音音響デー
タ、合成音音響データＣ４０に「ヘンリ」に該当する合
成音音響データの３種類の可能性のある発音が格納され
たとする。同時にこの登録時にはそれぞれの合成音音響
データの頻度情報として、初期値１０が頻度情報Ａ３
７、頻度情報Ｂ３９および頻度情報Ｃ４１に格納され
る。まず音声認識において話者が「ヘンリー」と発声し
「Henry」が認識された場合には、「ヘンリー」に該当
する頻度情報Ａの数値は１加算されるが最大値は初期値
１０を超えないため１０のままとなり、「ヘンリー」以
外の頻度情報すなわち頻度情報Ｂおよび頻度情報Ｃの数
値が１減算され、それぞれ９となる。この「Henry」に
対する認識が「ヘンリー」の発音でさらに７回続いた場
合には、頻度情報Ａ、Ｂ、Ｃはそれぞれ１０、２、２と
なる。次に「アンリ」による認識が起こった場合、頻度
情報Ａ、Ｂ、Ｃはそれぞれ９、３、１となる。次に「ヘ
ンリー」による認識が起こった場合、頻度情報Ａ、Ｂ、
Ｃはそれぞれ１０、２、０となる。この時点で「ヘン
リ」に対応する頻度情報はスレッシュレベル１を下回り
音声音音響データＣ４０は削除される。これ以降は「ヘ
ンリ」に対する比較処理は行われない。このようにして
発声頻度の小さい合成音音響データのみが音声認識が進
むにつれて削除されることになる。（実施の形態７）実施の形態１においては、認識単語の
音響データとして合成音音響データを登録するが、話者
の実際の単語の読みと合成音声がまったく異なる場合に
は、該当単語の認識が困難になるという課題がある。こ
の問題を解決するために、発明の構成は形態２と同様で
あるが、登録する合成音声信号を音声合成処理部が生成
した際に、その信号を合成音声変換部に入力すると同時
に信号処理部へも返し、信号処理部ではスピーカーを用
いて話者にフィードバックした後、話者からの確認入力
を入力装置を通じて得るような機構を設ける。これによ
って、話者は合成音声が意図した読みに近い場合にはＯ
Ｋの確認入力を行い、合成音声が意図する読みと全く異
なる場合にはＮＧの確認入力を行うようにする。確認入
力がＮＧの場合には、信号処理部においてマイクを通じ
て該当単語に対する話者の発声を入力するようにし、そ
の音声信号を音響処理部、音響データ選択部を単語音響
データ登録部に格納するような制御を行う。以上のよう
にして、認識単語の音響データ登録時に話者への確認処
理を行うことによって、話者のトレーニングによる処理
を最小限にし、かつ意図しない合成音声が登録されて音
声認識率が低下することを防ぐことができる。[Table 1] After the registration is completed, when the user wants to call "john", for example, the name of the other party "John" is input from the microphone 13. This voice is sent to the voice recognition processing unit 16 as a voice signal via the signal processing unit 14. Sound processing unit 20
Converts this "John" into acoustic data, and the word identification unit 22
Send to. The word identification unit 22 compares the acoustic data with 100 pieces of acoustic data stored in the word acoustic data storage unit, and returns the word identifier “3” of the matching entry as identification information to the signal processing unit 14. The signal processing unit 14 can search the internal telephone directory data and obtain the telephone number “123-3344” of the entry number “john” from the identification information “3”. Normally, the call is made after confirming the speaker for this telephone number. As described above, according to the present embodiment, even though it is a low-cost specific-speaker-type speech recognition device, the training that is burdensome to the speaker is automatically performed inside the device, so that apparent training As a result, it is possible to obtain the voice recognition device having the same convenience as the voice recognition device of the unspecified speaker system. (Embodiment 2) As described in Embodiment 1, by setting all the acoustic data in the word acoustic data storage unit as synthetic speech acoustic data, the speaker can be released from training. However, this method makes it difficult for the speaker's word to be recognized when the word spoken by the speaker is very different from the standard pronunciation generated by the speech synthesis conversion unit due to the influence of dialects and the like. It is possible that Embodiment 2
In order to solve this problem, the basic configuration is the same as that of the first embodiment, but by providing an acoustic data selection unit 27 in the speech recognition processing unit as shown in FIG. It is possible to select the acoustic data of the recognition word to be stored from the synthetic speech acoustic data or the acoustic data from the acoustic processing unit 24. The operation will be described below by taking voice dialing on a telephone similar to that of the first embodiment as an example.
In the present invention, when the acoustic data is registered in the word acoustic data storage unit 25 associated with 100 telephone directory registrations, the acoustic data selection unit 27 selects the synthetic sound acoustic data.
Therefore, when the registration is completed, the word acoustic data storage unit 2
The registration contents of No. 5 are the same as those of the form 1. However, consider a case where the synthesized voice signal for "Henry" of entry number 2 is "Henry" but the actual reading is "Henry". The ratio (so-called voice recognition rate) in which the word identification unit 26 correctly returns the entry number “2” as identification information to the signal processing unit for “Henri” issued by the speaker in voice dialing is stored in the word acoustic data storage unit 25. It will be smaller than when correct reading "Anri" is registered. In the present invention, in order to prevent a decrease in the voice recognition rate when there is a difference between such a synthesized voice and an actual vocal sound, based on the identification information from the word identification unit 26,
A mechanism is provided for replacing the corresponding sound data in the word sound data storage unit 25 with the sound data output by the sound processing unit 24. In this case, when the speaker utters "Henri" and the entry number "2" is returned from the word identifying unit 26 to the signal processing unit as identification information, the "Henri" held in the acoustic processing unit 24 is held. The acoustic data corresponding to “” is selected by the acoustic data selection 27, and at the same time, the word identifier “2” is input to the word acoustic data storage unit 25 from the signal processing unit.
From these data, the entry 2 of the word acoustic data storage section
Is replaced with acoustic data corresponding to "Henry". Since this process is implicitly executed for the speaker, the burden on the speaker does not occur. In this way, even if the actual voice is slightly different from the synthetic voice, the acoustic data for the identified word is constantly updated to the acoustic data corresponding to the utterance sound of the actual speaker. It is effective in preventing the permanent reduction of the voice recognition rate based on. (Third Embodiment) In the second embodiment, when the recognition is performed, the acoustic data of the corresponding word is replaced with the uttered acoustic data to prevent a decrease in the speech recognition rate when the synthesized speech is different from the actual pronunciation. It was intended. However, the voice registered as the replacement is not always appropriate. For example, when a plurality of speakers with different pronunciations share a voice recognition device, or when surrounding noise is mixed even in the same speaker's utterance, the acoustic data of a particular utterance is once the acoustic data of the relevant word. If it is registered in the word sound data storage unit as, it is conceivable that the voice recognition rate will decrease even if standard utterance is performed. In order to solve this problem in the third embodiment, as shown in FIG. 4, the word sound data storage unit is used to synthesize synthetic sound sound data 29 for each recognition target word.
The vocal sound data 30 can be held. This makes it possible to prevent a decrease in the voice recognition rate by matching with any one of the acoustic data even if the utterance is close to the standard synthetic voice or is not the standard utterance peculiar to the speaker. The operation will be described using a precedent of a telephone directory. Consider a case where the synthesized voice signal for "Henry" of entry number 2 is "Henry", but the actual reading is "Henry". First, in the word sound data storage unit, the entry number "2" is stored in the word identifier 28 when the telephone directory is registered, and "Henry" is stored in the synthesized sound sound data 29 related thereto. After this, recognition is successful due to the actual occurrence "Henri" of the speaker, and when the entry "2" is returned to the signal control unit as the identification information, "Henri" becomes the generated acoustic data 30 related to the entry "2". Is stored.
This series of processes is the same except that the rewriting of acoustic data does not occur in the second embodiment. In this way, two acoustic data for the word "Henry" in the telephone directory are held in the word acoustic data storage unit. In this way, in the next voice dialing, even if the speaker utters "Henri" or another speaker utters "Henry", the voice recognition processing is performed without lowering the voice recognition rate. The effect is that you can do it. The uttered sound data 30 is updated to the sound data corresponding to the utterance of the speaker each time the recognition of the entry “2” is successful, as in the second embodiment. (Embodiment 4) In Embodiment 1, only one piece of acoustic data can be registered for one recognized word. However, if it is known in advance that there are multiple pronunciations, such as "Henry" in the phonebook registration example,
The voice recognition rate can be improved by registering all of them when registering the phone book. The fourth embodiment has the same basic configuration as that of the first embodiment, but is provided with a word sound data storage unit capable of registering a plurality of synthetic sound sound data with respect to one recognized word as shown in FIG. As a feature, when the utterance of the speaker matches with any one of the synthetic sound and acoustic data, the corresponding word can be recognized. For example, "He
In the example of “nry”, the entry number “2” is added to the word identifier 31.
Is stored, and the synthetic sound acoustic data A32 related thereto stores acoustic data corresponding to “Henry”, and the synthetic sound acoustic data B33 stores synthetic sound acoustic data corresponding to “Henri”. These two synthetic sound and acoustic data are automatically generated in the synthetic speech conversion unit using general reading knowledge. Thus, even when the speaker utters "Henry" or "Henri", "Henry" is correctly recognized without lowering the recognition rate. (Fifth Embodiment) In the fourth embodiment, it is possible to improve the reduction of the recognition rate for different utterances by allowing a plurality of synthetic voice-acoustic data for one recognized word. However, there is a problem in that the speech recognition processing time is increased because the amount of acoustic data compared by the word identification unit increases during recognition. In the fifth embodiment, with respect to the synthetic sound acoustic data of each word in the word acoustic data storage unit, the synthetic sound acoustic data that best matches the utterance of the speaker among the recognized words is left, and the other synthetic sound acoustic data is deleted. It is characterized in that the unnecessary recognition process is omitted from the next recognition process and the recognition process is performed at a higher speed by including the mechanism. (Sixth Embodiment) In the fifth embodiment, all of the plurality of synthetic sound-acoustic data of the recognized word are deleted except for the best match. However, it is possible that one or more pronunciations occur at the same degree for a certain recognized word, and in this case, the speech recognition rate other than the remaining pronunciations becomes low. In the sixth embodiment, as means for solving this, in the word sound data storage unit of FIG. 6, the frequency information matched when the corresponding word is recognized for each of the plurality of synthetic sound sound data of each word is obtained. Add a recording mechanism. Only when this frequency information falls below a certain threshold level, the synthetic sound acoustic data is deleted.
In this way, by gradually deleting the synthetic voice acoustic data with low utterance frequency each time the voice is recognized, the synthetic voice acoustic data with the possibility of utterance is left, thereby preventing a decrease in the voice recognition rate and eliminating unnecessary. This has an effect of preventing an increase in voice recognition time due to acoustic data. The operation will be described using the example of "Henry" in the telephone directory. In the word sound data storage unit of FIG. 6, the entry number “2” is stored in the word identifier 35 at the time of registering in the telephone directory, and the synthetic sound sound data A related thereto is stored.
There are three possibilities of 36: acoustic data corresponding to "Henry", synthetic sound acoustic data B38: synthetic sound acoustic data corresponding to "Henri", synthetic sound acoustic data C40: synthetic sound acoustic data corresponding to "Henry". It is assumed that a certain pronunciation is stored. At the same time, at the time of this registration, the initial value 10 is the frequency information A3 as the frequency information of each synthetic sound acoustic data.
7, frequency information B39 and frequency information C41. First, in the voice recognition, when the speaker utters "Henry" and "Henry" is recognized, the numerical value of the frequency information A corresponding to "Henry" is incremented by 1, but the maximum value does not exceed the initial value 10. Therefore, the value remains 10, and the frequency information other than “Henry”, that is, the numerical values of the frequency information B and the frequency information C are decremented by 1, and become 9 respectively. When the recognition of “Henry” continues seven times with the pronunciation of “Henry”, the frequency information A, B, and C becomes 10, 2, and 2, respectively. Next, when recognition by "Henri" occurs, the frequency information A, B, C becomes 9, 3, 1 respectively. Next, when recognition by "Henry" occurs, frequency information A, B,
C becomes 10, 2 and 0, respectively. At this time, the frequency information corresponding to "Henri" falls below the threshold level 1, and the voice sound acoustic data C40 is deleted. After that, the comparison process for "Henri" is not performed. In this way, only the synthetic sound acoustic data having a low utterance frequency is deleted as the speech recognition progresses. (Embodiment 7) In Embodiment 1, synthetic sound acoustic data is registered as acoustic data of a recognized word, but if the actual reading of the word by the speaker and the synthetic speech are completely different, the corresponding word is recognized. There is a problem that it becomes difficult. In order to solve this problem, the configuration of the invention is similar to that of the second embodiment, but when the synthesized speech signal to be registered is generated by the speech synthesis processing unit, the signal is input to the synthesized speech conversion unit and at the same time the signal processing unit is input. In addition, the signal processing unit is provided with a mechanism for obtaining confirmation input from the speaker through the input device after feeding back to the speaker using the speaker. As a result, the speaker can give O when the synthesized speech is close to the intended reading.
K is input for confirmation, and if the synthetic speech is completely different from the intended reading, NG is input for confirmation. If the confirmation input is NG, the signal processing unit inputs the speaker's utterance for the word through the microphone, and stores the voice signal in the acoustic processing unit and the acoustic data selection unit in the word acoustic data registration unit. Control. As described above, by performing confirmation processing for the speaker when the recognition word is registered as acoustic data, processing by speaker training is minimized, and unintended synthetic speech is registered and the speech recognition rate is reduced. Can be prevented.

【発明の効果】本発明は低コストで実現できる特定話者
の音声認識装置を基本として、一般的に音声認識と併用
されることの多い音声合成機能を利用し、それから生成
される音声信号を話者による発声の代わりに装置内部で
自動的に行うことで、見かけ上は話者に負担となるトレ
ーニングのない音声認識装置を提供することができる。The present invention is based on a voice recognition device for a specific speaker that can be realized at low cost, and utilizes a voice synthesis function that is often used together with voice recognition. It is possible to provide a speech recognition apparatus that does not seem to be a training burden on the speaker by automatically performing it inside the device instead of speaking by the speaker.

[Brief description of drawings]

【図１】本発明の音声認識装置を搭載した電話機の基本
的構成を示すブロック図FIG. 1 is a block diagram showing a basic configuration of a telephone equipped with a voice recognition device of the present invention.

【図２】本発明の実施の形態１における音声認識処理部
の構成を示すブロック図FIG. 2 is a block diagram showing a configuration of a voice recognition processing unit according to the first embodiment of the present invention.

【図３】本発明の実施の形態２における音声認識処理部
の構成を示すブロック図FIG. 3 is a block diagram showing a configuration of a voice recognition processing unit according to the second embodiment of the present invention.

【図４】本発明の音声認識装置に必要な単語音響データ
格納部の一例を示すブロック図FIG. 4 is a block diagram showing an example of a word acoustic data storage unit necessary for the voice recognition device of the present invention.

【図５】本発明の音声認識装置に必要な単語音響データ
格納部の他の例を示すブロック図FIG. 5 is a block diagram showing another example of a word sound data storage unit necessary for the voice recognition device of the present invention.

【図６】本発明の音声認識装置に必要な単語音響データ
格納部の他の例を示すブロック図FIG. 6 is a block diagram showing another example of a word sound data storage unit necessary for the voice recognition device of the present invention.

【図７】トレーニングを伴う従来の単語音響データの登
録処理手順を示すフローチャートFIG. 7 is a flowchart showing a conventional registration process procedure of word acoustic data accompanied by training.

【図８】本特許によるトレーニングを伴わない単語音響
データの登録処理手順を示すフローチャートFIG. 8 is a flowchart showing a procedure for registering word acoustic data without training according to the present patent.

【図９】一般的な特定話者法式の音声認識装置の構成を
示すブロック図FIG. 9 is a block diagram showing the configuration of a general specific speaker method speech recognition device.

【図１０】同音声認識装置の音声認識処理部の構成を示
すブロック図FIG. 10 is a block diagram showing a configuration of a voice recognition processing unit of the voice recognition device.

【図１１】同音声認識装置の単語音響データ格納部の構
成を示すブロック図FIG. 11 is a block diagram showing a configuration of a word acoustic data storage unit of the voice recognition device.

[Explanation of symbols]

１３マイク１４信号処理部１５音声合成処理部１６音声認識処理部１７入力装置１８スピーカー１９合成音声変換部２０音響処理部２１単語音響データ格納部２２単語識別部２３表示装置２７音響データ選択部２８単語識別子２９合成音音響データ３０発声音響データ 13 microphone 14 Signal processing unit 15 Speech synthesis processing unit 16 Speech recognition processing unit 17 Input device 18 speakers 19 Synthetic speech converter 20 Sound processing unit 21 Word acoustic data storage 22 Word identifier 23 Display 27 Acoustic data selection section 28 word identifiers 29 Synthetic sound acoustic data 30 vocal sound data

フロントページの続き (72)発明者立山雅一大阪府門真市大字門真1006番地松下電器産業株式会社内 (72)発明者西田博人大阪府門真市大字門真1006番地松下電器産業株式会社内 (72)発明者黒木義明大阪府門真市大字門真1006番地松下電器産業株式会社内 (72)発明者西岡靖幸大阪府門真市大字門真1006番地松下電器産業株式会社内 (72)発明者五島龍宏大阪府門真市大字門真1006番地松下電器産業株式会社内Ｆターム(参考） 5D015 AA03 HH04 KK04 5D045 AB30 Continued front page (72) Inventor Masakazu Tateyama 1006 Kadoma, Kadoma-shi, Osaka Matsushita Electric Sangyo Co., Ltd. (72) Inventor Hiroto Nishida 1006 Kadoma, Kadoma-shi, Osaka Matsushita Electric Sangyo Co., Ltd. (72) Inventor Yoshiaki Kuroki 1006 Kadoma, Kadoma-shi, Osaka Matsushita Electric Sangyo Co., Ltd. (72) Inventor Yasuyuki Nishioka 1006 Kadoma, Kadoma-shi, Osaka Matsushita Electric Sangyo Co., Ltd. (72) Inventor Tatsuhiro Goto 1006 Kadoma, Kadoma-shi, Osaka Matsushita Electric Sangyo Co., Ltd. F-term (reference) 5D015 AA03 HH04 KK04 5D045 AB30

Claims

[Claims]

1. A microphone as a voice input device, a speaker as a voice output device, and an input device such as a keyboard,
A display device for displaying a recognition result, the microphone, the speaker, the input device and the output device are connected,
A signal processing unit for performing processing control of the entire voice recognition device;
A voice synthesis processing unit that inputs text information from the signal processing unit and outputs a synthesized voice signal, and a voice signal input from the signal processing unit is compared with a plurality of acoustic data held internally, and the results match. In a voice recognition device provided with a voice recognition processing unit for outputting the result to the signal processing as a result of voice recognition, a synthetic voice conversion for converting a synthetic voice signal from the voice synthesis processing unit into synthetic voice acoustic data which is acoustic data. By providing a section, synthetic speech sound data is stored in the speech recognition processing unit instead of the sound data that was generated by training by the speaker in the past and used for speech recognition, so training that is a burden on the speaker is required. A voice recognition device characterized by not performing.

2. The voice recognition processing unit according to claim 1, wherein the voice recognition processing unit has an acoustic data selection unit for selecting either the synthetic sound acoustic data output from the synthetic speech conversion unit or the acoustic data output from the acoustic processing unit. Synthesized voice acoustic data is stored in the word acoustic data storage unit in the initial stage of the device, and when a certain recognition word is recognized, the synthetic speech acoustic data is replaced with the acoustic data corresponding to the speech of the corresponding word. A voice recognition device characterized in that it is possible to recognize acoustic data with actual pronunciation from the first time onward, and prevents a decrease in the voice recognition rate even when the synthetic voice and the actual pronunciation are different.

3. The word sound data storage unit according to claim 2, wherein both of the synthesized sound sound data output from the synthesized sound conversion unit and the sound data of the talker's speech output from the sound processing unit are stored. A voice recognition device, characterized in that the voice recognition rate is improved by holding and outputting the identification information of the corresponding word when it matches with either of them.

4. The word sound data storage unit according to claim 1, having a configuration in which a plurality of pronunciation methods for one word are input and stored from the synthetic speech conversion unit, and a speaker makes any pronunciation. A voice recognition device characterized by being able to correctly recognize the corresponding word even in case of failure.

5. The word acoustic data storage unit according to claim 4, wherein a speaker utters a corresponding word, leaves the corresponding acoustic data and deletes other acoustic data, thereby performing unnecessary identification processing from the next time. A voice recognition device, which is omitted and performs recognition processing at a higher speed.

6. The word sound data storage unit according to claim 5, wherein a mechanism for holding the frequency at which the occurrence of a speaker matches each of a plurality of synthetic sound sound data for a certain word is added, and the corresponding word is recognized. A speech recognition device characterized by deleting only acoustic data having a matching frequency equal to or lower than a threshold level when being broken.

7. The invention according to claim 1, wherein when the synthesized sound data for each word to be identified is held in the speech recognition processing unit, a synthesized sound is reproduced from the speaker for the speaker, and the intention of the speaker. A voice recognition device characterized by performing a training procedure by a speaker only when it is not intended and confirming whether or not it is a synthetic sound.