JPH11282485A

JPH11282485A - Voice input device

Info

Publication number: JPH11282485A
Application number: JP10100678A
Authority: JP
Inventors: Masaru Takano; 優高野
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1998-03-27
Filing date: 1998-03-27
Publication date: 1999-10-15
Anticipated expiration: 2018-03-27
Also published as: JP3302923B2

Abstract

PROBLEM TO BE SOLVED: To provide a voice input device for solving the problem that a shake by the accidental factor of the place is present at the time of voice input and a recognition rate is lowered because of that. SOLUTION: This device is provided with a voice analysis part 101 for calculating the feature amount of each fixed time of the plural times of input voices, a dictionary part 102 for holding the voice patterns of all words assumed as input, a matching part 103 for performing the pattern matching of the feature amount and word models in the dictionary part and calculating and outputting coarse likelihood to the respective word models of the input voice, a coarse likelihood storage part 104 for holding the coarse likelihood with all the input voices of the respective word models and an output means 105 for outputting the words for a predetermined fixed piece number from the one of the large weighted average of the coarse likelihood with the input voices of the corresponding word models. By utilizing the plural pieces of the input, adverse influence of the accidental factor is eliminated. Also, by performing recognition from the utterance of different forms, influence from an utterance form is reduced as well.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、音声入力装置に関
し、特に、複数の音声入力により単語認識を行う音声登
録認識装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice input device, and more particularly to a voice registration / recognition device that performs word recognition by a plurality of voice inputs.

【０００２】[0002]

【従来の技術】複数の音声入力により単語認識を行う音
声登録認識装置として、例えば文献１（鹿野清宏、「音
声認識の基礎」、平成8年５月１７日、p.32〜34、39〜4
3）等の記載が参照される。2. Description of the Related Art As a speech registration / recognition apparatus for performing word recognition by a plurality of speech inputs, for example, Reference 1 (Kiyoshi Kano, "Basics of Speech Recognition", May 17, 1996, pp. 32-34, 39- Four
3) etc. are referred to.

【０００３】音声入力装置による単語入力はまだ１００
％とはいえず、入力装置に音声認識の信頼性を上げる手
段が必要とされる。[0003] Word input by voice input device is still 100
In other words, the input device requires a means for improving the reliability of speech recognition.

【０００４】[0004]

【発明が解決しようとする課題】ところで、人間の音声
は、同一人の同一単語の発声であっても、発声の度に環
境や話者の状態によって大きく異なる場合が考えられ
る。特に、一回だけの発声内容を認識しようとする際に
は、その一回の発声の乱れ等が音声波形に大きく影響を
与え、誤認識が生じ易い。By the way, it is conceivable that a human voice, even if it is the same person's utterance of the same word, greatly varies depending on the environment and the state of the speaker each time it is uttered. In particular, when trying to recognize only one utterance content, the disturbance of the single utterance or the like greatly affects the speech waveform, and erroneous recognition is likely to occur.

【０００５】このため、音声入力装置を使用するさまざ
まな局面には、高い認識性能をもつとして、音声入力回
数を多く設定することが可能な局面が存在する。[0005] For this reason, in various aspects of using the voice input device, there is a phase in which the number of voice inputs can be set to be large, assuming high recognition performance.

【０００６】したがって本発明の目的は、そのような場
面において、複数の入力音声を利用して認識率の高い音
声認識を実現する音声入力装置を提供することにある。Accordingly, it is an object of the present invention to provide a speech input device that realizes speech recognition with a high recognition rate using a plurality of input speeches in such a situation.

【０００７】[0007]

【課題を解決するための手段】前記目的を達成するた
め、本発明は、二回以上の発声の情報を総合することに
より、一回だけの発声で起こりやすい、その発声特有の
「ゆれ」の影響を低減し、確度の高い音声入力を実現す
るようにしたものである。また、本発明は、発声様式の
異なる入力を採用することで、個々の発声様式に依存し
て発生する音声波形の「ゆれ」を解消する。In order to achieve the above-mentioned object, the present invention integrates information of two or more utterances, thereby making it possible to produce a unique "sway" which is likely to occur only once. The effect is reduced, and highly accurate voice input is realized. In addition, the present invention eliminates the “fluctuation” of the voice waveform generated depending on each utterance style by adopting inputs having different utterance styles.

【０００８】より詳細には、本発明は、複数回の入力音
声のそれぞれ一定時間ごとの特徴量を計算する音声分析
手段と、入力として想定されるすべての単語の音声パタ
ン(「単語モデル」という）を保持する辞書部と、前記
特徴量と前記辞書部中の単語モデルとのパタンマッチン
グを行い、前記入力音声の前記単語モデルそれぞれに対
する尤度(「粗尤度」という）を算出して出力するマッ
チング手段と、前記各単語モデルの、前記入力音声すべ
てとの間の前記粗尤度を保持する粗尤度格納部と、前記
単語を、対応する前記単語モデルの前記入力音声との間
の前記粗尤度の加重平均の大きいものから、予め定めた
一定の個数だけ出力する出力手段と、を備える。More specifically, the present invention provides a voice analysis means for calculating a feature amount of a plurality of times of input voices at predetermined time intervals, and a voice pattern of all words assumed as input (referred to as a "word model"). ), Pattern matching is performed between the feature amount and the word model in the dictionary unit, and the likelihood (“coarse likelihood”) of the input speech with respect to each of the word models is calculated and output. Matching means, a coarse likelihood storage unit for holding the coarse likelihood between all of the input voices of the respective word models, and the word between the input voices of the corresponding word models. Output means for outputting a predetermined fixed number from the largest weighted average of the coarse likelihoods.

【０００９】[0009]

【発明の実施の形態】本発明の実施の形態について説明
する。通常、音声認識では、パタンマッチングにより入
力音声と既存の単語モデルの間の尤度を計算し、予め想
定した単語モデルの集合の中で最も尤度の高いものを認
識結果とする。ところが、発声の乱れにより、実際に発
声した単語に対応する単語モデルの尤度が低く算出され
ることは良くあることである。Embodiments of the present invention will be described. Normally, in speech recognition, the likelihood between an input speech and an existing word model is calculated by pattern matching, and a word model having the highest likelihood among a set of word models assumed in advance is used as a recognition result. However, it is often the case that the likelihood of the word model corresponding to the word actually uttered is calculated to be low due to the disorder of the utterance.

【００１０】本発明では、複数回の発声の尤度の平均
値、あるいは最大値を尤度として用いることにより、そ
のような偶然要因により個別の発声の尤度低下の影響を
低減する。In the present invention, the influence of the likelihood decrease of individual utterances due to such accidental factors is reduced by using the average value or the maximum value of the likelihoods of a plurality of utterances as the likelihood.

【００１１】また、発声の様式に起因する尤度低下も考
えられる。特定の音節の組みの連接時に現れる「なま
け」等の音響的変化である。本発明では、そのような発
声様式に起因する尤度低下の影響を低減するため、複数
の発声様式による発声の尤度から、前記のような平均値
あるいは最大値を求め、尤度とする方法を用いる。It is also conceivable that the likelihood decreases due to the style of utterance. This is an acoustic change such as “namake” that appears when a specific syllable set is connected. In the present invention, in order to reduce the influence of the likelihood decrease caused by such a utterance style, from the likelihood of utterance by a plurality of utterance styles, the average value or the maximum value as described above is obtained, and the likelihood is determined. Is used.

【００１２】本発明における音声入力装置は、その好ま
しい実施の形態において、複数の音声を入力とし各入力
音声のフレームごとの音響的特徴量を計算し出力する音
声分析手段（図１の１０１）と、単語モデルを保持する
辞書部（図１の１０２）と、音声分析手段から出力され
る各入力音声のの特徴量を入力とし入力音声の、各単語
モデルに対する尤度を求め粗尤度として出力するマッチ
ング手段（図１の１０３）と、前記入力音声と前記単語
モデルのあらゆる組み合わせの前記粗尤度を保持してい
る粗尤度格納部（図１の１０４）と、各単語モデルにつ
いて、その各入力音声に対する粗尤度の平均値を計算
し、その値の大きい順に単語モデルを順序付け、その順
序に従って一定数の単語モデルを出力する出力手段（図
１１の１０５）と、を備えている。The speech input device according to the present invention, in a preferred embodiment thereof, includes a speech analysis means (101 in FIG. 1) which receives a plurality of speeches as input, calculates acoustic features for each frame of each input speech, and outputs them. A dictionary unit (102 in FIG. 1) for holding a word model, and inputting a feature amount of each input voice output from the voice analysis unit, obtaining a likelihood of the input voice for each word model, and outputting as a coarse likelihood. Matching means (103 in FIG. 1), a coarse likelihood storage unit (104 in FIG. 1) holding the coarse likelihood of any combination of the input speech and the word model, Output means (105 in FIG. 11) for calculating an average value of the coarse likelihood for each input voice, ordering the word models in descending order of the value, and outputting a fixed number of word models according to the order; It is provided.

【００１３】本発明の実施の形態の動作について説明す
る。The operation of the embodiment of the present invention will be described.

【００１４】音声分析手段は、入力音声の一定時間（フ
レーム）ごとに、その区間の音声の周波数分析を行い、
特徴量を算出し、出力する。特徴量としては、音声のパ
ワー、パワー変化量、ケプストラム、ケプストラム変化
量等を使用する。[0014] The voice analysis means performs a frequency analysis of the voice of the section for each fixed time (frame) of the input voice,
Calculate and output feature values. As the feature amount, the power of the voice, the power change amount, the cepstrum, the cepstrum change amount, and the like are used.

【００１５】辞書部は、認識対象の、すなわち予め想定
した単語すべての単語モデルを保持している。The dictionary section holds word models of all words to be recognized, that is, words assumed in advance.

【００１６】マッチング手段は、入力となる特徴量と、
辞書中の全単語モデルのパタンマッチングを行い、入力
音声の各単語モデルに対する粗尤度を計算し、出力す
る。[0016] The matching means comprises:
Pattern matching of all word models in the dictionary is performed, and a coarse likelihood for each word model of the input speech is calculated and output.

【００１７】粗尤度格納部は、各入力音声についてマッ
チング部が出力した各単語モデルの尤度を記憶してお
く。The coarse likelihood storage unit stores the likelihood of each word model output by the matching unit for each input speech.

【００１８】出力手段は、粗尤度格納部に格納されてい
る粗尤度を、単語モデルごとに合計し、平均値を求め
る。その後、その平均値の大きい単語モデルから順に一
定数だけ、当該単語モデルの表す単語を出力する。マッ
チング手段及び出力手段は、音声入力装置を構成するプ
ロセッサ等で実行されるプログラム制御によって実現す
るようにしてもよい。The output means sums the coarse likelihoods stored in the coarse likelihood storage unit for each word model to obtain an average value. After that, the words represented by the word model are output by a certain number in order from the word model having the larger average value. The matching means and the output means may be realized by program control executed by a processor or the like constituting the voice input device.

【００１９】[0019]

【実施例】次に、上記した本発明の実施の形態について
さらに詳細に説明すべく、本発明の実施例について図面
を参照して以下に説明する。Next, in order to explain the above-mentioned embodiment of the present invention in more detail, an embodiment of the present invention will be described below with reference to the drawings.

【００２０】本発明の第１の実施例について説明する。
図１は、本発明の第１の実施例の構成を示す図である。
図１を参照すると、本発明の第１の実施例は、各入力音
声のフレームごとの音響的特徴量を計算し出力する音声
分析部１０１と、単語モデルを保持する辞書部１０２
と、音声分析部から出力される各入力音声のの特徴量を
入力とし入力音声の、各単語モデルに対する尤度を求め
粗尤度として出力するマッチング部１０３と、前記入力
音声と前記単語モデルのあらゆる組み合わせの前記粗尤
度を保持している粗尤度格納部１０４と、各単語モデル
につき、その各入力音声に対する粗尤度の平均値を計算
し、その値の大きい順に単語モデルを順序付け、その順
序に従って一定数の単語モデルを出力する出力部１０５
と、を備えて構成されている。A first embodiment of the present invention will be described.
FIG. 1 is a diagram showing the configuration of the first exemplary embodiment of the present invention.
Referring to FIG. 1, a first embodiment of the present invention is a speech analyzing unit 101 which calculates and outputs an acoustic feature amount for each frame of each input speech, and a dictionary unit 102 which holds a word model.
A matching unit 103 that receives a feature amount of each input voice output from the voice analysis unit as input, obtains a likelihood of each of the input voices with respect to each word model, and outputs the likelihood as a coarse likelihood; A coarse likelihood storage unit 104 that holds the coarse likelihood of any combination, and for each word model, calculates the average value of the coarse likelihood for each input speech, and orders the word models in descending order of their values; Output unit 105 that outputs a certain number of word models according to the order
And is provided.

【００２１】図２は、本発明の第１の実施例の動作を示
すフローチャートである。図１及び図２を参照して、本
発明の第１の実施例の動作について説明する。FIG. 2 is a flowchart showing the operation of the first embodiment of the present invention. The operation of the first embodiment of the present invention will be described with reference to FIGS.

【００２２】辞書部１０２は、単語モデルとして、「さ
とう」、「かとう」、「ごとう」の３つを保持している
ものとする。また、出力部１０５が出力する単語の数
は、２とする。また、入力音声として、「さとう」とい
う発声が３回なされたものとする。It is assumed that the dictionary unit 102 holds three word models, "Sato", "Kato", and "Goto". The number of words output by the output unit 105 is two. Further, it is assumed that the utterance “Sato” is made three times as the input voice.

【００２３】まず、ステップ１において、音声分析部１
０１は「さとう」発声の第１回目から、その各フレーム
の特徴量を算出し、出力する。First, in step 1, the voice analysis unit 1
01 calculates and outputs the feature amount of each frame from the first utterance of “Sato”.

【００２４】次にステップ２において、マッチング部１
０３は、辞書部１０２に格納されている各単語モデルに
つき、各フレームの特徴量のパタンマッチングを、上記
文献１にも記載されているように、Viterbi（ビタビ）
法を用いて行い、第１回目の「さとう」入力音声に対す
る各単語モデルの粗尤度を求める。Next, in step 2, the matching unit 1
Reference numeral 03 denotes a pattern matching of the feature amount of each frame for each word model stored in the dictionary unit 102, as described in the above-mentioned document 1, and Viterbi (Viterbi).
The coarse likelihood of each word model for the first "sato" input speech is obtained by using the method.

【００２５】次にステップ３において、マッチング部１
０３は、第１回目の「さとう」入力音声に対する各単語
モデルの粗尤度を、粗尤度格納部１０４に格納する。Next, in step 3, the matching unit 1
03 stores the coarse likelihood of each word model for the first “sato” input speech in the coarse likelihood storage unit 104.

【００２６】以後、「さとう」第２回目及び第３回目の
入力音声に対しても、ステップ１からステップ３を繰り
返すことにより、図３に示すように、粗尤度格納部１０
４の格納すべき粗尤度を求める。Thereafter, steps 1 to 3 are repeated for the second and third input speeches of "Sato", as shown in FIG.
4 to be stored.

【００２７】すべての入力音声と単語モデルの組み合わ
せにつき、粗尤度が格納されてから、ステップ４におい
て、出力部１０５は、図４に示すように、各単語モデル
の全発声に対する粗尤度の平均値を求める。After the coarse likelihood is stored for all combinations of the input speech and the word model, in step 4, the output unit 105 outputs the coarse likelihood of all the utterances of each word model as shown in FIG. Find the average value.

【００２８】しかる後、ステップ５において、出力部１
０５は前記粗尤度の平均値が大きい２単語「さとう」、
「かとう」を出力する。Thereafter, in step 5, the output unit 1
05 is two words “Sato” having a large average value of the coarse likelihood,
Outputs "Kato".

【００２９】本発明の一実施例によれば、一回の発声だ
けによるパタンマッチングでは安定した尤度が得られな
いという弱点を補うものである。According to one embodiment of the present invention, a weak point that stable likelihood cannot be obtained by pattern matching using only one utterance is compensated for.

【００３０】本発明の他の実施例について説明する。前
述したような、偶然要因による尤度の低下の影響を防止
するためならば、複数の発声中から、最も尤度の高い発
声だけを選んで認識すればよい。本発明の第２の実施例
は、図１に示した前記第１の実施例における出力部１０
５における平均値を、最大値で置き換えて構成したもの
である。このようにしても、偶然要因による尤度の低下
の影響を防ぐことができる。Next, another embodiment of the present invention will be described. In order to prevent the influence of the likelihood decrease caused by the accidental factor as described above, only the utterance having the highest likelihood may be selected and recognized from a plurality of utterances. The second embodiment of the present invention is different from the first embodiment shown in FIG.
5 is configured by replacing the average value with the maximum value. Even in this case, it is possible to prevent the influence of the decrease in the likelihood due to an accidental factor.

【００３１】次に本発明の第３の実施例について説明す
る。尤度の低下は、偶然要因のみによってではなく、発
声様式の違いによっても起こりうる。発声様式の影響を
低減するために、複数の発声様式を用いる方法が考えら
れる。本発明の第３の実施例は、図１に示した前述の実
施例におけるマッチング部１０３の動作を変更したもの
である。Next, a third embodiment of the present invention will be described. The decrease in likelihood can be caused not only by accidental factors but also by differences in utterance style. In order to reduce the influence of the utterance style, a method using a plurality of utterance styles can be considered. The third embodiment of the present invention is a modification of the operation of the matching unit 103 in the above-described embodiment shown in FIG.

【００３２】本発明の第３の実施例において、マッチン
グ部１０５は、入力音声の様式に応じて、マッチング処
理を変更する。入力音声が「さとう」というような単語
発声の場合は、前記した第一の実施例に挙げた動作を行
うが、「さ、…、と、…、う」というような音節発声の
場合は、入力の各音節に対し、音節認識を行い、各音節
の尤度を平均することで、粗尤度を求める。In the third embodiment of the present invention, the matching unit 105 changes the matching process according to the format of the input voice. When the input voice is a word utterance such as “Sato”, the operation described in the first embodiment is performed. In the case of a syllable utterance such as “Sa,...,. The syllable recognition is performed for each input syllable, and the likelihood of each syllable is averaged to obtain a coarse likelihood.

【００３３】本発明の第４の実施例は、図１に示した前
述の実施例における辞書部１０２の内容に、それぞれの
単語に含まれる音節すべてにつき、音節モデルを加える
ことによって得られる。例えば、単語モデル「さとう」
に対して、音節モデル「さ」、「と」及び「う」を加え
るという変更を行うものである。The fourth embodiment of the present invention is obtained by adding a syllable model for all syllables included in each word to the contents of the dictionary unit 102 in the above-described embodiment shown in FIG. For example, the word model "Sato"
Is added to the syllable models “sa”, “to” and “u”.

【００３４】本発明の第４の実施例の動作について説明
する。本発明の第４の実施例の動作について、前記第１
の実施例と同様、図２に示すフローチャートに従い説明
する。The operation of the fourth embodiment of the present invention will be described. Regarding the operation of the fourth embodiment of the present invention, the first embodiment
The description will be given in accordance with the flowchart shown in FIG.

【００３５】辞書部１０２は、単語モデル「さとう」、
「かとう」、「ごとう」に加えて、各単語モデルに登場
する音節モデル「さ」、「と」、「う」、「か」、
「ご」の５個を保持する。The dictionary unit 102 includes a word model “Sato”,
In addition to "Kato" and "Goto", the syllable models "Sa", "To", "U", "K",
Hold five “go”.

【００３６】また、出力部１０５が出力する単語の数
は、前記第１の実施例と同様、２とする。また、入力音
声として、「さとう」という単語発声が一回、「さ、
…、と、…、う」という音節発声が一回、それぞれなさ
れたものとする。The number of words output from the output unit 105 is set to 2, as in the first embodiment. In addition, as the input voice, the word “Sato” is once spoken, and “Sa,
…,…, U ”are uttered once.

【００３７】単語発声に対するステップ１からステップ
３までの動作は、前記第１の実施例と同一である。The operations from step 1 to step 3 for word utterance are the same as in the first embodiment.

【００３８】音節発声においてもステップ１及びステッ
プ３は同一動作であるが、ステップ２が異なる。音節発
声に対するステップ２において、マッチング部１０３
は、辞書部１０２に格納されている単語モデルの第一音
節に相当する、「さ」、「か」、「ご」各音節モデル
と、入力の第一音節とのマッチングを、同様に文献１に
記載されるViterbi（ビタビ）法を用いて行い、各音節
モデルに対する入力の第一音節の尤度を求める。Steps 1 and 3 are the same in syllable utterance, but step 2 is different. In step 2 for syllable utterance, the matching unit 103
Is the matching between the “s”, “ka”, and “go” syllable models corresponding to the first syllable of the word model stored in the dictionary unit 102 and the first syllable of the input. And the likelihood of the first syllable of the input to each syllable model is obtained.

【００３９】第二音節は、各単語の第二音節である
「と」の音節モデルと、入力の第二音節とのマッチング
を同様に行う。For the second syllable, matching between the syllable model of “to”, which is the second syllable of each word, and the input second syllable is performed in the same manner.

【００４０】第三音節についても同様にしてマッチング
を行う。その後、マッチング部１０３は、各単語の粗尤
度として、入力の各音節の、各単語の対応する音節モデ
ルとの尤度の平均値を算出し、出力する。Matching is similarly performed for the third syllable. After that, the matching unit 103 calculates and outputs the average value of the likelihood of each input syllable with the corresponding syllable model of each word as the coarse likelihood of each word.

【００４１】以後、ステップ４も同様に行い、粗尤度の
平均値が大きい方から２単語を出力する。Thereafter, step 4 is performed in the same manner, and the two words are output in descending order of the average value of the coarse likelihood.

【００４２】この実施例を用いると、例に挙げたよう
に、単語発声及び音節発声の二つの発声様式からの情報
を得ることができる。これにより、発声様式に起因する
誤り、たとえば、単語内の音節間における「なまけ」、
音節の語頭における発声の乱れ、等の影響を低減するこ
とができる。Using this embodiment, as described in the example, information from two utterance modes, word utterance and syllable utterance, can be obtained. This can lead to errors due to the utterance style, such as "naming" between syllables in a word,
It is possible to reduce the influence of utterance disorder at the beginning of a syllable, and the like.

【００４３】また、これらの手段を用いても防げない誤
認識は、人力で訂正することも考えられる。本発明の第
５の実施例は、前記した実施例に、人力による訂正手段
を与える。In addition, erroneous recognition that cannot be prevented by using these means may be corrected manually. The fifth embodiment of the present invention provides manual correction means to the above embodiment.

【００４４】図５を参照すると、本発明の第５の実施例
は、各入力音声のフレームごとの音響的特徴量を計算し
出力する音声分析部１０１と、単語モデルを保持する辞
書部１０２と、音声分析部から出力される各入力音声の
の特徴量を入力とし入力音声の、各単語モデルに対する
尤度を求め粗尤度として出力するマッチング部１０３
と、前記入力音声と前記単語モデルのあらゆる組み合わ
せの前記粗尤度を保持している粗尤度格納部１０４と、
各単語モデルにつき、その各入力音声に対する粗尤度の
平均値を計算し、その値の大きい順に単語モデルを順序
付け、その順序に従って一定数の単語モデルを出力する
出力部１０５と、人力による前記一定数の選択肢からの
選択を入力する入力手段１０６と、出力部１０５からの
出力及び入力手段１０６からの出力から、最終出力とな
る単語モデルを求め、出力する最終選択部１０７と、を
備えて構成されている。Referring to FIG. 5, in a fifth embodiment of the present invention, a speech analysis unit 101 for calculating and outputting an acoustic feature value for each frame of each input speech, a dictionary unit 102 for holding a word model, and A matching unit 103 that receives a feature amount of each input speech output from the speech analysis unit, obtains a likelihood of the input speech for each word model, and outputs the likelihood as a coarse likelihood.
And a coarse likelihood storage unit 104 that holds the coarse likelihood of any combination of the input speech and the word model;
For each word model, an average value of the coarse likelihood for each input speech is calculated, the word models are ordered in descending order of the value, and an output unit 105 that outputs a certain number of word models according to the order; An input unit 106 for inputting a selection from a number of options, and a final selection unit 107 for obtaining and outputting a word model as a final output from the output from the output unit 105 and the output from the input unit 106 Have been.

【００４５】図６は、本発明の第５の実施例の動作を示
すフローチャートである。次に、図５及び図６を参照し
て、本発明の第５の実施例の動作について説明する。FIG. 6 is a flowchart showing the operation of the fifth embodiment of the present invention. Next, the operation of the fifth embodiment of the present invention will be described with reference to FIGS.

【００４６】辞書部１０２、音声分析部１０１、マッチ
ング部１０３、粗尤度格納部１０４、出力部１０５の動
作は前述実施例と同一である。ただし、出力部１０５の
出力は、最終選択部１０７に対しても行われる。入力手
段１０６は、１から３までの数字を入力できる入力装置
であるとする。これはキーボードやスイッチ装置のよう
なものでも、別の音声入力装置でもよい。また、入力手
段１０６を操作する人間は、必ずしも入力音声を発声し
た人物と同一でなくともよい。入力手段１０６は、入力
された数字を最終選択部１０７に出力する。The operations of the dictionary unit 102, the voice analysis unit 101, the matching unit 103, the coarse likelihood storage unit 104, and the output unit 105 are the same as those in the above-described embodiment. However, the output of the output unit 105 is also performed to the final selection unit 107. It is assumed that the input unit 106 is an input device that can input numbers from 1 to 3. This may be something like a keyboard or switch device, or another voice input device. Also, the person who operates the input means 106 does not necessarily have to be the same person who uttered the input voice. The input unit 106 outputs the input number to the final selection unit 107.

【００４７】最終選択部１０７は、出力部１０５からの
出力とその順番を保持しておき、入力手段１０６の出力
である数字を順番とする単語モデルを出力する。また、
入力手段１０６に与えられた入力は数字の１であるとす
る。The final selection unit 107 holds the outputs from the output unit 105 and their order, and outputs a word model in which the numbers output by the input unit 106 are in order. Also,
It is assumed that the input given to the input means 106 is the numeral 1.

【００４８】まず、ステップ１からステップ５までの動
作を、前述実施例と同一に行う。First, the operations from step 1 to step 5 are performed in the same manner as in the above embodiment.

【００４９】ステップ６において、入力手段１０６は与
えられた入力である数字の１を、最終選択部１０７に出
力する。At step 6, the input means 106 outputs the given input number 1 to the final selection unit 107.

【００５０】ステップ７において、最終選択部１０７
は、数字の１に対応する１番目の出力「さとう」のみ
を、最終出力として出力する。In step 7, the final selection unit 107
Outputs only the first output “Sato” corresponding to the number 1 as the final output.

【００５１】この実施例では、最終選択により入力の確
実性が高まる。最終選択そのものは通常の一回入力の音
声認識にも使用されるが、この実施例では、上位前記一
定数までの単語モデル中に正解が含まれる確率が非常に
高いため、最終選択による確定の確率も高く、正解の含
まれない複数候補から選択を行わなければならない、等
の苦痛を使用者に与える確率は通常の一回入力使用時よ
り著しく低い。In this embodiment, the reliability of the input is increased by the final selection. Although the final selection itself is also used for ordinary single-input speech recognition, in this embodiment, since the probability that a correct answer is included in the word models up to the certain number is extremely high, the final selection is not determined. Probability is high, and the probability of giving the user pain such as selecting from a plurality of candidates that do not include a correct answer is remarkably lower than that in the case of using a single input.

【００５２】[0052]

【発明の効果】以上説明したように、本発明によれば下
記記載の効果を奏する。As described above, according to the present invention, the following effects can be obtained.

【００５３】本発明の第１の効果は、一回の発声だけに
よるパタンマッチングでは安定した尤度が得られないと
いう問題点を解消し、偶然要因による尤度の低下の影響
を防ぐことができる、ということである。The first effect of the present invention is to solve the problem that a stable likelihood cannot be obtained by pattern matching using only one utterance, and to prevent the influence of a decrease in likelihood due to an accidental factor. ,That's what it means.

【００５４】その理由は、本発明においては、複数回の
発声の尤度の平均値、あるいは最大値を尤度として用い
ることにより、そのような偶然要因により個別の発声の
尤度低下の影響を低減するようにしたことによる。The reason is that, in the present invention, by using the average value or the maximum value of the likelihoods of a plurality of utterances as the likelihood, the influence of the likelihood reduction of the individual utterances due to such accidental factors is considered. This is due to the reduction.

【００５５】また、本発明の第２の効果は、発声様式に
起因する誤り、たとえば、単語内の音節間における「な
まけ」、音節の語頭における発声の乱れ等の影響を低減
することができる、ということである。Further, the second effect of the present invention is to reduce the effects of errors caused by the utterance style, for example, the effects of "slowness" between syllables in a word and disorder of utterance at the beginning of a syllable. That's what it means.

【００５６】その理由は、本発明においては、そのよう
な発声様式に起因する尤度低下の影響を低減するため、
複数の発声様式による発声の尤度から、前記のような平
均値あるいは最大値を求め、尤度とするようにしたため
である。The reason is that, in the present invention, in order to reduce the influence of the likelihood decrease caused by such a vocal style,
This is because the average value or the maximum value as described above is obtained from the likelihoods of utterances in a plurality of utterance styles, and is set as the likelihood.

【００５７】このように、本発明によれば、偶然要因あ
るいは発声様式に起因する誤りに対し頑健な音声入力装
置を実現することができる。As described above, according to the present invention, it is possible to realize a voice input device that is robust against errors caused by accidental factors or utterance styles.

[Brief description of the drawings]

【図１】本発明の一実施例の構成を示すブロック図であ
る。FIG. 1 is a block diagram showing the configuration of an embodiment of the present invention.

【図２】本発明の一実施例の動作を示すフローチャート
である。FIG. 2 is a flowchart showing the operation of one embodiment of the present invention.

【図３】本発明の一実施例における各単語モデルの、各
入力音声との間の粗尤度の一覧を表形式で示す図であ
る。FIG. 3 is a table showing a list of coarse likelihoods between each word model and each input voice in one embodiment of the present invention.

【図４】本発明の一実施例における各単語モデルの、各
入力音声との間の粗尤度及び粗尤度の平均の一覧を表形
式で示す図である。FIG. 4 is a diagram showing, in a table format, a list of coarse likelihoods and averages of the coarse likelihoods between each word model and each input voice in one embodiment of the present invention.

【図５】本発明の他の実施例の構成を示すブロック図で
ある。FIG. 5 is a block diagram showing a configuration of another embodiment of the present invention.

【図６】本発明の他の実施例の動作を示すフローチャー
トである。FIG. 6 is a flowchart showing the operation of another embodiment of the present invention.

【符号の説明】１０１音声分析部１０２辞書部１０３マッチッング部１０４尤度格納部１０５出力部１０６入力手段１０７最終選択部[Description of Signs] 101 voice analysis unit 102 dictionary unit 103 matching unit 104 likelihood storage unit 105 output unit 106 input unit 107 final selection unit

Claims

[Claims]

1. A speech analysis means for calculating a feature amount of a plurality of input speeches at predetermined time intervals, and a dictionary unit for holding speech patterns of all words assumed as inputs (referred to as "word models"). A matching unit that performs pattern matching between the feature amount and a word model in the dictionary unit, calculates and outputs a likelihood (referred to as “coarse likelihood”) of the input voice with respect to each of the word models, A coarse likelihood storage unit for holding the coarse likelihood between the word model and all of the input voices; and a weighted average of the coarse likelihood between the word and the corresponding input voice of the word model. Output means for outputting a predetermined fixed number from the largest one.

2. A speech analysis means for calculating a feature amount of a plurality of times of input speech at regular time intervals, and a dictionary unit for holding speech patterns of all words assumed as input (referred to as "word model"). A matching unit that performs pattern matching between the feature amount and the word model in the dictionary unit, calculates a likelihood of the input speech for each of the word models (referred to as “coarse likelihood”), and outputs the likelihood. A coarse likelihood storage unit for holding the coarse likelihood between all the input voices of each word model, and a maximum of the coarse likelihood between the word and the input voice of the corresponding word model. Output means for ordering values and outputting a predetermined number in accordance with the order.

3. The dictionary unit holds, for all words assumed as the input, speech patterns (referred to as “syllable models”) of all syllables appearing in the words. Only when the input speech is a word utterance, pattern matching is performed between the feature amount and the word model in the dictionary unit, and the likelihood of the input speech for each of the word models is output as a coarse likelihood. Only when the input voice is a syllable utterance, pattern matching is performed between the feature amount and the syllable model in the dictionary unit, and the likelihood of each syllable in the input voice with respect to the syllable model is output as a coarse likelihood. Be characterized in that
The voice input device according to claim 1.

4. The likelihood output by the matching means for an input speech that is the syllable utterance is an average value of the likelihood of each syllable in the input speech with respect to a corresponding syllable model. The voice recognition device according to claim 3, which performs the operation.

5. A speech recognition apparatus according to claim 1, wherein said output means has a weight used for calculating said weighted average of one.

6. The output means according to claim 1, wherein the weight used for calculating the weighted average is a geometric progression having a predetermined constant value as a common ratio in chronological order of the input voice. Voice input device.

7. The voice input device according to claim 1, wherein the number of words output by said output means is one.

8. An input means for allowing a user to select and input a correct answer from the predetermined fixed number of word models, and an optimal word based on an input from the input means and an output from the output unit. The voice input device according to any one of claims 1 to 6, further comprising: final selection means for selecting and outputting one model.

9. (a) A plurality of input speeches each having a characteristic amount calculated at predetermined time intervals by a voice analysis means, and the calculated characteristic amount and voice patterns ("") of all words assumed as inputs. Pattern matching with a word model in a dictionary unit that holds the word model, and calculates a likelihood (referred to as “coarse likelihood”) of the input speech with respect to each of the word models. Matching means for outputting to the coarse likelihood storage unit which holds the coarse likelihood between all of the input voices; (b) calculating the word of the coarse likelihood between the input voice of the corresponding word model; From the one with the largest weighted average,
A recording medium for recording a program for causing a computer constituting the voice input device to function each of the above-mentioned means (a) and (b) of an output means for outputting a predetermined fixed number.

(A) A feature amount of a plurality of times of input speech is calculated for each fixed time by speech analysis means, and the calculated feature amount and speech patterns ("") of all words assumed as input are calculated. Pattern matching with the word model in the dictionary unit holding the word model is performed, and the likelihood of the input speech for each of the word models (referred to as “coarse likelihood”) is calculated. Matching means for outputting to the coarse likelihood storage unit which holds the coarse likelihood between all of the input voices, and (b) the coarse likelihood between the word and the input voice of the corresponding word model. Order the maximum degree,
A recording medium for recording a program for causing a computer constituting the voice input device to function each of the means (a) and (b) of the output means for outputting a predetermined number in a predetermined order according to the order.