JP2004012713A

JP2004012713A - Device and method for recognizing speech

Info

Publication number: JP2004012713A
Application number: JP2002164795A
Authority: JP
Inventors: Kazuhide Okada; 岡田　一秀
Original assignee: Toyota Motor Corp
Current assignee: Toyota Motor Corp
Priority date: 2002-06-05
Filing date: 2002-06-05
Publication date: 2004-01-15

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech recognition device which can realize high recognition rate. <P>SOLUTION: This speech recognition device is provided with a storage means for storing a database correlating inquiry vocabulary candidates and response vocabulary candidates with probabilities with which both vocabulary candidates are connected, an identification processing means for performing identification processing of a speech word acquired and the vocabulary candidate by using the storage means, and a correcting means for correcting probability in the database stored in the storage means based on the result of the identification processing by the identification processing means. Consequently, high recognition rate can be realized by utilizing relation between the vocabulary candidate of the inquiry, the vocabulary candidate of the response and probability with which both vocabulary candidates are connected. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は、音声認識装置及び音声認識方法に関する。
【０００２】
【従来の技術】
音声をデータとして取得し、このデータを処理して発話された内容を認識する音声認識装置が実用化されている。例えば、発話によって文章入力を行うアプリケーションや発話によって操作する車載ナビゲーションシステムなどには、音声認識エンジンが内蔵されている。現状の音声認識では、取得した音声をそのまま解析［音響モデルによる解析：音素解析］しても、その認識率は６０％〜８０％にとどまっている。これに加えて、文法論や語（音素）の繋がりを解析［言語モデルによる解析］を行って補完することで、認識率を８５％〜９５％まで上げている。
【０００３】
上述した音響モデルによる解析手法としては、隠れマルコフモデル（ＨＭＭ：Ｈｉｄｅｎ　Ｍａｒｋｏｖ　Ｍｏｄｅｌ）などの確率モデルを用いるものが有名である。また、上述した言語モデルによる解析手法として、バイグラムやトライグラムといったものがある。これらは、同一人物による同一発話内における一つの語彙を推定する手法で、一語彙中の初めの二つ又は三つの音（音素）から、その語彙全体を予測するものである。
【０００４】
【発明が解決しようとする課題】
こうした音声認識の状況を踏まえ、発明者は更なる認識率向上を目指して鋭意研究を行い、言語モデルによる解析手法として有効な新たな手法を発明した。即ち、本発明の目的は、高認識率を実現することのできる音声認識装置及び方法を提供することにある。
【０００５】
【課題を解決するための手段】
請求項１に記載の音声認識装置は、問いかけの語彙候補と応答の語彙候補と両語彙候補が結びつく確率とを関連づけたデータベースを記憶した記憶手段、記憶手段を用いて取得した音声の語と語彙候補との同定処理を行う同定処理手段、及び、同定手段による同定処理の結果に基づいて、記憶手段に記憶されたデータベース内の確率を修正する修正手段を備えていることを特徴としている。
【０００６】
請求項２に記載の発明は、請求項１に記載の音声認識装置において、データベースが、短期会話用データセットと長期会話用データセットとを有しており、修正手段が、同定処理の結果に基づいて、短期会話用データセット及び長期会話用データセットに対してそれぞれ修正を行うことを特徴としている。
【０００７】
請求項３に記載の音声認識方法は、取得した音声の語を、問いかけの語彙候補と応答の語彙候補と両語彙候補が結びつく確率とを関連づけたデータベースを用いて同定処理を行ない、かつ、同定処理の結果に基づいてデータベース内の確率を修正することを特徴としている。
【０００８】
請求項４に記載の発明は、請求項３に記載の音声認識方法において、データベースが、短期会話用データセットと長期会話用データセットとを有しており、同定処理の結果に基づいて、短期会話用データセット及び長期会話用データセットに対してそれぞれ修正を行うことを特徴としている。
【０００９】
なお、ここに言う確率は、学問としての確率・統計上の狭義の確率（１０回中１回起こる可能性がある場合を１／１０や１０％と表記するような確率）に限られず、ある事象が発生する可能性を相対的に示した広義の確率（上述した狭義の確率も含む）という意味も有するものである。
【００１０】
【発明の実施の形態】
上述したように、音声認識には音響モデルによる解析と、言語モデルによる解析とを併用して認識率を向上させている。従来の言語モデルによる解析では、上述したバイグラムやトライグラムなどが用いられるが、これらの手法には人と人との対話のやりとりは全く反映されていない。還元すれば、従来の言語モデルによる解析は、途中までの発話でその一語彙（単語）全体を予測するミクロ的な予測である。言語（例えば日本語）の文法や発話の経験則などに重点を置いて認識率を向上させようとした場合、このようなミクロ的予測だけでなく、問いかけに対する応答という対話に着目した予測、即ち、ミクロ的予測に対してマクロ的予測とも言うべき概念が有効であることを発明者は知見した。本発明は、このような知見に基づくものである。
【００１１】
本発明の音声認識装置は、入力部と演算処理部（同定処理手段・修正手段）と記憶部（記憶手段）とを有している。演算処理部と記憶部とはＣＰＵやＲＯＭ、ＲＡＭ等からなる電子制御ユニット（ＥＣＵ）として構成されている。なお、記憶部に関しては、交換可能なハードディスクやＣＤ−ＲＯＭ、ＤＶＤ−ＲＯＭ等の光ディスクであっても良いし、これらが外付けされたものであっても良い。後述するデータベースなどは、これらの記憶部に記憶されている。入力部としてのマイクが上述したＥＣＵに接続されている。ＥＣＵ内のＣＰＵは、各種演算を行うと共に、その演算時の各種データはＣＰＵ内のキャッシュメモリやＲＡＭ内に保持される。
【００１２】
また、本実施形態の音声認識装置は、カーナビゲーションシステムに組み込まれているものであり、カーナビゲーションシステムが発する問いかけ（質問）に対する操作者の応答を認識しようとするものである。このため、カーナビゲーションシステム側が発する問いかけの文章（単語）に関しては認識処理を行う必要がなく、初めから装置自体が問いかけの文章（単語）に関しては把握している。
【００１３】
本実施形態の音声認識に用いるデータベースの構造を模式的に示したものを図１に示す。また、本実施形態の音声認識装置及び方法における音声認識過程を図２のフローチャートに示す。
【００１４】
まず、図１に基づいて、本実施形態の音声認識において使用するデータベースの構造について説明する。図１に示されるように、問いかけの語彙候補が行方向に並べられ、応答に関する語彙候補が列方向に並べられている行列状のデータセットを二組有している。そして、各問いかけの語彙候補に対して、各応答に関する語彙候補が出現する確率がその交点に記憶されている。
【００１５】
なお、上述したようにデータセットは二組ある。短期会話用データセットと長期会話用データセットである。短期会話用データセットとは、上述した確率が、比較的短い一連の会話中で出現する確率として記憶されているものである。一方、長期会話用データセットとは、上述した確率が、会話が行われたときにその会話中に出現する（ここでは、これを短期会話用に対応させて長期会話用と定義）確率として定義されているものである。二つのデータセットは、交点に記憶される確率のみが異なり（一時的に同じになる場合もあり得る）、行列状の語彙候補の配列は同一である。
【００１６】
例えば、短期間に行われる一連の対話の中で、全く同じ問いかけ（とそれに対する応答）がなされることは少ない（確率は低い）と言える。一方、長い目で見れば、会話の話者の人間性や興味の対象（趣味や趣向）等を考慮すれば、同じ問いかけが再度なされる可能性は高い（確率は高い）。即ち、短期的に見た場合と長期的に見た場合との確率の違いを加味することで認識率を向上させることができる。そこで、ここでは、二つのデータセットを用意して、これらのデータセットを有効に利用して認識率を向上させている。その具体的な利用方法については後述する。
【００１７】
次に、図２に示されるフローチャートに基づいて説明する。まず、上述したマイクなどで音声を取得する（ステップ２００）。取得した音声は、まず、音響モデルによる解析を行う（ステップ２０５）。音響モデルによる解析としては、隠れマルコフモデルや、本発明者が提案するＩｓｌａｎｄ（島）分析などによる解析が挙げられる。次いで、音響モデルによる解析での認識率が十分高いものであるか否かを判定する（ステップ２１０）。ここでは、認識率がαより大きければ、十分高いとしている。
【００１８】
認識率が十分高い場合、即ち、ステップ２１０が肯定される場合は、音響モデルによる解析のみで認識できたとし、認識を確定して（ステップ２１５）図２のフローチャートを抜ける。一方、ステップ２１０が否定される場合は、続いて、音響モデルによる解析での認識率が低すぎないか否かを判定する（ステップ２２０）。ここでは、認識率がβより小さい場合を低すぎるとしている。この場合は、取得した音声の認識は困難（音響モデルによる解析に加えて言語モデルによる解析を行っても困難）であるとして、再発話指示（ステップ２２５）を行なった後に図２のフローチャートを抜ける。
【００１９】
再発話指示は、認識に失敗した旨を知らせるエラー音（ビープ音）やエラー表示を行っても良いし、装置が有するスピーカーから合成音声や録音音声などで「再入力してください」と出力させても良い。再発話された音声を取得して、再びステップ２００から音声認識処理を行うこととなる。一方、ステップ２２０が否定される場合は、認識率は上述したβよりは高いがαには達しない状況といえる。この場合は、上述した対話（問いかけと応答）を考慮した言語モデルを用い解析を行う（ステップ２３０）。
【００２０】
その結果、認識率が十分高いものであるか否かを判定する（ステップ２３５）。認識率が十分高い場合、即ち、ステップ２３５が肯定される場合は、認識を確定する（ステップ２４０）。一方、ステップ２３５が否定される場合は、取得した音声の認識は困難（言語モデルによる解析を行ったが困難）であるとして、再発話指示（ステップ２２５）を行なった後に図２のフローチャートを抜ける。
【００２１】
ここで、各ステップの説明が上述したものと一部重複するが、上述したステップ２３０以降の言語モデルによる解析について以下に詳しく説明する。
【００２２】
始めて装置が使用されるとき、図１に示される二つのデータセット中の各確率は、初期値１００にリセットされている。二つのデータ設置は、この初期状態から音声認識をする度に修正され、修正されたデータセットが保存され（後述するが長期会話用データセットのみ保存される）、その後の音声認識に継続して使用される。本実施形態の場合は、まず、ナビゲーションシステムから問いかけの文章（あるいは単語、以下同じ）が合成音声や録音音声によって発せられる［図１中のａ］。この文章は、候補語彙として図１中の問いかけの語彙候補内に保存されている。
【００２３】
これに対して話者によって応答の文章が発話される。装置は、マイクなどによってこの発話された語彙を取得し（ステップ２００）、まず音響モデルを用いて解析を行う（ステップ２０５）。その結果、応答の語彙候補が検出される。このときの語彙候補は、図１の応答の語彙候補と完全に一致しない場合もあるが、その場合は最も似ているものが選択される［図１中のｂ］。このようにして行列中の（ａ，ｂ）が決定される（ステップ２３０）。この（ａ，ｂ）の確率が、短期会話用テーブルのａ行中の全ての要素中の確率に対して上位１０％以内であれば（ステップ２３５）、問いかけａに対する応答がｂであると認識する（ステップ２４０）。
【００２４】
（ａ，ｂ）に確定した後は、一対のデータセットを修正する。まず、短期会話用データセットに関してであるが、（ａ，ｂ）に保存されているデータに関しては−１０とし、ａ行中の（ａ，ｂ）以外の全データに関しては＋３とする。上述したように、短期的な会話中では、一度話されたやりとりが再び行われる可能性は低いと思われるので、（ａ，ｂ）のデータについてはその確率を低く修正する。また、ａ行中のその他のデータに関しては、短期的な会話では未だ使われていない語彙の方が既に使われた語彙よりも会話中に出てくる可能性が強いため、会話中に出現する可能性の確率を高くするように修正する。なお、短期的な会話としては、ナビゲーションシステムと操作者間での目的地設定、目的地探索、目的地周辺情報取得などが考えられる。
【００２５】
一方、長期会話用データセットに関してであるが、（ａ，ｂ）に保存されているデータに関しては＋５とし、ａ行中の（ａ，ｂ）以外の全データに関しては−２とする。上述したように、長いスパンで考えれば、似た話題が話されることが多く、一度認識されたやりとりが再び行われる可能性は高いと思われるので、（ａ，ｂ）のデータについてはその確率を高く修正する。また、ａ行中のその他のデータに関しては、いくつか想定される会話の中であまり行われない会話での語彙が今後の会話で出現する可能性が低いため、会話中に出現する可能性の確率を低くするように修正する。
【００２６】
このように二つのデータセットを修正している（ステップ２４５）。上述したように、ステップ２３５において、（ａ，ｂ）の確率から認識できたか否かを判断する際には、長期会話用データではなく短期会話用データセットに基づいて判断するのは、二つのデータセットがそれぞれ上述した性質を有しているからである。そして、話題が転換したか否かを判断し（ステップ２５０）、話題が転換していない場合はそのまま図２のフローチャートを終えるが、話題が転換している場合はそこまで使用していた短期会話用データセットを一回破棄し（ステップ２５５）、その時点での長期用データセットをそのままコピーする（ステップ２６０）。
【００２７】
その瞬間は二つのデータセットは同一であるが、再び認識処理が行われることで、それぞれのデータセットで異なる重み付けがなされ、認識率向上に寄与する。なお、話題が転換したか否かの判断は種々考えられる。例えば、発話の間隔が長かったときに話題が転換したと判断しても良いし、対話中に頻出する単語が変わった場合に話題が転換したと判断しても良い。あるいは、話題の転換時に頻出する語（例えば、「さて、」など）を検出した場合に話題が転換したと判断しても良い。本実施形態の場合は、ナビゲーションシステムの問いかけと操作者の応答というやりとりが前提となっているので、ナビゲーションシステム側で話題の転換を把握するのが容易である。
【００２８】
なお、本発明は上述した実施形態に限定されるものではない。例えば、上述した実施形態の装置は、ナビゲーションシステムに統合されており、問いかけの語彙については装置側が当初から把握しているというものであった。しかし、本発明は、問いかけ側と応答側の双方を音声認識によって把握しようとするものに対しても適用し得る。ただし、上述したように、装置側が発する問いかけに対する応答を認識という形態は、問いかけ側の語彙を特定できるので、本発明をより効果的に適用し得る形態であると言える。
【００２９】
【発明の効果】
請求項１及び請求項３に記載の音声認識装置・方法によれば、問いかけの語彙と応答の語彙候補と両語彙候補が結びつく確率との関係を利用して音声認識を行うことによって、高い認識率を得ることができる。特に、対象が対話形式に特化しているようであれば、その認識率や従来よりもより高い認識率を実現することも可能となる。
【００３０】
請求項２及び請求項４に記載の音声認識装置・方法によれば、データベース内に長期会話用データセットと短期会話用データセットを用意し、この二つのデータセットを用いて音声認識を行うことで、短いスパンでのある特定語彙の発話率と長いスパンでの発話率とに差があることを利用し、より高い認識率を実現することが可能となる。
【図面の簡単な説明】
【図１】本発明の音声認識装置及び方法の一実施形態におけるデータベース構造を模式的に示した説明図である。
【図２】本発明の音声認識装置の一実施形態におけるデータベース構造を模式的に示した説明図である。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a voice recognition device and a voice recognition method.
[0002]
[Prior art]
2. Description of the Related Art A speech recognition device that acquires speech as data, processes the data, and recognizes the uttered content has been put to practical use. For example, a speech recognition engine is built in an application for inputting a sentence by utterance or an in-vehicle navigation system operated by utterance. In the current speech recognition, even if the acquired speech is analyzed as it is (analysis by acoustic model: phoneme analysis), the recognition rate remains at 60% to 80%. In addition to this, the recognition rate is increased to 85% to 95% by performing analysis [analysis using a language model] of grammar theory and the connection of words (phonemes) to complement them.
[0003]
As an analysis method using the acoustic model described above, a method using a probabilistic model such as a hidden Markov model (HMM: Hidden Markov Model) is famous. As an analysis method using the above-mentioned language model, there are methods such as bigram and trigram. These are methods for estimating one vocabulary in the same utterance by the same person, and predicting the entire vocabulary from the first two or three sounds (phonemes) in one vocabulary.
[0004]
[Problems to be solved by the invention]
Based on the situation of speech recognition, the inventor has conducted intensive research aiming at further improving the recognition rate, and has invented a new effective method as an analysis method using a language model. That is, an object of the present invention is to provide a speech recognition apparatus and method capable of realizing a high recognition rate.
[0005]
[Means for Solving the Problems]
The speech recognition apparatus according to claim 1, wherein the storage unit stores a database that associates a vocabulary candidate to be asked, a vocabulary candidate for a response, and a probability that both vocabulary candidates are associated with each other, and a word and a vocabulary of a speech acquired using the storage unit. It is characterized by comprising identification processing means for performing identification processing with a candidate, and correction means for correcting the probability in the database stored in the storage means based on the result of the identification processing by the identification means.
[0006]
According to a second aspect of the present invention, in the speech recognition apparatus according to the first aspect, the database includes a short-term conversation data set and a long-term conversation data set, and the correction unit outputs the result of the identification processing. On the basis of this, correction is performed on the short-term conversation data set and the long-term conversation data set, respectively.
[0007]
The speech recognition method according to claim 3 performs an identification process on the acquired speech word using a database in which a vocabulary candidate to be queried, a vocabulary candidate for a response, and a probability that both vocabulary candidates are associated are identified. It is characterized in that the probability in the database is corrected based on the processing result.
[0008]
According to a fourth aspect of the present invention, in the voice recognition method according to the third aspect, the database has a short-term conversation data set and a long-term conversation data set, and the short-term conversation data set is used based on a result of the identification processing. The present invention is characterized in that the conversation data set and the long-term conversation data set are respectively corrected.
[0009]
Note that the probability referred to here is not limited to the probability in the academic sense and the probability in a strict sense in terms of statistics (probability such as 1/10 or 10% when there is a possibility that it occurs once in 10 times). It also has the meaning of a broad sense (including the narrow sense described above) that relatively indicates the possibility of an event occurring.
[0010]
BEST MODE FOR CARRYING OUT THE INVENTION
As described above, the speech recognition uses both the analysis by the acoustic model and the analysis by the language model to improve the recognition rate. In the analysis using the conventional language model, the above-mentioned bigram, trigram, and the like are used. However, these methods do not reflect the interaction between people. In other words, the analysis based on the conventional language model is a micro-prediction that predicts the entire vocabulary (word) by halfway utterance. When trying to improve the recognition rate by focusing on the grammar of the language (for example, Japanese) and the rules of thumb of speech, predictions that focus on not only such microscopic predictions but also dialogues of responses to questions, The inventors have found that a concept that can be called macro prediction is effective for micro prediction. The present invention is based on such findings.
[0011]
The speech recognition device of the present invention has an input unit, an arithmetic processing unit (identification processing unit / correction unit), and a storage unit (storage unit). The arithmetic processing unit and the storage unit are configured as an electronic control unit (ECU) including a CPU, a ROM, a RAM, and the like. Note that the storage unit may be an exchangeable hard disk, an optical disk such as a CD-ROM, a DVD-ROM, or the like, or may be an externally attached one. The database and the like described later are stored in these storage units. A microphone as an input unit is connected to the ECU described above. The CPU in the ECU performs various calculations, and various data at the time of the calculations are held in a cache memory or a RAM in the CPU.
[0012]
Further, the voice recognition device of the present embodiment is incorporated in a car navigation system, and attempts to recognize an operator's response to a question (question) issued by the car navigation system. For this reason, it is not necessary to perform a recognition process on the question sentence (word) issued by the car navigation system, and the apparatus itself grasps the question sentence (word) from the beginning.
[0013]
FIG. 1 schematically shows the structure of a database used for speech recognition according to the present embodiment. FIG. 2 is a flowchart illustrating a speech recognition process in the speech recognition apparatus and method according to the present embodiment.
[0014]
First, the structure of a database used in the speech recognition of the present embodiment will be described with reference to FIG. As shown in FIG. 1, there are two sets of matrix-like data sets in which vocabulary candidates to be queried are arranged in the row direction, and vocabulary candidates for responses are arranged in the column direction. Then, the probability that the vocabulary candidate related to each response appears for the vocabulary candidate for each question is stored at the intersection.
[0015]
As described above, there are two sets of data sets. A short-term conversation data set and a long-term conversation data set. The short-term conversation data set is one in which the above-described probabilities are stored as probabilities that appear in a series of relatively short conversations. On the other hand, a long-term conversation data set is defined as a probability that the above-mentioned probability appears during a conversation when the conversation is performed (here, this is defined as a long-term conversation corresponding to a short-term conversation). Is what is being done. The two data sets differ only in the probability of being stored at the intersection (it may be temporarily the same), and the arrangement of the matrix-like vocabulary candidates is the same.
[0016]
For example, in a series of dialogues conducted in a short period of time, it is rare (the probability is low) that exactly the same questions (and responses to them) are made. On the other hand, in the long run, the same question is more likely to be asked again (the probability is high) in consideration of the human nature of the speaker of the conversation, the object of interest (hobbies and interests), and the like. That is, the recognition rate can be improved by taking into account the difference between the probabilities of a short-term view and a long-term view. Therefore, here, two data sets are prepared, and the recognition rate is improved by effectively using these data sets. The specific usage will be described later.
[0017]
Next, a description will be given based on the flowchart shown in FIG. First, a voice is acquired with the above-described microphone or the like (step 200). The acquired voice is first analyzed by an acoustic model (step 205). Examples of the analysis by the acoustic model include a hidden Markov model and an analysis by an island (island) analysis proposed by the present inventors. Next, it is determined whether or not the recognition rate in the analysis using the acoustic model is sufficiently high (step 210). Here, it is assumed that if the recognition rate is larger than α, it is sufficiently high.
[0018]
If the recognition rate is sufficiently high, that is, if step 210 is affirmative, it is determined that recognition was possible only by analysis using the acoustic model, and the recognition is determined (step 215), and the process exits the flowchart in FIG. On the other hand, when step 210 is denied, it is subsequently determined whether or not the recognition rate in the analysis by the acoustic model is too low (step 220). Here, it is assumed that the case where the recognition rate is smaller than β is too low. In this case, it is determined that recognition of the acquired voice is difficult (even if analysis is performed using a language model in addition to analysis using an acoustic model), and after issuing a re-speak instruction (step 225), the process exits the flowchart in FIG. .
[0019]
The re-speaking instruction may be an error sound (beep sound) or an error display indicating that the recognition has failed, or may output "Please re-enter" as a synthesized voice or recorded voice from the speaker of the device. May be. The reuttered voice is acquired, and the voice recognition processing is performed again from step 200. On the other hand, if step 220 is denied, it can be said that the recognition rate is higher than β described above but does not reach α. In this case, analysis is performed using a language model that takes into account the dialogue (question and response) described above (step 230).
[0020]
As a result, it is determined whether or not the recognition rate is sufficiently high (step 235). If the recognition rate is sufficiently high, that is, if step 235 is affirmed, the recognition is determined (step 240). On the other hand, if step 235 is denied, it is determined that recognition of the acquired voice is difficult (analysis using the language model is difficult), and after issuing a re-speak instruction (step 225), the process exits the flowchart in FIG. .
[0021]
Here, the description of each step partially overlaps with that described above, but the analysis using the language model after step 230 described above will be described in detail below.
[0022]
When the device is used for the first time, each probability in the two data sets shown in FIG. 1 has been reset to an initial value of 100. The two data settings are corrected each time speech recognition is performed from this initial state, the corrected data set is stored (only the data set for long-term conversation is described later), and the subsequent voice recognition is continued. used. In the case of the present embodiment, first, a sentence (or word, the same applies hereinafter) to be queried is emitted from the navigation system by a synthesized voice or a recorded voice [a in FIG. 1]. This sentence is stored as a candidate vocabulary in the vocabulary candidate in question in FIG.
[0023]
On the other hand, a response sentence is uttered by the speaker. The device acquires the uttered vocabulary using a microphone or the like (step 200), and first performs analysis using an acoustic model (step 205). As a result, vocabulary candidates for the response are detected. The vocabulary candidate at this time may not completely match the vocabulary candidate of the response in FIG. 1, but in this case, the most similar one is selected [b in FIG. 1]. In this way, (a, b) in the matrix is determined (step 230). If the probability of (a, b) is within the upper 10% of the probabilities in all elements in row a of the short-term conversation table (step 235), it is recognized that the response to question a is b. (Step 240).
[0024]
After (a, b), the pair of data sets is corrected. First, regarding the data set for short-term conversation, the data stored in (a, b) is set to -10, and all data other than (a, b) in row a is set to +3. As described above, during the short-term conversation, it is considered that there is a low possibility that the exchange once spoken is performed again. Therefore, the probability of the data (a, b) is corrected to be low. In addition, the other data in the row a appear in the conversation because the vocabulary not yet used in the short-term conversation is more likely to appear in the conversation than the vocabulary already used. Modify to increase the probability of possibility. The short-term conversation may include setting a destination between the navigation system and the operator, searching for a destination, obtaining information around the destination, and the like.
[0025]
On the other hand, regarding the long-term conversation data set, +5 is set for data stored in (a, b), and -2 is set for all data except (a, b) in row a. As described above, similar topics are often spoken when considered over a long span, and it is highly likely that the exchange once recognized will be performed again. Modify the probability to be higher. As for the other data in the row a, the vocabulary of a conversation that is rarely performed in some assumed conversations is unlikely to appear in future conversations. Modify to lower the probability.
[0026]
Thus, the two data sets are modified (step 245). As described above, in step 235, when determining whether or not recognition was possible from the probability of (a, b), the determination based on the short-term conversation data set instead of the long-term conversation data is performed in two steps. This is because each data set has the above-described properties. Then, it is determined whether or not the topic has changed (step 250). If the topic has not changed, the flowchart of FIG. 2 ends. If the topic has changed, the short-term conversation that has been used up to that point is used. The long-term data set is discarded once (step 255), and the long-term data set at that point is copied as it is (step 260).
[0027]
At that moment, the two data sets are the same, but by performing the recognition process again, different weights are given to the respective data sets, thereby contributing to an improvement in the recognition rate. There are various ways to determine whether the topic has changed. For example, it may be determined that the topic has changed when the interval between the utterances is long, or it may be determined that the topic has changed when a word that frequently appears during the conversation has changed. Alternatively, it may be determined that the topic has been changed when a word that frequently appears at the time of changing the topic (for example, “Okay,” etc.) is detected. In the case of the present embodiment, it is premised on the exchange of the question of the navigation system and the response of the operator, so that it is easy for the navigation system to grasp the change of the topic.
[0028]
Note that the present invention is not limited to the embodiment described above. For example, the device of the above-described embodiment is integrated into a navigation system, and the vocabulary to be asked is known from the beginning by the device. However, the present invention can also be applied to a case where both the inquiring side and the responding side are to be grasped by voice recognition. However, as described above, the form of recognizing the response to the question issued by the device can be said to be a form to which the present invention can be more effectively applied since the vocabulary of the questioner can be specified.
[0029]
【The invention's effect】
According to the speech recognition apparatus and method according to the first and third aspects, the speech recognition is performed by using the relationship between the vocabulary to be queried, the vocabulary candidate for the response, and the probability that both vocabulary candidates are connected, thereby achieving high recognition. Rate can be obtained. In particular, if the target is specialized in an interactive format, it is possible to realize a recognition rate higher than that of the related art.
[0030]
According to the speech recognition apparatus and method of the present invention, a data set for long-term conversation and a data set for short-term conversation are prepared in a database, and speech recognition is performed using the two data sets. Therefore, it is possible to realize a higher recognition rate by utilizing the difference between the utterance rate of a specific vocabulary in a short span and the utterance rate in a long span.
[Brief description of the drawings]
FIG. 1 is an explanatory diagram schematically showing a database structure in an embodiment of a speech recognition apparatus and method of the present invention.
FIG. 2 is an explanatory diagram schematically showing a database structure in one embodiment of the speech recognition device of the present invention.

Claims

Storage means for storing a database in which a vocabulary candidate to be asked, a vocabulary candidate for a response, and a probability of association between the vocabulary candidates are stored;
Identification processing means for performing identification processing between the words of the voice acquired using the storage means and the vocabulary candidates, and
A speech recognition apparatus, comprising: a correction unit that corrects the probability in the database stored in the storage unit based on a result of the identification processing by the identification unit.

The database has a short-term conversation data set and a long-term conversation data set,
2. The speech recognition apparatus according to claim 1, wherein the correction unit corrects each of the short-term conversation data set and the long-term conversation data set based on a result of the identification processing. 3.

The acquired speech word is identified using a database that associates the vocabulary candidate to be asked, the vocabulary candidate for the response, and the probability that the two vocabulary candidates are associated with each other, and based on the result of the identification process, A speech recognition method characterized by correcting a probability.

The database has a short-term conversation data set and a long-term conversation data set,
The speech recognition method according to claim 3, wherein the short-term conversation data set and the long-term conversation data set are respectively corrected based on the result of the identification processing.