JP2965529B2

JP2965529B2 - Voice recognition device

Info

Publication number: JP2965529B2
Application number: JP9161243A
Authority: JP
Inventors: 淳河井; 由実脇田
Original assignee: 株式会社エイ・ティ・アール音声翻訳通信研究所
Priority date: 1996-12-20
Filing date: 1997-06-18
Publication date: 1999-10-18
Anticipated expiration: 2017-06-18
Also published as: JPH10232693A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、発声音声の音声信
号に基づいて、統計的言語モデルを参照して音声認識す
る音声認識装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition apparatus for recognizing speech based on a speech signal of an uttered speech by referring to a statistical language model.

【０００２】[0002]

【従来の技術】連続音声認識装置において、Ｎ−ｇｒａ
ｍと呼ばれる統計的手法に基づいた統計的言語モデルが
広く使用されている（例えば、従来技術文献１「L.R.Ba
hl etal.,“A Maximum Likelihood Approach to Contin
uous Speech Recognition",IEEE Transactions on Patt
ern Analysis and Machine Intelligence,pp.179-190,1
983年」参照。）。Ｎ−ｇｒａｍを用いた連続音声認識
装置では、予め、大規模な学習データを用いて、直前の
Ｎ−１個の単語から次の単語に遷移する遷移確率を学習
しておき、音声認識時に、学習した遷移確率を用いて次
に接続する単語を予測することにより、音声認識率の向
上を計っている。一般に、Ｎが大きくなるほど次単語の
予測精度は向上するが、単語連鎖の種類数が多くなるた
め、信頼できる遷移確率を得るためには、大量の学習デ
ータが必要となる。そこで現状では、Ｎを２（ｂｉ−ｇ
ｒａｍ）又は３（ｔｒｉ−ｇｒａｍ）程度に設定して使
用している例が多い。しかしながら、単語のｂｉ−ｇｒ
ａｍや単語のｔｒｉ−ｇｒａｍを用いた連続音声認識結
果を分析してみると、２又は３単語以内の局所的な単語
連鎖に自然性はあったとしても、文全体を眺めると、不
自然な誤認識文を結果として出力している例が多々見受
けられ、より大局的な言語制約が必要であると考える。2. Description of the Related Art In a continuous speech recognition apparatus, N-gra
m, a statistical language model based on a statistical method is widely used (for example, see LRBa
hl etal., “A Maximum Likelihood Approach to Contin
continuous Speech Recognition ", IEEE Transactions on Patt
ern Analysis and Machine Intelligence, pp.179-190,1
983 ". ). In the continuous speech recognition device using N-gram, the transition probability of transition from the immediately preceding N-1 words to the next word is learned in advance using large-scale learning data, and at the time of speech recognition, By predicting the next word to be connected using the learned transition probability, the speech recognition rate is improved. Generally, as N increases, the prediction accuracy of the next word improves, but since the number of types of word chains increases, a large amount of learning data is required to obtain a reliable transition probability. Therefore, at present, N is 2 (bi-g
ram) or 3 (tri-gram). However, the word bi-gr
Analysis of the results of continuous speech recognition using "am" and "tri-gram" of words indicates that even if the local word chain within two or three words has naturalness, the whole sentence is unnatural. There are many cases where false recognition sentences are output as a result, and it is considered that more global language constraints are required.

【０００３】文脈自由文法などの文法や単語間の依存関
係を用いて、より大局的な制約を可能とする言語モデル
が提案されている。しかしながら、自然発話文の構造や
依存関係の多様性を考えると、規則や依存関係の構築は
容易ではないし、処理量も膨大になる。一方、用例主導
型のアプローチで文の構文の曖昧性を解消する方法（以
下、従来例という。）が従来技術文献２「隅田英一郎ほ
か，“英語前置語句係り先の用例主導あいまい性解
消”，電子情報通信学会論文誌（Ｄ−ＩＩ），Ｊ７７−
Ｄ−ＩＩ，Ｎｏ３，ｐｐ．５５７−５６５，１９９４年
３月」において提案されている。この従来例の方法は、
コーパスから用例を抽出し、入力文の表現と用例との意
味的距離をシソーラスに従って計算し、最終的な意味的
距離が最も小さくなる構文を選択する方法であり、対訳
決定処理などでもその効果が確認されている（従来技術
文献３「古瀬蔵ほか，“経験的知識を活用する変換主導
型機械翻訳”，情報処理学会論文誌，Ｖｏｌ．３５，Ｎ
ｏ３，ｐｐ．４１４−４２３，１９９４年３月」参
照。）。[0003] Language models have been proposed that enable more global constraints using grammar such as context-free grammar and dependencies between words. However, considering the structure of spontaneously spoken sentences and the diversity of dependencies, it is not easy to construct rules and dependencies, and the amount of processing is enormous. On the other hand, a method of disambiguating the syntax of a sentence by an example-driven approach (hereinafter referred to as a conventional example) is described in prior art document 2 “Eiichiro Sumida et al. , IEICE Transactions (D-II), J77-
D-II, No3, pp. 557-565, March 1994 ". This conventional method is
This is a method of extracting an example from the corpus, calculating the semantic distance between the expression of the input sentence and the example according to a thesaurus, and selecting a syntax that minimizes the final semantic distance. (Prior Art Document 3, "Kurase Kura et al.," Transformation-Driven Machine Translation Utilizing Empirical Knowledge ", Transactions of Information Processing Society of Japan, Vol. 35, N.
o3, pp. 414-423, March 1994 ". ).

【０００４】[0004]

【発明が解決しようとする課題】しかしながら、従来例
の方法を用いる音声認識装置において、例えば、学習し
た用例に対して不自然な構文を入力すると、どの用例と
の意味的距離も大きくなってしまい、音声認識率が比較
的低いという問題点があった。However, in a speech recognition apparatus using the conventional method, for example, if an unnatural syntax is input for a learned example, the semantic distance from any example increases. However, there is a problem that the voice recognition rate is relatively low.

【０００５】本発明の目的は以上の問題点を解決し、不
適格な誤認識結果を除去することができ、局所的にも大
局的にも適格な文を出力でき、従来例に比較して高い音
声認識率を得ることができる音声認識装置を提供するこ
とにある。SUMMARY OF THE INVENTION An object of the present invention is to solve the above problems, remove improper misrecognition results, and output a sentence that is locally and globally appropriate. An object of the present invention is to provide a speech recognition device that can obtain a high speech recognition rate.

【０００６】[0006]

【課題を解決するための手段】本発明に係る請求項１記
載の音声認識装置は、入力される単語列からなる発声音
声文の音声の音声信号に基づいて、所定の統計的言語モ
デルを参照して上記音声に対して音声認識処理を実行す
る音声認識手段とを備えた音声認識装置において、上記
音声認識手段は、音声認識候補に対して、音声認識候補
に対する不適格の度合いを表わす所定の不適格文判定関
数を用いて不適格文判定関数の関数値を計算し、計算さ
れた関数値が所定のしきい値を超えるときに、当該音声
認識候補を除去して音声認識処理を実行することを特徴
とする。According to a first aspect of the present invention, there is provided a speech recognition apparatus which refers to a predetermined statistical language model on the basis of a speech signal of an uttered speech sentence composed of an input word string. And a voice recognition means for performing voice recognition processing on the voice, wherein the voice recognition means determines a degree of disqualification of the voice recognition candidate with respect to the voice recognition candidate. The function value of the ineligible sentence determination function is calculated using the ineligible sentence determination function, and when the calculated function value exceeds a predetermined threshold, the voice recognition candidate is removed and the voice recognition process is performed. It is characterized by the following.

【０００７】また、請求項２記載の音声認識装置は、請
求項１記載の音声認識装置において、上記不適格文判定
関数の関数値は、上記音声認識処理で用いた用例に対応
する意味的距離の和を計算し、計算された和に音声認識
処理の対象となる音声認識候補に含まれる形態素の数を
乗算しかつ上記音声認識処理で用いた用例の数で除算し
た値であることを特徴とする。さらに、請求項３記載の
音声認識装置は、請求項１記載の音声認識装置におい
て、上記不適格文判定関数の関数値は、上記音声認識処
理で用いた用例に対応する意味的距離の和を計算し、計
算された和を上記音声認識処理で用いた用例の数で除算
した値である意味的距離の平均値を計算し、計算された
意味的距離の平均値に音声認識処理の対象となる音声認
識候補に含まれる形態素の数を乗算しかつ上記音声認識
処理で用いた用例の数で除算した値であることを特徴と
する。またさらに、請求項４記載の音声認識装置は、請
求項１記載の音声認識装置において、上記不適格文判定
関数の関数値は、上記音声認識処理で用いた用例に対応
する意味的距離の和を計算し、計算された和を上記音声
認識処理で用いた用例の数で除算した値である意味的距
離の平均値を計算し、計算された意味的距離の平均値
を、所定個の形態素を処理した段階で上記音声認識処理
で用いた用例中で所定の複数個以上の形態素を含む用例
数で除算した値であることを特徴とする。According to a second aspect of the present invention, in the speech recognition apparatus of the first aspect, the function value of the ineligible sentence determination function is a semantic distance corresponding to an example used in the speech recognition processing. Is calculated by multiplying the calculated sum by the number of morphemes included in the speech recognition candidates to be subjected to speech recognition processing and dividing by the number of examples used in the speech recognition processing. And Further, in the speech recognition apparatus according to claim 3, in the speech recognition apparatus according to claim 1, the function value of the inappropriate sentence determination function is a sum of semantic distances corresponding to the examples used in the speech recognition processing. Calculate and calculate the average value of the semantic distance, which is the value obtained by dividing the calculated sum by the number of examples used in the voice recognition processing, and calculate the average value of the semantic distance as the target of the voice recognition processing. And a value obtained by multiplying by the number of morphemes included in the speech recognition candidate and dividing by the number of examples used in the speech recognition processing. Still further, the speech recognition device according to claim 4 is the speech recognition device according to claim 1, wherein the function value of the inappropriate sentence determination function is a sum of semantic distances corresponding to the examples used in the speech recognition processing. Is calculated, and the calculated sum is divided by the number of examples used in the above speech recognition processing to calculate the average value of the semantic distance, and the calculated average value of the semantic distance is determined by a predetermined number of morphemes. Is a value obtained by dividing by the number of examples including a plurality of predetermined morphemes in the examples used in the speech recognition processing at the stage of processing.

【０００８】また、請求項５記載の音声認識装置は、請
求項１乃至４のうちの１つに記載の音声認識装置におい
て、上記しきい値は、好ましくは、一定値である。さら
に、請求項６記載の音声認識装置は、請求項１乃至４の
うちの１つに記載の音声認識装置において、上記しきい
値は、好ましくは、音声認識処理の対象となる部分文に
含まれる形態素の数に依存して変化させる。According to a fifth aspect of the present invention, in the speech recognition apparatus according to any one of the first to fourth aspects, the threshold value is preferably a constant value. Further, the speech recognition device according to claim 6 is the speech recognition device according to any one of claims 1 to 4, wherein the threshold value is preferably included in a partial sentence to be subjected to speech recognition processing. To be changed depending on the number of morphemes.

【０００９】[0009]

【発明の実施の形態】以下、図面を参照して本発明に係
る実施形態について説明する。Embodiments of the present invention will be described below with reference to the drawings.

【００１０】図１は、本発明に係る一実施形態の音声認
識装置の構成を示すブロック図である。この音声認識装
置は、マイクロホン１と、特徴抽出部２と、バッファメ
モリ３と、入力される発声音声データに基づいて隠れマ
ルコフモデルメモリ（以下、ＨＭＭメモリという。）５
内の音響モデルである隠れマルコフモデル（以下、ＨＭ
Ｍという。）を参照して音素照合処理を実行して音素デ
ータを出力する音素照合部４と、音素照合部４からの音
素データに基づいてＯｎｅｐａｓｓＤＰ（Ｖｉｔｅ
ｒｂｉｓｅａｒｃｈ）アルゴリズムを用いて統計的言
語モデルメモリ７内の統計的言語モデル及び用例と距離
のデータベースメモリ（データベースメモリという。）
８内の用例と距離のデータベース（以下、データベース
という。）を参照して音声認識を実行するＯｎｅｐａ
ｓｓＤＰ音声認識部（以下、音声認識部という。）６
とを備え、上記音声認識部６は、音声認識候補に対し
て、音声認識候補に対する不適格の度合いを表わす所定
の不適格文判定関数（詳細後述する数１）を用いて不適
格文判定関数の関数値を計算し、計算された関数値が所
定のしきい値Ｆｔｈを超えるときに、当該音声認識候補
を除去して音声認識することを特徴とする。ここで、上
記不適格文判定関数の関数値は、好ましくは、上記音声
認識候補の構文を決定するために用いた用例に対応する
意味的距離の和を計算し、計算された和に音声認識処理
の対象となる音声認識候補に含まれる形態素の数を乗算
しかつ上記音声認識候補の構文を決定するために用いた
用例の数で除算した値である。また、上記しきい値Ｆｔ
ｈは、好ましくは、一定値、又は、音声認識処理の対象
となる部分文に含まれる形態素の数に依存して変化させ
る。なお、形態素とは、語幹、接頭辞、接尾辞など意味
を有する文字系列の最小単位で単語と実質的に同一であ
るかやや小さい単位である。FIG. 1 is a block diagram showing the configuration of a speech recognition apparatus according to an embodiment of the present invention. The speech recognition apparatus includes a microphone 1, a feature extraction unit 2, a buffer memory 3, and a hidden Markov model memory (hereinafter, referred to as an HMM memory) 5 based on input uttered speech data.
Hidden Markov model (hereinafter referred to as HM)
It is called M. ), And outputs a phoneme data by executing a phoneme verification process, and One pass DP (Vite) based on the phoneme data from the phoneme verification unit 4.
A database memory of statistical language models and examples and distances in the statistical language model memory 7 using an rbi search algorithm (referred to as a database memory).
One Pa, which executes speech recognition with reference to the example and distance database (hereinafter referred to as database) in FIG.
ss DP speech recognition unit (hereinafter referred to as speech recognition unit) 6
The speech recognition unit 6 performs an ineligible sentence determination function on the speech recognition candidate by using a predetermined ineligible sentence determination function (Equation 1 described later in detail) indicating the degree of ineligibility for the speech recognition candidate. Is calculated, and when the calculated function value exceeds a predetermined threshold value Fth, the speech recognition candidate is removed to perform speech recognition. Here, the function value of the disqualified sentence determination function is preferably calculated by calculating the sum of the semantic distances corresponding to the examples used to determine the syntax of the speech recognition candidate, and adding the calculated sum to the speech recognition. This is a value obtained by multiplying the number of morphemes included in the speech recognition candidate to be processed and dividing by the number of examples used to determine the syntax of the speech recognition candidate. Further, the threshold value Ft
h is preferably changed depending on a fixed value or the number of morphemes included in the partial sentence to be subjected to the speech recognition processing. The morpheme is a minimum unit of a character sequence having a meaning such as a stem, a prefix, and a suffix, and is a unit substantially the same as or slightly smaller than a word.

【００１１】まず、音声認識部６における不適格文検出
手法について説明する。Ｎ−ｇｒａｍを用いた音声認識
処理における誤認識には次の特徴がある。（ａ）Ｎ個以上の単語の連鎖で判断すると、文法的及び
意味的に不適当な単語の組み合わせが存在する。例え
ば、誤認識例：「電話番号が２１０７号室ですか」。（ｂ）文の構造が大きな単位でまとまらない。すなわ
ち、文法的に規則を適用することができず、局所的にし
か判断できない。例えば、誤認識例：「三名様までのえ
ーまでシングルの一泊の……」。First, a description will be given of an improper sentence detection method in the voice recognition unit 6. Misrecognition in speech recognition processing using N-gram has the following features. (A) Judging from a chain of N or more words, there are grammatically and semantically inappropriate word combinations. For example, an incorrect recognition example: "Is the phone number room 2107?" (B) The sentence structure is not organized in large units. That is, the rules cannot be applied grammatically, and only local decisions can be made. For example, a misrecognition example: "Even up to three people staying in a single night ..."

【００１２】上記の特徴を持つ誤認識を解決するために
は、Ｎ−ｇｒａｍよりも、より大局的な立場で、単語間
の整合性や構文の適格性を判断する必要がある。一方、
用例主導型の音声翻訳手法（従来技術文献２及び従来技
術文献４「O.Furuse et al.,“Incremental Translatio
n Utilizing Constituent Boundary Patterns",Proceed
ings of Coling'96,1996年」参照。）では、用例に基づ
く翻訳知識を用いて左から右に方向で（left-to-right
に）構文を決定していく方法をとっている。この処理課
程で、構文の曖昧性を解消するために、入力文と用例と
の意味的距離をシソーラス（類語辞書）を用いて計算
し、距離の小さい用例に相当する構文を選択する方法を
とっている。本発明者は、次の理由により、上記構文決
定手法が、従来例のＮ−ｇｒａｍ言語モデルによる誤認
識を除去するのに整合性が良いと考えられる。（ａ）上記構文決定手法は用例主導型手法であるので、
会話文に見られるような従来の文法で処理が困難な構文
が容易に処理可能である。（ｂ）上記構文決定手法では、構文に基づいて意味的距
離を求めているので、隣接しない単語間の整合性を判断
できる能力がある。（ｃ）音声認識、上記構文決定手法、ともに左から右に
方向で（left-to-rightに）処理を行なっているので、
ある時点までの中間結果を、逐次的に判定できる可能性
がある。In order to solve the misrecognition having the above characteristics, it is necessary to judge the consistency between words and the suitability of the syntax from a broader perspective than N-gram. on the other hand,
Example-driven speech translation method (prior art document 2 and prior art document 4 “O. Furuse et al.,“ Incremental Translatio ”
n Utilizing Constituent Boundary Patterns ", Proceed
ings of Coling '96, 1996 ". ) Uses left-to-right (left-to-right)
2) The method of determining the syntax is used. In this process, in order to resolve the ambiguity of the syntax, a method of calculating the semantic distance between the input sentence and the example using a thesaurus (synonym dictionary) and selecting a syntax corresponding to the example with a small distance is adopted. ing. The present inventor believes that the above syntax determination method has good consistency in removing erroneous recognition based on the conventional N-gram language model for the following reasons. (A) Since the above syntax determination method is an example-driven method,
Syntax that is difficult to process with conventional grammar, such as that found in conversational sentences, can be easily processed. (B) In the above syntax determination method, since the semantic distance is obtained based on the syntax, there is an ability to determine consistency between non-adjacent words. (C) Since both speech recognition and the above syntax determination method are performed from left to right (left-to-right),
Intermediate results up to a certain point in time may be determined sequentially.

【００１３】そこで、大局的にみた意味的距離の整合性
と解析された構文の適格性で、不適格文を検出する。具
体的には次のように判断する。まず、部分文における意
味的距離の不整合は、上記の構文決定手法に用いた意味
的距離値で判断する。ある部分文の意味的距離の総和が
一定値以上になると、その文を誤認識と判断する。次に
構文の適格性については次のように考える。一定以上の
形態素からなる自然な文であればまとまった構文を持
ち、構文の構造はある程度複雑な構造であろうと仮定す
る。ある部分文に含まれる形態素の数ｍの、構文決定の
ために使用された文脈自由文法の規則又は用例の規則数
（又は用例数）Ｒに対する割合（＝ｍ／Ｒ）を考える。
まとまった構文を持たない部分文は構文構造が階層にな
らず、よって形態素の数ｍに対して、使用された構文規
則数Ｒは少なく、ｍ／Ｒ値は大きくなる。逆に、構文が
複雑になり階層的になるほど、ｍ／Ｒ値は小さくなる。
そこで、次式の不適格文判定関数Ｆ_error（ｍ）を定義
する。Therefore, an improper sentence is detected based on the consistency of the semantic distance viewed from a global perspective and the suitability of the analyzed syntax. Specifically, the determination is made as follows. First, the inconsistency of the semantic distance in the partial sentence is determined based on the semantic distance value used in the above syntax determination method. When the sum of the semantic distances of a certain partial sentence becomes equal to or greater than a certain value, the sentence is determined to be erroneously recognized. Next, we consider the eligibility of the syntax as follows. It is assumed that a natural sentence composed of a certain number of morphemes has a coherent syntax, and that the structure of the syntax is a somewhat complicated structure. Consider the ratio (= m / R) of the number m of morphemes included in a certain partial sentence to the rule number R (or the number of examples) of the rules of the context-free grammar or the examples used for the syntax determination.
A sub-sentence without a coherent syntax does not have a hierarchical syntax structure. Therefore, the number R of used syntax rules is small and the m / R value is large for the number m of morphemes. Conversely, the more complex and hierarchical the syntax, the smaller the m / R value.
Therefore, the following equation defines an ineligible sentence determination function F _error (m).

【００１４】[0014]

【数１】 (Equation 1)

【００１５】ここで、ｄ（ｒ_i）は複数の用例又は規則
ｒ_iに対応する意味的距離又は類似度距離であり、ｍは
音声認識処理の対象となる音声認識候補の部分文に含ま
れる形態素の数であり、Ｒは音声認識処理を実行すると
き音声認識候補の部分文の構文を決定するために用いた
用例又は規則の数である。ここで、意味的距離又は類似
度距離とは、例えば従来技術文献２のｐ．５５９の
（１）式で定義され、シソーラスを用いて計算する、入
力発声音声文の音声認識候補と用例との間の意味的距離
であって、本実施形態においては、音声認識候補の部分
文に該当するデータベース内の用例に対する距離を検索
して決定する。ここで、シソーラスとは、概念間の上位
下位関係を木構造で表現し、葉に相当する最下位の概念
に当該概念をもつ単語を割り当てた辞書を指す。単語間
の意味的距離はシソーラス上の概念間の意味的距離によ
って定義され、概念間の距離はシソーラスにおける最小
の共通上位概念の位置に従って０から１までの値に設定
される。値０は２つの概念が同じであることを意味し、
値１は無関係であることを意味する。また、上記判定関
数Ｆ_error（ｍ）は形態素数ｍの関数であり、文章の始
めからｍ番目の形態素までの音声認識候補の部分文を対
象に計算される。この判定関数値Ｆ_error（ｍ）が所定
のしきい値Ｆｔｈを越えた場合、音声認識部６は、その
音声認識候補の部分文を誤認識結果と判断して、音声認
識候補から除去する。なお、上記数１は、好ましくは、
ｍ≧５のときに適用することができる。なお、上記数１
における規則数Ｒが０であるときは、当該関数値を１と
し、誤認識結果と判断して、音声認識候補から除去す
る。Here, d (r _i ) is a semantic distance or similarity distance corresponding to a plurality of examples or rules r _i , and m is included in a partial sentence of a speech recognition candidate to be subjected to speech recognition processing. R is the number of morphemes, and R is the number of examples or rules used to determine the syntax of the partial sentence of the speech recognition candidate when executing the speech recognition process. Here, the semantic distance or the similarity distance is, for example, p. This is a semantic distance between the speech recognition candidate of the input utterance speech sentence and the example, which is defined by Expression (1) of 559 and calculated using a thesaurus. In the present embodiment, the partial sentence of the speech recognition candidate is The distance to the example in the database corresponding to is searched and determined. Here, the thesaurus refers to a dictionary in which the upper-lower relationships between concepts are expressed in a tree structure, and words having the concepts are assigned to the lowest-level concepts corresponding to leaves. The semantic distance between words is defined by the semantic distance between concepts on the thesaurus, and the distance between concepts is set to a value from 0 to 1 according to the position of the smallest common superordinate concept in the thesaurus. A value of 0 means that the two concepts are the same,
A value of 1 means irrelevant. The judgment function F _error (m) is a function of the number of morphemes m, and is calculated for partial sentences of speech recognition candidates from the beginning of the sentence to the m-th morpheme. When the determination function value F _error (m) exceeds a predetermined threshold value Fth, the speech recognition unit 6 determines that the partial sentence of the speech recognition candidate is an erroneous recognition result and removes the partial sentence from the speech recognition candidate. The above equation 1 is preferably
This can be applied when m ≧ 5. The above equation (1)
When the number of rules R is 0, the function value is set to 1 and the result is determined to be an erroneous recognition result and is removed from the speech recognition candidates.

【００１６】図１の好ましい実施形態においては、デー
タベース生成部１０は、用例メモリ１１内の用例と、単
語セットメモリ１２内の単語セットとに基づいて、所定
の類似度規則を用いて、データベースを生成して、デー
タベースメモリ８に記憶する。文脈自由文法規則の用例
の一例を表１及び表２に示す。また、類似度規則の一例
を表３に示す。In the preferred embodiment shown in FIG. 1, the database generating unit 10 uses a predetermined similarity rule based on the examples in the example memory 11 and the word sets in the word set memory 12 to store the database. Generated and stored in the database memory 8. Tables 1 and 2 show examples of context-free grammar rules. Table 3 shows an example of the similarity rule.

【００１７】[0017]

【表１】用例１ ───────── ＸのＹ ───────── 僕の子供あなたの会社 ……… ─────────[Table 1] Example 1 Ｙ X Y ───────── My child Your company ……… ─────────

【００１８】[0018]

【表２】用例２ ───────── ＸがＹ ───────── 僕が先生 ……… ─────────[Table 2] Example 2 が X is Y ───────── I am a teacher ……… ─────────

【００１９】[0019]

【表３】類似度規則 ─────────────────────────────────── （Ｉ）単語セットの組で生成される文が用例と同じとき、距離＝０とする。（II）単語セットの組で生成される文が用例と同じ機能単語（例えば、 “の”や“が”）を有し、用例の単語とき同じ類似カテゴリーの単語セットを有するとき、距離＝１０^-5とする。（III）単語セットの組で生成される文が用例に無い単語同士のとき、距離＝０．５とする。 ───────────────────────────────────[Table 3] Similarity rule ─────────────────────────────────── (I) A set of word sets When the generated sentence is the same as the example, distance = 0. (II) When a sentence generated by a set of word sets has the same function word as the example (for example, “no” or “ga”) and has a word set of the same similar category as the example word, distance = 10 ^-5 . (III) When sentences generated by the set of word sets are words that are not in the example, distance = 0.5. ───────────────────────────────────

【００２０】日本語処理の音声認識装置における、単語
セットＳ１，Ｓ２，Ｓ３，Ｓ４の一例、並びに、単語セ
ット間の所定の機能単語を用いたときの距離を図２に示
す。図２において、例えば、「あなた」（単語セットＳ
１）が「先生」（単語セットＳ２）のとき、距離が１０
^-5になり、「あなた」（単語セットＳ１）の「子供」
（単語セットＳ３）のとき、距離が１０^-5になり、「あ
なた」（単語セットＳ１）の「会社」（単語セットＳ
４）のとき、距離が１０^-5になる。また、「先生」（単
語セットＳ２）の「会社」（単語セットＳ４）のとき、
距離が０．５になる。FIG. 2 shows an example of the word sets S1, S2, S3, and S4 in the speech recognition apparatus for Japanese processing, and the distance between the word sets when a predetermined functional word is used. In FIG. 2, for example, “you” (word set S
When 1) is “teacher” (word set S2), the distance is 10
^-5 , "child" of "you" (word set S1)
In the case of (word set S3), the distance becomes 10 ^-5 , and the “company” (word set S) of “you” (word set S1)
In the case of 4), the distance becomes 10 ^-5 . In the case of “teacher” (word set S2), “company” (word set S4),
The distance becomes 0.5.

【００２１】データベース生成部１０は、表１及び表２
の用例と、表３の類似度規則を用いたときのデータベー
ス生成処理を以下のように行う。各単語セットの組で部
分文を生成して、部分文が「あなたの会社」であるとき
は、距離は０となり、部分文が「私の学校」であるとき
は、距離は１０^-5となり、部分文が「子供が先生」であ
るときは、距離は０．５となる。このように生成した、
部分文の用例と距離とのデータベースは、データベース
メモリ８に記憶される。The database generation unit 10 performs the operations shown in Tables 1 and 2
And a database generation process using the similarity rule shown in Table 3 are performed as follows. A partial sentence is generated from each set of word sets. If the partial sentence is "your company", the distance is 0, and if the partial sentence is "my school", the distance is 10 ^-5 . When the partial sentence is "child is a teacher", the distance is 0.5. Generated like this,
A database of examples of partial sentences and distances is stored in the database memory 8.

【００２２】さらに、統計的言語モデルは、発声音声文
のテキストデータに基づいて、公知の方法により、例え
ば、単語のｂｉ−ｇｒａｍの統計的言語モデルを生成し
て統計的言語モデルメモリ７に記憶する。Further, the statistical language model generates, for example, a bi-gram statistical language model of the word based on the text data of the uttered voice sentence and stores it in the statistical language model memory 7. I do.

【００２３】次いで、本実施形態の統計的言語モデルを
用いた音声認識装置の構成及び動作について説明する。Next, the configuration and operation of the speech recognition apparatus using the statistical language model of the present embodiment will be described.

【００２４】図１において、話者の発声音声はマイクロ
ホン１に入力されて音声信号に変換された後、特徴抽出
部２に入力される。特徴抽出部２は、入力された音声信
号をＡ／Ｄ変換した後、例えばＬＰＣ分析を実行し、対
数パワー、１６次ケプストラム係数、Δ対数パワー及び
１６次Δケプストラム係数を含む３４次元の特徴パラメ
ータを抽出する。抽出された特徴パラメータの時系列は
バッファメモリ３を介して音素照合部４に入力される。
音素照合部４に接続されるＨＭＭメモリ５内のＨＭＭ
は、複数の状態と、各状態間の遷移を示す弧から構成さ
れ、各弧には状態間の遷移確率と入力コードに対する出
力確率を有している。音素照合部４は、入力されたデー
タに基づいて音素照合処理を実行して音素データを、音
声認識部６に出力する。In FIG. 1, a speaker's uttered voice is input to a microphone 1, converted into a voice signal, and then input to a feature extracting unit 2. After performing A / D conversion on the input audio signal, the feature extraction unit 2 performs, for example, LPC analysis, and performs 34-dimensional feature parameters including logarithmic power, 16th-order cepstrum coefficient, Δlogarithmic power, and 16th-order Δcepstrum coefficient. Is extracted. The time series of the extracted feature parameters is input to the phoneme matching unit 4 via the buffer memory 3.
HMM in HMM memory 5 connected to phoneme matching unit 4
Is composed of a plurality of states and arcs indicating transitions between the states. Each arc has a transition probability between states and an output probability for an input code. The phoneme matching unit 4 performs phoneme matching processing based on the input data, and outputs phoneme data to the speech recognition unit 6.

【００２５】統計的言語モデルを予め記憶する統計的言
語モデルメモリ７は音声認識部６に接続される。音声認
識部６は、統計的言語モデルメモリ７内の統計的言語モ
デル及びデータベースメモリ８内のデータベースを参照
して、所定のＯｎｅｐａｓｓＤＰアルゴリズムを用
いて、入力された音素データについて左から右方向に、
後戻りなしに処理してより高い生起確率の単語を音声認
識候補として認識し、当該音声認識候補に対して上記数
１を用いて判定関数値Ｆ_error（ｍ）を計算する。ここ
で、数１におけるｄ（ｒ_i）は音声認識候補に該当する
用例をデータベースより検索して、検索された用例に該
当する距離を意味的距離とする。そして、計算された判
定関数値Ｆ_error（ｍ）が所定のしきい値Ｆｔｈを越え
た場合、音声認識部６は、その音声認識候補の部分文を
誤認識結果と判断して、音声認識候補から除去する。そ
して、残った音声認識候補を音声認識結果（文字列デー
タ）と決定して出力する。A statistical language model memory 7 for storing a statistical language model in advance is connected to the speech recognition unit 6. The speech recognition unit 6 refers to the statistical language model in the statistical language model memory 7 and the database in the database memory 8 and uses a predetermined One pass DP algorithm to move the input phoneme data from left to right. To
Processing is performed without regression to recognize a word having a higher probability of occurrence as a speech recognition candidate, and a decision function value F _error (m) is calculated for the speech recognition candidate using _Equation 1 above. Here, d (r _i ) in Equation 1 is obtained by searching a database for an example corresponding to a voice recognition candidate, and setting a distance corresponding to the searched example as a semantic distance. When the calculated determination function value F _error (m) exceeds a predetermined threshold value Fth, the speech recognition unit 6 determines that the partial sentence of the speech recognition candidate is an erroneous recognition result, and Remove from Then, the remaining speech recognition candidates are determined as speech recognition results (character string data) and output.

【００２６】図３は、以上のように構成された日本語処
理の音声認識装置の動作を示す動作図であって、入力文
と、認識結果文とその構文木とスコアと、構文解析結果
文とその構文木とスコアとを示す動作図である。図３
（ａ）に示すように、「私のエットー学校がね」という
入力文の音声が入力されたとき、認識結果文として、図
３（ｂ）に示すように、「私の江藤学校がね」が得られ
たとき、すなわち、「エットー」という間投詞が「江
藤」という名詞に誤って認識された場合である。認識結
果文における単語間のスコアを図３（ｂ）に示してい
る。さらに、認識結果文に基づいて構文解析したとき
に、図３（ｃ）に示すように、より小さいスコアに基づ
いて構文解析結果の構文木が得られ、このときのスコア
が得られている。図３（ｃ）における場合を、上記数１
に当てはめると、不適格文判定関数の関数値Ｆ
_error（ｍ）は次式のようになる。FIG. 3 is an operation diagram showing the operation of the speech recognition apparatus for Japanese language processing configured as described above. The input sentence, the recognition result sentence, its syntax tree and score, and the syntax analysis result sentence are shown. FIG. 4 is an operation diagram showing a syntax tree and a score thereof. FIG.
As shown in FIG. 3A, when the voice of the input sentence "My Eto school is completed" is input, as shown in FIG. 3B, "My Eto school is completed" as a recognition result sentence. Is obtained, that is, when the interjection "Eto" is mistakenly recognized as a noun "Eto". FIG. 3B shows scores between words in the recognition result sentence. Further, when the syntax is analyzed based on the recognition result sentence, as shown in FIG. 3C, a syntax tree of the syntax analysis result is obtained based on the smaller score, and the score at this time is obtained. In the case of FIG.
To the function value F of the ineligible sentence determination function
_error (m) is as follows.

【００２７】[0027]

【数２】Ｆ_error（ｍ）＝（６／３）（０．５＋０．５＋１０^-5）＝２×（１．００００１）＝２．００００２F _error (m) = (6/3) (0.5 + 0.5 + 10 ⁻⁵ ) = 2 × (1.00001) = 2.00002

【００２８】当該例において、不適格文を判定するとき
のしきい値Ｆｔｈは、好ましくは、０．６乃至０．７で
あり、上記数２で計算された関数値＝２．００００２は
しきい値Ｆｔｈを超えているので、それに対応する音声
認識候補は音声認識候補から除去される。上記しきい値
Ｆｔｈは、一定値であってもよいし、音声認識処理の対
象となる部分文に含まれる形態素数ｍに依存して変化し
てもよい。In this example, the threshold value Fth for judging an improper sentence is preferably 0.6 to 0.7, and the function value = 2.00002 calculated by the above equation (2) is a threshold. Since the value exceeds the value Fth, the corresponding speech recognition candidate is removed from the speech recognition candidates. The threshold value Fth may be a constant value or may vary depending on the number of morphemes m included in the partial sentence to be subjected to the speech recognition processing.

【００２９】以上のように構成された音声認識装置にお
いて、特徴抽出部２と、音素照合部４と、音声認識部６
と、データベース生成部１０とは、例えば、ディジタル
計算機などのコンピュータで構成され、バッファメモリ
３と、ＨＭＭメモリ５と、統計的言語モデルメモリ７
と、データベースメモリ８とは、例えば、ハードディス
クメモリなどの記憶装置で構成される。In the speech recognition apparatus configured as described above, the feature extraction unit 2, the phoneme collation unit 4, the speech recognition unit 6
And the database generation unit 10 are configured by a computer such as a digital computer, for example, and include a buffer memory 3, an HMM memory 5, and a statistical language model memory 7.
And the database memory 8 are configured by a storage device such as a hard disk memory, for example.

【００３０】次いで、英語処理の音声認識装置の一例に
ついて説明する。英語処理のときの文脈自由文法規則の
用例の一例を表４及び表５に示す。また、類似度規則は
例えば、表３のものをそのまま使用する。Next, an example of a speech recognition device for English processing will be described. Tables 4 and 5 show examples of context-free grammar rules in English processing. For example, the similarity rule shown in Table 3 is used as it is.

【００３１】[0031]

【表４】用例１１ ──────────────── ＸａｔＹ ──────────────── ｓｔａｒｔａｔ７：３０ｌｅａｖｅａｔ６ｐ．ｍ． ……………………… ────────────────[Table 4] Example 11 {X at Y} start at 7:30 leave at 6p . m. ……………………… ────────────────

【００３２】[0032]

【表５】用例１２ ───────────────── Ｚ・Ｘ ───────────────── ｔｈｅｔｒａｉｎｓｔａｒｔｓ ……………………… ─────────────────[Table 5] Example 12 ─────────────────Z ・ X ｅthe train starts ………… …………… ─────────────────

【００３３】英語処理の音声認識装置における、単語セ
ットＳ１１（Ｘ），Ｓ１２，Ｓ１３（Ｚ），Ｓ１４
（Ｙ）の一例、並びに、単語セット間の所定の機能単語
を用いたときの距離を図５に示す。図５において、例え
ば、「ｔｒａｉｎｌｅａｖｅｓ」のとき距離が１０^-5
になり、「ｌｅａｖｅｔｒａｉｎ」のとき距離が０．
５になる。また、「ｌｅａｖｅＫｙｏｔｏ」のとき距
離が１０^-5になり、「ｌｅａｖｅａｔ６ｐ．
ｍ．」のとき距離が１０^-5になる。データベース生成部
１０は、表４及び表５の用例と、表３の類似度規則を用
いたときのデータベース生成処理を以下のように行う。
各単語セットの組で部分文を生成して、部分文が「ｔｈ
ｅｔｒａｉｎｓｔａｒｔｓ」であるときは、距離は
０となり、部分文が「ｔｈｅｂｕｓｌｅａｖｅｓ」
であるときは、距離は１０^-5となり、部分文が「ｌｅａ
ｖｅｙａｃｈｔ」であるときは、距離は０．５とな
る。このように生成した、部分文の用例と距離とのデー
タベースは、データベースメモリ８に記憶される。Word sets S11 (X), S12, S13 (Z), S14 in the English language speech recognition device
FIG. 5 shows an example of (Y) and the distance between the word sets when a predetermined function word is used. In FIG. 5, for example, when “train leaves”, the distance is 10 ^−5.
, And the distance is 0 when “leave train”.
It becomes 5. In the case of “leave Kyoto”, the distance becomes 10 ⁻⁵ , and “leave at 6 p.
m. ", The distance becomes 10 ^-5 . The database generation unit 10 performs a database generation process using the examples in Tables 4 and 5 and the similarity rule in Table 3 as follows.
A sub-sentence is generated from each set of word sets, and the sub-sentence is "th
When "e train starts", the distance is 0, and the sub-sentence is "the bus leaves".
, The distance is 10 ⁻⁵ and the partial sentence is “lea
ve yacht, the distance is 0.5. The database of the example of the partial sentence and the distance thus generated is stored in the database memory 8.

【００３４】図６は、以上のように構成された英語処理
の音声認識装置の動作を示す動作図であって、入力文
と、認識結果文とその構文木とスコアと、構文解析結果
文とその構文木とスコアとを示す動作図である。図６
（ａ）に示すように、「Ｔｈｅｂｕｓｌｅａｖｅｓ
Ｋｙｏｔｏａｔ１１ａ．ｍ．」という入力文の音
声が入力されたとき、認識結果文として、図６（ｂ）に
示すように、「Ｔｈｅｂｕｓｌｅａｖｅｓｙａｃｈ
ｔａｔ１１ａ．ｍ．」が得られたとき、すなわ
ち、「Ｋｙｏｔｏ」という地名の固有名詞が「ｙａｃｈ
ｔ」という名詞に誤って認識された場合である。認識結
果文における単語間のスコアを図６（ｂ）に示してい
る。さらに、認識結果文に基づいて構文解析したとき
に、図６（ｃ）に示すように、より小さいスコアに基づ
いて構文解析結果の構文木が得られ、このときのスコア
が得られている。図６（ｃ）における場合を、上記数１
に当てはめると、不適格文判定関数の関数値Ｆ
_error（ｍ）は次式のようになる。FIG. 6 is an operation diagram showing the operation of the speech recognition apparatus for English processing configured as described above. The input sentence, the recognition result sentence, its syntax tree and score, the syntax analysis result sentence, FIG. 6 is an operation diagram showing the syntax tree and a score. FIG.
As shown in (a), “Thebus leaves”
Kyoto at 11 a. m. When the voice of the input sentence “” is input, as a recognition result sentence, as shown in FIG. 6B, “Thebus leaves yach”.
tat 11 a. m. Is obtained, that is, the proper noun of the place name “Kyoto” is “yach
This is the case where the noun “t” is incorrectly recognized. FIG. 6B shows scores between words in the recognition result sentence. Further, when the syntax is analyzed based on the recognition result sentence, as shown in FIG. 6C, a syntax tree of the syntax analysis result is obtained based on the smaller score, and the score at this time is obtained. In the case of FIG.
To the function value F of the ineligible sentence determination function
_error (m) is as follows.

【００３５】[0035]

【数３】Ｆ_error（ｍ）＝（５／４）（１０^-5＋０．５＋０．５＋１０^-5）＝１．２５×（１．００００２）＝１．２５００２５F _error (m) = (5/4) (10 ⁻⁵ + 0.5 + 0.5 + 10 ⁻⁵ ) = 1.25 × (1.00002) = 1.250025

【００３６】当該例において、上記数３で計算された関
数値＝１．２５００２５はしきい値Ｆｔｈを超えている
ので、それに対応する音声認識候補は音声認識候補から
除去される。In this example, since the function value = 1.250025 calculated by the above equation 3 exceeds the threshold value Fth, the corresponding speech recognition candidate is removed from the speech recognition candidates.

【００３７】[0037]

【実施例】本発明者は、上述の不適格文検出方法を備え
た音声認識装置の有効性を評価するために、以下のごと
く実験を行った。ここでは、上述の不適格文判定関数Ｆ
_errorが、Ｎ−ｇｒａｍ言語モデルを用いた認識実験に
おける誤認識文と正解文とを区別することが可能かどう
かを確認した。具体的には、ｂｉ−ｇｒａｍを用いた認
識システムによる誤認識結果文と正解文とを対象に不適
格文判定関数Ｆ_errorを算出し、誤認識文と正解文との
不適格文判定関数の関数値Ｆ_errorの違いを考察した。
正解文では、形態素の数ｍが大きい、つまり部分文が長
いほど、文構造が複雑になり構造の曖昧性も低くなるの
で関数値Ｆ_errorが小さくなり、誤認識文との区別がつ
きやすくなると想像できる。しかしながら、認識処理の
効率化を考えると、なるべく早く、つまり形態素の数ｍ
が小さい段階の音声認識候補の部分文に対して不適格判
定を行ない、不適格文を誤認識文として結果候補から除
去することが好ましい。信頼性の高い関数値Ｆ_errorを
得るための形態素の数ｍを知るために、誤認識または正
解文のｍ番目の形態素までの音声認識候補の部分文に対
して関数値Ｆ_errorを計算し、形態素の数ｍを変化させ
た時の関数値Ｆ_errorの変化も合わせて調べた。実験に
おける音声認識及びデータ条件を表６に示す。DESCRIPTION OF THE PREFERRED EMBODIMENTS The present inventor conducted an experiment as follows in order to evaluate the effectiveness of a speech recognition apparatus provided with the above-described unsuitable sentence detection method. Here, the above-mentioned improper sentence determination function F
_error is, it was confirmed whether it is possible to distinguish the erroneous recognition sentence and a correct sentence in the recognition experiments using the N-gram language model. Specifically, an incorrectly qualified sentence determination function F _error is calculated for the incorrectly recognized result sentence and the correct sentence by the recognition system using the bi-gram, and an incorrectly qualified sentence determination function of the incorrectly recognized sentence and the correct answer sentence is calculated. The difference in the function value F _error was considered.
The correct sentence, the number m of morphemes is large, i.e. the longer the partial sentence, since the sentence structure also decreases ambiguity of the structure becomes complex will function value F _error is small, the distinction between misrecognition statement is likely attached I can imagine. However, considering the efficiency of the recognition process, the number of morphemes m
It is preferable to perform an improper judgment on the partial sentence of the speech recognition candidate at the stage where the is small, and remove the improper sentence from the result candidate as an erroneously recognized sentence. In order to know the number m of morphemes for obtaining a highly reliable function value F _error , a function value F _error is calculated for a partial sentence of a speech recognition candidate up to the m-th morpheme of a misrecognized or correct sentence, The change in the function value F _error when the number m of morphemes was changed was also examined. Table 6 shows the speech recognition and data conditions in the experiment.

【００３８】[0038]

【表６】音声認識及びデータ条件 ─────────────────────────────────── タスク旅行案内用音声対話データベース ─────────────────────────────────── 音響モデル不特定話者ＨＭ−ｎｅｔ，４０１状態，１０混合分布 ─────────────────────────────────── 言語モデル単語のｂｉ−ｇｒａｍ ─────────────────────────────────── 音声認識方式Ｏｎｅ−ｐａｓｓＤＰ，Ｎ−ｂｅｓｔ探索 ─────────────────────────────────── bi-gram学習データ３３６３文、２２２９５４単語 ─────────────────────────────────── 評価データ学習用データに含まれる４４文、４話者 ───────────────────────────────────[Table 6] Voice recognition and data conditions ─────────────────────────────────── Task Voice guidance for travel guidance Database ─────────────────────────────────── Acoustic model Unspecified speaker HM-net, 401 state, 10 Mixture distribution ─────────────────────────────────── language model bi-gram of words 混合───────────────────────────── Voice recognition method One-pass DP, N-best search ───────── ────────────────────────── bi-gram learning data 3363 sentences, 222954 words ────────────── ─────────────── ────── Evaluation data 44 sentences included in the training data, 4 speakers ────────────────────────────── ─────

【００３９】音声認識処理は、統計的言語モデルに単語
のｂｉ−ｇｒａｍを使用し、ｏｎｅ−ｐａｓｓＤＰア
ルゴリズム、Ｎ−ｂｅｓｔ探索型の音声認識システムを
用いた。正解文として、表６に示した評価データを用
い、誤認識文としては、上記評価データを、表６に示し
た３種類のＮ−ｇｒａｍを用いた認識システムで認識
し、その結果の誤認識文９４文を用いた。図４に、正解
文に対する関数値Ｆ_errorの平均値と最大値、及び誤認
識文に対する関数値Ｆ_errorを、各形態素数ｍ毎に示
す。この図４より、次のことがわかる。（ａ）正解文については形態素数ｍが長くなるほど、関
数値Ｆ_errorの平均値、最大値ともに減少する。（ｂ）誤認識文においても同様に、形態素数が長くなる
ほど関数値Ｆ_errorは減少する傾向にあるが、その減少
の度合いは正解文に比べて少ない。In the speech recognition processing, a bi-gram of words was used as a statistical language model, and a one-pass DP algorithm and an N-best search type speech recognition system were used. The evaluation data shown in Table 6 is used as a correct answer sentence, and the evaluation data described above is recognized by a recognition system using three types of N-grams shown in Table 6 as an erroneously recognized sentence, and the result is erroneously recognized. Sentence 94 sentences were used. FIG. 4 shows the average value and the maximum value of the function value F _error for the correct sentence, and the function value F _error for the incorrectly recognized sentence for each morpheme number m. The following can be understood from FIG. (A) For the correct answer sentence, as the morpheme number m becomes longer, both the average value and the maximum value of the function value _Ferror decrease. (B) Similarly, in the misrecognized sentence, the function value F _error tends to decrease as the number of morphemes increases, but the degree of the decrease is smaller than that of the correct answer sentence.

【００４０】このことは、左から右への（left-to-righ
tの）音声認識処理系において、処理した形態素がまだ
少ない文の始めの部分では、正解文及び誤認識文の関数
値Ｆ_errorに差がなく、不適格文の検出は困難である
が、処理した形態素数が多くなるほど、正解文と誤認識
文との関数値Ｆ_errorに差が生じるため、上記しきい値
Ｆｔｈを適切に設定することで、不適格文の検出が可能
であることを示している。但し、このしきい値Ｆｔｈは
一定値ではなく、形態素数ｍを変数とする関数値として
定義した方がより有効であることがわかる。例えば、図
４中の最大値をしきい値Ｆｔｈとした場合には、このし
きい値Ｆｔｈ以上の関数値Ｆ_errorを示す文章は、各々
の形態素数ｍの処理を行なっている際に、不適格文と判
定することができる。このように文の途中結果から誤認
識であると判定できた文の割合は、本実験では全誤認識
文中４７．９％（＝４５／９４）であった。以上の結果
をまとめると、次のようになる。（ａ）本不適格文の検出に用いた（１）入力語句と用例
との意味的距離、（２）形態素数に対する規則数で表さ
れた文構造の複雑さの２つのパラメータは、不適格文を
判定するのに有効なパラメータであり、提案した不適格
文判定関数Ｆ_errorは、不適格文を検出するのに有効で
あることがわかった。（ｂ）不適格文検出の性能は形態素数ｍに依存し、ｍが
大きくなるほど、検出性能は上がる。（ｃ）不適格文判定関数Ｆ_errorのしきい値Ｆｔｈは、
形態素数ｍに依存して変えた方が、より効率良く不適格
文を検出できる。This means that from left to right (left-to-righ
In the (t) speech recognition processing system, there is no difference in the function value _Ferror between the correct sentence and the incorrectly recognized sentence at the beginning of the sentence where the processed morphemes are still small, and it is difficult to detect an improper sentence. As the number of morphemes increases, a difference occurs in the function value _Ferror between the correct sentence and the incorrectly-recognized sentence. Thus, by appropriately setting the threshold value Fth, it is possible to detect an improper sentence. ing. However, it can be seen that the threshold value Fth is not a constant value, and it is more effective to define the threshold value Fth as a function value using the morpheme number m as a variable. For example, when the maximum value in FIG. 4 is set to the threshold value Fth, a sentence indicating a function value F _error equal to or larger than the threshold value Fth is not processed when each morpheme number m is processed. It can be determined that the sentence is eligible. In this experiment, the ratio of sentences that could be determined to be misrecognized from the intermediate results of the sentences was 47.9% (= 45/94) of all misrecognized sentences in this experiment. The above results are summarized as follows. Two parameters, (1) the semantic distance between the input word and the example and (2) the complexity of the sentence structure represented by the rule number for the morpheme number, used for detection of the unqualified sentence are ineligible. It is a parameter effective for judging a sentence, and it has been found that the proposed improper sentence determination function F _error is effective for detecting an improper sentence. (B) The performance of improper sentence detection depends on the morpheme number m, and the detection performance increases as m increases. (C) The threshold value Fth of the ineligible sentence determination function F _error is
If it is changed depending on the morpheme number m, an unsuitable sentence can be detected more efficiently.

【００４１】以上説明したように、本発明によれば、用
例との意味的距離を使用することで構文の曖昧性を解消
しながら構文を決定していく構文決定手法とを用いて、
従来の統計的言語モデルを用いた音声認識の誤認識結果
文の不適格性を逐次的に検出する方法を発明した。この
方法は、認識結果の部分文に含まれる語句と予め学習さ
れた用例との意味的距離と、認識結果の部分文の構文の
複雑さとを不適格文の判定要因として使用するものであ
る。そして、様々な単語及び品詞のｂｉ−ｇｒａｍを用
いた認識システムの結果を対象に、不適格文の検出を行
なった結果、誤認識文と正解文との判定のしきい値Ｆｔ
ｈを適切に設定すれば、誤認識文の約半分を不適格な文
として検出可能であることがわかった。As described above, according to the present invention, a syntax determining method for determining a syntax while eliminating ambiguity of a syntax by using a semantic distance from an example is used.
We have invented a method of sequentially detecting ineligibility of a sentence resulting from false recognition of speech recognition using a conventional statistical language model. In this method, the semantic distance between a phrase included in a partial sentence of a recognition result and a previously learned example and the complexity of the syntax of the partial sentence of the recognition result are used as factors for determining an improper sentence. Then, as a result of the detection of an improper sentence on the result of the recognition system using the bi-gram of various words and parts of speech, a threshold value Ft for judging a misrecognized sentence and a correct sentence is obtained.
It has been found that if h is set appropriately, about half of the misrecognized sentences can be detected as improper sentences.

【００４２】従って、音声認識部６は、音声認識候補に
対して、音声認識候補に対する不適格の度合いを表わす
所定の不適格文判定関数を用いて不適格文判定関数の関
数値を計算し、計算された関数値が所定のしきい値を超
えるときに、当該音声認識候補を除去して音声認識する
ので、不適格な誤認識結果を除去することができ、局所
的にも大局的にも適格な文を出力でき、従来例に比較し
て高い音声認識率を得ることができる音声認識装置を提
供することができる。Therefore, the speech recognition unit 6 calculates a function value of the unqualified sentence determination function using a predetermined unqualified sentence determination function indicating the degree of disqualification of the speech recognition candidate, and When the calculated function value exceeds a predetermined threshold, the speech recognition candidate is removed and speech recognition is performed, so that an inappropriate erroneous recognition result can be removed, and the result can be locally or globally. It is possible to provide a speech recognition device that can output a qualified sentence and obtain a higher speech recognition rate than the conventional example.

【００４３】以上の実施形態においては、不適格文判定
関数として数１を用いているが、本発明はこれに限ら
ず、以下に示す数４又は数５の不適格文判定関数を用い
てもよい。In the above embodiment, the expression 1 is used as the ineligible sentence determination function. However, the present invention is not limited to this, and the ineligible sentence determination function of the following expression 4 or 5 may be used. Good.

【数４】 (Equation 4)

【数５】 (Equation 5)

【００４４】ここで、数４の不適格文判定関数Ｆ_error'
（ｍ）は、数１の不適格文判定関数Ｆ_error（ｍ）に比
較して、上記音声認識候補の構文を決定するために用い
た用例に対応する意味的距離の和を計算し、計算された
和を上記音声認識候補の構文を決定するために用いた用
例の数で除算した値である意味的距離の平均値を計算す
ることを特徴としている。また、数５において、Ｍは、
所定ｍ個の形態素を処理した段階で上記音声認識候補の
構文を決定するために用いた用例の規則の中で所定の複
数ｍ_a個以上の形態素を含む規則数を表し、ここで、ｍ
は好ましくは５以上であって、ｍ_aは好ましくは３であ
る。数５の不適格文判定関数Ｆ_error''（ｍ）は、数３
の不適格文判定関数Ｆ_error'（ｍ）に比較して、（ｍ／
Ｒ）に代えて上記規則数Ｍの逆数を用いたことを特徴と
する。これら数４又は数５の不適格文判定関数を用いて
音声認識することにより、不適格な誤認識結果を除去す
ることができ、局所的にも大局的にも適格な文を出力で
き、従来例に比較して高い音声認識率を得ることができ
る音声認識装置を提供することができる。Here, the ineligible sentence determination function F _error '
(M) calculates the sum of the semantic distances corresponding to the example used to determine the syntax of the speech recognition candidate, as compared with the ineligible sentence determination function F _error (m) in _Equation 1. An average value of semantic distances, which is a value obtained by dividing the obtained sum by the number of examples used to determine the syntax of the speech recognition candidate, is calculated. In Equation 5, M is
At the stage of processing the predetermined m morphemes, the rule of the example used to determine the syntax of the speech recognition candidate represents the number of rules including _a predetermined plurality of ma morphemes or more, where m
It is preferably a 5 or more, the m _a is preferably 3. The ineligible sentence determination function F _error ″ (m) in _Expression 5 is
Compared to the ineligible sentence determination function F _error '(m),
It is characterized in that the reciprocal of the rule number M is used instead of R). By performing speech recognition using the unsuitable sentence determination function of Equations 4 and 5, it is possible to remove inappropriate unrecognized results, and to output locally and globally appropriate sentences. A speech recognition device that can obtain a higher speech recognition rate than the example can be provided.

【００４５】[0045]

【発明の効果】以上詳述したように本発明に係る請求項
１記載の音声認識装置によれば、入力される単語列から
なる発声音声文の音声の音声信号に基づいて、所定の統
計的言語モデルを参照して上記音声に対して音声認識処
理を実行する音声認識手段とを備えた音声認識装置にお
いて、上記音声認識手段は、音声認識候補に対して、音
声認識候補に対する不適格の度合いを表わす所定の不適
格文判定関数を用いて不適格文判定関数の関数値を計算
し、計算された関数値が所定のしきい値を超えるとき
に、当該音声認識候補を除去して音声認識処理を実行す
る。従って、不適格な誤認識結果を除去することがで
き、局所的にも大局的にも適格な文を出力でき、従来例
に比較して高い音声認識率を得ることができる音声認識
装置を提供することができる。As described in detail above, according to the speech recognition apparatus of the first aspect of the present invention, a predetermined statistical analysis is performed on the basis of the speech signal of the uttered speech sentence composed of an input word string. A speech recognition unit that performs speech recognition processing on the speech with reference to a language model, wherein the speech recognition unit determines a degree of disqualification of the speech recognition candidate with respect to the speech recognition candidate. Calculates the function value of the ineligible sentence determination function using a predetermined ineligible sentence determination function representing a function, and when the calculated function value exceeds a predetermined threshold value, removes the speech recognition candidate and removes the speech recognition candidate. Execute the process. Therefore, it is possible to provide a speech recognition device that can remove improper misrecognition results, can output locally and globally proper sentences, and can obtain a higher speech recognition rate than the conventional example. can do.

【００４６】また、請求項２記載の音声認識装置におい
ては、請求項１記載の音声認識装置において、上記不適
格文判定関数の関数値は、上記音声認識処理で用いた用
例に対応する意味的距離の和を計算し、計算された和に
音声認識処理の対象となる音声認識候補に含まれる形態
素の数を乗算しかつ上記音声認識処理で用いた用例の数
で除算した値である。従って、簡便に上記不適格文判定
関数の関数値を計算することができ、不適格な誤認識結
果を除去することができ、局所的にも大局的にも適格な
文を出力でき、従来例に比較して高い音声認識率を得る
ことができる音声認識装置を提供することができる。According to the second aspect of the present invention, in the first aspect, the function value of the ineligible sentence determination function may be a semantic value corresponding to the example used in the speech recognition process. This is a value obtained by calculating the sum of the distances, multiplying the calculated sum by the number of morphemes included in the speech recognition candidates to be subjected to the speech recognition processing, and dividing by the number of examples used in the speech recognition processing. Therefore, it is possible to easily calculate the function value of the unsuitable sentence determination function, to remove an inappropriate unrecognized result, and to output a sentence that is locally and globally appropriate. It is possible to provide a speech recognition device that can obtain a higher speech recognition rate as compared with the first embodiment.

【００４７】さらに、請求項３記載の音声認識装置にお
いては、請求項１記載の音声認識装置において、上記不
適格文判定関数の関数値は、上記音声認識処理で用いた
用例に対応する意味的距離の和を計算し、計算された和
を上記音声認識処理で用いた用例の数で除算した値であ
る意味的距離の平均値を計算し、計算された意味的距離
の平均値に音声認識処理の対象となる音声認識候補に含
まれる形態素の数を乗算しかつ上記音声認識処理で用い
た用例の数で除算した値である。従って、簡便に上記不
適格文判定関数の関数値を計算することができ、不適格
な誤認識結果を除去することができ、局所的にも大局的
にも適格な文を出力でき、従来例に比較して高い音声認
識率を得ることができる音声認識装置を提供することが
できる。Further, in the speech recognition apparatus according to the third aspect, in the speech recognition apparatus according to the first aspect, the function value of the inappropriate sentence determination function is a semantic corresponding to the example used in the speech recognition processing. The sum of the distances is calculated, and the calculated sum is divided by the number of examples used in the above speech recognition processing to calculate an average value of the semantic distances. This is a value obtained by multiplying the number of morphemes included in the speech recognition candidate to be processed and dividing by the number of examples used in the above speech recognition processing. Therefore, it is possible to easily calculate the function value of the unsuitable sentence determination function, to remove an inappropriate unrecognized result, and to output a sentence that is locally and globally appropriate. It is possible to provide a speech recognition device that can obtain a higher speech recognition rate as compared with the first embodiment.

【００４８】またさらに、請求項４記載の音声認識装置
においては、請求項１記載の音声認識装置において、上
記不適格文判定関数の関数値は、上記音声認識処理で用
いた用例に対応する意味的距離の和を計算し、計算され
た和を上記音声認識処理で用いた用例の数で除算した値
である意味的距離の平均値を計算し、計算された意味的
距離の平均値を、所定個の形態素を処理した段階で上記
音声認識処理で用いた用例中で所定の複数個以上の形態
素を含む用例数で除算した値である。従って、簡便に上
記不適格文判定関数の関数値を計算することができ、不
適格な誤認識結果を除去することができ、局所的にも大
局的にも適格な文を出力でき、従来例に比較して高い音
声認識率を得ることができる音声認識装置を提供するこ
とができる。According to a fourth aspect of the present invention, in the voice recognition apparatus of the first aspect, the function value of the inappropriate sentence determination function has a meaning corresponding to the example used in the voice recognition processing. Calculate the sum of the semantic distance, calculate the average of the semantic distance that is the value obtained by dividing the calculated sum by the number of examples used in the speech recognition processing, and calculate the average of the calculated semantic distance, This is a value obtained by dividing the number of examples including a plurality of predetermined morphemes in the examples used in the speech recognition processing at the stage when a predetermined number of morphemes have been processed. Therefore, it is possible to easily calculate the function value of the unsuitable sentence determination function, to remove an inappropriate unrecognized result, and to output a sentence that is locally and globally appropriate. It is possible to provide a speech recognition device that can obtain a higher speech recognition rate as compared with the first embodiment.

【００４９】さらに、請求項５又は６記載の音声認識装
置においては、請求項１乃至４のうちの１つに記載の音
声認識装置において、上記しきい値は、好ましくは、一
定値、もしくは、音声認識処理の対象となる部分文に含
まれる形態素の数に依存して変化させる。従って、より
有効的に、不適格な誤認識結果を除去することができ、
局所的にも大局的にも適格な文を出力でき、従来例に比
較して高い音声認識率を得ることができる音声認識装置
を提供することができる。Further, in the speech recognition apparatus according to the fifth or sixth aspect, in the speech recognition apparatus according to any one of the first to fourth aspects, the threshold value is preferably a constant value or It is changed depending on the number of morphemes included in the partial sentence to be subjected to the speech recognition processing. Therefore, it is possible to more effectively remove the incorrect recognition result,
It is possible to provide a speech recognition device that can output a sentence that is qualified both locally and globally and that can obtain a higher speech recognition rate than the conventional example.

[Brief description of the drawings]

【図１】本発明に係る一実施形態である音声認識装置
のブロック図である。FIG. 1 is a block diagram of a voice recognition device according to an embodiment of the present invention.

【図２】図１の音声認識装置における日本語の単語セ
ットと距離との関係を示す図である。FIG. 2 is a diagram showing a relationship between a Japanese word set and a distance in the voice recognition device of FIG. 1;

【図３】図１の音声認識装置の日本語処理の動作を示
す動作図であって、入力文と、認識結果文とその構文木
とスコアと、構文解析結果文とその構文木とスコアとを
示す動作図である。FIG. 3 is an operation diagram showing an operation of the Japanese language processing of the speech recognition apparatus of FIG. 1, including an input sentence, a recognition result sentence, its syntax tree and a score, a syntax analysis result sentence, its syntax tree and a score, FIG.

【図４】図１の音声認識装置のシミュレーション結果
であって、入力された形態素の数に対する判定関数値Ｆ
_errorを示すグラフである。4 is a simulation result of the speech recognition apparatus of FIG. 1, showing a decision function value F with respect to the number of input morphemes;
_It is a graph which shows _error .

【図５】図１の音声認識装置における英語の単語セッ
トと距離との関係を示す図である。FIG. 5 is a diagram showing a relationship between an English word set and a distance in the voice recognition device of FIG. 1;

【図６】図１の音声認識装置の英語処理の動作を示す
動作図であって、入力文と、認識結果文とその構文木と
スコアと、構文解析結果文とその構文木とスコアとを示
す動作図である。6 is an operation diagram showing an operation of the English language processing of the speech recognition device of FIG. 1, wherein an input sentence, a recognition result sentence, its syntax tree and a score, and a syntax analysis result sentence, its syntax tree and a score are FIG.

[Explanation of symbols]

１…マイクロホン、２…特徴抽出部、３…バッファメモリ、４…音素照合部、５…隠れマルコフモデルメモリ（ＨＭＭメモリ）、６…ＯｎｅｐａｓｓＤＰ音声認識部、７…統計的言語モデルメモリ、８…用例と距離のデータベースメモリ（データベースメ
モリ）、１０…データベース生成部、１１…用例メモリ、１２…単語セットメモリ。DESCRIPTION OF SYMBOLS 1 ... Microphone, 2 ... Feature extraction part, 3 ... Buffer memory, 4 ... Phoneme collation part, 5 ... Hidden Markov model memory (HMM memory), 6 ... One pass DP speech recognition part, 7 ... Statistical language model memory, 8 ... a database memory (database memory) of examples and distances, 10 ... a database generator, 11 ... example memories, 12 ... word set memories.

───────────────────────────────────────────────────── フロントページの続き (56)参考文献電子情報通信学会論文誌Ｖｏｌ．Ｊ 77−Ｄ−▲ＩＩ▼，Ｎｏ．３，Ｍａｒｃｈ 1994「英語前置詞句係り先の用例主導あいまい性解消」ｐ．566−572，平成６年３月25日発行人工知能学会言語・音声理解と対話処理研究会資料Ｖｏｌ．１「意味的類似性を用いた音声認識正解部分の特定法と音声翻訳手法への応用」ｐ．７−12, 1997 情報処理学会論文誌Ｖｏｌ．35，Ｎｏ．３「経験的知識を活用する変換主導型機械翻訳」ｐ．414−425 平成６年３月15日発行ＩＥＥＥＴｒａｎｓａｃｔｉｏｎｓｏｎＰａｔｔｅｒｎＡｎａｌｙｓｉｓａｎｄＭａｃｈｉｎｅＩｎｔｅｌｌｉｇｅｎｃｅ，Ｖｏｌ．ＰＡＭＩ−５，Ｎｏ．２，Ｍａｒｃｈ 1983，”ＡＭａｘｉｍｕｍＬｉｋｅｌｉｈｏｏｄＡｐｐｒｏａｃｈｔｏＣｏｎｔｉｎｕｏｕｓＳｐｅｅｃｈＲｅｃｏｇｎｉｔｉｏｎ”，ｐ．179 −190 (58)調査した分野(Int.Cl.⁶，ＤＢ名) G10L 3/00 537 G10L 3/00 531 ＪＩＣＳＴファイル（ＪＯＩＳ)──────────────────────────────────────────────────続き Continued on the front page (56) References Transactions of the Institute of Electronics, Information and Communication Engineers, Vol. J77-D-II, No. 3, March 1994, "Example-driven ambiguity resolution of English prepositional phrase endings" p. 566-572, published March 25, 2006, Japanese Society for Artificial Intelligence Language and Speech Understanding and Dialogue Processing, Vol. 1 "Method of specifying correct part of speech recognition using semantic similarity and application to speech translation method" p. 7-12, 1997 IPSJ Transactions Vol. 35, No. 3 "Conversion-driven machine translation utilizing empirical knowledge" p. 414-425 Published on March 15, 1994 IEEE Transactions on Pattern Analysis and Machine Intelli- gen, Vol. PAM I-5, No. 2, March 1983, "A Maximum Like life Approach to Continuous Speech Recognition", p. 179 −190 (58) Field surveyed (Int.Cl. ⁶ , DB name) G10L 3/00 537 G10L 3/00 531 JICST file (JOIS)

Claims

(57) [Claims]

1. A speech recognition means for executing speech recognition processing on said speech by referring to a predetermined statistical language model based on a speech signal of an uttered speech sentence composed of an input word string. In the speech recognition device provided with the speech recognition unit, the speech recognition unit calculates a function value of an unqualified sentence determination function using a predetermined unqualified sentence determination function indicating a degree of disqualification of the speech recognition candidate. When the calculated function value exceeds a predetermined threshold value, the speech recognition device removes the speech recognition candidate and executes a speech recognition process.

2. The function value of the ineligible sentence determination function calculates a sum of semantic distances corresponding to the examples used in the speech recognition process, and calculates the sum to be a target of the speech recognition process for the speech recognition process. The speech recognition apparatus according to claim 1, wherein the value is obtained by multiplying the number of morphemes included in the candidate and dividing the result by the number of examples used in the speech recognition processing.

3. The function value of the unsuitable sentence determination function is obtained by calculating the sum of semantic distances corresponding to the example used in the speech recognition process, and using the calculated sum in the speech recognition process. Calculate the average value of the semantic distance, which is the value divided by the number,
The average value of the calculated semantic distances is multiplied by the number of morphemes included in the speech recognition candidates to be subjected to the speech recognition process, and is a value obtained by dividing by the number of examples used in the speech recognition process. The voice recognition device according to claim 1.

4. The function value of the ineligible sentence determination function is obtained by calculating the sum of semantic distances corresponding to the example used in the speech recognition process, and using the calculated sum in the speech recognition process. Calculate the average value of the semantic distance, which is the value divided by the number,
The average value of the calculated semantic distances is a value obtained by dividing the average value of the calculated semantic distance by the number of examples including a plurality of predetermined morphemes in the examples used in the speech recognition processing at the stage when the predetermined number of morphemes are processed. The speech recognition device according to claim 1, wherein

5. The speech recognition device according to claim 1, wherein the threshold value is a constant value.

6. The method according to claim 1, wherein the threshold value is changed depending on the number of morphemes included in the partial sentence to be subjected to the speech recognition processing. Voice recognition device.