JPH06342298A

JPH06342298A - Speech recognition system

Info

Publication number: JPH06342298A
Application number: JP5130434A
Authority: JP
Inventors: Ryosuke Isotani; 亮輔磯谷; Shigeki Sagayama; 茂樹嵯峨山
Original assignee: A T R JIDO HONYAKU DENWA KENKYUSHO KK; ATR JIDO HONYAKU DENWA
Current assignee: A T R JIDO HONYAKU DENWA KENKYUSHO KK; ATR JIDO HONYAKU DENWA
Priority date: 1993-06-01
Filing date: 1993-06-01
Publication date: 1994-12-13

Abstract

PURPOSE:To provide the speech recognition system which performs high- precision recognition by using a language model which can express syntax and meaning relations wide in range than N-gram of words and can easily be structured from a text data base. CONSTITUTION:A speech vocalized while paused in phrase units is inputted to a phrase candidate output part 1, which outputs phrase lattices as the result of recognition in the phrase units to a sentence candidate selection part 2. A language model parameter storage part 3 is stored with a chain of two groups of adjuncts and the value of appearance probability corresponding to it as parameters of the language model. The sentence candidate selection part 2 selects an optimum phrase candidate sequence on the basis of the phrase lattices outputted from the sentence candidate output part 1 in consideration of the appearance probability of the chain of the two groups of adjuncts stored in the language model parameter storage part 3, and outputs the phrase candidate sequence as a sentence recognition result.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】この発明は音声認識方式に関し、
特に、発声された音声を、音響モデルにより得られるス
コアと、言語モデルにより得られるスコアを用いて認識
するような音声認識方式に関する。This invention relates to a speech recognition system,
In particular, the present invention relates to a voice recognition method for recognizing spoken voice using a score obtained by an acoustic model and a score obtained by a language model.

【０００２】[0002]

【従来の技術】従来より、文発声などの音声認識の性能
の向上を図るために、言語モデルが用いられている。特
に、単語の２つ組（ｂｉｇｒａｍ），３つ組（ｔｒｉｇ
ｒａｍ）など一般に単語のＮ個組の連鎖が出現する確率
を用いたＮ−ｇｒａｍモデルは、モデルの構築，利用が
容易であることや、確率的な音響モデルとの統合が自然
に行なえることなどから、広く用いられているものの１
つである。単語のＮ−ｇｒａｍモデルは、日本語の場合
の文節内の単語の連接関係などを表現するのには適して
いると考えられる。単語のＮ−ｇｒａｍモデルを用いた
音声認識については、たとえば電子情報通信学会技術研
究報告ＳＰ８７−２３に述べられている。2. Description of the Related Art Conventionally, a language model has been used to improve the performance of speech recognition such as sentence utterance. In particular, two-word (bigram) and three-word (trig)
Generally, an N-gram model using the probability that a chain of N sets of words appears, such as ram), is easy to construct and use, and can be naturally integrated with a probabilistic acoustic model. It is widely used because
Is one. It is considered that the N-gram model of words is suitable for expressing a concatenation relation of words in a clause in Japanese. Speech recognition using the N-gram model of words is described in, for example, IEICE Technical Research Report SP87-23.

【０００３】また、別の言語モデルとしては、文脈自由
文法や正規文法などで書かれた構文規則を用いるものも
ある。正規文法や文脈自由文法は、単語のＮ−ｇｒａｍ
モデルより複雑な言語現象を記述できる。構文規則を用
いた音声認識については、たとえば電子情報通信学会技
術研究報告ＳＰ９０−７３に述べられている。Another language model uses syntax rules written in a context-free grammar or a regular grammar. Regular grammar and context-free grammar are N-gram of words
Can describe more complex language phenomena than models. Speech recognition using syntax rules is described in, for example, IEICE Technical Research Report SP90-73.

【０００４】[0004]

【発明が解決しようとする課題】単語のＮ−ｇｒａｍモ
デルは、大量のテキストデータベースがあれば、そこか
ら自動的にモデルを構築できるという利点がある。しか
し、モデルのパラメータの推定精度などの制約から、Ｎ
は実用上３程度が限界で、表せる情報は局所的なものに
限られ、文節間の構文的あるいは意味的な関係を表現す
る能力には欠けるという欠点がある。The N-gram model of words has an advantage that a model can be automatically constructed from a large text database if it is available. However, due to constraints such as the estimation accuracy of model parameters, N
Has a limit of about 3 in practice, and the information that can be expressed is limited to local information, and there is a drawback in that it lacks the ability to express syntactic or semantic relationships between clauses.

【０００５】一方、構文規則による方法は、構文的，意
味的な関係をある程度記述できる反面、規則の作成を主
に人手に頼らざるを得ず、大規模なタスクの文法の作
成，管理に手間がかかり、また音声認識のタスクなどの
変更に柔軟に対応できないという欠点がある。On the other hand, the syntax rule method can describe the syntactic and semantic relations to some extent, but has to rely on human beings mainly for the rule creation, and it is troublesome to create and manage the grammar of a large-scale task. However, there is a drawback in that it is difficult to flexibly respond to changes such as voice recognition tasks.

【０００６】それゆえに、この発明の主たる目的は、単
語をＮ−ｇｒａｍより大域的な構文的，意味的関係を表
現でき、かつテキストデータベースから容易に構築でき
る言語モデルを用いることにより、高精度な認識を行な
う音声認識方式を提供することである。Therefore, the main object of the present invention is to achieve high accuracy by using a language model that can express words in a syntactic and semantic relation that is more global than N-gram and can be easily constructed from a text database. It is to provide a voice recognition method for performing recognition.

【０００７】[0007]

【課題を解決するための手段】請求項１にかかる発明
は、発声された音声を、音響モデルにより得られるスコ
アと、言語モデルにより得られるスコアとを用いて認識
する音声認識方法において、言語モデルにより得られる
スコアとして、予め定めた特定のカテゴリに属する単語
に注目したときの単語の連鎖に対して決まる値を用い
る。According to a first aspect of the present invention, there is provided a speech recognition method for recognizing spoken speech using a score obtained by an acoustic model and a score obtained by a language model. As the score obtained by the above, a value determined for a chain of words when attention is paid to a word belonging to a predetermined specific category is used.

【０００８】請求項２ないし６に係る発明は、注目する
カテゴリとして、特定の品詞に属する単語，付属語，文
節の最後の単語，自立語，文節の最初の単語のいずれか
を用いる。The invention according to claims 2 to 6 uses any one of a word belonging to a specific part of speech, an adjunct word, a last word of a bunsetsu, an independent word, and a first word of a bunsetsu as a category of interest.

【０００９】請求項７に係る発明は、注目するカテゴリ
として付属語または文節の最後の単語を用いた場合の単
語の連鎖に対して決まる値と、注目するカテゴリとして
自立語または文節の最初の単語を用いた場合の単語の連
鎖に対して決まる値を加えたものを言語モデルにより得
られるスコアとして用いる。The invention according to claim 7 is a value determined for a chain of words when the last word of an adjunct word or a phrase is used as the category of interest, and the independent word or the first word of the phrase as the category of interest. The value obtained by adding the determined value to the chain of words when is used as the score obtained by the language model.

【００１０】[0010]

【作用】たとえば助詞に注目し、他の自立語などを無視
して、その文中での連鎖を考えると、「から」の後には
「まで」が現われやすく、「が」のすぐ後にさらに格助
詞「が」が続くことはほとんどないといった性質が日本
語には見られる。これを音声認識に利用し、言語モデル
として助詞の連鎖に対してそれが出現するしやすさに応
じてスコアを与えることにより、認識性能を向上させる
ことができる。すなわち、たとえば音声認識結果の候補
として「私は日本語がわかりません」という候補と「私
が日本語がわかりません」という候補があり、音響モデ
ルによるスコアでは両者に差がない場合、言語モデルの
スコアとして前者の方が後者より高い値が与えられ、前
者の方をよりもっともらしい候補として選択することが
できる。助詞の連鎖に対するスコアを与える言語モデル
は、単語Ｎ−ｇｒａｍでは十分に表現することが困難な
大域的な構文的な関係に相当する情報を表現しているも
のと見なすことができる。[Function] For example, when focusing on particles, ignoring other independent words, etc., and considering the chain in the sentence, "to" is likely to appear after "kara", and further after "ga" There is a characteristic in Japanese that "ga" rarely continues. By utilizing this for speech recognition and giving a score to the chain of particles as a language model according to the ease with which it appears, the recognition performance can be improved. That is, for example, if there are two candidates for the speech recognition result, "I don't understand Japanese" and "I don't understand Japanese", and if there is no difference in the score based on the acoustic model, The former is given a higher value as the model score than the latter, and the former can be selected as a more plausible candidate. A language model that gives a score for a particle chain can be regarded as expressing information corresponding to a global syntactic relation that is difficult to express sufficiently in the word N-gram.

【００１１】同様にして、名詞，動詞などの自立語に注
目し、付属語を無視してその文中での連鎖を考えると、
「会議」の後には「参加（する）」「発表（する）」と
いった語が続きやすく、「用紙」の後には「記入（す
る）」、「送付（する）」といった語が続きやすいとい
った情報を言語的なスコアとして音声認識に反映させる
ことができる。これは文中に現われる語の間の意味的な
関係に相当する情報を表現しているものと見なすことが
できる。Similarly, focusing on independent words such as nouns and verbs, ignoring attached words, and considering chains in the sentence,
Information that words such as "participate" and "announce" are likely to follow "meeting" and words such as "fill in" and "send" are likely to follow "paper" Can be reflected in speech recognition as a linguistic score. This can be regarded as representing information corresponding to the semantic relationship between the words appearing in the sentence.

【００１２】助詞の場合も自立語の場合も、より長い連
鎖を考えることにより、さらに広い範囲の情報をとらえ
ることができる。このような語の連鎖に対するスコア
は、大量のテキストデータベースから、その中での出現
頻度として統計的に求めることができる。In the case of both particles and independent words, it is possible to capture a wider range of information by considering a longer chain. The score for such a chain of words can be statistically obtained as an appearance frequency in a large amount of text databases.

【００１３】[0013]

【実施例】図１はこの発明の一実施例を示す概略ブロッ
ク図である。図１において、この発明に係る音声認識方
式は、文節候補出力部１と文候補選択部２と言語モデル
パラメータ格納部３とによって実現される。文節候補出
力部１には、文節単位に区切って発声された音声が入力
され、文節候補出力部１は文節単位に認識した結果の文
節ラティスを文候補選択部２に与える。文候補選択部２
は言語モデルパラメータ格納部３から与えられる言語モ
デルの与えるスコアを用いて１つの候補列を選択し、文
認識結果として出力する。言語モデルパラメータ格納部
３に格納されている言語モデルとして、ここでは付属語
（助詞，助動詞）の２つ組の連鎖に対する出現確率を用
いる。1 is a schematic block diagram showing an embodiment of the present invention. In FIG. 1, the speech recognition method according to the present invention is realized by a phrase candidate output unit 1, a sentence candidate selection unit 2, and a language model parameter storage unit 3. The voicing candidate output unit 1 is input with a voice that is segmented into bunsetsu units, and the bunsetsu candidate output unit 1 provides the bunsetsu lattice, which is the result of recognition in bunsetsu units, to the bunsetsu candidate selecting unit 2. Sentence selection unit 2
Selects one candidate string using the score given by the language model given from the language model parameter storage unit 3 and outputs it as a sentence recognition result. As the language model stored in the language model parameter storage unit 3, here, the appearance probability for a chain of two sets of adjuncts (particles and auxiliary verbs) is used.

【００１４】図２はこの発明の一実施例における文節候
補出力部の一例を示す図であり、図３は言語モデルパラ
メータの一例を示す図である。FIG. 2 is a diagram showing an example of a phrase candidate output unit in an embodiment of the present invention, and FIG. 3 is a diagram showing an example of language model parameters.

【００１５】次に、図１ないし図３を参照して、この発
明の一実施例の動作について説明する。文節候補出力部
１に入力された音声は、文節毎に音響モデルを用いて処
理され、図２に示すように、各文節に対して上位Ｒ個の
候補がスコア（音響的スコア）つきで出力される。各文
節候補には、その文節に含まれる付属語の情報が付加さ
れている。文節中に付属語が複数あるときは文節の最後
に現われる付属語を選び、文節に付属語が含まれていな
い場合は、仮想的な付属語“φ”が付いていると見な
す。文節候補出力部１としては、たとえば電子情報通信
学会技術研究報告ＳＰ９２−３３に述べられている音声
認識方式を用いることができる。Next, the operation of one embodiment of the present invention will be described with reference to FIGS. The speech input to the phrase candidate output unit 1 is processed for each phrase using an acoustic model, and as shown in FIG. 2, the top R candidates are output with a score (acoustic score) for each phrase. To be done. To each phrase candidate, the information of the attached word included in the phrase is added. If there is more than one adjunct word in the bunsetsu, select the adjunct word that appears at the end of the bunsetsu, and if the bunsetsu does not include an adjunct word, consider that it is attached with a virtual adjunct "φ". As the phrase candidate output unit 1, for example, the voice recognition method described in IEICE Technical Research Report SP92-33 can be used.

【００１６】言語モデルパラメータ格納部３には、図３
に示すような付属語の２つ組の連鎖とそれに対する出現
確率の値が言語モデルのパラメータとして格納されてい
る。具体的には、直前の付属語がω_iであるときに付属
語ω_jが出現する条件付確率Ｐ（ω_j｜ω_i）の値が格
納されている。“φ”（付属語なし）および文頭（＜ｓ
ｔａｒｔ＞），文末（＜ｅｎｄ＞）も、それぞれ１つの
付属語として扱われる。The language model parameter storage unit 3 stores in FIG.
A chain of two sets of adjuncts as shown in and the value of the occurrence probability for the chain are stored as parameters of the language model. Specifically, the value of the conditional probability P (ω _j | ω _i ) that the adjunct word ω _j appears when the immediately preceding adjunct word is ω _i is stored. “Φ” (no attached word) and sentence beginning (<s
The “start>” and the end of the sentence (<end>) are also treated as one adjunct.

【００１７】文候補選択部２では、文節候補出力部１の
出力する文節ラティスから、言語モデルパラメータ格納
部３に格納された付属語２つ組の連鎖の出現確率を加味
して、最適な文節候補列を選択し、文認識結果として出
力する。具体的には、音響的スコアと言語的スコア（上
記確率の対数値）の重み付きの総和が最大になるような
文節列を以下の手順に従って動的計画法によって求め
る。In the sentence candidate selecting unit 2, the optimum phrase is obtained from the phrase lattice output from the phrase candidate output unit 1 in consideration of the appearance probability of the chain of two adjunct words stored in the language model parameter storage unit 3. A candidate string is selected and output as a sentence recognition result. Specifically, the phrase sequence that maximizes the weighted sum of the acoustic score and the linguistic score (logarithmic value of the above probability) is obtained by dynamic programming according to the following procedure.

【００１８】ｄ（ｉ，ｒ）（１≦ｉ≦ｎ，１≦ｒ≦Ｒ）
を文節候補出力部１から出力される第ｉ文節の第ｒ位候
補の音響的スコア，ω^t _ｉ（ｒ）を第ｉ文節の第ｒ位候
補の文節末の付属語，ω_langを音響的スコアに対する言
語モデルのスコアの重みとして、動的計画法を用いて、初期条件：ｒ＝１，…，Ｒに対してD (i, r) (1≤i≤n, 1≤r≤R)
Is the acoustic score of the r-th candidate of the i-th phrase output from the phrase candidate output unit 1, ω ^t _i (r) is the annex at the end of the phrase of the r-th candidate of the i-th phrase, and ω _lang is the acoustic As a weight of the score of the language model to the score, using dynamic programming, for initial conditions: r = 1, ..., R

【００１９】[0019]

【数１】 [Equation 1]

【００２０】漸化式：ｉ＝２，…，ｎ；ｒ＝１，…，Ｒ
に対してRecurrence formula: i = 2, ..., N; r = 1, ..., R
Against

【００２１】[0021]

【数２】 [Equation 2]

【００２２】終端処理：ｒ＝１，…，Ｒに対して、Termination processing: For r = 1, ..., R,

【００２３】[0023]

【数３】 [Equation 3]

【００２４】を計算し、最後にバックトラックを行なう
ことにより最適な文節系列として文認識結果が得られ
る。The sentence recognition result is obtained as an optimum phrase series by calculating and finally backtracking.

【００２５】言語モデルパラメータＰ（ω_j｜ω_i）の
値は、大量のテキストデータベースから以下のようにし
て推定することができる。テキストデータから文節末の
付属語を抜き出し、（ω_i，ω_j）の連鎖の出現頻度ｃ
（ω_i，ω_j）を求める。付属語なし，文頭，文末も、
それぞれ１つの仮想的な付属語として扱う。Ｐ（ω_j｜
ω_i）の推定値は、The value of the language model parameter P (ω _j | ω _i ) can be estimated from a large amount of text database as follows. The adjunct word at the end of the phrase is extracted from the text data, and the appearance frequency c of the chain of (ω _i , ω _j )
Find (ω _i , ω _j ). No attached words, beginning of sentence, end of sentence,
Each is treated as one virtual adjunct. P (ω _j |
The estimated value of ω _i ) is

【００２６】[0026]

【数４】 [Equation 4]

【００２７】となる。特に、言語モデルとして単語の３
つ組（条件付確率Ｐ（ω_k｜ω_i，ω _j））を用いる場
合など、テキストデータ量に比べて推定すべきパラメー
タ数が多いときは、パラメータのスムージングを行なう
ことにより、より良いパラメータの推定値が得られる。
パラメータのスムージングは、通常の単語Ｎ−ｇｒａｍ
と全く同様の手法を用いることができる。具体的な方法
については、たとえば刊行物「確率モデルによる音声認
識」中川聖一著に紹介されている。[0027] In particular, the 3 words
Tuple (conditional probability P (ω_k｜ ω_i, Ω _j))
Parameters to be estimated compared to the amount of text data, such as
If the number of parameters is large, smooth the parameters.
This gives a better estimate of the parameters.
Parameter smoothing is done using the normal word N-gram
A method exactly the same as can be used. Concrete method
For example, refer to the publication “Voice recognition by probabilistic models.
Knowledge ”is introduced by Seiichi Nakagawa.

【００２８】ここでは、付属語の２つ組を用いた言語モ
デルを例として説明したが、自立語を用いる場合も全く
同様である。付属語の２つ組と自立語の２つ組を併用す
ることも可能である。また、付属語あるいは自立語の３
つ組以上への拡張も容易である。たとえば、３つ組の場
合、文候補選択部２の処理で、隣接する２文節のペアを
単位として考えることにより、上述と同様に動的計画法
を用いた最適解の選択ができる。Here, the language model using two sets of attached words has been described as an example, but the same applies to the case where an independent word is used. It is also possible to use both the adjunct word set and the independent word set. In addition, it is an adjunct or independent word 3
It is easy to expand to more than three sets. For example, in the case of three sets, by considering the pair of adjacent two clauses as a unit in the processing of the sentence candidate selection unit 2, it is possible to select the optimal solution using the dynamic programming method as described above.

【００２９】さらに、特定のカテゴリに属する単語の連
鎖をモデル化するのに、上述のようなＮ個組の連鎖の出
現確率を考える代わりに、隠れマルコフモデルを用いる
こともできる。隠れマルコフモデルを用いた言語モデル
については、日本音響学会平成４年度秋季研究発表会講
演論文集２−Ｑ−１１に述べられているが、ここで文中
のすべての単語を用いる代わりに、特定のカテゴリに属
する単語だけを対象とすれば全く同様に言語モデルを作
成でき、その音声認識への利用も容易である。Further, in order to model a chain of words belonging to a specific category, a hidden Markov model can be used instead of considering the appearance probability of the N-piece chain as described above. The language model using the hidden Markov model is described in the Acoustical Society of Japan 1994 Autumn Research Presentation Lecture Proceedings 2-Q-11, but instead of using all the words in the sentence, A language model can be created in exactly the same way by targeting only words belonging to a category, and its use for speech recognition is easy.

【００３０】[0030]

【発明の効果】以上のように、この発明によれば、言語
モデルにより得られるスコアとして、予め定めた特定の
カテゴリに属する単語に注目したときの単語の連鎖に対
して決まる値を用いるようにしたので、大域的な構文
的，意味的関係を表現でき、かつテキストデータベース
から容易に構築できる言語モデルを用いることにより、
音声を高精度に認識することができる。As described above, according to the present invention, as the score obtained by the language model, a value determined for a chain of words when attention is paid to a word belonging to a predetermined specific category is used. Therefore, by using a language model that can express global syntactic and semantic relationships and can be easily constructed from a text database,
The voice can be recognized with high accuracy.

[Brief description of drawings]

【図１】この発明の一実施例の概略ブロック図である。FIG. 1 is a schematic block diagram of an embodiment of the present invention.

【図２】この発明の一実施例における文節候補出力部の
出力の一例を示す図である。FIG. 2 is a diagram showing an example of output of a phrase candidate output unit in an embodiment of the present invention.

【図３】この発明の一実施例における言語モデルパラメ
ータの一例を示す図である。FIG. 3 is a diagram showing an example of language model parameters in an embodiment of the present invention.

[Explanation of symbols]

１文節候補出力部２文候補選択部３言語モデルパラメータ格納部 1 phrase candidate output unit 2 sentence candidate selection unit 3 language model parameter storage unit

Claims

[Claims]

1. A speech recognition method for recognizing a spoken voice using a score obtained by an acoustic model and a score obtained by a language model, wherein a predetermined specific score is obtained as the score obtained by the language model. A speech recognition method characterized by using a value determined for a chain of words when focusing on a word belonging to a category.

2. The speech recognition system according to claim 1, wherein a word belonging to a specific part of speech is used as the category of interest.

3. The speech recognition system according to claim 1, wherein an adjunct word is used as the category of interest.

4. The speech recognition system according to claim 1, wherein the last word of a phrase is used as the category of interest.

5. The voice recognition system according to claim 1, wherein an independent word is used as the category of interest.

6. The speech recognition system according to claim 1, wherein the first word of a phrase is used as the category of interest.

7. A value determined for a chain of words when an adjunct word or the last word of a phrase is used as a target category, and a word when an independent word or the first word of a phrase is used as a target category. The speech recognition system according to claim 1, wherein a value obtained by adding a value determined to the chain of is used as a score obtained by the language model.