JP2010078877A

JP2010078877A - Speech recognition device, speech recognition method, and speech recognition program

Info

Publication number: JP2010078877A
Application number: JP2008246783A
Authority: JP
Inventors: Hajime Kobayashi; 載小林; Ikuo Fujita; 育雄藤田
Original assignee: Pioneer Electronic Corp
Current assignee: Pioneer Corp
Priority date: 2008-09-25
Filing date: 2008-09-25
Publication date: 2010-04-08

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech recognition method recognizing speech including a predetermined keyword and accurately extracting the keyword. <P>SOLUTION: A speech recognition device performs speech recognition by using a sound model, a language model, and a dictionary. Once a speech is given, a plurality of candidate patterns each composed of a plurality of words are generated from content of the speech. A candidate pattern having the largest total score is determined as a recognition result and a keyword included in the candidate pattern is extracted in reference to the dictionary. In the language model, an appearance probability and a coefficient of a word string for a keyword similar word are set so that a language score of the candidate pattern including the keyword similar word has a minimum value. Therefore, the language score of the candidate pattern including the keyword similar word has the minimum value and is not regarded as a recognition result. Consequently, the candidate pattern including the keyword is regarded as a recognition result and the keyword is correctly extracted therefrom. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、言語モデルを利用した音声認識技術に関する。 The present invention relates to a speech recognition technique using a language model.

言語モデルを用いた音声認識手法が知られている。言語モデルとは、隣り合う単語の組合せそれぞれに対する出現確率を格納したデータベースであり、統計的手法による言語モデルの代表的なものとしてＮ−ｇｒａｍモデルが既知である。Ｎ−ｇｒａｍ言語モデルは、Ｎ個の連続する単語の組合せに対する出現確率を与える。言語モデルを用いた音声認識手法は、言語モデルを用いて、複数の単語の組合せの出現確率を考慮して認識結果を導出する。 A speech recognition method using a language model is known. The language model is a database that stores appearance probabilities for each combination of adjacent words, and an N-gram model is known as a representative language model by a statistical method. The N-gram language model gives the probability of occurrence for a combination of N consecutive words. A speech recognition method using a language model derives a recognition result using a language model in consideration of the appearance probability of a combination of a plurality of words.

言語モデルを利用した音声認識装置の一例が特許文献１に記載されている。特許文献１の手法では、統計的言語モデル中に、記述文法（ネットワーク文法）で規定された文字列の言語パターンを混在させるとともに、ネットワーク文法で規定された文字列の単語についてバックオフ係数及び１−ｇｒａｍ確率を０（ゼロ）とする。これにより、ネットワーク文法で規定された単語とそうでない単語との接続が禁止され、統計的言語モデルの処理を行っても、ネットワーク文法に沿った文字列を正しく検出することができる。 An example of a speech recognition device using a language model is described in Patent Document 1. In the method of Patent Document 1, a language pattern of a character string specified by a description grammar (network grammar) is mixed in a statistical language model, and a backoff coefficient and 1 for a word of a character string specified by a network grammar are used. -Set the gram probability to 0 (zero). As a result, the connection between a word defined in the network grammar and a word that is not so is prohibited, and a character string in accordance with the network grammar can be correctly detected even when the statistical language model is processed.

特許第３９５０９５７号公報Japanese Patent No. 3950957

しかし、上記のような手法では、ネットワーク文法で規定されている単語を含むがネットワーク文法で規定されていない言語パターンは認識精度が低下する可能性がある。 However, in the above-described method, there is a possibility that the recognition accuracy of a language pattern that includes a word defined in the network grammar but is not defined in the network grammar is lowered.

本発明が解決しようとする課題としては、上記のものが一例として挙げられる。本発明は、予め決められたキーワードを含む発話を認識し、正確にキーワードを抽出することが可能な音声認識手法を提供することを課題とする。 The above-mentioned thing is mentioned as an example as a subject which the present invention tends to solve. An object of the present invention is to provide a speech recognition method capable of recognizing an utterance including a predetermined keyword and accurately extracting the keyword.

請求項１に記載の発明は、音声認識装置であって、音響モデルを記憶する音響モデル記憶部と、キーワードを含む複数の単語を有する辞書を記憶する辞書記憶部と、複数の単語について、当該単語を含む単語列の出現確率、及び、当該単語を含む単語列の出現確率を計算するための係数を有する言語モデルを記憶する言語モデル記憶部と、発話内容から、複数の単語の組合せにより構成される候補パターンを作成する候補パターン作成部と、前記候補パターンの各々について、前記音響モデルに基づいて音響スコアと計算するともに前記言語モデルに基づいて言語スコアを計算し、前記音響スコアと前記言語スコアに基づいてトータルスコアを計算するスコア計算部と、複数の候補パターンのうち、最大のトータルスコアを有する候補パターンを認識結果と決定する認識結果決定手段と、前記辞書を参照して、認識結果と決定された候補パターンからキーワードを抽出するキーワード抽出手段と、を備え、前記言語モデルにおいて、キーワードと類似する非キーワードであるキーワード類似単語に対する前記単語列の出現確率及び前記係数は、当該キーワード類似単語を含む候補パターンの言語スコアを最小値とする値を有することを特徴とする。 The invention according to claim 1 is a speech recognition device, wherein an acoustic model storage unit that stores an acoustic model, a dictionary storage unit that stores a dictionary having a plurality of words including a keyword, and a plurality of words, A language model storage unit that stores a language model having a coefficient for calculating an appearance probability of a word string including a word and a word string including the word, and a combination of a plurality of words from the utterance content A candidate pattern creation unit for creating a candidate pattern, and for each of the candidate patterns, an acoustic score is calculated based on the acoustic model, a language score is calculated based on the language model, and the acoustic score and the language A score calculation unit that calculates a total score based on the score, and a candidate pattern having the maximum total score among a plurality of candidate patterns A recognition result determination means for determining a recognition result as a recognition result, and a keyword extraction means for extracting a keyword from the recognition result and the determined candidate pattern with reference to the dictionary, and the non-similarity to the keyword in the language model. The appearance probability and the coefficient of the word string for a keyword similar word that is a keyword have values having a language score of a candidate pattern including the keyword similar word as a minimum value.

請求項５に記載の発明は、音響モデルと、キーワードを含む複数の単語を有する辞書と、複数の単語について、当該単語を含む単語列の出現確率及び当該単語を含む単語列の出現確率を計算するための係数を有する言語モデルとを利用する音声認識方法であって、発話内容から、複数の単語の組合せにより構成される候補パターンを作成する候補パターン作成工程と、前記候補パターンの各々について、前記音響モデルに基づいて音響スコアと計算するともに前記言語モデルに基づいて言語スコアを計算し、前記音響スコアと前記言語スコアに基づいてトータルスコアを計算するスコア計算工程と、複数の候補パターンのうち、最大のトータルスコアを有する候補パターンを認識結果と決定する認識結果決定工程と、前記辞書を参照して、認識結果と決定された候補パターンからキーワードを抽出するキーワード抽出工程と、を備え、前記言語モデルにおいて、キーワードと類似する非キーワードであるキーワード類似単語に対する前記単語列の出現確率及び前記係数は、当該キーワード類似単語を含む候補パターンの言語スコアを最小値とする値を有することを特徴とする。 The invention according to claim 5 calculates an appearance probability of a word string including the word and an appearance probability of the word string including the word with respect to the acoustic model, a dictionary having a plurality of words including the keyword, and the plurality of words. A speech recognition method using a language model having a coefficient for performing a candidate pattern creating step of creating a candidate pattern composed of a combination of a plurality of words from the utterance content, and each of the candidate patterns, A score calculation step of calculating an acoustic score based on the acoustic model and calculating a language score based on the language model, and calculating a total score based on the acoustic score and the language score, and a plurality of candidate patterns, A recognition result determination step for determining a candidate pattern having the maximum total score as a recognition result; A keyword extraction step of extracting a keyword from the result and the determined candidate pattern, and in the language model, the appearance probability of the word string and the coefficient for the keyword similar word that is a non-keyword similar to the keyword are the keyword It has the value which makes the language score of the candidate pattern containing a similar word the minimum value.

請求項６に記載の発明は、コンピュータにより実行される音声認識プログラムであって、音響モデルを記憶する音響モデル記憶部、キーワードを含む複数の単語を有する辞書を記憶する辞書記憶部、複数の単語について、当該単語を含む単語列の出現確率、及び、当該単語を含む単語列の出現確率を計算するための係数を有する言語モデルを記憶する言語モデル記憶部、発話内容から、複数の単語の組合せにより構成される候補パターンを作成する候補パターン作成部、前記候補パターンの各々について、前記音響モデルに基づいて音響スコアと計算するともに前記言語モデルに基づいて言語スコアを計算し、前記音響スコアと前記言語スコアに基づいてトータルスコアを計算するスコア計算部、複数の候補パターンのうち、最大のトータルスコアを有する候補パターンを認識結果と決定する認識結果決定手段、及び、前記辞書記憶部を参照して、認識結果と決定された候補パターンからキーワードを抽出するキーワード抽出手段として前記コンピュータを機能させ、前記言語モデルにおいて、キーワードと類似する非キーワードであるキーワード類似単語に対する前記単語列の出現確率及び前記係数は、当該キーワード類似単語を含む候補パターンの言語スコアを最小値とする値を有することを特徴とする。 The invention according to claim 6 is a speech recognition program executed by a computer, an acoustic model storage unit for storing an acoustic model, a dictionary storage unit for storing a dictionary having a plurality of words including keywords, and a plurality of words A language model storage unit for storing a language model having a coefficient for calculating an appearance probability of a word string including the word and an appearance probability of the word string including the word, a combination of a plurality of words from the utterance content For each of the candidate patterns, a candidate pattern creating unit that creates a candidate pattern configured by: calculating an acoustic score based on the acoustic model and calculating a language score based on the language model; and Score calculator that calculates the total score based on the language score, the largest total of multiple candidate patterns A recognition result determining means for determining a candidate pattern having a core as a recognition result, and referring to the dictionary storage unit, causing the computer to function as a keyword extracting means for extracting a keyword from the recognition result and the determined candidate pattern; In the language model, the appearance probability and the coefficient of the word string with respect to a keyword similar word that is a non-keyword similar to a keyword have a value that minimizes the language score of a candidate pattern including the keyword similar word. And

本発明の好適な実施形態では、音声認識装置は、音響モデルを記憶する音響モデル記憶部と、キーワードを含む複数の単語を有する辞書を記憶する辞書記憶部と、複数の単語について、当該単語を含む単語列の出現確率、及び、当該単語を含む単語列の出現確率を計算するための係数を有する言語モデルを記憶する言語モデル記憶部と、発話内容から、複数の単語の組合せにより構成される候補パターンを作成する候補パターン作成部と、前記候補パターンの各々について、前記音響モデルに基づいて音響スコアと計算するともに前記言語モデルに基づいて言語スコアを計算し、前記音響スコアと前記言語スコアに基づいてトータルスコアを計算するスコア計算部と、複数の候補パターンのうち、最大のトータルスコアを有する候補パターンを認識結果と決定する認識結果決定手段と、前記辞書を参照して、認識結果と決定された候補パターンからキーワードを抽出するキーワード抽出手段と、を備え、前記言語モデルにおいて、キーワードと類似する非キーワードであるキーワード類似単語に対する前記単語列の出現確率及び前記係数は、当該キーワード類似単語を含む候補パターンの言語スコアを最小値とする値を有する。 In a preferred embodiment of the present invention, the speech recognition apparatus includes: an acoustic model storage unit that stores an acoustic model; a dictionary storage unit that stores a dictionary having a plurality of words including a keyword; A language model storage unit for storing a language model having a coefficient for calculating an appearance probability of a word string including the word string and an appearance probability of the word string including the word, and a combination of a plurality of words from the utterance content A candidate pattern creating unit that creates a candidate pattern, and for each of the candidate patterns, calculates an acoustic score based on the acoustic model and calculates a language score based on the language model, and calculates the acoustic score and the language score. A score calculation unit that calculates a total score based on the candidate pattern, and a candidate pattern having a maximum total score among a plurality of candidate patterns A non-keyword similar to a keyword in the language model, comprising: a recognition result determining means for determining a recognition result; and a keyword extracting means for extracting a keyword from the recognition result and the determined candidate pattern with reference to the dictionary. The probability of appearance of the word string and the coefficient with respect to a keyword-similar word are values having a language score of a candidate pattern including the keyword-similar word as a minimum value.

上記の音声認識装置は、音響モデルと、言語モデルと、辞書とを使用して音声認識を行う。言語モデルは、複数の単語について、当該単語を含む単語列の出現確率、及び、当該単語を含む単語列の出現確率を計算するための係数を有する。辞書はキーワードを含む複数の単語を記憶している。発話がなされると、発話内容から複数の単語の組合せにより構成される候補パターンが作成される。通常１つの発話に対して複数の候補パターンが作成される。そして、候補パターンの各々について、音響スコア及び言語スコアが計算され、それらからトータルスコアが計算される。最大のトータルスコアを有する候補パターンが認識結果とされ、辞書を参照してその候補パターンに含まれるキーワードが抽出される。 The speech recognition apparatus performs speech recognition using an acoustic model, a language model, and a dictionary. The language model has, for a plurality of words, coefficients for calculating the appearance probability of a word string including the word and the appearance probability of a word string including the word. The dictionary stores a plurality of words including keywords. When an utterance is made, a candidate pattern composed of a combination of a plurality of words is created from the utterance content. Usually, a plurality of candidate patterns are created for one utterance. Then, an acoustic score and a language score are calculated for each candidate pattern, and a total score is calculated therefrom. A candidate pattern having the maximum total score is taken as a recognition result, and keywords included in the candidate pattern are extracted with reference to the dictionary.

言語モデルにおいては、キーワード類似単語に対する単語列の出現確率及び係数は、当該キーワード類似単語を含む候補パターンの言語スコアを最小値とする値を有している。よって、キーワード類似単語を含む候補パターンの言語スコアは最小値となり、認識結果として決定されることがなくなる。その結果、キーワードを含む候補パターンが認識結果となり、その認識結果からキーワードが正しく抽出されるようになる。 In the language model, the appearance probability and coefficient of the word string for the keyword similar word have values having the minimum language score of the candidate pattern including the keyword similar word. Therefore, the language score of the candidate pattern including the keyword similar words becomes the minimum value and is not determined as the recognition result. As a result, the candidate pattern including the keyword becomes the recognition result, and the keyword is correctly extracted from the recognition result.

上記の音声認識装置の一態様では、前記スコア計算部は、前記音響スコアと前記言語スコアの積を前記トータルスコアとして計算し、前記キーワード類似単語に対する前記単語列の出現確率及び前記係数は０である。この態様では、キーワード類似単語に対する単語列の出現確率及び係数が０であるので、キーワード類似単語を含む候補パターンはトータルスコアが０となり、認識結果として決定されなくなる。 In one aspect of the speech recognition apparatus, the score calculation unit calculates a product of the acoustic score and the language score as the total score, and the appearance probability of the word string and the coefficient for the keyword similar word are 0. . In this aspect, since the appearance probability and coefficient of the word string for the keyword similar word are 0, the candidate pattern including the keyword similar word has a total score of 0 and is not determined as a recognition result.

好適な例では、前記言語モデルはＮ−ｇｒａｍ言語モデルであり、前記単語列の出現確率は１−ｇｒａｍ確率であり、前記係数はバックオフ係数である。また、キーワード類似単語については、１ｇｒａｍ確率及びバックオフ係数がともに０に設定される。 In a preferred example, the language model is an N-gram language model, the word string appearance probability is a 1-gram probability, and the coefficient is a back-off coefficient. For keyword-like words, both the 1 gram probability and the backoff coefficient are set to 0.

他の好適な例では、前記キーワード類似単語は、キーワードの一部を構成する単語、及び、キーワードの一部又は全部を含む単語を含む。 In another preferred example, the keyword-similar word includes a word that forms a part of the keyword and a word that includes a part or all of the keyword.

本発明の他の実施形態では、音響モデル、キーワードを含む複数の単語を有する辞書、並びに、複数の単語について、当該単語を含む単語列の出現確率及び当該単語を含む単語列の出現確率を計算するための係数を有する言語モデルを利用する音声認識方法は、発話内容から、複数の単語の組合せにより構成される候補パターンを作成する候補パターン作成工程と、前記候補パターンの各々について、前記音響モデルに基づいて音響スコアと計算するともに前記言語モデルに基づいて言語スコアを計算し、前記音響スコアと前記言語スコアに基づいてトータルスコアを計算するスコア計算工程と、複数の候補パターンのうち、最大のトータルスコアを有する候補パターンを認識結果と決定する認識結果決定工程と、前記辞書を参照して、認識結果と決定された候補パターンからキーワードを抽出するキーワード抽出工程と、を備え、前記言語モデルにおいて、キーワードと類似する非キーワードであるキーワード類似単語に対する前記単語列の出現確率及び前記係数は、当該キーワード類似単語を含む候補パターンの言語スコアを最小値とする値を有する。 In another embodiment of the present invention, for an acoustic model, a dictionary having a plurality of words including keywords, and for a plurality of words, the appearance probability of the word string including the word and the appearance probability of the word string including the word are calculated. A speech recognition method using a language model having a coefficient for performing a candidate pattern creation step of creating a candidate pattern composed of a combination of a plurality of words from speech content, and for each of the candidate patterns, the acoustic model A score calculation step of calculating a language score based on the language model and calculating a total score based on the acoustic score and the language score, and a maximum total score among a plurality of candidate patterns A recognition result determination step for determining a candidate pattern having a recognition result, and recognition with reference to the dictionary And a keyword extraction step of extracting a keyword from the determined candidate pattern, and in the language model, the appearance probability of the word string and the coefficient for the keyword similar word that is a non-keyword similar to the keyword are the keyword It has a value that makes the language score of a candidate pattern including similar words the minimum value.

上記の音声認識方法においても、音響モデルと、言語モデルと、辞書とを使用して音声認識が行われる。最大のトータルスコアを有する候補パターンが認識結果とされ、辞書を参照してその候補パターンに含まれるキーワードが抽出される。また、言語モデルにおいては、キーワード類似単語に対する単語列の出現確率及び係数は、当該キーワード類似単語を含む候補パターンの言語スコアを最小値とする値を有する。よって、キーワード類似単語を含む候補パターンの言語スコアは最小値となり、認識結果とはならない。その結果、キーワードを含む候補パターンが認識結果となり、その認識結果からキーワードが正しく抽出されるようになる。 Also in the speech recognition method described above, speech recognition is performed using an acoustic model, a language model, and a dictionary. A candidate pattern having the maximum total score is taken as a recognition result, and keywords included in the candidate pattern are extracted with reference to the dictionary. Further, in the language model, the appearance probability and coefficient of the word string for the keyword similar word have values having the minimum language score of the candidate pattern including the keyword similar word. Therefore, the language score of the candidate pattern including the keyword-similar words is the minimum value and does not become a recognition result. As a result, the candidate pattern including the keyword becomes the recognition result, and the keyword is correctly extracted from the recognition result.

本発明の他の実施形態では、コンピュータにより実行される音声認識プログラムは、音響モデルを記憶する音響モデル記憶部、キーワードを含む複数の単語を有する辞書を記憶する辞書記憶部、複数の単語について、当該単語を含む単語列の出現確率、及び、当該単語を含む単語列の出現確率を計算するための係数を有する言語モデルを記憶する言語モデル記憶部、発話内容から、複数の単語の組合せにより構成される候補パターンを作成する候補パターン作成部、前記候補パターンの各々について、前記音響モデルに基づいて音響スコアと計算するともに前記言語モデルに基づいて言語スコアを計算し、前記音響スコアと前記言語スコアに基づいてトータルスコアを計算するスコア計算部、複数の候補パターンのうち、最大のトータルスコアを有する候補パターンを認識結果と決定する認識結果決定手段、及び、前記辞書記憶部を参照して、認識結果と決定された候補パターンからキーワードを抽出するキーワード抽出手段として前記コンピュータを機能させ、前記言語モデルにおいて、キーワードと類似する非キーワードであるキーワード類似単語に対する前記単語列の出現確率及び前記係数は、当該キーワード類似単語を含む候補パターンの言語スコアを最小値とする値を有する。 In another embodiment of the present invention, a speech recognition program executed by a computer includes an acoustic model storage unit that stores an acoustic model, a dictionary storage unit that stores a dictionary having a plurality of words including keywords, and a plurality of words, A language model storage unit for storing a language model having a coefficient for calculating the appearance probability of a word string including the word and the appearance probability of the word string including the word, and composed of a combination of a plurality of words from speech content A candidate pattern creating unit that creates a candidate pattern to be calculated, and for each of the candidate patterns, an acoustic score is calculated based on the acoustic model and a language score is calculated based on the language model, and the acoustic score and the language score are calculated A score calculation unit that calculates a total score based on the maximum total score among a plurality of candidate patterns A recognition result determining means for determining a candidate pattern having a recognition result, and referring to the dictionary storage unit, causing the computer to function as a keyword extracting means for extracting a keyword from the recognition pattern and the determined candidate pattern, In the language model, the appearance probability of the word string and the coefficient with respect to a keyword similar word that is a non-keyword similar to the keyword have a value that makes the language score of the candidate pattern including the keyword similar word the minimum value.

上記の音声認識プログラムをコンピュータ上で実行することにより、上述の音声認識装置を実現することができる。なお、この音声認識プログラムは、記録媒体に記録した状態で利用することができる。 The above speech recognition apparatus can be realized by executing the above speech recognition program on a computer. This voice recognition program can be used in a state where it is recorded on a recording medium.

以下、図面を参照して本発明の好適な実施例について説明する。 Hereinafter, preferred embodiments of the present invention will be described with reference to the drawings.

［基本説明］
具体的な実施例の説明に先立って、本発明による音声認識手法について説明する。図１は、言語モデルを用いた音声認識装置の概略構成図である。言語モデルを用いた音声認識装置は、ユーザによる発話を単語の組合せとして認識する。発話を単語の組合せとして認識する処理を「ディクテーション」と呼ぶ。発話を単語の組合せとして認識することにより、予め用意された文章以外の文章、即ち、複数の単語を任意に組み合わせて作られた文章を認識することが可能となる。 [Basic explanation]
Prior to description of a specific embodiment, a speech recognition method according to the present invention will be described. FIG. 1 is a schematic configuration diagram of a speech recognition apparatus using a language model. A speech recognition apparatus using a language model recognizes an utterance by a user as a combination of words. The process of recognizing an utterance as a combination of words is called “dictation”. By recognizing an utterance as a combination of words, it is possible to recognize a sentence other than a sentence prepared in advance, that is, a sentence formed by arbitrarily combining a plurality of words.

図１に示すように、音声認識装置は、ディクテーションを行うディクテーション部２０、言語モデルを記憶する言語モデルデータベース２４（以下、「データベース」を「ＤＢ」と略記することがある。）、音響モデルを記憶する音響モデルＤＢ２５、辞書を記憶する辞書ＤＢ２６、及び、キーワード抽出部３０を備える。 As shown in FIG. 1, the speech recognition apparatus includes a dictation unit 20 that performs dictation, a language model database 24 that stores language models (hereinafter, “database” may be abbreviated as “DB”), and an acoustic model. An acoustic model DB 25 for storing, a dictionary DB 26 for storing a dictionary, and a keyword extracting unit 30 are provided.

辞書ＤＢ２６は多数の単語を記憶している。辞書を参照することにより、発話内容が複数の単語の組合せとして認識される。なお、後述するように、辞書に記憶されている単語にはキーワードが含まれる。 The dictionary DB 26 stores a large number of words. By referring to the dictionary, the utterance content is recognized as a combination of a plurality of words. As will be described later, keywords stored in the dictionary include keywords.

音響モデルＤＢ２５は、音節や音素の単位の音の特徴を格納したデータベースである。発話に含まれる各単語の音の特徴は、音響モデルに記憶されている音の特徴との比較により決定され、音響スコアとして算出される。即ち、「音響スコア」とは、単語の組合せに対する一般的な音の特徴と、発話に対応する音の特徴との類似度であり、一般的な音の特徴は音響モデルから生成される。 The acoustic model DB 25 is a database that stores the characteristics of sounds in units of syllables and phonemes. The sound feature of each word included in the utterance is determined by comparison with the sound feature stored in the acoustic model, and is calculated as an acoustic score. That is, the “acoustic score” is a similarity between a general sound feature for a combination of words and a sound feature corresponding to an utterance, and the general sound feature is generated from an acoustic model.

図２は、音響モデルの一つとしてコンテキスト依存音響モデルの例を示す。「コンテキスト依存音響モデル」とは、同じ音節であっても、その前後の音節によって音の特徴が変わる音響モデルを言う。図２（ａ）の例では、同じ「し」という文字であっても、前後の音節が異なるため、音の特徴が異なるものとして認識される。よって、図２（ｂ）に示すように、同じ「みたかしの」という音であっても、単語の組合せによって音響スコアが異なってくる。 FIG. 2 shows an example of a context-dependent acoustic model as one of acoustic models. A “context-dependent acoustic model” refers to an acoustic model in which the characteristics of sound change depending on the syllables before and after the same syllable. In the example of FIG. 2A, even if the character is the same “shi”, the syllables before and after are different, so that the characteristics of the sound are recognized as different. Therefore, as shown in FIG. 2 (b), the acoustic score varies depending on the combination of words even if the sound is the same “mitakashino”.

言語モデルＤＢ２４は、隣り合う単語の組合せの出現確率を記憶したデータベースである。本発明では、言語モデルの一つであるＮ−ｇｒａｍ言語モデルが使用される。Ｎ−ｇｒａｍ言語モデルは、直前にＮ−１個の単語の組合せがある条件下で単語Ｗが出現する確率（＝Ｎ−ｇｒａｍ）を格納したデータベースである。言語モデルを用いて言語スコアが計算される。「言語スコア」とは、隣り合う単語の組合せの出現確率（出現頻度）を示す値である。 The language model DB 24 is a database that stores the appearance probabilities of adjacent word combinations. In the present invention, an N-gram language model which is one of language models is used. The N-gram language model is a database that stores the probability (= N-gram) that a word W appears under the condition that there is a combination of N−1 words immediately before. A language score is calculated using the language model. “Language score” is a value indicating the appearance probability (appearance frequency) of a combination of adjacent words.

図３は、２−ｇｒａｍ言語モデルの構成例を示す。２−ｇｒａｍ言語モデルは、直前に１つの単語Ｘがある条件下で単語Ｗが出現する確率を記憶している。言語モデルに記憶されている単語の組合せ及びその出現確率は、多数の文例に基づいて作成される。 FIG. 3 shows a configuration example of the 2-gram language model. The 2-gram language model stores the probability that the word W appears under the condition that there is one word X immediately before. A combination of words stored in the language model and an appearance probability thereof are created based on a large number of sentence examples.

図３において、「１−ｇｒａｍ確率Ｐ（ｗ１）」は、単語ｗ１の出現確率を示す。「２−ｇｒａｍ確率Ｐ（ｗ１｜ｗ２）」は、直前に単語ｗ２が来る条件下での単語ｗ１の出現確率を示す。「バックオフ係数Ｂｏ（ｗ２）」は、２−ｇｒａｍ確率に存在しない単語の組合せの出現確率を求めるために用意された係数である。具体的に、単語ｗ２の後に単語ｗ１が来る組合せ、即ちＰ（ｗ１｜ｗ２）が２−ｇｒａｍ確率に存在しない場合、Ｐ（ｗ１｜ｗ２）はバックオフ係数を用いて以下の式で得られる。 In FIG. 3, “1-gram probability P (w1)” indicates the appearance probability of the word w1. “2-gram probability P (w1 | w2)” indicates the appearance probability of the word w1 under the condition that the word w2 comes immediately before. The “backoff coefficient Bo (w2)” is a coefficient prepared for obtaining the appearance probability of a word combination that does not exist in the 2-gram probability. Specifically, when a combination of the word w1 after the word w2, that is, P (w1 | w2) does not exist in the 2-gram probability, P (w1 | w2) is obtained by the following equation using a back-off coefficient. .

Ｐ（ｗ１｜ｗ２）＝Ｐ（ｗ１） × Ｂｏ（ｗ２）
次に、言語モデルを用いた言語スコアの計算例を説明する。図４は、「みたかしのしょうがっこう」という発話を「三鷹市／の／小学校」という単語の組合せ（以下、「候補パターン」と呼ぶ。）と判断した場合の言語スコアを計算した例を示す。なお、以下の説明において、記号「／」は発話を単語に分割した際の区切りを示すこととする。２−ｇｒａｍ言語モデルの構成例中において、「＜ｓ＞」は文頭を示すコードであり、「＜／ｓ＞」は文末を示すコードである。 P (w1 | w2) = P (w1) × Bo (w2)
Next, an example of calculating a language score using a language model will be described. FIG. 4 shows an example in which the language score is calculated when it is determined that the utterance “Mitakashi no Shogakuko” is a combination of words “Mitaka City / No / Elementary School” (hereinafter referred to as “candidate pattern”). Show. In the following description, the symbol “/” indicates a break when an utterance is divided into words. In the configuration example of the 2-gram language model, “<s>” is a code indicating a sentence head, and “</ s>” is a code indicating a sentence end.

ディクテーション部２０は、「三鷹市の小学校」という発話を「三鷹市」、「の」、「小学校」という単語に分割し、これらの組合せに対応する言語スコアを２−ｇｒａｍ言語モデルを参照して取得する。それらの言語スコアの積が発話に対する言語スコアとして得られる。図４の例は、全ての単語の組合せについて２−ｇｒａｍ確率が存在する例である。２−ｇｒａｍ言語モデルの構成例中の２−ｇｒａｍ確率の欄を参照すると、各単語の並びに対する言語スコアは下記のようになる。 The dictation unit 20 divides the utterance “Mitaka City Elementary School” into the words “Mitaka City”, “No”, and “Elementary School”, and refers to the 2-gram language model for the language score corresponding to these combinations. get. The product of these language scores is obtained as the language score for the utterance. The example of FIG. 4 is an example in which 2-gram probabilities exist for all word combinations. Referring to the 2-gram probability column in the configuration example of the 2-gram language model, the language score for each word sequence is as follows.

・文頭に「三鷹市」：０．０００００２８
・「三鷹市」→「の」：０．０００１４７８
・「の」→「小学校」：０．０００３８２２
・「小学校」が文末：０．１０６０４２５
よって、図４に示すように、候補パターン「三鷹市／の／小学校」に対する言語スコアは個々の言語スコアの積となる。・ "Mitaka City" at the beginning: 0.0000028
・ "Mitaka City" → "no": 0.0001478
・ "No" → "elementary school": 0.0003822
・ "Elementary school" ends the sentence: 0.1060425
Therefore, as shown in FIG. 4, the language score for the candidate pattern “Mitaka city / no / elementary school” is the product of the individual language scores.

図５は他の例を示す。図５は、同じ「みたかしのしょうがっこう」という発話を「見／高篠／小学校」という候補パターンと判定した場合の言語スコアの計算例である。図４の例と同様に、ディクテーション部２０は基本的に相前後する単語の組合せの出現確率を２−ｇｒａｍ言語モデルから取得し、その積を言語スコアとする。但し、図５の例では、「見」の後に「高篠」が来る組合せの２−ｇｒａｍ確率が存在しないため、これを「見」のバックオフ係数と「高篠」の１−ｇｒａｍ確率を乗算して作成している。以上のように、言語モデルを用いて単語の組合せに対する言語スコアが算出される。 FIG. 5 shows another example. FIG. 5 is a calculation example of a language score when the same utterance “Mitataka no Shogakuko” is determined as a candidate pattern “Mito / Takashino / elementary school”. Similar to the example of FIG. 4, the dictation unit 20 basically obtains the appearance probability of adjacent word combinations from the 2-gram language model, and uses the product as the language score. However, in the example of FIG. 5, there is no 2-gram probability of a combination in which “Takashino” comes after “see”, so this is expressed as a back-off coefficient of “see” and a 1-gram probability of “Takashino”. Created by multiplication. As described above, the language score for the word combination is calculated using the language model.

ディクテーションによる認識結果は、音響スコアと言語スコアの積が最大となる単語の組合せとして得られる。具体的には、図６（ａ）に示すように、「三鷹市」、「の」、「小学校」という各単語についての音響スコアが算出されるとともに、それらの組合せについての言語スコアが算出される。そして、音響スコアと言語スコアの積がトータルスコアとして算出される。１つの発話は、複数の候補パターンとして認識され、各候補パターンについてトータルスコアが計算される。そして、最大のトータルスコアを有する候補パターンが認識結果として出力される。 The recognition result by dictation is obtained as a combination of words that maximizes the product of the acoustic score and the language score. Specifically, as shown in FIG. 6A, an acoustic score is calculated for each of the words “Mitaka City”, “No”, and “elementary school”, and a language score for the combination thereof is calculated. The Then, the product of the acoustic score and the language score is calculated as a total score. One utterance is recognized as a plurality of candidate patterns, and a total score is calculated for each candidate pattern. Then, a candidate pattern having the maximum total score is output as a recognition result.

次に、キーワードの抽出について説明する。図１に示すように、本発明では、音声認識装置は発話の認識結果に含まれるキーワードを抽出する。キーワードは予め決められており、キーワードと非キーワードとを識別する情報が辞書に記憶されている。例えば、辞書に記憶された多数の単語のうち、キーワードにはそれを示すキーワードフラグが付加されている。なお、好適には、本発明の音声認識手法が適用される機器の操作コマンドなどがキーワードとして設定される。 Next, keyword extraction will be described. As shown in FIG. 1, in the present invention, the speech recognition apparatus extracts keywords included in the speech recognition result. Keywords are determined in advance, and information for identifying keywords and non-keywords is stored in the dictionary. For example, among many words stored in the dictionary, a keyword flag indicating the keyword is added. Preferably, an operation command of a device to which the speech recognition method of the present invention is applied is set as a keyword.

図１に示すように、キーワード抽出部３０は、辞書を参照し、ディクテーション部２０が得た認識結果「三鷹市／の／小学校」に含まれるキーワード（本例では「三鷹市」）を抽出して出力する。出力されたキーワードは機器の操作に使用される。このため、キーワードを出力できない場合には、音声認識装置はユーザに再度発話を要求することになる。よって、音声認識装置は、ユーザによる発話の繰り返しを防止するため、高精度でキーワードを抽出できることが要求される。 As shown in FIG. 1, the keyword extraction unit 30 refers to the dictionary and extracts a keyword (“Mitaka city” in this example) included in the recognition result “Mitaka city / no / elementary school” obtained by the dictation unit 20. Output. The output keyword is used to operate the device. For this reason, when the keyword cannot be output, the speech recognition apparatus requests the user to speak again. Therefore, the speech recognition apparatus is required to be able to extract keywords with high accuracy in order to prevent repeated speech by the user.

キーワードの抽出精度を高めるためには、キーワードを１つの独立した単語として含む候補パターンが認識結果として得られることが好ましい。図６（ｂ）はキーワードの抽出例を示す。「みたかしの」という発話を「三鷹市／の」という候補パターン１に分割した場合と、「見／高篠」という候補パターン２に分割した場合のトータルスコアの計算は図６（ｂ）に示すように行われる。前述のようにディクテーション部２０は、トータルスコアが最も高い候補パターンを認識結果として出力する。よって、図６（ｂ）の例では、候補パターン１の方が候補パターン２よりトータルスコアが高くなるようにしたい。そのために、本発明では、「見／高篠」の言語スコアを最小値、即ち「０（ゼロ）」とする。これにより、言語パターン２のトータルスコアは最小値である０となり、候補パターン１が認識結果として得られることになる。 In order to increase the keyword extraction accuracy, it is preferable that a candidate pattern including the keyword as one independent word is obtained as a recognition result. FIG. 6B shows an example of keyword extraction. FIG. 6B shows the calculation of the total score when the utterance “Mitakano” is divided into candidate pattern 1 “Mitaka city / no” and divided into candidate pattern 2 “Mitaka / Takashino”. To be done. As described above, the dictation unit 20 outputs a candidate pattern having the highest total score as a recognition result. Therefore, in the example of FIG. 6B, it is desired that the candidate pattern 1 has a higher total score than the candidate pattern 2. Therefore, in the present invention, the language score of “Midori / Takashino” is set to the minimum value, that is, “0 (zero)”. Thereby, the total score of the language pattern 2 becomes 0 which is the minimum value, and the candidate pattern 1 is obtained as a recognition result.

なお、上述のように音響スコア及び言語スコアを用いて、トータルスコアを乗算により算出する場合、スコアの最小値は「０」となる。これに対し、後述の実施例のように、音響スコア及び言語スコアの値が対数値である場合、スコアの最小値は最大の絶対値を有する負値となる。 When the total score is calculated by multiplication using the acoustic score and the language score as described above, the minimum value of the score is “0”. On the other hand, when the values of the acoustic score and the language score are logarithmic values as in the examples described later, the minimum value of the score is a negative value having the maximum absolute value.

図６（ｂ）の例において、「三鷹市」は一般的によく使われる単語であり、言語モデルを作成する際に使用される文例にも「三鷹市の」という単語列は含まれるので、図４に示されるように言語モデル中に「三鷹市／の」に対応する２−ｇｒａｍ確率が存在する。これに対し、「見高篠」という単語列は通常は用いられない単語列であり、図５の例に示すように、言語モデル中に「見／高篠」に対応する２−ｇｒａｍ確率は存在しない。よって、「見／高篠」に対応する言語スコアは、図５に示すように１−ｇｒａｍ確率とバックオフ係数の積により作成されることになる。そこで、「見／高篠」の言語スコアを０にするためには、「高篠」という単語の１−ｇｒａｍ確率とバックオフ係数の両方を０に設定すればよい。これにより、「見／高篠」の言語スコアが０となり、候補パターン２は認識結果とはならなくなる。つまり、「三鷹市」というキーワードを正確に抽出するためには、単語「三鷹市」に類似する非キーワード「高篠」の１−ｇｒａｍ確率及びバックオフ係数を０とすればよい。以下、このようにキーワードに類似する非キーワードを「キーワード類似単語」と呼ぶ。 In the example of FIG. 6B, “Mitaka City” is a commonly used word, and the sentence example used when creating the language model also includes the word string “Mitaka City”. As shown in FIG. 4, there is a 2-gram probability corresponding to “Mitaka city / no” in the language model. On the other hand, the word string “Midaka Shino” is a word string that is not normally used. As shown in the example of FIG. 5, the 2-gram probability corresponding to “Min / Takashino” in the language model is not exist. Therefore, the language score corresponding to “Mimi / Takashino” is created by the product of the 1-gram probability and the back-off coefficient as shown in FIG. Therefore, in order to set the language score of “Tomi / Takashino” to 0, both the 1-gram probability and the back-off coefficient of the word “Takashino” may be set to 0. As a result, the language score of “Mimi / Takashino” becomes 0, and the candidate pattern 2 does not become a recognition result. That is, in order to accurately extract the keyword “Mitaka City”, the 1-gram probability and backoff coefficient of the non-keyword “Takashino” similar to the word “Mitaka City” may be set to zero. Hereinafter, such non-keywords similar to the keyword are referred to as “keyword similar words”.

なお、キーワード類似単語については、その１−ｇｒａｍ確率とバックオフ係数の両方を０とすることが必要となる。これについて説明する。発話に対応する候補パターンが「見／高篠」という単語列を含み、それに対応する２−ｇｒａｍ確率が存在しない場合、「見／高篠」の言語スコアは、
「見／高篠」の言語スコア＝「見」のバックオフ係数×「高篠」の１−ｇｒａｍ確率
で計算される。よって、キーワード類似単語である「高篠」の１−ｇｒａｍ確率を０に設定すれば、「見／高篠」の言語スコアを０とすることができる。 For keyword-like words, both the 1-gram probability and the backoff coefficient need to be 0. This will be described. If the candidate pattern corresponding to the utterance includes the word string “Mimi / Takashino” and there is no 2-gram probability corresponding to it, the language score of “Mimi / Takashino” is
The language score of “Mino / Takashino” = the “off” coefficient of “Mino” × 1−gram probability of “Takashino”. Therefore, if the 1-gram probability of “Takashino”, which is a keyword-similar word, is set to 0, the language score of “Tomi / Takashino” can be set to 0.

一方、発話に対応する候補パターンが「高篠／見」という単語列を含み、それに対応する２−ｇｒａｍ確率が存在しない場合、「高篠／見」の言語スコアは、
「高篠／見」の言語スコア＝「高篠」のバックオフ係数×「見」の１−ｇｒａｍ確率
で計算される。よって、キーワード類似単語である「高篠」のバックオフ係数を０に設定しておかないと、「高篠／見」という単語列の言語スコアを０にすることができない。このように、キーワード類似単語が前方に来る単語列と後方に来る単語列の両方について言語スコアを０とするためには、キーワード類似単語の１−ｇｒａｍ確率とバックオフ係数の両方を０にすることが必要となる。 On the other hand, if the candidate pattern corresponding to the utterance includes the word string “Takashino / Mimi” and there is no 2-gram probability corresponding to it, the language score of “Takashino / Mimi” is
The language score of “Takashino / Tami” = the back-off coefficient of “Takashino” × the 1-gram probability of “Tami”. Therefore, the language score of the word string “Takashino / Tami” cannot be set to 0 unless the backoff coefficient of “Takashino”, which is a keyword similar word, is set to 0. As described above, in order to set the language score to 0 for both the word string that precedes the keyword-similar word and the word string that follows, both the 1-gram probability and the back-off coefficient of the keyword-similar word are set to 0. It will be necessary.

図７は、２−ｇｒａｍ言語モデルの構成例を示す。図７（ａ）は、本発明を適用していない例であり、キーワード類似単語「高篠」の１−ｇｒａｍ確率及びバックオフ係数はともにある値を持っている。一方、図７（ｂ）は本発明を適用した例であり、符号９０で示すようにキーワード類似単語「高篠」の１−ｇｒａｍ確率及びバックオフ係数の両方が０に設定されている。 FIG. 7 shows a configuration example of the 2-gram language model. FIG. 7A is an example in which the present invention is not applied, and the 1-gram probability and the back-off coefficient of the keyword similar word “Takashino” both have certain values. On the other hand, FIG. 7B is an example to which the present invention is applied. As indicated by reference numeral 90, both the 1-gram probability and the back-off coefficient of the keyword similar word “Takashino” are set to zero.

次に、キーワード類似単語について説明する。キーワード類似単語は、キーワードに類似する非キーワードであり、キーワードの一部を構成する単語、キーワードの一部又は全部を含む単語などがある。キーワードの一部を構成する単語とは、例えばキーワード「三鷹市（みたかし）」に対する「三鷹」などである。キーワードの一部を含む単語とは、例えば上記の例のように、キーワード「三鷹市（みたかし）」に対して、その一部である「たかし」を含む「高篠」などである。また、キーワードの全部を含む単語とは、キーワードに何かの語（α）が加わった単語であり、「キーワード＋α」のようにキーワードの前に何かの語が加わった場合と、「α＋キーワード」のようにキーワードの後ろに何かの語が加わった場合とを含む。この場合、何かの語αは数字などでも構わない。 Next, keyword similar words will be described. The keyword-similar word is a non-keyword similar to the keyword, and includes a word constituting a part of the keyword, a word including a part or all of the keyword, and the like. The word constituting a part of the keyword is, for example, “Mitaka” for the keyword “Mitaka City”. The word including a part of the keyword is, for example, “Takashino” including “Takashi” which is a part of the keyword “Mitaka City” as in the above example. A word including all keywords is a word in which some word (α) is added to the keyword. When a word is added before the keyword, such as “keyword + α”, “α + This includes the case where a word is added after the keyword such as “keyword”. In this case, the word α may be a number.

以上のように、本発明では、音声認識装置は、まず発話を複数の単語に分割して候補パターンを作成し、各単語の音響スコアと各単語の組合せの言語スコアとの積であるトータルスコアが最大である候補パターンを認識結果とする。そして音声認識装置は、認識結果に含まれるキーワードを抽出して出力する。ここで、予め決められたキーワードに類似するキーワード類似単語については、言語スコアを計算する際に使用される出現確率値である１−ｇｒａｍ確率及びバックオフ係数の両方が最小値、即ち「０」に設定される。よって、キーワード類似単語を含む候補パターンのトータルスコアは０になり、認識結果として決定されることがなくなる。その結果、キーワードを含む候補パターンが認識結果として得られ易くなり、発話中のキーワードを高精度で抽出することが可能となる。 As described above, in the present invention, the speech recognition apparatus first creates a candidate pattern by dividing an utterance into a plurality of words, and a total score that is the product of the acoustic score of each word and the language score of each word combination is The largest candidate pattern is taken as the recognition result. Then, the speech recognition device extracts and outputs keywords included in the recognition result. Here, for a keyword similar word similar to a predetermined keyword, both the 1-gram probability and the back-off coefficient, which are appearance probability values used when calculating the language score, are minimum values, that is, “0”. Set to Therefore, the total score of candidate patterns including keyword-similar words is 0 and is not determined as a recognition result. As a result, a candidate pattern including a keyword can be easily obtained as a recognition result, and a keyword being uttered can be extracted with high accuracy.

［実施例］
次に、本発明を適用した音声認識装置の具体的実施例を説明する。なお、以下の説明では、言語モデル中の出現確率値は対数で示される。 [Example]
Next, a specific embodiment of the speech recognition apparatus to which the present invention is applied will be described. In the following description, the appearance probability value in the language model is indicated by a logarithm.

（言語モデル）
図８は、２−ｇｒａｍ言語モデルの例である。図８の言語モデルのフォーマットはＡＲＰＡ形式と呼ばれ、広く使われている。このＡＲＰＡ形式の言語モデルは、先頭に１−ｇｒａｍ確率の種類数及び２−ｇｒａｍ確率の種類数が記述されており、その後ろに１−ｇｒａｍ情報及び２−ｇｒａｍ情報が順に記述されている。 (Language model)
FIG. 8 is an example of a 2-gram language model. The language model format of FIG. 8 is called the ARPA format and is widely used. In this ARPA format language model, the number of types of 1-gram probabilities and the number of types of 2-gram probabilities are described at the top, and 1-gram information and 2-gram information are described in order behind the number of types.

１−ｇｒａｍ情報は、単語毎に出現確率の対数値、該当単語、及び、バックオフ係数の対数値が順に記述されている。バックオフ係数は、前述のように、実際の文例には存在しない単語の組合せについて２−ｇｒａｍ確率を出力するための数値である。２−ｇｒａｍ情報は、２−ｇｒａｍ確率それぞれの対数値、先行単語、後続単語の順で記述されている。 In the 1-gram information, a logarithmic value of an appearance probability, a corresponding word, and a logarithmic value of a back-off coefficient are sequentially described for each word. As described above, the back-off coefficient is a numerical value for outputting a 2-gram probability for a combination of words that does not exist in an actual sentence example. The 2-gram information is described in the order of the logarithmic value of each 2-gram probability, the preceding word, and the succeeding word.

（言語モデル作成処理）
図９は、言語モデル作成部のブロック図である。以下、図９を用いて言語モデル作成処理を説明する。 (Language model creation process)
FIG. 9 is a block diagram of the language model creation unit. The language model creation process will be described below with reference to FIG.

言語モデル作成部の入力は、言語モデル学習用テキストデータである。これは、音声認識への発話が考えられる文例を可能な限り書き起こしたテキストデータである。図１０（ａ）は、学習用テキストデータの例である。 The input of the language model creation unit is language model learning text data. This is text data that has been transcribed as much as possible for possible sentence examples for speech recognition. FIG. 10A shows an example of learning text data.

形態素解析器１１は、学習用テキストデータを形態素解析し、解析テキストを出力する。「形態素解析」とは、文を品詞等の意味のある単位へ分解することを言う。図１０（ｂ）は、言語モデル学習用テキストデータを形態素解析し、単語単位で分かち書きした後のテキストの例である。文の先頭に文頭記号＜ｓ＞が挿入され、文末に文末記号＜／ｓ＞が挿入されている。これを「解析テキスト」と呼ぶ。なお、図１０（ｂ）において各単語の後の数値「＋１」及び「＋０」はキーワードフラグであり、「＋１」はキーワードを、「＋０」は非キーワードを示す。 The morphological analyzer 11 performs morphological analysis on the learning text data and outputs an analysis text. “Morphological analysis” refers to breaking a sentence into meaningful units such as parts of speech. FIG. 10B shows an example of the text after the morphological analysis is performed on the language model learning text data and the text data is written in units of words. A sentence head symbol <s> is inserted at the beginning of the sentence, and a sentence end symbol </ s> is inserted at the end of the sentence. This is called “analysis text”. In FIG. 10B, numerical values “+1” and “+0” after each word are keyword flags, “+1” indicates a keyword, and “+0” indicates a non-keyword.

使用語彙抽出器１２は、解析テキストで使用される語彙を抽出し、使用語彙リストを生成する。図１０（ｃ）は、使用語彙リストの例である。言語モデル作成器１３は、解析テキストおよび使用語彙リストを用いて、言語モデルを作成する。ここでは、仮の言語モデルが生成される。 The used vocabulary extractor 12 extracts vocabulary used in the analysis text and generates a used vocabulary list. FIG. 10C is an example of a used vocabulary list. The language model creation unit 13 creates a language model using the analysis text and the used vocabulary list. Here, a temporary language model is generated.

書換対象語彙リストＤＢ１５は、仮の言語モデルのうち、上述のように、１−ｇｒａｍ確率やバックオフ係数を０にしたい語彙、即ち前述のキーワード類似単語が全て記述されたものである。図１０（ｄ）は、書換対象語彙リストの例である。この例では、キーワード類似単語、即ち１−ｇｒａｍ確率及びバックオフ係数を０にしたい単語は「ミ＋０」ということになる。 The rewrite target vocabulary list DB 15 is a tentative language model in which the vocabulary for which the 1-gram probability and the back-off coefficient are set to 0, that is, the above-described keyword similar words are all described. FIG. 10D is an example of a rewrite target vocabulary list. In this example, a keyword-similar word, that is, a word for which the 1-gram probability and the back-off coefficient are to be 0 is “mi + 0”.

バックオフ＆１−ｇｒａｍ確率強制書換器１４は、仮の言語モデルが所持する１−ｇｒａｍ確率とバックオフ係数を強制的に０へ書き換える。具体的には、書換対象語彙リストＤＢ１５に記載された語彙に対する１−ｇｒａｍ確率とバックオフ係数を０にする。ここで出力される言語モデルを「対策言語モデル」と呼ぶ。この対策言語モデルを、ディクテーション音声認識で使用する。 The backoff & 1-gram probability compulsory rewriter 14 forcibly rewrites the 1-gram probability and backoff coefficient possessed by the temporary language model to 0. Specifically, the 1-gram probability and the back-off coefficient for the vocabulary described in the rewrite target vocabulary list DB 15 are set to zero. The language model output here is called a “countermeasure language model”. This countermeasure language model is used for dictation speech recognition.

仮の言語モデルの例を図１１（ａ）に示す。また、この仮の言語モデルを、図１０（ｄ）の書換対象語彙リストにある情報を用いて書き換え処理して得た対策言語モデルを図１１（ｂ）に示す。図１１（ｂ）に示す対策言語モデルでは、書換対象語彙リストに記述されている「ミ＋０」という単語に対する１−ｇｒａｍの確率及びバックオフ係数が、仮の言語モデルと異なることが分かる（ボックス９２を参照）。なお、本実施例で使用するＡＲＰＡ形式では出現確率値が対数値で与えられるため、「ミ＋０」という単語に対する１−ｇｒａｍの確率及びバックオフ係数は「０」ではなく最小値、つまり最大の絶対値を持つ負値で書き換えられている。 An example of a temporary language model is shown in FIG. FIG. 11B shows a countermeasure language model obtained by rewriting this temporary language model using information in the vocabulary list to be rewritten in FIG. In the countermeasure language model shown in FIG. 11B, it can be seen that the 1-gram probability and the backoff coefficient for the word “mi + 0” described in the vocabulary list to be rewritten are different from the temporary language model (box). 92). In the ARPA format used in this embodiment, since the appearance probability value is given as a logarithmic value, the probability of 1-gram and the back-off coefficient for the word “mi + 0” are not “0” but the minimum value, that is, the maximum value. It has been rewritten with a negative value that has an absolute value.

書換対象語彙リストＤＢ１５及びバックオフ＆１−ｇｒａｍ確率強制書換器１４は本発明の特徴的な構成部分である。即ち、キーワード及び非キーワードを含む言語モデル中のキーワード類似単語が書換対象語彙リストとして用意され、バックオフ＆１−ｇｒａｍ確率強制書換器１４は、書換対象語彙リストを参照して、キーワード類似単語の１−ｇｒａｍ確率及びバックオフ係数を最小値に書き換える。こうして、本発明で使用される対策言語モデルが作成される。なお、作成された対策言語モデルは、図１における言語モデルとして使用される。 The rewrite target vocabulary list DB 15 and the backoff & 1-gram probability compulsory rewrite device 14 are characteristic components of the present invention. That is, keyword similar words in a language model including keywords and non-keywords are prepared as a rewrite target vocabulary list, and the backoff & 1-gram probability compulsory rewrite device 14 refers to the rewrite target vocabulary list to 1 Rewrite gram probabilities and backoff coefficients to minimum values. Thus, the countermeasure language model used in the present invention is created. The prepared countermeasure language model is used as the language model in FIG.

（ディクテーション処理）
次に、ディクテーション処理について説明する。図１２はディクテーション部２０の構成を示すブロック図である。ディクテーション部２０は、音声区間検出部２１と、特徴パラメータ計算部２２と、マッチング処理部２３とを備える。 (Dictation process)
Next, the dictation process will be described. FIG. 12 is a block diagram showing the configuration of the dictation unit 20. The dictation unit 20 includes a speech segment detection unit 21, a feature parameter calculation unit 22, and a matching processing unit 23.

「発話データ」とは、音声を含む入力信号を指す。たとえば、カーナビゲーション装置に実装されている音声認識装置の場合、発話データはユーザが発話ボタンを押下してから一定時間の間にマイクから録音された入力信号を指す。 “Speech data” refers to an input signal including voice. For example, in the case of a speech recognition device mounted on a car navigation device, the speech data indicates an input signal recorded from a microphone during a certain time after the user presses the speech button.

音声区間検出部２１は、発話データの中から音声区間を検出し、音声区間内の音声データを出力する。つまり、「音声データ」とは、発話データの中から音声に該当する区間だけが切り出されたものを指す。 The voice section detection unit 21 detects a voice section from speech data and outputs voice data in the voice section. That is, “voice data” refers to data obtained by cutting out only a section corresponding to voice from speech data.

特徴パラメータ計算部２２は、音声区間検出部２１で検出された音声データを単位時間毎に分割し、それぞれにおいて特徴パラメータを計算し、マッチング処理部２３へ供給する。 The feature parameter calculation unit 22 divides the voice data detected by the voice segment detection unit 21 for each unit time, calculates a feature parameter in each unit, and supplies the feature parameter to the matching processing unit 23.

辞書ＤＢ２６は、音声認識可能な単語、具体的には解析テキストに出現する単語が格納されている。本発明の音声認識手法は、ディクテーション処理による認識結果からキーワードを抽出することを想定しているため、キーワードと非キーワードを区別する必要がある。そのため、辞書ＤＢ２６には、登録されている単語毎にキーワードと非キーワードを識別するためのフラグデータが格納されている。即ち、辞書ＤＢ２６に登録されている単語のうち、キーワードにはそれを示すキーワードフラグ「＋１」が付与されており、非キーワードにはそれを示すキーワードフラグ「＋０」が付与されている。 The dictionary DB 26 stores words that can be recognized by speech, specifically words that appear in the analysis text. Since the speech recognition method of the present invention assumes that a keyword is extracted from a recognition result by dictation processing, it is necessary to distinguish a keyword from a non-keyword. For this reason, the dictionary DB 26 stores flag data for identifying keywords and non-keywords for each registered word. That is, among the words registered in the dictionary DB 26, the keyword is assigned the keyword flag “+1”, and the non-keyword is assigned the keyword flag “+0”.

音響モデルＤＢ２５には、サブワード単位（音節、音素など）の音響的特徴を、それぞれ特徴パラメータで表現したものが格納されている。 The acoustic model DB 25 stores acoustic features in subword units (syllables, phonemes, etc.) expressed by feature parameters.

言語モデルＤＢ２５は、辞書に登録されている単語の組み合わせに対応したＮ−ｇｒａｍ言語モデルである。本発明では、この言語モデルとして、図１１（ｂ）に例示した対策言語モデルを使用する。 The language model DB 25 is an N-gram language model corresponding to a combination of words registered in the dictionary. In the present invention, the countermeasure language model illustrated in FIG. 11B is used as the language model.

マッチング処理部２３は、言語モデルＤＢ２４、音響モデルＤＢ２５、及び、辞書ＤＢ２６を用いて発話内容を予測する。 The matching processing unit 23 predicts the utterance content using the language model DB 24, the acoustic model DB 25, and the dictionary DB 26.

次に、図１２を参照してディクテーションの動作について説明する。 Next, the dictation operation will be described with reference to FIG.

まず、発話データが入力されると、音声区間検出部２１は、そのデータの中から音声データを検出する。次に、特徴パラメータ計算部２２は、音声データを単位時間毎にフレーム分割し、それぞれにおいて特徴パラメータを計算する。 First, when utterance data is input, the voice section detection unit 21 detects voice data from the data. Next, the feature parameter calculation unit 22 divides the audio data into frames for each unit time, and calculates the feature parameter in each.

次に、マッチング処理部２３は、単位時間毎に得られた特徴パラメータを言語モデルＤＢ２４、音響モデルＤＢ２５、及び、辞書ＤＢ２６にあてはめることによって認識結果を出力するマッチング処理を行う。 Next, the matching processing unit 23 performs matching processing for outputting a recognition result by applying the feature parameters obtained for each unit time to the language model DB 24, the acoustic model DB 25, and the dictionary DB 26.

具体的には、マッチング処理部２３は、辞書ＤＢ２６に登録されている単語の組み合わせのうち、音声データに最も適合するものを、音声データの始端から時系列順に探索する。この探索により複数の候補パターンが作成される。ここで、音声データの始端から途中までをマッチング処理した結果、スコアの低かった組合せについては、以降マッチング処理しないようにする枝刈り処理も行う。 Specifically, the matching processing unit 23 searches the combination of words registered in the dictionary DB 26 for the best match with the voice data in chronological order from the beginning of the voice data. A plurality of candidate patterns are created by this search. Here, as a result of the matching process from the beginning to the middle of the audio data, a pruning process is performed so as not to perform the matching process thereafter for a combination having a low score.

マッチング処理部２３は、複数の候補パターンについて、音響スコア及び言語スコアを計算してトータルスコアを求め、複数の候補パターンのうち最大のトータルスコアを有する候補パターンを認識結果として出力する。ここで出力される認識結果は、辞書ＤＢ２６に登録されている単語の組み合わせで構成される文となる。この際、本実施例では、図１１（ｂ）に例示されるように、対策言語モデルにおいてキーワード類似単語の１−ｇｒａｍ確率及びバックオフ係数が最小値に設定されているので、キーワード類似単語を含む候補パターンのトータルスコアは最小値となり、その結果、キーワードを抽出可能な認識結果が得られる。 The matching processing unit 23 calculates an acoustic score and a language score for a plurality of candidate patterns to obtain a total score, and outputs a candidate pattern having the maximum total score among the plurality of candidate patterns as a recognition result. The recognition result output here is a sentence composed of a combination of words registered in the dictionary DB 26. At this time, in this embodiment, as illustrated in FIG. 11B, the 1-gram probability and the back-off coefficient of the keyword similar word are set to the minimum value in the countermeasure language model. The total score of the candidate patterns to be included becomes the minimum value, and as a result, a recognition result from which keywords can be extracted is obtained.

以上の構成において、音響モデルＤＢ２５は本発明の音響モデル記憶部に相当し、言語モデルＤＢ２４は言語モデル記憶部に相当し、辞書ＤＢ２６は辞書記憶部に相当する。また、ディクテーション部２０の特にマッチング処理部２３は本発明の候補パターン作成部、スコア計算部及び認識結果決定手段として機能する。 In the above configuration, the acoustic model DB 25 corresponds to the acoustic model storage unit of the present invention, the language model DB 24 corresponds to the language model storage unit, and the dictionary DB 26 corresponds to the dictionary storage unit. In particular, the matching processing unit 23 of the dictation unit 20 functions as a candidate pattern creation unit, a score calculation unit, and a recognition result determination unit of the present invention.

（音声認識処理）
図１３は、本発明に係る音声認識処理を示す図である。上述のように、ディクテーション部２０が作成した認識結果はキーワード抽出部３０へ供給される。その結果、キーワード抽出部３０は、
「ミタカシ＋１ノ＋０ショウガッコウ＋０」
という認識結果を得る。単語の後ろに加えられている数字はキーワードフラグである。前述のように、キーワードに該当するものには「１」、非キーワードには「０」が付与されている。 (Voice recognition processing)
FIG. 13 shows a speech recognition process according to the present invention. As described above, the recognition result created by the dictation unit 20 is supplied to the keyword extraction unit 30. As a result, the keyword extraction unit 30
"Mitakashi + 1 + 0 ginger + 0"
The recognition result is obtained. The number added after the word is the keyword flag. As described above, “1” is assigned to a keyword and “0” is assigned to a non-keyword.

キーワード抽出部３０は、認識結果として出力された候補パターンに含まれる単語から、辞書ＤＢ２６を参照してキーワードを抽出し、キーワードをコマンドなどとして使用する動作制御部へ供給する。例えば、本実施例の音声認識装置がナビゲーション装置に適用された場合、キーワード抽出部３０が抽出したキーワードは、ナビゲーション装置の操作コマンドとしてナビゲーション装置の動作制御部へと送られる。 The keyword extraction unit 30 extracts a keyword from the words included in the candidate pattern output as the recognition result with reference to the dictionary DB 26, and supplies the keyword to the operation control unit that uses the keyword as a command. For example, when the voice recognition device of this embodiment is applied to a navigation device, the keyword extracted by the keyword extraction unit 30 is sent to the operation control unit of the navigation device as an operation command of the navigation device.

（言語スコアの計算方法）
次に、Ｎ−ｇｒａｍ言語モデルを用いた言語スコアの計算方法について詳しく説明する。以下の説明において、候補パターン中の単語ｗ_ｉが発話に適合する区間を単に「ｗ_ｉの区間」と呼ぶ。「ｗ_ｉの区間」に対する音響スコアを、「ｗ_ｉの音響スコア」と呼ぶ。さらに、音響スコアと言語スコアの積をトータルスコアと呼ぶ。 (Language score calculation method)
Next, a method for calculating a language score using the N-gram language model will be described in detail. In the following description, the section in which the word w _i in the candidate pattern matches the utterance is simply referred to as “section of w _i ”. The acoustic score for "w _i section of", referred to as the "acoustic score of w _i". Further, the product of the acoustic score and the language score is called a total score.

なお、ここでは、以下の条件下で言語スコアを計算することを想定する。 Here, it is assumed that the language score is calculated under the following conditions.

（１）Ｎ−ｇｒａｍ言語モデルを使用する。 (1) Use the N-gram language model.

（２）候補パターンに含まれる単語はＲ個（Ｎ＜Ｒ）である。 (2) The number of words included in the candidate pattern is R (N <R).

（３）候補パターンは、ｗ_１，…，ｗ_Ｒで構成されている。 (3) The candidate pattern is composed of w ₁ ,..., W _R.

（ｗ_ｉは候補パターンに含まれる単語のうち番目にあたるものを指す）
候補パターンＸに対するトータルスコアＳ（Ｘ）は、以下の式（１）で表現される。 (W _i refers to those corresponding to out-th word included in the candidate pattern)
The total score S (X) for the candidate pattern X is expressed by the following equation (1).

Ｐ（ｗ_ｊ｜ｗ_{ｊ−Ｎ＋１}，…，ｗ_ｊ−１）は、候補パターンに含まれる単語列ｗ_{ｊ−Ｎ＋１}，…，ｗ_ｊに対するＮ−ｇｒａｍ確率であり、Ｎ−ｇｒａｍ言語モデルから得られるものである。 P (w _j | w _{j−N + 1} ,..., W _j−1 ) is an N-gram probability for the word string w _{j−N + 1} ,..., W _j included in the candidate pattern, and is obtained from the N-gram language model. It is

ここで、Ｎ−ｇｒａｍ言語モデルから単語列ｗ_１，…，ｗ_ｎのＮ−ｇｒａｍ確率を取得するまでの流れを、図１４を参照して説明する。 Here, the word sequence _w 1 from N-gram language model, ..., the flow of obtaining N-gram probability of the _{w n,} will be described with reference to FIG. 14.

まず、ステップＳ２０１において単語列ｗ_１，…，ｗ_ｎのＮ−ｇｒａｍ確率が存在するかどうかを調べる。もし存在する場合は、このＮ−ｇｒａｍ確率を出力する。 First, a word string _w 1 at step S201, ..., determine whether N-gram probability of _{w n} is present. If it exists, this N-gram probability is output.

もし存在しない場合は、ステップＳ２０２及びＳ２０３において、単語列ｗ_２，…，ｗ_ｎにおける(Ｎ−１)−ｇｒａｍ確率および単語列ｗ_１，…，ｗ_ｎ−１における(Ｎ−１)−ｇｒａｍ確率がそれぞれ存在するかどうかを調べる。どちらも存在する場合は、単語列ｗ_２，…，ｗ_ｎにおける(Ｎ−１)−ｇｒａｍ確率と単語列ｗ_１，…，ｗ_ｎ−１におけるバックオフ係数の積をＮ−ｇｒａｍ確率として出力する。 If it does not exist if, in step S202 and S203, the word sequence _w 2, ..., in _{w n} (N-1) -gram probability and the word sequence _w 1, _..., in _{w n-1 (N-1} ) -gram Check whether each probability exists. If both present, the word sequence _w 2, ..., in _{w n} (N-1) -gram probability and the word sequence _w 1, _..., output the product of the back-off factor in the _{w n-1} as N-gram probability To do.

どちらか１つでも存在しない場合は、入力する単語列をｗ_２，…，ｗ_ｎに置き換えて、再度(Ｎ−１)−ｇｒａｍ確率の存在を調べるようにする。 If you do not want to present even one of them, a word string w ₂ to _enter, ..., replaced by a w _n, to examine the presence of again (N-1) -gram probability.

（本実施例による効果）
一般的に、ディクテーション部は、辞書に登録されている単語の組合せを認識結果として出力する。よって、発話によっては、複数の正解が考えられる。また、コンテキスト依存型の音響モデルを使うと、単語の組合せの違いによって音響スコアが変わる。 (Effects of this embodiment)
Generally, the dictation unit outputs a combination of words registered in the dictionary as a recognition result. Therefore, a plurality of correct answers can be considered depending on the utterance. When a context-dependent acoustic model is used, the acoustic score changes depending on the combination of words.

例えば、前述の「三鷹市の小学校」を例に取り上げる。ここでは、「三鷹市」がキーワードであるとする。また、辞書には「タカシノ」や「ミ」も登録されているものとする、この場合、正解となる候補パターンは次の２つが考えられる。 Take, for example, “Mitaka City Elementary School” mentioned above. Here, “Mitaka City” is a keyword. In addition, it is assumed that “Takashino” and “Mi” are also registered in the dictionary. In this case, the following two candidate patterns can be considered as correct answers.

候補パターン１：ミタカシ＋１ノ＋０ショウガッコウ＋０
候補パターン２：ミ＋０タカシノ＋０ショウガッコウ＋０
候補パターン１が認識結果とされた場合はキーワードの「三鷹市」を抽出することができる。そのため、ナビゲーション装置など、本発明の音声認識装置を適用した機器は「三鷹市」をキーワードとして抽出することができる。しかし、候補パターン２が認識結果とされた場合、認識結果中にキーワードが存在せず、そのため機器を動作させることができないという問題が生じる。 Candidate pattern 1: Mitakashi + 1 No + 0 Ginger + 0
Candidate pattern 2: Mi + 0 Takashino + 0 Ginger + 0
When candidate pattern 1 is recognized as a recognition result, the keyword “Mitaka City” can be extracted. Therefore, a device such as a navigation device to which the voice recognition device of the present invention is applied can extract “Mitaka City” as a keyword. However, when the candidate pattern 2 is a recognition result, there is a problem that no keyword is present in the recognition result, and thus the device cannot be operated.

ここで、解析テキスト中に、
・「見」の後ろに「高篠」が来る文は存在しない
・「三鷹市」の後ろに「の」が来る文は存在する
という条件では言語スコアは基本的には候補パターン１の方が高くなる。しかし、音響スコアは発話によって候補パターン２の方が高くなる場合がある。その結果、トータルスコアも候補パターン２が高くなってしまう可能性がある。そうなると、「三鷹市」を認識するべき状況であるにも関わらず、「高篠小学校」を認識してしまう可能性が考えられる。 Here, in the analysis text,
・ There is no sentence that comes with “Takashino” after “seeing” ・ As long as there is a sentence that comes with “no” after “Mitaka City”, the language score is basically candidate pattern 1. Get higher. However, the acoustic score may be higher in the candidate pattern 2 due to the utterance. As a result, there is a possibility that the candidate pattern 2 also becomes high in the total score. If this happens, it is possible that “Takashino Elementary School” will be recognized despite the fact that “Mitaka City” should be recognized.

具体的な例を、図８の言語モデルを使って解説する。図８に記載されている言語モデルの数値は対数で記載されている。 A specific example will be explained using the language model of FIG. The numerical value of the language model described in FIG. 8 is described in logarithm.

まずは、候補パターン１及び候補パターン２の対数言語スコアＰ_ＬＭ１、Ｐ_ＬＭ２を計算する。 First, logarithmic language scores P _LM1 and P _LM2 of candidate pattern 1 and candidate pattern 2 are calculated.

ここで、候補パターン２においては「ミ」の後ろに「タカシノ」が来る２−ｇｒａｍ確率が存在しない。よって、この２−ｇｒａｍ確率は「ミ」に対するバックオフ係数と「タカシノ」の１−ｇｒａｍ確率の積となる。よって、言語スコアそのものは候補パターン１の方が高くなっている。しかし、音響スコアについては候補パターン２の方が高い事例も存在する。例えば、候補パターン１、候補パターン２の対数音響スコアＰ_ＡＭ１、Ｐ_ＡＭ２が、それぞれＰ_ＡＭ１＝−２０．０、Ｐ_ＡＭ２＝−１９．０であるとする。この場合、それぞれの対数トータルスコアＰ_Ｒ１、Ｐ_Ｒ２は、 Here, in candidate pattern 2, there is no 2-gram probability that “Takashino” comes after “Mi”. Therefore, this 2-gram probability is the product of the back-off coefficient for “Mi” and the 1-gram probability of “Takashino”. Therefore, the candidate score 1 is higher in the language score itself. However, there are cases where the candidate pattern 2 is higher in the acoustic score. For example, it is assumed that the logarithmic acoustic scores P _AM1 and P _{AM2 of} candidate pattern 1 and candidate pattern 2 are P _AM1 = −20.0 and P _AM2 = −19.0, respectively. In this case, each log total score _PR1 , _PR2 is

となり、解析テキストに存在しない候補パターン２の方が上回ってしまう。よって、キーワードの抽出が不可能となる。 Thus, the candidate pattern 2 that does not exist in the analysis text is surpassed. Therefore, keyword extraction is impossible.

これに対し、本発明で作成した２−ｇｒａｍ言語モデルの例を図１５に示す。これは、前述のようにキーワード類似単語のバックオフ係数および１−ｇｒａｍ確率を最小値（本例では最大の絶対値を有する負値）にしたものである。具体的には、正解となるキーワードの一部である非キーワード「ミ」のバックオフ係数および１−ｇｒａｍ確率を最小値にしている。 On the other hand, an example of a 2-gram language model created in the present invention is shown in FIG. As described above, the backoff coefficient and 1-gram probability of the keyword similar word are set to the minimum value (in this example, the negative value having the maximum absolute value). Specifically, the back-off coefficient and 1-gram probability of the non-keyword “mi”, which is a part of the correct keyword, are minimized.

この対策を行った後の、候補パターン２の対数言語スコアＰ_ＬＭ２’を計算する。 After this countermeasure is taken, a logarithmic language score P _LM2 ′ of candidate pattern 2 is calculated.

これにより、対数トータルスコアＰ_Ｒ２’は As a result, the logarithmic total score _PR2 '

となる。ちなみに、候補パターン１においてはバックオフ係数を使用しないため、対数トータルスコアはＰ_Ｒ１＝−３３．７７０２１のままである。よって、本発明において作成された対策言語モデルを使用した場合は、候補パターン１が認識結果とされ、キーワード「三鷹市」が正しく抽出される。 It becomes. Incidentally, since the backoff coefficient is not used in the candidate pattern 1, the logarithmic total score is still P _R1 = −33.77021. Therefore, when the countermeasure language model created in the present invention is used, the candidate pattern 1 is taken as the recognition result, and the keyword “Mitaka city” is correctly extracted.

［適用分野］
本発明は、音声認識処理を行う各種の機器に適用することができる。例えば、カーナビゲーション装置、携帯電話、パーソナルコンピュータ、ＡＶ機器、家電製品など、音声入力機能を備える各種の機器に適用することができる。 [Application field]
The present invention can be applied to various devices that perform voice recognition processing. For example, the present invention can be applied to various devices having a voice input function such as a car navigation device, a mobile phone, a personal computer, an AV device, and a home appliance.

言語モデルを用いた音声認識手法の概略説明図である。It is a schematic explanatory drawing of the speech recognition method using a language model. 音響モデルの一つとしてコンテキスト依存音響モデルの例を示す。An example of a context-dependent acoustic model is shown as one of acoustic models. ２−ｇｒａｍ言語モデルの構成例を示す。The structural example of a 2-gram language model is shown. 言語スコアの計算例を示す。An example of language score calculation is shown below. 言語スコアの計算例を示す。An example of language score calculation is shown below. トータルスコアの計算方法を説明する図である。It is a figure explaining the calculation method of a total score. ２−ｇｒａｍ言語モデルの構成例を示す。The structural example of a 2-gram language model is shown. 言語モデルの例を示す。An example language model is shown. 言語モデル作成処理を示すブロック図である。It is a block diagram which shows a language model creation process. 学習テキスト、解析テキスト、使用語彙リスト及び書換対象語彙リストの例を示す。Examples of learning text, analysis text, vocabulary list used, and vocabulary list to be rewritten are shown. 言語モデルの例を示す。An example language model is shown. ディクテーション部の構成を示すブロック図である。It is a block diagram which shows the structure of a dictation part. 音声認識処理の流れを示す。The flow of voice recognition processing is shown. 言語スコアの計算方法を示すフローチャートである。It is a flowchart which shows the calculation method of a language score. 言語モデルの例を示す。An example language model is shown.

Explanation of symbols

１１形態素解析器
１２使用語彙抽出器
１３言語モデル作成器
１４バックオフ＆１−ｇｒａｍ確率強制書換器
１５書換対象語彙リストＤＢ
２０ディクテーション部
２１音声区間検出部
２２特徴パラメータ計算部
２３マッチング処理部
２４言語モデルＤＢ
２５音響モデルＤＢ
２６辞書ＤＢ DESCRIPTION OF SYMBOLS 11 Morphological analyzer 12 Vocabulary extractor 13 Language model generator 14 Backoff & 1-gram probability compulsory rewriter 15 Rewrite object vocabulary list DB
20 dictation unit 21 speech segment detection unit 22 feature parameter calculation unit 23 matching processing unit 24 language model DB
25 Acoustic model DB
26 Dictionary DB

Claims

An acoustic model storage unit for storing an acoustic model;
A dictionary storage unit for storing a dictionary having a plurality of words including keywords;
For a plurality of words, a language model storage unit that stores a language model having a coefficient for calculating an appearance probability of a word string including the word and an appearance probability of the word string including the word;
A candidate pattern creation unit that creates a candidate pattern composed of a combination of a plurality of words from the utterance content;
For each of the candidate patterns, a score calculator that calculates an acoustic score based on the acoustic model and calculates a language score based on the language model, and calculates a total score based on the acoustic score and the language score;
A recognition result determination means for determining a candidate pattern having the maximum total score among a plurality of candidate patterns as a recognition result;
A keyword extracting means for referring to the dictionary and extracting a keyword from the recognition result and the determined candidate pattern;
In the language model, the appearance probability and the coefficient of the word string with respect to a keyword similar word that is a non-keyword similar to a keyword have a value that minimizes the language score of a candidate pattern including the keyword similar word. Voice recognition device.

The score calculation unit calculates a product of the acoustic score and the language score as the total score,
The speech recognition apparatus according to claim 1, wherein the appearance probability of the word string and the coefficient for the keyword-similar word are zero.

The speech recognition apparatus according to claim 2, wherein the language model is an N-gram language model, the appearance probability of the word string is a 1-gram probability, and the coefficient is a back-off coefficient.

The speech recognition apparatus according to any one of claims 1 to 3, wherein the keyword-similar word includes a word that forms part of the keyword and a word that includes part or all of the keyword.

An acoustic model, a dictionary having a plurality of words including a keyword, and a language model having a coefficient for calculating an appearance probability of a word string including the word and an appearance probability of the word string including the word for the plurality of words A speech recognition method using
A candidate pattern creating step for creating a candidate pattern composed of a combination of a plurality of words from the utterance content;
For each of the candidate patterns, a score calculation step of calculating an acoustic score based on the acoustic model and calculating a language score based on the language model, and calculating a total score based on the acoustic score and the language score;
A recognition result determination step of determining a candidate pattern having the maximum total score among a plurality of candidate patterns as a recognition result;
A keyword extraction step of referring to the dictionary and extracting a keyword from the recognition result and the determined candidate pattern,
In the language model, the appearance probability and the coefficient of the word string with respect to a keyword similar word that is a non-keyword similar to a keyword have a value that minimizes the language score of a candidate pattern including the keyword similar word. Voice recognition method.

A speech recognition program executed by a computer,
An acoustic model storage unit for storing an acoustic model;
A dictionary storage unit for storing a dictionary having a plurality of words including a keyword;
A language model storage unit that stores a language model having a coefficient for calculating an appearance probability of a word string including the word and an appearance probability of the word string including the word for a plurality of words,
A candidate pattern creation unit that creates a candidate pattern composed of a combination of a plurality of words from the utterance content,
For each of the candidate patterns, a score calculation unit that calculates an acoustic score based on the acoustic model and calculates a language score based on the language model, and calculates a total score based on the acoustic score and the language score;
A recognition result determining means for determining a candidate pattern having the maximum total score among a plurality of candidate patterns as a recognition result; and
Referencing the dictionary storage unit, causing the computer to function as keyword extraction means for extracting a keyword from a recognition result and a determined candidate pattern,
In the language model, the appearance probability and the coefficient of the word string with respect to a keyword similar word that is a non-keyword similar to a keyword have a value that minimizes the language score of a candidate pattern including the keyword similar word. Voice recognition program.

A recording medium on which the voice recognition program according to claim 6 is recorded.