JP2003345383A

JP2003345383A - Method, device, and program for voice recognition

Info

Publication number: JP2003345383A
Application number: JP2002152645A
Authority: JP
Inventors: Hajime Kobayashi; 載小林; Soichi Toyama; 聡一外山
Original assignee: Pioneer Electronic Corp
Current assignee: Pioneer Corp
Priority date: 2002-05-27
Filing date: 2002-05-27
Publication date: 2003-12-03
Anticipated expiration: 2022-05-27
Also published as: JP4226273B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice recognition device which reduces the calculation volume of similarities in the case of matching processing to quickly and accurately recognize voice. <P>SOLUTION: A voice recognition device 100 is provided with a voice analysis part 105 for extracting the feature quantity of a voice signal with respect to each frame, a key word model database 106 wherein key word HMMs representative of feature quantity patterns of a plurality of key words to be recognized are stored, a similarity calculation part 107 for calculating similarity of the feature quantities of respective frames on the basis of the feature quantity of each extracted frame and key word HMMs, a matching processing part 108 for performing matching processing on the basis of the calculated similarity between each frame and each key word HMM and similarities of unnecessary words which don't constitute preliminarily set key words, and a discrimination part 109 for discriminating a key word included in a spoken voice on the basis of the matching processing. <P>COPYRIGHT: (C)2004,JPO

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、ＨＭＭ（Hidden M
arkov Models）法を用いて音声認識を行う技術分野に属
し、より詳細には、発話された音声からキーワードを認
識する技術分野に属する。TECHNICAL FIELD The present invention relates to an HMM (Hidden M).
arkov Models) method, and more specifically, it belongs to the technical field of recognizing keywords from spoken speech.

【０００２】[0002]

【従来の技術】現在、人間が発声した音声を認識する音
声認識装置が開発されており、このような音声認識装置
では、人間が所定の語句の音声を発声すると、その入力
信号から語句の音声を認識するようになっている。2. Description of the Related Art Currently, a voice recognition device for recognizing a voice uttered by a human being has been developed. In such a voice recognizing device, when a human utters a voice of a predetermined phrase, the voice of the phrase is input from the input signal. To recognize.

【０００３】また、このような音声認識装置を、車載さ
れたナビゲーション装置やパーソナルコンピュータなど
各種装置に適用すれば、その装置はキーボードやスイッ
チ選択の手動操作を要することなく、各種の情報を入力
することができるようになる。Further, when such a voice recognition device is applied to various devices such as a vehicle-mounted navigation device and a personal computer, the device inputs various information without requiring a manual operation such as a keyboard or switch selection. Will be able to.

【０００４】したがって、自動車の運転中にナビゲーシ
ョン装置を利用するなどの人間が両手を使用する作業環
境下であっても、操作者は、所望の情報を当該装置に入
力することができるようになっている。Therefore, even in a working environment where a person uses both hands while using a navigation device while driving a car, the operator can input desired information into the device. ing.

【０００５】このような音声認識の代表的なものにＨＭ
Ｍ（隠れマルコフモデル）と呼ばれる確率モデルを利用
して音声認識を行う方法（以下、単に音声認識という）
がある。HM is a typical example of such speech recognition.
A method for performing speech recognition using a probabilistic model called M (Hidden Markov Model) (hereinafter, simply referred to as speech recognition)
There is.

【０００６】この音声認識は、発話音声の特徴量のパタ
ーンを、予め用意されたキーワードとなる認識候補の語
句（以下、認識対象語（キーワード）という）を示す音
声の特徴量のパターンとマッチングさせることにより音
声認識を行うようになっている。In this voice recognition, the pattern of the feature quantity of the uttered voice is matched with the pattern of the feature quantity of the voice showing the words and phrases of the recognition candidates (hereinafter referred to as recognition target words (keywords)) which are the prepared keywords. As a result, voice recognition is performed.

【０００７】具体的には、この音声認識は、予め定めら
れた時間間隔毎に入力された発話音声（入力信号）を分
析して特徴量を抽出し、この入力信号の特徴量に予めデ
ータベースに格納されたＨＭＭによって示される認識対
象語の特徴量のデータとマッチングの割合、すなわち、
入力信号の特徴量が特徴量データであることを示す確率
（以下、類似度という）を算出するとともに、発話音声
の全てにおけるこの類似度を積算し、この積算された類
似度が最も高い認識対象語を認識結果として確定するよ
うになっている。Specifically, in this voice recognition, a speech amount (input signal) input at predetermined time intervals is analyzed to extract a feature amount, and the feature amount of this input signal is stored in a database in advance. Data of the feature amount of the recognition target word indicated by the stored HMM and the matching rate, that is,
The probability that the feature quantity of the input signal is feature quantity data (hereinafter referred to as similarity degree) is calculated, and the similarity degrees of all utterance voices are integrated, and the recognition target having the highest integrated similarity degree is calculated. The word is fixed as a recognition result.

【０００８】この結果、この音声認識は、発話音声であ
る入力信号から所定の語句の音声認識を行うことができ
るようになっている。As a result, in this voice recognition, it is possible to perform voice recognition of a predetermined word / phrase from an input signal which is a spoken voice.

【０００９】なお、ＨＭＭは、遷移する状態の集まりと
して表される統計的信号源モデルであり、予めキーワー
ドなどの認識すべき音声の特徴量を示す。また、このＨ
ＭＭは、予め複数の音声データを採取し、これらの音声
データに基づいて生成されるようになっている。The HMM is a statistical signal source model represented as a group of transition states, and indicates a feature amount of speech such as a keyword to be recognized in advance. Also, this H
The MM is adapted to collect a plurality of voice data in advance and generate based on these voice data.

【００１０】このような音声認識では、発話音声に含ま
れる認識対象語となるキーワード部分を如何に抽出する
かが重要になる。In such speech recognition, it is important how to extract a keyword portion which is a recognition target word included in a speech voice.

【００１１】発話音声には、通常、キーワードの他に、
予め既知の認識する際に不要な語である不要語（認識対
象語の前後に付加される「えー」や「です」等の語）が
含まれることが多く、この場合の発話音声の構成は、原
則的には、不要語と当該不要語に挟まれたキーワードに
よって形成される。Generally, in addition to keywords, the utterance voice includes
It often contains unnecessary words (words such as "er" and "da" that are added before and after the recognition target word) that are unnecessary words when recognizing in advance. In principle, it is formed by unnecessary words and keywords sandwiched between the unnecessary words.

【００１２】従来、一般的に、音声認識を行う場合、音
声認識の対象となるキーワードを認識することによって
行うワードスポッティングという手法（以下、単にワー
ドスポッティング音声認識という）がよく用いられてい
る。Conventionally, in the case of performing voice recognition, a method called word spotting (hereinafter simply referred to as word spotting voice recognition) which is performed by recognizing a keyword as a target of voice recognition is generally used.

【００１３】このワードスポッティング音声認識は、認
識対象となる発話音声を、キーワードモデルを示すＨＭ
Ｍの他に、不要語のモデル（以下、ガーベージモデルと
いう）を示すＨＭＭを用意し、最も特徴量の類似度が高
いキーワードモデル、ガーベージモデルまたはそれらの
組み合わせを認識することによって音声認識を行うよう
になっている。In this word spotting voice recognition, the utterance voice to be recognized is HM indicating a keyword model.
In addition to M, an HMM indicating a model of an unnecessary word (hereinafter referred to as a garbage model) is prepared, and speech recognition is performed by recognizing a keyword model, a garbage model, or a combination thereof having the highest degree of similarity in feature quantity. It has become.

【００１４】すなわち、このワードスポッティング音声
認識は、積算された類似度に基づいて、最も特徴量の類
似度が高いキーワードモデル、ガーベージモデルまたは
それらの組み合わせを認識し、当該発話音声にキーワー
ドが含まれている場合には、そのキーワードを認識結果
として出力するようになっている。That is, this word spotting voice recognition recognizes a keyword model, a garbage model, or a combination thereof having the highest similarity in feature amount based on the accumulated similarity, and the utterance voice includes the keyword. In that case, the keyword is output as a recognition result.

【００１５】このように音声認識を行う場合に、不要語
モデルの構成方法としては、フィラーモデルと呼ばれる
確率モデル（以下、単にフィラーモデルをいう）を利用
する方法がある。As a method of constructing the unnecessary word model in the case of performing speech recognition as described above, there is a method of using a probabilistic model called a filler model (hereinafter simply referred to as a filler model).

【００１６】フィラーモデルは、図５に示すように、全
ての音声をモデル化するために接続可能な全ての母音お
よび子音の接続関係をネットワークで表現したモデルで
あり、このフィラーモデルを用いてワードスポッティン
グを実現するためには、キーワードモデルの前後にそれ
ぞれフィラーモデルを接続する必要がある。As shown in FIG. 5, the filler model is a model in which the connection relationships of all vowels and consonants that can be connected to model all voices are expressed by a network. In order to realize spotting, it is necessary to connect filler models before and after the keyword model.

【００１７】すなわち、フィラーモデルでは、認識可能
な全てのパターン、具体的には認識すべき発話音声の特
徴量と各音素毎の特徴量のマッチングを算出することに
よって発話音声の音素の繋がりを算出し、言語として考
える全ての接続パターンにおける音韻、音素または音節
などの単語を構成する単位である各サブワードの類似度
を累積して得られる累積類似度（累積尤度）に基づいて
認識すべき不要語を認識するようになっている。In other words, in the filler model, all recognizable patterns, specifically, the matching of the feature amount of the uttered speech to be recognized and the feature amount of each phoneme is calculated to calculate the phoneme connection of the uttered voice. However, it is not necessary to recognize it based on the cumulative similarity (cumulative likelihood) obtained by accumulating the similarity of each subword that is a unit that constitutes a word such as phoneme, phoneme, or syllable in all connection patterns considered as a language. It is designed to recognize words.

【００１８】[0018]

【発明が解決しようとする課題】しかしながら、このよ
うな音声認識装置では、不要語を認識するため、発話音
声の特徴量と音素など当該不要語の構成要素となり得る
各特徴量のデータとのマッチング処理を行うので、計算
量が膨大となり、計算処理の負荷がかかるという問題を
有していた。However, in such a voice recognition device, since the unnecessary word is recognized, matching between the characteristic amount of the uttered voice and the data of each characteristic amount such as a phoneme that can be a constituent element of the unnecessary word. Since the processing is performed, the amount of calculation becomes enormous, and there is a problem in that the load of calculation processing is applied.

【００１９】具体的には、マッチング処理とは、発話音
声の特徴量と、不要語の構成要素となり得る特徴量デー
タとの類似する割合を示す類似度（尤度）を計算し、最
も類似度が高くなる特徴量データを有する不要語として
認識すべきものと判断するものであるため、日本語のフ
ィラーモデルの場合、発話音声の特徴量を、あ行、か
行、さ行、た行などのすべての音節の特徴量データとの
類似度の計算を行う必要がある。Specifically, the matching process is performed by calculating the similarity (likelihood) indicating the similarity between the feature amount of the uttered voice and the feature amount data that can be a constituent element of the unnecessary word, and calculating the maximum similarity. Since it is determined that the word should be recognized as an unnecessary word having feature amount data that increases, in the case of the Japanese filler model, the feature amount of the uttered speech is It is necessary to calculate the degree of similarity with the feature amount data of all syllables.

【００２０】したがって、上述の音声認識装置では、各
特徴量データ毎の類似度の計算により、計算量が膨大と
なるので、計算処理の負荷がかかるという問題を有して
いた。Therefore, the above-mentioned speech recognition apparatus has a problem in that the calculation amount becomes enormous due to the calculation of the degree of similarity for each feature amount data, so that the load of calculation processing is increased.

【００２１】特に、上述の音声認識装置であっては、認
識すべきキーワードの前後に不要語認識用のフィラーモ
デルを接続した言語モデルを想定しているため、キーワ
ードの前後において、発話音声の特徴量と各特徴量デー
タとのマッチング処理を行うことになるので、さらに多
くの計算量が必要となる。In particular, the above speech recognition apparatus assumes a language model in which filler models for recognizing unnecessary words are connected before and after a keyword to be recognized. Since a matching process between the amount and each feature amount data is performed, a larger amount of calculation is required.

【００２２】本発明は、上記の各問題点に鑑みて為され
たもので、その課題は、マッチング処理を行う場合の類
似度の算出量を少なくし、高速にかつ的確に音声認識を
行う音声認識装置を提供することにある。The present invention has been made in view of the above problems, and its object is to reduce the calculation amount of the similarity when performing the matching process, and to perform the voice recognition accurately and at high speed. To provide a recognition device.

【００２３】[0023]

【課題を解決するための手段】上記の課題を解決するた
めに、請求項１に記載の発明は、発話音声に含まれるキ
ーワードを認識する音声認識装置であって、前記発話音
声を分析することによって当該発話音声の音声成分の特
徴量である発話音声特徴量を抽出する抽出手段と、１ま
たは２以上の前記キーワードの音声成分の特徴量を示す
キーワード特徴量データを予め格納する第１格納手段
と、前記発話音声の少なくとも一部の音声区間の抽出さ
れた前記発話音声特徴量と前記格納手段に格納された前
記キーワード特徴量データとに基づいて当該発話音声特
徴量が前記キーワードである確率を示す各キーワード確
率を算出する算出手段と、予め設定された値を、前記発
話音声の少なくとも一部の音声区間が前記キーワードを
構成しない不要語である確率を示す不要語確率として格
納する第２格納手段と、前記算出した各キーワード確率
および前記不要語確率に基づいて前記発話音声に含まれ
る認識すべき前記キーワードを決定する決定手段と、を
備えた構成を有している。In order to solve the above-mentioned problems, the invention according to claim 1 is a voice recognition device for recognizing a keyword included in an uttered voice, wherein the uttered voice is analyzed. Extraction means for extracting the uttered voice feature amount, which is the feature amount of the voice component of the uttered voice, and first storage means for storing in advance keyword feature amount data indicating the feature amount of the voice component of one or more of the keywords. And a probability that the uttered voice feature amount is the keyword based on the uttered voice feature amount extracted from at least a part of the utterance voice and the keyword feature amount data stored in the storage means. The calculation means for calculating each keyword probability shown and a preset value are used as unnecessary words in which at least a part of the speech section of the uttered speech does not form the keyword. Second storage means for storing as an unnecessary word probability indicating a probability of occurrence, and determining means for determining the keyword to be recognized included in the uttered voice based on the calculated keyword probability and the unnecessary word probability. It has a different configuration.

【００２４】この構成により、請求項１に記載の発明で
は、発話音声特徴量が各キーワード特徴量データによっ
て示されるキーワードであることを示すキーワード確率
を算出し、当該算出したキーワード確率および予め設定
された不要語確率に基づいて発話音声に含まれる認識す
べきキーワードを決定する。With this configuration, in the invention according to claim 1, the keyword probability indicating that the uttered voice feature amount is the keyword indicated by each keyword feature amount data is calculated, and the calculated keyword probability and the preset keyword probability are set. Based on the unnecessary word probability, the keyword to be recognized included in the spoken voice is determined.

【００２５】したがって、発話音声特徴量と不要語の特
徴量データとの特徴量の特性を算出せずに、不要語とキ
ーワードを識別し、認識すべきキーワードを決定するこ
とができるので、不要語類似度を算出する際の処理負担
を軽減することができ、容易にかつ高速に発話音声に含
まれるキーワードを認識することができる。Therefore, the unnecessary word and the keyword can be identified and the keyword to be recognized can be determined without calculating the characteristics of the characteristic amounts of the uttered voice characteristic amount and the unnecessary word characteristic amount data. The processing load when calculating the degree of similarity can be reduced, and the keywords included in the uttered voice can be easily and quickly recognized.

【００２６】また、請求項２に記載の発明は、請求項１
に記載の音声認識装置において、前記抽出手段が、予め
設定された単位時間毎に前記発話音声を分析して前記発
話音声特徴量情報を抽出するとともに、前記予め格納手
段に格納された不要語確率が前記単位時間における前記
不要語確率を示す場合であって、前記算出手段が、前記
抽出された各単位時間毎の発話音声特徴量に基づいて各
前記キーワード確率を算出するとともに、前記決定手段
が、前記算出した各キーワード確率および前記単位時間
における不要語確率に基づいて発話音声に含まれる認識
すべき前記キーワードを決定する構成を有している。The invention described in claim 2 is the same as claim 1
In the voice recognition device according to the item (1), the extraction unit analyzes the uttered voice for each preset unit time to extract the uttered voice feature amount information, and the unnecessary word probability stored in the storage unit in advance. Is a case where the unnecessary word probability in the unit time is shown, the calculating means calculates each of the keyword probabilities based on the extracted speech feature amount for each unit time, and the determining means , The keyword to be recognized included in the uttered voice is determined based on the calculated keyword probability and the unnecessary word probability in the unit time.

【００２７】この構成により、請求項２に記載の発明で
は、各単位時間毎に発話音声特徴量がキーワード特徴量
データによって示されるキーワードであることを示すキ
ーワード確率を算出し、当該算出したキーワード確率お
よび予め設定された単位時間における不要語確率に基づ
いて発話音声に含まれる認識すべきキーワードを決定す
る。With this configuration, in the invention according to claim 2, a keyword probability indicating that the uttered voice feature amount is a keyword indicated by the keyword feature amount data is calculated for each unit time, and the calculated keyword probability is calculated. And the keywords to be recognized included in the spoken voice are determined based on the unnecessary word probability in a preset unit time.

【００２８】したがって、発話音声から音素単位、また
は、音声単位などの単語を構成する単位であるサブワー
ド単位で表現される各言語音毎に類似度を算出すること
によって認識すべきキーワードを決定することができる
とともに、発話音声特徴量と不要語の特徴量データとの
特徴量の特性を算出せずに、不要語とキーワードを識別
し、認識すべきキーワードを決定することができるの
で、類似度を算出する際の処理負担を軽減することがで
き、容易にかつ高速に発話音声に含まれるキーワードを
認識することができるとともに、誤認識を防止すること
ができる。Therefore, the keyword to be recognized can be determined by calculating the degree of similarity for each linguistic sound expressed from the uttered speech in units of phonemes, or in units of subwords, which are units of words such as speech units. In addition, the unnecessary words and the keywords can be identified and the keywords to be recognized can be determined without calculating the characteristics of the characteristic features of the uttered voice characteristic amount and the unnecessary word characteristic amount data. The processing load at the time of calculation can be reduced, the keyword included in the uttered voice can be easily and quickly recognized, and erroneous recognition can be prevented.

【００２９】また、請求項３に記載の発明は、請求項２
に記載の音声認識装置において、前記決定手段が、前記
算出した前記各キーワード確率と前記単位時間における
不要語確率に基づいて、前記第１格納手段に格納された
各前記キーワード特徴量データによって示される前記各
キーワードと前記不要語との各組み合わせである確率を
示す組み合わせ確率を算出するとともに、当該組み合わ
せ確率に基づいて前記発話音声に含まれる認識すべき前
記キーワードを決定する構成を有している。The invention described in claim 3 is the same as that of claim 2.
In the voice recognition device according to the above item 3, the determination unit is indicated by each of the keyword feature amount data stored in the first storage unit based on the calculated keyword probability and the unnecessary word probability in the unit time. It is configured to calculate a combination probability indicating a probability of each combination of each keyword and the unnecessary word and determine the keyword to be recognized included in the uttered voice based on the combination probability.

【００３０】この構成により、請求項３に記載の発明で
は、キーワード特徴量データによって示されるキーワー
ドと不要語との各組み合わせに基づいて各単位時間毎に
算出した各キーワード確率と不要語確率に基づいて、キ
ーワード特徴量データによって示される各キーワードと
不要語との各組み合わせ確率を算出し、当該組み合わせ
確率に基づいて発話音声に含まれる認識すべき前記キー
ワードを決定する。With this configuration, in the invention according to claim 3, based on each keyword probability and unnecessary word probability calculated for each unit time based on each combination of the keyword and the unnecessary word indicated by the keyword feature amount data. Then, each combination probability of each keyword indicated by the keyword feature amount data and the unnecessary word is calculated, and the keyword to be recognized included in the uttered voice is determined based on the combination probability.

【００３１】したがって、不要語類似度および算出した
各キーワード類似度の各組み合わせを考慮しつつ、発話
音声に含まれるキーワードを決定することができるの
で、容易にかつ高速に発話音声に含まれるキーワードを
認識することができるとともに、誤認識を防止すること
ができる。Therefore, the keyword included in the uttered voice can be determined while considering each combination of the unnecessary word similarity and each calculated keyword similarity. Therefore, the keyword included in the uttered voice can be easily and quickly determined. Not only can it be recognized, but erroneous recognition can be prevented.

【００３２】また、請求項４に記載の発明は、発話音声
に含まれるキーワードを認識する音声認識方法であっ
て、前記発話音声を分析することによって当該発話音声
の音声成分の特徴量である発話音声特徴量を抽出する抽
出処理工程と、１または２以上の前記キーワードの音声
成分の特徴量を示すキーワード特徴量データを予め取得
する第１取得処理工程と、前記発話音声の少なくとも一
部の音声区間の抽出された前記発話音声特徴量と前記格
納手段に格納された前記キーワード特徴量データとに基
づいて当該発話音声特徴量が前記キーワードである確率
を示す各キーワード確率を算出する算出処理工程と、予
め設定された値を、前記発話音声の少なくとも一部の音
声区間が前記キーワードを構成しない不要語である確率
を示す不要語確率として取得する第２取得処理工程と、
前記算出した各キーワード確率および前記不要語確率に
基づいて前記発話音声に含まれる認識すべき前記キーワ
ードを決定する決定処理工程と、を含む構成を有してい
る。According to a fourth aspect of the present invention, there is provided a voice recognition method for recognizing a keyword included in an uttered voice, which is a feature amount of a voice component of the uttered voice by analyzing the uttered voice. An extraction processing step of extracting a voice feature quantity, a first acquisition processing step of previously acquiring keyword feature quantity data indicating a feature quantity of a voice component of one or more of the keywords, and a voice of at least a part of the uttered voice. A calculation processing step of calculating each keyword probability indicating a probability that the utterance voice feature amount is the keyword, based on the utterance voice feature amount extracted from the section and the keyword feature amount data stored in the storage means; , A preset value is an unnecessary word probability indicating a probability that at least a part of the speech section of the uttered speech is an unnecessary word that does not form the keyword. A second acquisition processing step of acquiring Te,
And a determination processing step of determining the keyword to be recognized included in the uttered voice based on the calculated keyword probability and the unnecessary word probability.

【００３３】この構成により、請求項４に記載の発明で
は、発話音声特徴量が各キーワード特徴量データによっ
て示されるキーワードであることを示すキーワード確率
を算出し、当該算出したキーワード確率および予め設定
された不要語確率に基づいて発話音声に含まれる認識す
べきキーワードを決定する。With this configuration, in the invention according to claim 4, a keyword probability indicating that the uttered voice feature amount is a keyword indicated by each keyword feature amount data is calculated, and the calculated keyword probability and the preset keyword probability are set. Based on the unnecessary word probability, the keyword to be recognized included in the spoken voice is determined.

【００３４】したがって、発話音声特徴量と不要語の特
徴量データとの特徴量の特性を算出せずに、不要語とキ
ーワードを識別し、認識すべきキーワードを決定するこ
とができるので、不要語類似度を算出する際の処理負担
を軽減することができ、容易にかつ高速に発話音声に含
まれるキーワードを認識することができる。Therefore, the unnecessary word and the keyword can be identified and the keyword to be recognized can be determined without calculating the characteristics of the characteristic amounts of the uttered voice characteristic amount and the unnecessary word characteristic amount data. The processing load when calculating the degree of similarity can be reduced, and the keywords included in the uttered voice can be easily and quickly recognized.

【００３５】また、請求項５に記載の発明は、請求項４
に記載の音声認識方法において、前記抽出処理工程にお
いては、予め設定された単位時間毎に前記発話音声を分
析して前記発話音声特徴量情報を抽出し、前記算出処理
工程においては、前記抽出された各単位時間毎の発話音
声特徴量に基づいて前記各キーワード確率を算出し、前
記第２取得処理工程においては、前記単位時間における
不要語確率を示す不要語確率を予め取得し、前記決定処
理工程においては、前記算出した各キーワード確率およ
び前記単位時間における不要語確率に基づいて発話音声
に含まれる認識すべき前記キーワードを決定する構成を
有している。The invention described in claim 5 is the same as claim 4
In the voice recognition method described in the above, in the extraction processing step, the uttered voice is analyzed at every preset unit time to extract the uttered voice feature amount information, and in the calculation processing step, the extracted The keyword probabilities are calculated based on the uttered speech feature amount for each unit time, and in the second acquisition processing step, an unnecessary word probability indicating an unnecessary word probability in the unit time is acquired in advance, and the determination processing is performed. In the step, the keyword to be recognized included in the speech is determined based on the calculated keyword probability and the unnecessary word probability in the unit time.

【００３６】この構成により、請求項５に記載の発明で
は、各単位時間毎に発話音声特徴量がキーワード特徴量
データによって示されるキーワードであることを示すキ
ーワード確率を算出し、当該算出したキーワード確率お
よび予め設定された単位時間における不要語確率に基づ
いて発話音声に含まれる認識すべきキーワードを決定す
る。With this configuration, in the invention according to claim 5, the keyword probability indicating that the uttered voice feature amount is the keyword indicated by the keyword feature amount data is calculated for each unit time, and the calculated keyword probability is calculated. And the keywords to be recognized included in the spoken voice are determined based on the unnecessary word probability in a preset unit time.

【００３７】したがって、発話音声から音素単位、また
は、音声単位などの単語を構成する単位であるサブワー
ド単位で表現される各言語音毎に類似度を算出すること
によって認識すべきキーワードを決定することができる
とともに、発話音声特徴量と不要語の特徴量データとの
特徴量の特性を算出せずに、不要語とキーワードを識別
し、認識すべきキーワードを決定することができるの
で、類似度を算出する際の処理負担を軽減することがで
き、容易にかつ高速に発話音声に含まれるキーワードを
認識することができるとともに、誤認識を防止すること
ができる。Therefore, the keyword to be recognized can be determined by calculating the degree of similarity for each linguistic sound expressed in a phoneme unit or a subword unit which is a unit of a word such as a voice unit from the uttered speech. In addition, the unnecessary words and the keywords can be identified and the keywords to be recognized can be determined without calculating the characteristics of the characteristic features of the uttered voice characteristic amount and the unnecessary word characteristic amount data. The processing load at the time of calculation can be reduced, the keyword included in the uttered voice can be easily and quickly recognized, and erroneous recognition can be prevented.

【００３８】また、請求項６に記載の発明は、請求項５
に記載の音声認識方法において、前記決定処理工程にお
いては、前記算出した前記各キーワード確率と前記単位
時間における不要語確率に基づいて、前記取得された各
キーワード特徴量データによって示される各キーワード
と前記不要語との各組み合わせである確率を示す組み合
わせ確率を算出し、当該組み合わせ確率に基づいて前記
発話音声に含まれる認識すべき前記キーワードを決定す
る構成を有している。The invention according to claim 6 is the same as claim 5
In the voice recognition method according to, in the determination processing step, based on the calculated keyword probability and the unnecessary word probability in the unit time, each keyword indicated by the acquired keyword feature amount data and the A combination probability indicating a probability of each combination with an unnecessary word is calculated, and the keyword to be recognized included in the uttered speech is determined based on the combination probability.

【００３９】この構成により、請求項６に記載の発明で
は、キーワード特徴量データによって示されるキーワー
ドと不要語との各組み合わせに基づいて各単位時間毎に
算出した各キーワード類似度と不要語類似度を積算して
累積類似度を算出し、累積類似度に基づいて発話音声に
含まれる認識すべき前記キーワードを決定する。With this configuration, in the invention according to claim 6, the keyword similarity and the unnecessary word similarity calculated for each unit time based on each combination of the keyword and the unnecessary word indicated by the keyword feature amount data. Are accumulated to calculate a cumulative similarity, and the keyword to be recognized included in the uttered voice is determined based on the cumulative similarity.

【００４０】したがって、不要語類似度および算出した
各キーワード類似度の各組み合わせを考慮しつつ、発話
音声に含まれるキーワードを決定することができるの
で、容易にかつ高速に発話音声に含まれるキーワードを
認識することができるとともに、誤認識を防止すること
ができる。また、請求項７に記載の発明は、コンピュー
タによって、発話音声に含まれるキーワードを認識する
音声認識処理を行う音声認識プログラムであって、前記
コンピュータを、前記発話音声を分析することによって
当該発話音声の音声成分の特徴量である発話音声特徴量
を抽出する抽出手段、１または２以上の前記キーワード
の音声成分の特徴量を示すキーワード特徴量データを予
め取得する第１取得手段、前記発話音声の少なくとも一
部の音声区間の抽出された前記発話音声特徴量と前記格
納手段に格納された前記キーワード特徴量データとに基
づいて当該発話音声特徴量が前記キーワードである確率
を示す各キーワード確率を算出する算出手段、予め設定
された値を、前記発話音声の少なくとも一部の音声区間
が前記キーワードを構成しない不要語である確率を示す
不要語確率として取得する第２取得手段、前記算出した
各キーワード確率および前記不要語確率に基づいて前記
発話音声に含まれる認識すべき前記キーワードを決定す
る決定手段、として機能させる構成を有している。Therefore, it is possible to determine the keyword included in the uttered voice while considering each combination of the unnecessary word similarity and each calculated keyword similarity, so that the keyword included in the uttered voice can be easily and quickly determined. Not only can it be recognized, but erroneous recognition can be prevented. According to a seventh aspect of the present invention, there is provided a voice recognition program for performing a voice recognition process of recognizing a keyword included in a uttered voice by a computer, wherein the computer analyzes the uttered voice to produce the uttered voice. Extracting means for extracting a uttered voice feature amount, which is a feature amount of the voice component, and a first obtaining means for obtaining in advance the keyword feature amount data indicating the feature amount of the voice component of one or more of the keywords; Calculate each keyword probability indicating the probability that the utterance voice feature amount is the keyword, based on the extracted utterance voice feature amount of at least a part of the voice section and the keyword feature amount data stored in the storage means. Calculating means for setting a preset value such that at least a part of the speech section of the uttered speech does not constitute the keyword. As second acquisition means for acquiring as an unnecessary word probability indicating a probability of being an unnecessary word, determination means for determining the keyword to be recognized included in the uttered voice based on the calculated keyword probability and the unnecessary word probability. It has a functioning function.

【００４１】この構成により、請求項７に記載の発明で
は、キーワード特徴量データによって示されるキーワー
ドと不要語との各組み合わせに基づいて各単位時間毎に
算出した各キーワード確率と不要語確率に基づいて、キ
ーワード特徴量データによって示される各キーワードと
不要語との各組み合わせ確率を算出し、当該組み合わせ
確率に基づいて発話音声に含まれる認識すべき前記キー
ワードを決定する。With this configuration, in the invention according to claim 7, based on each keyword probability and unnecessary word probability calculated for each unit time based on each combination of the keyword and the unnecessary word indicated by the keyword feature amount data. Then, each combination probability of each keyword indicated by the keyword feature amount data and the unnecessary word is calculated, and the keyword to be recognized included in the uttered voice is determined based on the combination probability.

【００４２】したがって、発話音声特徴量と不要語の特
徴量データとの特徴量の特性を算出せずに、不要語とキ
ーワードを識別し、認識すべきキーワードを決定するこ
とができるので、不要語類似度を算出する際の処理負担
を軽減することができ、容易にかつ高速に発話音声に含
まれるキーワードを認識することができる。Therefore, the unnecessary word and the keyword can be discriminated and the keyword to be recognized can be determined without calculating the characteristics of the characteristic amounts of the uttered voice characteristic amount and the unnecessary word characteristic amount data. The processing load when calculating the degree of similarity can be reduced, and the keywords included in the uttered voice can be easily and quickly recognized.

【００４３】また、請求項８に記載の発明は、請求項７
に記載の音声認識プログラムにおいて、前記コンピュー
タを、予め設定された単位時間毎に前記発話音声を分析
して前記発話音声特徴量情報を抽出する抽出手段、前記
各単位時間毎に抽出された発話音声特徴量に基づいて前
記各キーワード確率を算出する算出手段、前記単位時間
における前記不要語確率を予め取得する第２取得手段、
前記算出した各キーワード確率および前記単位時間にお
ける不要語確率に基づいて発話音声に含まれる認識すべ
き前記キーワードを決定する決定手段、として機能させ
る構成を有している。The invention according to claim 8 is the same as claim 7
In the voice recognition program according to claim 1, the computer, the extraction means for analyzing the uttered voice for each preset unit time to extract the uttered voice feature amount information, the uttered voice extracted for each unit time Calculation means for calculating each keyword probability based on a feature amount, second acquisition means for previously acquiring the unnecessary word probability in the unit time,
It is configured to function as a determining unit that determines the keyword to be recognized included in the spoken voice based on the calculated keyword probability and the unnecessary word probability in the unit time.

【００４４】この構成により、請求項８に記載の発明で
は、各単位時間毎に発話音声特徴量がキーワード特徴量
データによって示されるキーワードであることを示すキ
ーワード確率を算出し、当該算出したキーワード確率お
よび予め設定された単位時間における不要語確率に基づ
いて発話音声に含まれる認識すべきキーワードを決定す
る。With this configuration, in the invention according to claim 8, a keyword probability indicating that the uttered voice feature amount is a keyword indicated by the keyword feature amount data is calculated for each unit time, and the calculated keyword probability is calculated. And the keywords to be recognized included in the spoken voice are determined based on the unnecessary word probability in a preset unit time.

【００４５】したがって、発話音声から音素単位、また
は、音声単位などの単語を構成する単位であるサブワー
ド単位で表現される各言語音毎に類似度を算出すること
によって認識すべきキーワードを決定することができる
とともに、発話音声特徴量と不要語の特徴量データとの
特徴量の特性を算出せずに、不要語とキーワードを識別
し、認識すべきキーワードを決定することができるの
で、類似度を算出する際の処理負担を軽減することがで
き、容易にかつ高速に発話音声に含まれるキーワードを
認識することができるとともに、誤認識を防止すること
ができる。Therefore, the keyword to be recognized can be determined by calculating the degree of similarity for each language sound expressed from the uttered speech in units of phonemes, or in units of subwords, which are units of words such as units of speech. In addition, the unnecessary words and the keywords can be identified and the keywords to be recognized can be determined without calculating the characteristics of the characteristic features of the uttered voice characteristic amount and the unnecessary word characteristic amount data. The processing load at the time of calculation can be reduced, the keyword included in the uttered voice can be easily and quickly recognized, and erroneous recognition can be prevented.

【００４６】また、請求項９に記載の発明は、請求項８
に記載の音声認識プログラムにおいて、前記コンピュー
タを、前記算出した各キーワード確率と前記単位時間に
おける不要語確率に基づいて、前記予め取得した各キー
ワード特徴量データによって示される各キーワードと前
記不要語との各組み合わせである確率を示す組み合わせ
確率を算出し、当該組み合わせ確率に基づいて前記発話
音声に含まれる認識すべき前記キーワードを決定する決
定手段として機能させる構成を有している。The invention described in claim 9 is the same as claim 8
In the speech recognition program according to claim 1, the computer, based on the calculated keyword probability and the unnecessary word probability in the unit time, between each keyword and the unnecessary word indicated by the keyword feature amount data acquired in advance The configuration is such that a combination probability indicating a probability of each combination is calculated, and based on the combination probability, it functions as a determination unit that determines the keyword to be recognized included in the uttered voice.

【００４７】この構成により、請求項９に記載の発明で
は、キーワード特徴量データによって示されるキーワー
ドと不要語との各組み合わせに基づいて各単位時間毎に
算出した各キーワード確率と不要語確率に基づいて、キ
ーワード特徴量データによって示される各キーワードと
不要語との各組み合わせ確率を算出し、当該組み合わせ
確率に基づいて発話音声に含まれる認識すべき前記キー
ワードを決定する。With this configuration, in the invention described in claim 9, based on each keyword probability and unnecessary word probability calculated for each unit time based on each combination of the keyword and the unnecessary word indicated by the keyword feature amount data. Then, each combination probability of each keyword indicated by the keyword feature amount data and the unnecessary word is calculated, and the keyword to be recognized included in the uttered voice is determined based on the combination probability.

【００４８】したがって、不要語類似度および算出した
各キーワード類似度の各組み合わせを考慮しつつ、発話
音声に含まれるキーワードを決定することができるの
で、容易にかつ高速に発話音声に含まれるキーワードを
認識することができるとともに、誤認識を防止すること
ができる。Therefore, the keyword included in the uttered voice can be determined while considering each combination of the unnecessary word similarity and each calculated keyword similarity, so that the keyword included in the uttered voice can be easily and quickly determined. Not only can it be recognized, but erroneous recognition can be prevented.

【００４９】[0049]

【発明の実施の形態】次に、本発明に好適な実施の形態
について、図面に基づいて説明する。BEST MODE FOR CARRYING OUT THE INVENTION Next, preferred embodiments of the present invention will be described with reference to the drawings.

【００５０】なお、以下に説明する実施の形態は、本発
明に係る音声認識装置を適用した場合の実施形態であ
る。The embodiment described below is an embodiment in which the voice recognition device according to the present invention is applied.

【００５１】まず、図１を用いて本実施形態のＨＭＭを
用いた音声言語モデルについて説明する。First, a spoken language model using the HMM of this embodiment will be described with reference to FIG.

【００５２】なお、図１は、本実施形態のＨＭＭを用い
た認識ネットワークを示す音声言語モデルを示す図であ
る。FIG. 1 is a diagram showing a spoken language model showing a recognition network using the HMM of this embodiment.

【００５３】本実施形態では、図１に示すようなＨＭＭ
を用いた認識ネットワークを示すモデル（以下、音声言
語モデルという）、すなわち、認識すべきキーワードが
含まれる音声言語モデル１０を想定する。In this embodiment, the HMM as shown in FIG.
It is assumed that the model showing the recognition network using (hereinafter referred to as a spoken language model), that is, the spoken language model 10 including the keyword to be recognized.

【００５４】この音声言語モデル１０は、キーワードモ
デル１１の前後にガーベージモデルと呼ばれる不要語を
構成する単位を示すモデル（以下、不要語構成要素モデ
ルという）１２ａ、１２ｂを接続する構成を有し、発話
音声に含まれるキーワードはキーワードモデル１１に、
不要語は各不要語構成要素モデル１２ａ、１２ｂにマッ
チングさせることによってキーワードと不要語を識別
し、発話音声に含まれるキーワードを認識するようにな
っている。The spoken language model 10 has a structure in which a model (hereinafter, referred to as an unnecessary word constituent element model) 12a and 12b called a garbage model, which is a unit for forming an unnecessary word, is connected before and after the keyword model 11. Keywords included in the uttered voice are displayed in the keyword model 11,
By matching the unnecessary words with the unnecessary word constituent models 12a and 12b, the keywords and the unnecessary words are identified, and the keywords included in the uttered speech are recognized.

【００５５】実際には、本実施形態では、このキーワー
ドモデル１１は、発話音声の任意の区間毎に遷移する状
態の集まりを表し、非定常信号源を定常信号の連結で表
す統計的信号源モデルであるＨＭＭによって表し、不要
語構成要素モデル１２ａ、１２ｂは、後述するように計
算量を軽減するために、予め設定された固定値によって
表すようになっている。In practice, in the present embodiment, the keyword model 11 represents a group of states that transit for each arbitrary section of the uttered speech, and a statistical signal source model in which a non-stationary signal source is represented by a connection of stationary signals. And the unnecessary word component models 12a and 12b are represented by fixed values set in advance in order to reduce the calculation amount as described later.

【００５６】キーワードモデル１１のＨＭＭ（以下、キ
ーワードＨＭＭという）は、ある状態からある状態に状
態の遷移の確率を示す状態遷移確率と状態が遷移すると
きに観測されるベクトル（フレーム毎の特徴量ベクト
ル）の確率を出力する出力確率の２つのパラメータを有
し、各キーワードの特徴量パターンを示すようになって
いる。The HMM of the keyword model 11 (hereinafter referred to as a keyword HMM) is a state transition probability indicating a probability of transition of a state from a certain state to a vector (feature amount for each frame) observed when the state transits. It has two parameters of output probability that outputs the probability of (vector), and indicates the feature amount pattern of each keyword.

【００５７】一般的に、発話音声は、同じ単語や音節で
あっても様々な原因によって生じる音響的変動を有する
ため、発話者が異なれば、発話音声を構成する言語音が
大幅に変化するが、同じ言語音は、主に、スペクトル包
絡とその時間的推移によって特徴付けられるようになっ
ており、このような変動の時系列パターンの確率的な性
質を、ＨＭＭによって精密に表現することができるよう
になっている。In general, a speech voice has acoustic fluctuations caused by various causes even if the same word or syllable is present. Therefore, if the speaker is different, the speech sound constituting the speech voice changes significantly. , The same speech sound is mainly characterized by the spectral envelope and its temporal transition, and the stochastic nature of such a time series pattern of fluctuations can be accurately represented by the HMM. It is like this.

【００５８】したがって、本実施形態では、後述するよ
うに、入力された発話音声の特徴量と各キーワードＨＭ
Ｍとの類似度を算出し、不要語を考慮しつつマッチング
処理を行うことによってこの発話音声に含まれるキーワ
ードを認識するようになっている。Therefore, in the present embodiment, as will be described later, the feature amount of the input speech voice and each keyword HM.
The keyword included in the uttered voice is recognized by calculating the degree of similarity with M and performing matching processing while considering unnecessary words.

【００５９】なお、本実施形態では、このＨＭＭは、各
キーワードの特徴量パターンの特徴量を示す一定時間毎
の各周波数毎におけるパワーを示すスペクトル包絡のデ
ータまたはこのパワースペクトルの対数をとって逆フー
リエ変換を行うことによって得られたケプストラムのデ
ータを有する確率モデルを示すようになっている。In the present embodiment, this HMM takes the inverse of the data of the spectrum envelope showing the power at each frequency for every fixed time showing the feature quantity of the feature quantity pattern of each keyword or the logarithm of this power spectrum. A stochastic model having the cepstrum data obtained by performing the Fourier transform is shown.

【００６０】このようなＨＭＭを用いて発話音声などの
音声に含まれるキーワードの音声認識を行う場合には、
当該認識する音声を予め定められた一定時間毎に分割
し、予め格納されたキーワードＨＭＭのデータとのマッ
チング処理に基づいて各分割された状態から次の状態に
変化する場合の確率を算出するとともに、後述する不要
語類似度として予め設定された値とのマッチング処理を
行うことにより、発話音声に含まれるキーワードと不要
語を識別して認識すべきキーワードを確定するようにな
っている。In the case of performing speech recognition of a keyword included in a voice such as a spoken voice using such an HMM,
The recognized voice is divided at predetermined intervals, and the probability of changing from each divided state to the next state is calculated based on the matching process with the data of the keyword HMM stored in advance. By performing matching processing with a preset value as the unnecessary word similarity, which will be described later, the keywords included in the uttered voice and the unnecessary words are identified to determine the keywords to be recognized.

【００６１】具体的には、本実施形態では、この特徴量
パターンと任意の状態を示す発話音声の一定時間に区切
られた各音声区間の特徴量と比較することによって、こ
のＨＭＭの特徴量パターンと各音声区間の特徴量の一致
度を示す類似度（本願発明のキーワード確率に相当）を
算出し、この算出された類似度と、発話音声の各区間が
不要語であると想定した場合の当該各音声区間の音声特
徴量と予め設定された不要語類似度とに基づいて後述す
るマッチング処理を行い、あらゆるＨＭＭの繋がり、す
なわち、キーワードと不要語の繋がりの確率を示す累積
類似度（本発明の組み合わせ確率に相当）を算出し、最
も類似度の高いＨＭＭの繋がりを発話音声の言語として
認識するようになっている。Specifically, in the present embodiment, the feature amount pattern of this HMM is obtained by comparing this feature amount pattern with the feature amount of each speech section of the utterance voice showing an arbitrary state divided into a certain time period. And a degree of similarity (corresponding to the keyword probability of the invention of the present application) indicating the degree of coincidence of the feature amount of each speech section, and the calculated similarity and each section of the uttered speech are assumed to be unnecessary words. A matching process described below is performed based on the voice feature amount of each voice section and a preset unnecessary word similarity, and the cumulative similarity (this is the cumulative similarity indicating the probability of any HMM connection, that is, the connection between the keyword and the unnecessary word). (Corresponding to the combination probability of the invention) is calculated, and the connection of the HMMs with the highest degree of similarity is recognized as the language of the spoken voice.

【００６２】次に、図２を用いて本実施形態の音声認識
装置の構成について説明する。Next, the configuration of the speech recognition apparatus of this embodiment will be described with reference to FIG.

【００６３】図２は、本発明に係るワードスポッティン
グ音声認識装置の一実施形態の構成概要を示すブロック
図である。FIG. 2 is a block diagram showing the outline of the configuration of an embodiment of the word spotting voice recognition device according to the present invention.

【００６４】音声認識装置１００は、図２に示すよう
に、認識すべき発話音声を入力するマイクロホン１０１
と、ローパスフィルター（以下、ＬＰＦ：Low Pass F
ilterという）１０２と、マイクロホン１０１から出力
された音声信号をデジタル信号に変換するアナログ／デ
ジタル変換部（以下、Ａ／Ｄ変換部という）１０３と、
デジタル信号に変換された音声信号から発話音声部分の
音声信号を切り出し、予め設定された時間間隔毎にフレ
ーム分割する入力処理部１０４と、各フレーム毎に音声
信号の特徴量を抽出する音声分析部１０５と、認識すべ
き複数のキーワードの特徴量パターンを示すキーワード
ＨＭＭが予め格納されているキーワードモデルデータベ
ース１０６と、抽出されたフレーム毎の特徴量とキーワ
ードＨＭＭに基づいてこの各フレームの特徴量の類似度
を算出する類似度算出部１０７と、算出された各フレー
ム毎の各キーワードＨＭＭとの類似度および予め設定さ
れた前記キーワードを構成しない不要語の類似度に基づ
いて後述するマッチング処理を行うマッチング処理部１
０８と、マッチング処理に基づいて発話音声に含まれる
キーワードを判定する判定部１０９とを備えている。The voice recognition device 100, as shown in FIG. 2, is a microphone 101 for inputting a speech voice to be recognized.
And a low pass filter (hereinafter LPF: Low Pass F
ilter) 102, an analog / digital conversion unit (hereinafter referred to as an A / D conversion unit) 103 that converts the audio signal output from the microphone 101 into a digital signal,
An input processing unit 104 that cuts out a voice signal of a uttered voice portion from a voice signal converted into a digital signal and divides into frames at preset time intervals, and a voice analysis unit that extracts a feature amount of the voice signal for each frame. 105, a keyword model database 106 in which a keyword HMM indicating a feature amount pattern of a plurality of keywords to be recognized is stored in advance, and a feature amount of each frame based on the extracted feature amount of each frame and the keyword HMM. A matching process described below is performed based on the similarity between the similarity calculating unit 107 that calculates the similarity and the calculated keyword HMM for each frame and the preset similarity of unnecessary words that do not form the keyword. Matching processing unit 1
08, and a determination unit 109 that determines a keyword included in the uttered voice based on the matching process.

【００６５】なお、入力処理部１０４および音声分析部
１０５は、本発明に係る抽出手段を構成し、キーワード
モデルデータベース１０６は、本発明に係る第１格納手
段を構成する。The input processing unit 104 and the voice analysis unit 105 constitute the extracting means according to the present invention, and the keyword model database 106 constitutes the first storing means according to the present invention.

【００６６】また、類似度算出部１０７は、本発明に係
る算出手段および第１取得手段を構成し、マッチング処
理部１０８は、第２格納手段、第２取得手段および決定
手段を構成するとともに、判定部１０９は、本発明に係
る決定手段を構成する。The similarity calculating section 107 constitutes the calculating means and the first obtaining means according to the present invention, and the matching processing section 108 constitutes the second storing means, the second obtaining means and the determining means, and The determination unit 109 constitutes the determination means according to the present invention.

【００６７】マイクロホン１０１には、発話音声が入力
されるようになっており、このマイクロホン１０１は、
入力された発話音声に基づいて音声信号を生成し、ＬＰ
Ｆ１０２に出力されるようになっている。Speech sound is input to the microphone 101, and this microphone 101
Generates a voice signal based on the input utterance voice, and
It is designed to be output to F102.

【００６８】ＬＰＦ１０２には、マイクロホン１０１に
おいて生成された音声信号が入力されるようになってお
り、このＬＰＦ１０２は、入力された音声信号のうち高
周波数成分を取り除き、当該高周波数成分を取り除いた
音声信号をＡ／Ｄ変換部１０３に出力するようになって
いる。An audio signal generated by the microphone 101 is input to the LPF 102. The LPF 102 removes a high frequency component from the input audio signal and removes the high frequency component. The signal is output to the A / D conversion unit 103.

【００６９】Ａ／Ｄ変換部１０３には、ＬＰＦ１０２に
おいて高周波数成分が取り除かれた音声信号が入力され
るようになっており、このＡ／Ｄ変換部１０３は、入力
された音声信号をアナログ信号からデジタル信号に変換
し、デジタル信号に変換された音声信号を入力処理部１
０４に出力するようになっている。An audio signal from which the high frequency component has been removed by the LPF 102 is input to the A / D conversion unit 103. The A / D conversion unit 103 converts the input audio signal into an analog signal. Is converted into a digital signal, and the audio signal converted into the digital signal is input to the input processing unit 1.
It is designed to output to 04.

【００７０】入力処理部１０４には、デジタル信号に変
換された音声信号が入力されるようになっており、この
入力処理部１０４は、入力したデジタル信号の発話音声
部分の音声区間を示す音声信号を切り出すとともに、こ
の切り出された音声区間の音声信号を予め設定された時
間間隔毎のフレームに分割し、音声分析部１０５に出力
するようになっている。A voice signal converted into a digital signal is input to the input processing unit 104. The input processing unit 104 outputs a voice signal indicating the voice section of the uttered voice portion of the input digital signal. The voice signal of the cut voice segment is divided into frames at preset time intervals and is output to the voice analysis unit 105.

【００７１】なお、例えば、入力処理部１０４は、１フ
レーム、１０ｍｓ〜２０ｍｓ程度の時間間隔毎に分割す
るようになっている。For example, the input processing section 104 is configured to divide one frame at time intervals of 10 ms to 20 ms.

【００７２】音声分析部１０５には、フレーム分割され
た音声信号が入力されるようになっており、この音声分
析部１０５は、入力されたフレーム毎に当該音声信号を
分析するとともに、当該フレーム毎の音声信号の特徴量
を抽出して類似度算出部１０７に出力するようになって
いる。A voice signal divided into frames is input to the voice analysis unit 105. The voice analysis unit 105 analyzes the voice signal for each input frame and also for each frame. The feature amount of the voice signal is extracted and output to the similarity calculation unit 107.

【００７３】具体的には、音声分析部１０５は、各フレ
ーム毎に、一定時間毎の各周波数毎におけるパワーを示
すスペクトル包絡の情報、または、このパワースペクト
ルの対数をとって逆フーリエ変換を行うことによって得
られたケプストラムの情報を特徴量として抽出し、当該
抽出した特徴量をベクトル化して類似度算出部１０７に
出力ようになっている。Specifically, the voice analysis unit 105 performs, for each frame, inverse Fourier transform by taking the information of the spectrum envelope showing the power at each frequency for every fixed time or the logarithm of this power spectrum. The information of the cepstrum thus obtained is extracted as a feature amount, and the extracted feature amount is vectorized and output to the similarity calculation unit 107.

【００７４】キーワードモデルデータベース１０６は、
認識すべきキーワードの特徴量のパターンデータを示す
キーワードＨＭＭが予め格納されている。この格納され
ている複数のキーワードＨＭＭのデータは、認識すべき
複数の認識対象語の各々の特徴量のパターンを示すよう
になっている。The keyword model database 106 is
A keyword HMM indicating pattern data of the characteristic amount of the keyword to be recognized is stored in advance. The stored data of the plurality of keyword HMMs indicates the pattern of the characteristic amount of each of the plurality of recognition target words to be recognized.

【００７５】例えば、車載されたナビゲーション装置で
用いる場合には、キーワードモデルデータベース１０６
には、自動車が向かう目的地名や現在位置名、レストラ
ンなどの施設名といった音声信号の特徴量のパターンを
示すＨＭＭが格納されるようになっている。For example, when used in a vehicle-mounted navigation device, the keyword model database 106
An HMM indicating a feature quantity pattern of a voice signal such as a destination name to which an automobile is heading, a current position name, a facility name such as a restaurant is stored in the.

【００７６】本実施形態では、各キーワードの特徴量パ
ターンを示すＨＭＭは、上述のように、一定時間毎の各
周波数毎におけるパワーを示すスペクトル包絡のデータ
またはこのパワースペクトルの対数をとって逆フーリエ
変換によって得られたケプストラムのデータを有する確
率モデルを示すようになっている。In the present embodiment, the HMM indicating the feature amount pattern of each keyword is, as described above, the data of the spectrum envelope indicating the power at each frequency for each constant time or the inverse Fourier transform of the logarithm of this power spectrum. A probabilistic model having cepstrum data obtained by the conversion is shown.

【００７７】また、通常、キーワードは、「現在地」や
「目的地」のように、複数の音節または音素から構成さ
れるようになっているので、本実施形態では、１つのキ
ーワードＨＭＭは、複数のキーワード構成要素ＨＭＭに
よって構成されており、類似度算出部１０７では、各キ
ーワード構成要素ＨＭＭ毎に１のフレーム毎の特徴量と
の類似度を算出するようになっている。Further, since a keyword is usually composed of a plurality of syllables or phonemes such as "current location" and "destination", in the present embodiment, one keyword HMM includes a plurality of syllables. The keyword calculating element 107 is configured to calculate the degree of similarity between the keyword calculating element 107 and the feature amount of each frame for each keyword forming element HMM.

【００７８】このように、キーワードモデルデータベー
ス１０６には、認識すべきキーワードの各キーワードＨ
ＭＭ、すなわち、キーワード構成要素ＨＭＭが格納され
るようになっている。As described above, each keyword H of the keywords to be recognized is stored in the keyword model database 106.
The MM, that is, the keyword component HMM is stored.

【００７９】類似度算出部１０７には、各フレーム毎の
ベクトル特徴量が入力されるようになっており、この類
似度算出部１０７は、入力された各フレーム毎の特徴量
とキーワードモデルデータベース１０６に格納されてい
るキーワードＨＭＭモデルの特徴量とを比較して入力さ
れた各フレームと各ＨＭＭとの類似度を算出し、この算
出された類似度をマッチング処理部１０８に出力するよ
うになっている。The vector feature amount for each frame is input to the similarity calculation unit 107. The similarity calculation unit 107 inputs the feature amount for each frame and the keyword model database 106. The similarity between the input frame and each HMM is calculated by comparing with the feature amount of the keyword HMM model stored in, and the calculated similarity is output to the matching processing unit 108. There is.

【００８０】本実施形態では、類似度算出部１０７は、
各フレーム毎の特徴量およびキーワードモデルデータベ
ース１０６に格納されるＨＭＭの特徴量に基づいて、各
フレームがキーワードモデルデータベース１０６に格納
されるＨＭＭを示す場合などの確率を算出するようにな
っている。In this embodiment, the similarity calculation unit 107
Based on the feature amount of each frame and the feature amount of the HMM stored in the keyword model database 106, the probability that each frame indicates the HMM stored in the keyword model database 106 is calculated.

【００８１】具体的には、類似度算出部１０７は、各フ
レームが各キーワード構成要素ＨＭＭを示す出力確率を
算出するとともに、任意のフレームから次のフレームへ
の状態遷移が各キーワード構成要素ＨＭＭからキーワー
ド構成要素ＨＭＭまたは不要語構成要素への状態遷移を
示す状態遷移確率を算出し、これらの確率を類似度とし
てマッチング処理部１０８に出力するようになってい
る。Specifically, the similarity calculator 107 calculates the output probability that each frame indicates each keyword constituent HMM, and the state transition from an arbitrary frame to the next frame is calculated from each keyword constituent HMM. A state transition probability indicating a state transition to the keyword component HMM or the unnecessary word component is calculated, and these probabilities are output to the matching processing unit 108 as the similarity.

【００８２】なお、状態遷移確率には、各キーワード構
成要素ＨＭＭから自己のキーワード構成要素ＨＭＭへの
状態遷移を示す状態遷移確率が含まれるようになってい
る。The state transition probability includes the state transition probability indicating the state transition from each keyword constituent element HMM to its own keyword constituent element HMM.

【００８３】また、類似度算出部１０７は、各フレーム
毎に算出された各出力確率および各状態遷移確率を各フ
レームの類似度としてマッチング処理部１０８に出力す
るようになっている。Further, the similarity calculating section 107 outputs the output probabilities and the state transition probabilities calculated for each frame to the matching processing section 108 as the similarity of each frame.

【００８４】マッチング処理部１０８には、類似度算出
部１０７において各フレーム毎の各出力確率および各遷
移出力確率が入力されるようになっており、マッチング
処理部１０８は、この入力された各出力確率および各遷
移出力確率、並びに、不要語類似度に基づいて各キーワ
ードモデルＨＭＭと不要語類似度の各組み合わせの類似
度を示す累積類似度を算出するマッチング処理を行い、
この算出された累積類似度を判定部１０９に出力するよ
うになっている。The similarity calculation section 107 inputs the output probabilities and transition output probabilities for each frame to the matching processing section 108. The matching processing section 108 outputs the input outputs. A matching process is performed to calculate the cumulative similarity indicating the similarity of each combination of each keyword model HMM and the unnecessary word similarity, based on the probability, each transition output probability, and the unnecessary word similarity.
The calculated cumulative similarity is output to the determination unit 109.

【００８５】具体的には、マッチング処理部１０８は、
当該フレームが不要語構成要素であると想定した場合の
当該フレームの音声成分の特徴量と不要語構成要素の音
声成分の特徴量の特性の不要語類似度を示す出力確率お
よび状態遷移確率を、予め内部に格納しておき、この不
要語類似度と類似度算出部１０７によって算出された各
キーワードの類似度を各フレーム毎に積算することによ
ってキーワードと不要語とのあらゆる組み合わせの累積
類似度を算出するようになっており、後述するように各
キーワード毎に１の累積類似度を算出するとともに、キ
ーワードなしの場合の累積類似度を算出するようになっ
ている。Specifically, the matching processing unit 108
The output probability and the state transition probability indicating the unnecessary word similarity of the feature amount of the voice component of the frame and the feature amount of the voice component of the unnecessary word component when the frame is assumed to be the unnecessary word component, The cumulative similarity of all combinations of the keyword and the unnecessary word is stored in advance inside and the unnecessary word similarity and the similarity of each keyword calculated by the similarity calculating unit 107 are integrated for each frame. As described later, the cumulative similarity of 1 is calculated for each keyword, and the cumulative similarity in the case of no keyword is calculated.

【００８６】なお、このようなマッチング処理部１０８
で行われるマッチング処理の詳細については後述する。Incidentally, such a matching processing unit 108
Details of the matching process performed in step 1 will be described later.

【００８７】本実施形態では、このように、発話音声に
キーワードを構成しない不要語を認識する場合に、発話
音声の特徴量と当該不要語の特徴量データとの類似度を
算出せず、互いの特徴量が類似すると想定した不要語類
似度として予め設定された値を用いるようになってい
る。In this embodiment, when recognizing an unnecessary word that does not form a keyword in the uttered voice, the similarity between the feature amount of the uttered voice and the feature amount data of the unnecessary word is not calculated, A value set in advance is used as the unnecessary word similarity on the assumption that the feature quantities are similar.

【００８８】例えば、不要語類似度には、予め不要語類
似度をｋと設定し、このｋの値を変えながら実際に音声
の認識実験を行い、変化させたｋの値のうち最も性能の
よい値を不要語類似度と決定するといったような実験的
に算出された値を用いるようになっている。For example, with respect to the unnecessary word similarity, the unnecessary word similarity is set to k in advance, a speech recognition experiment is actually performed while changing the value of k, and the highest performance of the changed values of k is obtained. A value calculated experimentally such as determining a good value as the unnecessary word similarity is used.

【００８９】なお、キーワードＨＭＭの前後で、同じ類
似度を用いてもよいし、キーワードＨＭＭの前後で異な
る類似度を設定するようにしてもよい。The same degree of similarity may be used before and after the keyword HMM, or different degrees of similarity may be set before and after the keyword HMM.

【００９０】また、不要語類似度は、図示しない操作部
の操作によって予め設定するようにしてもよいし、ま
た、本音声認識装置１００の製造時に予め組み込むよう
にしてもよい。The unnecessary word similarity may be set in advance by the operation of the operation unit (not shown), or may be incorporated in advance when the speech recognition apparatus 100 is manufactured.

【００９１】判定部１０９には、マッチング処理部１０
８において算出された各キーワード毎の累積類似度が入
力されるようになっており、入力された累積類似度を各
キーワードのワード長、すなわち、各累積類似度に係る
キーワードの時間的な長さによって正規化し、この各正
規化された類似度の中で最も類似度が高いキーワードを
発話音声に含まれるキーワードと判定してこのキーワー
ドを認識結果として外部に出力するようになっている。The determination unit 109 includes a matching processing unit 10
The cumulative similarity calculated for each keyword in 8 is input, and the input cumulative similarity is the word length of each keyword, that is, the temporal length of the keyword related to each cumulative similarity. The keyword having the highest degree of similarity among the normalized degrees of similarity is determined to be a keyword included in the uttered voice, and the keyword is output to the outside as a recognition result.

【００９２】このとき、この判定部１０９は、不要語類
似度のみの累積類似度も判定対象に加えてキーワードの
判定を行うようになっており、この不要語類似度のみの
累積類似度が最も入力された累積類似度の中で最も高い
場合には、発話音声にキーワードが含まれていなかった
ものと判定してこの判定結果を外部に出力するようにな
っている。At this time, the determination unit 109 also determines the keyword by adding the cumulative similarity of only the unnecessary word similarity to the determination target, and the cumulative similarity of only the unnecessary word similarity is the highest. When the input cumulative similarity is the highest, it is determined that the speech voice does not include the keyword, and the determination result is output to the outside.

【００９３】次に、本実施形態のマッチング処理部１０
８において行われるマッチング処理について説明する。Next, the matching processing section 10 of the present embodiment.
The matching process performed in 8 will be described.

【００９４】なお、本実施形態のマッチング処理では、
ビタビアルゴリズムを用いるようになっており、このビ
タビアルゴリズムによってキーワードモデルおよび予め
設定された不要語類似度との各組み合わせの累積類似度
を算出するようになっている。In the matching process of this embodiment,
The Viterbi algorithm is used, and the cumulative similarity of each combination of the keyword model and the preset unnecessary word similarity is calculated by this Viterbi algorithm.

【００９５】このビタビアルゴリズムは、各状態になる
出力確率と各状態から他の状態へ遷移する場合の遷移確
率に基づいて累積類似度を算出するアルゴリズムであ
り、累積類似度を算出した後に当該累積類似度が算出さ
れた組み合わせを出力するようになっている。This Viterbi algorithm is an algorithm for calculating the cumulative similarity based on the output probability of each state and the transition probability when transitioning from each state to another state. After calculating the cumulative similarity, the cumulative similarity is calculated. The combination for which the similarity is calculated is output.

【００９６】なお、一般的には、各フレームの特徴量に
よって示される状態とＨＭＭによって示される特徴量の
状態のユーグリッド距離を算出し、この累積距離を求め
ることによって累積類似度を算出するようになってい
る。In general, the Euclidean distance between the state indicated by the feature amount of each frame and the state of the feature amount indicated by the HMM is calculated, and the cumulative distance is calculated to calculate the cumulative similarity. It has become.

【００９７】具体的には、ビタビアルゴリズムは、任意
の状態ｉから次の状態ｊへの遷移を示すパスに沿って計
算される累積的な確率の計算を行うようになっており、
この累積的な確率計算を行うことによって状態の遷移が
可能な各パス、すなわち、ＨＭＭの繋がりおよび組み合
わせを抽出するようになっている。Specifically, the Viterbi algorithm is adapted to calculate a cumulative probability calculated along a path indicating a transition from an arbitrary state i to a next state j,
By performing this cumulative probability calculation, each path capable of state transition, that is, the connection and combination of HMMs is extracted.

【００９８】本実施形態では、類似度算出部１０７おい
て算出された各出力確率および各状態遷移確率、並び
に、予め設定された不要語類似度に基づいて、入力され
た発話音声の最初の分割フレームから最後の分割フレー
ムまで、順次、各フレームがキーワードモデルの出力確
率および状態遷移確率、並びに、不要語として予め定め
られている出力確率および状態遷移確率を当てはめ、キ
ーワードモデルおよび不要語の任意の組み合わせによる
最初の分割フレームから最後の分割フレームまでの累積
的な確率を算出するようになっており、各キーワードモ
デル毎に算出された累積類似度の最も高い組み合わせ
を、１つずつ判定部１０９に出力するようになってい
る。In the present embodiment, based on the output probabilities and the state transition probabilities calculated by the similarity calculation unit 107, and the unnecessary word similarity set in advance, the first division of the input uttered voice is performed. From the frame to the last divided frame, each frame is sequentially applied with the output probability and the state transition probability of the keyword model, and the output probability and the state transition probability that are predetermined as unnecessary words, and the The cumulative probability from the first divided frame to the last divided frame by the combination is calculated, and the combination with the highest cumulative similarity calculated for each keyword model is given to the determination unit 109 one by one. It is designed to output.

【００９９】例えば、認識すべきキーワードが「現在
地」「目的地」であり、入力された発話音声が「えーっ
と、現在地」の場合、本実施形態のマッチング処理は、
以下の処理を行うようになっている。For example, when the keywords to be recognized are “current position” and “destination” and the input speech voice is “um, current position”, the matching process of the present embodiment
The following processing is performed.

【０１００】なお、不要語が「えーっと」であり、予め
不要語類似度（出力確率および状態遷移確率）が設定さ
れているとともに、キーワードモデルデータベース１０
６には、「現在地」「目的地」がそれぞれの音節毎のＨ
ＭＭが格納されているものとする。The unnecessary word is “Eh”, the unnecessary word similarity (output probability and state transition probability) is set in advance, and the keyword model database 10
In 6, the "current location" and "destination" are H for each syllable.
It is assumed that the MM is stored.

【０１０１】また、マッチング処理部１０８には、既に
類似度算出部１０７において算出された各出力確率およ
び状態遷移確率が入力されているものとする。It is also assumed that the output probability and the state transition probability calculated by the similarity calculation section 107 have already been input to the matching processing section 108.

【０１０２】このような場合、本実施形態では、ビタビ
アルゴリズムによって、「現在地」のキーワードと「目
的地」のキーワードそれぞれにおいて、不要語類似度と
全ての組み合わせにおける累積類似度を、不要語類似
度、出力確率および状態遷移確率に基づいて算出するよ
うになっている。In such a case, in the present embodiment, the Viterbi algorithm is used to calculate the unnecessary word similarity and the cumulative similarity in all combinations for each of the "current location" keyword and the "destination" keyword. , The output probability and the state transition probability are calculated.

【０１０３】具体的には、任意の発話音声が入力された
場合、「げんざいち」、「○げんざいち」、「げんざい
ち○」、「○げんざいち○」（○印は不要語類似度の固
定値）の各組み合わせパターンにおけるすべての累積類
似度と、現在地のキーワードと同様に、目的地のキーワ
ードにおいても、「もくてきち」、「○もくてきち」、
「もくてきち○」、「○もくてきち○」（同様に○印は
不要語類似度）の各パターンにおける全ての累積類似度
を、不要語類似度、出力確率および状態遷移確率に基づ
いて算出するようになっている。Specifically, when any uttered voice is input, "Genzaichi", "○ Genzaichi", "Genzaichi ○", "○ Genzaichi ○" (○ mark is unnecessary. (Fixed value of word similarity) In all combination similarities in each combination pattern, and in the keyword of the destination as well as the keyword of the current location, "Mokukuchi", "○ Mokukuchi",
All cumulative similarities in each pattern of “Mokukuichi ○” and “○ Mokukuchi ○” (indicated by “○” is unnecessary word similarity) are used as unnecessary word similarity, output probability, and state transition probability. It is based on the calculation.

【０１０４】ビタビアルゴリズムは、１のキーワードモ
デル毎、この場合は、「現在地」と「目的地」毎に、発
話音声の最初のフレームから順次各フレーム毎に同時に
全ての組み合わせパターンにおける全ての累積類似度を
算出するようになっている。The Viterbi algorithm is for each keyword model, in this case, for each of the "current position" and the "destination", sequentially from the first frame of the uttered speech, sequentially for each frame, and simultaneously for all cumulative similarities in all combination patterns. It is designed to calculate degrees.

【０１０５】また、このビタビアルゴリズムは、各キー
ワード毎の各組み合わせの累積類似度を算出する過程に
おいて、組み合わせパターンの累積類似度の低いものは
順次算出途中で、発話音声がこの組み合わせパターンで
はないと判断して累積類似度の計算を中止するようにな
っている。Further, in the Viterbi algorithm, in the process of calculating the cumulative similarity of each combination for each keyword, a combination pattern with a low cumulative similarity is in the process of being calculated sequentially, and the uttered voice is not this combination pattern. Judgment is made and the calculation of the cumulative similarity is stopped.

【０１０６】具体的には、最初の分割フレームには、キ
ーワード「現在地」のキーワード構成要素ＨＭＭである
「げ」のＨＭＭを示す場合と、不要語構成要素ＨＭＭを
示す場合との確率である類似度の何れかが加算されるよ
うになるが、この場合、累積類似度の高いものが次の分
割フレームの累積類似度を算出するようになっている。
上述の場合では、不要語構成要素ＨＭＭの類似度の方
が、「げ」のＨＭＭの類似度より高くなるので、「げ」
に対するその後の累積類似度、すなわち、「げんざいち
○」や「げんざいち」の算出処理を終了させるようにな
っている。Specifically, in the first divided frame, there is a similarity between the case where the HMM of "ge" which is the keyword constituent HMM of the keyword "current position" and the case where the unnecessary word constituent HMM is shown are similar. One of the degrees is added, but in this case, the one having the higher cumulative similarity is adapted to calculate the cumulative similarity of the next divided frame.
In the above case, the similarity of the unnecessary word component HMM is higher than the similarity of the HMM of “ge”, so “ge”
After that, the calculation processing of the cumulative similarity, that is, “Genzaichi ○” and “Genzaichi” is ended.

【０１０７】この結果、このようなマッチング処理で
は、「現在地」および「目的地」の各キーワードにおけ
る累積類似度が１つずつ算出されるようになっている。As a result, in such a matching process, the cumulative similarities in the keywords "current location" and "destination" are calculated one by one.

【０１０８】次に、図３を用いて本実施形態のキーワー
ド認識処理について説明する。Next, the keyword recognition processing of this embodiment will be described with reference to FIG.

【０１０９】なお、図３は、本実施形態のキーワード認
識処理の動作を示すフローチャートである。Note that FIG. 3 is a flowchart showing the operation of the keyword recognition processing of this embodiment.

【０１１０】まず、図示しない操作部または制御部によ
ってキーワード認識処理を開始するよう各部に指示が入
力され、発話音声がマイクロホン１０１に入力されると
（ステップＳ１１）、ＬＰＦ１０２およびＡ／Ｄ変換部
１０３を介して入力処理部１０４に入力され、この入力
処理部１０４は、入力された音声信号から発話音声部分
の音声信号を切り出すとともに（ステップＳ１２）、予
め設定された時間間隔毎にフレーム分割を行い、各フレ
ーム毎に先頭のフレームから順次音声信号を音声分析部
１０５に出力する（ステップＳ１３）。First, when an instruction is input to each unit by an operation unit or a control unit (not shown) to start the keyword recognition process, and a speech voice is input to the microphone 101 (step S11), the LPF 102 and the A / D conversion unit 103 are input. Is input to the input processing unit 104 via the input processing unit 104, and the input processing unit 104 cuts out the voice signal of the uttered voice portion from the input voice signal (step S12), and performs frame division at preset time intervals. , For each frame, the audio signal is sequentially output to the audio analysis unit 105 from the first frame (step S13).

【０１１１】次いで、本動作は各フレーム毎に以下の処
理を行う。Next, this operation performs the following processing for each frame.

【０１１２】まず、図示しない制御部によって、音声分
析部１０５に入力されたフレームが最終の分割フレーム
か否かが判断され（ステップＳ１４）、最終の分割のフ
レームと判断されたときは、ステップＳ１９に行き、最
終の分割フレームでないと判断されたときは、以下の動
作を行う。First, the control unit (not shown) determines whether or not the frame input to the voice analysis unit 105 is the final divided frame (step S14). When it is determined that the frame is the final divided frame, step S19. If it is determined that the frame is not the final divided frame, the following operation is performed.

【０１１３】まず、音声分析部１０５は、入力されたフ
レームの音声信号の特徴量を抽出するとともに、抽出し
たこのフレームの特徴量を類似度算出部１０７に出力す
る（ステップＳ１５）。First, the voice analysis unit 105 extracts the feature amount of the voice signal of the input frame and outputs the extracted feature amount of this frame to the similarity calculation unit 107 (step S15).

【０１１４】具体的には、音声分析部１０５は、各フレ
ームの音声信号に基づいて、一定時間毎の各周波数毎に
おけるパワーを示すスペクトル包絡の情報、または、こ
のパワースペクトルの対数をとって逆フーリエ変換を行
うことによって得られたケプストラムの情報を特徴量と
して抽出するとともに、当該特徴量をベクトル化して類
似度算出部１０７に出力する。More specifically, the voice analysis unit 105 takes the inverse of the spectrum envelope information indicating the power at each frequency for each fixed time, or the logarithm of this power spectrum, based on the voice signal of each frame. The cepstrum information obtained by performing the Fourier transform is extracted as a feature amount, and the feature amount is vectorized and output to the similarity calculation unit 107.

【０１１５】次いで、類似度算出部１０７は、入力され
たフレームの特徴量とキーワードモデルデータベース１
０６に格納されている各ＨＭＭモデルの特徴量とを比較
するとともに、上述のように、各ＨＭＭモデル毎の当該
フレームの出力確率および状態遷移確率を算出し、この
出力確率および状態遷移確率をマッチング処理部１０８
に出力する（ステップＳ１６）。Next, the similarity calculation unit 107 determines the feature quantity of the input frame and the keyword model database 1
06, the output probability and the state transition probability of the frame of each HMM model are calculated, and the output probability and the state transition probability are matched, as described above. Processing unit 108
(Step S16).

【０１１６】次いで、マッチング処理部１０８は、類似
度算出部１０７おいて算出された各出力確率および各状
態遷移確率、並びに、予め設定され、当該マッチング処
理部１０８の内部に格納されている不要語類似度に基づ
いて、上述のマッチング処理を行い、各キーワード毎の
累積類似度を算出する（ステップＳ１７）。Next, the matching processing unit 108 outputs the output probabilities and the state transition probabilities calculated by the similarity calculating unit 107, and the unnecessary words which are preset and stored in the matching processing unit 108. Based on the degree of similarity, the above-described matching processing is performed to calculate the cumulative degree of similarity for each keyword (step S17).

【０１１７】具体的には、マッチング処理部１０８は、
前回までの累積類似度に入力された各キーワードＨＭＭ
によって示された類似度および不要語類似度を積算し、
各キーワードの種別毎に最も累積類似度の高い累積類似
度のみ算出する。Specifically, the matching processing unit 108
Each keyword HMM input to the cumulative similarity up to the last time
The similarity and unnecessary word similarity indicated by
Only the cumulative similarity having the highest cumulative similarity is calculated for each keyword type.

【０１１８】次いで、図示しない制御部からの指示によ
り次フレームの入力制御を行い（ステップＳ１８）、ス
テップＳ１４に戻る。Then, the input control of the next frame is performed according to an instruction from the control unit (not shown) (step S18), and the process returns to step S14.

【０１１９】一方、図示しない制御部において、最終の
分割のフレームと判断されたときは、算出した各キーワ
ード毎の最も高い累積類似度が判定部１０９に出力され
るとともに、判定部１０９は、各キーワード毎の累積類
似度のワード長に正規化処理を行う（ステップＳ１
９）。On the other hand, when the control unit (not shown) determines that the frame is the final divided frame, the highest cumulative similarity calculated for each keyword is output to the determination unit 109, and the determination unit 109 Normalization processing is performed on the word length of the cumulative similarity for each keyword (step S1).
9).

【０１２０】最後に、判定部１０９は、各キーワード毎
の正規化された類似度に基づいて、最も類似度の高い類
似度を有するキーワードを、発話音声に含まれるキーワ
ードであると判断して外部に出力し（ステップＳ２
０）、本動作を終了させる。Finally, the determination unit 109 determines that the keyword having the highest similarity is the keyword included in the uttered voice, based on the normalized similarity for each keyword. Output to (step S2
0), this operation is ended.

【０１２１】以上により本実施形態によれば、各フレー
ム毎に、発話音声特徴量とキーワード特徴量データとの
特徴量の特性が類似する割合を示す類似度を算出し、当
該算出した類似度および予め設定された不要語類似度に
基づいて発話音声に含まれる認識すべきキーワードを決
定することができるので、発話音声特徴量と不要語の特
徴量データとの特徴量の特性を算出せずに、予め設定さ
れた不要語類似度を用いることにより、認識すべきキー
ワードを決定することができる。As described above, according to this embodiment, the similarity indicating the ratio of the characteristics of the uttered voice feature and the keyword feature data is similar for each frame is calculated, and the calculated similarity and Since the keyword to be recognized included in the uttered voice can be determined based on the preset unnecessary word similarity, it is possible to calculate the characteristic of the uttered voice characteristic amount and the characteristic amount data of the unnecessary word without calculating the characteristic of the characteristic amount. By using the preset unnecessary word similarity, the keyword to be recognized can be determined.

【０１２２】また、各フレーム毎に、不要語類似度およ
び算出した各類似度を積算して当該不要語類似度および
当該算出した各類似度の各組み合わせにおける累積類似
度を算出し、当該累積類似度に基づいて発話音声に含ま
れる認識すべきキーワードを決定することができるの
で、不要語類似度および算出した各類似度の各組み合わ
せを考慮しつつ、発話音声に含まれるキーワードを決定
することができる。Also, for each frame, the unnecessary word similarity and the calculated similarities are integrated to calculate the cumulative similarity in each combination of the unnecessary word similarity and the calculated similarities, and the cumulative similarity is calculated. Since it is possible to determine the keyword to be recognized included in the uttered voice based on the degree, it is possible to determine the keyword included in the uttered voice while considering each combination of the unnecessary word similarity and each calculated similarity. it can.

【０１２３】この結果、容易にかつ高速に発話音声に含
まれるキーワードを的確に認識することができるととも
に、誤認識を防止することができる。As a result, the keyword contained in the uttered voice can be accurately recognized easily and at high speed, and erroneous recognition can be prevented.

【０１２４】また、本実施形態において、１の発話音声
において複数のキーワードを認識する場合には、さらに
容易にかつ高速に発話音声に含まれるキーワードを認識
することができるとともに、誤認識を防止することがで
きる。Further, in the present embodiment, when a plurality of keywords are recognized in one utterance voice, the keywords included in the utterance voice can be recognized more easily and at high speed, and erroneous recognition can be prevented. be able to.

【０１２５】例えば、２のキーワードを認識する場合
に、図４に示すようなＨＭＭを用いた認識すべきキーワ
ードが含まれる音声言語モデル２０を想定すると、各認
識すべきキーワードモデルにおけるワード長に基づいて
ワード長の正規化を行うようにすれば、２のキーワード
を同時に認識することができるようになる。For example, when recognizing two keywords, assuming a spoken language model 20 including keywords to be recognized using an HMM as shown in FIG. 4, it is based on the word length in each keyword model to be recognized. If the word length is normalized by using two words, the two keywords can be recognized at the same time.

【０１２６】すなわち、上記マッチング処理部１０８に
おいて、各キーワード毎の累積類似度を算出することに
代えて、キーワードモデルデータベース１０６に格納さ
れる全てのキーワードの組み合わせ毎に累積類似度を算
出し、判定部１０９において、各キーワードのワード長
を加算して正規化処理を行うようにすれば、複数のキー
ワードを同時に認識することができるとともに、容易に
かつ高速に発話音声に含まれるキーワードを認識するこ
とができ、誤認識を防止することができる。That is, in the matching processing unit 108, instead of calculating the cumulative similarity for each keyword, the cumulative similarity is calculated for each combination of all the keywords stored in the keyword model database 106, and the determination is made. By adding the word lengths of the keywords and performing the normalization process in the unit 109, a plurality of keywords can be recognized at the same time, and the keywords included in the uttered voice can be easily and quickly recognized. It is possible to prevent erroneous recognition.

【０１２７】また、本実施形態では、上述の音声認識装
置によってキーワード認識処理行うようになっている
が、音声認識装置にコンピュータおよび記録媒体を備
え、この記録媒体に上述のキーワード認識処理を行うプ
ログラムを格納し、このコンピュータによってキーワー
ド認識処理プログラムを読み込むことによって上述と同
様のキーワード認識処理を行うようにしてもよい。In this embodiment, the keyword recognition process is performed by the above-mentioned voice recognition device. However, the voice recognition device is provided with a computer and a recording medium, and a program for performing the above-mentioned keyword recognition process on this recording medium. May be stored and the keyword recognition processing program similar to the above may be performed by reading the keyword recognition processing program by this computer.

【０１２８】また、この場合に、この記録媒体は、ＤＶ
ＤやＣＤなどの記録媒体により構成し、当該音声認識装
置１００には、記録媒体からプログラムを読み出す読出
装置を備えるようにしてもよい。In this case, the recording medium is DV
It may be configured by a recording medium such as a D or a CD, and the voice recognition device 100 may be provided with a reading device that reads out a program from the recording medium.

【０１２９】[0129]

【発明の効果】以上説明したように、請求項１に記載の
発明によれば、発話音声特徴量と不要語の特徴量データ
との特徴量の特性を算出せずに、予め設定された不要語
確率を用いることにより、不要語とキーワードを識別す
ることによって認識すべきキーワードを決定することが
できるので、不要語類似度を算出する際の処理負担を軽
減することができ、容易にかつ高速に発話音声に含まれ
るキーワードを認識することができる。As described above, according to the first aspect of the invention, the characteristics of the feature amounts of the utterance voice feature amount and the feature amount data of the unnecessary words are not calculated, and the preset unnecessary By using the word probability, it is possible to determine the keyword to be recognized by distinguishing the unnecessary word from the keyword, so that the processing load when calculating the unnecessary word similarity can be reduced, and it is easy and fast. It is possible to recognize a keyword included in the uttered voice.

[Brief description of drawings]

【図１】ＨＭＭを用いた認識ネットワークを示す音声言
語モデルを示す図である。FIG. 1 is a diagram showing a spoken language model showing a recognition network using an HMM.

【図２】本発明に係るワードスポッティング音声認識装
置の一実施形態の構成概要を示すブロック図である。FIG. 2 is a block diagram showing a schematic configuration of an embodiment of a word spotting voice recognition device according to the present invention.

【図３】ワードスポッティング音声認識装置の一実施形
態におけるキーワード認識処理の動作を示すフローチャ
ートである。FIG. 3 is a flowchart showing an operation of a keyword recognition process in one embodiment of the word spotting voice recognition device.

【図４】２のキーワードを認識する際のＨＭＭを用いた
認識ネットワークを示す音声言語モデルを示す図であ
る。FIG. 4 is a diagram showing a spoken language model showing a recognition network using an HMM when recognizing the second keyword.

【図５】フィラーモデルの音声認識ネットワークを示す
音声言語モデルを示す図。FIG. 5 is a diagram showing a spoken language model showing a filler model speech recognition network.

[Explanation of symbols]

１００ … 音声認識装置１０１ … マイクロホン１０２ … ＬＰＦ１０３ … Ａ／Ｄ変換部１０４ … 入力処理部（抽出手段）１０５ … 音声分析部（抽出手段）１０６ … キーワードモデルデータベース（第１格納
手段）１０７ … 類似度算出部（算出手段、第１取得手段）１０８ … マッチング処理部（決定手段、第２格納手
段、第２取得手段）１０９ … 判定部（決定手段）100 ... Speech recognition device 101 ... Microphone 102 ... LPF 103 ... A / D conversion unit 104 ... Input processing unit (extraction means) 105 ... Speech analysis unit (extraction means) 106 ... Keyword model database (first storage means) 107 ... Similar Degree calculation unit (calculation unit, first acquisition unit) 108 ... Matching processing unit (determination unit, second storage unit, second acquisition unit) 109 ... Judgment unit (determination unit)

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁷ 識別記号ＦＩテーマコート゛(参考）Ｇ１０Ｌ 15/28 ─────────────────────────────────────────────────── ─── Continuation of front page (51) Int.Cl. ⁷ Identification code FI theme code (reference) G10L 15/28

Claims

[Claims]

1. A voice recognition device for recognizing a keyword included in a uttered voice, comprising: an extracting unit that extracts a uttered voice feature amount that is a feature amount of a voice component of the uttered voice by analyzing the uttered voice. A first storage unit that stores in advance keyword feature amount data indicating a feature amount of one or more voice components of the keyword; and the uttered voice feature amount extracted from at least a part of the voice section of the uttered voice. Calculation means for calculating each keyword probability indicating the probability that the uttered voice feature amount is the keyword based on the keyword feature amount data stored in the storage means; and a preset value for the uttered voice. Second storage means for storing as an unnecessary word probability indicating a probability that at least a part of the voice section is an unnecessary word that does not form the keyword; Determining means for determining the keywords to be recognized that is included in the uttered voice based on the keyword probabilities and the unnecessary word probability, the speech recognition apparatus characterized by comprising a.

2. The voice recognition apparatus according to claim 1, wherein the extraction unit analyzes the uttered voice for each preset unit time to extract the uttered voice feature amount information and stores the uttered voice feature amount information in advance. In the case where the unnecessary word probability stored in the means indicates the unnecessary word probability in the unit time, the calculating means calculates each of the keyword probabilities based on the extracted uttered voice feature amount for each unit time. The voice recognition device, wherein the determination unit determines the keyword to be recognized included in the uttered voice based on the calculated keyword probability and the unnecessary word probability in the unit time, in addition to the calculation.

3. The voice recognition device according to claim 2, wherein the determining unit determines the first probability based on the calculated keyword probabilities and unnecessary word probabilities in the unit time.
A combination probability indicating a probability of each combination of the keyword and the unnecessary word indicated by each of the keyword feature amount data stored in the storage means is calculated, and included in the uttered voice based on the combination probability. A voice recognition device characterized by determining the keyword to be recognized.

4. A voice recognition method for recognizing a keyword included in an uttered voice, comprising: extracting a uttered voice feature amount that is a feature amount of a voice component of the uttered voice by analyzing the uttered voice. A first acquisition processing step of previously acquiring keyword feature amount data indicating a feature amount of a voice component of one or more of the keywords; and the uttered voice feature extracted from at least a part of the voice section of the uttered voice. A calculation processing step of calculating each keyword probability indicating the probability that the uttered voice feature amount is the keyword based on the amount and the keyword feature amount data stored in the storage means; A second acquisition processing unit that acquires as an unnecessary word probability indicating a probability that at least a part of the speech section of the uttered voice is an unnecessary word that does not form the keyword. When the voice recognition method characterized by including: a determination step of determining the keyword to be recognized that is included in the uttered voice based on the keyword probabilities and the unnecessary words probability the calculated.

5. The voice recognition method according to claim 4, wherein in the extraction processing step, the uttered voice is analyzed for each preset unit time to extract the uttered voice feature amount information, and the calculation is performed. In the processing step, the keyword probabilities are calculated based on the extracted uttered voice feature amount for each unit time, and in the second acquisition processing step, the unnecessary word probability indicating the unnecessary word probability in the unit time. A voice recognition method characterized in that, in the determination processing step, the keyword to be recognized included in the uttered voice is determined based on the calculated keyword probability and the unnecessary word probability in the unit time. .

6. The speech recognition method according to claim 5, wherein, in the determination processing step, the acquired keyword feature amount is obtained based on the calculated keyword probability and the unnecessary word probability in the unit time. It is characterized in that a combination probability indicating a probability of each combination of each keyword indicated by the data and the unnecessary word is calculated, and the keyword to be recognized included in the uttered voice is determined based on the combination probability. Speech recognition method.

7. A voice recognition program for performing a voice recognition process of recognizing a keyword included in an uttered voice by a computer, wherein the computer analyzes the uttered voice to determine a feature amount of a voice component of the uttered voice. Extracting means for extracting the uttered speech feature amount, first acquisition means for obtaining in advance the keyword feature amount data indicating the feature amount of the voice component of one or more of the keywords, and at least a part of the voice section of the uttered voice. Calculating means for calculating each keyword probability indicating the probability that the utterance voice feature amount is the keyword based on the extracted utterance voice feature amount and the keyword feature amount data stored in the storage means, The determined value is an unnecessary word in which at least a part of the speech section of the uttered speech does not form the keyword. Second acquisition means for acquiring the probability of unnecessary words indicating a rate, and determination means for determining the keywords to be recognized included in the uttered voice based on the calculated keyword probabilities and the unnecessary word probabilities. Characteristic voice recognition program.

8. The voice recognition program according to claim 7, wherein the computer is configured to extract the uttered voice feature amount information by analyzing the uttered voice for each preset unit time, each unit. Calculation means for calculating the keyword probabilities based on the uttered voice feature quantity extracted for each time, second acquisition means for acquiring the unnecessary word probability in the unit time in advance, the calculated keyword probabilities and the unit time A voice recognition program, which functions as a determining unit that determines the keyword to be recognized, which is included in a speech voice, based on the unnecessary word probability in.

9. The voice recognition program according to claim 8, wherein the computer indicates the keyword feature amount data acquired in advance based on the calculated keyword probability and the unnecessary word probability in the unit time. A combination probability indicating a probability of each combination of each keyword and the unnecessary word is calculated, and based on the combination probability, it functions as a determination unit that determines the keyword to be recognized included in the uttered voice. Speech recognition method.