JP6451171B2

JP6451171B2 - Speech recognition apparatus, speech recognition method, and program

Info

Publication number: JP6451171B2
Application number: JP2014192424A
Authority: JP
Inventors: 原田　将治; 将治原田
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2014-09-22
Filing date: 2014-09-22
Publication date: 2019-01-16
Anticipated expiration: 2034-09-22
Also published as: JP2016062059A

Description

本発明は、音声認識装置、音声認識方法、及び、プログラムに関する。 The present invention relates to a voice recognition device, a voice recognition method, and a program.

近年、特定の発声内容や特定のコマンドのみを認識する装置が開発され、情報システムに活用され始めている。そのような装置の一例として音声認識装置が知られている。音声認識装置は、ユーザの発声内容を認識して、情報提供システムへの入力として受け付ける装置である。 In recent years, devices that recognize only specific utterance contents and specific commands have been developed and are beginning to be utilized in information systems. A speech recognition device is known as an example of such a device. The voice recognition device is a device that recognizes a user's utterance content and receives it as an input to the information providing system.

このような音声認識装置は、例えば、コールセンター用のシステムで活用されている。この場合、音声認識装置は、例えば、大量の通話内容の中から予め登録されているキーワード（認識単語）を含む通話内容を抽出し、通話内容を分類あるいは分析するために、含んでいるキーワードに応じてインデックスを付与するためなどに活用されている。 Such a speech recognition apparatus is used in, for example, a call center system. In this case, for example, the speech recognition apparatus extracts the call contents including a keyword (recognition word) registered in advance from a large amount of call contents and classifies or analyzes the call contents to include the keywords. It is used to give an index accordingly.

音声による検索機能や情報提供機能などの精度を向上させるためには、ユーザの発声内容に対するレスポンスの精度を高める必要がある。つまり、音声認識装置における精度の向上が不可欠である。すなわち、音声認識装置においては、誤検出を少なくし適切な認識結果のみを精度よく検出する技術、つまり、高い再現率と高い適合率とを得られる技術が求められている。 In order to improve the accuracy of the voice search function and information providing function, it is necessary to improve the accuracy of the response to the user's utterance content. That is, it is essential to improve the accuracy of the speech recognition apparatus. That is, in the speech recognition apparatus, there is a demand for a technique for reducing erroneous detection and detecting only an appropriate recognition result with high accuracy, that is, a technique capable of obtaining a high reproduction rate and a high matching rate.

ここで、再現率は、網羅性に関する指標であり、音声認識においては、正しく認識した単語数を認識すべき単語数で除したものである。また、適合率は、正確性に関する指標であり、音声認識においては、正しく認識した単語数を認識した単語数で除したものである。 Here, the recall is an index relating to completeness. In speech recognition, the number of correctly recognized words is divided by the number of words to be recognized. In addition, the relevance rate is an index related to accuracy, and in speech recognition, the number of correctly recognized words is divided by the number of recognized words.

適切な認識単語を精度よく検出することを目的とした技術が、例えば、特許文献１（以下において、従来技術１ともいう）と特許文献２（以下において、従来技術２ともいう）で提案されている。 For example, Patent Document 1 (hereinafter also referred to as Prior Art 1) and Patent Document 2 (hereinafter also referred to as Prior Art 2) have proposed techniques aimed at accurately detecting appropriate recognition words. Yes.

特許文献１で提案されている音声認識装置（以下、従来技術１の音声認識装置という）は、認識対象の認識単語とは別に、環境に適応した雑音モデル（以下、環境適応雑音モデルという）を記憶し、入力音声の特徴量を認識単語と環境適応雑音モデルとそれぞれ比較してそれぞれの類似度を算出する。そして、従来技術１の音声認識装置は、最も類似度が高い認識単語の類似度が環境適応雑音モデルの類似度より高い場合に、最も類似度が高い認識単語を認識結果として出力する。 The speech recognition device proposed in Patent Document 1 (hereinafter referred to as the speech recognition device of Prior Art 1) uses a noise model adapted to the environment (hereinafter referred to as the environment adaptive noise model) separately from the recognition word to be recognized. The feature amount of the input speech is stored and compared with the recognized word and the environment adaptive noise model, and the similarity is calculated. Then, when the similarity of the recognition word having the highest similarity is higher than the similarity of the environment adaptive noise model, the speech recognition device of the conventional technique 1 outputs the recognition word having the highest similarity as a recognition result.

図１８Ａと図１８Ｂは、いずれも、従来技術１における認識単語“岡山”が検出される範囲（以下、検出範囲という）ＢＤ２を示す図である。図１８は、従来技術１の環境適応雑音モデルを棄却単語とみなし、認識単語を“岡山”、棄却単語を“和歌山”と“富山”とした場合の例を示すものである。従来技術１の方法では、図１８に示すように、閾値（以下、第１閾値という）Ｔ１を調整しても、検出範囲ＢＤ２を、高い再現率と高い適合率とが得られる理想的な範囲（以下、理想範囲という）ＢＤ１に近づけることは困難である。なお、図中のＳＣ１は、発声内容に対する認識単語“岡山”の類似度を示している。また、図中のＳＣ２とＳＣ３は、それぞれ、発声内容に対する棄却単語“和歌山”の類似度と棄却単語“富山”の類似度を示している。 FIG. 18A and FIG. 18B are diagrams showing a range (hereinafter referred to as a detection range) BD2 in which the recognition word “Okayama” in Conventional Technique 1 is detected. FIG. 18 shows an example in which the environment adaptive noise model of the prior art 1 is regarded as a rejection word, the recognition word is “Okayama”, and the rejection words are “Wakayama” and “Toyama”. In the method of the prior art 1, as shown in FIG. 18, even if the threshold value (hereinafter referred to as the first threshold value) T1 is adjusted, the detection range BD2 is an ideal range in which a high reproduction rate and a high matching rate can be obtained. It is difficult to approach BD1 (hereinafter referred to as an ideal range). SC1 in the figure indicates the degree of similarity of the recognition word “Okayama” with respect to the utterance content. SC2 and SC3 in the figure respectively indicate the similarity of the reject word “Wakayama” and the similarity of the reject word “Toyama” with respect to the utterance content.

第１閾値Ｔ１を下げると、図１８Ａに示すように、それに伴い認識単語“岡山”の類似度ＳＣ１が棄却単語の類似度（ＳＣ２とＳＣ３）より高い範囲が広くなるため、高い再現率を得ることが可能となるが、適合率が劣化してしまう。一方、第１閾値Ｔ１を上げると、図１８Ｂに示すように、適合率を向上させることができるが、それに伴い本来認識すべき単語を網羅できなくなり、再現率が劣化してしまう。 When the first threshold T1 is lowered, as shown in FIG. 18A, the range in which the similarity SC1 of the recognized word “Okayama” is higher than the similarity of the rejected words (SC2 and SC3) is widened, so that a high recall is obtained. Although it becomes possible, the precision is deteriorated. On the other hand, when the first threshold value T1 is increased, as shown in FIG. 18B, the relevance ratio can be improved. However, the words that should be recognized cannot be covered and the reproduction ratio is deteriorated.

また、特許文献２で提案されている音声認識装置（以下、従来技術２の音声認識装置という）は、第１閾値Ｔ１とは別に、認識単語用の第２閾値Ｔ２を記憶し、入力音声の特徴量と各認識単語との類似度と入力音声の特徴量と各棄却単語との類似度をそれぞれ算出する。そして、従来技術２の音声認識装置は、最も類似度が高い認識単語の類似度が第２閾値Ｔ２より高い場合、あるいは、最も類似度が高い認識単語の類似度が第２閾値Ｔ２以下であっても、第１閾値Ｔ１と最も類似度が高い棄却単語の類似度よりも高い場合に、最も類似度が高い認識単語を認識結果として出力するものである。 In addition, the speech recognition device proposed in Patent Document 2 (hereinafter referred to as the speech recognition device of the prior art 2) stores a second threshold T2 for a recognized word separately from the first threshold T1, and the input speech The similarity between the feature quantity and each recognized word, the feature quantity of the input speech, and the similarity between each reject word are calculated. Then, the speech recognition apparatus according to the related art 2 has the highest similarity of the recognized word that is higher than the second threshold T2, or the highest similarity of the recognized word has the second threshold T2 or less. However, when the similarity is higher than the rejection word having the highest similarity with the first threshold T1, the recognition word having the highest similarity is output as the recognition result.

図１９Ａと図１９Ｂは、いずれも、従来技術２における認識単語“岡山”の検出範囲ＢＤ２を示す図である。従来技術２の方法では、図１９に示すように、第２閾値Ｔ２を調整することで、従来技術１の場合と比較すれば、検出範囲ＢＤ２を理想範囲ＢＤ１に近づけることができるものの、まだ十分ではない。 19A and 19B are diagrams showing a detection range BD2 of the recognition word “Okayama” in the related art 2. In the method of the prior art 2, as shown in FIG. 19, by adjusting the second threshold T2, the detection range BD2 can be brought closer to the ideal range BD1 as compared with the case of the prior art 1, but it is still sufficient. is not.

第２閾値Ｔ２を下げると、図１９Ａに示すように、それに伴い認識単語“岡山”の類似度ＳＣ１が第２閾値Ｔ２より高くなる範囲が広がるため、高い再現率を得ることが可能となるが、適合率が劣化してしまう。一方、第２閾値Ｔ２を上げたとしても、図１９Ｂに示すように、認識単語“岡山”の類似度ＳＣ１が第２閾値Ｔ２以下の場合であっても第１閾値Ｔ１と最も類似度が高い棄却単語の類似度（ＳＣ２又はＳＣ３）よりも高い場合には、認識単語“岡山”を認識結果としているため、適合率をある程度向上させることができるのみである。 When the second threshold value T2 is lowered, as shown in FIG. 19A, the range in which the similarity SC1 of the recognized word “Okayama” is higher than the second threshold value T2 is widened, so that a high recall can be obtained. The precision will deteriorate. On the other hand, even if the second threshold T2 is increased, as shown in FIG. 19B, even when the similarity SC1 of the recognition word “Okayama” is equal to or lower than the second threshold T2, the highest similarity is obtained with the first threshold T1. When the similarity is higher than the rejection word similarity (SC2 or SC3), the recognition word “Okayama” is used as the recognition result, and therefore the relevance rate can only be improved to some extent.

以上に説明したように、従来技術１と２では、高い再現率と高い適合率とが得られる理想範囲ＢＤ１に近づくように検出範囲ＢＤ２を調整することが非常に難しい場合がある。 As described above, in the related arts 1 and 2, it may be very difficult to adjust the detection range BD2 so as to approach the ideal range BD1 in which a high reproduction rate and a high matching rate are obtained.

特開２００３−２０２８８７号公報JP 2003-202887 A 特開２００８−１２９２６３号公報JP 2008-129263 A

一つの側面では、本発明は、高い再現率と高い適合率とが得られる音声認識を実現することを可能とする音声認識装置、音声認識方法、及び、プログラムを提供することを課題とする。 In one aspect, an object of the present invention is to provide a speech recognition device, a speech recognition method, and a program that can realize speech recognition that can achieve a high reproduction rate and a high matching rate.

一態様における音声認識装置は、認識単語が登録されている認識単語辞書と棄却単語が登録されている棄却単語辞書とを備え、入力音声と認識単語との類似度を算出し、最も類似度が高い認識単語の類似度が予め設定されている第１の閾値以下の場合に入力音声を棄却する音声認識装置であって、入力音声と棄却単語との類似度を算出し、最も類似度が高い棄却単語を特定する第１の特定手段と、入力音声と認識単語との類似度を算出し、最も類似度が高い認識単語を特定する第２の特定手段と、特定された棄却単語の類似度が、予め設定されている、第１の閾値より大きい第３の閾値を超えている場合には、特定された認識単語の類似度が第１の閾値を越えていても、入力音声を棄却する照合手段と、を備え、照合手段は、特定された認識単語の類似度が、第１の閾値より大きい第２の閾値を超えている場合には、特定された棄却単語の類似度が第３の閾値を超えているかいないかにかかわらず、特定された認識単語を出力する、ことを特徴としている。 The speech recognition apparatus according to an aspect includes a recognition word dictionary in which recognition words are registered and a rejection word dictionary in which rejection words are registered, calculates a similarity between an input speech and a recognition word, and has the highest similarity. A speech recognition device that rejects input speech when the similarity of a high recognition word is equal to or lower than a first threshold value set in advance, and calculates the similarity between the input speech and the rejection word, and has the highest similarity A first specifying means for specifying a reject word, a second specifying means for calculating the similarity between the input speech and the recognized word, and specifying the recognized word having the highest similarity, and the similarity of the specified reject word Is over a third threshold that is greater than the first threshold set in advance, the input speech is rejected even if the similarity of the identified recognition word exceeds the first threshold comprising a checking means, the collation means, recognition identified If the similarity of the word exceeds a second threshold greater than the first threshold, the identified recognition regardless of whether the similarity of the identified rejected word exceeds the third threshold or not It is characterized by outputting words .

一態様における音声認識方法は、認識単語が登録されている認識単語辞書と棄却単語が登録されている棄却単語辞書とを備え、入力音声と認識単語との類似度を算出し、最も類似度が高い認識単語の類似度が予め設定されている第１の閾値以下の場合に入力音声を棄却する音声認識装置の音声認識方法であって、入力音声と棄却単語との類似度を算出し、最も類似度が高い棄却単語を特定し、入力音声と認識単語との類似度を算出し、最も類似度が高い認識単語を特定し、特定された棄却単語の類似度が、予め設定されている、第１の閾値より大きい第３の閾値を超えている場合には、特定した認識単語の類似度が第１の閾値を越えていても、入力音声を棄却し、特定された認識単語の類似度が、第１の閾値より大きい第２の閾値を超えている場合には、特定された棄却単語の類似度が第３の閾値を超えているかいないかにかかわらず、特定された認識単語を出力する、ことを特徴としている。 A speech recognition method according to an aspect includes a recognition word dictionary in which recognition words are registered and a rejection word dictionary in which rejection words are registered, calculates a similarity between an input speech and a recognition word, and has the highest similarity. A speech recognition method of a speech recognition apparatus that rejects input speech when the similarity of a high recognition word is equal to or less than a first threshold value set in advance, and calculates the similarity between the input speech and the rejection word, identify the degree of similarity is high reject words, to calculate the degree of similarity between the input speech and the recognized word to identify the highest similarity recognition words, the similarity of the reject words specified, are previously set, If the third threshold value that is greater than the first threshold value is exceeded, the input speech is rejected even if the similarity level of the identified recognition word exceeds the first threshold value, and the similarity level of the identified recognition word Exceeds a second threshold greater than the first threshold Case, regardless of whether the similarity of the reject words identified exceeds a third threshold value, and outputs a recognized word specified is characterized in that.

一態様におけるプログラムは、認識単語が登録されている認識単語辞書と棄却単語が登録されている棄却単語辞書とを備え、入力音声と認識単語との類似度を算出し、最も類似度が高い認識単語の類似度が予め設定されている第１の閾値以下の場合に入力音声を棄却する音声認識装置のコンピュータに、入力音声と棄却単語との類似度を算出し、最も類似度が高い棄却単語を特定し、入力音声と認識単語との類似度を算出し、最も類似度が高い認識単語を特定し、特定された棄却単語の類似度が、予め設定されている、第１の閾値より大きい第３の閾値を超えている場合には、特定した認識単語の類似度が第１の閾値を越えていても、入力音声を棄却し、特定された認識単語の類似度が、第１の閾値より大きい第２の閾値を超えている場合には、特定された棄却単語の類似度が第３の閾値を超えているかいないかにかかわらず、特定された認識単語を出力する、処理を実行させることを特徴としている。 The program according to one aspect includes a recognition word dictionary in which recognition words are registered and a rejection word dictionary in which rejection words are registered, calculates a similarity between an input voice and a recognition word, and recognizes the highest similarity The similarity of the input speech and the reject word is calculated in the computer of the speech recognition device that rejects the input speech when the word similarity is equal to or less than a first threshold value set in advance, and the reject word with the highest similarity is calculated identify and calculate the degree of similarity between the input speech and the recognized word to identify the highest similarity recognition words, the similarity of the reject words identified, are preset, larger than the first threshold value When the third threshold is exceeded, even if the similarity of the identified recognition word exceeds the first threshold, the input speech is rejected, and the similarity of the identified recognition word is the first threshold. If a larger second threshold is exceeded , Regardless of whether the similarity of the reject words identified exceeds a third threshold value, and outputs a recognized word specified it is characterized in that to execute the process.

一つの側面では、高い再現率と高い適合率とが得られる音声認識を実現することが可能となる。 In one aspect, it is possible to realize speech recognition that provides a high recall and a high precision.

Ａは、実施形態１における音声認識装置の構成例を示す機能ブロック図であり、Ｂは、実施形態１における記憶部の構成例を示す機能ブロック図である。A is a functional block diagram illustrating a configuration example of the speech recognition apparatus according to the first embodiment, and B is a functional block diagram illustrating a configuration example of a storage unit according to the first embodiment. 実施形態１における認識単語辞書の構成例を示す図である。It is a figure which shows the structural example of the recognition word dictionary in Embodiment 1. FIG. 実施形態１における棄却単語辞書の構成例を示す図である。It is a figure which shows the structural example of the rejection word dictionary in Embodiment 1. FIG. 実施形態１における認識単語が検出される範囲を示す図である。It is a figure which shows the range from which the recognition word in Embodiment 1 is detected. 実施形態１における音声認識処理のフローを説明するためのフローチャートの例の一部である。3 is a part of an example of a flowchart for explaining a flow of a speech recognition process in the first embodiment. 実施形態１における音声認識処理のフローを説明するためのフローチャートの例の他の一部である。6 is another part of an example of a flowchart for explaining the flow of the speech recognition process in the first embodiment. Ａは、従来技術１における各第１閾値に対する再現率と適合率とを示す図であり、Ｂは、実施形態１における各第３閾値に対する再現率と適合率とを示す図である。FIG. 4A is a diagram showing the recall rate and the matching rate for each first threshold value in the prior art 1, and B is a diagram showing the recall rate and the matching rate for each third threshold value in the first embodiment. 実施形態２における音声認識装置の構成例を示す機能ブロック図である。It is a functional block diagram which shows the structural example of the speech recognition apparatus in Embodiment 2. 実施形態２における棄却単語辞書の構成例を示す図である。It is a figure which shows the structural example of the rejection word dictionary in Embodiment 2. FIG. 実施形態２における第３閾値について説明するための図である。FIG. 10 is a diagram for explaining a third threshold value in the second embodiment. 実施形態２における認識単語が検出される範囲を示す図である。It is a figure which shows the range from which the recognition word in Embodiment 2 is detected. 実施形態２における音声認識処理のフローを説明するためのフローチャートの例の一部である。10 is a part of an example of a flowchart for explaining a flow of speech recognition processing in the second embodiment. 実施形態３における認識単語が検出される範囲を示す図である。It is a figure which shows the range from which the recognition word in Embodiment 3 is detected. 実施形態３における音声認識処理のフローを説明するためのフローチャートの例の一部である。10 is a part of an example of a flowchart for explaining a flow of voice recognition processing in the third embodiment. Ａは、従来技術２における各第２閾値に対する再現率と適合率とを示す図であり、Ｂは、実施形態３における各第３閾値に対する再現率と適合率とを示す図である。FIG. 7A is a diagram showing the recall rate and the matching rate for each second threshold value in Conventional Technology 2, and B is a diagram showing the recall rate and the matching rate for each third threshold value in the third embodiment. 実施形態４における音声認識処理のフローを説明するためのフローチャートの例の一部である。14 is a part of an example of a flowchart for explaining a flow of voice recognition processing in the fourth embodiment. 実施形態における音声認識装置のハードウェア構成の例を示す図である。It is a figure which shows the example of the hardware constitutions of the speech recognition apparatus in embodiment. ＡとＢは、いずれも、従来技術１における認識単語が検出される範囲を示す図である。A and B are both diagrams showing a range in which a recognized word in the prior art 1 is detected. ＡとＢは、いずれも、従来技術２における認識単語が検出される範囲を示す図である。A and B are both diagrams showing a range in which a recognized word is detected in the related art 2.

以下に本発明の実施の形態について図面を参照しながら詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

（実施形態１）
図１Ａは、本実施形態１における音声認識装置１の構成例を示す機能ブロック図であり、図１Ｂは、本実施形態１における記憶部２０の構成例を示す機能ブロック図である。本実施形態１における音声認識装置１は、入力音声の特徴量を予め登録されている認識単語と棄却単語とそれぞれ比較し、それぞれ算出した類似度の大小により、入力音声を棄却、あるいは、最も類似度が高い認識単語を出力する装置である。本実施形態１における音声認識装置１は、図１に示すように、入力部１０と、記憶部２０と、出力部３０と、制御部４０と、を備えている。 (Embodiment 1)
FIG. 1A is a functional block diagram illustrating a configuration example of the speech recognition apparatus 1 according to the first embodiment, and FIG. 1B is a functional block diagram illustrating a configuration example of the storage unit 20 according to the first embodiment. The speech recognition apparatus 1 according to the first exemplary embodiment compares the feature amount of the input speech with a recognition word registered in advance and a rejection word, and rejects the input speech according to the degree of similarity calculated, or the most similar It is a device that outputs recognition words with high degrees. As shown in FIG. 1, the voice recognition device 1 according to the first exemplary embodiment includes an input unit 10, a storage unit 20, an output unit 30, and a control unit 40.

入力部１０は、例えば、入出力インターフェースなどを備えており、接続されている音声取得装置（例えば、マイクロフォンなど）から音声区間を含む信号（以下、入力信号という）を受け付ける。そして、入力部１０は、受け付けた入力信号を、制御部４０に出力する。この際、入力部１０は、受け付けた入力信号をバッファメモリ（不図示）に一時的に格納し、制御部４０が処理のタイミングに合わせて入力信号をフレーム単位でバッファメモリから順次取得するようにしてもよい。なお、以下において、入力信号の音声区間の信号を音声信号と称することとする。 The input unit 10 includes, for example, an input / output interface and receives a signal including an audio section (hereinafter referred to as an input signal) from a connected audio acquisition device (for example, a microphone). Then, the input unit 10 outputs the received input signal to the control unit 40. At this time, the input unit 10 temporarily stores the received input signal in a buffer memory (not shown), and the control unit 40 sequentially acquires the input signal from the buffer memory in units of frames in accordance with the processing timing. May be. In the following, the signal in the voice section of the input signal is referred to as a voice signal.

記憶部２０は、例えば、ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ（ＲＡＭ）、ＲｅａｄＯｎｌｙＭｅｍｏｒｙ（ＲＯＭ）、ＨａｒｄＤｉｓｋＤｒｉｖｅ（ＨＤＤ）などを備えている。記憶部２０は、制御部４０が備える、例えば、ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ（ＣＰＵ）のワークエリア、音声認識装置１全体を制御するための動作プログラムなどの各種プログラムを格納するプログラムエリア、従来用いられている第１閾値Ｔ１などの各種データを格納するデータエリアとして機能する。この第１閾値Ｔ１は、従来通り、入力音声を棄却するための一つの尺度として用いられる。 The storage unit 20 includes, for example, a random access memory (RAM), a read only memory (ROM), and a hard disk drive (HDD). The storage unit 20 includes, for example, a work area of a central processing unit (CPU) provided in the control unit 40, a program area that stores various programs such as an operation program for controlling the entire speech recognition apparatus 1, and is conventionally used. It functions as a data area for storing various data such as the first threshold T1. This first threshold value T1 is used as one measure for rejecting the input speech, as is conventional.

また、記憶部２０は、図１Ｂに示すように、認識単語辞書２１、棄却単語辞書２２、音響モデル記憶部２３として機能する。 Moreover, the memory | storage part 20 functions as the recognition word dictionary 21, the rejection word dictionary 22, and the acoustic model memory | storage part 23, as shown to FIG. 1B.

認識単語辞書２１は、複数の認識単語とそれらの認識単語に関する情報を格納している。図２は、本実施形態１における認識単語辞書２１の構成例を示す図である。認識単語辞書２１は、例えば、図２に示すように、複数の認識単語とそれらの認識単語に関する情報をテーブル化して格納している。本実施形態１における認識単語辞書２１は、図２に示すように、認識単語の「単語表記」ごとに、「単語の読み」と「音素系列」とを対応付けたものである。なお、認識単語には、認識語彙が含まれていてもよい。 The recognition word dictionary 21 stores a plurality of recognition words and information related to the recognition words. FIG. 2 is a diagram illustrating a configuration example of the recognized word dictionary 21 according to the first embodiment. For example, as shown in FIG. 2, the recognition word dictionary 21 stores a plurality of recognition words and information related to the recognition words in a table form. As shown in FIG. 2, the recognized word dictionary 21 according to the first embodiment associates “word reading” with “phoneme series” for each “word notation” of the recognized word. The recognized word may include a recognized vocabulary.

「単語表記」は、対応する認識単語を書き表した情報である。「単語の読み」は、対応する認識単語をひらがなで表した情報である。「音素系列」は、対応する認識単語を音素で表した情報であり、入力された音声信号の特徴量との類似度を算出する際に用いられる。 “Word notation” is information describing the corresponding recognition word. The “word reading” is information representing the corresponding recognition word in hiragana. The “phoneme series” is information representing the corresponding recognition word in phonemes, and is used when calculating the similarity with the feature amount of the input speech signal.

棄却単語辞書２２は、複数の棄却単語それらの棄却単語に関する情報を格納している。図３は、本実施形態１における棄却単語辞書２２の構成例を示す図である。棄却単語辞書２２は、例えば、図３に示すように、複数の棄却単語とそれらの棄却単語に関する情報をテーブル化して格納している。本実施形態１における棄却単語辞書２２は、図３に示すように、棄却単語の「単語表記」ごとに、「単語の読み」と「音素系列」と「第３閾値」とを対応付けたものである。なお、棄却単語には、棄却語彙が含まれていてもよい。 The reject word dictionary 22 stores a plurality of reject words and information related to the reject words. FIG. 3 is a diagram illustrating a configuration example of the reject word dictionary 22 according to the first embodiment. For example, as shown in FIG. 3, the reject word dictionary 22 stores a plurality of reject words and information related to the reject words in a table. As shown in FIG. 3, the reject word dictionary 22 according to the first exemplary embodiment associates “word reading”, “phoneme series”, and “third threshold” for each “word notation” of the reject word. It is. The rejection word may include a rejection vocabulary.

「単語表記」は、対応する棄却単語を書き表した情報である。「単語の読み」は、対応する棄却単語をひらがなで表した情報である。「音素系列」は、対応する棄却単語を音素で表した情報であり、入力された音声信号の特徴量との類似度を算出する際に用いられる。「第３閾値」は、対応する棄却単語の閾値であり、上述した第１閾値Ｔ１と第２閾値Ｔ２と区別するために、第３閾値Ｔ３と称している。この第３閾値Ｔ３は、第１閾値Ｔ１よりも大きく、第３閾値Ｔ３の値を超えると対応する棄却単語が発声されている可能性が高くなる様に実験的に求め、予め設定されている。 The “word notation” is information describing the corresponding reject word. “Reading a word” is information representing the corresponding reject word in hiragana. The “phoneme series” is information representing the corresponding reject word in phonemes, and is used when calculating the similarity with the feature amount of the input speech signal. The “third threshold value” is a threshold value of the corresponding rejection word, and is referred to as a third threshold value T3 in order to distinguish the first threshold value T1 and the second threshold value T2 described above. This third threshold value T3 is larger than the first threshold value T1 and is experimentally determined and set in advance so that if the value of the third threshold value T3 is exceeded, the corresponding rejection word is likely to be uttered. .

なお、第３閾値Ｔ３は、棄却単語ごとに設定するのではなく、全ての棄却単語に対して共通の閾値であってもよい。 Note that the third threshold T3 is not set for each rejection word, but may be a common threshold for all rejection words.

音響モデル記憶部２３は、音素ごとに、特徴量の傾向を統計的にモデル化したデータを格納している。音響モデルの例としては、これらに限定されるものではないが、ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ（ＨＭＭ）や、ＤｙｎａｍｉｃＰｒｏｇｒａｍｍｉｎｇ（ＤＰ）マッチングなどを用いることができる。 The acoustic model storage unit 23 stores data obtained by statistically modeling the feature amount tendency for each phoneme. Examples of the acoustic model include, but are not limited to, Hidden Markov Model (HMM), Dynamic Programming (DP) matching, and the like.

出力部３０は、例えば、入出力インターフェースなどを備えており、照合部４２（詳しくは後述）による認識結果を出力する。そして、例えば、情報提供装置（不図示）が接続されている場合には、その認識結果に基づく情報をユーザに提供する。また、認識結果は、上位プログラムに渡してもよいし、ＣＰＵに対してコマンドとして出力してもよい。 The output unit 30 includes, for example, an input / output interface and the like, and outputs a recognition result by the collation unit 42 (described later in detail). For example, when an information providing device (not shown) is connected, information based on the recognition result is provided to the user. Further, the recognition result may be passed to the upper program or may be output as a command to the CPU.

制御部４０は、例えば、ＣＰＵなどを備えており、記憶部２０のプログラムエリアに格納されている動作プログラムを実行して、図１Ａに示すように、分析部４１と、照合部４２としての機能を実現する。また、制御部４０は、動作プログラムを実行して、音声認識装置１全体を制御する制御処理や詳しくは後述の音声認識処理などの処理を実行する。 The control unit 40 includes a CPU, for example, and executes an operation program stored in the program area of the storage unit 20 to function as an analysis unit 41 and a verification unit 42 as shown in FIG. 1A. Is realized. Further, the control unit 40 executes an operation program to execute a control process for controlling the entire voice recognition apparatus 1 and a process such as a voice recognition process described later in detail.

分析部４１は、入力信号を分析して音声区間を検出し、更に、音声区間の信号である音声信号を音響分析して、音声信号の特徴量を算出する。そして、分析部４１は、算出した音声信号の特徴量を照合部４２に出力する。 The analysis unit 41 analyzes the input signal to detect a speech section, and further acoustically analyzes the speech signal that is a signal in the speech section to calculate a feature amount of the speech signal. Then, the analysis unit 41 outputs the calculated feature amount of the audio signal to the collation unit 42.

特徴量は、これらに限定されるものではないが、ＭｅｌＦｒｅｑｕｅｎｃｙＣｅｐｓｔｒａｌＣｏｅｆｆｉｃｉｅｎｔ（ＭＦＣＣ）、線形予測符号（ＬｉｎｅａｒＰｒｅｄｉｃｔｉｖｅＣｏｄｉｎｇ，ＬＰＣ）ケプストラム、パワースペクトラムなどを用いることができる。 The feature amount is not limited to these, but a Mel Frequency Cepstial Coefficient (MFCC), a linear predictive coding (LPC) cepstrum, a power spectrum, or the like can be used.

照合部４２は、入力された特徴量に基づいて、入力音声と認識単語辞書２１に格納されている各認識単語との類似度をそれぞれ算出する。また、照合部４２は、入力された特徴量に基づいて、入力音声と棄却単語辞書２２に格納されている各棄却単語との類似度をそれぞれ算出する。 The collation unit 42 calculates the similarity between the input speech and each recognized word stored in the recognized word dictionary 21 based on the input feature amount. The collation unit 42 also calculates the similarity between the input speech and each rejection word stored in the rejection word dictionary 22 based on the input feature amount.

より具体的には、照合部４２は、音響モデル記憶部２３に格納されている音響モデルと入力された特徴量とを比較して、入力音声に対応する音素列を抽出する。そして、照合部４２は、抽出した音素列と各認識単語の音素系列とを比較して類似度を算出すると共に、抽出した音素列と各棄却単語の音素系列とを比較して類似度を算出する。なお、類似度の算出方法は、従来用いられている方法を用いることができる。また、本実施形態１においては、類似度は、０〜１００の間の値に正規化したものである。 More specifically, the collation unit 42 compares the acoustic model stored in the acoustic model storage unit 23 with the input feature quantity, and extracts a phoneme string corresponding to the input speech. Then, the matching unit 42 calculates the similarity by comparing the extracted phoneme sequence and the phoneme sequence of each recognized word, and calculates the similarity by comparing the extracted phoneme sequence and the phoneme sequence of each rejected word. To do. Note that a conventionally used method can be used as the similarity calculation method. In the first embodiment, the similarity is normalized to a value between 0 and 100.

そして、照合部４２は、類似度が最も高い棄却単語を特定し、特定した棄却単語の類似度が、対応する第３閾値Ｔ３を超えているか否かを判定する。そして、特定した棄却単語の類似度が第３閾値Ｔ３を超えていると判定した場合には、照合部４２は、照合対象の音声信号を棄却する。一方、特定した棄却単語の類似度が第３閾値Ｔ３以下であると判定した場合には、照合部４２は、類似度が最も高い認識単語を特定する。 And the collation part 42 specifies the rejection word with the highest similarity, and determines whether the similarity of the specified rejection word exceeds the corresponding 3rd threshold value T3. And when it determines with the similarity of the specified rejection word exceeding 3rd threshold value T3, the collation part 42 rejects the audio | voice signal of collation object. On the other hand, when it is determined that the similarity of the specified rejection word is equal to or less than the third threshold T3, the matching unit 42 specifies the recognized word having the highest similarity.

そして、照合部４２は、特定した認識単語の類似度が第１閾値Ｔ１を超えているか否かを判定する。そして、特定した認識単語の類似度が第１閾値Ｔ１以下であると判定した場合には、照合部４２は、照合対象の音声信号を棄却する。一方、特定した認識単語の類似度が第１閾値Ｔ１を超えていると判定した場合には、照合部４２は、更に、特定した認識単語の類似度が特定した棄却単語の類似度を超えているか否かを判定する。 And the collation part 42 determines whether the similarity of the identified recognition word is over 1st threshold value T1. And when it determines with the similarity of the identified recognition word being below 1st threshold value T1, the collation part 42 rejects the audio | voice signal of collation object. On the other hand, when it is determined that the similarity of the identified recognition word exceeds the first threshold T1, the collation unit 42 further exceeds the similarity of the rejection word that is identified by the similarity of the identified recognition word. It is determined whether or not.

そして、特定した認識単語の類似度が特定した棄却単語の類似度を超えていると判定した場合には、照合部４２は、特定した認識単語を認識結果として、出力部３０を介して、出力する。一方、特定した認識単語の類似度が特定した棄却単語の類似度以下であると判定した場合には、照合部４２は、照合対象の音声信号を棄却する。 And when it determines with the similarity of the identified recognition word exceeding the similarity of the identified rejection word, the collation part 42 outputs via the output part 30 by making the identified recognition word into a recognition result. To do. On the other hand, when it determines with the similarity of the identified recognition word being below the similarity of the identified rejection word, the collation part 42 rejects the audio | voice signal of collation object.

次に、図４を参照して、具体例に従って、本実施形態１における照合部４２の処理について更に説明する。図４は、本実施形態１における認識単語“岡山”が検出される範囲を示す図である。なお、図４の例は、入力音声を認識単語“岡山”と比較する場合の例であり、第３閾値Ｔ３が全ての棄却単語に対して共通の場合の例である。 Next, with reference to FIG. 4, the process of the collation part 42 in this Embodiment 1 is further demonstrated according to a specific example. FIG. 4 is a diagram showing a range in which the recognition word “Okayama” is detected in the first embodiment. The example of FIG. 4 is an example when the input speech is compared with the recognized word “Okayama”, and is an example when the third threshold T3 is common to all rejected words.

図４を参照して、ユーザが、例えば、“△▲やま”と発声（周辺環境雑音が影響してその様に聞き取れる場合も含む）したとすると、“△▲やま”と認識単語“岡山”との類似度ＳＣ１は“△▲やま”と棄却単語“和歌山”との類似度ＳＣ２（この例では、棄却単語“和歌山”の類似度ＳＣ２が棄却単語の中で最も高くなる）より小さい。つまり、“△▲やま”は、認識単語“岡山”の類似度ＳＣ１が棄却単語“和歌山”の類似度ＳＣ２以下となる範囲である点線Ｌ１より左側の範囲にある。したがって、この場合、照合部４２は、“△▲やま”を棄却する。 Referring to FIG. 4, if the user utters, for example, “△ ▲ Yama” (including the case where it can be heard as a result of noise from the surrounding environment), “△ ▲ Yama” is recognized as “OKyama”. Is lower than the similarity SC2 between the “△ ▲ yama” and the rejection word “Wakayama” (in this example, the similarity SC2 of the rejection word “Wakayama” is the highest among the rejection words). In other words, “Yama” is in the range to the left of the dotted line L1 in which the similarity SC1 of the recognition word “Okayama” is less than or equal to the similarity SC2 of the rejection word “Wakayama”. Therefore, in this case, the collation unit 42 rejects “Δ ▲ Yama”.

また、ユーザが、例えば、“×かやま”と発声したとすると、“×かやま”と認識単語“岡山”との類似度ＳＣ１は“×かやま”と棄却単語“和歌山”との類似度ＳＣ２（この例では、棄却単語“和歌山”の類似度ＳＣ２が棄却単語の中で最も高くなる）より大きい。しかしながら、図４に示すように、棄却単語“和歌山”の類似度ＳＣ２は第３閾値Ｔ３を超えている。つまり、“×かやま”は、認識単語“岡山”の類似度ＳＣ１が棄却単語“和歌山”の類似度ＳＣ２より大きいものの、棄却単語“和歌山”の類似度ＳＣ２が第３閾値Ｔ３を超えてしまっている範囲、点線Ｌ１と点線Ｌ２との間の範囲にある。したがって、この場合、照合部４２は、“×かやま”を棄却する。 For example, if the user utters “X Kayama”, the similarity SC1 between “X Kayama” and the recognized word “Okayama” is similar to “X Kayama” and the rejection word “Wakayama”. It is larger than SC2 (in this example, the similarity SC2 of the reject word “Wakayama” is the highest among the reject words). However, as shown in FIG. 4, the similarity SC2 of the rejection word “Wakayama” exceeds the third threshold T3. That is, “× Kayama” has the similarity SC1 of the recognition word “Okayama” greater than the similarity SC2 of the rejection word “Wakayama”, but the similarity SC2 of the rejection word “Wakayama” exceeds the third threshold T3. In the range between the dotted line L1 and the dotted line L2. Therefore, in this case, the collation unit 42 rejects “X Kayama”.

また、ユーザが、例えば、“○やま”と発声したとすると、“○やま”と認識単語“岡山”との類似度ＳＣ１は“○やま”と棄却単語“富山”との類似度ＳＣ３（この例では、棄却単語“富山”の類似度ＳＣ３が棄却単語の中で最も高くなる）より小さい。つまり、“○やま”は、認識単語“岡山”の類似度ＳＣ１が棄却単語“富山”の類似度ＳＣ３以下となる範囲である点線Ｌ３より右側の範囲にある。したがって、この場合、照合部４２は、“○やま”を棄却する。 For example, if the user utters “Oyama”, the similarity SC1 between “Oyama” and the recognition word “Okayama” is the similarity SC3 between “Oyama” and the rejection word “Toyama” (this In the example, the similarity SC3 of the reject word “Toyama” is the highest among the reject words). That is, “Oyama” is in the range to the right of the dotted line L3, which is the range in which the similarity SC1 of the recognition word “Okayama” is less than or equal to the similarity SC3 of the rejection word “Toyama”. Therefore, in this case, the collation unit 42 rejects “Oyama”.

また、ユーザが、例えば、“□かやま”と発声したとすると、図４に示すように、“□かやま”と棄却単語“富山”との類似度ＳＣ３（この例では、棄却単語“富山”の類似度ＳＣ３が棄却単語の中で最も高くなる）は第３閾値Ｔ３以下である。また、“□かやま”と認識単語“岡山”との類似度ＳＣ１は、図４に示すように、棄却単語“富山”の類似度ＳＣ３と第１閾値Ｔ１より大きい。つまり、“□かやま”は、最も類似度が大きい棄却単語の類似度（ＳＣ２又はＳＣ３）が第３閾値Ｔ３を超えておらず、認識単語“岡山”の類似度ＳＣ１が最も類似度が大きい棄却単語の類似度（ＳＣ２又はＳＣ３）と第１閾値Ｔ１より大きい範囲、点線Ｌ２と点線Ｌ３との間の範囲にある。本具体例では、点線Ｌ２と点線Ｌ３との間の範囲が、認識単語“岡山”の検出範囲ＢＤ２となる。したがって、この場合、照合部４２は、認識単語“岡山”を認識結果として出力する。 For example, if the user utters “□ Kayama”, as shown in FIG. 4, the similarity SC3 between “□ Kayama” and the rejection word “Toyama” (in this example, the rejection word “Toyama”). “Similarity SC3 of“ is the highest among rejected words ”is equal to or less than the third threshold T3. Further, the similarity SC1 between “□ Kayama” and the recognized word “Okayama” is larger than the similarity SC3 of the reject word “Toyama” and the first threshold T1, as shown in FIG. That is, in “□ Kayama”, the similarity (SC2 or SC3) of the rejection word having the highest similarity does not exceed the third threshold T3, and the similarity SC1 of the recognition word “Okayama” has the highest similarity. The rejection word similarity (SC2 or SC3) is in a range larger than the first threshold T1, and is in a range between the dotted line L2 and the dotted line L3. In this specific example, the range between the dotted line L2 and the dotted line L3 is the detection range BD2 of the recognition word “Okayama”. Therefore, in this case, the collation unit 42 outputs the recognition word “Okayama” as a recognition result.

次に、図５と図６を参照して、本実施形態１における音声認識処理の流れについて説明する。図５と図６は、ぞれぞれ、本実施形態１における音声認識処理のフローを説明するためのフローチャートの例の一部と、他の一部である。本音声認識処理は、入力部１０が受け付けた入力信号を制御部４０に出力したことをトリガとして開始される。 Next, the flow of speech recognition processing in the first embodiment will be described with reference to FIGS. 5 and 6 are a part of an example of a flowchart for explaining the flow of the speech recognition processing in the first embodiment and another part, respectively. The voice recognition process is started with the input signal received by the input unit 10 being output to the control unit 40 as a trigger.

分析部４１は、入力信号を分析して、音声信号を検出する（ステップＳ００１）。そして、分析部４１は、検出した音声信号の特徴量を算出し、算出した特徴量を照合部４２に出力する（ステップＳ００２）。 The analysis unit 41 analyzes the input signal and detects an audio signal (step S001). Then, the analysis unit 41 calculates the feature amount of the detected audio signal, and outputs the calculated feature amount to the collation unit 42 (step S002).

そして、照合部４２は、入力された特徴量に基づいて、照合対象の音声信号と各棄却単語との類似度をそれぞれ算出し（ステップＳ００３）、最も類似度が高い棄却単語を特定する（ステップＳ００４）。そして、照合部４２は、特定した棄却単語の類似度が、対応する第３閾値Ｔ３を超えているか否かを判定する（ステップＳ００５）。 Then, the collation unit 42 calculates the similarity between the speech signal to be collated and each rejection word based on the input feature amount (step S003), and identifies the rejection word with the highest similarity (step S003). S004). And the collation part 42 determines whether the similarity of the specified rejection word exceeds the corresponding 3rd threshold value T3 (step S005).

特定した棄却単語の類似度が第３閾値Ｔ３を超えていると判定した場合には（ステップＳ００５；ＹＥＳ）、照合部４２は、照合対象の音声信号を棄却し（ステップＳ００６）、本処理は終了する。 When it is determined that the identified rejection word similarity exceeds the third threshold T3 (step S005; YES), the collation unit 42 rejects the collation target speech signal (step S006), finish.

一方、特定した棄却単語の類似度が第３閾値Ｔ３以下であると判定した場合には（ステップＳ００５；ＮＯ）、照合部４２は、入力された特徴量に基づいて、照合対象の音声信号と各認識単語との類似度をそれぞれ算出し（ステップＳ００７）、類似度が最も高い認識単語を特定する（ステップＳ００８）。そして、照合部４２は、特定した認識単語の類似度が第１閾値Ｔ１を超えているか否かを判定する（ステップＳ００９）。 On the other hand, when it is determined that the similarity of the specified rejection word is equal to or smaller than the third threshold T3 (step S005; NO), the collation unit 42 determines whether the collation target voice signal is based on the input feature amount. The degree of similarity with each recognized word is calculated (step S007), and the recognized word with the highest degree of similarity is specified (step S008). And the collation part 42 determines whether the similarity degree of the identified recognition word is over 1st threshold value T1 (step S009).

特定した認識単語の類似度が第１閾値Ｔ１以下であると判定した場合には（ステップＳ００９；ＮＯ）、照合部４２は、照合対象の音声信号を棄却し（ステップＳ００６）、本処理は終了する。一方、特定した認識単語の類似度が第１閾値Ｔ１を超えていると判定した場合には（ステップＳ００９；ＹＥＳ）、照合部４２は、更に、特定した認識単語の類似度が特定した棄却単語の類似度を超えているか否かを判定する（ステップＳ０１０）。 If it is determined that the similarity of the identified recognition word is equal to or less than the first threshold T1 (step S009; NO), the collation unit 42 rejects the speech signal to be collated (step S006), and this process ends. To do. On the other hand, when it is determined that the similarity of the identified recognition word exceeds the first threshold T1 (step S009; YES), the collation unit 42 further rejects the similarity identified by the similarity of the identified recognition word. It is determined whether or not the similarity is exceeded (step S010).

特定した認識単語の類似度が特定した棄却単語の類似度以下であると判定した場合には（ステップＳ０１０；ＮＯ）、照合部４２は、照合対象の音声信号を棄却し（ステップＳ００６）、本処理は終了する。一方、特定した認識単語の類似度が特定した棄却単語の類似度を超えていると判定した場合には（ステップＳ０１０；ＹＥＳ）、照合部４２は、特定した認識単語を認識結果として、出力部３０を介して、出力する（ステップＳ０１１）。そして、本処理は終了する。 When it is determined that the similarity of the identified recognition word is equal to or less than the similarity of the specified rejection word (step S010; NO), the collation unit 42 rejects the speech signal to be collated (step S006), The process ends. On the other hand, if it is determined that the similarity of the identified recognition word exceeds the similarity of the specified rejection word (step S010; YES), the collation unit 42 uses the identified recognition word as a recognition result as an output unit. 30 is output (step S011). Then, this process ends.

次に、図７を参照して、本実施形態１における効果について説明する。図７Ａは、従来技術１における各第１閾値Ｔ１に対する再現率と適合率とを示す図であり、図７Ｂは、本実施形態１における各第３閾値Ｔ３に対する再現率と適合率とを示す図である。なお、再現率と適合率の各値は、単語数３００語、約１時間分の音声データに対する評価結果に基づく値である。 Next, effects of the first embodiment will be described with reference to FIG. FIG. 7A is a diagram showing the recall rate and the matching rate for each first threshold value T1 in the prior art 1, and FIG. 7B is a diagram showing the recall rate and the matching rate for each third threshold value T3 in the first embodiment. It is. Note that each value of the recall rate and the matching rate is a value based on the evaluation result with respect to the speech data for 300 words and about one hour.

従来技術１において第１閾値Ｔ１を基準値（Ｔ１＝８０）から大きくしていくと、図７Ａに示すように、第１閾値Ｔ１＝８８の場合には、再現率が８４％、適合率が８３％となり、第１閾値Ｔ１＝９０の場合には、再現率が８１％、適合率が８６％となる。つまり、従来技術１の場合、基準値（Ｔ１＝８０）から第１閾値Ｔ１を大きくしていくと、適合率を上げることができるが、再現率が劣化してしまう。 When the first threshold value T1 is increased from the reference value (T1 = 80) in the prior art 1, as shown in FIG. 7A, when the first threshold value T1 = 88, the recall rate is 84% and the matching rate is When the first threshold value T1 = 90, the recall rate is 81% and the matching rate is 86%. That is, in the case of the prior art 1, when the first threshold value T1 is increased from the reference value (T1 = 80), the matching rate can be increased, but the reproduction rate is deteriorated.

これに対し、本実施形態１の方法を用い、第３閾値Ｔ３＝９５の場合には、再現率が８５％、適合率８３％となり、第３閾値Ｔ３＝９４の場合には、再現率が８４％、適合率が８７％となる。つまり、第３閾値Ｔ３を調整することで、再現率をほぼ維持した状態で、適合率を向上させることが可能となる。なお、本実施形態１の方法を評価する際には、第１閾値Ｔ１は基準値（Ｔ１＝８０）に固定した。 On the other hand, when the method of the first embodiment is used and the third threshold T3 = 95, the recall is 85% and the matching ratio is 83%, and when the third threshold T3 = 94, the recall is 84% and the precision is 87%. That is, by adjusting the third threshold value T3, it is possible to improve the matching rate while maintaining the recall rate substantially. When evaluating the method of the first embodiment, the first threshold value T1 is fixed to the reference value (T1 = 80).

上記実施形態１によれば、音声認識装置１は、入力音声の特徴量を算出し、算出した入力音声の特徴量に基づいて、入力音声と各棄却単語との類似度を算出し、最も類似度が高い棄却単語を特定する。そして、音声認識装置１は、特定した棄却単語の類似度が対応する第３閾値Ｔ３を超えている場合には、入力音声を棄却する。こうすることで、最も類似度の高い認識単語の類似度が第１閾値Ｔ１を超えている範囲において、最も類似度の高い棄却単語の類似度が第１閾値Ｔ１を超えているために誤検出が多く発生している場合に、第３閾値Ｔ３を調整することで、誤検出されてしまう棄却単語が検出されないようにすることが可能となる。つまり、最も類似度が高い認識単語と最も類似度が高い認識単語との間の相関性が高いために誤検出が発生している場合に、第３閾値Ｔ３を調整することで、このような誤検出を防止することが可能となる。したがって、再現率を維持しつつ適合率を向上させることが可能となる。 According to the first embodiment, the speech recognition apparatus 1 calculates the feature amount of the input speech, calculates the similarity between the input speech and each rejected word based on the calculated feature amount of the input speech, and is the most similar Identify reject words with high degrees. Then, the speech recognition apparatus 1 rejects the input speech when the similarity of the identified rejection word exceeds the corresponding third threshold value T3. By doing so, in the range where the similarity of the recognition word with the highest similarity exceeds the first threshold T1, the similarity of the rejection word with the highest similarity exceeds the first threshold T1, and thus is erroneously detected. When a large number of occurrences occur, it is possible to prevent detection of rejected words that are erroneously detected by adjusting the third threshold value T3. That is, by adjusting the third threshold T3 when a false detection occurs due to a high correlation between the recognized word with the highest similarity and the recognized word with the highest similarity, It is possible to prevent erroneous detection. Therefore, it is possible to improve the matching rate while maintaining the recall rate.

また、上記実施形態１によれば、音声認識装置１は、棄却単語ごとに第３閾値Ｔ３を格納する。こうすることで、棄却単語ごとに、その値を超えると対応する棄却単語が発声されている可能性が高くなる値を実験的に求めて第３閾値Ｔ３として設定することが可能となる。そのため、棄却単語ごとに細かな調整が可能となり、検出の精度をより向上させることが可能となる。 Moreover, according to the said Embodiment 1, the speech recognition apparatus 1 stores 3rd threshold value T3 for every rejection word. By doing so, it becomes possible to experimentally obtain a value that increases the possibility that the corresponding rejection word is uttered when the value is exceeded for each rejection word, and set it as the third threshold value T3. Therefore, fine adjustment is possible for each rejected word, and the detection accuracy can be further improved.

（実施形態２）
実施形態１においては、第３閾値Ｔ３を予め設定するようにした。本実施形態２においては、第３閾値Ｔ３を算出する場合の例について説明する。 (Embodiment 2)
In the first embodiment, the third threshold T3 is set in advance. In the second embodiment, an example in which the third threshold value T3 is calculated will be described.

図８は、本実施形態２における音声認識装置１の構成例を示す機能ブロック図である。本実施形態２における音声認識装置１の基本的な構成は、実施形態１の場合と同じである。但し、棄却単語辞書２２の構成が、実施形態１の場合と若干異なっている。また、図８に示すように、制御部４０が、更に、閾値算出部４３を備える点で、実施形態１の場合と異なっている。 FIG. 8 is a functional block diagram illustrating a configuration example of the speech recognition apparatus 1 according to the second embodiment. The basic configuration of the speech recognition apparatus 1 in the second embodiment is the same as that in the first embodiment. However, the configuration of the reject word dictionary 22 is slightly different from that in the first embodiment. Further, as shown in FIG. 8, the control unit 40 further includes a threshold value calculation unit 43, which is different from the case of the first embodiment.

図９は、本実施形態２における棄却単語辞書２２の構成例を示す図である。本実施形態２における棄却単語辞書２２の構成は、図９に示すように、「第３閾値」を格納していない点で、実施形態１における棄却単語辞書２２の構成と異なっている。これは、上述したように、本実施形態２においては、閾値算出部４３が第３閾値Ｔ３を算出するからである。 FIG. 9 is a diagram illustrating a configuration example of the reject word dictionary 22 according to the second embodiment. The configuration of the reject word dictionary 22 in the second embodiment is different from the configuration of the reject word dictionary 22 in the first embodiment in that the “third threshold” is not stored as shown in FIG. 9. This is because, as described above, in the second embodiment, the threshold calculation unit 43 calculates the third threshold T3.

図８に戻り、制御部４０は、例えば、ＣＰＵなどを備えており、記憶部２０のプログラムエリアに格納されている動作プログラムを実行して、図８に示すように、分析部４１と、照合部４２と、閾値算出部４３としての機能を実現する。また、制御部４０は、動作プログラムを実行して、音声認識装置１全体を制御する制御処理や詳しくは後述の音声認識処理などの処理を実行する。 Returning to FIG. 8, the control unit 40 includes, for example, a CPU, executes an operation program stored in the program area of the storage unit 20, and collates with the analysis unit 41 as illustrated in FIG. 8. The function as the unit 42 and the threshold value calculation unit 43 is realized. Further, the control unit 40 executes an operation program to execute a control process for controlling the entire voice recognition apparatus 1 and a process such as a voice recognition process described later in detail.

閾値算出部４３は、最も類似度が高い棄却単語の第３閾値Ｔ３を算出する。より具体的には、閾値算出部４３は、最も類似度が高い棄却単語と最も類似度が高い認識単語との間の単語間類似度を算出する。単語間類似度は、これらに限定されるものではないが、例えば、「単語の読み」や「音素系列」の編集距離に基づいて定めることができる。編集距離に基づいて単語間類似度を定める場合は、編集距離が短いほど単語間類似度が高くなるように設定される。 The threshold calculation unit 43 calculates the third threshold T3 of the reject word with the highest similarity. More specifically, the threshold calculation unit 43 calculates the inter-word similarity between the reject word with the highest similarity and the recognized word with the highest similarity. The similarity between words is not limited to these, but can be determined based on, for example, the editing distance of “word reading” or “phoneme series”. When the similarity between words is determined based on the editing distance, the similarity between words is set higher as the editing distance is shorter.

例えば、単語間類似度は、「単語の読み」の編集距離に基づいて定めるものとし、最も類似度の高い認識単語と最も類似度が高い棄却単語を、それぞれ、“岡山”と“富山”とする。この場合、認識単語“岡山”の「単語の読み」は“おかやま”であり、棄却単語“富山”の「単語の読み」は“とやま”なので、“とやま”を“おかやま”に編集するためには、“お”を追加し、更に、“と”を“か”に変換する必要がある。つまり、この場合の編集距離は、“２”となる。一方、最も類似度が高い棄却単語が“和歌山”だとすると、棄却単語“和歌山”の「単語の読み」は“わかやま”なので、“わかやま”を“おかやま”に編集するためには、“わ”を“お”に変換するだけでよい。つまり、この場合の編集距離は、“１”となる。 For example, the similarity between words is determined based on the editing distance of “word reading”. The recognition word with the highest similarity and the rejection word with the highest similarity are “Okayama” and “Toyama”, respectively. To do. In this case, the “word reading” of the recognition word “Okayama” is “Okayama” and the “word reading” of the rejection word “Toyama” is “Toyama”, so to edit “Toyama” to “Okayama” Needs to add “o” and convert “to” to “ka”. That is, the edit distance in this case is “2”. On the other hand, if the rejection word with the highest degree of similarity is “Wakayama”, the “word reading” of the rejection word “Wakayama” is “Wakayama”, so to edit “Wakayama” to “Okayama” You just need to convert it to “O”. That is, the edit distance in this case is “1”.

したがって、認識単語“岡山”と棄却単語“富山”との単語間類似度は、認識単語“岡山”と棄却単語“和歌山”との単語間類似度より小さくなる。 Therefore, the similarity between words of the recognition word “Okayama” and the rejection word “Toyama” is smaller than the similarity between words of the recognition word “Okayama” and the rejection word “Wakayama”.

そして、閾値算出４３は、図１０に示すように、単語間類似度が高いほど、第３閾値Ｔ３の値が高くなるように定義された関数Ｆに基づいて、第３閾値Ｔ３を算出する。これは、単語間類似度が高ければ、認識単語の類似度が高くなる発声に対しては単語間類似度が高い棄却単語の類似度も高くなる傾向にあるので、第３閾値Ｔ３を低くすると抑制しすぎることになるからである。一方、単語間類似度が低ければ、認識単語の類似度が高くなる発声に対してその棄却単語の類似度は高くなりにくいので、第３閾値Ｔ３をある程度低く設定しても抑制しすぎることはないからである。図１０に示すように、第３閾値Ｔ３が所定の値以上となるようにしているのは、第３閾値Ｔ３を低くしすぎると検出範囲ＢＤ２が狭くなりすぎる、つまり、抑制しすぎとなってしまうからである。ここで、図１０は、本実施形態２における第３閾値Ｔ３（関数Ｆ）について説明するための図である。 Then, as shown in FIG. 10, the threshold value calculation 43 calculates the third threshold value T3 based on the function F that is defined such that the higher the inter-word similarity, the higher the value of the third threshold value T3. This is because if the similarity between words is high, the similarity of rejected words with high similarity between words tends to be high for utterances with high similarity of recognized words. It is because it will suppress too much. On the other hand, if the similarity between words is low, the similarity of the rejected word is unlikely to be high with respect to the utterance in which the similarity of the recognized word is high. Because there is no. As shown in FIG. 10, the third threshold value T3 is set to be equal to or greater than a predetermined value because if the third threshold value T3 is too low, the detection range BD2 becomes too narrow, that is, it is excessively suppressed. Because it ends up. Here, FIG. 10 is a diagram for explaining the third threshold value T3 (function F) in the second embodiment.

例えば、関数Ｆは、以下の式１に示すものであってもよい。なお、式１中のＴ３＿ＭＩＮは抑制しすぎにならないようにするための第３閾値Ｔ３の最低値である。
第３閾値＝ＭＡＸ（１００−編集距離×２，Ｔ３＿ＭＩＮ）・・・（式１） For example, the function F may be as shown in the following Expression 1. Note that T3_MIN in Equation 1 is the lowest value of the third threshold value T3 for preventing excessive suppression.
Third threshold = MAX (100−edit distance × 2, T3_MIN) (Expression 1)

なお、Ｔ３＿ＭＩＮは、単語間類似度を算出する認識単語と棄却単語との組み合わせ毎に、それぞれ定めてもよいし、全ての組み合わせにおいて共通であってもよい。例えば、単語間類似度が低いほど、Ｔ３＿ＭＩＮが低くなるように設定してもよい。 Note that T3_MIN may be determined for each combination of the recognition word for calculating the similarity between words and the rejection word, or may be common to all combinations. For example, T3_MIN may be set to be lower as the similarity between words is lower.

次に、図１１を参照して、具体例に従って、本実施形態２における照合部４２と閾値算出部４３の処理について説明する。図１１は、本実施形態２における認識単語“岡山”が検出される範囲を示す図である。なお、図１１の例は、入力音声を認識単語“岡山”と比較する場合の例である。また、図１１中のＴ３−Ｗ１は、単語間類似度を算出する認識単語と棄却単語との組み合わせが、認識単語“岡山”と棄却単語“和歌山”の場合の第３閾値であり、Ｔ３−Ｗ２は、単語間類似度を算出する認識単語と棄却単語との組み合わせが、認識単語“岡山”と棄却単語“富山”の場合の第３閾値である。 Next, with reference to FIG. 11, processing of the matching unit 42 and the threshold value calculation unit 43 in the second embodiment will be described according to a specific example. FIG. 11 is a diagram illustrating a range in which the recognition word “Okayama” is detected in the second embodiment. In addition, the example of FIG. 11 is an example in the case of comparing the input speech with the recognition word “Okayama”. Further, T3-W1 in FIG. 11 is a third threshold value when the combination of the recognition word for calculating the similarity between words and the rejection word is the recognition word “Okayama” and the rejection word “Wakayama”. W2 is a third threshold value when the combination of the recognition word for calculating the similarity between words and the rejection word is the recognition word “Okayama” and the rejection word “Toyama”.

本具体例では、上述したように、認識単語“岡山”と棄却単語“和歌山”との組み合わせの方が、認識単語“岡山”と棄却単語“富山”との組み合せより、単語間類似度がより高い。したがって、閾値算出部４３により算出される第３閾値Ｔ３は、図１１に示すように、棄却単語“富山”に対する第３閾値Ｔ３−Ｗ２より、棄却単語“和歌山”に対する第３閾値Ｔ３−Ｗ１の方が高くなる。 In this specific example, as described above, the combination of the recognition word “Okayama” and the rejection word “Wakayama” has a higher degree of similarity between words than the combination of the recognition word “Okayama” and the rejection word “Toyama”. high. Therefore, as shown in FIG. 11, the third threshold value T3 calculated by the threshold value calculation unit 43 is greater than the third threshold value T3-W1 for the rejection word “Wakayama” than the third threshold value T3-W2 for the rejection word “Toyama”. Will be higher.

図１１を参照して、ユーザが、例えば、“△▲やま”と発声（周辺環境雑音が影響してその様に聞き取れる場合も含む）したとすると、“△▲やま”と認識単語“岡山”との類似度ＳＣ１は“△▲やま”と棄却単語“和歌山”との類似度ＳＣ２（この例では、棄却単語“和歌山”の類似度ＳＣ２が棄却単語の中で最も高くなる）より小さい。つまり、“△▲やま”は、認識単語“岡山”の類似度ＳＣ１が棄却単語“和歌山”の類似度ＳＣ２以下となる範囲である点線Ｌ１より左側の範囲にある。したがって、この場合、照合部４２は、“△▲やま”を棄却する。 Referring to FIG. 11, if the user utters, for example, “△ Yama” (including the case where it can be heard as a result of the influence of ambient environmental noise), “△ ▲ Yama” and the recognized word “Okayama” Is lower than the similarity SC2 between the “△ ▲ yama” and the rejection word “Wakayama” (in this example, the similarity SC2 of the rejection word “Wakayama” is the highest among the rejection words). In other words, “Yama” is in the range to the left of the dotted line L1 in which the similarity SC1 of the recognition word “Okayama” is less than or equal to the similarity SC2 of the rejection word “Wakayama”. Therefore, in this case, the collation unit 42 rejects “Δ ▲ Yama”.

また、ユーザが、例えば、“×かやま”と発声したとすると、“×かやま”と認識単語“岡山”との類似度ＳＣ１は“×かやま”と棄却単語“和歌山”との類似度ＳＣ２（この例では、棄却単語“和歌山”の類似度ＳＣ２が棄却単語の中で最も高くなる）より大きい。しかしながら、図１１に示すように、棄却単語“和歌山”の類似度ＳＣ２は第３閾値Ｔ３−Ｗ１を超えている。つまり、“×かやま”は、認識単語“岡山”の類似度ＳＣ１が棄却単語“和歌山”の類似度ＳＣ２より大きいものの、棄却単語“和歌山”の類似度ＳＣ２が第３閾値Ｔ３−Ｗ１を超えてしまっている範囲、点線Ｌ１と点線Ｌ２との間の範囲にある。したがって、この場合、照合部４２は、“×かやま”を棄却する。 For example, if the user utters “X Kayama”, the similarity SC1 between “X Kayama” and the recognized word “Okayama” is similar to “X Kayama” and the rejection word “Wakayama”. It is larger than SC2 (in this example, the similarity SC2 of the reject word “Wakayama” is the highest among the reject words). However, as shown in FIG. 11, the similarity SC2 of the rejection word “Wakayama” exceeds the third threshold value T3-W1. That is, “× Kayama” has a similarity SC1 of the recognition word “Okayama” greater than a similarity SC2 of the rejection word “Wakayama”, but a similarity SC2 of the rejection word “Wakayama” exceeds the third threshold T3-W1. Is in the range between the dotted line L1 and the dotted line L2. Therefore, in this case, the collation unit 42 rejects “X Kayama”.

また、ユーザが、例えば、“□かやま”と発声したとすると、“□かやま”と認識単語“岡山”との類似度ＳＣ１は“□かやま”と棄却単語“富山”との類似度ＳＣ３（この例では、棄却単語“富山”の類似度ＳＣ３が棄却単語の中で最も高くなる）より大きい。しかしながら、図１１に示すように、棄却単語“岡山”の類似度ＳＣ３は第３閾値Ｔ３−Ｗ２を超えている。つまり、“□かやま”は、認識単語“岡山”の類似度ＳＣ１が棄却単語“富山”の類似度ＳＣ３より大きいものの、棄却単語“富山”の類似度ＳＣ３が第３閾値Ｔ３−Ｗ２を超えてしまっている範囲、点線Ｌ４と点線Ｌ３との間の範囲にある。したがって、この場合、照合部４２は、“□かやま”を棄却する。 For example, if the user utters “□ Kayama”, the similarity SC1 between “□ Kayama” and the recognition word “Okayama” is similar to “□ Kayama” and the rejection word “Toyama”. It is larger than SC3 (in this example, the similarity SC3 of the rejection word “Toyama” is the highest among the rejection words). However, as shown in FIG. 11, the similarity SC3 of the rejection word “Okayama” exceeds the third threshold T3-W2. In other words, “□ Kayama” has a similarity SC1 of the recognition word “Okayama” greater than the similarity SC3 of the rejection word “Toyama”, but the similarity SC3 of the rejection word “Toyama” exceeds the third threshold T3-W2. Is in the range between the dotted line L4 and the dotted line L3. Therefore, in this case, the collation unit 42 rejects “□ Kayama”.

また、ユーザが、例えば、“■かやま”と発声したとすると、図１１に示すように、“■かやま”と棄却単語“岡山”との類似度ＳＣ３（この例では、棄却単語“富山”の類似度ＳＣ３が棄却単語の中で最も高くなる）は第３閾値Ｔ３−Ｗ２以下である。また、“■かやま”と認識単語“岡山”との類似度ＳＣ１は、図１１に示すように、棄却単語“富山”の類似度ＳＣ３と第１閾値Ｔ１より大きい。つまり、“■かやま”は、最も類似度が大きい棄却単語の類似度（ＳＣ２又はＳＣ３）が第３閾値（Ｔ３−Ｗ１又はＴ３−Ｗ２）を超えておらず、認識単語“岡山”の類似度ＳＣ１が最も類似度が大きい棄却単語の類似度（ＳＣ２又はＳＣ３）と第１閾値Ｔ１より大きい範囲、点線Ｌ２と点線Ｌ４との間の範囲にある。本具体例では、点線Ｌ２と点線Ｌ４との間の範囲が、認識単語“岡山”の検出範囲ＢＤ２となる。したがって、この場合、照合部４２は、認識単語“岡山”を認識結果として出力する。 For example, if the user utters “■ Kayama”, as shown in FIG. 11, the similarity SC3 between “■ Kayama” and the rejection word “Okayama” (in this example, the rejection word “Toyama”). “Similarity SC3 of“ is the highest among rejected words ”is equal to or less than the third threshold value T3-W2. Further, the similarity SC1 between “■ Kayama” and the recognized word “Okayama” is larger than the similarity SC3 of the reject word “Toyama” and the first threshold T1, as shown in FIG. That is, “■ Kayama” is similar to the recognition word “Okayama” because the similarity (SC2 or SC3) of the rejection word having the highest similarity does not exceed the third threshold (T3-W1 or T3-W2). The degree SC1 is in the range between the similarity (SC2 or SC3) of the reject word having the highest similarity and the first threshold T1, and the range between the dotted line L2 and the dotted line L4. In this specific example, the range between the dotted line L2 and the dotted line L4 is the detection range BD2 of the recognition word “Okayama”. Therefore, in this case, the collation unit 42 outputs the recognition word “Okayama” as a recognition result.

このように、単語間類似度に応じて棄却単語の第３閾値Ｔ３を設定することで、検出範囲ＢＤ２を理想範囲ＢＤ１により近づけることが可能となる。 In this way, by setting the third threshold value T3 for rejected words according to the similarity between words, the detection range BD2 can be made closer to the ideal range BD1.

次に、図１２を参照して、本実施形態２における音声認識処理の流れについて説明する。図１２は、本実施形態１２における音声認識処理のフローを説明するためのフローチャートの例の一部である。本音声認識処理は、入力部１０が受け付けた入力信号を制御部４０に出力したことをトリガとして開始される。なお、ステップＳ１０９以降の処理は実施形態１の場合の同じである。 Next, with reference to FIG. 12, the flow of speech recognition processing in the second embodiment will be described. FIG. 12 is a part of an example of a flowchart for explaining the flow of speech recognition processing in the twelfth embodiment. The voice recognition process is started with the input signal received by the input unit 10 being output to the control unit 40 as a trigger. The processing after step S109 is the same as that in the first embodiment.

分析部４１は、入力信号を分析して、音声信号を検出する（ステップＳ１０１）。そして、分析部４１は、検出した音声信号の特徴量を算出し、算出した特徴量を照合部４２に出力する（ステップＳ１０２）。 The analysis unit 41 analyzes the input signal and detects an audio signal (step S101). And the analysis part 41 calculates the feature-value of the detected audio | voice signal, and outputs the calculated feature-value to the collation part 42 (step S102).

そして、照合部４２は、入力された特徴量に基づいて、照合対象の音声信号と各棄却単語との類似度をそれぞれ算出し（ステップＳ１０３）、最も類似度が高い棄却単語を特定する（ステップＳ１０４）。そして、照合部４２は、入力された特徴量に基づいて、照合対象の音声信号と各認識単語との類似度をそれぞれ算出し（ステップＳ１０５）、最も類似度が高い認識単語を特定する（ステップＳ１０６）。 Then, the collation unit 42 calculates the similarity between the speech signal to be collated and each rejection word based on the input feature amount (step S103), and identifies the rejection word with the highest similarity (step S103). S104). Then, the collation unit 42 calculates the similarity between the speech signal to be collated and each recognition word based on the input feature amount (step S105), and identifies the recognition word having the highest similarity (step S105). S106).

そして、閾値算出部４３は、特定された棄却単語と特定された認識単語との間の単語間類似度を算出し（ステップＳ１０７）、算出した単語間類似度に基づいて、特定された棄却単語の第３閾値Ｔ３を算出する（ステップＳ１０８）。 Then, the threshold calculation unit 43 calculates the inter-word similarity between the specified rejection word and the specified recognition word (step S107), and the specified rejection word based on the calculated inter-word similarity. The third threshold value T3 is calculated (step S108).

そして、照合部４２は、特定した棄却単語の類似度が、算出された第３閾値Ｔ３を超えているか否かを判定する（ステップＳ１０９）。特定した棄却単語の類似度が算出された第３閾値Ｔ３を超えていると判定した場合には（ステップＳ１０９；ＹＥＳ）、照合部４２は、照合対象の音声信号を棄却し（ステップＳ００６）、本処理は終了する。 Then, the collation unit 42 determines whether or not the similarity of the identified rejection word exceeds the calculated third threshold T3 (step S109). When it is determined that the similarity of the identified rejection word exceeds the calculated third threshold T3 (step S109; YES), the collation unit 42 rejects the collation target speech signal (step S006), This process ends.

一方、特定した棄却単語の類似度が算出された第３閾値Ｔ３以下であると判定した場合には（ステップＳ１０９；ＮＯ）、処理は実施形態１で説明したステップＳ００９以降の処理へと進み、照合部４２は、特定した認識単語の類似度が第１閾値Ｔ１を超えているか否かを判定する（ステップＳ００９）。 On the other hand, when it is determined that the similarity of the specified rejection word is equal to or less than the calculated third threshold T3 (step S109; NO), the process proceeds to the process after step S009 described in the first embodiment. The collation unit 42 determines whether the similarity of the identified recognition word exceeds the first threshold T1 (step S009).

上記実施形態２によれば、音声認識装置１は、最も類似度が高い認識単語と最も類似度が高い棄却単語との単語間類似度（単語間の類似度）を算出し、算出した単語間類似度が高い程、対応する棄却単語（最も類似度が高い棄却単語）の第３閾値Ｔ３をより高く設定する。こうすることで、最も類似度が高い棄却単語と最も類似度が高い認識単語の相関性に応じて第３閾値Ｔ３を調整できるので、抑制のしすぎを防止することができる。したがって、適合率を向上させる際に、再現率を劣化しすぎないようにできる。 According to the second embodiment, the speech recognition apparatus 1 calculates the inter-word similarity (similarity between words) between the recognition word with the highest similarity and the rejection word with the highest similarity, and the calculated inter-word The higher the similarity is, the higher the third threshold value T3 of the corresponding rejection word (the rejection word with the highest similarity) is set. In this way, since the third threshold T3 can be adjusted according to the correlation between the reject word with the highest similarity and the recognized word with the highest similarity, it is possible to prevent excessive suppression. Therefore, when improving the relevance rate, it is possible to prevent the recall rate from deteriorating too much.

また、上記実施形態２によれば、音声認識装置１は、第３閾値Ｔ３を所定の値以上に設定する。これにより、最も類似度が高い認識単語と最も類似度が高い棄却単語との単語間類似度が非常に低い場合であっても、第３閾値Ｔ３を所定の値以上にすることで、抑制のしすぎを防止することが可能となる。 Further, according to the second embodiment, the speech recognition apparatus 1 sets the third threshold value T3 to a predetermined value or more. Thereby, even if the similarity between words of the recognition word with the highest similarity and the rejection word with the highest similarity is very low, the third threshold T3 is set to a predetermined value or more to suppress the suppression. It becomes possible to prevent excessively.

（実施形態３）
本実施形態３においては、実施形態１で説明した方法を従来技術２に適用した場合の例について説明する。もちろん、実施形態２で説明した方法を従来技術２に適用することも可能である。 (Embodiment 3)
In the third embodiment, an example in which the method described in the first embodiment is applied to the related art 2 will be described. Of course, the method described in the second embodiment can be applied to the related art 2.

本実施形態３における音声認識装置１の基本的な構成は、実施形態１の場合と同じである。但し、上述した認識単語に対応する第２閾値Ｔ２が記憶部２０のデータエリアに更に格納されている点で、実施形態１の場合と異なっている。また、照合部４２が果たす機能が、実施形態１の場合と若干異なっている。なお、データエリアに格納されている第２閾値Ｔ２は、認識単語ごとに別々の第２閾値Ｔ２であってもよいし、全ての認識単語に共通した第２閾値Ｔ２であってもよい。 The basic configuration of the speech recognition apparatus 1 in the third embodiment is the same as that in the first embodiment. However, the second threshold value T2 corresponding to the recognition word described above is further stored in the data area of the storage unit 20, which is different from the case of the first embodiment. Further, the function performed by the collation unit 42 is slightly different from that in the first embodiment. Note that the second threshold T2 stored in the data area may be a second threshold T2 that is different for each recognized word, or may be a second threshold T2 that is common to all recognized words.

本実施形態３の照合部４２は、音響モデル記憶部２３に格納されている音響モデルと入力された特徴量とを比較して、入力音声に対応する音素列を抽出する。そして、照合部４２は、抽出した音素列と各認識単語の音素系列とを比較して類似度を算出すると共に、抽出した音素列と各棄却単語の音素系列とを比較して類似度を算出する。なお、類似度の算出方法は、従来用いられている方法を用いることができる。また、本実施形態３においては、類似度は、０〜１００の間の値に正規化したものである。 The collation unit 42 according to the third embodiment compares the acoustic model stored in the acoustic model storage unit 23 with the input feature amount, and extracts a phoneme string corresponding to the input speech. Then, the matching unit 42 calculates the similarity by comparing the extracted phoneme sequence and the phoneme sequence of each recognized word, and calculates the similarity by comparing the extracted phoneme sequence and the phoneme sequence of each rejected word. To do. Note that a conventionally used method can be used as the similarity calculation method. In the third embodiment, the similarity is normalized to a value between 0 and 100.

そして、照合部４２は、特定した認識単語の類似度が対応する第２閾値Ｔ２を超えているか否かを判定する。そして、特定した認識単語の類似度が対応する第２閾値Ｔ２を超えていると判定した場合には、照合部４２は、特定した認識単語を認識結果として、出力部３０を介して、出力する。一方、特定した認識単語の類似度が対応する第２閾値Ｔ２以下であると判定した場合には、照合部４２は、更に、特定した認識単語の類似度が第１閾値Ｔ１を超えているか否かを判定する。そして、特定した認識単語の類似度が第１閾値Ｔ１以下であると判定した場合には、照合部４２は、照合対象の音声信号を棄却する。一方、特定した認識単語の類似度が第１閾値Ｔ１を超えていると判定した場合には、照合部４２は、更に、特定した認識単語の類似度が特定した棄却単語の類似度を超えているか否かを判定する。 Then, the collation unit 42 determines whether or not the similarity of the identified recognition word exceeds the corresponding second threshold value T2. If it is determined that the similarity of the identified recognition word exceeds the corresponding second threshold T2, the collation unit 42 outputs the identified recognition word as a recognition result via the output unit 30. . On the other hand, when it is determined that the similarity of the identified recognition word is equal to or less than the corresponding second threshold T2, the collation unit 42 further determines whether the similarity of the identified recognition word exceeds the first threshold T1. Determine whether. And when it determines with the similarity of the identified recognition word being below 1st threshold value T1, the collation part 42 rejects the audio | voice signal of collation object. On the other hand, when it is determined that the similarity of the identified recognition word exceeds the first threshold T1, the collation unit 42 further exceeds the similarity of the rejection word that is identified by the similarity of the identified recognition word. It is determined whether or not.

次に、図１３を参照して、具体例に従って、本実施形態３における照合部４２の処理について更に説明する。図１３は、本実施形態３における認識単語“岡山”が検出される範囲を示す図である。なお、図１３の例は、入力音声を認識単語“岡山”と比較する場合の例であり、第３閾値Ｔ３が全ての棄却単語に共通の場合の例である。 Next, with reference to FIG. 13, the process of the collation part 42 in this Embodiment 3 is further demonstrated according to a specific example. FIG. 13 is a diagram illustrating a range in which the recognition word “Okayama” is detected in the third embodiment. Note that the example of FIG. 13 is an example in which the input speech is compared with the recognition word “Okayama”, and is an example in which the third threshold T3 is common to all rejection words.

図１３を参照して、ユーザが、例えば、“△▲やま”と発声（周辺環境雑音が影響してその様に聞き取れる場合も含む）したとすると、“△▲やま”と認識単語“岡山”との類似度ＳＣ１は“△▲やま”と棄却単語“和歌山”との類似度ＳＣ２（この例では、棄却単語“和歌山”の類似度ＳＣ２が棄却単語の中で最も高くなる）より小さい。つまり、“△▲やま”は、認識単語“岡山”の類似度ＳＣ１が棄却単語“和歌山”の類似度ＳＣ２以下となる範囲である点線Ｌ１より左側の範囲にある。したがって、この場合、照合部４２は、“△▲やま”を棄却する。これに対して、従来技術２の場合では、“△▲やま”が属する点線Ｌ５と点線Ｌ１との間の範囲は、最も類似度が高い認識単語の類似度（認識単語“岡山”の類似度ＳＣ１）が第２閾値Ｔ２を超えているため、認識単語“岡山”を認識結果として出力してしまう。 Referring to FIG. 13, if the user utters, for example, “△ Yama” (including the case where it can be heard as a result of the influence of ambient environmental noise), “△ ▲ Yama” and the recognized word “Okayama” Is lower than the similarity SC2 between the “△ ▲ yama” and the rejection word “Wakayama” (in this example, the similarity SC2 of the rejection word “Wakayama” is the highest among the rejection words). In other words, “Yama” is in the range to the left of the dotted line L1 in which the similarity SC1 of the recognition word “Okayama” is less than or equal to the similarity SC2 of the rejection word “Wakayama”. Therefore, in this case, the collation unit 42 rejects “Δ ▲ Yama”. On the other hand, in the case of the prior art 2, the range between the dotted line L5 and the dotted line L1 to which “ΔYama” belongs is the similarity of the recognized word having the highest similarity (the similarity of the recognized word “Okayama”). Since SC1) exceeds the second threshold T2, the recognition word “Okayama” is output as the recognition result.

また、ユーザが、例えば、“×かやま”と発声したとすると、“×かやま”と認識単語“岡山”との類似度ＳＣ１は“×かやま”と棄却単語“和歌山”との類似度ＳＣ２（この例では、棄却単語“和歌山”の類似度ＳＣ２が棄却単語の中で最も高くなる）より大きい。しかしながら、図１３に示すように、棄却単語“和歌山”の類似度ＳＣ２は第３閾値Ｔ３を超えている。つまり、“×かやま”は、認識単語“岡山”の類似度ＳＣ１が棄却単語“和歌山”の類似度ＳＣ２より大きいものの、棄却単語“和歌山”の類似度ＳＣ２が第３閾値Ｔ３を超えてしまっている範囲、点線Ｌ１と点線Ｌ２との間の範囲にある。したがって、この場合、照合部４２は、“×かやま”を棄却する。これに対して、従来技術２の場合では、“×かやま”が属する点線Ｌ１と点線Ｌ２との間の範囲は、最も類似度が高い認識単語の類似度（認識単語“岡山”の類似度ＳＣ１）が第２閾値Ｔ２を超えているため、認識単語“岡山”を認識結果として出力してしまう。 For example, if the user utters “X Kayama”, the similarity SC1 between “X Kayama” and the recognized word “Okayama” is similar to “X Kayama” and the rejection word “Wakayama”. It is larger than SC2 (in this example, the similarity SC2 of the reject word “Wakayama” is the highest among the reject words). However, as shown in FIG. 13, the similarity SC2 of the reject word “Wakayama” exceeds the third threshold T3. That is, “× Kayama” has the similarity SC1 of the recognition word “Okayama” greater than the similarity SC2 of the rejection word “Wakayama”, but the similarity SC2 of the rejection word “Wakayama” exceeds the third threshold T3. In the range between the dotted line L1 and the dotted line L2. Therefore, in this case, the collation unit 42 rejects “X Kayama”. On the other hand, in the case of the prior art 2, the range between the dotted line L1 and the dotted line L2 to which “X Kayama” belongs is the similarity of the recognition word having the highest similarity (the similarity of the recognition word “Okayama”). Since SC1) exceeds the second threshold T2, the recognition word “Okayama” is output as the recognition result.

また、ユーザが、例えば、“□かやま”と発声したとすると、図１３に示すように、“□かやま”と棄却単語“富山”との類似度ＳＣ３（この例では、棄却単語“富山”の類似度ＳＣ３が棄却単語の中で最も高くなる）は第３閾値Ｔ３以下である。また、“□かやま”と認識単語“岡山”との類似度ＳＣ１は、図１３に示すように、第２閾値Ｔ２以下である。しかしながら、認識単語“岡山”の類似度ＳＣ１は、図１３に示すように、棄却単語“富山”の類似度ＳＣ３と第１閾値Ｔ１より大きい。つまり、“□かやま”は、最も類似度が大きい棄却単語の類似度（ＳＣ２又はＳＣ３）が第３閾値Ｔ３を超えておらず、最も類似度が高い認識単語“岡山”の類似度ＳＣ１が第２閾値Ｔ２を超えている範囲、あるいは、最も類似度が大きい棄却単語の類似度（ＳＣ２又はＳＣ３）が第３閾値Ｔ３を超えておらず、最も類似度が高い認識単語“岡山”の類似度ＳＣ１が第２閾値Ｔ２以下であるが、認識単語“岡山”の類似度ＳＣ１が最も類似度が大きい棄却単語の類似度（ＳＣ２又はＳＣ３）と第１閾値Ｔ１より大きい範囲、点線Ｌ２と点線Ｌ３との間の範囲にある。本具体例では、点線Ｌ２と点線Ｌ３との間の範囲が、認識単語“岡山”の検出範囲ＢＤ２となる。したがって、この場合、照合部４２は、認識単語“岡山”を認識結果として出力する。 For example, if the user utters “□ Kayama”, as shown in FIG. 13, the similarity SC3 between “□ Kayama” and the rejection word “Toyama” (in this example, the rejection word “Toyama”). “Similarity SC3 of“ is the highest among rejected words ”is equal to or less than the third threshold T3. Further, the similarity SC1 between “□ Kayama” and the recognition word “Okayama” is equal to or lower than the second threshold T2, as shown in FIG. However, the similarity SC1 of the recognition word “Okayama” is larger than the similarity SC3 of the rejection word “Toyama” and the first threshold T1, as shown in FIG. That is, in “□ Kayama”, the similarity (SC2 or SC3) of the rejection word with the highest similarity does not exceed the third threshold T3, and the similarity SC1 of the recognition word “Okayama” with the highest similarity is Range of exceeding the second threshold T2 or similarity of the rejection word having the highest similarity (SC2 or SC3) does not exceed the third threshold T3 and the similarity of the recognized word “Okayama” having the highest similarity The degree SC1 is less than or equal to the second threshold T2, but the similarity SC1 of the recognized word “Okayama” has the highest similarity (SC2 or SC3) and the range greater than the first threshold T1, dotted line L2 and dotted line It is in the range between L3. In this specific example, the range between the dotted line L2 and the dotted line L3 is the detection range BD2 of the recognition word “Okayama”. Therefore, in this case, the collation unit 42 outputs the recognition word “Okayama” as a recognition result.

このように、実施形態１で説明した方法を従来技術２に適用することで、点線Ｌ５と点線Ｌ２との間の範囲を検出範囲ＢＤ２から除去することが可能となり、従来技術２の場合と比較して、検出範囲ＢＤ２を理想範囲ＢＤ１に近づけることが可能となる。 As described above, by applying the method described in the first embodiment to the related art 2, it becomes possible to remove the range between the dotted line L5 and the dotted line L2 from the detection range BD2, which is compared with the case of the related art 2. Thus, the detection range BD2 can be brought closer to the ideal range BD1.

次に、図１４を参照して、本実施形態３における音声認識処理の流れについて説明する。図１４は、本実施形態３における音声認識処理のフローを説明するためのフローチャートの例の一部である。本音声認識処理は、入力部１０が受け付けた入力信号を制御部４０に出力したことをトリガとして開始される。なお、ステップＳ２０５の処理で“ＮＯ”と判定された場合のそれ以降の処理と、ステップＳ２０８の処理で“ＮＯ”と判定された場合のそれ以降の処理は、実施形態１で説明した処理と同じである。 Next, the flow of speech recognition processing in the third embodiment will be described with reference to FIG. FIG. 14 is a part of an example of a flowchart for explaining the flow of speech recognition processing in the third embodiment. The voice recognition process is started with the input signal received by the input unit 10 being output to the control unit 40 as a trigger. Note that the subsequent processing when “NO” is determined in the process of step S205 and the subsequent processing when “NO” is determined in the processing of step S208 are the same as those described in the first embodiment. The same.

分析部４１は、入力信号を分析して、音声信号を検出する（ステップＳ２０１）。そして、分析部４１は、検出した音声信号の特徴量を算出し、算出した特徴量を照合部４２に出力する（ステップＳ２０２）。 The analysis unit 41 analyzes the input signal and detects an audio signal (step S201). And the analysis part 41 calculates the feature-value of the detected audio | voice signal, and outputs the calculated feature-value to the collation part 42 (step S202).

そして、照合部４２は、入力された特徴量に基づいて、照合対象の音声信号と各棄却単語との類似度をそれぞれ算出し（ステップＳ２０３）、最も類似度が高い棄却単語を特定する（ステップＳ２０４）。そして、照合部４２は、特定した棄却単語の類似度が、対応する第３閾値Ｔ３を超えているか否かを判定する（ステップＳ２０５）。 Then, the collation unit 42 calculates the similarity between the speech signal to be collated and each rejection word based on the input feature amount (step S203), and identifies the rejection word with the highest similarity (step S203). S204). And the collation part 42 determines whether the similarity of the specified rejection word exceeds the corresponding 3rd threshold value T3 (step S205).

特定した棄却単語の類似度が第３閾値Ｔ３を超えていると判定した場合には（ステップＳ２０５；ＹＥＳ）、処理は実施形態１で説明したステップＳ００６の処理へと進み、照合部４２は、照合対象の音声信号を棄却し（ステップＳ００６）、本処理は終了する。 If it is determined that the similarity of the identified rejection word exceeds the third threshold T3 (step S205; YES), the process proceeds to the process of step S006 described in the first embodiment, The voice signal to be verified is rejected (step S006), and this process ends.

一方、特定した棄却単語の類似度が第３閾値Ｔ３以下であると判定した場合には（ステップＳ２０５；ＮＯ）、照合部４２は、入力された特徴量に基づいて、照合対象の音声信号と各認識単語との類似度をそれぞれ算出し（ステップＳ２０６）、類似度が最も高い認識単語を特定する（ステップＳ２０７）。そして、照合部４２は、特定した認識単語の類似度が対応する第２閾値Ｔ２を超えているか否かを判定する（ステップＳ２０８）。 On the other hand, when it is determined that the similarity of the specified rejection word is equal to or less than the third threshold T3 (step S205; NO), the collation unit 42 determines whether the collation target speech signal is based on the input feature amount. The similarity with each recognized word is calculated (step S206), and the recognized word with the highest similarity is specified (step S207). Then, the collation unit 42 determines whether or not the similarity of the identified recognized word exceeds the corresponding second threshold T2 (step S208).

特定した認識単語の類似度が対応する第２閾値Ｔ２を超えていると判定した場合には（ステップＳ２０８；ＹＥＳ）、照合部４２は、特定した認識単語を認識結果として、出力部３０を介して、出力する（ステップＳ２０９）。そして、本処理は終了する。一方、特定した認識単語の類似度が対応する第２閾値Ｔ２以下であると判定した場合には（ステップＳ２０８；ＮＯ）、処理は実施形態１で説明したステップＳ００９の処理へと進み、照合部４２は、更に、特定した認識単語の類似度が第１閾値Ｔ１を超えているか否かを判定する（ステップＳ００９）。 If it is determined that the similarity of the identified recognition word exceeds the corresponding second threshold T2 (step S208; YES), the collation unit 42 uses the identified recognition word as a recognition result via the output unit 30. Is output (step S209). Then, this process ends. On the other hand, when it is determined that the similarity of the identified recognized word is equal to or less than the corresponding second threshold value T2 (step S208; NO), the process proceeds to the process of step S009 described in the first embodiment, and the collating unit 42 further determines whether or not the similarity of the identified recognition word exceeds the first threshold value T1 (step S009).

次に、図１５を参照して、本実施形態３における効果について説明する。図１５Ａは、従来技術２における各第２閾値Ｔ２に対する再現率と適合率とを示す図であり、図１５Ｂは、本実施形態３における各第３閾値Ｔ３に対する再現率と適合率とを示す図である。なお、再現率と適合率の各値は、単語数３００語、約１時間分の音声データに対する評価結果に基づく値である。 Next, effects of the third embodiment will be described with reference to FIG. FIG. 15A is a diagram showing the recall rate and the matching rate for each second threshold value T2 in the prior art 2, and FIG. 15B is a diagram showing the recall rate and the matching rate for each third threshold value T3 in the third embodiment. It is. Note that each value of the recall rate and the matching rate is a value based on the evaluation result with respect to the speech data for 300 words and about one hour.

従来技術２において第２閾値Ｔ２を基準値（Ｔ２＝８０）から大きくしていくと、図１５Ａに示すように、第２閾値Ｔ２＝９１の場合には、再現率が８９％、適合率が７７％となり、第２閾値Ｔ２＝９３の場合には、再現率が８７％、適合率が７９％となり、第２閾値Ｔ２＝９５の場合には、再現率が８５％、適合率が８０％となる。つまり、従来技術２の場合、基準値（Ｔ２＝８０）から第２閾値Ｔ２を大きくしていくと、適合率を上げることができるが、再現率がやはり劣化してしまう。 When the second threshold value T2 is increased from the reference value (T2 = 80) in the prior art 2, as shown in FIG. 15A, when the second threshold value T2 = 91, the recall is 89% and the precision is When the second threshold value T2 = 93, the recall rate is 87% and the matching rate is 79%. When the second threshold value T2 = 95, the recall rate is 85% and the matching rate is 80%. It becomes. In other words, in the case of the prior art 2, when the second threshold value T2 is increased from the reference value (T2 = 80), the matching rate can be increased, but the recall rate is also deteriorated.

これに対し、実施形態１の方法を従来技術２に適用すると、第３閾値Ｔ３＝９５の場合には、再現率が９１％、適合率７９％となり、第３閾値Ｔ３＝９４の場合には、再現率が９０％、適合率が８４％となり、第３閾値Ｔ３＝９３の場合には、再現率が８９％、適合率が８４％となる。つまり、第３閾値Ｔ３を調整することで、再現率をほぼ維持した状態で、適合率を向上させることが可能となる。なお、実施形態１の方法を従来技術２に適用した場合（本実施形態３）を評価する際には、第２閾値Ｔ２は基準値（Ｔ２＝８０）に固定した。 In contrast, when the method of the first embodiment is applied to the conventional technique 2, when the third threshold value T3 = 95, the recall rate is 91% and the matching rate is 79%, and when the third threshold value T3 = 94, When the third threshold value T3 = 93, the recall rate is 89% and the match rate is 84%. That is, by adjusting the third threshold value T3, it is possible to improve the matching rate while maintaining the recall rate substantially. When evaluating the case where the method of the first embodiment is applied to the related art 2 (the third embodiment), the second threshold value T2 is fixed to the reference value (T2 = 80).

（実施形態４）
本実施形態４においては、実施形態１で説明した方法を従来技術２に適用した場合の別の例について説明する。実施形態３における音声認識処理では、最も類似度が高い棄却単語の類似度が対応する第３閾値Ｔ３を超えているか否かの判定処理を、最も類似度が高い認識単語の類似度が第２閾値Ｔ２を超えているか否かの判定処理よりも前に行った。 (Embodiment 4)
In the fourth embodiment, another example in which the method described in the first embodiment is applied to the related art 2 will be described. In the speech recognition processing according to the third embodiment, the determination processing as to whether or not the similarity of the reject word having the highest similarity exceeds the corresponding third threshold T3 is performed, and the similarity of the recognition word having the highest similarity is the second. This was performed before the process of determining whether or not the threshold value T2 was exceeded.

本実施形態４においては、最も類似度が高い棄却単語の類似度が対応する第３閾値Ｔ３を超えているか否かの判定処理を、最も類似度が高い認識単語の類似度が第２閾値Ｔ２を超えているか否かの判定処理よりも後に行う。こうすることで、より再現率を重視した設定を行うことが可能となる。 In the fourth embodiment, it is determined whether or not the similarity of the reject word having the highest similarity exceeds the corresponding third threshold T3, and the similarity of the recognized word having the highest similarity is the second threshold T2. This is performed after the process of determining whether or not the threshold is exceeded. By doing so, it becomes possible to make a setting with more emphasis on the recall.

本実施形態４における音声認識装置１の基本的な構成は、実施形態３の場合と同じである。但し、照合部４２が果たす機能が、実施形態３の場合と若干異なっている。 The basic configuration of the speech recognition apparatus 1 in the fourth embodiment is the same as that in the third embodiment. However, the function performed by the matching unit 42 is slightly different from that in the third embodiment.

本実施形態４の照合部４２は、音響モデル記憶部２３に格納されている音響モデルと入力された特徴量とを比較して、入力音声に対応する音素列を抽出する。そして、照合部４２は、抽出した音素列と各認識単語の音素系列とを比較して類似度を算出すると共に、抽出した音素列と各棄却単語の音素系列とを比較して類似度を算出する。なお、類似度の算出方法は、従来用いられている方法を用いることができる。また、本実施形態４においては、類似度は、０〜１００の間の値に正規化したものである。 The matching unit 42 according to the fourth embodiment compares the acoustic model stored in the acoustic model storage unit 23 with the input feature quantity, and extracts a phoneme string corresponding to the input speech. Then, the matching unit 42 calculates the similarity by comparing the extracted phoneme sequence and the phoneme sequence of each recognized word, and calculates the similarity by comparing the extracted phoneme sequence and the phoneme sequence of each rejected word. To do. Note that a conventionally used method can be used as the similarity calculation method. In the fourth embodiment, the similarity is normalized to a value between 0 and 100.

そして、照合部４２は、類似度が最も高い認識単語を特定し、特定した認識単語の類似度が対応する第２閾値Ｔ２を超えているか否かを判定する。そして、特定した認識単語の類似度が対応する第２閾値Ｔ２を超えていると判定した場合には、照合部４２は、特定した認識単語を認識結果として、出力部３０を介して、出力する。一方、特定した認識単語の類似度が対応する第２閾値Ｔ２以下であると判定した場合には、照合部４２は、類似度が最も高い棄却単語を特定し、特定した棄却単語の類似度が、対応する第３閾値Ｔ３を超えているか否かを判定する。 And the collation part 42 specifies the recognition word with the highest similarity, and determines whether the similarity of the specified recognition word exceeds the corresponding 2nd threshold value T2. If it is determined that the similarity of the identified recognition word exceeds the corresponding second threshold T2, the collation unit 42 outputs the identified recognition word as a recognition result via the output unit 30. . On the other hand, if it is determined that the similarity of the identified recognition word is equal to or less than the corresponding second threshold value T2, the matching unit 42 identifies the rejection word having the highest similarity, and the similarity of the identified rejection word is It is determined whether or not the corresponding third threshold value T3 is exceeded.

そして、特定した棄却単語の類似度が対応する第３閾値Ｔ３を超えていると判定した場合には、照合部４２は、照合対象の音声信号を棄却する。一方、特定した棄却単語の類似度が対応する第３閾値Ｔ３以下であると判定した場合には、照合部４２は、更に、特定した認識単語の類似度が第１閾値Ｔ１を超えているか否かを判定する。そして、特定した認識単語の類似度が第１閾値Ｔ１以下であると判定した場合には、照合部４２は、照合対象の音声信号を棄却する。一方、特定した認識単語の類似度が第１閾値Ｔ１を超えていると判定した場合には、照合部４２は、更に、特定した認識単語の類似度が特定した棄却単語の類似度を超えているか否かを判定する。 And when it determines with the similarity of the identified rejection word exceeding the corresponding 3rd threshold value T3, the collation part 42 rejects the audio | voice signal of collation object. On the other hand, if it is determined that the similarity of the identified rejection word is equal to or less than the corresponding third threshold T3, the collation unit 42 further determines whether the similarity of the identified recognition word exceeds the first threshold T1. Determine whether. And when it determines with the similarity of the identified recognition word being below 1st threshold value T1, the collation part 42 rejects the audio | voice signal of collation object. On the other hand, when it is determined that the similarity of the identified recognition word exceeds the first threshold T1, the collation unit 42 further exceeds the similarity of the rejection word that is identified by the similarity of the identified recognition word. It is determined whether or not.

次に、図１６を参照して、本実施形態４における音声認識処理の流れについて説明する。図１６は、本実施形態４における音声認識処理のフローを説明するためのフローチャートの例の一部である。本音声認識処理は、入力部１０が受け付けた入力信号を制御部４０に出力したことをトリガとして開始される。なお、ステップＳ３０９の処理より後の処理は、実施形態１で説明した処理と同じである。 Next, with reference to FIG. 16, the flow of speech recognition processing in the fourth embodiment will be described. FIG. 16 is a part of an example of a flowchart for explaining the flow of speech recognition processing in the fourth embodiment. The voice recognition process is started with the input signal received by the input unit 10 being output to the control unit 40 as a trigger. Note that the processing after step S309 is the same as the processing described in the first embodiment.

分析部４１は、入力信号を分析して、音声信号を検出する（ステップＳ３０１）。そして、分析部４１は、検出した音声信号の特徴量を算出し、算出した特徴量を照合部４２に出力する（ステップＳ３０２）。 The analysis unit 41 analyzes the input signal and detects an audio signal (step S301). And the analysis part 41 calculates the feature-value of the detected audio | voice signal, and outputs the calculated feature-value to the collation part 42 (step S302).

そして、照合部４２は、入力された特徴量に基づいて、照合対象の音声信号と各認識単語との類似度をそれぞれ算出し（ステップＳ３０３）、類似度が最も高い認識単語を特定する（ステップＳ３０４）。そして、照合部４２は、特定した認識単語の類似度が対応する第２閾値Ｔ２を超えているか否かを判定する（ステップＳ３０５）。 Then, the collation unit 42 calculates the similarity between the speech signal to be collated and each recognized word based on the input feature amount (step S303), and identifies the recognized word having the highest similarity (step S303). S304). Then, the collation unit 42 determines whether or not the similarity of the identified recognized word exceeds the corresponding second threshold T2 (step S305).

特定した認識単語の類似度が対応する第２閾値Ｔ２を超えていると判定した場合には（ステップＳ３０５；ＹＥＳ）、照合部４２は、特定した認識単語を認識結果として、出力部３０を介して、出力する（ステップＳ３０６）。そして、本処理は終了する。一方、特定した認識単語の類似度が対応する第２閾値Ｔ２以下であると判定した場合には（ステップＳ３０５；ＮＯ）、照合部４２は、入力された特徴量に基づいて、照合対象の音声信号と各棄却単語との類似度をそれぞれ算出し（ステップＳ３０７）、最も類似度が高い棄却単語を特定する（ステップＳ３０８）。 If it is determined that the similarity of the identified recognition word exceeds the corresponding second threshold T2 (step S305; YES), the collation unit 42 uses the identified recognition word as a recognition result via the output unit 30. Is output (step S306). Then, this process ends. On the other hand, when it is determined that the similarity of the identified recognition word is equal to or less than the corresponding second threshold value T2 (step S305; NO), the collation unit 42 determines the speech to be collated based on the input feature amount. The similarity between the signal and each rejection word is calculated (step S307), and the rejection word with the highest similarity is specified (step S308).

そして、照合部４２は、特定した棄却単語の類似度が、対応する第３閾値Ｔ３を超えているか否かを判定する（ステップＳ３０９）。特定した棄却単語の類似度が第３閾値Ｔ３を超えていると判定した場合には（ステップＳ３０９；ＹＥＳ）、処理は実施形態１で説明したステップＳ００６の処理へと進み、照合部４２は、照合対象の音声信号を棄却し（ステップＳ００６）、本処理は終了する。 And the collation part 42 determines whether the similarity of the specified rejection word exceeds the corresponding 3rd threshold value T3 (step S309). When it is determined that the similarity of the specified rejection word exceeds the third threshold T3 (step S309; YES), the process proceeds to the process of step S006 described in the first embodiment. The voice signal to be verified is rejected (step S006), and this process ends.

一方、特定した棄却単語の類似度が第３閾値Ｔ３以下であると判定した場合には（ステップＳ３０９；ＮＯ）、処理は実施形態１で説明したステップＳ００９の処理へと進み、照合部４２は、更に、特定した認識単語の類似度が第１閾値Ｔ１を超えているか否かを判定する（ステップＳ００９）。 On the other hand, if it is determined that the similarity of the specified rejection word is equal to or less than the third threshold T3 (step S309; NO), the process proceeds to the process of step S009 described in the first embodiment, and the collation unit 42 Further, it is determined whether the similarity of the identified recognition word exceeds the first threshold T1 (step S009).

図１７は、実施形態１乃至４における音声認識装置１のハードウェア構成の例を示す図である。図１などに示す音声認識装置１は、例えば、図１７に示す各種ハードウェアにより実現されてもよい。図１７の例では、音声認識装置１は、ＣＰＵ２０１、ＲＡＭ２０２、ＲＯＭ２０３、ＨＤＤ２０４、入出力インターフェース２０５、通信モジュール２０６を備え、これらのハードウェアはバス２０７を介して接続されている。 FIG. 17 is a diagram illustrating an example of a hardware configuration of the speech recognition apparatus 1 according to the first to fourth embodiments. The voice recognition device 1 shown in FIG. 1 and the like may be realized by various hardware shown in FIG. 17, for example. In the example of FIG. 17, the speech recognition apparatus 1 includes a CPU 201, a RAM 202, a ROM 203, an HDD 204, an input / output interface 205, and a communication module 206, and these hardware are connected via a bus 207.

ＣＰＵ２０１は、例えば、ＨＤＤ２０４に格納されている動作プログラムをＲＡＭ２０２にロードし、ＲＡＭ２０２をワーキングメモリとして使いながら各種処理を実行する。ＣＰＵ２０１は、動作プログラムを実行することで、図１などに示す制御部４０の各機能部を実現することができる。 For example, the CPU 201 loads an operation program stored in the HDD 204 into the RAM 202 and executes various processes while using the RAM 202 as a working memory. The CPU 201 can implement each function unit of the control unit 40 illustrated in FIG. 1 and the like by executing the operation program.

なお、上記動作を実行するための動作プログラムを、フレキシブルディスク、ＣｏｍｐａｃｔＤｉｓｋ−ＲｅａｄＯｎｌｙＭｅｍｏｒｙ（ＣＤ−ＲＯＭ）、ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｋ（ＤＶＤ）、ＭａｇｎｅｔｏＯｐｔｉｃａｌｄｉｓｋ（ＭＯ）などのコンピュータで読み取り可能な記録媒体（不図示）に記憶して配布し、これを音声認識装置１の読取装置（不図示）で読み取ってコンピュータにインストールすることにより、上述の処理を実行するように構成してもよい。さらに、インターネット上のサーバ装置が有するディスク装置等に動作プログラムを記憶しておき、通信モジュール２０６を介して、音声認識装置１のコンピュータに動作プログラムをダウンロード等するものとしてもよい。 Note that an operation program for executing the above operation is a computer-readable recording medium such as a flexible disk, Compact Disk-Read Only Memory (CD-ROM), Digital Versatile Disk (DVD), or Magneto Optical disk (MO). It may be configured to execute the above-described processing by storing and distributing it (not shown), reading it with a reading device (not shown) of the speech recognition apparatus 1 and installing it in a computer. Furthermore, an operation program may be stored in a disk device or the like included in a server device on the Internet, and the operation program may be downloaded to the computer of the speech recognition apparatus 1 via the communication module 206.

なお、実施形態に応じて、ＲＡＭ２０２、ＲＯＭ２０３、ＨＤＤ２０４以外の他の種類の記憶装置が利用されてもよい。例えば、音声認識装置１は、ＣｏｎｔｅｎｔＡｄｄｒｅｓｓａｂｌｅＭｅｍｏｒｙ（ＣＡＭ）、ＳｔａｔｉｃＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ（ＳＲＡＭ）、ＳｙｎｃｈｒｏｎｏｕｓＤｙｎａｍｉｃＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ（ＳＤＲＡＭ）などの記憶装置を有してもよい。 Depending on the embodiment, other types of storage devices other than the RAM 202, the ROM 203, and the HDD 204 may be used. For example, the speech recognition apparatus 1 may include a storage device such as a Content Addressable Memory (CAM), a Static Random Access Memory (SRAM), or a Synchronous Dynamic Random Access Memory (SDRAM).

なお、実施形態に応じて、音声認識装置１のハードウェア構成は図１７とは異なっていてもよく、図１７に例示した規格・種類以外のその他のハードウェアを音声認識装置１に適用することもできる。 Depending on the embodiment, the hardware configuration of the speech recognition apparatus 1 may be different from that in FIG. 17, and other hardware other than the standards and types illustrated in FIG. 17 may be applied to the speech recognition apparatus 1. You can also.

例えば、図１などに示す音声認識装置１の制御部４０の各機能部は、ハードウェア回路により実現されてもよい。具体的には、ＣＰＵ２０１の代わりに、ＦｉｅｌｄＰｒｏｇｒａｍｍａｂｌｅＧａｔｅＡｒｒａｙ（ＦＰＧＡ）などのリコンフィギュラブル回路や、ＡｐｐｌｉｃａｔｉｏｎＳｐｅｃｉｆｉｃＩｎｔｅｇｒａｔｅｄＣｉｒｃｕｉｔ（ＡＳＩＣ）などにより、図１などに示す制御部４０の各機能部が実現されてもよい。もちろん、ＣＰＵ２０１とハードウェア回路の双方により、これらの機能部が実現されてもよい。 For example, each functional unit of the control unit 40 of the voice recognition device 1 illustrated in FIG. 1 and the like may be realized by a hardware circuit. Specifically, each functional unit of the control unit 40 shown in FIG. 1 is realized by a reconfigurable circuit such as Field Programmable Gate Array (FPGA) or Application Specific Integrated Circuit (ASIC) instead of the CPU 201. May be. Of course, these functional units may be realized by both the CPU 201 and the hardware circuit.

以上において、いくつかの実施形態について説明した。しかしながら、実施形態は上記の実施形態に限定されるものではなく、上述の実施形態の各種変形形態及び代替形態を包含するものとして理解されるべきである。例えば、各種実施形態は、その趣旨及び範囲を逸脱しない範囲で構成要素を変形して具体化できることが理解されよう。また、前述した実施形態に開示されている複数の構成要素を適宜組み合わせることにより、種々の実施形態を成すことができることが理解されよう。更には、実施形態に示される全構成要素からいくつかの構成要素を削除して又は置換して、或いは実施形態に示される構成要素にいくつかの構成要素を追加して種々の実施形態が実施され得ることが当業者には理解されよう。 In the above, several embodiments have been described. However, the embodiments are not limited to the above-described embodiments, and should be understood as including various modifications and alternatives of the above-described embodiments. For example, it will be understood that various embodiments can be embodied by modifying the components without departing from the spirit and scope thereof. It will be understood that various embodiments can be made by appropriately combining a plurality of components disclosed in the above-described embodiments. Further, various embodiments may be implemented by deleting or replacing some components from all the components shown in the embodiments, or adding some components to the components shown in the embodiments. Those skilled in the art will appreciate that this can be done.

以上の実施形態１乃至４を含む実施形態に関し、さらに以下の付記を開示する。
（付記１）
認識単語が登録されている認識単語辞書と棄却単語が登録されている棄却単語辞書とを備え、入力音声と前記認識単語との類似度を算出し、最も類似度が高い認識単語の類似度が予め設定されている第１の閾値以下の場合に前記入力音声を棄却する音声認識装置であって、
前記入力音声と前記棄却単語との類似度を算出し、最も類似度が高い棄却単語を特定する第１の特定手段と、
前記入力音声と前記認識単語との類似度を算出し、最も類似度が高い認識単語を特定する第２の特定手段と、
特定された前記棄却単語の類似度が、予め設定されている、前記第１の閾値より大きい第３の閾値を超えている場合には、特定された前記認識単語の類似度が前記第１の閾値を越えていても、前記入力音声を棄却する照合手段と、
を備える、
ことを特徴とする音声認識装置。
（付記２）
前記照合手段は、特定された前記棄却単語の類似度が前記第３の閾値以下であり、特定された前記認識単語の類似度が所定の条件を満たしている場合には、特定された前記認識単語を出力し、特定された前記棄却単語の類似度が前記第３の閾値以下であり、特定された前記認識単語の類似度が前記所定の条件を満たしていない場合には、前記入力音声を棄却する、
ことを特徴とする付記１に記載の音声認識装置。
（付記３）
前記所定の条件は、特定された前記認識単語の類似度が、特定された前記棄却単語の類似度を越えており、且つ前記第１の閾値を越えていることである、
ことを特徴とする付記２に記載の音声認識装置。
（付記４）
前記照合手段は、特定された前記認識単語の類似度が、前記第１の閾値より大きい第２の閾値を超えている場合には、特定された前記棄却単語の類似度が前記第３の閾値を超えているかいないかにかかわらず、特定された前記認識単語を出力する、
ことを特徴とする付記１乃至３のいずれか一に記載の音声認識装置。
（付記５）
前記所定の条件は、特定された前記認識単語の類似度が、特定された前記棄却単語の類似度を越えており、且つ前記第１の閾値を越えていること、あるいは、特定された前記認識単語の類似度が、前記第１の閾値より大きい第２の閾値を超えていることである、
ことを特徴とする付記２に記載の音声認識装置。
（付記６）
値の異なる複数の第３の閾値が予め設定され、前記棄却単語ごとに前記複数の第３の閾値の内のいずれかが対応付けられており、
前記照合手段は、特定された前記棄却単語の類似度が対応する前記第３の閾値を超えている場合に、前記入力音声を棄却する、
ことを特徴とする付記１乃至５のいずれか一に記載の音声認識装置。
（付記７）
前記第３の閾値は、超えると特定された前記棄却単語が発声されている可能性が高くように定められている、
ことを特徴とする付記１乃至６のいずれか一に記載の音声認識装置。
（付記８）
特定された前記棄却単語と特定された前記認識単語との相関性が高い程、特定された前記棄却単語に対する前記第３の閾値を高く設定する設定手段を、更に、備え、
前記照合手段は、特定された前記棄却単語の類似度が対応する前記第３の閾値を超えている場合に、前記入力音声を棄却する、
ことを特徴とする付記１乃至５のいずれか一に記載の音声認識装置。
（付記９）
前記設定手段は、特定された前記棄却単語と特定された前記認識単語との単語間の類似度を算出し、算出した単語間の類似度が高い程、特定された前記棄却単語に対する前記第３の閾値を高く設定する、
ことを特徴とする付記８に記載の音声認識装置。
（付記１０）
前記設定手段は、所定の値以上となるように前記第３の閾値を設定する、
ことを特徴とする付記８又は９に記載の音声認識装置。
（付記１１）
認識単語が登録されている認識単語辞書と棄却単語が登録されている棄却単語辞書とを備え、入力音声と前記認識単語との類似度を算出し、最も類似度が高い認識単語の類似度が予め設定されている第１の閾値以下の場合に前記入力音声を棄却する音声認識装置の音声認識方法であって、
前記入力音声と前記棄却単語との類似度を算出し、最も類似度が高い棄却単語を特定し、
前記入力音声と前記認識単語との類似度を算出し、最も類似度が高い認識単語を特定し、
特定した前記棄却単語の類似度が、予め設定されている、前記第１の閾値より大きい第３の閾値を超えている場合には、特定した前記認識単語の類似度が前記第１の閾値を越えていても、前記入力音声を棄却する、
ことを特徴とする音声認識方法。
（付記１２）
認識単語が登録されている認識単語辞書と棄却単語が登録されている棄却単語辞書とを備え、入力音声と前記認識単語との類似度を算出し、最も類似度が高い認識単語の類似度が予め設定されている第１の閾値以下の場合に前記入力音声を棄却する音声認識装置のコンピュータに、
前記入力音声と前記棄却単語との類似度を算出し、最も類似度が高い棄却単語を特定し、
前記入力音声と前記認識単語との類似度を算出し、最も類似度が高い認識単語を特定し、
特定した前記棄却単語の類似度が、予め設定されている、前記第１の閾値より大きい第３の閾値を超えている場合には、特定した前記認識単語の類似度が前記第１の閾値を越えていても、前記入力音声を棄却する、
処理を実行させる、
ことを特徴とするプログラム。
（付記１３）
認識単語が登録されている認識単語辞書と棄却単語が登録されている棄却単語辞書とを備え、入力音声と前記認識単語との類似度を算出し、最も類似度が高い認識単語の類似度が予め設定されている第１の閾値以下の場合に前記入力音声を棄却する音声認識装置のコンピュータに、
前記入力音声と前記棄却単語との類似度を算出し、最も類似度が高い棄却単語を特定し、
前記入力音声と前記認識単語との類似度を算出し、最も類似度が高い認識単語を特定し、
特定した前記棄却単語の類似度が、予め設定されている、前記第１の閾値より大きい第３の閾値を超えている場合には、特定した前記認識単語の類似度が前記第１の閾値を越えていても、前記入力音声を棄却する、
処理を実行させるプログラムを記憶した記録媒体。 The following additional notes are further disclosed with respect to the embodiments including the first to fourth embodiments.
(Appendix 1)
A recognition word dictionary in which a recognition word is registered and a rejection word dictionary in which a rejection word is registered; the similarity between the input speech and the recognition word is calculated, and the similarity of the recognition word having the highest similarity is A speech recognition device that rejects the input speech when it is equal to or lower than a first threshold value set in advance,
Calculating a similarity between the input voice and the reject word, and specifying a reject word having the highest similarity;
Calculating a similarity between the input speech and the recognized word, and specifying a recognized word having the highest similarity;
When the similarity of the identified rejection word exceeds a third threshold that is set in advance and is greater than the first threshold, the similarity of the identified recognition word is the first Collating means for rejecting the input speech even if the threshold is exceeded;
Comprising
A speech recognition apparatus characterized by that.
(Appendix 2)
When the similarity of the identified rejection word is equal to or less than the third threshold and the similarity of the identified recognition word satisfies a predetermined condition, the collating unit identifies the identified recognition If the similarity of the identified rejection word is less than or equal to the third threshold and the identified similarity of the recognized word does not satisfy the predetermined condition, the input speech is output To reject,
The speech recognition apparatus according to supplementary note 1, wherein:
(Appendix 3)
The predetermined condition is that the similarity of the identified recognition word exceeds the similarity of the identified rejection word and exceeds the first threshold value.
The speech recognition apparatus according to Supplementary Note 2, wherein
(Appendix 4)
When the similarity of the identified recognition word exceeds a second threshold value that is greater than the first threshold value, the matching unit determines that the similarity of the identified rejection word is the third threshold value. Output the identified recognition word regardless of whether or not
The speech recognition apparatus according to any one of appendices 1 to 3, wherein
(Appendix 5)
The predetermined condition is that the degree of similarity of the identified recognized word exceeds the degree of similarity of the identified rejection word and exceeds the first threshold value, or the identified recognition The similarity of the words exceeds a second threshold greater than the first threshold;
The speech recognition apparatus according to Supplementary Note 2, wherein
(Appendix 6)
A plurality of third thresholds having different values are preset, and any one of the plurality of third thresholds is associated with each rejection word,
The collation means rejects the input speech when the similarity of the identified rejection word exceeds the corresponding third threshold;
The speech recognition device according to any one of appendices 1 to 5, characterized in that:
(Appendix 7)
The third threshold is determined so that the rejection word specified when the third threshold is exceeded is likely to be spoken.
The speech recognition device according to any one of appendices 1 to 6, characterized in that:
(Appendix 8)
A setting means for setting the third threshold for the identified rejection word higher as the correlation between the identified rejection word and the identified recognized word is higher;
The collation means rejects the input speech when the similarity of the identified rejection word exceeds the corresponding third threshold;
The speech recognition device according to any one of appendices 1 to 5, characterized in that:
(Appendix 9)
The setting means calculates a similarity between words of the identified rejection word and the identified recognition word, and the higher the similarity between the calculated words is, the higher the similarity between the identified rejection words, the third Set a high threshold for
The speech recognition apparatus according to appendix 8, wherein
(Appendix 10)
The setting means sets the third threshold value to be equal to or greater than a predetermined value;
The speech recognition apparatus according to appendix 8 or 9, characterized in that.
(Appendix 11)
A recognition word dictionary in which a recognition word is registered and a rejection word dictionary in which a rejection word is registered; the similarity between the input speech and the recognition word is calculated, and the similarity of the recognition word having the highest similarity is A speech recognition method for a speech recognition apparatus that rejects the input speech when the input speech is equal to or lower than a first threshold value set in advance,
Calculate the similarity between the input speech and the reject word, identify the reject word with the highest similarity,
Calculating the similarity between the input speech and the recognized word, identifying the recognized word having the highest similarity,
When the similarity of the identified rejection word exceeds a third threshold that is set in advance and is greater than the first threshold, the similarity of the identified recognition word is less than the first threshold. Even if it exceeds, reject the input speech,
A speech recognition method characterized by the above.
(Appendix 12)
A recognition word dictionary in which a recognition word is registered and a rejection word dictionary in which a rejection word is registered; the similarity between the input speech and the recognition word is calculated, and the similarity of the recognition word having the highest similarity is In the computer of the speech recognition device that rejects the input speech when it is equal to or less than a first threshold value set in advance,
Calculate the similarity between the input speech and the reject word, identify the reject word with the highest similarity,
Calculating the similarity between the input speech and the recognized word, identifying the recognized word having the highest similarity,
When the similarity of the identified rejection word exceeds a third threshold that is set in advance and is greater than the first threshold, the similarity of the identified recognition word is less than the first threshold. Even if it exceeds, reject the input speech,
To execute the process,
A program characterized by that.
(Appendix 13)
A recognition word dictionary in which a recognition word is registered and a rejection word dictionary in which a rejection word is registered; the similarity between the input speech and the recognition word is calculated, and the similarity of the recognition word having the highest similarity is In the computer of the speech recognition device that rejects the input speech when it is equal to or less than a first threshold value set in advance,
Calculate the similarity between the input speech and the reject word, identify the reject word with the highest similarity,
Calculating the similarity between the input speech and the recognized word, identifying the recognized word having the highest similarity,
When the similarity of the identified rejection word exceeds a third threshold that is set in advance and is greater than the first threshold, the similarity of the identified recognition word is less than the first threshold. Even if it exceeds, reject the input speech,
A recording medium storing a program for executing processing.

１音声認識装置
１０入力部
２０記憶部
２１認識単語辞書
２２棄却単語辞書
２３音響モデル記憶部
３０出力部
４０制御部
４１分析部
４２照合部
４３閾値算出部
Ｔ１第１閾値
Ｔ２第２閾値
Ｔ３第３閾値
Ｔ３−Ｗ１棄却単語“和歌山”の第３閾値
Ｔ３−Ｗ２棄却単語“富山”の第３閾値
ＢＤ１理想範囲
ＢＤ２検出範囲
ＳＣ１認識単語“岡山”の類似度
ＳＣ２棄却単語“和歌山”の類似度
ＳＣ３棄却単語“富山”の類似度
Ｌ１〜Ｌ５点線
Ｆ関数
２０１ＣＰＵ
２０２ＲＡＭ
２０３ＲＯＭ
２０４ＨＤＤ
２０５入出力インターフェース
２０６通信モジュール
２０７バス DESCRIPTION OF SYMBOLS 1 Speech recognition apparatus 10 Input part 20 Storage part 21 Recognition word dictionary 22 Reject word dictionary 23 Acoustic model storage part 30 Output part 40 Control part 41 Analysis part 42 Collation part 43 Threshold calculation part T1 1st threshold value T2 2nd threshold value T3 3rd Threshold T3-W1 Third threshold of reject word “Wakayama” T3-W2 Third threshold of reject word “Toyama” BD1 Ideal range BD2 Detection range SC1 Similarity of recognition word “Okayama” SC2 Similarity of reject word “Wakayama” SC3 Similarity of reject word “Toyama” L1-L5 dotted line F function 201 CPU
202 RAM
203 ROM
204 HDD
205 I / O interface 206 Communication module 207 Bus

Claims

A recognition word dictionary in which a recognition word is registered and a rejection word dictionary in which a rejection word is registered; the similarity between the input speech and the recognition word is calculated, and the similarity of the recognition word having the highest similarity is A speech recognition device that rejects the input speech when it is equal to or lower than a first threshold value set in advance,
Calculating a similarity between the input voice and the reject word, and specifying a reject word having the highest similarity;
Calculating a similarity between the input speech and the recognized word, and specifying a recognized word having the highest similarity;
When the similarity of the identified rejection word exceeds a third threshold that is set in advance and is greater than the first threshold, the similarity of the identified recognition word is the first Collating means for rejecting the input speech even if the threshold is exceeded;
Equipped with a,
When the similarity of the identified recognition word exceeds a second threshold value that is greater than the first threshold value, the matching unit determines that the similarity of the identified rejection word is the third threshold value. Output the identified recognition word regardless of whether or not
A speech recognition apparatus characterized by that.

When the similarity of the identified rejection word is equal to or less than the third threshold and the similarity of the identified recognition word satisfies a predetermined condition, the collating unit identifies the identified recognition If the similarity of the identified rejection word is less than or equal to the third threshold and the identified similarity of the recognized word does not satisfy the predetermined condition, the input speech is output To reject,
The speech recognition apparatus according to claim 1.

The predetermined condition is that the similarity of the identified recognition word exceeds the similarity of the identified rejection word and exceeds the first threshold value.
The speech recognition apparatus according to claim 2.

A recognition word dictionary in which a recognition word is registered and a rejection word dictionary in which a rejection word is registered; the similarity between the input speech and the recognition word is calculated, and the similarity of the recognition word having the highest similarity is A speech recognition device that rejects the input speech when it is equal to or lower than a first threshold value set in advance,
Calculating a similarity between the input voice and the reject word, and specifying a reject word having the highest similarity;
Calculating a similarity between the input speech and the recognized word, and specifying a recognized word having the highest similarity;
When the similarity of the identified rejection word exceeds a third threshold that is set in advance and is greater than the first threshold, the similarity of the identified recognition word is the first Collating means for rejecting the input speech even if the threshold is exceeded;
With
The verification means includes
If the similarity of the identified rejection word is less than or equal to the third threshold and the similarity of the identified recognition word satisfies a predetermined condition, the identified recognition word is output,
When the similarity of the identified rejection word is equal to or less than the third threshold and the similarity of the identified recognition word does not satisfy the predetermined condition, the input speech is rejected,
The predetermined condition is:
The similarity of the identified recognition word is less than or equal to a second threshold greater than the first threshold, and the similarity of the identified recognition word exceeds the similarity of the identified rejection word And a first condition that the first threshold is exceeded ;
The similarity of the recognition word identified, a second condition that exceeds the second threshold value,
Which is either
A speech recognition apparatus characterized by that.

A recognition word dictionary in which a recognition word is registered and a rejection word dictionary in which a rejection word is registered; the similarity between the input speech and the recognition word is calculated, and the similarity of the recognition word having the highest similarity is A speech recognition device that rejects the input speech when it is equal to or lower than a first threshold value set in advance,
Calculating a similarity between the input voice and the reject word, and specifying a reject word having the highest similarity;
Calculating a similarity between the input speech and the recognized word, and specifying a recognized word having the highest similarity;
When the similarity of the identified rejection word exceeds a third threshold that is set in advance and is greater than the first threshold, the similarity of the identified recognition word is the first Collating means for rejecting the input speech even if the threshold is exceeded;
With
A plurality of third thresholds having different values are preset, and any one of the plurality of third thresholds is associated with each rejection word,
The collation means rejects the input speech when the similarity of the identified rejection word exceeds the corresponding third threshold;
A speech recognition apparatus characterized by that.

A recognition word dictionary in which a recognition word is registered and a rejection word dictionary in which a rejection word is registered; the similarity between the input speech and the recognition word is calculated, and the similarity of the recognition word having the highest similarity is A speech recognition device that rejects the input speech when it is equal to or lower than a first threshold value set in advance,
Calculating a similarity between the input voice and the reject word, and specifying a reject word having the highest similarity;
Calculating a similarity between the input speech and the recognized word, and specifying a recognized word having the highest similarity;
When the similarity of the identified rejection word exceeds a third threshold that is set in advance and is greater than the first threshold, the similarity of the identified recognition word is the first Collating means for rejecting the input speech even if the threshold is exceeded;
The similarity between the identified rejection word and the identified recognition word is calculated, and the higher the similarity between the calculated words , the higher the third threshold for the identified rejection word is set. Setting means to
With
The collation means rejects the input speech when the similarity of the identified rejection word exceeds the corresponding third threshold;
A speech recognition apparatus characterized by that.

The setting means sets the third threshold value to be equal to or greater than a predetermined value;
The speech recognition apparatus according to claim 6 .

A recognition word dictionary in which a recognition word is registered and a rejection word dictionary in which a rejection word is registered; the similarity between the input speech and the recognition word is calculated, and the similarity of the recognition word having the highest similarity is A speech recognition method for a speech recognition apparatus that rejects the input speech when the input speech is equal to or lower than a first threshold value set in advance,
Calculate the similarity between the input speech and the reject word, identify the reject word with the highest similarity,
Calculating the similarity between the input speech and the recognized word, identifying the recognized word having the highest similarity,
The similarity of the reject word specified, are previously set, when said first exceeds the threshold value larger than the third threshold value, the similarity of the recognition word identified the first threshold value even if not exceed, dismissed the input voice,
If the degree of similarity of the identified recognition word exceeds a second threshold value that is greater than the first threshold value, whether the degree of similarity of the identified rejection word exceeds the third threshold value Regardless of whether the identified recognition word is output,
A speech recognition method characterized by the above.

A recognition word dictionary in which a recognition word is registered and a rejection word dictionary in which a rejection word is registered; the similarity between the input speech and the recognition word is calculated, and the similarity of the recognition word having the highest similarity is A speech recognition method for a speech recognition apparatus that rejects the input speech when the input speech is equal to or lower than a first threshold value set in advance,
Calculate the similarity between the input speech and the reject word, identify the reject word with the highest similarity,
Calculating the similarity between the input speech and the recognized word, identifying the recognized word having the highest similarity,
When the similarity of the identified rejection word exceeds a third threshold that is set in advance and is greater than the first threshold, the similarity of the identified recognition word is the first threshold. Even if it exceeds, the input speech is rejected,
If the similarity of the identified rejection word is less than or equal to the third threshold and the similarity of the identified recognition word satisfies a predetermined condition, the identified recognition word is output,
When the similarity of the identified rejection word is equal to or less than the third threshold and the similarity of the identified recognition word does not satisfy the predetermined condition, the input speech is rejected,
The predetermined condition is:
The similarity of the identified recognition word is less than or equal to a second threshold greater than the first threshold, and the similarity of the identified recognition word exceeds the similarity of the identified rejection word And a first condition that the first threshold is exceeded;
A second condition that the similarity of the identified recognized word exceeds the second threshold;
Which is either
A speech recognition method characterized by the above.

A recognition word dictionary in which a recognition word is registered and a rejection word dictionary in which a rejection word is registered; the similarity between the input speech and the recognition word is calculated, and the similarity of the recognition word having the highest similarity is A speech recognition method for a speech recognition apparatus that rejects the input speech when the input speech is equal to or lower than a first threshold value set in advance,
Calculate the similarity between the input speech and the reject word, identify the reject word with the highest similarity,
Calculating the similarity between the input speech and the recognized word, identifying the recognized word having the highest similarity,
When the similarity of the identified rejection word exceeds a third threshold that is set in advance and is greater than the first threshold, the similarity of the identified recognition word is the first threshold. Even if it exceeds, the input speech is rejected,
A plurality of third thresholds having different values are preset, and any one of the plurality of third thresholds is associated with each rejection word,
In the rejection of the input speech, if the similarity of the identified rejection word exceeds the corresponding third threshold, the input speech is rejected.
A speech recognition method characterized by the above.

A recognition word dictionary in which a recognition word is registered and a rejection word dictionary in which a rejection word is registered; the similarity between the input speech and the recognition word is calculated, and the similarity of the recognition word having the highest similarity is A speech recognition method for a speech recognition apparatus that rejects the input speech when the input speech is equal to or lower than a first threshold value set in advance,
Calculate the similarity between the input speech and the reject word, identify the reject word with the highest similarity,
Calculating the similarity between the input speech and the recognized word, identifying the recognized word having the highest similarity,
When the similarity of the identified rejection word exceeds a third threshold that is set in advance and is greater than the first threshold, the similarity of the identified recognition word is the first threshold. Even if it exceeds, the input speech is rejected,
The similarity between the identified rejection word and the identified recognition word is calculated, and the higher the similarity between the calculated words, the higher the third threshold for the identified rejection word is set. And
In the rejection of the input speech, if the similarity of the identified rejection word exceeds the corresponding third threshold, the input speech is rejected.
A speech recognition method characterized by the above.

A recognition word dictionary in which a recognition word is registered and a rejection word dictionary in which a rejection word is registered; the similarity between the input speech and the recognition word is calculated, and the similarity of the recognition word having the highest similarity is In the computer of the speech recognition device that rejects the input speech when it is equal to or less than a first threshold value set in advance,
Calculate the similarity between the input speech and the reject word, identify the reject word with the highest similarity,
Calculating the similarity between the input speech and the recognized word, identifying the recognized word having the highest similarity,
The similarity of the reject word specified, are previously set, when said first exceeds the threshold value larger than the third threshold value, the similarity of the recognition word identified the first threshold value even if not exceed, dismissed the input voice,
If the degree of similarity of the identified recognition word exceeds a second threshold value that is greater than the first threshold value, whether the degree of similarity of the identified rejection word exceeds the third threshold value Regardless of whether the identified recognition word is output,
To execute the process,
A program characterized by that.

A recognition word dictionary in which a recognition word is registered and a rejection word dictionary in which a rejection word is registered; the similarity between the input speech and the recognition word is calculated, and the similarity of the recognition word having the highest similarity is In the computer of the speech recognition device that rejects the input speech when it is equal to or less than a first threshold value set in advance,
Calculate the similarity between the input speech and the reject word, identify the reject word with the highest similarity,
Calculating the similarity between the input speech and the recognized word, identifying the recognized word having the highest similarity,
When the similarity of the identified rejection word exceeds a third threshold that is set in advance and is greater than the first threshold, the similarity of the identified recognition word is the first threshold. Even if it exceeds, the input speech is rejected,
If the similarity of the identified rejection word is less than or equal to the third threshold and the similarity of the identified recognition word satisfies a predetermined condition, the identified recognition word is output,
If the similarity of the identified rejection word is less than or equal to the third threshold and the similarity of the identified recognition word does not satisfy the predetermined condition, reject the input speech;
Let the process run,
The predetermined condition is:
The similarity of the identified recognition word is less than or equal to a second threshold greater than the first threshold, and the similarity of the identified recognition word exceeds the similarity of the identified rejection word And a first condition that the first threshold is exceeded;
A second condition that the similarity of the identified recognized word exceeds the second threshold;
Which is either
A program characterized by that.

A recognition word dictionary in which a recognition word is registered and a rejection word dictionary in which a rejection word is registered; the similarity between the input speech and the recognition word is calculated, and the similarity of the recognition word having the highest similarity is In the computer of the speech recognition device that rejects the input speech when it is equal to or less than a first threshold value set in advance,
Calculate the similarity between the input speech and the reject word, identify the reject word with the highest similarity,
Calculating the similarity between the input speech and the recognized word, identifying the recognized word having the highest similarity,
When the similarity of the identified rejection word exceeds a third threshold that is set in advance and is greater than the first threshold, the similarity of the identified recognition word is the first threshold. The input speech is rejected even if
Let the process run,
A plurality of third thresholds having different values are preset, and any one of the plurality of third thresholds is associated with each rejection word,
In the rejection of the input speech, if the similarity of the identified rejection word exceeds the corresponding third threshold, the input speech is rejected.
A program characterized by that.

A recognition word dictionary in which a recognition word is registered and a rejection word dictionary in which a rejection word is registered; the similarity between the input speech and the recognition word is calculated, and the similarity of the recognition word having the highest similarity is In the computer of the speech recognition device that rejects the input speech when it is equal to or less than a first threshold value set in advance,
Calculate the similarity between the input speech and the reject word, identify the reject word with the highest similarity,
Calculating the similarity between the input speech and the recognized word, identifying the recognized word having the highest similarity,
When the similarity of the identified rejection word exceeds a third threshold that is set in advance and is greater than the first threshold, the similarity of the identified recognition word is the first threshold. Even if it exceeds, the input speech is rejected,
The similarity between the identified rejection word and the identified recognition word is calculated, and the higher the similarity between the calculated words, the higher the third threshold for the identified rejection word is set. To
Let the process run,
In the rejection of the input speech, if the similarity of the identified rejection word exceeds the corresponding third threshold, the input speech is rejected.
A program characterized by that.