JP2011022476A

JP2011022476A - Threshold management program for voice recognition, method of the same, and voice recognition device

Info

Publication number: JP2011022476A
Application number: JP2009169063A
Authority: JP
Inventors: Masaharu Harada; 将治原田
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2009-07-17
Filing date: 2009-07-17
Publication date: 2011-02-03
Anticipated expiration: 2029-07-17
Also published as: JP5293478B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a technology for facilitating work related to setting of a threshold for voice recognition and suppressing working hours. <P>SOLUTION: There is provided a threshold management program for voice recognition for making a computer execute a determination value calculation step of calculating, based on reading information showing pronunciation for each word, a determination value for determining a threshold for determining whether to recognize an input voice based on a predetermined determination reference, and a threshold value setting step of determining the threshold based on the determination value and setting it for each word. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、入力された音声を認識する際に用いる閾値の設定に関する。 The present invention relates to setting of a threshold value used when recognizing input speech.

話者が発声した音声を文字列に変換する場合、話者の発声した内容を翻訳する場合など、音声を認識するために音声認識装置が開発されている。音声認識装置は、例えば次のように音声認識を行う。まず、音声認識装置は、音声の入力を受け付け、入力された音声から抽出した特徴パターンと、予め登録されている各単語の音声の特徴を表した標準パターンとを比較する。次に、音声認識装置は、特徴パターンと標準パターンとの一致度が所定の閾値よりも大きい場合は、入力された音声をその標準パターンに対応する単語として認識する。このように閾値は音声を認識するか否かの基準であるため、閾値をどのように設定するかによって音声認識の精度が左右される。 A speech recognition device has been developed for recognizing speech, such as converting speech uttered by a speaker into a character string, translating content uttered by the speaker, and the like. The voice recognition device performs voice recognition as follows, for example. First, the speech recognition apparatus accepts speech input, and compares a feature pattern extracted from the input speech with a standard pattern representing the speech features of each word registered in advance. Next, when the degree of coincidence between the feature pattern and the standard pattern is greater than a predetermined threshold, the speech recognition apparatus recognizes the input speech as a word corresponding to the standard pattern. As described above, since the threshold value is a criterion for determining whether or not speech is recognized, the accuracy of speech recognition depends on how the threshold value is set.

音声認識装置は例えば次のように閾値を設定する。まず、音声認識装置は、複数の単語の音声を話者から受け付け、受け付けた音声に基づいて各単語の標準パターンを作成する。さらに、着目した単語の標準パターンと、それ以外の単語の標準パターンと、の類似度を解析し、解析結果に基づいて、着目した単語の閾値を設定する。つまり、着目した単語の標準パターンが、その他の単語の標準パターンとの関係の中でどのような位置にあるべきかに基づいて閾値を設定する。 For example, the speech recognition apparatus sets a threshold as follows. First, the speech recognition apparatus accepts speech of a plurality of words from a speaker and creates a standard pattern for each word based on the accepted speech. Further, the similarity between the standard pattern of the focused word and the standard pattern of other words is analyzed, and the threshold value of the focused word is set based on the analysis result. That is, the threshold is set based on the position where the standard pattern of the focused word should be in the relationship with the standard patterns of other words.

新たな単語を音声認識装置に登録する場合にも、音声認識装置は新たな単語の音声を話者から受け付けて前述と同様に閾値を設定する。 Even when a new word is registered in the speech recognition device, the speech recognition device accepts the speech of the new word from the speaker and sets a threshold value as described above.

特許第２５５３１７３号明細書Japanese Patent No. 2553173

しかし、上記方法で閾値を設定するためには、各単語の音声を逐一入力しなければならず煩雑な作業となっている。また、新たな単語を音声認識装置に登録する度に、新たな単語の音声を入力しなければならずより煩雑さを感じさせる。このように、各単語について音声が入力された後でなければ閾値の設定ができないために、閾値設定の前段階に時間がかかり、ひいては閾値の設定に関わる全ての作業が長時間化している。 However, in order to set the threshold value by the above method, it is necessary to input the voice of each word one by one, which is a complicated operation. In addition, every time a new word is registered in the speech recognition apparatus, a new word voice must be input, which makes it more complicated. As described above, since the threshold value can be set only after the speech is input for each word, it takes time before the threshold value is set, and all the work related to the threshold value setting takes a long time.

そこで、閾値の設定に関わる作業時間を抑制可能な技術を提供することを目的とする。 Therefore, an object is to provide a technique capable of suppressing the work time related to the setting of the threshold value.

上記課題を解決するために以下のステップを含む閾値管理プログラムを提供する。
・所定の判断基準に基づいて、入力された音声を認識するか否かを判定するための閾値を決定する判断値を、単語毎の発音を示す読み情報に基づいて算出する判断値算出ステップ。
・前記判断値に基づいて、前記閾値を求めて、前記単語毎に設定する閾値設定ステップ。 In order to solve the above problems, a threshold management program including the following steps is provided.
A determination value calculation step of calculating a determination value for determining a threshold value for determining whether or not to recognize the input voice based on a predetermined determination criterion based on reading information indicating pronunciation for each word.
A threshold value setting step for obtaining the threshold value based on the determination value and setting the threshold value for each word.

音声認識用の閾値の設定に関わる作業時間を抑制可能な技術を提供することができる。 It is possible to provide a technique capable of suppressing the work time related to the setting of the threshold value for voice recognition.

第１実施形態例に係る音声認識装置のハードウェア構成の一例を示すブロック図である。It is a block diagram which shows an example of the hardware constitutions of the speech recognition apparatus which concerns on the example of 1st Embodiment. 第１実施形態例に係る音声認識装置の機能構成の一例を示すブロック図である。It is a block diagram which shows an example of a function structure of the speech recognition apparatus which concerns on the example of 1st Embodiment. 第１実施形態例に係る単語辞書ＤＢ３０の一例である。It is an example of the word dictionary DB 30 according to the first embodiment. 第１実施形態例に係る閾値管理プログラムの流れの一例を示すフローチャートである。It is a flowchart which shows an example of the flow of the threshold value management program which concerns on the example of 1st Embodiment. 変形例１に係る音声認識装置の機能構成の一例を示すブロック図である。It is a block diagram which shows an example of a function structure of the speech recognition apparatus which concerns on the modification 1. FIG. 変形例１に係る複雑度テーブルの一例を示す表である。10 is a table showing an example of a complexity table according to Modification 1. 第２実施形態例に係る音声認識装置の機能構成の一例を示すブロック図である。It is a block diagram which shows an example of a function structure of the speech recognition apparatus which concerns on the example of 2nd Embodiment. 第２実施形態例に係る単語辞書ＤＢ３０の一例である。It is an example of word dictionary DB30 concerning the example of a 2nd embodiment. 第２実施形態例に係る閾値管理プログラムの流れの一例を示すフローチャートである。It is a flowchart which shows an example of the flow of the threshold value management program which concerns on 2nd Embodiment. 変形例１に係る音声認識装置の機能構成の一例を示すブロック図である。It is a block diagram which shows an example of a function structure of the speech recognition apparatus which concerns on the modification 1. FIG. 変形例１に係る類似度テーブルの一例を示す表である。10 is a table showing an example of a similarity table according to Modification 1. 第３実施形態例に係る音声認識装置の機能構成の一例を示すブロック図である。It is a block diagram which shows an example of a function structure of the speech recognition apparatus which concerns on the example of 3rd Embodiment. 第３実施形態例に係る単語辞書ＤＢ３０の一例（１）である。It is an example (1) of word dictionary DB30 concerning the example of a 3rd embodiment. 第３実施形態例に係る単語辞書ＤＢ３０の一例（２）である。It is an example (2) of word dictionary DB30 concerning the example of a 3rd embodiment. 第３実施形態例に係る閾値管理プログラムの流れの一例を示すフローチャートである。It is a flowchart which shows an example of the flow of the threshold value management program which concerns on 3rd Embodiment. 音声認識装置とその他の記憶装置及び記録媒体等との関係を示すブロック図である。It is a block diagram which shows the relationship between a speech recognition apparatus, another memory | storage device, a recording medium, etc.

入力された音声の音響的特徴と、ある単語の音響的特徴と、の一致度が所定の閾値を超えるか否かに基づいて音声認識が行われる。つまり、閾値とは、入力された音声をある単語として認識するか否かを判定するための判定基準値である。ここでは、閾値は、単語の読み情報から算出される判断値に基づいて設定される。単語の読み情報とは、単語の発音を示す情報であり、例えば音節や音素を文字列で表した情報である。例えば「取引先」という単語の読み情報は「とりひきさき」である。 Speech recognition is performed based on whether or not the degree of coincidence between the acoustic feature of the input speech and the acoustic feature of a certain word exceeds a predetermined threshold. That is, the threshold value is a determination reference value for determining whether or not the input speech is recognized as a certain word. Here, the threshold value is set based on a judgment value calculated from word reading information. The word reading information is information indicating the pronunciation of the word, for example, information representing a syllable or phoneme in a character string. For example, the reading information of the word “customer” is “Torihisaki”.

この方法によれば、閾値が読み情報に基づいて設定されるため、閾値の設定に先立ち、各単語の音声をコンピュータに入力する必要がない。このように閾値の設定のための音声入力作業を不要とすることで、閾値設定の前段階に要する作業時間を抑制し、ひいては閾値の設定に関わる全作業時間を抑制することができる。 According to this method, since the threshold value is set based on the reading information, it is not necessary to input the voice of each word to the computer prior to setting the threshold value. By eliminating the need for voice input work for setting the threshold in this way, the work time required for the previous stage of threshold setting can be suppressed, and thus the total work time related to the threshold setting can be suppressed.

閾値を設定するための判断値を、その算出方法から以下の３パターンに大きく分類する。 Judgment values for setting the threshold value are roughly classified into the following three patterns based on the calculation method.

パターン１：判断値は、読み情報の読みの複雑度に基づいて算出される。 Pattern 1: The judgment value is calculated based on the reading complexity of the reading information.

パターン２：判断値は、読み情報どうしの類似度に基づいて算出される。 Pattern 2: The judgment value is calculated based on the similarity between the reading information.

パターン３：判断値は、読み情報に応じて複雑度及び／又は類似度に基づいて算出される。 Pattern 3: The judgment value is calculated based on the complexity and / or similarity according to the reading information.

これらパターン１〜３をそれぞれ第１〜第３実施形態例を用いて以下に説明する。 These patterns 1 to 3 will be described below using first to third embodiments.

＜第１実施形態例＞
第１実施形態例では、上記パターン１に示す通り、判断値は、読み情報の読みの複雑度に基づいて算出され、閾値は前記複雑度に基づいて設定される。以下に第１実施形態例について説明する。 <First embodiment>
In the first embodiment, as shown in the pattern 1, the determination value is calculated based on the reading complexity of the reading information, and the threshold is set based on the complexity. The first embodiment will be described below.

（１）ハードウェア構成
図１は第１実施形態例に係る音声認識装置のハードウェア構成の一例を示すブロック図である。図１に示す音声認識装置１では、ＣＰＵ（Central Processing Unit）１１、ＲＡＭ（Random Access Memory）１２、ＲＯＭ（Read Only Memory）１３、ＨＤＤ（Hard Disk Drive）１４、入力Ｉ／Ｆ（Inter Face）１５、出力Ｉ／Ｆ１７及び通信Ｉ／Ｆ１９がバス１０を介して接続されている。 (1) Hardware Configuration FIG. 1 is a block diagram showing an example of a hardware configuration of the speech recognition apparatus according to the first embodiment. In the speech recognition apparatus 1 shown in FIG. 1, a CPU (Central Processing Unit) 11, a RAM (Random Access Memory) 12, a ROM (Read Only Memory) 13, an HDD (Hard Disk Drive) 14, an input I / F (Inter Face). 15, an output I / F 17 and a communication I / F 19 are connected via the bus 10.

ＣＰＵ１１は、各種制御や演算処理を担う中央演算処理装置であり、後述するＲＡＭ１２及びＲＯＭ１３に格納されている各種プログラムに基づいて、音声認識及び閾値の設定などの機能を実現する。 The CPU 11 is a central processing unit that performs various controls and arithmetic processing, and realizes functions such as voice recognition and threshold setting based on various programs stored in a RAM 12 and a ROM 13 described later.

ＲＯＭ１３は、読み出し専用の記憶装置であり、音声認識を行うための音声認識プログラム及び後述する閾値を設定するための閾値管理プログラム等を記憶する。 The ROM 13 is a read-only storage device, and stores a voice recognition program for performing voice recognition, a threshold management program for setting a threshold described later, and the like.

ＲＡＭ１２は、主記憶装置であり、ＲＯＭ１３に記憶されている音声認識プログラム及び閾値管理プログラム等を一時的に記憶する。また、音声認識に用いる単語辞書及び音声認識モデル等を記憶する。 The RAM 12 is a main storage device and temporarily stores a voice recognition program, a threshold management program, and the like stored in the ROM 13. In addition, a word dictionary and a speech recognition model used for speech recognition are stored.

ＨＤＤ１４は、補助記憶装置であり、各種制御プログラム及びデータ等を記憶する。 The HDD 14 is an auxiliary storage device and stores various control programs and data.

入力Ｉ／Ｆ１５は、マイク、キーボード及びマウスなどの入力装置１６と接続されている。例えば、入力Ｉ／Ｆ１５は、入力装置１６から受け付けた話者の音声をバス１０を介してＣＰＵ１１及びＲＡＭ１２等に供給する。 The input I / F 15 is connected to an input device 16 such as a microphone, a keyboard, and a mouse. For example, the input I / F 15 supplies the voice of the speaker received from the input device 16 to the CPU 11 and the RAM 12 via the bus 10.

出力Ｉ／Ｆ１７は、ディスプレイ及びスピーカなどの出力装置１８と接続されており、ＣＰＵ１１及びＲＡＭ１２等からの出力をバス１０を介して出力装置１８に出力する。 The output I / F 17 is connected to an output device 18 such as a display and a speaker, and outputs an output from the CPU 11 and the RAM 12 to the output device 18 via the bus 10.

通信Ｉ／Ｆ１９は、外部ネットワークと音声認識装置１との間でデータの送受信を行う。例えば、外部ネットワーク上のリソースから新たな単語辞書及び音声認識モデル等を受け付けたり、音声認識装置１で処理したデータをネットワーク上のリソースに送信する。 The communication I / F 19 transmits and receives data between the external network and the voice recognition device 1. For example, a new word dictionary and a speech recognition model are received from resources on the external network, or data processed by the speech recognition apparatus 1 is transmitted to the resources on the network.

バス１０は、例えばＰＣＩ（Peripheral Component Interconnect）バスやＩＳＡ（Industrial Standard Architecture）バス等からなり、上記構成を互いに接続する。 The bus 10 includes, for example, a PCI (Peripheral Component Interconnect) bus, an ISA (Industrial Standard Architecture) bus, and the like, and connects the above configurations to each other.

（２）機能構成
図２は第１実施形態例に係る音声認識装置の機能構成の一例を示すブロック図である。図２では、音声認識及び閾値の設定などの機能を行うための機能構成のみを示している。 (2) Functional Configuration FIG. 2 is a block diagram illustrating an example of a functional configuration of the speech recognition apparatus according to the first embodiment. FIG. 2 shows only a functional configuration for performing functions such as voice recognition and threshold setting.

（２−１）ＲＡＭ
ＲＡＭ１２は、単語辞書ＤＢ（Data Base）３０及び音声認識モデルＤＢ３１を含む。 (2-1) RAM
The RAM 12 includes a word dictionary DB (Data Base) 30 and a voice recognition model DB 31.

（ａ）単語辞書ＤＢ
単語辞書ＤＢ３０は、単語と、各単語の読み情報とを対応付けて記憶している。図３は第１実施形態例に係る単語辞書ＤＢ３０の一例である。単語辞書ＤＢ３０では、少なくとも単語と読み情報とが対応付けて記憶されている。後述の閾値管理プログラムの実行によって複雑度及び閾値が算出されると、単語及び読み情報に対応付けて複雑度及び閾値が記憶される。 (A) Word dictionary DB
The word dictionary DB 30 stores a word and the reading information of each word in association with each other. FIG. 3 is an example of the word dictionary DB 30 according to the first embodiment. In the word dictionary DB 30, at least words and reading information are stored in association with each other. When the complexity and the threshold are calculated by executing a threshold management program described later, the complexity and the threshold are stored in association with the word and the reading information.

ここで、単語の読み情報とは、単語の発音を示す情報であり、例えば音節や音素を文字列で表した情報である。図３に示す通り、例えば「取引先」という単語の読み情報は「とりひきさき」である。複雑度とは、単語の発音の音響的な特徴がどの程度複雑であるかを示す指標である。 Here, the word reading information is information indicating pronunciation of the word, for example, information representing syllables and phonemes as character strings. As shown in FIG. 3, for example, the reading information of the word “customer” is “Torihisaki”. Complexity is an index indicating how complex the acoustic features of the pronunciation of a word are.

単語辞書ＤＢ３０の単語及び読み情報は、キーボードなどの入力装置１６を介して手入力により随時更新可能である。あるいは、通信Ｉ／Ｆ１９等を介して外部ネットワークのリソースから単語及び読み情報をダウンロードすることも可能である。また、読み情報は、単語辞書ＤＢ３０に単語が追加されると、音声認識装置１により生成されるようにしても良い。 The words and the reading information in the word dictionary DB 30 can be updated at any time by manual input via the input device 16 such as a keyboard. Alternatively, it is also possible to download words and reading information from resources of an external network via the communication I / F 19 or the like. The reading information may be generated by the voice recognition device 1 when a word is added to the word dictionary DB 30.

（ｂ）音声認識モデルＤＢ
音声認識モデルＤＢ３１は音声認識モデルを記憶している。音声認識モデルとは、例えば音節又は音素ごとの音響的な特徴パターンからなる音節又は音素モデルである。この音節又は音素モデルを含む音声認識モデルに基づいて音声認識が行われる。 (B) Speech recognition model DB
The voice recognition model DB 31 stores a voice recognition model. The speech recognition model is, for example, a syllable or phoneme model including an acoustic feature pattern for each syllable or phoneme. Speech recognition is performed based on a speech recognition model including the syllable or phoneme model.

（２−２）ＣＰＵ
ＣＰＵ１１は、音声入力部２０、音声認識部２１、複雑度算出部２２及び閾値設定部２３を含む。
（ａ）音声入力部
音声入力部２０は、マイクなどの入力装置１６により入力された音声をＡ／Ｄ変換するなどして入力音声信号を生成する。
（ｂ）音声認識部
音声認識部２１は、音声入力部２０から入力音声信号を受け取り、単語辞書ＤＢ３０及び音声認識モデルＤＢ３１に基づいて音声認識を行う。例えば、音声認識部２１は、入力音声信号が表す音響的特徴を抽出し特徴パターンを生成する。また、音声認識部２１は、音声認識モデルＤＢ３１の音節又は音素モデルを結合して音節又は音素モデル列（以下、音声認識モデル列という）を生成する。次に、入力音声信号から抽出した特徴パターンと、生成した音声認識モデル列と、の一致度が単語辞書ＤＢ３０の閾値を超えるかどうかを判定する。この判定において一致度が閾値を超えた場合は、音声認識部２１は、入力された音声を、該当する閾値の読み情報、つまり単語であると認識する。 (2-2) CPU
The CPU 11 includes a voice input unit 20, a voice recognition unit 21, a complexity calculation unit 22, and a threshold setting unit 23.
(A) Audio Input Unit The audio input unit 20 generates an input audio signal by A / D converting the audio input by the input device 16 such as a microphone.
(B) Speech Recognition Unit The speech recognition unit 21 receives an input speech signal from the speech input unit 20 and performs speech recognition based on the word dictionary DB 30 and the speech recognition model DB 31. For example, the voice recognition unit 21 extracts an acoustic feature represented by the input voice signal and generates a feature pattern. The speech recognition unit 21 combines the syllables or phoneme models in the speech recognition model DB 31 to generate a syllable or phoneme model sequence (hereinafter referred to as a speech recognition model sequence). Next, it is determined whether or not the degree of coincidence between the feature pattern extracted from the input speech signal and the generated speech recognition model sequence exceeds the threshold of the word dictionary DB 30. If the degree of coincidence exceeds the threshold value in this determination, the voice recognition unit 21 recognizes the input voice as reading information of the corresponding threshold value, that is, a word.

例えば、話者が入力装置１６において「とりひきさき」と発声した場合を例に説明する。発声された「とりひきさき」という入力音声信号の特徴パターンと、音声認識モデル列と、の一致度が図３に示す「とりひきさき」の閾値９４を超える場合は、音声認識部２１は、話者の発話内容を「とりひきさき」と認識する。 For example, a case where a speaker utters “Torihisaki” on the input device 16 will be described as an example. When the degree of coincidence between the feature pattern of the input voice signal “Torihisaki” and the speech recognition model sequence that is uttered exceeds the threshold 94 of “Torihisaki” shown in FIG. 3, the speech recognition unit 21 Recognize the speaker's utterance as "Torihisaki".

音声認識部２１は、認識結果を例えばバス１０を介して出力装置１８に出力するなどする。
（ｃ）複雑度算出部
複雑度算出部２２は各単語について読み情報の読みの複雑度を算出する。 The voice recognition unit 21 outputs the recognition result to the output device 18 via the bus 10, for example.
(C) Complexity Calculation Unit The complexity calculation unit 22 calculates the reading complexity of reading information for each word.

ここで、単語の音素数及び音節数等が多いほど発声も困難となり、音響的特徴が複雑になる傾向にある。例えば、１音節の母音だけの「あ」よりも、破裂音及び母音の２音節からなる「か」のほうが複雑であり、さらに、破裂音、摩擦音及び母音の３音節からなる「ちゃ」のほうが複雑になる。音節が複数組み合わされた場合、「あか」は母音、破裂音及び母音の３音節からなり、「あかちゃ」は母音、破裂音、母音、破裂音、摩擦音及び母音の６音節からなり、音響的な特徴はさらに複雑になる。そこで、本実施形態では、読み情報から得られる音節数又は音素数など、読み情報の発音を構成する要素の要素数に基づいて複雑度を算出する。その他、要素には、音声認識に使用する音節モデル及び音素モデルなど音声認識モデルの単位も含まれる。 Here, as the number of phonemes and syllables of a word increases, it becomes difficult to utter and the acoustic features tend to be complicated. For example, “ka” consisting of two syllables of a plosive and a vowel is more complex than “a” consisting of only one syllable, and “cha” consisting of three syllables of a plosive, friction and vowel. It becomes complicated. When multiple syllables are combined, “Aka” consists of three syllables of vowels, plosives and vowels, and “Akacha” consists of six syllables of vowels, plosives, vowels, plosives, friction sounds and vowels, These features are further complicated. Therefore, in this embodiment, the complexity is calculated based on the number of elements constituting the pronunciation of the reading information, such as the number of syllables or phonemes obtained from the reading information. In addition, the element includes a unit of a speech recognition model such as a syllable model and a phoneme model used for speech recognition.

具体的には、複雑度算出部２２は、単語辞書ＤＢ３０から各単語の読み情報を取り出し、読み情報を音節又は音素に分解して音節数又は音素数を複雑度として算出する。例えば、図３に示す「とりひきさき」は「と」「り」「ひ」「き」「さ」「き」の６音節からなるため、複雑度は「６」として算出される。また、「さいしん」は「さ」「い」「し」「ん」の４音節からなるため、複雑度は「４」として算出される。 Specifically, the complexity calculation unit 22 extracts the reading information of each word from the word dictionary DB 30, decomposes the reading information into syllables or phonemes, and calculates the syllable number or phoneme number as the complexity. For example, “Torihi Saki” shown in FIG. 3 is composed of six syllables “to” “ri” “hi” “ki” “sa” “ki”, and therefore the complexity is calculated as “6”. Since “sai” is composed of four syllables “sa”, “i”, “shi”, and “n”, the complexity is calculated as “4”.

なお、音節数又は音素数をそのまま複雑度とするのではなく、例えば音節数又は音素数が多くなるほど指数関数的に複雑度を大きくするなど、音節数又は音素数の関数により複雑度を定義しても良い。
（ｄ）閾値設定部
閾値設定部２３は、複雑度算出部２２が算出した複雑度に基づいて、その読み情報の閾値を設定する。複雑度が大きくなるほど閾値が小さくなるように設定されると好ましい。この理由を次に説明する。 Note that the complexity is not defined by the number of syllables or phonemes, but is defined by the function of the number of syllables or phonemes, for example, the complexity increases exponentially as the number of syllables or phonemes increases. May be.
(D) Threshold setting unit The threshold setting unit 23 sets a threshold for the reading information based on the complexity calculated by the complexity calculating unit 22. It is preferable to set the threshold value to be smaller as the complexity increases. The reason for this will be described next.

単語の読みの複雑度が大きくなるほど、単語の音響的特徴が複雑になる傾向にある。例えば複雑度を音節数で定義する場合、２音節の「やま」と６音節の「とりひきさき」とでは、「とりひきさき」の方が、音節数が多く音響的特徴が複雑となり複雑度が大きくなる。 The greater the complexity of reading a word, the more complex the acoustic features of the word. For example, when the complexity is defined by the number of syllables, the two-syllable “Yama” and the six-syllable “Torihisaki” have more syllables and more complicated acoustic features. Becomes larger.

ここで、音声を認識する場合、例えば各単語の音響的特徴を表す音声認識モデル列と、発声された音声と、の一致度を閾値に基づいて比較し、発声された単語を認識する。単語の複雑度が大きいと、前述の通り単語の音響的特徴が複雑となるため、発声された単語の音声は、その単語の音声認識モデル列と一致しにくくなるが、別の単語の音声認識モデル列とも一致しにくくなる。そこで、複雑度が大きい場合には、発声された単語を認識可能な程度に、かつ誤認識しない程度に閾値を小さく設定する。これにより、発声された単語の音声と音声認識モデル列との一致度のレベルを下げ、発声された単語を認識しやすくしつつ誤認識を抑制する。 Here, when recognizing a voice, for example, the degree of coincidence between a voice recognition model sequence representing the acoustic characteristics of each word and the voice that is uttered is compared based on a threshold value, and the uttered word is recognized. If the complexity of the word is large, the acoustic characteristics of the word become complicated as described above, so that the voice of the spoken word is difficult to match the speech recognition model sequence of the word, but the speech recognition of another word It becomes difficult to match the model column. Therefore, when the complexity is large, the threshold value is set small enough to recognize the spoken word and not to misrecognize it. As a result, the level of coincidence between the voice of the spoken word and the voice recognition model sequence is lowered, and erroneous recognition is suppressed while making it easier to recognize the spoken word.

逆に、複雑度が小さいと単語の音響的特徴が簡単となるため、発声された単語の音声は、その単語の音声認識モデル列と一致し易くなるが、別の単語の音声認識モデル列とも一致し易くなる。そこで、複雑度が小さい場合には閾値を大きく設定する。これにより、発声された単語の音声と音声認識モデル列との一致度のレベルを上げ、発声された単語を正確に認識させつつ誤認識を抑制する。 Conversely, if the complexity is low, the acoustic characteristics of a word become simple, so that the voice of a spoken word easily matches the speech recognition model string of that word. It becomes easy to match. Therefore, when the complexity is small, a large threshold is set. As a result, the level of coincidence between the voice of the uttered word and the voice recognition model sequence is increased, and erroneous recognition is suppressed while the uttered word is recognized correctly.

具体的には、閾値設定部２３は、複雑度が大きくなるほど閾値が小さくなるように、例えば次式（１）に基づいて閾値を設定する。 Specifically, the threshold setting unit 23 sets the threshold based on, for example, the following equation (1) so that the threshold decreases as the complexity increases.

閾値＝所定値−複雑度・・・（１）
例えば、音声認識モデル列と話者の音声との一致度が０〜１００の範囲で算出される場合には、所定値は一例として１００である。なお、閾値と複雑度との関係式は上記式（１）に限定されず、複雑度に係数を乗じても良い。 Threshold = predetermined value-complexity (1)
For example, when the degree of coincidence between the speech recognition model sequence and the speaker's speech is calculated in the range of 0 to 100, the predetermined value is 100 as an example. The relational expression between the threshold value and the complexity is not limited to the above formula (1), and the complexity may be multiplied by a coefficient.

上記式（１）によると、「とりひきさき」の複雑度は「６」であるため閾値は「９４」となり、「さいしん」の複雑度は「４」であるため閾値は「９６」となる。 According to the above formula (1), the threshold value is “94” because the complexity of “Torihi Saki” is “6”, and the threshold value is “96” because the complexity of “Saishin” is “4”. .

閾値設定部２３は、このように設定した閾値を単語辞書ＤＢ３０に出力し、図３に示すように閾値を格納する。 The threshold value setting unit 23 outputs the threshold value set in this way to the word dictionary DB 30, and stores the threshold value as shown in FIG.

（３）処理の流れ
次に、閾値を設定するための処理の流れについて説明する。図４は第１実施形態例に係る閾値管理プログラムの流れの一例を示すフローチャートである。閾値管理プログラムは、複雑度算出部２２及び閾値設定部２３により実行される。 (3) Process Flow Next, a process flow for setting the threshold will be described. FIG. 4 is a flowchart showing an example of the flow of the threshold management program according to the first embodiment. The threshold management program is executed by the complexity calculation unit 22 and the threshold setting unit 23.

ステップＳ１：複雑度算出部２２は、単語辞書ＤＢ３０において閾値の設定されていない単語があるかを判断する。閾値の設定されていない単語がある場合は、ステップＳ２に進む。一方、全ての単語について閾値が設定されている場合は閾値管理プログラムを終了する。 Step S1: The complexity calculator 22 determines whether there is a word for which no threshold is set in the word dictionary DB 30. If there is a word for which no threshold is set, the process proceeds to step S2. On the other hand, if thresholds are set for all words, the threshold management program is terminated.

ステップＳ２：複雑度算出部２２は、閾値の設定されていない単語を単語辞書ＤＢ３０から選択する。つまり、図３の単語辞書ＤＢ３０において、閾値が未設定のレコードを１つ選択する。 Step S2: The complexity calculator 22 selects a word for which no threshold is set from the word dictionary DB 30. That is, one record with no threshold set is selected in the word dictionary DB 30 of FIG.

ステップＳ３、Ｓ４：複雑度算出部２２は、選択された単語の読み情報を音節又は音素に分解し（Ｓ３）、音節数又は音素数を複雑度として算出する（Ｓ４）。 Steps S3 and S4: The complexity calculator 22 decomposes the reading information of the selected word into syllables or phonemes (S3), and calculates the syllable number or phoneme number as complexity (S4).

ステップＳ５：閾値設定部２３は、複雑度算出部２２から複雑度を受け取り、複雑度に基づいて閾値を設定する。例えば、複雑度が大きくなるほど閾値が小さくなるように、例えば上記式（１）に基づいて設定される。 Step S5: The threshold setting unit 23 receives the complexity from the complexity calculation unit 22, and sets a threshold based on the complexity. For example, the threshold value is set so as to decrease as the complexity increases, for example, based on the above formula (1).

ステップＳ６：閾値設定部２３は、単語辞書ＤＢ３０に閾値を出力し単語辞書ＤＢ３０を更新する。 Step S6: The threshold value setting unit 23 outputs the threshold value to the word dictionary DB 30 and updates the word dictionary DB 30.

上記処理において、複雑度は、読み情報から得られる音節数又は音素数などに基づいて算出される。閾値は、この複雑度に基づいて設定されるため、閾値の設定に先立ち、各単語の音声をコンピュータに入力する必要がない。このように閾値の設定のための音声入力作業を不要とすることで、閾値設定の前段階に要する作業時間を抑制し、ひいては閾値の設定に関わる全作業時間を抑制することができる。 In the above processing, the complexity is calculated based on the number of syllables or phonemes obtained from the reading information. Since the threshold is set based on this complexity, it is not necessary to input the voice of each word to the computer prior to setting the threshold. By eliminating the need for voice input work for setting the threshold in this way, the work time required for the previous stage of threshold setting can be suppressed, and thus the total work time related to the threshold setting can be suppressed.

（４）実験結果
（ａ）実験例
上記方法により閾値を設定した音声認識装置１を用いて、音声認識の精度を検証する実験を行った。単語辞書ＤＢ３０には、「確認」、「受付」、「設定」などの普通名詞、「○○株式会社」などの固有名詞を含む３５８単語と読み情報が記憶されている。また、各単語について閾値は上記方法により設定された。音声入力部２０には、７０分間の音声データ（コールセンタの４４通話分）を入力して音声認識を行った。その結果、下記表１の通り、音声認識の失敗回数は２２４回であった。ここで、失敗回数は、認識すべきであった単語を認識できなかった回数と、誤認識した回数と、の総和で表した。
（ｂ）比較例１〜３
上記実験例で使用した音声認識装置１において、閾値を一定値に設定する場合には最適値は９２であった。そこで、比較例では、閾値を最適値に対して上下に振って９０、９２、９４に設定して比較例１〜３とし、それぞれの場合について音声認識を行った。認識対象の音声データは上記実験例と同じである。また、閾値が一定値である点を除けば、単語辞書ＤＢ３０に記憶されている３５８単語も上記と同様である。各閾値での音声認識の失敗回数は下記表１の通りであった。 (4) Experimental Results (a) Experimental Example An experiment was performed to verify the accuracy of speech recognition using the speech recognition apparatus 1 in which the threshold value was set by the above method. The word dictionary DB 30 stores 358 words including common nouns such as “confirmation”, “acceptance”, “setting”, and proper nouns such as “XX Inc.” and reading information. Moreover, the threshold value was set by the above method for each word. To the voice input unit 20, voice data for 70 minutes (44 calls for the call center) was input to perform voice recognition. As a result, as shown in Table 1 below, the number of voice recognition failures was 224. Here, the number of failures is expressed as the sum of the number of times the word that should have been recognized could not be recognized and the number of times of erroneous recognition.
(B) Comparative Examples 1-3
In the speech recognition apparatus 1 used in the above experimental example, the optimum value was 92 when the threshold value was set to a constant value. Therefore, in the comparative example, the threshold value was swung up and down with respect to the optimum value and set to 90, 92, and 94 to be comparative examples 1 to 3, and voice recognition was performed in each case. The speech data to be recognized is the same as the above experimental example. Except for the point that the threshold value is constant, the 358 words stored in the word dictionary DB 30 are the same as described above. The number of voice recognition failures at each threshold is shown in Table 1 below.

使用した音声認識装置１では、実験例の失敗回数は、最適値９２を閾値として設定した比較例２の場合よりも少なくなった。また、実験例の失敗回数は、比較例１及び比較例３の場合よりも少ない。よって、単語の読み情報に基づいて閾値を設定した場合であっても、閾値として適度な値が設定されていることが分かった。

In the used speech recognition apparatus 1, the number of failures in the experimental example is smaller than that in the comparative example 2 in which the optimum value 92 is set as a threshold value. In addition, the number of failures in the experimental example is smaller than in the case of Comparative Example 1 and Comparative Example 3. Therefore, it was found that even when a threshold value was set based on word reading information, an appropriate value was set as the threshold value.

（５）変形例
（５−１）変形例１
次に、上記第１実施形態例の変形例について説明する。上記では、単語の音節数又は音素数に基づいて複雑度を算出したが、本変形例では、音節毎に定義された複雑度に基づいて単語の複雑度を算出する。 (5) Modification (5-1) Modification 1
Next, a modification of the first embodiment will be described. In the above description, the complexity is calculated based on the number of syllables or phonemes of a word. In this modification, the complexity of a word is calculated based on the complexity defined for each syllable.

図５は変形例１に係る音声認識装置の機能構成の一例を示すブロック図である。図２と異なる点は、ＲＡＭ１２が複雑度テーブル３２を有する点である。その他の構成及び処理の流れ等は上記実施形態と同様であるので説明を省略する。 FIG. 5 is a block diagram illustrating an example of a functional configuration of the speech recognition apparatus according to the first modification. The difference from FIG. 2 is that the RAM 12 has a complexity table 32. Since other configurations and processing flows are the same as those in the above embodiment, description thereof is omitted.

図６は変形例１に係る複雑度テーブルの一例を示す表である。「あ」「い」などの音節ごとに複雑度が定義されている。このように音節毎に複雑度を設定することで、各音節の音響的特徴に応じて単語の複雑度を算出することができる。 FIG. 6 is a table showing an example of the complexity table according to the first modification. Complexity is defined for each syllable such as “A” and “I”. Thus, by setting the complexity for each syllable, the complexity of the word can be calculated according to the acoustic features of each syllable.

複雑度算出部２２は、閾値の設定されていない単語を単語辞書ＤＢ３０から選択し、音節に分解する。また、複雑度算出部２２は、複雑度テーブル３２を用いて、各音節の複雑度の総和を算出し、その単語の複雑度を算出する。 The complexity calculator 22 selects a word for which no threshold is set from the word dictionary DB 30 and breaks it down into syllables. Further, the complexity calculation unit 22 calculates the sum of the complexity of each syllable using the complexity table 32, and calculates the complexity of the word.

具体的に、「とりひきさき」及び「さいしん」の読み情報を例に挙げて説明する。「とりひきさき」を音節に分解すると「と」「り」「ひ」「き」「さ」「き」である。各音節の複雑度はそれぞれ２であるので、「とりひきさき」の複雑度は総和から１２と算出される。また、「さいしん」を音節に分解すると「さ」「い」「し」「ん」である。「さ」「し」の複雑度は２であり、「い」「ん」の複雑度は１であるので、「さいしん」の複雑度は総和から６と算出される。 Specifically, the reading information of “Torihi Saki” and “Saishin” will be described as an example. When "Torihisaki" is broken down into syllables, it becomes "to", "ri", "hi", "ki", "sa" and "ki". Since the complexity of each syllable is 2, the complexity of “Torihisaki” is calculated as 12 from the sum. Also, when “saishin” is broken down into syllables, “sa”, “i”, “shi” and “n”. Since the complexity of “sa” and “shi” is 2, and the complexity of “i” and “n” is 1, the complexity of “sai” is calculated as 6 from the sum.

上記の複雑度テーブル３２では音節毎に複雑度を定義したが、音素など読み情報の発音を構成する要素ごとに複雑度を定義しても良い。その他、要素には、音節モデル及び音素モデルなど音声認識モデルの単位も含まれる。複雑度算出部２２は、複雑度テーブル３２の要素に基づいて単語の読み情報を分解する。つまり、複雑度テーブル３２において、音素毎に複雑度が定義されている場合には、読み情報を音素に分解してその単語の複雑度を算出する。 Although the complexity table 32 defines the complexity for each syllable, the complexity may be defined for each element that constitutes the pronunciation of reading information such as phonemes. In addition, the elements include units of a speech recognition model such as a syllable model and a phoneme model. The complexity calculator 22 decomposes the word reading information based on the elements of the complexity table 32. That is, in the complexity table 32, when the complexity is defined for each phoneme, the reading information is decomposed into phonemes and the complexity of the word is calculated.

（５−２）変形例２
上記実施形態例の単語辞書ＤＢ３０は、１の単語に対して１の読み情報のみが対応付けられているが、１の単語に対して複数の読み情報が対応付けられていても良い。例えば、単語「十」に対して「じゅう」及び「とう」の読み情報が対応付けられている場合が挙げられる。また、第２実施形態例において後述するが、１の単語に対して、単語本来の発音を示す第１読み情報と、第１読み情報から変形された第２読み情報と、が対応付けられている場合も挙げられる。 (5-2) Modification 2
In the word dictionary DB 30 of the above embodiment, only one reading information is associated with one word, but a plurality of reading information may be associated with one word. For example, there is a case where the reading information of “10” and “to” is associated with the word “ten”. Further, as described later in the second embodiment, the first reading information indicating the original pronunciation of the word and the second reading information modified from the first reading information are associated with one word. There are cases where it is.

１の単語に対して複数の読み情報が対応付けられている場合は、複雑度算出部２２は各読み情報ごとに複雑度を算出する。また、閾値設定部２３は、各読み情報ごとの複雑度に基づいて、読み情報ごとに閾値を算出する。 When a plurality of pieces of reading information are associated with one word, the complexity calculating unit 22 calculates the complexity for each piece of reading information. The threshold setting unit 23 calculates a threshold for each reading information based on the complexity for each reading information.

例えば、１の単語「沖縄」が第１読み情報「おきなわ」及び第２読み情報「きなあ」を有する場合を説明する。音節数に基づいて複雑度を算出する場合、複雑度算出部２２は、第１読み情報「おきなわ」が４音節からなるため、第１複雑度を「４」と算出する。また、複雑度算出部２２は、第２読み情報「きなあ」が３音節からなるため、第２複雑度を「３」と算出する。なお、第１複雑度とは第１読み情報の読みの複雑度であり、第２複雑度とは第２読み情報の読みの複雑度である。 For example, a case where one word “Okinawa” has first reading information “Okinawa” and second reading information “Kinaa” will be described. When calculating the complexity based on the number of syllables, the complexity calculating unit 22 calculates the first complexity as “4” because the first reading information “Okinawa” consists of four syllables. Further, the complexity calculation unit 22 calculates the second complexity as “3” because the second reading information “Kinaa” includes three syllables. The first complexity is the reading complexity of the first reading information, and the second complexity is the reading complexity of the second reading information.

閾値設定部２３は、第１読み情報には第１複雑度に基づいて第１閾値を設定し、第２読み情報には第２複雑度に基づいて第２閾値を設定する。よって、各読み情報の複雑度の違いに応じて閾値をきめ細やかに設定することができる。 The threshold value setting unit 23 sets a first threshold value for the first reading information based on the first complexity, and sets a second threshold value for the second reading information based on the second complexity. Therefore, the threshold value can be finely set according to the difference in complexity of each reading information.

（５−３）変形例３
複雑度の算出方法は、上記第１実施形態例及び変形例１及び変形例２の方法に限定されない。例えば、着目している読み情報の音声認識モデル列と、全ての音声認識モデル列の平均的なモデルと、の類似の度合いに基づいて、着目している読み情報の複雑度を算出しても良い。また、音声認識モデル列は各種パラメータ及びテンプレート等により構成されているが、読み情報を構成する音声認識モデル列のパラメータ数やテンプレート数に基づいて複雑度を算出しても良い。 (5-3) Modification 3
The complexity calculation method is not limited to the methods of the first embodiment, the first modification, and the second modification. For example, even if the complexity of the reading information of interest is calculated based on the degree of similarity between the speech recognition model sequence of the reading information of interest and the average model of all the speech recognition model sequences good. The speech recognition model sequence includes various parameters, templates, and the like, but the complexity may be calculated based on the number of parameters and the number of templates of the speech recognition model sequence constituting the reading information.

＜第２実施形態例＞
第２実施形態例では、１の単語に対して複数の読み情報が対応付けられており、判断値は、上記パターン２に示す通り、読み情報どうしの類似度に基づいて算出される。また、閾値は、前記類似度に基づいて設定される。以下に第２実施形態例について説明する。 <Second Embodiment>
In the second embodiment, a plurality of reading information is associated with one word, and the determination value is calculated based on the similarity between the reading information as shown in the pattern 2 above. The threshold is set based on the similarity. The second embodiment will be described below.

ハードウェア構成は第１実施形態例と同様であるので説明を省略する。 Since the hardware configuration is the same as that of the first embodiment, description thereof is omitted.

（１）機能構成
図７は第２実施形態例に係る音声認識装置の機能構成の一例を示すブロック図である。 (1) Functional Configuration FIG. 7 is a block diagram showing an example of a functional configuration of the speech recognition apparatus according to the second embodiment.

（１−１）ＲＡＭ
ＲＡＭ１２は、単語辞書ＤＢ３０及び音声認識モデルＤＢ３１を含む。音声認識モデルＤＢ３１は第１実施形態例と同様であるので説明を省略する。 (1-1) RAM
The RAM 12 includes a word dictionary DB 30 and a speech recognition model DB 31. Since the voice recognition model DB 31 is the same as that of the first embodiment, description thereof is omitted.

単語辞書ＤＢ３０は、単語と、各単語の読み情報とを対応付けて記憶している。図８は第２実施形態例に係る単語辞書ＤＢ３０の一例である。単語辞書ＤＢ３０では、１の単語には複数の読み情報が対応付けられている。具体的には、１の単語には、第１読み情報及び第２読み情報が対応付けられている。ここで、第１読み情報とは単語本来の発音を示す情報であり、第２読み情報とは、第１読み情報から変形された発音を示す情報である。「沖縄」という単語の場合、後述の通り、第１読み情報は「おきなわ」であり、第２読み情報は例えば「おきなあ」及び「きなあ」などである。 The word dictionary DB 30 stores a word and the reading information of each word in association with each other. FIG. 8 shows an example of the word dictionary DB 30 according to the second embodiment. In the word dictionary DB 30, a plurality of pieces of reading information are associated with one word. Specifically, the first reading information and the second reading information are associated with one word. Here, the first reading information is information indicating the original pronunciation of the word, and the second reading information is information indicating the pronunciation modified from the first reading information. In the case of the word “Okinawa”, as will be described later, the first reading information is “Okinawa”, and the second reading information is, for example, “Okinawa” and “Kinaa”.

第１読み情報は、単語の標準的な発音であるため、ある単語を発声する場合に話者が一般的に発声しようとしている発音と言える。しかし、話者が第１読み情報に基づいて発音しようとしていても、話者の声の調子又は前後の発声内容との関係などにより、必ずしも第１読み情報に基づいて発音できていない場合がある。そこで、１の単語に対して、単語本来の発音である第１読み情報だけでなく、第２読み情報も対応付ける。 Since the first reading information is a standard pronunciation of a word, it can be said that the speaker generally tries to utter when a certain word is uttered. However, even if the speaker is trying to pronounce based on the first reading information, the speaker may not always be able to pronounce based on the first reading information due to the tone of the speaker or the relationship with the utterance contents before and after. . Therefore, not only the first reading information that is the original pronunciation of the word but also the second reading information is associated with one word.

例えば、話者が「沖縄」という単語を発声しようとしている場合を例に挙げて説明する。話者が、単語本来の発音である「おきなわ」と発声をしようとしていても、話者の声の調子などにより「おきなあ」及び「きなあ」などと発声してしまう場合がある。そこで、「沖縄」という単語に対して、第１読み情報である「おきなわ」に加えて、第２読み情報として「おきなあ」及び「きなあ」などを対応付ける。 For example, a case where a speaker is going to utter the word “Okinawa” will be described as an example. Even if the speaker tries to utter “Okinawa”, which is the original pronunciation of the word, the speaker may utter “Okina”, “Kina”, etc. depending on the tone of the speaker. Therefore, the word “Okinawa” is associated with “Okinawa” and “Kinaa” as the second reading information in addition to “Okinawa” as the first reading information.

図８の単語辞書ＤＢ３０では各単語に複数の読み情報が対応付けられている例のみを示したが、複数の読み情報が対応付けられた単語と、１の読み情報のみが対応付けられた単語と、が混在していても良い。 In the word dictionary DB 30 of FIG. 8, only an example in which a plurality of reading information is associated with each word is shown, but a word in which a plurality of reading information is associated with a word in which only one reading information is associated. And may be mixed.

単語辞書ＤＢ３０の単語及び読み情報は、入力装置１６を介した手入力や、通信Ｉ／Ｆ１９等を介した外部のリソースからのダウンロードにより随時更新可能である。また、読み情報は、単語辞書ＤＢ３０に単語が追加されると、音声認識装置１により自動生成されるようにしても良い。 The words and reading information in the word dictionary DB 30 can be updated as needed by manual input via the input device 16 or download from external resources via the communication I / F 19 or the like. The reading information may be automatically generated by the voice recognition device 1 when a word is added to the word dictionary DB 30.

（１−２）ＣＰＵ
ＣＰＵ１１は、音声入力部２０、音声認識部２１、閾値設定部２３及び類似度算出部２４を含む。音声入力部２０及び音声認識部２１は第１実施形態例と同様であるので説明を省略する。
（ａ）類似度算出部
類似度算出部２４は、各読み情報について、単語本来の標準的な発音を示す第１読み情報との類似度を算出する。例えば、類似度は、各読み情報が第１読み情報とどの程度かけ離れているかを示す距離に基づいて表される。この距離は、例えば、第１読み情報から第２読み情報への、音節の置換数、脱落数及び挿入数の総和により算出される。なお、類似度と距離との関係は、例えば次式（２）により表されるものとする。 (1-2) CPU
The CPU 11 includes a voice input unit 20, a voice recognition unit 21, a threshold setting unit 23, and a similarity calculation unit 24. Since the voice input unit 20 and the voice recognition unit 21 are the same as those in the first embodiment, description thereof will be omitted.
(A) Similarity calculation unit The similarity calculation unit 24 calculates the degree of similarity of each reading information with the first reading information indicating the standard pronunciation of the word. For example, the similarity is expressed based on a distance indicating how far each reading information is apart from the first reading information. This distance is calculated, for example, by the sum of the number of syllable substitutions, dropouts, and insertions from the first reading information to the second reading information. In addition, the relationship between similarity and distance shall be represented, for example by following Formula (2).

類似度＝０−距離・・・（２）
図８の単語辞書ＤＢ３０の「沖縄」を例に挙げて類似度の算出方法を説明する。類似度算出部２４は、第１読み情報である「おきなわ」と第２読み情報のいずれかとを対比する。類似度の算出にあたって、類似度算出部２４は、まず各読み情報を音節に分解する。ここで、第１読み情報「おきなわ」において「わ」が「あ」に置換されると、第２読み情報「おきなあ」になる。よって、第１読み情報「おきなわ」と第２読み情報「おきなあ」との距離は置換数「１」で表され、類似度は「−１」で表される。同様に、第１読み情報「おきなわ」において「お」が脱落し、「わ」が「あ」に置換されると、第２読み情報「きなあ」になる。よって、距離は、脱落数「１」及び置換数「１」の総和「２」で表され、類似度は「−２」で表される。なお、第１標準読み「おきなわ」については、置換数、脱落数及び挿入数は「０」であり、類似度は「０」で表される。 Similarity = 0-distance (2)
A method of calculating the similarity will be described by taking “Okinawa” in the word dictionary DB 30 of FIG. 8 as an example. The similarity calculating unit 24 compares the first reading information “Okinawa” with any of the second reading information. In calculating the similarity, the similarity calculation unit 24 first decomposes each reading information into syllables. Here, when “wa” is replaced with “a” in the first reading information “Okinawa”, the second reading information “Okinawa” is obtained. Therefore, the distance between the first reading information “Okinawa” and the second reading information “Okinawa” is represented by the number of substitutions “1”, and the similarity is represented by “−1”. Similarly, when “o” is dropped in the first reading information “Okinawa” and “wa” is replaced with “a”, the second reading information “kina” is obtained. Therefore, the distance is represented by the sum “2” of the dropout number “1” and the replacement number “1”, and the similarity is represented by “−2”. For the first standard reading “Okinawa”, the number of substitutions, the number of omissions and the number of insertions are “0”, and the similarity is represented by “0”.

上記では、音節の置換数、脱落数及び挿入数に基づいて類似度を算出しているが、類似度の算出方法はこれに限定されない。例えば、第１読み情報と第２読み情報とを比較し、音素、音節モデル及び音素モデルなどの置換数、脱落数及び挿入数に基づいて類似度を算出しても良い。
（ｂ）閾値設定部
閾値設定部２３は、類似度に基づいて閾値を設定する。類似度が小さくなるほど、つまり距離が大きくなるほど閾値が大きくなるように設定されると好ましい。 In the above, the similarity is calculated based on the number of syllable replacements, the number of omissions, and the number of insertions, but the method of calculating the similarity is not limited to this. For example, the first reading information may be compared with the second reading information, and the similarity may be calculated based on the number of replacements, dropouts, and insertions of phonemes, syllable models, and phoneme models.
(B) Threshold setting unit The threshold setting unit 23 sets a threshold based on the similarity. It is preferable that the threshold value is set to increase as the degree of similarity decreases, that is, as the distance increases.

ここで、上述の通り、第１読み情報は単語本来の標準的な発音を示す情報であり、第２読み情報は第１読み情報から変形された発音を示す情報である。よって、第１読み情報との類似度が小さい、つまり第１読み情報との距離が大きい第２読み情報ほど、単語本来の標準的な発音からかけ離れていると言える。よって、第１読み情報との類似度が小さい第２読み情報ほど、閾値を大きく設定して誤認識を抑制する。このように閾値を設定する理由を次に説明する。 Here, as described above, the first reading information is information indicating the standard pronunciation of the original word, and the second reading information is information indicating the pronunciation modified from the first reading information. Therefore, it can be said that the second reading information having a smaller degree of similarity with the first reading information, that is, a larger distance from the first reading information, is far from the standard pronunciation of the word. Therefore, the second reading information having a smaller degree of similarity with the first reading information is set to a larger threshold value to suppress erroneous recognition. The reason for setting the threshold in this way will be described next.

第２読み情報に基づいて音声認識がされた場合、閾値の大きさによっては、第２読み情報による音声のみならず第２読み情報に類似した音声も第２読み情報として認識する。しかし、第２読み情報に類似した音声であっても、第１読み情報が意味するところの単語の音声では無い場合がある。そのため、発声された音声が意味するところの単語と、音声認識装置１が認識した単語と、が異なり誤認識が生じる可能性がある。 When speech recognition is performed based on the second reading information, depending on the size of the threshold, not only the speech based on the second reading information but also the speech similar to the second reading information is recognized as the second reading information. However, even if the voice is similar to the second reading information, it may not be the voice of the word meant by the first reading information. For this reason, there is a possibility that misrecognition may occur because the word meant by the spoken voice is different from the word recognized by the voice recognition device 1.

例えば、「東京」の第１読み情報が「とうきょう」であり、第２読み情報が「とおきょ」であるとする。話者が「とおきょ」と発声した場合には、音声認識装置１は第２読み情報「とおきょ」に基づいて「東京」が発声されたことを認識する。一方、話者が、第２読み情報「とおきょ」に類似する、例えば「とっきょ」を発声したとする。ここで、読み情報「とっきょ」は単語「特許」を意味するとする。閾値の大きさによっては、音声認識装置１は、発声された「とっきょ」を第２読み情報「とおきょ」と認識し、「東京」が発声されたと誤認識してしまう。このような誤認識は、第１読み情報と第２読み情報との類似度が小さいほど顕著になる傾向にあると言える。そこで、第１読み情報と第２読み情報との類似度を考慮して、第１読み情報との類似度が小さい第２読み情報ほど閾値を大きく設定し、誤認識を抑制して認識の精度を高める。 For example, the first reading information of “Tokyo” is “Tokyo”, and the second reading information is “Toyocho”. When the speaker utters “Toyocho”, the speech recognition apparatus 1 recognizes that “Tokyo” is uttered based on the second reading information “Toyocho”. On the other hand, it is assumed that the speaker utters “Tokkyo” similar to the second reading information “Toyocho”, for example. Here, the reading information “Tokkyo” means the word “patent”. Depending on the size of the threshold, the speech recognition apparatus 1 recognizes the uttered “Tokkyo” as the second reading information “Toyocho” and misrecognizes that “Tokyo” is uttered. It can be said that such misrecognition tends to become more prominent as the similarity between the first reading information and the second reading information is smaller. Therefore, in consideration of the similarity between the first reading information and the second reading information, the second reading information having a smaller similarity with the first reading information is set to a larger threshold value, and the recognition accuracy is suppressed by suppressing erroneous recognition. To increase.

閾値設定部２３は、類似度が小さくなるほど閾値が大きくなるように、つまり距離が大きくなるほど閾値が大きくなるように、例えば次式（３）に基づいて閾値を設定する。 The threshold setting unit 23 sets the threshold based on, for example, the following equation (3) so that the threshold increases as the similarity decreases, that is, the threshold increases as the distance increases.

閾値＝所定値−類似度・・・（３）
例えば、音声認識モデル列と話者の音声との一致度が０〜１００の範囲で算出される場合には、所定値は一例として８０である。 Threshold = predetermined value−similarity (3)
For example, when the degree of coincidence between the speech recognition model sequence and the speaker's speech is calculated in the range of 0 to 100, the predetermined value is 80 as an example.

上記式（３）によると、第２読み情報「おきなあ」の類似度は上述の通り「−１」であるため閾値は「８１」となり、第２読み情報「きなあ」の類似度は上述の通り「−２」であるため閾値は「８２」となる。なお、第１読み情報「おきなわ」は、類似度が「０」であるため閾値は「８０」となる。つまり、第１読み情報の閾値にはデフォルト値、つまり式（３）の所定値が設定される。閾値設定部２３は、このように設定した閾値を単語辞書ＤＢ３０に出力し、図８に示すように閾値を格納する。 According to the above formula (3), since the similarity of the second reading information “OKINAA” is “−1” as described above, the threshold value is “81”, and the similarity of the second reading information “KINAA” is described above. Since the threshold is “−2”, the threshold value is “82”. The first reading information “Okinawa” has a threshold value of “80” because the similarity is “0”. That is, a default value, that is, a predetermined value of Expression (3) is set as the threshold value of the first reading information. The threshold value setting unit 23 outputs the threshold value set in this way to the word dictionary DB 30, and stores the threshold value as shown in FIG.

（２）処理の流れ
次に、閾値を設定するための処理の流れについて説明する。図９は第２実施形態例に係る閾値管理プログラムの流れの一例を示すフローチャートである。閾値管理プログラムは、類似度算出部２４及び閾値設定部２３により実行される。 (2) Process Flow Next, a process flow for setting a threshold will be described. FIG. 9 is a flowchart showing an exemplary flow of a threshold management program according to the second embodiment. The threshold management program is executed by the similarity calculation unit 24 and the threshold setting unit 23.

ステップＳ１１：類似度算出部２４は、単語辞書ＤＢ３０において閾値の設定されていない読み情報があるかを判断する。閾値の設定されていない読み情報がある場合は、ステップＳ１２に進む。一方、全ての読み情報について閾値が設定されている場合は閾値管理プログラムを終了する。 Step S11: The similarity calculation unit 24 determines whether there is reading information for which no threshold is set in the word dictionary DB 30. If there is reading information for which no threshold is set, the process proceeds to step S12. On the other hand, if threshold values are set for all reading information, the threshold management program is terminated.

ステップＳ１２：類似度算出部２４は、閾値の設定されていない読み情報を単語辞書ＤＢ３０から選択し、第１読み情報であるか否かを判断する。第２読み情報である場合はステップＳ１４に進み、第１読み情報である場合はステップＳ１３に進む。 Step S12: The similarity calculation unit 24 selects reading information for which no threshold is set from the word dictionary DB 30, and determines whether the reading information is first reading information. If it is the second reading information, the process proceeds to step S14, and if it is the first reading information, the process proceeds to step S13.

ステップＳ１３：選択した読み情報が第１読み情報である場合は、第１読み情報との類似度は「０」である。よって、閾値設定部２３は、第１読み情報の閾値にデフォルト値、つまり上記式（３）の所定値を設定する。 Step S13: When the selected reading information is the first reading information, the similarity with the first reading information is “0”. Therefore, the threshold value setting unit 23 sets a default value, that is, a predetermined value of the above formula (3) as the threshold value of the first reading information.

ステップＳ１４：選択した読み情報が第２読み情報である場合は、類似度算出部２４は、各読み情報を音節等に分解し、第１読み情報と第２読み情報との類似度を算出する。 Step S14: When the selected reading information is the second reading information, the similarity calculating unit 24 decomposes each reading information into syllables and the like, and calculates the similarity between the first reading information and the second reading information. .

ステップＳ１５：閾値設定部２３は、類似度算出部２４から類似度を受け取り、類似度に基づいて閾値を設定する。 Step S15: The threshold setting unit 23 receives the similarity from the similarity calculation unit 24, and sets a threshold based on the similarity.

ステップＳ１６：閾値設定部２３は、単語辞書ＤＢ３０に閾値を出力し単語辞書ＤＢ３０を更新する。 Step S16: The threshold value setting unit 23 outputs a threshold value to the word dictionary DB 30, and updates the word dictionary DB 30.

上記処理において、閾値は、読み情報から算出される類似度に基づいて設定されるため、閾値の設定に先立ち、各単語の音声をコンピュータに入力する必要がない。このように閾値の設定のための音声入力作業を不要とすることで、閾値設定の前段階に要する作業時間を抑制し、ひいては閾値の設定に関わる全作業時間を抑制することができる。 In the above processing, since the threshold is set based on the similarity calculated from the reading information, it is not necessary to input the voice of each word to the computer prior to setting the threshold. By eliminating the need for voice input work for setting the threshold in this way, the work time required for the previous stage of threshold setting can be suppressed, and thus the total work time related to the threshold setting can be suppressed.

（３）変形例
（３−１）変形例１
次に、上記第２実施形態例の変形例について説明する。上記では、第１読み情報から第２読み情報への音節の置換数、脱落数及び挿入数に基づいて類似度を算出した。しかし、本変形例では、音節どうしの類似度を予め類似度テーブルに定義しておき、この類似度テーブルに基づいて読み情報の類似度を算出する。 (3) Modification (3-1) Modification 1
Next, a modified example of the second embodiment will be described. In the above description, the similarity is calculated based on the number of syllable replacements, dropouts, and insertions from the first reading information to the second reading information. However, in this modification, the similarity between syllables is defined in advance in a similarity table, and the similarity of reading information is calculated based on this similarity table.

図１０は変形例１に係る音声認識装置の機能構成の一例を示すブロック図である。図７と異なる点は、ＲＡＭ１２が類似度テーブル３３を有する点である。その他の構成及び処理の流れ等は上記実施形態と同様であるので説明を省略する。 FIG. 10 is a block diagram illustrating an example of a functional configuration of the speech recognition apparatus according to the first modification. The difference from FIG. 7 is that the RAM 12 has a similarity table 33. Since other configurations and processing flows are the same as those in the above embodiment, description thereof is omitted.

図１１は変形例１に係る類似度テーブルの一例を示す表である。置換、挿入及び脱落のそれぞれについて音節毎に類似度の指標となる距離が定義されている。このように音節毎に距離を設定することで、各音節の音響的特徴に応じて読み情報どうしの類似度をきめ細やかに算出することができる。 FIG. 11 is a table showing an example of a similarity table according to the first modification. For each of substitution, insertion, and omission, a distance that is an index of similarity is defined for each syllable. By setting the distance for each syllable in this way, it is possible to finely calculate the similarity between reading information according to the acoustic characteristics of each syllable.

類似度算出部２４は、類似度テーブル３３を用いて、第１読み情報及び第２読み情報間の類似度を算出する。例えば、第１読み情報「おきなわ」と第２読み情報「おきなあ」との類似度は、「わ」から「あ」への置換の距離が「４」であるため、上記式（２）から「−４」と算出される。また、第１読み情報「おきなわ」と第２読み情報「きなあ」との類似度は、「お」の脱落の距離「５」と「わ」から「あ」への置換の距離「４」との総和が「９」であるため、上記式（２）から「−９」と算出される。 The similarity calculation unit 24 uses the similarity table 33 to calculate the similarity between the first reading information and the second reading information. For example, the similarity between the first reading information “Okinawa” and the second reading information “Okina” is “4” because the distance of replacement from “wa” to “a” is “4”. Calculated as “−4”. Also, the similarity between the first reading information “Okinawa” and the second reading information “Kinaa” is the dropout distance “5” of “O” and the replacement distance “4” from “wa” to “a”. Therefore, “−9” is calculated from the above equation (2).

上記の類似度テーブル３３では音節毎に距離を定義したが、音素など読み情報の発音を構成する要素ごとに距離を定義しても良い。その他、要素には、音節モデル及び音素モデルなど音声認識モデルの単位も含まれる。類似度算出部２４は、類似度テーブル３３の要素に基づいて単語の読み情報を分解する。つまり、類似度テーブル３３において、音素毎に距離が定義されている場合には、読み情報を音素に分解して類似度を算出する。 Although the distance is defined for each syllable in the similarity table 33 described above, the distance may be defined for each element constituting the pronunciation of reading information such as phonemes. In addition, the elements include units of a speech recognition model such as a syllable model and a phoneme model. The similarity calculation unit 24 decomposes the word reading information based on the elements of the similarity table 33. That is, when the distance is defined for each phoneme in the similarity table 33, the reading information is decomposed into phonemes and the similarity is calculated.

（３−２）変形例２
１の単語に対する読み情報の数を考慮し、次式（４）のように閾値を設定しても良い。 (3-2) Modification 2
In consideration of the number of reading information for one word, a threshold may be set as in the following equation (4).

閾値＝所定値−類似度＋読み情報の数・・・（４）
読み情報の数が多いほど、読み情報に類似する発音の範囲も広がり誤認識する可能性が高くなる。そこで、読み情報の数が多いほど閾値が高くなるように設定する。 Threshold = predetermined value−similarity + number of reading information (4)
The greater the number of reading information, the wider the range of pronunciation similar to the reading information and the higher the possibility of erroneous recognition. Therefore, the threshold is set so as to increase as the number of reading information increases.

例えば、図８に示す単語「沖縄」は、「おきなわ」「おきなあ」及び「きなあ」の計３つの読み情報がある。よって、第２読み情報「おきなあ」の場合、閾値は式（４）に基づいて、所定値「８０」−類似度「−１」＋読み情報の数「３」により「８４」と算出される。 For example, the word “Okinawa” shown in FIG. 8 has a total of three pieces of reading information of “Okinawa”, “Okinawa”, and “Kinaa”. Therefore, in the case of the second reading information “OKINAA”, the threshold value is calculated as “84” by the predetermined value “80” −similarity “−1” + number of reading information “3” based on the equation (4). The

＜第３実施形態例＞
第３実施形態例では、第２実施形態例と同様に、１の単語に対して第１読み情報及び第２読み情報が対応付けられている。また、第３実施形態例では、判断値は、上記パターン３に示す通り、読み情報に応じて複雑度及び／又は類似度に基づいて算出される。以下に第３実施形態例について説明する。 <Third Embodiment>
In the third embodiment, similarly to the second embodiment, the first reading information and the second reading information are associated with one word. In the third embodiment, the determination value is calculated based on the complexity and / or similarity according to the reading information as shown in the pattern 3 above. The third embodiment will be described below.

（１）機能構成
図１２は第３実施形態例に係る音声認識装置の機能構成の一例を示すブロック図である。図１３及び図１４は第３実施形態例に係る単語辞書ＤＢ３０の一例である。 (1) Functional Configuration FIG. 12 is a block diagram showing an example of a functional configuration of a speech recognition apparatus according to the third embodiment. 13 and 14 show an example of the word dictionary DB 30 according to the third embodiment.

ＣＰＵ１１は、音声入力部２０、音声認識部２１、複雑度算出部２２、閾値設定部２３及び類似度算出部２４を含む。また、ＲＡＭ１２は、単語辞書ＤＢ３０及び音声認識モデルＤＢ３１を含む。音声入力部２０、音声認識部２１及び音声認識モデルＤＢ３１は第１実施形態例と同様であるので説明を省略する。 The CPU 11 includes a voice input unit 20, a voice recognition unit 21, a complexity calculation unit 22, a threshold setting unit 23, and a similarity calculation unit 24. The RAM 12 includes a word dictionary DB 30 and a speech recognition model DB 31. Since the voice input unit 20, the voice recognition unit 21, and the voice recognition model DB 31 are the same as those in the first embodiment, description thereof is omitted.

（１−１）算出方法Ａ
まず、第１読み情報の第１閾値が第１複雑度に基づいて算出され、第２読み情報の第２閾値が第２複雑度及び類似度に基づいて算出される場合（算出方法Ａ）を説明する。ここで、第１閾値とは第１読み情報により発声されたか否かを判定するための閾値であり、第２閾値とは第２読み情報により発声されたか否かを判定するための閾値である。また、第１複雑度とは第１読み情報の読みの複雑度であり、第２複雑度とは第２読み情報の読みの複雑度である。類似度とは、第２読み情報が第１読み情報からどの程度かけ離れているかを示す距離である。 (1-1) Calculation method A
First, the first threshold value of the first reading information is calculated based on the first complexity, and the second threshold value of the second reading information is calculated based on the second complexity and the similarity (calculation method A). explain. Here, the first threshold is a threshold for determining whether or not the voice is uttered by the first reading information, and the second threshold is a threshold for determining whether or not the voice is uttered by the second reading information. . The first complexity is the reading complexity of the first reading information, and the second complexity is the reading complexity of the second reading information. The similarity is a distance indicating how far the second reading information is separated from the first reading information.

複雑度算出部２２は、第１読み情報及び第２読み情報のそれぞれについて第１複雑度及び第２複雑度を算出する。複雑度の算出方法は第１実施形態例と同様であるので説明を省略する。例えば複雑度が音節数で表される場合は、各読み情報の複雑度は図１３に示すように算出される。図１３は単語辞書ＤＢ３０の一例である。 The complexity calculator 22 calculates the first complexity and the second complexity for each of the first reading information and the second reading information. Since the complexity calculation method is the same as that in the first embodiment, description thereof is omitted. For example, when the complexity is expressed by the number of syllables, the complexity of each reading information is calculated as shown in FIG. FIG. 13 is an example of the word dictionary DB 30.

類似度算出部２４は、各読み情報について第１読み情報との類似度を算出する。類似度の算出方法は第２実施形態例と同様であるので説明を省略する。例えば類似度が音節の置換数、脱落数及び挿入数で表される場合は、各読み情報の類似度は図１３に示すように算出される。 The similarity calculation unit 24 calculates the similarity between each reading information and the first reading information. Since the similarity calculation method is the same as that in the second embodiment, description thereof is omitted. For example, when the similarity is represented by the number of syllable replacements, the number of omissions, and the number of insertions, the similarity of each piece of reading information is calculated as shown in FIG.

閾値設定部２３は、複雑度が大きくなるほど閾値が小さくなるように、かつ類似度が小さくなるほど閾値が大きくなるように、例えば次式（５）に基づいて閾値を設定する。 The threshold setting unit 23 sets the threshold based on, for example, the following equation (5) so that the threshold decreases as the complexity increases and so that the threshold increases as the similarity decreases.

閾値＝所定値−複雑度−類似度・・・（５）
例えば、音声認識モデル列と話者の音声との一致度が０〜１００の範囲で算出される場合には、所定値は一例として９０である。 Threshold = predetermined value−complexity−similarity (5)
For example, when the degree of coincidence between the speech recognition model sequence and the speaker's speech is calculated in the range of 0 to 100, the predetermined value is 90 as an example.

上記式（５）によると、第１読み情報「おきなわ」の第１複雑度は「４」であり、類似度は「０」であるため、第１閾値は「８６」となる。また、第２読み情報「おきなあ」の第２複雑度は「４」であり、類似度は「−１」であるため第２閾値は「８７」となる。第１閾値及び第２閾値等は図１３のように単語辞書ＤＢ３０に格納される。 According to the above equation (5), the first reading information “Okinawa” has a first complexity of “4” and a similarity of “0”, so the first threshold is “86”. Further, the second reading information “OKINAA” has a second complexity of “4” and a similarity of “−1”, so the second threshold is “87”. The first threshold and the second threshold are stored in the word dictionary DB 30 as shown in FIG.

ここで、第１読み情報の類似度は「０」であるため、第１読み情報は第１複雑度のみにより第１閾値が設定されている。一方、第２読み情報では、第２複雑度及び類似度の両方を考慮して第２閾値が設定されている。そのため、第２読み情報の音響的特徴に応じて第２閾値をよりきめ細やかに設定することができる。 Here, since the similarity of the first reading information is “0”, the first threshold is set for the first reading information only by the first complexity. On the other hand, in the second reading information, the second threshold is set in consideration of both the second complexity and the similarity. Therefore, the second threshold can be set more finely according to the acoustic characteristics of the second reading information.

（１−２）算出方法Ｂ
次に、第１読み情報の第１閾値が第１複雑度に基づいて算出され、第２読み情報の第２閾値が類似度のみに基づいて算出される場合（算出方法Ｂ）を説明する。 (1-2) Calculation method B
Next, a case where the first threshold value of the first reading information is calculated based on the first complexity and the second threshold value of the second reading information is calculated based only on the similarity (calculation method B) will be described.

複雑度算出部２２は、第１読み情報について第１複雑度を算出する。類似度算出部２４は、各読み情報について第１読み情報との類似度を算出する。 The complexity calculator 22 calculates the first complexity for the first reading information. The similarity calculation unit 24 calculates the similarity between each reading information and the first reading information.

閾値設定部２３は、第１読み情報の第１閾値を第１複雑度に基づいて設定する。例えば、上記式（５）において類似度を「０」として第１閾値を設定する。第１読み情報「おきなわ」の第１複雑度は「４」であるため、第１閾値は「８６」となる。 The threshold setting unit 23 sets the first threshold of the first reading information based on the first complexity. For example, the first threshold is set by setting the similarity to “0” in the above equation (5). Since the first complexity of the first reading information “Okinawa” is “4”, the first threshold value is “86”.

一方、閾値設定部２３は、第２読み情報の第２閾値を類似度のみに基づいて設定する。例えば、上記式（５）において複雑度を「０」として第２閾値を設定する。第２読み情報「おきなあ」の類似度は「−１」であるため第２閾値は「９１」となる。第１閾値及び第２閾値等は図１４のように単語辞書ＤＢ３０に格納される。 On the other hand, the threshold setting unit 23 sets the second threshold of the second reading information based only on the similarity. For example, in the above equation (5), the second threshold is set with the complexity set to “0”. Since the similarity of the second reading information “OKINAA” is “−1”, the second threshold value is “91”. The first threshold value, the second threshold value, and the like are stored in the word dictionary DB 30 as shown in FIG.

（２）処理の流れ
次に、閾値を設定するための処理の流れについて説明する。図１５は第３実施形態例に係る閾値管理プログラムの流れの一例を示すフローチャートである。閾値管理プログラムは、複雑度算出部２２、閾値設定部２３及び類似度算出部２４により実行される。 (2) Process Flow Next, a process flow for setting a threshold will be described. FIG. 15 is a flowchart showing an exemplary flow of a threshold management program according to the third embodiment. The threshold management program is executed by the complexity calculation unit 22, the threshold setting unit 23, and the similarity calculation unit 24.

ステップＳ２１：複雑度算出部２２は、単語辞書ＤＢ３０において閾値の設定されていない単語があるかを判断する。 Step S21: The complexity calculation unit 22 determines whether there is a word for which no threshold is set in the word dictionary DB 30.

ステップＳ２２：複雑度算出部２２は、閾値の設定されていない単語を単語辞書ＤＢ３０から選択する。 Step S22: The complexity calculator 22 selects a word for which no threshold is set from the word dictionary DB 30.

ステップＳ２３：複雑度算出部２２は読み情報の複雑度を算出する。 Step S23: The complexity calculator 22 calculates the complexity of the reading information.

ここで、算出方法Ａの場合には、複雑度算出部２２は、第１読み情報及び第２読み情報のすべての複雑度を算出する。一方、算出方法Ｂの場合には、複雑度算出部２２は、第１読み情報の第１複雑度のみを算出する。 Here, in the case of the calculation method A, the complexity calculation unit 22 calculates all the complexity of the first reading information and the second reading information. On the other hand, in the case of the calculation method B, the complexity calculation unit 22 calculates only the first complexity of the first reading information.

ステップＳ２４：類似度算出部２４は、第１読み情報と第２読み情報との類似度を算出する。 Step S24: The similarity calculation unit 24 calculates the similarity between the first reading information and the second reading information.

ステップＳ２５：閾値設定部２３は、複雑度及び類似度に基づいて第１閾値及び第２閾値を設定する。 Step S25: The threshold setting unit 23 sets the first threshold and the second threshold based on the complexity and the similarity.

ステップＳ２６：閾値設定部２３は、単語辞書ＤＢ３０に第１閾値及び第２閾値を出力し単語辞書ＤＢ３０を更新する。 Step S26: The threshold value setting unit 23 outputs the first threshold value and the second threshold value to the word dictionary DB 30, and updates the word dictionary DB 30.

なお、ステップＳ２３及びステップＳ２４は順不同である。また、ステップＳ２１、Ｓ２２の処理を行う主体は、複雑度算出部２２である必要はなく、例えば類似度算出部２４であっても良い。 Step S23 and step S24 are in no particular order. Further, the subject performing the processes of steps S21 and S22 does not have to be the complexity calculating unit 22, and may be, for example, the similarity calculating unit 24.

（３）変形例
（３−１）変形例１
上記実施形態例の式（５）では所定値は一定の「９０」である。しかし、第２閾値を設定する場合、所定値を第１閾値に設定しても良い。例えば、第１読み情報の第１閾値が図１３に示すように「８６」である場合、第２閾値は次式（６）に基づいて設定される。 (3) Modification (3-1) Modification 1
In formula (5) of the above embodiment, the predetermined value is a constant “90”. However, when setting the second threshold, a predetermined value may be set as the first threshold. For example, when the first threshold value of the first reading information is “86” as shown in FIG. 13, the second threshold value is set based on the following equation (6).

第２閾値＝８６−複雑度−類似度・・・（６）
第２読み情報の第２閾値を複雑度及び類似度の両方を考慮して設定する場合、第２読み情報「おきなあ」の第２閾値は、８６−複雑度「４」−類似度「−１」＝８３と設定される。これにより、第２読み情報の第２閾値を、第１読み情報の第１閾値を考慮して算出することができる。 Second threshold = 86−complexity−similarity (6)
When the second threshold of the second reading information is set in consideration of both the complexity and the similarity, the second threshold of the second reading information “Okinaa” is 86−complexity “4” −similarity “−. 1 "= 83 is set. Thereby, the second threshold value of the second reading information can be calculated in consideration of the first threshold value of the first reading information.

（３−２）変形例２
第２実施形態例の変形例２と同様に、読み情報の数を考慮し、次式（７）のように閾値を設定しても良い。 (3-2) Modification 2
Similarly to the second modification of the second embodiment, the threshold may be set as in the following equation (7) in consideration of the number of reading information.

閾値＝所定値−複雑度−類似度＋読み情報の数・・・（７）
＜その他の実施形態例＞
（Ａ）上記では、日本語の単語を例に挙げて閾値を設定したが、上記閾値の設定方法はあらゆる言語に適用可能である。 Threshold = predetermined value−complexity−similarity + number of reading information (7)
<Other embodiment examples>
(A) In the above description, the threshold value is set by taking a Japanese word as an example. However, the threshold value setting method can be applied to any language.

（Ｂ）
上記実施形態例では、音声認識プログラム及び閾値管理プログラムなど各種プログラムがＲＯＭ１３に記憶されている。前記各種プログラムは、その他、ＲＡＭ１２及びＨＤＤ１４等に記憶されていても良い。 (B)
In the above embodiment, various programs such as a voice recognition program and a threshold management program are stored in the ROM 13. In addition, the various programs may be stored in the RAM 12, the HDD 14, or the like.

また、前記各種プログラムは、音声認識装置１内に記憶されている必要はなく、図１６に示すように音声認識装置１と接続される外部の記憶装置６４に記憶されても良い。図１６は音声認識装置とその他の記憶装置及び記録媒体等との関係を示すブロック図である。 The various programs do not need to be stored in the voice recognition device 1 and may be stored in an external storage device 64 connected to the voice recognition device 1 as shown in FIG. FIG. 16 is a block diagram showing the relationship between the voice recognition device and other storage devices and recording media.

また、前記プログラムを記録したコンピュータ読み取り可能な記録媒体６２は、本発明の範囲に含まれる。ここで、コンピュータ読み取り可能な記録媒体としては、例えば、フレキシブルディスク、ハードディスク、ＣＤ−ＲＯＭ、ＭＯ、ＤＶＤ、ＤＶＤ−ＲＯＭ、ＤＶＤ−ＲＡＭ、ＢＤ（Ｂｌｕ−ｒａｙＤｉｓｃ）、ＵＳＢメモリ、半導体メモリ等を挙げることができる。 A computer-readable recording medium 62 that records the program is included in the scope of the present invention. Here, examples of the computer-readable recording medium include a flexible disk, a hard disk, a CD-ROM, an MO, a DVD, a DVD-ROM, a DVD-RAM, a BD (Blu-ray Disc), a USB memory, and a semiconductor memory. Can be mentioned.

また、前記プログラムは、電気通信回線、無線又は有線通信回線、インターネットを代表とするネットワーク等を経由して記憶装置６１から伝送されるものであってもよい。 The program may be transmitted from the storage device 61 via an electric communication line, a wireless or wired communication line, a network represented by the Internet, or the like.

以上の実施形態及及びその他の実施形態に関し、更に以下の付記を開示する。 Regarding the above embodiment and other embodiments, the following additional notes are disclosed.

＜付記＞
（付記１）
所定の判断基準に基づいて、入力された音声を認識するか否かを判定するための閾値を決定する判断値を、単語毎の発音を示す読み情報に基づいて算出する判断値算出ステップと、
前記判断値に基づいて、前記閾値を求めて、前記単語毎に設定する閾値設定ステップと、
をコンピュータに実行させるための音声認識用の閾値管理プログラム。 <Appendix>
(Appendix 1)
A determination value calculating step for calculating a determination value for determining a threshold value for determining whether or not to recognize the input voice based on a predetermined determination criterion based on reading information indicating pronunciation for each word;
A threshold setting step for obtaining the threshold based on the determination value and setting the threshold for each word;
Threshold management program for voice recognition for causing a computer to execute.

（付記２）
前記判断値算出ステップは、前記判断値として、前記読み情報の読みの複雑度を算出する複雑度算出ステップを含み、
前記閾値設定ステップでは、前記複雑度に基づいて、前記複雑度が大きいほど前記閾値が小さくなるように前記閾値を求めて、前記単語毎に設定する、付記１に記載の音声認識用の閾値管理プログラム。 (Appendix 2)
The determination value calculating step includes a complexity calculation step of calculating a reading complexity of the reading information as the determination value,
The threshold management for speech recognition according to claim 1, wherein in the threshold setting step, the threshold is obtained so that the threshold decreases as the complexity increases, and is set for each word based on the complexity. program.

（付記３）
前記コンピュータは、前記単語と前記読み情報とを対応付けた単語辞書を備えており、
前記閾値設定ステップでは、前記単語辞書の全ての単語毎に閾値を設定する、付記２に記載の音声認識用の閾値管理プログラム。 (Appendix 3)
The computer includes a word dictionary that associates the word with the reading information,
The threshold management program for speech recognition according to attachment 2, wherein in the threshold setting step, a threshold is set for every word in the word dictionary.

（付記４）
前記読み情報は、単語本来の発音を示す第１読み情報と、前記第１読み情報とは異なる、前記単語の発音である第２読み情報と、を含み、
前記複雑度算出ステップでは、前記第１読み情報に基づいて第１複雑度を算出し、前記第２読み情報に基づいて第２複雑度を算出し、
前記閾値設定ステップでは、前記単語が前記第１読み情報により発声されたか否かを判定するための第１閾値を前記第１複雑度に基づいて設定し、前記単語が前記第２読み情報により発声されたか否かを判定するための第２閾値を前記第２複雑度に基づいて設定する、
付記２又は３に記載の音声認識用の閾値管理プログラム。 (Appendix 4)
The reading information includes first reading information indicating the original pronunciation of a word, and second reading information that is a pronunciation of the word different from the first reading information,
In the complexity calculation step, a first complexity is calculated based on the first reading information, a second complexity is calculated based on the second reading information,
In the threshold setting step, a first threshold for determining whether or not the word is uttered by the first reading information is set based on the first complexity, and the word is uttered by the second reading information. Setting a second threshold for determining whether or not the determination has been made based on the second complexity.
The threshold value management program for speech recognition according to appendix 2 or 3.

（付記５）
前記判断値算出ステップは、前記第１読み情報と前記第２読み情報との類似度を算出する類似度算出ステップをさらに含み、
前記閾値設定ステップでは、前記第２複雑度に加えて前記類似度に基づいて、前記類似度が小さいほど前記第２閾値が大きくなるように、前記第２閾値を求めて、前記単語毎に設定する、付記４に記載の音声認識用の閾値管理プログラム。 (Appendix 5)
The determination value calculating step further includes a similarity calculating step for calculating a similarity between the first reading information and the second reading information,
In the threshold setting step, the second threshold is obtained and set for each word based on the similarity in addition to the second complexity so that the second threshold increases as the similarity decreases. The threshold management program for speech recognition according to appendix 4.

（付記６）
前記読み情報は、単語本来の発音を示す第１読み情報と、前記第１読み情報とは異なる、前記単語の発音である第２読み情報と、を含み、
前記複雑度算出ステップでは、前記第１読み情報に基づいて第１複雑度を算出し、
前記判断値算出ステップは、前記第１読み情報と前記第２読み情報との類似度を算出する類似度算出ステップをさらに含み、
前記閾値設定ステップでは、前記単語が前記第１読み情報により発声されたか否かを判定するための第１閾値を前記第１複雑度に基づいて設定し、前記単語が前記第２読み情報により発声されたか否かを判定するための第２閾値を、前記類似度に基づいて、前記類似度が小さいほど前記第２閾値が大きくなるように求めて、前記単語毎に設定する、付記２に記載の音声認識用の閾値管理プログラム。 (Appendix 6)
The reading information includes first reading information indicating the original pronunciation of a word, and second reading information that is a pronunciation of the word different from the first reading information,
In the complexity calculation step, a first complexity is calculated based on the first reading information,
The determination value calculating step further includes a similarity calculating step for calculating a similarity between the first reading information and the second reading information,
In the threshold setting step, a first threshold for determining whether or not the word is uttered by the first reading information is set based on the first complexity, and the word is uttered by the second reading information. The second threshold value for determining whether or not the determination has been made is determined for each word based on the similarity, so that the second threshold value increases as the similarity decreases. Threshold management program for voice recognition.

（付記７）
前記複雑度算出ステップでは、前記単語の発音を構成する要素の要素数に基づいて前記複雑度を算出する、付記２乃至６のいずれかに記載の音声認識用の閾値管理プログラム。 (Appendix 7)
The threshold management program for speech recognition according to any one of appendices 2 to 6, wherein in the complexity calculation step, the complexity is calculated based on the number of elements constituting the pronunciation of the word.

（付記８）
前記複雑度算出ステップでは、前記単語の発音を構成する要素毎に定義された要素複雑度の総和に基づいて前記複雑度を算出する、付記２乃至６のいずれかに記載の音声認識用の閾値管理プログラム。 (Appendix 8)
The threshold for speech recognition according to any one of appendices 2 to 6, wherein in the complexity calculation step, the complexity is calculated based on a sum of element complexity defined for each element constituting the pronunciation of the word. Management program.

（付記９）
前記読み情報は、単語本来の発音を示す第１読み情報と、前記第１読み情報とは異なる、前記単語の発音である第２読み情報と、を含み、
前記判断値算出ステップは、前記判断値として、前記第１読み情報と前記第２読み情報との類似度を算出する類似度算出ステップを含み、
前記閾値設定ステップでは、前記単語が前記第２読み情報により発声されたか否かを判定するための第２閾値を、前記類似度に基づいて、前記類似度が小さいほど前記第２閾値が大きくなるように前記第２閾値を求めて、前記単語毎に設定する、付記１に記載の音声認識用の閾値管理プログラム。 (Appendix 9)
The reading information includes first reading information indicating the original pronunciation of a word, and second reading information that is a pronunciation of the word different from the first reading information,
The determination value calculation step includes a similarity calculation step of calculating a similarity between the first reading information and the second reading information as the determination value,
In the threshold setting step, a second threshold for determining whether or not the word is uttered by the second reading information is based on the similarity, and the second threshold increases as the similarity decreases. The threshold management program for speech recognition according to appendix 1, wherein the second threshold is obtained as described above and set for each word.

（付記１０）
コンピュータが実行する音声認識用の閾値管理方法であって、
所定の判断基準に基づいて、入力された音声を認識するか否かを判定するための閾値を決定する判断値を、単語毎の発音を示す読み情報に基づいて算出する判断値算出ステップと、
前記判断値に基づいて、前記閾値を求めて、前記単語毎に設定する閾値設定ステップと、
を含む音声認識用の閾値管理方法。 (Appendix 10)
A threshold management method for speech recognition executed by a computer,
A determination value calculating step for calculating a determination value for determining a threshold value for determining whether or not to recognize the input voice based on a predetermined determination criterion based on reading information indicating pronunciation for each word;
A threshold setting step for obtaining the threshold based on the determination value and setting the threshold for each word;
Threshold management method for speech recognition including

（付記１１）
所定の判断基準に基づいて、入力された音声を認識するか否かを判定するための閾値を決定する判断値を、単語毎の発音を示す読み情報に基づいて算出する判断値算出手段と、
前記判断値に基づいて、前記閾値を求めて、前記単語毎に設定する閾値設定手段と、
入力された音声を前記単語として認識するか否かを、前記閾値を用いて判定する音声認識手段と、
を含む音声認識装置。 (Appendix 11)
A determination value calculating means for calculating a determination value for determining a threshold value for determining whether or not to recognize an input voice based on a predetermined determination criterion, based on reading information indicating pronunciation for each word;
Threshold setting means for obtaining the threshold based on the judgment value and setting the threshold for each word;
Speech recognition means for determining whether to recognize the input speech as the word using the threshold;
A speech recognition device.

１：音声認識装置
２０：音声入力部
２１：音声認識部
２２：複雑度算出部
２３：閾値設定部
２４：類似度算出部
３０：単語辞書ＤＢ
３１：音声認識モデルＤＢ
３２：複雑度テーブル
３３：類似度テーブル 1: Speech recognition device 20: Speech input unit 21: Speech recognition unit 22: Complexity calculation unit 23: Threshold setting unit 24: Similarity calculation unit 30: Word dictionary DB
31: Voice recognition model DB
32: Complexity table 33: Similarity table

Claims

A determination value calculating step for calculating a determination value for determining a threshold value for determining whether or not to recognize the input voice based on a predetermined determination criterion based on reading information indicating pronunciation for each word;
A threshold setting step for obtaining the threshold based on the determination value and setting the threshold for each word;
Threshold management program for voice recognition for causing a computer to execute.

The determination value calculating step includes a complexity calculation step of calculating a reading complexity of the reading information as the determination value,
The threshold value for speech recognition according to claim 1, wherein in the threshold value setting step, the threshold value is determined for each word based on the complexity so that the threshold value decreases as the complexity increases. Management program.

The computer includes a word dictionary that associates the word with the reading information,
The threshold management program for speech recognition according to claim 2, wherein in the threshold setting step, a threshold is set for every word in the word dictionary.

The reading information includes first reading information indicating the original pronunciation of a word, and second reading information that is a pronunciation of the word different from the first reading information,
In the complexity calculation step, a first complexity is calculated based on the first reading information, a second complexity is calculated based on the second reading information,
In the threshold setting step, a first threshold for determining whether or not the word is uttered by the first reading information is set based on the first complexity, and the word is uttered by the second reading information. Setting a second threshold for determining whether or not the determination has been made based on the second complexity.
The threshold value management program for speech recognition according to claim 2 or 3.

The determination value calculating step further includes a similarity calculating step for calculating a similarity between the first reading information and the second reading information,
In the threshold setting step, the second threshold is obtained and set for each word based on the similarity in addition to the second complexity so that the second threshold increases as the similarity decreases. The threshold management program for speech recognition according to claim 4.

The reading information includes first reading information indicating the original pronunciation of a word, and second reading information that is a pronunciation of the word different from the first reading information,
In the complexity calculation step, a first complexity is calculated based on the first reading information,
The determination value calculating step further includes a similarity calculating step for calculating a similarity between the first reading information and the second reading information,
In the threshold setting step, a first threshold for determining whether or not the word is uttered by the first reading information is set based on the first complexity, and the word is uttered by the second reading information. The second threshold value for determining whether or not the determination has been made is determined for each word based on the similarity level so that the second threshold value increases as the similarity level decreases. The threshold management program for voice recognition as described.

The reading information includes first reading information indicating the original pronunciation of a word, and second reading information that is a pronunciation of the word different from the first reading information,
The determination value calculation step includes a similarity calculation step of calculating a similarity between the first reading information and the second reading information as the determination value,
In the threshold setting step, a second threshold for determining whether or not the word is uttered by the second reading information is based on the similarity, and the second threshold increases as the similarity decreases. The threshold management program for speech recognition according to claim 1, wherein the second threshold is obtained and set for each word.

A threshold management method for speech recognition executed by a computer,
A determination value calculating step for calculating a determination value for determining a threshold value for determining whether or not to recognize the input voice based on a predetermined determination criterion based on reading information indicating pronunciation for each word;
A threshold setting step for obtaining the threshold based on the determination value and setting the threshold for each word;
Threshold management method for speech recognition including

A determination value calculating means for calculating a determination value for determining a threshold value for determining whether or not to recognize an input voice based on a predetermined determination criterion, based on reading information indicating pronunciation for each word;
Threshold setting means for obtaining the threshold based on the judgment value and setting the threshold for each word;
Speech recognition means for determining whether to recognize the input speech as the word using the threshold;
A speech recognition device.