JP2011215291A

JP2011215291A - Speech recognition device and program

Info

Publication number: JP2011215291A
Application number: JP2010082109A
Authority: JP
Inventors: Yoriko Sasaki; ヨリ子佐々木; Yasushi Kamisawa; 泰上澤; Ryuji Mizutani; 龍治水谷; Naoyuki Kurauchi; 直行倉内
Original assignee: Aisin AW Co Ltd
Current assignee: Aisin AW Co Ltd
Priority date: 2010-03-31
Filing date: 2010-03-31
Publication date: 2011-10-27

Abstract

PROBLEM TO BE SOLVED: To provide a speech recognition device and program, which appropriately changes whether recognition processing is performed in forward or reverse direction, according to speech uttered by a user.SOLUTION: Based on a forward direction dictionary, recognition processing is performed in the forward direction by reproducing speech information in the forward direction (S401), and based on the recognition result of the forward direction recognition processing, a position of a fixed word in speech information is estimated (S403 and S405). When it is estimated that the fixed word is in the latter half of the speech information (S405), recognition processing is performed in reverse direction by reproducing speech information in the reverse direction (S406) based on a reverse direction dictionary. Based on a position of the fixed word of the speech uttered by the user, it is appropriately changed whether recognition processing of the speech information is performed in the forward direction or reverse direction, and thereby, accuracy of the speech recognition is improved without making the user conscious of operation about the speech recognition direction.

Description

本発明は、音声認識装置及びプログラムに関する。 The present invention relates to a speech recognition apparatus and a program.

従来、ユーザの発した音声を認識する音声認識装置が知られている。音声認識処理において、ユーザの発した音声を前から順に認識処理を行うことが一般的であるが、例えば小さな行政単位から大きな行政単位の順となっている欧米等の住所を認識する場合、住所の音声認識であることをユーザのスイッチ操作などによって検出し、その場合にだけ住所認識用の後ろ向き木構造の辞書を用いて後ろ向き認識処理を行う技術が開示されている（例えば特許文献１参照）。 2. Description of the Related Art Conventionally, a voice recognition device that recognizes a voice uttered by a user is known. In speech recognition processing, it is common to perform recognition processing in order from the user's voice, but for example, when recognizing addresses in Europe and the United States that are in order of small administrative units to large administrative units, Is detected by a user's switch operation or the like, and a technique for performing backward recognition processing using a backward tree dictionary for address recognition is disclosed only in that case (see, for example, Patent Document 1). .

特許第４１０４３１３号公報Japanese Patent No. 4104313

しかしながら特許文献１では、住所以外の音声認識において後ろ向き認識処理を行う点については開示されておらず、また、後ろ向き認識処理を行う場合、後ろ向き認識処理を開始するためにはユーザのスイッチ操作などを検出する必要があった。すなわち、特許文献１では、ユーザの発した音声に応じて認識処理を順方向に行うか、逆方向に行うかを適切に切り替えることができなかった。
本発明は、上述の課題に鑑みてなされたものであり、その目的は、ユーザの発した音声に応じて認識処理を順方向に行うか、逆方向に行うかを適切に切り替え可能な音声認識装置およびプログラムを提供することにある。 However, Patent Document 1 does not disclose that backward recognition processing is performed in speech recognition other than an address, and when backward recognition processing is performed, a user's switch operation or the like is performed in order to start backward recognition processing. It was necessary to detect. That is, in Patent Document 1, it has not been possible to appropriately switch between performing the recognition process in the forward direction or in the reverse direction according to the voice uttered by the user.
The present invention has been made in view of the above-described problems, and the purpose of the present invention is to recognize a voice that can be appropriately switched between a forward process and a backward process in accordance with a user's voice. To provide an apparatus and a program.

請求項１に記載の音声認識装置は、単語を順方向に再生したときの音声データと単語とが対応付けて記憶される前向き木構造の辞書データ、および単語を逆方向に再生したときの音声データと単語とが対応付けて記憶される後ろ向き木構造の辞書データを有する。音声認識装置は、木構造の根幹をなす固定語、および固定語に対応し木構造の枝葉をなす可変語からなる音声情報を取得する音声情報取得手段と、前向き木構造の辞書データに基づき、音声情報を順方向に再生することにより認識処理を順方向に行う順方向認識処理手段と、順方向認識処理の認識結果に基づき、音声情報における固定語の位置を推定する固定語位置推定手段と、固定語位置推定手段により固定語が音声情報の後半部分にあると推定された場合、後ろ向き木構造の辞書データに基づき、音声情報を逆方向に再生することにより認識処理を逆方向に行う逆方向認識処理手段と、を備える。これにより、ユーザの発した音声の固定語の位置に基づき、音声情報の認識処理を順方向に行うか、逆方向に行うか、を適切に切り替えることができるので、ユーザに認識処理方向に関する操作を意識させることなく音声認識の精度が向上する。 The speech recognition apparatus according to claim 1, wherein the speech data when the word is reproduced in the forward direction and the dictionary data of the forward tree structure in which the word is stored in association with each other, and the voice when the word is reproduced in the reverse direction It has dictionary data of backward tree structure in which data and words are stored in association with each other. The speech recognition device is based on speech information acquisition means for acquiring speech information consisting of fixed words that form the root of the tree structure, and variable words that correspond to the fixed words and that form branches and leaves of the tree structure, and forward-facing tree structure dictionary data, Forward direction recognition processing means for performing recognition processing in the forward direction by reproducing the voice information in the forward direction; fixed word position estimation means for estimating the position of the fixed word in the speech information based on the recognition result of the forward direction recognition processing; When the fixed word position estimating means estimates that the fixed word is in the latter half of the speech information, the recognition processing is performed in the reverse direction by reproducing the speech information in the reverse direction based on the backward-facing tree structure dictionary data. Direction recognition processing means. Accordingly, the voice information recognition process can be appropriately switched between the forward direction and the reverse direction based on the position of the fixed word of the voice uttered by the user. Improves the accuracy of speech recognition without making you conscious.

請求項２に記載の発明では、音声認識装置は、順方向認識処理手段の認識結果に基づき、音声情報の前半部分が発散しているか否かを判断する発散判断手段を備える。また、固定語位置推定手段は、発散判断手段により、音声情報の前半部分が発散していると判断された場合、固定語が音声情報の後半部分にあると推定する。これにより、固定語の位置の推定に係る演算負荷を低減することができる。 In the second aspect of the present invention, the speech recognition apparatus includes divergence determination means for determining whether or not the first half of the speech information is divergence based on the recognition result of the forward direction recognition processing means. The fixed word position estimating means estimates that the fixed word is in the latter half of the speech information when the divergence determining means determines that the first half of the speech information is diverging. Thereby, it is possible to reduce the calculation load related to the estimation of the position of the fixed word.

請求項３に記載の発明では、発散判断手段は、音声情報取得手段により音声情報の取得が開始されてから所定期間の音声情報である部分音声情報を認識処理し、部分音声情報の認識結果が得られない場合、音声情報の前半部分が発散していると判断する。これにより、音声情報の全体の認識処理を行わなくても音声情報の前半部分が発散しているか否かを判断できるので、演算負荷をより低減することができる。また、音声情報の前半部分が発散していると判断された場合、逆方向処理に速やかに切り替えることができる。これにより、全ての音声情報を順方向処理した後に逆方向処理に切り替える場合と比較して、音声情報が取得されてから認識結果が得られるまでの時間を短縮することができる。 In the invention according to claim 3, the divergence determining means recognizes the partial voice information that is the voice information for a predetermined period from the start of the acquisition of the voice information by the voice information acquisition means, and the recognition result of the partial voice information is If not obtained, it is determined that the first half of the voice information is diverging. Accordingly, it is possible to determine whether or not the first half of the voice information is diverging without performing the entire recognition process of the voice information, so that the calculation load can be further reduced. Further, when it is determined that the first half of the voice information is diverging, it is possible to quickly switch to the backward processing. Thereby, compared with the case where it switches to reverse direction processing after carrying out forward direction processing of all the audio | voice information, the time until a recognition result is acquired after audio | voice information is acquired can be shortened.

音声情報の前半部分が発散しているか否かの判断は、具体的には以下のように行うことができる。
請求項４に記載の発明では、発散判断手段は、部分音声情報の認識結果を特定するためのスコアを有する複数の認識対象候補において、最もスコアの高い認識対象候補と最もスコアの低い前記認識対象候補とのスコアの差が所定の閾値以内である場合、部分音声情報の認識結果が得られず、音声情報の前半部分が発散していると判断する。 The determination as to whether or not the first half of the voice information is diverging can be made specifically as follows.
In the invention according to claim 4, the divergence determining means is the recognition target candidate having the highest score and the recognition target having the lowest score among the plurality of recognition target candidates having a score for specifying the recognition result of the partial speech information. If the difference between the score and the candidate is within a predetermined threshold, it is determined that the recognition result of the partial voice information is not obtained and the first half of the voice information is divergent.

請求項５に記載の発明では、発散判断手段は、部分音声情報の認識結果を特定するためのスコアを有する複数の認識対象候補において、最もスコアの高い前記認識対象候補とのスコアの差が所定の閾値以内である認識対象候補が所定数以上存在する場合、部分音声情報の認識結果が得られず、音声情報の前半部分が発散していると判断する。 In the invention according to claim 5, the divergence determining means has a predetermined difference in score from the recognition target candidate having the highest score among a plurality of recognition target candidates having a score for specifying the recognition result of the partial speech information. If there are a predetermined number or more of recognition target candidates within the threshold value, it is determined that the recognition result of the partial speech information is not obtained and the first half of the speech information is divergent.

請求項４または５に記載の発明のように、複数の認識候補における認識結果を特定するためのスコアに基づき、複数の認識候補間のスコアのばらつきが小さい場合、音声情報の前半部分が発散していると判断する。これにより、音声情報の前半部分が発散しているか否かを適切に判断することができる。 As in the invention according to claim 4 or 5, when the variation in the score between the plurality of recognition candidates is small based on the score for specifying the recognition result in the plurality of recognition candidates, the first half of the voice information is diverged. Judge that Thereby, it is possible to appropriately determine whether or not the first half of the voice information is diverging.

請求項６に記載の発明では、発散判断手段は、順方向認識処理手段による認識結果に基づき、認識結果として特定された単語ごとの信頼度を算出し、音声情報の前半部分にある単語の信頼度が所定値以下であり、音声情報の後半部分にある単語の信頼度が所定値以上である場合、音声情報の前半部分が発散していると判断する。これにより、認識結果として特定された単語の信頼度に基づいて前半部分が発散しているか否かを適切に判断することができる。 In the invention according to claim 6, the divergence determining means calculates the reliability of each word specified as the recognition result based on the recognition result by the forward direction recognition processing means, and the reliability of the word in the first half of the speech information. If the degree is less than or equal to a predetermined value and the reliability of a word in the latter half of the voice information is greater than or equal to the predetermined value, it is determined that the first half of the voice information is diverging. This makes it possible to appropriately determine whether or not the first half part is diverging based on the reliability of the word specified as the recognition result.

以上、音声認識装置の発明として説明してきたが、次に示すようなプログラムの発明として実現することもできる。
すなわち、木構造の根幹をなす固定語、および固定語に対応し木構造の枝葉をなす可変語からなる音声情報を取得する音声情報取得手段、前向き木構造の辞書データに基づき、音声情報を順方向に再生することにより認識処理を順方向に行う順方向認識処理手段、順方向認識処理の認識結果に基づき、音声情報における固定語の位置を推定する固定語位置推定手段、固定語位置推定手段により固定語が音声情報の後半部分にあると推定された場合、後ろ向き木構造の辞書データに基づき、音声情報を逆方向に再生することにより認識処理を逆方向に行う逆方向認識処理手段、としてコンピュータを機能させるプログラムである。このようなプログラムを実行することで、上述のナビゲーション装置と同様の効果が奏される。 As described above, the invention has been described as the invention of the speech recognition apparatus, but it can also be realized as an invention of the following program.
In other words, the speech information acquisition means for acquiring speech information consisting of fixed words that form the root of the tree structure, and variable words that correspond to the fixed words and that form branches and leaves of the tree structure, and based on dictionary data of the forward tree structure, Forward direction recognition processing means for performing recognition processing in the forward direction by reproducing in the direction, fixed word position estimation means for estimating the position of the fixed word in the speech information based on the recognition result of the forward direction recognition processing, fixed word position estimation means When it is estimated that the fixed word is in the second half of the speech information, the backward recognition processing means performs the recognition processing in the reverse direction by reproducing the speech information in the reverse direction based on the dictionary data of the backward tree structure. A program that causes a computer to function. By executing such a program, an effect similar to that of the navigation device described above can be obtained.

本発明の一実施形態の音声認識装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition apparatus of one Embodiment of this invention. 本発明の一実施形態の認識辞書を説明する図であって、（ａ）は前向き木構造の辞書データを説明する説明図であり、（ｂ）は後ろ向き木構造の辞書データを説明する説明図である。It is a figure explaining the recognition dictionary of one Embodiment of this invention, Comprising: (a) is explanatory drawing explaining the dictionary data of a forward tree structure, (b) is explanatory drawing explaining the dictionary data of a backward tree structure It is. 本発明の一実施形態の音声認識処理を説明するフローチャートである。It is a flowchart explaining the speech recognition process of one Embodiment of this invention. 本発明の一実施形態の音声認識処理を説明するフローチャートである。It is a flowchart explaining the speech recognition process of one Embodiment of this invention. 本発明の一実施形態により前向き木構造データに基づく認識処理を説明する説明図である。It is explanatory drawing explaining the recognition process based on forward tree structure data by one Embodiment of this invention. 本発明の一実施形態により後ろ向き木構造データに基づく認識処理を説明する説明図である。It is explanatory drawing explaining the recognition process based on backward facing tree structure data by one Embodiment of this invention. 本発明の変形例による発散判断処理を説明する説明図である。It is explanatory drawing explaining the divergence determination process by the modification of this invention.

以下、本発明による音声認識装置を図面に基づいて説明する。
（一実施形態）
図１は、本発明の一実施形態による音声認識装置１の全体構成を説明するブロック図である。音声認識装置１は、例えば車両に搭載されたナビゲーション装置に適用される。音声認識装置１は、制御部１０を中心に構成されており、制御部１０に接続される音声認識情報記憶部２０、操作スイッチ群３０、音声入力部４０、音声出力部５０、描画部６０、情報記憶部７０等を備えている。 Hereinafter, a speech recognition apparatus according to the present invention will be described with reference to the drawings.
(One embodiment)
FIG. 1 is a block diagram illustrating the overall configuration of a speech recognition apparatus 1 according to an embodiment of the present invention. The voice recognition device 1 is applied to, for example, a navigation device mounted on a vehicle. The voice recognition device 1 is configured around a control unit 10, and includes a voice recognition information storage unit 20, an operation switch group 30, a voice input unit 40, a voice output unit 50, a drawing unit 60, connected to the control unit 10. An information storage unit 70 and the like are provided.

制御部１０は、通常のコンピュータとして構成されている。制御部１０の内部には、ＣＰＵ、ＲＯＭ、Ｉ／Ｏ、および、これらの構成を接続するバスラインなどが備えられている。
音声認識情報記憶部２０は、例えばハードディスク装置（ＨＤＤ）として実現される記憶装置である。なお、本実施形態ではＨＤＤを用いたが、ＤＶＤ−ＲＯＭや、メモリカード等の他の媒体を用いても差し支えない。 The control unit 10 is configured as a normal computer. The control unit 10 includes a CPU, a ROM, an I / O, a bus line that connects these components, and the like.
The voice recognition information storage unit 20 is a storage device realized as, for example, a hard disk device (HDD). Although the HDD is used in the present embodiment, other media such as a DVD-ROM and a memory card may be used.

音声認識情報記憶部２０には、音声認識データベース２１が記憶されている。音声認識データベース２１には、音響モデル２２、認識辞書２３、言語モデル２４が含まれる。音響モデル２２は、音声の特徴量と音素とを関連付けたデータである。本実施形態では、音響モデル２２は、順方向に再生した場合の順方向用の音響モデルである順方向モデル２２１および逆方向に再生した場合の逆方向用の音響モデルである逆方向モデル２２２を有している。認識辞書２３は、音素列と対応付けられた単語を格納している。言語モデル２４は、文頭または文末に位置する確率、連続する単語間の接続確率、係り受け関係等をモデル化したデータである。 A voice recognition database 21 is stored in the voice recognition information storage unit 20. The speech recognition database 21 includes an acoustic model 22, a recognition dictionary 23, and a language model 24. The acoustic model 22 is data in which a feature amount of speech and a phoneme are associated with each other. In this embodiment, the acoustic model 22 includes a forward model 221 that is an acoustic model for the forward direction when reproduced in the forward direction and a backward model 222 that is an acoustic model for the backward direction when reproduced in the reverse direction. Have. The recognition dictionary 23 stores words associated with phoneme strings. The language model 24 is data that models the probability of being located at the beginning or end of a sentence, the connection probability between consecutive words, the dependency relationship, and the like.

認識辞書２３には、順方向に再生した場合の音素列と単語とが対応付けられている前向き木構造の辞書データとしての順方向辞書２３１、および、逆方向に再生した場合の音素列と単語とが対応付けられている後ろ向き木構造の辞書データとしての逆方向辞書２３２が含まれる。例えば、「再生」という単語について説明すると、順方向辞書２３１には「ｓａｉｓｅｉ」が順方向に再生したときの音声データとして「再生」という単語と対応付けて格納されている。また、逆方向辞書２３２には「ｉｅｓｉａｓ」が逆方向に再生したときの音声データとして「再生」という単語と対応付けて格納されている。なお、「ｓａｉｓｅｉ」、「ｉｅｓｉａｓ」等は、説明の便宜上このように表現しているが、本体は音素単位で記憶されている。 The recognition dictionary 23 includes a forward dictionary 231 as a forward tree structure dictionary data in which a phoneme string and a word reproduced in the forward direction are associated with each other, and a phoneme string and a word reproduced in the backward direction. Are stored in the backward dictionary 232 as the backward-facing tree-structured dictionary data. For example, the word “reproduction” will be described. In the forward dictionary 231, “saisei” is stored in association with the word “reproduction” as voice data when reproduced in the forward direction. Further, the backward dictionary 232 stores “iesias” in association with the word “reproduced” as voice data when reproduced in the reverse direction. Note that “saisei”, “iesias” and the like are expressed in this way for convenience of explanation, but the main body is stored in units of phonemes.

ここで、順方向辞書２３１および逆方向辞書２３２のデータ構造を図２に基づいて説明する。図２においては所定の処理を指示する固定語である「再生」に対応付けられた可変語である「ＡＢＣ」、「ＤＥＦ」を例について説明する。なお、固定語である「再生」が後述する「固定語辞書」に記憶され、可変語である「ＡＢＣ」、「ＤＥＦ」が「再生」と対応づけられて後述する「可変語辞書」に記憶されている。 Here, the data structures of the forward dictionary 231 and the backward dictionary 232 will be described with reference to FIG. In FIG. 2, examples of variable words “ABC” and “DEF” associated with “reproduction”, which is a fixed word for instructing a predetermined process, will be described. The fixed word “reproduction” is stored in a “fixed word dictionary” described later, and the variable words “ABC” and “DEF” are associated with “reproduction” and stored in a “variable word dictionary” described later. Has been.

順方向辞書２３１は、発話内容を順方向に再生して認識するための辞書であり、図２（ａ）に示すように、時系列において可変語の先にある固定語「ｓａｉｓｅｉ」に対応付けられて可変語「ＡＢＣ」、「ＤＥＦ」等が記憶されている。すなわち、順方向辞書２３１は、図２（ａ）に示すように、音声情報の取得が開始された開始位置側に木構造の根幹となる固定語がある「前向き木構造」になっているといえる。順方向辞書２３１に基づき、例えば「再生ＡＢＣ」と発話された発話内容を音声認識する場合、取得された音声情報を順方向に再生したときの波形データと順方向辞書２３１に記憶された「ｓａｉｓｅｉ」および「ｓａｉｓｅｉ」に対応付けて記憶されている「ＡＢＣ」とを比較することにより、発話内容を「再生ＡＢＣ」と特定する、といった具合である。 The forward dictionary 231 is a dictionary for reproducing and recognizing the utterance content in the forward direction, and associating with the fixed word “saisei” that precedes the variable word in time series as shown in FIG. The variable words “ABC”, “DEF” and the like are stored. That is, as shown in FIG. 2A, the forward dictionary 231 has a “forward tree structure” in which a fixed word serving as the root of the tree structure is located on the start position side where the acquisition of voice information is started. I can say that. Based on the forward dictionary 231, for example, when recognizing the utterance content uttered as “reproduction ABC”, the waveform data when the acquired voice information is reproduced in the forward direction and “saisei” stored in the forward dictionary 231 are used. ”And“ ABC ”stored in association with“ saisei ”, and the content of the utterance is specified as“ reproduced ABC ”.

一方、逆方向辞書２３２は発話内容を逆方向に再生して認識するための辞書であり、図２（ｂ）に示すように、時系列において可変語の後にある固定語「ｓａｉｓｅｉ」に対応付けられて可変語「ＡＢＣ」、「ＤＥＦ」等が記憶されている。すなわち、逆方向辞書２３２は、図２（ｂ）に示すように、音声情報の取得が終了した終了位置側に木構造の根幹となる固定語がある「後ろ向き木構造」になっているといえる。逆方向辞書２３２に基づき、例えば「ＡＢＣ再生」と発話された発話内容を音声認識する場合、取得された音声情報を逆方向に再生したときの波形データと逆方向辞書２３２に記憶された「ｉｅｓｉａｓ」および「ｉｅｓｉａｓ」に対応付けて記憶されている「ＣＢＡ」とを比較することにより、発話内容を「ＡＢＣ再生」と特定する、といった具合である。 On the other hand, the reverse dictionary 232 is a dictionary for reproducing and recognizing the utterance contents in the reverse direction, and associating with the fixed word “saisei” after the variable word in the time series as shown in FIG. The variable words “ABC”, “DEF” and the like are stored. That is, as shown in FIG. 2B, the backward dictionary 232 has a “backward-facing tree structure” in which a fixed word serving as the root of the tree structure is located on the end position side where the acquisition of the voice information is completed. . For example, when recognizing speech content uttered as “ABC playback” based on the reverse dictionary 232, waveform data when the acquired speech information is reproduced in the reverse direction and “iesias” stored in the reverse dictionary 232 are used. ”And“ cies ”and“ CBA ”stored in association with each other to identify the utterance content as“ ABC playback ”.

順方向辞書２３１および逆方向辞書２３２には、それぞれ所定の処理を指示する単語であって木構造の根幹をなす固定語に関する固定語辞書が含まれる。また、固定語辞書に記憶されたそれぞれの固定語には、当該固定語の目的語となりうる単語であって木構造の枝葉をなす可変語に関する可変語辞書が対応付けて記憶されている。なお、本実施形態における「固定語」とは、所定の処理、例えば音楽の再生、を指示する動詞である。また、「固定語」は、所定の処理を指示するコマンドである、ともいえる。 Each of the forward dictionary 231 and the backward dictionary 232 includes fixed word dictionaries relating to fixed words that are words that instruct predetermined processing and form the basis of a tree structure. In addition, each fixed word stored in the fixed word dictionary is stored in association with a variable word dictionary related to a variable word that can be a target word of the fixed word and has branches and leaves of a tree structure. The “fixed word” in the present embodiment is a verb for instructing a predetermined process, for example, music reproduction. It can also be said that the “fixed word” is a command for instructing a predetermined process.

例えば固定語辞書に登録された例えば、固定語辞書に記憶された「再生する」という動詞には、可変語辞書としての「アーティスト名辞書」や「曲名辞書」等が対応づけられている、といった具合である。なお、例えば「再生」という単語は名詞であるが、「再生する」のような「名詞＋サ変動詞」となる単語の名詞部分については、所定の処理を指示する単語であるものとし、本実施形態では「固定語」として固定語辞書に記憶されている。また例えば「を再生する」のように助詞を含んだものも固定語として固定語辞書に登録されている。
なお本実施形態の認識辞書２３には、日本語、英語等複数の言語の単語が記憶されている。 For example, for example, an “artist name dictionary” or “song name dictionary” as a variable word dictionary is associated with a verb “play” stored in the fixed word dictionary. Condition. For example, the word “play” is a noun, but the noun part of the word “noun + sub-variable” such as “play” is a word for instructing predetermined processing. In the form, it is stored in the fixed word dictionary as “fixed word”. Also, for example, those containing particles such as “reproduce” are registered as fixed words in the fixed word dictionary.
The recognition dictionary 23 of this embodiment stores words in a plurality of languages such as Japanese and English.

操作スイッチ群３０は、ディスプレイ６１と一体になったタッチスイッチもしくはメカニカルなスイッチやリモコン装置等で構成され、各種入力に使用される。操作スイッチ群３０には、トークスイッチ３１が含まれる。トークスイッチ３１は、音声入力時に操作される。 The operation switch group 30 includes a touch switch integrated with the display 61, a mechanical switch, a remote control device, or the like, and is used for various inputs. The operation switch group 30 includes a talk switch 31. The talk switch 31 is operated at the time of voice input.

音声入力部４０は、音声を入力するためのマイク４１が接続されている。トークスイッチ３１がオンされたとき、マイク４１を介してユーザの発した音声情報が取得される。
音声出力部５０には、音声を出力するためのスピーカ５１が接続されている。
描画部６０には、ディスプレイ６１が接続されている。ディスプレイ６１は、液晶やＣＲＴを用いたカラーディスプレイである。このディスプレイ６１を介して情報表示が行われる。 The voice input unit 40 is connected to a microphone 41 for inputting voice. When the talk switch 31 is turned on, voice information issued by the user is acquired via the microphone 41.
A speaker 51 for outputting sound is connected to the sound output unit 50.
A display 61 is connected to the drawing unit 60. The display 61 is a color display using liquid crystal or CRT. Information is displayed via the display 61.

情報記憶部７０は、取得された音声情報を記憶するためのものであり、音声認識情報記憶部２０と同一のＨＤＤで構成されている。もちろん、メモリカード等の他の媒体を用いてもよい。 The information storage unit 70 is for storing the acquired voice information, and is configured by the same HDD as the voice recognition information storage unit 20. Of course, other media such as a memory card may be used.

本実施形態では、発話内容が前向き木構造であるか後ろ向き木構造であるかを適切に判断し、音声情報を順方向に再生することにより認識処理を順方向に行うか、音声情報を逆方向に再生することにより認識処理を逆方向に行うかを適切に切り替える点に特徴を有している。そこで、図３、４に示すフローチャートに基づいて音声認識処理を説明する。 In this embodiment, it is appropriately determined whether the utterance content is a forward-facing tree structure or a backward-facing tree structure, and the speech processing is performed in the forward direction by reproducing the speech information in the forward direction, or the speech information is reversed in the backward direction. It is characterized in that it appropriately switches whether to perform the recognition process in the reverse direction by playing back. Therefore, the speech recognition process will be described based on the flowcharts shown in FIGS.

以下、音声情報を順方向に再生することにより認識処理を順方向に行うことを「順方向処理」といい、音声情報を逆方向に再生することにより認識処理を逆方向に行うことを「逆方向処理」という。なお、本実施形態では、順方向処理においては、まず順方向辞書２３１に含まれる固定語辞書に基づく認識処理が行われ、認識結果が得られなかった場合、順方向辞書２３１に含まれる固定語辞書以外の辞書に基づく認識処理が行われるものとする。同様に、逆方向処理においては、まず逆方向辞書２３２に含まれる固定語辞書に基づく認識処理が行われ、認識結果が得られなかった場合、逆方向辞書２３２に含まれる固定語辞書以外の辞書に基づく認識処理が行われるものとする。 Hereinafter, performing the recognition process in the forward direction by reproducing the voice information in the forward direction is referred to as “forward process”, and performing the recognition process in the reverse direction by reproducing the voice information in the reverse direction is referred to as “reverse process”. It is called “direction processing”. In this embodiment, in the forward processing, first, recognition processing based on the fixed word dictionary included in the forward dictionary 231 is performed, and if no recognition result is obtained, the fixed word included in the forward dictionary 231. It is assumed that recognition processing based on a dictionary other than the dictionary is performed. Similarly, in the backward processing, first, recognition processing based on the fixed word dictionary included in the backward dictionary 232 is performed, and if a recognition result is not obtained, a dictionary other than the fixed word dictionary included in the backward dictionary 232 is used. It is assumed that recognition processing based on is performed.

まず音声認識処理のメインフローを図３に基づいて説明する。図３に示す音声認識処理は、トークスイッチ３１がオンされたときに行われる処理である。
初めのステップＳ１００（以下、「ステップ」を省略し、単に記号「Ｓ」で示す。）では、トークスイッチ３１がオンされたことを検知する。 First, the main flow of the speech recognition process will be described with reference to FIG. The voice recognition process shown in FIG. 3 is a process performed when the talk switch 31 is turned on.
In the first step S100 (hereinafter, “step” is omitted and simply indicated by the symbol “S”), it is detected that the talk switch 31 is turned on.

Ｓ２００では、認識辞書２３をセットする。
Ｓ３００では、マイク４１を介して入力されたユーザの発した音声をＡ／Ｄ変換し、データ処理が可能な波形データに変換し、当該波形データを音声情報として取得する。取得された音声情報は、順次情報記憶部７０に記憶される。
Ｓ４００では、Ｓ３００で取得された音声情報を順方向処理するか逆方向処理するかを特定し、認識処理を行う。 In S200, the recognition dictionary 23 is set.
In S300, the voice uttered by the user input via the microphone 41 is A / D converted into waveform data that can be processed, and the waveform data is acquired as voice information. The acquired audio information is sequentially stored in the information storage unit 70.
In S400, it is specified whether the voice information acquired in S300 is to be forward-processed or reverse-processed, and recognition processing is performed.

ここで行われる認識処理を図４に示すサブフローに基づいて説明する。なお、図４に示す認識処理は、Ｓ３００において音声情報の取得が開始されると同時に開始してもよいし、音声情報の取得が終了してから開始してもよい。
図４中のＳ４０１では、順方向辞書２３１に基づき、Ｓ３００にて取得された音声情報を順方向処理する。なお、Ｓ４０１における順方向処理では、順方向辞書２３１に含まれる固定語辞書に基づく認識処理のみとし、固定語辞書以外の辞書に基づく認識処理を行わなくてもよい。 The recognition process performed here is demonstrated based on the subflow shown in FIG. Note that the recognition process shown in FIG. 4 may be started at the same time as the acquisition of voice information is started in S300, or may be started after the acquisition of voice information is completed.
In S401 in FIG. 4, the voice information acquired in S300 is forward processed based on the forward dictionary 231. Note that the forward processing in S401 is limited to recognition processing based on the fixed word dictionary included in the forward dictionary 231, and recognition processing based on a dictionary other than the fixed word dictionary may not be performed.

Ｓ４０２では、取得された音声情報の前半部分が発散しているか否かを判断する。取得された音声情報の前半部分が発散しているか否かの判断方法の詳細については、後述する。本実施形態では、音声情報の取得が開始されてから所定期間（例えば０．５秒）の音声情報である部分音声情報を「前半部分」とし、音声情報における前半部分を除く残りの部分を「後半部分」とする。取得された音声情報の前半部分が発散していると判断された場合（Ｓ４０２：ＹＥＳ）、Ｓ４０５へ移行する。取得された音声情報の前半部分が発散していないと判断された場合（Ｓ４０２：ＮＯ）、Ｓ４０３へ移行する。 In S402, it is determined whether or not the first half of the acquired voice information is diverging. Details of the method for determining whether or not the first half of the acquired voice information is diverging will be described later. In the present embodiment, partial audio information that is audio information for a predetermined period (for example, 0.5 seconds) after acquisition of audio information is set as “first half portion”, and the remaining portion excluding the first half portion in audio information is “ Let's say the second half. When it is determined that the first half of the acquired voice information is diverging (S402: YES), the process proceeds to S405. When it is determined that the first half of the acquired voice information is not diverging (S402: NO), the process proceeds to S403.

Ｓ４０３では、固定語が音声情報の前半部分にあると推定する。
Ｓ４０４では、順方向辞書２３１に基づき、音声情報を順方向処理し、順方向処理により得られた結果を認識結果とする。そして、図３中のＳ５００へ移行する。 In S403, it is estimated that the fixed word is in the first half of the voice information.
In S404, the speech information is forward processed based on the forward dictionary 231 and the result obtained by the forward processing is used as the recognition result. Then, the process proceeds to S500 in FIG.

取得された音声情報の前半部分が発散していると判断された場合（Ｓ４０２：ＹＥＳ）に移行するＳ４０５では、固定語が後半部分にあると推定する。
Ｓ４０６では、逆方向辞書２３２に基づき、音声情報を逆方向処理する。なお、音声情報を逆方向に再生する場合、情報記憶部７０に記憶された音声情報に基づき、波形データの振幅が所定値以下の期間が所定時間以上となった位置を当該音声情報の終了位置であると特定する。そして、特定された終了位置を開始位置として、音声情報を逆方向に再生し、認識処理を行う。 When it is determined that the first half of the acquired voice information is diverging (S402: YES), it is estimated that the fixed word is in the second half.
In S <b> 406, the voice information is processed backward based on the backward dictionary 232. When audio information is reproduced in the reverse direction, based on the audio information stored in the information storage unit 70, the position where the amplitude of the waveform data is equal to or less than a predetermined value is a predetermined time or more. To be identified. Then, the audio information is reproduced in the reverse direction with the identified end position as the start position, and recognition processing is performed.

Ｓ４０７では、音声情報を逆方向処理により認識結果を特定できたか否かを判断する。逆方向処理により認識結果を特定できなかったと判断された場合（Ｓ４０７：ＮＯ）、Ｓ４０９へ移行する。逆方向処理により認識結果を特定できたと判断された場合（Ｓ４０７：ＹＥＳ）、Ｓ４０８へ移行する。 In step S407, it is determined whether or not the voice information has been identified by the backward processing. When it is determined that the recognition result cannot be specified by the backward process (S407: NO), the process proceeds to S409. When it is determined that the recognition result can be specified by the backward process (S407: YES), the process proceeds to S408.

Ｓ４０８では、逆方向処理により得られた結果を認識結果とする。そして、図３中のＳ５００へ移行する。
逆方向処理により認識結果を特定できなかったと判断された場合（Ｓ４０７：ＮＯ）に移行するＳ４０９では、認識結果なしとし、図３中のＳ５００へ移行する。認識結果がない場合とは、取得された音声の音量が小さい場合や、発話内容が特定できない場合などが例示される。 In S408, the result obtained by the backward process is set as the recognition result. Then, the process proceeds to S500 in FIG.
If it is determined that the recognition result could not be specified by the reverse process (S407: NO), the process proceeds to S500 in FIG. The case where there is no recognition result is exemplified when the volume of the acquired voice is low, or when the utterance content cannot be specified.

図４に示す認識処理終了後に移行する図３中のＳ５００では、認識結果が得られたか否かを判断する。認識結果が得られなかったと判断された場合（Ｓ５００：ＮＯ）、すなわち図４中のＳ４０７において否定判断された場合、Ｓ６００の処理を行わず、スピーカ５１を介して「認識できませんでした」といった音声を出力し、ユーザに認識結果が得られなかった旨を示す情報を通知し、本処理を終了する。認識結果が得られたと判断された場合（Ｓ５００：ＹＥＳ）、Ｓ６００へ移行する。 In S500 in FIG. 3 that is shifted to after completion of the recognition process shown in FIG. 4, it is determined whether or not a recognition result has been obtained. If it is determined that the recognition result has not been obtained (S500: NO), that is, if a negative determination is made in S407 in FIG. 4, the processing of S600 is not performed and a voice such as “could not be recognized” is obtained via the speaker 51. Is output to the user, and information indicating that the recognition result has not been obtained is notified to the user, and the process is terminated. When it is determined that the recognition result has been obtained (S500: YES), the process proceeds to S600.

Ｓ６００では、認識結果を出力し、本処理を終了する。すなわち、図４中のＳ４０４の後に移行するＳ６００では、順方向処理により得られた認識結果を出力する。また、図４中のＳ４０８の後に移行するＳ６００では、逆方向処理により得られた認識結果を出力する。 In S600, the recognition result is output, and this process ends. That is, in S600 that moves after S404 in FIG. 4, the recognition result obtained by the forward processing is output. Further, in S600 that moves after S408 in FIG. 4, the recognition result obtained by the backward process is output.

ところで、一般的な音声認識では、探索の範囲を限定するビームサーチを行う。ビームサーチでは、所定のビーム幅を設定しておき、最尤仮説を基準として設定したビーム幅の範囲から外れた仮説を棄却する、所謂「仮説の枝刈り」を行う。それぞれの仮説は、音声情報との一致度を示す値であるスコアが算出されている。このスコアは、認識結果を特定するために算出される値であって、スコアが高いほど音声情報の認識結果である確率が高い。なお、仮説が「認識対象候補」に対応している。
ビーム幅の設定は、保持する仮説数で設定する方法と、スコアで設定する方法とがある。 By the way, in general speech recognition, a beam search that limits a search range is performed. In the beam search, a so-called “hypothesis pruning” is performed in which a predetermined beam width is set and a hypothesis that is outside the beam width range set based on the maximum likelihood hypothesis is rejected. For each hypothesis, a score, which is a value indicating the degree of coincidence with the voice information, is calculated. This score is a value calculated to identify the recognition result, and the higher the score, the higher the probability that it is a recognition result of speech information. The hypothesis corresponds to the “recognition target candidate”.
There are two methods for setting the beam width: a method of setting the number of hypotheses to be held, and a method of setting the score.

ここでは、音声情報の探索結果が発散しているか否かについて、２つの判断方法を説明する。
＜発散判断方法１＞
発散判断方法１は、保持する仮説数をｎ件とすることによりビーム幅が設定されている場合の判断方法である。部分音声情報を順方向に再生することにより認識処理を行ったとき、前半部のある特定の時点で、保持されているｎ件の仮説のうち、最もスコアが高い仮説と最もスコアが低い仮説とのスコアの差が所定の閾値以内である場合、すなわち仮説ごとのスコアの開きが小さい場合、音声情報の前半部分の探索処理が発散していると判断する（Ｓ４０２：ＹＥＳ）。 Here, two determination methods will be described as to whether or not the search result of the voice information is divergent.
<Divergent judgment method 1>
The divergence determination method 1 is a determination method when the beam width is set by setting the number of hypotheses to be held to n. When the recognition process is performed by reproducing the partial voice information in the forward direction, the hypothesis having the highest score and the hypothesis having the lowest score among the n hypotheses held at a specific time in the first half If the difference between the scores is within a predetermined threshold, that is, if the difference between the scores for each hypothesis is small, it is determined that the search processing for the first half of the speech information is diverging (S402: YES).

＜発散判断方法２＞
発散判断方法２は、最もスコアの高い仮説である最尤仮説とのスコアの差が所定の閾値Ｔ以内である仮説を保持するようにビーム幅が設定されている場合の判断方法である。部分音声情報を順方向に再生することにより認識処理を行ったとき、前半部のある特定の時点で、最尤仮説とのスコアの差が閾値Ｔ以内である仮説が所定数以上ある場合、すなわち所定の閾値Ｔ内に多くの仮説があり仮説毎のスコアの開きが小さい場合。音声情報の前半部分の探索処理が発散していると判断する（Ｓ４０２：ＹＥＳ）。 <Divergent judgment method 2>
The divergence determination method 2 is a determination method in a case where the beam width is set so as to hold a hypothesis whose score difference with the maximum likelihood hypothesis that is the hypothesis having the highest score is within a predetermined threshold T. When the recognition process is performed by reproducing the partial voice information in the forward direction, if there is a predetermined number or more of hypotheses whose score difference with the maximum likelihood hypothesis is within the threshold T at a certain point in the first half, When there are many hypotheses within a predetermined threshold T and the score difference for each hypothesis is small. It is determined that the search process for the first half of the voice information is diverging (S402: YES).

ここで、本実施形態による音声認識処理の具体例を説明する。以下、「ＢＢＢ」というアーティストの音楽の再生に係る発話内容を例に説明する。
音楽の再生に係る発話内容は、所定の処理を指示する固定語（この例では「再生」や「Ｐｌａｙ」等）と、アーティスト名や曲名といった固定語に対応する目的語である可変語（この例では、「ＢＢＢ」）とから構成される。 Here, a specific example of the speech recognition processing according to the present embodiment will be described. Hereinafter, an utterance content related to the reproduction of music by the artist “BBB” will be described as an example.
The utterance content related to music playback includes fixed words (in this example, “play”, “Play”, etc.) for instructing predetermined processing, and variable words (this is an object corresponding to fixed words such as artist name and song name) In the example, “BBB”).

（１）具体例１
具体例１では、「ＢＢＢ」というアーティストの音楽を再生するとき、ユーザが「ＰｌａｙＢＢＢ」と発話した場合を例として、図５に基づいて説明する。この場合、図５（ａ）に示すように、固定語である「Ｐｌａｙ」は、可変語である「ＢＢＢ」よりも時系列において先に発話され、部分音声情報に含まれるものとする。なお、図５（ａ）および後述する図６（ａ）は、いずれもユーザが発話した発話内容の構造を説明するための図である。 (1) Specific example 1
Specific example 1 will be described with reference to FIG. 5 by taking as an example a case where the user utters “Play BBB” when playing the music of the artist “BBB”. In this case, as shown in FIG. 5A, the fixed word “Play” is uttered earlier in time series than the variable word “BBB”, and is included in the partial voice information. FIG. 5A and FIG. 6A to be described later are diagrams for explaining the structure of the utterance content uttered by the user.

トークスイッチ３１がオンされたことが検出されると（Ｓ１００）、認識辞書２３がセットされ（Ｓ２００）、発話された「ＰｌａｙＢＢＢ」に係る音声情報を取得する（Ｓ３００）。音声情報が取得されると、まず取得された音声情報を順方向処理する。具体例１では、部分音声情報に固定語である「Ｐｌａｙ」が含まれるので、順方向辞書２３１に含まれる固定語辞書に基づいて「Ｐｌａｙ」が特定されるので音声情報の前半部分が発散せず（Ｓ４０２：ＮＯ）、固定語「Ｐｌａｙ」が音声情報の前半部分にあると推定される（Ｓ４０３）。 When it is detected that the talk switch 31 is turned on (S100), the recognition dictionary 23 is set (S200), and the voice information related to the uttered “Play BBB” is acquired (S300). When the voice information is acquired, first, the acquired voice information is forward-processed. In Specific Example 1, since “Play”, which is a fixed word, is included in the partial voice information, “Play” is specified based on the fixed word dictionary included in the forward dictionary 231, so that the first half of the voice information diverges. (S402: NO), it is estimated that the fixed word “Play” is in the first half of the voice information (S403).

そこで、図５（ｂ）に示すように、時間軸に沿って順方向、すなわち図５（ｂ）における紙面左から右方向へ、音声情報の波形データと順方向辞書２３１とを照合することにより順方向処理する。具体例１では固定語である「Ｐｌａｙ」が特定されているので、「Ｐｌａｙ」に対応付けられている可変語辞書である「アーティスト名辞書」や「曲名辞書」に基づく順方向処理を行うことにより、可変語である「ＢＢＢ」と比較的容易に特定することができる。したがって、Ｓ３００において取得された音声情報を順方向処理して特定された結果である「ＰｌａｙＢＢＢ」を認識結果とし（Ｓ４０４、Ｓ５００：ＹＥＳ）、認識結果を出力する（Ｓ６００）。なお、具体例１では、認識結果に基づき、アーティスト「ＢＢＢ」の曲の再生に係る処理を本処理とは別処理にて行う。 Therefore, as shown in FIG. 5B, by collating the waveform data of the voice information with the forward dictionary 231 in the forward direction along the time axis, that is, from the left to the right in FIG. 5B. Process forward. In specific example 1, since “Play” as a fixed word is specified, forward processing based on “artist name dictionary” and “song name dictionary” which are variable word dictionaries associated with “Play” is performed. Thus, it is relatively easy to identify the variable word “BBB”. Therefore, “Play BBB”, which is the result specified by performing forward processing on the audio information acquired in S300, is set as the recognition result (S404, S500: YES), and the recognition result is output (S600). In the first specific example, based on the recognition result, the process related to the reproduction of the song of the artist “BBB” is performed in a process different from this process.

（２）具体例２
具体例２では、「ＢＢＢ」というアーティストの音楽を再生するとき、ユーザが「ＢＢＢを再生」と発話した場合を例として、図６に基づいて説明する。この場合、図６（ａ）に示すように、固定語である「再生」は、可変語である「ＢＢＢ」よりも時系列において後に発話され、部分音声情報に含まれないものとする。なお具体例２では、助詞を含む「を再生」を固定語として説明する。 (2) Specific example 2
Specific example 2 will be described with reference to FIG. 6, taking as an example a case where the user utters “play BBB” when playing the music of the artist “BBB”. In this case, as shown in FIG. 6A, it is assumed that the “reproduction” that is a fixed word is uttered later in time series than the variable word “BBB” and is not included in the partial voice information. In the second specific example, “reproduce” including a particle will be described as a fixed word.

トークスイッチ３１がオンされたことが検出されると（Ｓ１００）、認識辞書２３がセットされ（Ｓ２００）、発話された「ＢＢＢを再生」に係る音声情報を取得する（Ｓ３００）。音声情報が取得されると、まず取得された音声情報を順方向処理する。具体例２では、部分音声情報に固定語である「を再生」が含まれないので、音声情報の前半部分が発散する（Ｓ４０２：ＹＥＳ）。このとき、固定語辞書以外の辞書を参照して音声認識処理を行うと、例えば「に電話をかける」といった別の固定語に対応する「ＤＤＤ」のような「ＢＢＢ」と類似する単語であると誤認識する虞がある。「ＢＢＢ」が「ＤＤＤ」であると誤認識されて仮説の枝刈りが行われると、「ＢＢＢ」に続く「を再生」を認識することができなくなってしまう。 When it is detected that the talk switch 31 is turned on (S100), the recognition dictionary 23 is set (S200), and the voice information related to the “reproduce BBB” is acquired (S300). When the voice information is acquired, first, the acquired voice information is forward-processed. In the specific example 2, since the fixed word “reproduce” is not included in the partial voice information, the first half of the voice information diverges (S402: YES). At this time, when speech recognition processing is performed with reference to a dictionary other than the fixed word dictionary, the word is similar to “BBB” such as “DDD” corresponding to another fixed word such as “call to”. There is a risk of misunderstanding. If “BBB” is misrecognized as “DDD” and the hypothesis is pruned, “replay” following “BBB” cannot be recognized.

本実施形態では、具体例２のように、音声情報の前半部分が発散した場合（Ｓ４０２：ＹＥＳ）、固定語が音声情報の後半部分にあると推定する（Ｓ４０５）。そこで、図６（ｂ）に示すように、時間軸に沿って逆方向、すなわち図６（ｂ）における紙面右から左方向へ、音声情報の波形データと逆方向辞書２３２とを照合することにより逆方向処理する。具体例２では、固定語である「を再生」が特定される（Ｓ４０６）ので、「を再生」に対応付けられている可変語辞書である「アーティスト名辞書」や「曲名辞書」に基づく逆方向処理を行うことにより、可変語である「ＢＢＢ」を比較的容易に特定することができる（Ｓ４０７：ＹＥＳ）。したがって、Ｓ３００において取得された音声情報を逆方向処理して特定された結果である「ＢＢＢを再生」を認識結果とし（Ｓ４０８、Ｓ５００：ＹＥＳ）、認識結果を出力する（Ｓ６００）。なお、具体例２では、具体例１と同様、認識結果に基づき、アーティスト「ＢＢＢ」の曲の再生に係る処理を本処理とは別処理にて行う。 In the present embodiment, as in the specific example 2, when the first half of the voice information diverges (S402: YES), it is estimated that the fixed word is in the second half of the voice information (S405). Therefore, as shown in FIG. 6B, by collating the waveform data of the voice information and the reverse dictionary 232 in the reverse direction along the time axis, that is, from the right to the left in FIG. 6B. Process in the reverse direction. In the second specific example, the fixed word “replay” is specified (S406). Therefore, the inverse based on the “artist name dictionary” or the “song name dictionary” that is the variable word dictionary associated with “replay”. By performing the direction processing, the variable word “BBB” can be identified relatively easily (S407: YES). Accordingly, “reproduce BBB”, which is the result specified by performing backward processing on the audio information acquired in S300, is set as the recognition result (S408, S500: YES), and the recognition result is output (S600). In the second specific example, as in the first specific example, based on the recognition result, the process related to the reproduction of the song of the artist “BBB” is performed in a process different from the present process.

以上詳述したように、音声認識装置１は、単語を順方向に再生したときの音声データと単語とが対応付けられて記憶されている順方向辞書２３１、および、単語を逆方向に再生したときの音声データと単語とが対応付けて記憶されている逆方向辞書２３２を含む認識辞書２３を有する。本実施形態では、木構造の根幹をなす固定語、および固定語に対応し木構造の枝葉をなす可変語からなる音声情報を取得し（Ｓ３００）、認識処理を行う。まず、順方向辞書２３１に基づき、音声情報を順方向に再生することにより認識処理を順方向に行い（Ｓ４０１）、順方向に認識処理した認識結果に基づき、音声情報における固定語の位置を推定する（Ｓ４０３、Ｓ４０５）。固定語が音声情報の後半にあると推定された場合（Ｓ４０５）、逆方向辞書２３２に基づき、音声情報を逆方向に再生することにより認識処理を逆方向に行う（Ｓ４０６）。これにより、ユーザの発した音声の固定語の位置に基づき、音声情報の認識処理を順方向に行うか、逆方向に行うか、を適切に切り替えることができるので、ユーザに認識処理方向に関する操作を意識させることなく音声認識の精度が向上する。 As described above in detail, the speech recognition apparatus 1 reproduces the forward dictionary 231 in which the speech data and the word when the word is reproduced in the forward direction are stored in association with each other and the word in the backward direction. A recognition dictionary 23 including a backward dictionary 232 in which voice data and words are stored in association with each other. In the present embodiment, speech information including a fixed word that forms the basis of a tree structure and variable words that correspond to the fixed word and form branches and leaves of the tree structure is acquired (S300), and recognition processing is performed. First, recognition processing is performed in the forward direction by reproducing the speech information in the forward direction based on the forward dictionary 231 (S401), and the position of the fixed word in the speech information is estimated based on the recognition result obtained in the forward direction. (S403, S405). When it is estimated that the fixed word is in the second half of the voice information (S405), the recognition process is performed in the reverse direction by reproducing the voice information in the reverse direction based on the reverse dictionary 232 (S406). Accordingly, the voice information recognition process can be appropriately switched between the forward direction and the reverse direction based on the position of the fixed word of the voice uttered by the user. Improves the accuracy of speech recognition without making you conscious.

本実施形態では、順方向に行った認識結果に基づき、音声情報の前半部分が発散しているか否かを判断し、前半部分が発散していると判断された場合（Ｓ４０２：ＹＥＳ）、固定語が音声情報の後半部分にあると推定する（Ｓ４０５）。これにより、固定語の位置の推定に係る演算負荷を低減することができる。 In the present embodiment, based on the recognition result performed in the forward direction, it is determined whether or not the first half of the voice information is diverging, and if it is determined that the first half is diverging (S402: YES), the fixed It is estimated that the word is in the latter half of the voice information (S405). Thereby, it is possible to reduce the calculation load related to the estimation of the position of the fixed word.

また、本実施形態では、音声情報の認識処理を開始してから所定期間の音声情報である部分音声情報を順方向に認識処理し、部分音声情報の認識結果が得られない場合、音声情報の前半部分が発散していると判断する（Ｓ４０２：ＹＥＳ）。これにより、音声情報の全体の認識処理を行わなくても音声情報の前半部分が発散しているか否かを判断できるので、演算負荷をより低減することができる。また、部分音声情報の認識結果に基づき、音声情報の前半部分が発散していると判断された場合、音声情報を逆方向に再生して認識処理を行う逆方向処理に速やかに切り替えることができる。これにより、全ての音声情報を順方向処理した後に逆方向処理に切り替える場合と比較して、音声情報が取得されてから認識結果が得られるまでの時間を短縮することができる。 Further, in the present embodiment, when the voice information recognition process is started and the partial voice information that is the voice information for a predetermined period is recognized in the forward direction and the recognition result of the partial voice information cannot be obtained, the voice information It is determined that the first half part is diverging (S402: YES). Accordingly, it is possible to determine whether or not the first half of the voice information is diverging without performing the entire recognition process of the voice information, so that the calculation load can be further reduced. Further, based on the recognition result of the partial voice information, when it is determined that the first half of the voice information is divergent, it is possible to quickly switch to the reverse process in which the voice information is reproduced in the reverse direction and the recognition process is performed. . Thereby, compared with the case where it switches to reverse direction processing after carrying out forward direction processing of all the audio | voice information, the time until a recognition result is acquired after audio | voice information is acquired can be shortened.

音声情報の前半部分が発散しているか否かの判断は、以下のように行うことができる。
部分音声情報の認識結果を特定するためのスコアを有する複数の認識対象候補において、最もスコアが高い認識対象候補と最もスコアが低い認識対象候補のスコアの差が所定の閾値以内である場合、部分音声情報の認識結果が得られず、音声情報の前半部分が発散していると判断する（Ｓ４０２：ＹＥＳ）。 The determination as to whether or not the first half of the voice information is diverging can be made as follows.
In a plurality of recognition target candidates having a score for specifying a recognition result of partial speech information, if the difference between the score of the recognition target candidate with the highest score and the recognition target candidate with the lowest score is within a predetermined threshold, It is determined that the voice information recognition result is not obtained and the first half of the voice information is divergent (S402: YES).

また、部分音声情報の認識結果を特定するためのスコアを有する複数の認識対象候補において、最もスコアの高い認識対象候補とのスコアの差が所定の閾値以内である認識対象候補が所定数以上存在する場合、部分音声情報の認識結果が得られず、音声情報の前半部分が発散していると判断してもよい（Ｓ４０２：ＹＥＳ）。 In addition, among a plurality of recognition target candidates having a score for specifying a recognition result of partial speech information, there are a predetermined number or more of recognition target candidates whose difference in score from a recognition target candidate having the highest score is within a predetermined threshold. In this case, the recognition result of the partial voice information may not be obtained, and it may be determined that the first half of the voice information is divergent (S402: YES).

発散判断方法１、２で説明したように、複数の認識候補における認識結果を特定するためのスコアに基づき、複数の認識候補間のスコアのばらつきが小さい場合、音声情報の認識結果が発散していると判断する。これにより、音声情報の認識結果が発散しているか否かを適切に判断することができる。 As described in the divergence determination methods 1 and 2, based on the score for identifying the recognition result of the plurality of recognition candidates, when the variation in the score between the plurality of recognition candidates is small, the recognition result of the voice information is diverged. Judge that Thereby, it can be determined appropriately whether the recognition result of audio | voice information is diverging.

なお、本実施形態では、制御部１０が「音声情報取得手段」、「順方向認識処理手段」、「固定語位置推定手段」、「逆方向認識処理手段」、および「発散判断手段」を構成する。また、図３中のＳ３００が「音声情報取得手段」の機能としての処理に相当し、図４中のＳ４０１が「順方向認識処理手段」の機能としての処理に相当し、Ｓ４０３、Ｓ４０５が「固定語位置推定手段」の機能としての処理に相当し、Ｓ４０６が「逆方向認識処理手段」の機能としての処理に相当し、Ｓ４０２が「発散判断手段」の機能としての処理に相当する。 In this embodiment, the control unit 10 constitutes “voice information acquisition means”, “forward direction recognition processing means”, “fixed word position estimation means”, “reverse direction recognition processing means”, and “divergence determination means”. To do. Also, S300 in FIG. 3 corresponds to processing as a function of “voice information acquisition means”, S401 in FIG. 4 corresponds to processing as a function of “forward recognition processing means”, and S403 and S405 are “ The process corresponds to a process as a function of “fixed word position estimating means”, S406 corresponds to a process as a function of “reverse direction recognition processing means”, and S402 corresponds to a process as a function of “divergence determination means”.

以上、本発明は、上記実施形態になんら限定されるものではなく、発明の趣旨を逸脱しない範囲において種々の形態で実施可能である。
（ア）固定語位置推定手段
上記実施形態では、音声情報の前半部分の認識結果が得られず、発散していると判断された場合（Ｓ４０２：ＹＥＳ）、固定語が音声情報の後半部分にあると推定した。 As mentioned above, this invention is not limited to the said embodiment at all, In the range which does not deviate from the meaning of invention, it can implement with a various form.
(A) Fixed word position estimation means In the above embodiment, if the recognition result of the first half of the speech information is not obtained and it is determined that the speech is diverging (S402: YES), the fixed word is in the second half of the speech information. Presumed to be.

ところで、音声認識装置においては、当該装置の操作に係る操作言語を選択することができる場合がある。例えば、操作言語として日本語が選択された場合、スピーカ５１やディスプレイ６１を介して提供される種々の情報が日本語で出力される。また例えば操作言語として英語が選択された場合、スピーカ５１やディスプレイ６１を介して提供される種々の情報が英語で出力される、といった具合である。 By the way, in the voice recognition apparatus, there is a case where an operation language related to the operation of the apparatus can be selected. For example, when Japanese is selected as the operation language, various information provided via the speaker 51 and the display 61 is output in Japanese. For example, when English is selected as the operation language, various information provided via the speaker 51 or the display 61 is output in English.

ここで、上述の具体例１、２を説明する図５、６を参照し、固定語および可変語の位置関係と操作言語との関係について説明する。
所定の処理を指示する固定語と固定語の目的語である可変語との関係は、固定語を根幹とし、可変語を枝葉とする木構造になっているといえる。例えば英語のように、固定語、可変語の語順になっている言語では、図５（ａ）に示すように、木構造の根幹である固定語が前にある前向き木構造であるといえる。また例えば日本語のように、可変語、固定語の語順になっている言語では、図６（ａ）に示すように、木構造の根幹である固定語が後にある後ろ向き木構造であるといえる。 Here, the relationship between the positional relationship between the fixed word and the variable word and the operation language will be described with reference to FIGS.
It can be said that the relationship between the fixed word instructing the predetermined processing and the variable word that is the target word of the fixed word has a tree structure with the fixed word as a root and the variable word as a branch leaf. For example, in a language such as English, which is in the order of words of fixed words and variable words, as shown in FIG. 5A, it can be said that the forward tree structure is preceded by a fixed word that is the basis of the tree structure. Further, for example, in a language in which variable words and fixed words are arranged in the order of words such as Japanese, as shown in FIG. 6A, it can be said that the tree has a backward-facing tree structure followed by a fixed word that is the root of the tree structure. .

固定語が可変語よりも前に発話される特性を有する操作言語、例えば英語、が選択された場合、ユーザは英語で発話するとみなし、固定語が音声情報の前半部分にあると推定する。なお、ここでは語順が「固定語」「可変語」となる操作言語の一例として英語について述べたが、語順が「固定語」「可変語」となる特性を有する操作言語であれば、英語に限らない。 When an operation language having a characteristic that a fixed word is spoken before a variable word, for example, English, is selected, the user assumes that the spoken word is spoken in English, and estimates that the fixed word is in the first half of the voice information. Note that although English has been described here as an example of an operation language in which the word order is “fixed word” and “variable word”, if the operation language has the characteristics that the word order is “fixed word” and “variable word”, then the operation language Not exclusively.

また、固定語が可変語よりも後に発話される特性を有する操作言語、例えば日本語、が選択された場合、ユーザは日本語で発話するとみなし、固定語が音声情報の後半部分にあると推定する。なお、ここでは語順が「可変語」「固定語」となる操作言語の一例として日本語について述べたが、語順が「可変語」「固定語」となる特性を有する操作言語であれば、日本語に限らない。 In addition, when an operation language having a characteristic that a fixed word is uttered after a variable word, for example, Japanese, is selected, the user is assumed to speak in Japanese, and the fixed word is estimated to be in the latter half of the voice information. To do. In this example, Japanese is described as an example of an operation language in which the word order is “variable word” or “fixed word”. However, if the operation language has a characteristic in which the word order is “variable word” or “fixed word”, Japan It is not limited to words.

このように、固定語と可変語との位置関係は、言語の特性によってある程度決まっている。そこで、変形例では、選択された操作言語に基づいて固定語の位置を推定するように構成してもよい。すなわち、ユーザは操作言語として選択された言語で発話するとみなせば、選択された操作言語に基づきユーザの発話言語が推定される、といえる。とすれば、設定された操作言語に基づき、認識処理を順方向に行うか逆方向に行うかを切り替えることは、間接的にユーザの発話言語に基づき、認識処理を順方向に行うか逆方向に行うかを切り替える、といえる。 Thus, the positional relationship between fixed words and variable words is determined to some extent by the characteristics of the language. Therefore, in a modified example, the position of the fixed word may be estimated based on the selected operation language. That is, if it is considered that the user speaks in the language selected as the operation language, it can be said that the user's speech language is estimated based on the selected operation language. Then, based on the set operation language, switching between performing the recognition process in the forward direction or the reverse direction is based on whether the recognition process is performed in the forward direction or based on the user's utterance language indirectly. It can be said that switching is performed.

このように構成しても、設定された操作言語に基づきユーザが発話する言語を推定し、適切に固定語の位置を推定することができるので、ユーザの発した音声に応じて認識処理を順方向に行うか逆方向に行うか、を適切に切り替えることができるので、ユーザに認識処理方向に関する操作を意識させることなく音声認識の精度が向上する。 Even in this configuration, the language spoken by the user can be estimated based on the set operation language, and the position of the fixed word can be estimated appropriately, so that the recognition process is performed in accordance with the voice uttered by the user. Since it is possible to appropriately switch between performing in the direction and in the reverse direction, the accuracy of voice recognition is improved without making the user aware of the operation related to the recognition processing direction.

（イ）発散判断手段
音声情報の前半部分が発散しているか否かの判断は、以下のように行ってもよい。
音声情報が取得されたとき、順方向辞書に基づき、取得された音声情報の全体を順方向に再生して認識処理を行い、仮の認識結果を取得する。仮の認識結果において、特定された単語ごとに信頼度を算出する。この「信頼度」とは、特定された単語が認識結果としてどれだけ信頼してよいかを示す尺度であって、０〜１の範囲の値を取り得る。特定された単語の信頼度の値が１に近い場合、当該単語と同程度のスコアを持つ他の仮説がほとんどなかったことを示し、当該単語が認識結果として信頼できる確率が高い、といえる。一方、特定された単語の信頼度が０に近い場合、当該単語と同程度のスコアを持つ他の仮説が多くあったことを示し、当該単語他認識結果として信頼できる確率が低い、といえる。 (A) Divergence determination means The determination as to whether or not the first half of the voice information is divergent may be performed as follows.
When the voice information is acquired, based on the forward dictionary, the entire acquired voice information is reproduced in the forward direction to perform recognition processing, and a temporary recognition result is acquired. In the temporary recognition result, the reliability is calculated for each identified word. The “reliability” is a scale indicating how much the identified word can be trusted as a recognition result, and can take a value in the range of 0-1. When the reliability value of the identified word is close to 1, it indicates that there is almost no other hypothesis having a score comparable to that word, and it can be said that the probability that the word is reliable as a recognition result is high. On the other hand, when the reliability of the identified word is close to 0, it indicates that there are many other hypotheses having scores similar to that of the word, and it can be said that the probability of being reliable as the recognition result of the word is low.

ここで、信頼度に基づいて固定語の位置を推定する具体例を図７に基づいて説明する。図７では、「ＰＰＰに電話をかける」と発話された場合を例に説明する。
「ＰＰＰに電話をかける」と発話された場合、取得された音声情報を順方向に認識処理を行い、得られた仮の認識結果に含まれる単語ごとの信頼度を算出する。図７（ａ）に示すように、音声情報の前半部分にある「ＰＰＰ」の信頼度は０．１３と低い。一方、音声情報の後半部分にある「に電話をかける」の信頼度は０．８２と、前半部分にある「ＰＰＰ」の信頼度と比較して高い。ここで、信頼度が０．５以上である単語は認識結果として信頼するに足りるとすると、音声情報の前半部分にある単語「ＰＰＰ」の信頼度は所定値０．５以下であり、音声情報の後半部分にある単語「に電話をかける」の信頼度は所定値０．５以上であり、音声情報の前半部分が発散していると判断することができる（Ｓ４０２：ＹＥＳ）。このとき、音声情報の前半部分の「ＰＰＰ」の信頼度が低いので、「ＰＰＰ」が誤認識される虞がある。「ＰＰＰ」が誤認識されて仮説の枝刈りが行われると、「ＰＰＰ」に続く「に電話をかける」を認識することができなくなってしまう。 Here, a specific example of estimating the position of the fixed word based on the reliability will be described with reference to FIG. In FIG. 7, an example in which “Speak to PPP” is spoken will be described.
When the user utters “call PPP”, the acquired voice information is recognized in the forward direction, and the reliability for each word included in the obtained temporary recognition result is calculated. As shown in FIG. 7A, the reliability of “PPP” in the first half of the voice information is as low as 0.13. On the other hand, the reliability of “calling” in the second half of the voice information is 0.82, which is higher than the reliability of “PPP” in the first half. Here, if it is sufficient to trust a word having a reliability of 0.5 or more as a recognition result, the word “PPP” in the first half of the voice information has a reliability of a predetermined value of 0.5 or less. The reliability of the word “call phone” in the latter half of the voice information is a predetermined value of 0.5 or more, and it can be determined that the first half of the voice information is diverging (S402: YES). At this time, since the reliability of “PPP” in the first half of the voice information is low, “PPP” may be erroneously recognized. If “PPP” is misrecognized and the hypothesis is pruned, it becomes impossible to recognize “call to” following “PPP”.

そこで、図７（ｂ）に示すように、音声情報を逆方向に再生して逆方向辞書に基づいて認識処理を行うことにより、固定語である「に電話をかける」を特定することができる。そして、「に電話をかける」に対応付けて記憶されている辞書をセットして「ＰＰＰ」の認識処理を行うことにより、認識処理の精度が向上する。 Therefore, as shown in FIG. 7B, the voice information is reproduced in the reverse direction and the recognition process is performed based on the reverse dictionary, whereby the fixed word “call” can be specified. . Then, by setting a dictionary stored in association with “call to” and performing “PPP” recognition processing, the accuracy of the recognition processing is improved.

このように、順方向に認識処理を行った認識結果に基づき、認識結果として特定された単語ごとの信頼度を算出し、音声情報の前半部分にある単語の信頼度が所定値以下であり、音声情報の後半部分にある単語の信頼度が所定値以上である場合、音声情報の前半部分が発散していると判断し、固定語が音声情報の後半にあると推定してもよい。これにより、認識結果として特定された単語の信頼度に基づいて前半部分が発散しているか否かを適切に判断することができる。 Thus, based on the recognition result obtained by performing the recognition process in the forward direction, the reliability for each word specified as the recognition result is calculated, and the reliability of the word in the first half of the voice information is equal to or less than a predetermined value, If the reliability of the word in the second half of the speech information is greater than or equal to a predetermined value, it may be determined that the first half of the speech information is divergent and the fixed word is in the second half of the speech information. This makes it possible to appropriately determine whether or not the first half part is diverging based on the reliability of the word specified as the recognition result.

なお、順方向の認識処理において後半部分にある固定語の「に電話をかける」の信頼度が所定値以上である場合、固定語の「に電話をかける」が音声情報の後半部分にあると推定してもよい。換言すると、「固定語位置特定手段は、順方向に認識処理を行った認識結果に基づき、認識結果として特定された単語ごとの信頼度を算出し、音声情報の後半部分にある単語の信頼度が所定値以上である場合、固定語が音声情報の後半部分にあると推定する」ということである。この場合、逆方向辞書２３２に含まれる固定語辞書に基づく「に電話をかける」に対応する部分の音声情報の逆方向処理を省略してもよい。そして、逆方向辞書２３２に含まれる「に電話をかける」に対応する辞書をセットし、「に電話をかける」に対応する波形データの順方向における開始点から逆方向に再生して認識処理を行ってもよい。 In the forward recognition process, if the reliability of “call to” the fixed word in the second half is greater than or equal to a predetermined value, the “call to” fixed word is in the second half of the voice information. It may be estimated. In other words, “the fixed word position specifying means calculates the reliability of each word specified as the recognition result based on the recognition result obtained by performing the recognition process in the forward direction, and the reliability of the word in the latter half of the speech information. Is greater than or equal to a predetermined value, it is estimated that the fixed word is in the second half of the speech information ”. In this case, the backward processing of the voice information of the portion corresponding to “call to” based on the fixed word dictionary included in the backward dictionary 232 may be omitted. Then, a dictionary corresponding to “call to” included in the reverse dictionary 232 is set, and the waveform data corresponding to “call to” is reproduced in the reverse direction from the start point in the forward direction to perform recognition processing. You may go.

（ウ）音声情報
上記実施形態では、音声情報の取得が開始してから所定期間（例えば０．５秒）の音声情報である部分音声情報を「前半部分」とし、音声情報における前半部分を除く残りの部分を「後半部分」とした。変形例では、音声情報の開始位置および終了位置を特定し、音声情報の全認識時間のうち前半の所定割合（例えば１／２）を音声情報の「前半部分」とし、残部を「後半部分」としてもよい。また、音声情報の全認識時間のうち後半の所定割合（例えば１／２）を音声情報の「後半部分」とし、残部を「前半部分」としてもよい。 (C) Audio information In the above embodiment, partial audio information, which is audio information for a predetermined period (for example, 0.5 seconds) from the start of acquisition of audio information, is defined as “first half part”, and the first half part of the audio information is excluded. The remaining part was designated as the “second half part”. In the modification, the start position and the end position of the voice information are specified, and the predetermined ratio (for example, 1/2) of the first half of the total recognition time of the voice information is set as the “first half part” of the voice information, and the remaining part is the “second half part”. It is good. Further, a predetermined ratio (for example, ½) of the latter half of the total recognition time of the voice information may be set as the “second half” of the voice information, and the remaining may be set as the “first half”.

また、上記実施形態では、音声情報において、木構造の根幹をなす固定語、および固定語対応し木構造の枝葉をなす可変語は、それぞれ１つずつであったが、固定語および可変語は、それぞれ複数個ずつあってもよい。取得された音声情報がｎ階層の木構造である場合、根元から（ｎ−１）階層までが固定語であってもよい。同様に、取得された音声情報がｎ階層の木構造である場合、末端から（ｎ−１）階層までが可変語であってもよい。 Further, in the above embodiment, in the speech information, there is one fixed word that forms the basis of the tree structure and one variable word that corresponds to the fixed word and forms the branches and leaves of the tree structure. There may be a plurality of each. When the acquired voice information has a tree structure of n layers, the words from the root to the (n-1) layer may be fixed words. Similarly, when the acquired voice information has a tree structure of n layers, variable words from the end to the (n−1) layer may be used.

１・・・音声認識装置
１０・・・制御部（音声情報取得手段、順方向認識処理手段、固定語位置推定手段、逆方向認識処理手段、発散判断手段）
２０・・・音声情報記憶部
２１・・・音声情報データベース
２２・・・音響モデル
２２１・・・順方向モデル
２２２・・・逆方向モデル
２３・・・認識辞書
２３１・・・順方向辞書（前向き木構造の辞書データ）
２３２・・・逆方向辞書（後ろ向き木構造の辞書データ）
２４・・・言語モデル
３０・・・操作スイッチ群
３１・・・トークスイッチ
４０・・・音声入力部
４１・・・マイク
５０・・・音声出力部
５１・・・スピーカ
６０・・・描画部
６１・・・ディスプレイ
７０・・・情報記憶部 DESCRIPTION OF SYMBOLS 1 ... Voice recognition apparatus 10 ... Control part (Voice information acquisition means, forward direction recognition processing means, fixed word position estimation means, reverse direction recognition processing means, divergence judgment means)
DESCRIPTION OF SYMBOLS 20 ... Audio | voice information storage part 21 ... Audio | voice information database 22 ... Acoustic model 221 ... Forward model 222 ... Reverse model 23 ... Recognition dictionary 231 ... Forward dictionary (forward) Dictionary data of tree structure)
232 ... Reverse dictionary (dictionary data of backward-facing tree structure)
24 ... Language model 30 ... Operation switch group 31 ... Talk switch 40 ... Audio input unit 41 ... Microphone 50 ... Audio output unit 51 ... Speaker 60 ... Drawing unit 61 ... Display 70 ... Information storage

Claims

Speech data when a word is reproduced in the forward direction and the word are stored in association with the dictionary data of the forward tree structure, and voice data when the word is reproduced in the reverse direction and the word are stored in association with each other A speech recognition apparatus having dictionary data of a backward-facing tree structure,
Speech information acquisition means for acquiring speech information consisting of fixed words that form the basis of a tree structure, and variable words that correspond to the fixed words and form branches and leaves of the tree structure;
Forward recognition processing means for performing recognition processing in the forward direction by reproducing the voice information in the forward direction based on the dictionary data of the forward tree structure;
Fixed word position estimating means for estimating the position of the fixed word in the speech information based on the recognition result of the forward direction recognition process;
When the fixed word position estimating unit estimates that the fixed word is in the latter half of the speech information, the speech information is reproduced in the reverse direction based on the backward tree structure dictionary data to reverse the recognition process. Reverse direction recognition processing means for performing the direction;
A speech recognition apparatus comprising:

Based on the recognition result of the forward direction recognition processing means, comprising divergence determination means for determining whether or not the first half of the voice information is divergence,
The fixed word position estimating means estimates that the fixed word is in the latter half of the voice information when the divergence judging means determines that the first half of the voice information is diverging. The speech recognition apparatus according to claim 1.

The divergence determining means recognizes the partial voice information as the voice information for a predetermined period after the voice information acquisition means starts to acquire the voice information, and the recognition result of the partial voice information cannot be obtained. The speech recognition apparatus according to claim 2, wherein it is determined that the first half of the speech information is diverging.

In the plurality of recognition target candidates having a score for specifying the recognition result of the partial speech information, the divergence determination unit is configured to determine the recognition target candidate having the highest score and the recognition target candidate having the lowest score. 4. The voice according to claim 3, wherein when the difference in scores is within a predetermined threshold, the recognition result of the partial voice information is not obtained, and it is determined that the first half of the voice information is diverging. Recognition device.

The divergence determination means includes a plurality of recognition target candidates having a score for specifying a recognition result of the partial speech information, and the difference in the score from the recognition target candidate having the highest score is within a predetermined threshold. 4. The voice according to claim 3, wherein when a plurality of recognition target candidates exist, a recognition result of the partial voice information is not obtained, and it is determined that the first half of the voice information is divergent. Recognition device.

The divergence determining means calculates the reliability for each word specified as the recognition result based on the recognition result by the forward direction recognition processing means, and the reliability of the word in the first half of the speech information is less than a predetermined value 3. The method according to claim 2, wherein if the reliability of a word in the second half of the voice information is equal to or greater than the predetermined value, it is determined that the first half of the voice information is divergent. Voice recognition device.

Speech information acquisition means for acquiring speech information consisting of fixed words forming the basis of a tree structure, and variable words corresponding to the fixed words and forming branches and leaves of the tree structure;
Forward direction recognition processing means for performing recognition processing in the forward direction by reproducing the voice information in the forward direction based on dictionary data of a forward tree structure;
Fixed word position estimating means for estimating the position of the fixed word in the speech information based on the recognition result of the forward direction recognition process;
When the fixed word position estimating means estimates that the fixed word is in the latter half of the speech information, the speech information is reproduced in the reverse direction based on the backward-facing tree structure dictionary data. Reverse direction recognition processing means for performing in the reverse direction;
As a program that allows the computer to function.