JP2010204442A

JP2010204442A - Speech recognition device, speech recognition method, speech recognition program and program recording medium

Info

Publication number: JP2010204442A
Application number: JP2009050613A
Authority: JP
Inventors: Fumihiro Adachi; 史博安達; Ryosuke Isotani; 亮輔磯谷; Toru Iwazawa; 透岩沢; Takeshi Hanazawa; 健花沢; Seiya Osada; 誠也長田; Takenori Tsujikawa; 剛範辻川; Takayuki Arakawa; 隆行荒川; Koji Okabe; 浩司岡部
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2009-03-04
Filing date: 2009-03-04
Publication date: 2010-09-16

Abstract

PROBLEM TO BE SOLVED: To provide a speech recognition device capable of giving trigger that a user may notice that an error of a speech recognition result is possibly caused by a recognition mode error. SOLUTION: When at least one or more recognition modes for performing speech recognition on an input speech, are selected in a plurality of recognition modes equipped beforehand, and a user indicates it as a first recognition mode from an input means 20, a recognition mode setting means 11 sets it to a condition defined by the first recognition mode, and a speech recognition means 12 performs recognition processing of the input speech based on the condition, and outputs the speech recognition result to the user via an output means 14. Then, a mode is changed to a second recognition mode which is different from the first recognition mode indicated by the user, and feed-back information for asking the user whether or not, speech recognition is performed again on the same speech data with the input speech data, is created at predetermined timing by a feed-back creation means 13, and output the information to the user via the output means 14. COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、音声認識装置、音声認識方法、音声認識プログラムおよびプログラム記録媒体に関し、特に、音声認識を行うための認識モードを指定する機能を有する場合において、ユーザが認識モードの指定を誤って利用した際にフィードバックを行う仕組みを有する音声認識装置、音声認識方法、音声認識プログラムおよびプログラム記録媒体に関する。 The present invention relates to a voice recognition device, a voice recognition method, a voice recognition program, and a program recording medium, and in particular, when a user has a function of designating a recognition mode for performing voice recognition, the user erroneously uses the designation of the recognition mode. The present invention relates to a voice recognition apparatus, a voice recognition method, a voice recognition program, and a program recording medium having a mechanism for performing feedback when the program is performed.

従来、ユーザから指定された認識モードに設定してから、ユーザが入力した音声を認識する音声認識システムがあった。例えば、「男性」、「女性」という認識モードがあり、ユーザが「男性」を指定すると、音声認識システムは、男性用の性別依存音響モデルを用いて入力音声を音声認識処理する。また、例えば、「日本語」、「英語」という認識モードがあり、ユーザが「日本語」を指定すると、音声認識システムは、日本語の言語モデル・単語辞書を用いて入力音声を音声認識処理する。 Conventionally, there has been a speech recognition system that recognizes speech input by a user after setting the recognition mode designated by the user. For example, when there are recognition modes of “male” and “female” and the user designates “male”, the speech recognition system performs speech recognition processing on the input speech using a male-dependent sex-dependent acoustic model. Also, for example, there are recognition modes of “Japanese” and “English”, and when the user designates “Japanese”, the speech recognition system performs speech recognition processing on the input speech using a Japanese language model / word dictionary. To do.

一方で、例えば、特許文献１の特開平０７−１０４７８０号公報「不特定話者連続音声認識方法」のように、男性用の性別依存音響モデルおよび女性用の性別依存音響モデルのような複数の音響モデルを同時に用いて認識処理を実行する音声認識システムもある。このように、複数のモデルを同時に用いて認識処理を行う音声認識システムは、ユーザが認識モードを選択して設定するという手間がなくなるものの、本来適用すべきモデルを用いた認識結果とはならず、認識精度が悪くなるケースがある。また、特に、端末などのメモリやＣＰＵ等のリソースに制約が大きい環境で使用する場合には、多くのメモリと処理能力とを要する同時認識の技術は、使うことができなく、前述のような認識モードを指定して利用する音声認識システムの方が望ましい。 On the other hand, for example, as disclosed in Japanese Patent Application Laid-Open No. 07-104780 “Unspecified Speaker Continuous Speech Recognition Method” of Patent Document 1, a plurality of gender-dependent acoustic models for men and gender-dependent acoustic models for women are used. There is also a speech recognition system that executes recognition processing using an acoustic model at the same time. As described above, a speech recognition system that performs recognition processing using a plurality of models at the same time eliminates the need for the user to select and set a recognition mode, but does not result in recognition using a model that should be originally applied. There are cases where the recognition accuracy is degraded. In particular, when used in an environment in which resources such as a terminal and resources such as a CPU are highly restricted, a simultaneous recognition technique that requires a large amount of memory and processing power cannot be used. A speech recognition system that specifies and uses a recognition mode is preferable.

特開平０７−１０４７８０号公報（第３−５頁）JP 07-104780 A (page 3-5)

しかしながら、前述したような認識モードを指定して利用する音声認識システムにおいては、ユーザが認識モードの指定を誤って音声認識システムを使用するケースが発生する。指定を誤って使用した場合には、全く異なる認識結果が出力されることが多くなる。 However, in the voice recognition system that uses the recognition mode as described above, the user may use the voice recognition system by mistakenly specifying the recognition mode. If the designation is used incorrectly, a completely different recognition result is often output.

さらに、指定を誤って使用していても、例えば、性別など、認識モードの種類によっては、誤った音声認識結果が出力される原因として、指定した認識モードの誤りが誤認識の原因であることにユーザは気付き難い。その結果、ユーザ自身の発声に問題があったと思い違いをして、再度発声して音声認識をやり直すことになる場合が多い。認識モードの指定が誤っている限り、認識モードを正しく設定し直さなければ、再び誤った認識結果が出力されることになり、徒らに、無駄な操作が繰り返される結果を招いてしまう。 Furthermore, even if the specification is used incorrectly, for example, depending on the type of recognition mode, such as gender, the error in the specified recognition mode is the cause of erroneous recognition as the cause of the output of an incorrect speech recognition result. It is difficult for the user to notice. As a result, it is often the case that the user thinks that there was a problem with his / her voice, and that he / she utters again to perform voice recognition again. As long as the recognition mode is specified incorrectly, if the recognition mode is not set correctly, an incorrect recognition result will be output again, leading to a result of repeated useless operations.

（本発明の目的）
本発明は、かかる事情に鑑みてなされたものであり、本発明の目的とするところは、ユーザに対して適切なフィードバックを返すことによって、音声認識結果の誤りの原因が指定した認識モードの誤りに可能性があることにユーザが気付く契機を与えることを可能とする音声認識装置、音声認識方法、音声認識プログラムおよびプログラム記録媒体を提供することにある。 (Object of the present invention)
The present invention has been made in view of such circumstances, and an object of the present invention is to return the appropriate feedback to the user, thereby recognizing the error in the recognition mode designated by the cause of the error in the speech recognition result. An object of the present invention is to provide a voice recognition device, a voice recognition method, a voice recognition program, and a program recording medium that can give the user an opportunity to notice that there is a possibility.

前述の課題を解決するため、本発明による音声認識装置、音声認識方法、音声認識プログラムおよびプログラム記録媒体は、次のような特徴的な構成を採用している。下記（１）及び（１２）なる番号は請求項の項番号にそれぞれ対応している。 In order to solve the above-described problems, the speech recognition apparatus, speech recognition method, speech recognition program, and program recording medium according to the present invention employ the following characteristic configuration. The numbers (1) and (12) below correspond to the item numbers in the claims.

（１）入力音声に対する音声認識を行うための少なくとも１つ以上の認識モードを、あらかじめ備えている複数の認識モードの中からユーザが選択して第１の認識モードとして指定し、該第１の認識モードが規定する条件に基づいて、前記入力音声に対する音声認識を行い、音声認識結果を当該ユーザに対して出力する音声認識手段を備えた音声認識装置において、ユーザが指定した前記第１の認識モードとは異なる第２の認識モードに変更して、前記入力音声と同一の音声データに対して再度音声認識を行うか否かをユーザに問い合わせるフィードバック情報をあらかじめ定めたタイミングで生成して当該ユーザに対して出力するフィードバック生成手段を備えている音声認識装置。
（１２）入力音声に対する音声認識を行うための少なくとも１つ以上の認識モードを、あらかじめ備えている複数の認識モードの中からユーザが選択して第１の認識モードとして指定し、該第１の認識モードが規定する条件に基づいて、前記入力音声に対する音声認識を行い、音声認識結果を当該ユーザに対して出力する音声認識方法であって、ユーザが指定した前記第１の認識モードとは異なる第２の認識モードに変更して、前記入力音声と同一の音声データに対して再度音声認識を行うか否かをユーザに問い合わせるフィードバック情報をあらかじめ定めたタイミングで生成して当該ユーザに対して出力する音声認識方法。 (1) The user selects at least one or more recognition modes for performing speech recognition on the input speech from a plurality of recognition modes provided in advance and designates the first recognition mode as the first recognition mode. The first recognition specified by the user in a speech recognition apparatus comprising speech recognition means for performing speech recognition on the input speech and outputting a speech recognition result to the user based on a condition defined by a recognition mode. Change to a second recognition mode different from the mode, and generate feedback information for inquiring the user whether or not to perform voice recognition again on the same voice data as the input voice, and generate the user at a predetermined timing. A speech recognition apparatus comprising feedback generation means for outputting to the above.
(12) The user selects at least one or more recognition modes for performing speech recognition on the input speech from a plurality of recognition modes provided in advance and designates the first recognition mode as the first recognition mode. A speech recognition method for performing speech recognition on the input speech based on conditions defined by a recognition mode and outputting a speech recognition result to the user, which is different from the first recognition mode designated by the user. Change to the second recognition mode, generate feedback information for inquiring the user whether or not to perform voice recognition again on the same voice data as the input voice, and output the feedback information to the user Voice recognition method.

本発明の音声認識装置、音声認識方法、音声認識プログラムおよびプログラム記録媒体によれば、以下のような効果を奏することができる。 According to the speech recognition apparatus, speech recognition method, speech recognition program, and program recording medium of the present invention, the following effects can be obtained.

音声認識結果のみならず、認識モードを設定し直して再度音声認識を行うか否かを問い合わせるフィードバック情報をユーザに返送するので、音声認識結果の誤りの原因が指定した認識モードの誤りである可能性があることをユーザに気付かせることが可能となる。 Not only the voice recognition result, but also feedback information that asks whether or not to perform voice recognition again after resetting the recognition mode is returned to the user, so the cause of the error in the voice recognition result may be an error in the specified recognition mode It is possible to make the user aware that there is

本発明に係る音声認識装置の内部構成の一例を示すブロック構成図である。It is a block block diagram which shows an example of the internal structure of the speech recognition apparatus which concerns on this invention. 図１に示す音声認識装置の動作の一例を示すフローチャートである。It is a flowchart which shows an example of operation | movement of the speech recognition apparatus shown in FIG. 本発明に係る音声認識装置の全体構成の一例を示すブロック構成図である。It is a block block diagram which shows an example of the whole structure of the speech recognition apparatus which concerns on this invention. 本発明に係る音声認識装置の全体構成の他の例を示すブロック構成図である。It is a block block diagram which shows the other example of the whole structure of the speech recognition apparatus which concerns on this invention. 本発明に係る音声認識装置の全体構成のさらに異なる例を示すブロック構成図である。It is a block block diagram which shows the further different example of the whole structure of the speech recognition apparatus which concerns on this invention. 本発明に係る音声認識装置の全体構成のさらに異なる例を示すブロック構成図である。It is a block block diagram which shows the further different example of the whole structure of the speech recognition apparatus which concerns on this invention.

以下、本発明による音声認識装置、音声認識方法、音声認識プログラムおよびプログラム記録媒体の好適な実施形態について添付図を参照して説明する。なお、以下の説明においては、本発明による音声認識装置、音声認識方法について説明するが、かかる音声認識方法をコンピュータにより実行可能な音声認識プログラムとして実施するようにしても良いし、あるいは、音声認識プログラムをコンピュータにより読み取り可能な記録媒体に記録するようにしても良いことは言うまでもない。 Preferred embodiments of a speech recognition apparatus, speech recognition method, speech recognition program, and program recording medium according to the present invention will be described below with reference to the accompanying drawings. In the following description, the speech recognition apparatus and speech recognition method according to the present invention will be described. However, the speech recognition method may be implemented as a speech recognition program that can be executed by a computer, or speech recognition. Needless to say, the program may be recorded on a computer-readable recording medium.

（本発明の特徴）
本発明の実施形態の説明に先立って、本発明の特徴についてその概要をまず説明する。本発明の音声認識装置は、ユーザが指定する第１の認識モードを用いて入力音声を音声認識処理する音声認識手段と、該第１の認識モードとは異なる第２の認識モードを用いて再度音声認識処理を行うか否かを問い合わせるフィードバック情報を生成するフィードバック生成手段と、を備えることにより、ユーザに対して、音声認識結果のみならず、認識モードの確認をユーザに行わせる契機となるフィードバック情報を返すことを可能としている点に特徴を有している。 (Features of the present invention)
Prior to the description of the embodiments of the present invention, an outline of the features of the present invention will be described first. The speech recognition apparatus according to the present invention uses speech recognition means for performing speech recognition processing on input speech using a first recognition mode designated by a user, and again using a second recognition mode different from the first recognition mode. Feedback generating means for generating feedback information for inquiring whether or not to perform voice recognition processing, and providing feedback that triggers the user to check not only the voice recognition result but also the recognition mode. It is characterized in that information can be returned.

（実施形態の構成例）
次に、本発明に係る音声認識装置の構成について、その一例を、図１を用いて説明する。図１は、本発明に係る音声認識装置の内部構成の一例を示すブロック構成図であり、本発明に関連するブロックについてのみ示している。 (Configuration example of embodiment)
Next, an example of the configuration of the speech recognition apparatus according to the present invention will be described with reference to FIG. FIG. 1 is a block diagram showing an example of the internal configuration of a speech recognition apparatus according to the present invention, and shows only the blocks related to the present invention.

図１を参照すると、本発明の一例を示す音声認識装置１０は、認識モード設定手段１１、音声認識手段１２、フィードバック生成手段１３、出力手段１４を少なくとも備えている。認識モード設定手段１１は、外部からのユーザ操作により選択された認識モード（第１の認識モード）を用いるように設定する。音声認識手段１２は、認識モード設定手段１１によって設定された認識モード（第１の認識モード）に対応する音声認識用のモデル等を用いて、入力された音声に対して音声認識処理を実施し、認識結果としての単語列を出力手段１４に供給する。 Referring to FIG. 1, a speech recognition apparatus 10 showing an example of the present invention includes at least a recognition mode setting unit 11, a speech recognition unit 12, a feedback generation unit 13, and an output unit 14. The recognition mode setting means 11 sets so as to use the recognition mode (first recognition mode) selected by an external user operation. The speech recognition unit 12 performs speech recognition processing on the input speech by using a speech recognition model or the like corresponding to the recognition mode (first recognition mode) set by the recognition mode setting unit 11. The word string as the recognition result is supplied to the output means 14.

また、フィードバック生成手段１３は、認識モード設定手段１１が設定した認識モード（第１の認識モード）とは、異なる認識モード（第２の認識モード）に対応する音声認識用のモデル等を用いて、同一の入力音声に対して再度音声認識処理を行うか否かを問い合わせるフィードバック情報を生成して、出力手段１４に供給する。出力手段１４は、音声認識手段１２から供給される認識結果としての単語列をユーザに対して出力する。また、フィードバック生成手段１３から供給されるフィードバック情報についてもユーザに出力する。 The feedback generation unit 13 uses a speech recognition model or the like corresponding to a recognition mode (second recognition mode) different from the recognition mode (first recognition mode) set by the recognition mode setting unit 11. Then, feedback information for inquiring whether or not to perform speech recognition processing again on the same input speech is generated and supplied to the output means 14. The output unit 14 outputs a word string as a recognition result supplied from the voice recognition unit 12 to the user. Further, the feedback information supplied from the feedback generation means 13 is also output to the user.

次に、図２のフローチャートを参照して、図１に示す音声認識装置１０の動作について詳細に説明する。図２は、図１に示す音声認識装置１０の動作の一例を示すフローチャートである。 Next, the operation of the speech recognition apparatus 10 shown in FIG. 1 will be described in detail with reference to the flowchart of FIG. FIG. 2 is a flowchart showing an example of the operation of the speech recognition apparatus 10 shown in FIG.

図２のフローチャートにおいて、まず、ユーザ操作の内容に応じて認識モード設定手段１１が認識モードを設定する（ステップＳ１０１）。しかる後、ユーザは音声を入力する（ステップＳ１０２）。 In the flowchart of FIG. 2, first, the recognition mode setting means 11 sets the recognition mode according to the content of the user operation (step S101). Thereafter, the user inputs voice (step S102).

音声の入力を検知すると、認識モード設定手段１１で設定された認識モードに対応するモデル等を用いて、音声認識手段１２は、入力音声に対して音声認識処理を実施し、音声認識結果として単語列を生成する（ステップＳ１０３）。音声認識結果として単語列が生成されると、出力手段１４は、音声認識手段１２で生成された音声認識結果の単語列をユーザに対して出力する（ステップＳ１０４）。 When speech input is detected, the speech recognition unit 12 performs speech recognition processing on the input speech using a model or the like corresponding to the recognition mode set by the recognition mode setting unit 11, and a word is obtained as a speech recognition result. A column is generated (step S103). When the word string is generated as the voice recognition result, the output unit 14 outputs the word string of the voice recognition result generated by the voice recognition unit 12 to the user (step S104).

一方、フィードバック生成手段１３は、設定した認識モード（第１の認識モード）とは異なる認識モード（第２の認識モード）に対応するモデル等を用いて、再度、同一の入力音声に対して音声認識処理を行うか否かを問い合わせるフィードバック情報を生成し、ユーザに対して出力する（ステップＳ１０５）。 On the other hand, the feedback generation unit 13 uses a model or the like corresponding to a recognition mode (second recognition mode) different from the set recognition mode (first recognition mode), and again performs speech for the same input voice. Feedback information for inquiring whether or not to perform recognition processing is generated and output to the user (step S105).

次に、音声認識手段１２による音声認識結果の単語列とフィードバック生成手段１３によるフィードバック情報とを受け取ったユーザから再度の音声認識処理の指示があるか否かをチェックする（ステップＳ１０６）。ユーザから認識モードを変更して再度認識処理を行う旨の反応があった場合には（ステップＳ１０６のＹｅｓ）、認識モード設定手段１１は、新たな認識モード（第２の認識モード）に変更して設定する（ステップＳ１０７）。しかる後、ユーザが再度音声を入力する（ステップＳ１０８）。 Next, it is checked whether or not there is another voice recognition processing instruction from the user who has received the word string of the voice recognition result by the voice recognition unit 12 and the feedback information by the feedback generation unit 13 (step S106). If there is a reaction from the user to change the recognition mode and perform the recognition process again (Yes in step S106), the recognition mode setting means 11 changes to a new recognition mode (second recognition mode). (Step S107). Thereafter, the user inputs voice again (step S108).

音声の入力を再度検知すると、認識モード設定手段１１で新たに設定された認識モード（第２の認識モード）に対応するモデル等を用いて、音声認識手段１２は、再度入力された入力音声に対して音声認識処理を実施し、音声認識結果として単語列を生成する（ステップＳ１０９）。音声認識結果として単語列が生成されると、出力手段１４は、音声認識手段１２で生成された音声認識結果の単語列を再度ユーザに対して出力するとともに、フィードバック生成手段１３が再度生成したフィードバック情報をユーザに対して出力するという動作を繰り返す。 When the voice input is detected again, the voice recognition unit 12 uses the model corresponding to the recognition mode newly set by the recognition mode setting unit 11 (second recognition mode), etc. Then, voice recognition processing is performed, and a word string is generated as a voice recognition result (step S109). When the word string is generated as the voice recognition result, the output unit 14 outputs the word string of the voice recognition result generated by the voice recognition unit 12 to the user again, and the feedback generated by the feedback generation unit 13 again. The operation of outputting information to the user is repeated.

以上のように、本発明に係る音声認識装置は、ユーザが指定した認識モード（第１の認識モード）とは異なる認識モード（第２の認識モード）を用いて認識処理を行うか否かを問い合わせるフィードバック情報をユーザに出力して、ユーザに対して、認識モード（第１の認識モード）を変更して再度認識処理を行うか否かというフィードバックを行うことにより、認識モード（第１の認識モード）の指定誤りが認識結果誤りの原因かも知れないということをユーザに気付かせることを可能としている。 As described above, the speech recognition apparatus according to the present invention determines whether or not to perform recognition processing using a recognition mode (second recognition mode) different from the recognition mode (first recognition mode) specified by the user. Feedback information to be inquired is output to the user, and the recognition mode (first recognition mode) is obtained by giving feedback to the user whether or not to perform the recognition processing again after changing the recognition mode (first recognition mode). It is possible to make the user aware that an erroneous specification of (mode) may be the cause of the recognition result error.

次に、図１の音声認識装置について、さらに詳細な構成を、図３を用いて説明する。図３は、本発明に係る音声認識装置の全体構成の一例を示すブロック構成図であり、図１に示した音声認識装置内の回路ブロック（図１と同じ符号を付している）と、当該音声認識装置１０へ認識モードや音声を入力する入力手段２０も含めて示している。なお、入力手段２０は、場合によっては、音声認識装置１０内に配置しても構わない。 Next, a more detailed configuration of the speech recognition apparatus in FIG. 1 will be described with reference to FIG. FIG. 3 is a block configuration diagram showing an example of the overall configuration of the speech recognition apparatus according to the present invention. The circuit block (same as in FIG. 1) in the speech recognition apparatus shown in FIG. A recognition mode and input means 20 for inputting voice to the voice recognition device 10 are also shown. Note that the input unit 20 may be disposed in the voice recognition device 10 depending on circumstances.

図３に示す音声認識装置の全体構成において、入力手段２０は、ユーザが指定した認識モード（第１の認識モード）の情報を認識モード設定手段１１に対して供給し、ユーザが発声した入力音声を音声認識手段１２等に供給するものであり、例えば、表示画面、ボタン、キーボード、マイクなどを有して構成されている。 In the overall configuration of the speech recognition apparatus shown in FIG. 3, the input unit 20 supplies information on the recognition mode (first recognition mode) designated by the user to the recognition mode setting unit 11, and the input speech uttered by the user. Is provided to the voice recognition means 12 and the like, and includes, for example, a display screen, buttons, a keyboard, a microphone, and the like.

図３に示す音声認識装置１０内の認識モード設定手段１１は、図１の場合と同様に、入力手段２０を介して外部から入力されてくるユーザ操作に応じた認識モード（第１の認識モード）を用いた音声認識処理を行うように設定する。つまり、音声認識手段１２の音声認識処理で用いる音響モデル・言語モデル・単語辞書等をユーザが指定する認識モード（第１の認識モード）に対応したものに設定する。 The recognition mode setting means 11 in the speech recognition apparatus 10 shown in FIG. 3 is a recognition mode (first recognition mode) corresponding to a user operation input from the outside via the input means 20 as in FIG. ) To perform voice recognition processing. That is, the acoustic model, language model, word dictionary, etc. used in the speech recognition process of the speech recognition means 12 are set to those corresponding to the recognition mode (first recognition mode) designated by the user.

ここで、入力手段２０には例えば認識モードに対応するモード指定ボタンなどが備えられており、ユーザはモード指定ボタンを押下することによって、認識モード（第１の認識モード）を指定する。認識モード（第１の認識モード）を設定する場合、１つの認識モードのみに限らず、性別に関する認識モード・言語に関する認識モードなど、複数の認識モードを一度に設定するようにしても良い。 Here, the input means 20 is provided with a mode designation button corresponding to the recognition mode, for example, and the user designates the recognition mode (first recognition mode) by pressing the mode designation button. When setting the recognition mode (first recognition mode), not only one recognition mode but also a plurality of recognition modes such as a recognition mode related to gender and a recognition mode related to language may be set at a time.

認識モードとしては、例えば、「男性」、「女性」という性別に関する認識モードがあり、ユーザが「男性」の認識モードを指定した場合、認識モード設定手段１１は、男性による入力音声をより高精度に認識することができるように作成された男性用の性別依存音響モデルを用いるように設定する。また、「日本語」、「英語」という言語に関する認識モードがあり、ユーザが「日本語」の認識モードを指定した場合、認識モード設定手段１１は、日本語の入力音声をより高精度に認識できるように作成された言語モデルや単語辞書を用いるように設定する。このように、認識モードには、音響モデル・言語モデル・単語辞書といった音声認識処理において用いるモデル類が対応している。 As the recognition mode, for example, there are recognition modes relating to the genders of “male” and “female”, and when the user designates the recognition mode of “male”, the recognition mode setting unit 11 is more accurate for the voice input by the male. It is set to use a male-dependent sex-dependent acoustic model created so that it can be recognized. Also, there are recognition modes related to the languages “Japanese” and “English”, and when the user designates the recognition mode of “Japanese”, the recognition mode setting means 11 recognizes Japanese input speech with higher accuracy. It is set to use a language model or word dictionary created so that it can. Thus, models used in speech recognition processing such as an acoustic model, a language model, and a word dictionary correspond to the recognition mode.

音声認識手段１２は、認識モード設定手段１１により設定されたモデル類を用いて、入力音声に対する音声認識処理を実施し、認識結果を単語列として生成する。音声認識処理は、例えば、入力音声をＭＦＣＣ(Mel-Frequency
Cepstral Coefficient：メル周波数ケプストラム係数)などの音声特徴量に変換する音響分析処理や、入力音声の中から音声区間と無音区間とを分ける音声検出処理や、音声特徴量に対して音響モデル・言語モデル・単語辞書といったモデル類を用いて入力音声に類似する単語列を生成する探索処理などから構成されるが、当業者にとって周知の技術であるため詳細な説明は省略する。 The speech recognition unit 12 performs speech recognition processing on the input speech using the models set by the recognition mode setting unit 11, and generates a recognition result as a word string. For example, the voice recognition processing is performed by converting input voice to MFCC (Mel-Frequency
Cepstral Coefficient (mel frequency cepstrum coefficient) and other acoustic analysis processing that converts into speech features, speech detection processing that separates speech and silent segments from the input speech, and acoustic models and language models for speech features A search process for generating a word string similar to the input speech using models such as a word dictionary is used. However, since this is a technique well known to those skilled in the art, detailed description thereof is omitted.

フィードバック生成手段１３は、前述したように、認識モード設定手段１１が設定した認識モード（第１の認識モード）とは異なる認識モード（第２の認識モード）に変更して再度音声認識処理を行うか否かを問い合わせるフィードバック情報を生成して、出力手段１４に供給する。例えば、「認識モードを変更してやり直しますか」のように、認識モードを変更して再認識を行うか否かをユーザに尋ねるようなフィードバック情報を生成すれば良い。 As described above, the feedback generation unit 13 changes to a recognition mode (second recognition mode) different from the recognition mode (first recognition mode) set by the recognition mode setting unit 11 and performs voice recognition processing again. Feedback information for inquiring whether or not is generated is supplied to the output means 14. For example, feedback information that asks the user whether or not to perform re-recognition by changing the recognition mode may be generated, such as "Do you want to change the recognition mode and try again?"

出力手段１４は、前述したように、音声認識手段１２から供給される音声認識結果としての単語列をユーザに対して出力する。また、フィードバック生成手段１３から供給されるフィードバック情報をユーザに対して出力する。 As described above, the output unit 14 outputs a word string as a speech recognition result supplied from the speech recognition unit 12 to the user. Further, the feedback information supplied from the feedback generation unit 13 is output to the user.

フィードバック情報をユーザに対して出力した後において、ユーザが再度音声認識を処理する旨の反応をした場合は、認識モード設定手段１１は、新たな認識モード（第２の認識モード）に設定し直し、再び、入力される音声に対して音声認識手段１２が音声認識処理することになる。ここで、ユーザが再度音声認識を処理する旨の反応は、例えば、出力したフィードバック情報に対して「はい」と音声により回答する、あるいは、最初に指定した認識モード（第１の認識モード）とは異なる認識モード（第２の認識モード）のモード指定ボタンを押下するなどによって行う。 After the feedback information is output to the user, when the user reacts to process the voice recognition again, the recognition mode setting unit 11 resets to a new recognition mode (second recognition mode). Again, the speech recognition means 12 performs speech recognition processing on the input speech. Here, the reaction that the user processes voice recognition again is, for example, replying “Yes” to the output feedback information by voice, or the first designated recognition mode (first recognition mode) and Is performed by pressing a mode designation button of a different recognition mode (second recognition mode).

なお、再度音声認識を処理する旨の反応をする場合、例えば、ユーザが再度認識モードを指定し直すことにより、指定された認識モードに変更すれば良い。あるいは、択２の認識モードの場合は、最初に設定した認識モード（第１の認識モード）とは異なる認識モード（第２の認識モード）を自動的に選択して設定するようにしても良いし、あるいは、認識結果の誤り率を学習する機能を備えている場合には、学習結果に応じた認識モードを第２の認識モードとして選択して設定するようにしても良い。 In the case of reacting to process voice recognition again, for example, the user may change to the designated recognition mode by redesignating the recognition mode. Alternatively, in the alternative 2 recognition mode, a recognition mode (second recognition mode) different from the initially set recognition mode (first recognition mode) may be automatically selected and set. Alternatively, when a function for learning the error rate of the recognition result is provided, a recognition mode corresponding to the learning result may be selected and set as the second recognition mode.

以上のように、認識モードを変更して再度音声認識処理を行うか否かを問い合わせるフィードバック情報をユーザに対して出力することによって、直前に出力されている音声認識結果の単語列に誤りが含まれていた場合、その誤り原因が、指定した認識モードの誤りである可能性をユーザに気付かせる契機を与えることが可能となる。 As described above, an error is included in the word string of the speech recognition result output immediately before by outputting feedback information for inquiring whether or not to perform the speech recognition processing again after changing the recognition mode. If the error has occurred, it is possible to give the user a notice that the cause of the error may be an error in the designated recognition mode.

（他の実施形態）
次に、前述した実施形態とは異なる実施形態についてさらに説明する。 (Other embodiments)
Next, an embodiment different from the above-described embodiment will be further described.

音声認識装置１０内に音声データを記憶する音声記憶手段を備えることにし、音声認識手段１２は、音声認識を行う対象の入力音声の全部あるいは一部を音声データとして該音声記憶手段に記憶しておくようにしても良い。ユーザから再度音声認識を処理する旨の反応があった場合には、音声記憶手段に記憶していた音声データを用いて音声認識処理を行うことができる。これにより、認識モードを変更した後で、ユーザが、再度、音声の入力をやり直す手間を軽減することが可能となる。 The voice recognition device 10 includes voice storage means for storing voice data, and the voice recognition means 12 stores all or part of the input voice to be voice-recognized as voice data in the voice storage means. You may make it leave. If there is a reaction from the user to process the voice recognition again, the voice recognition process can be performed using the voice data stored in the voice storage means. Thereby, after changing the recognition mode, it is possible to reduce time and effort for the user to input voice again.

なお、音声記憶手段に記憶していた音声データを用いて音声認識処理を行うことを可能としている場合、フィードバック生成手段１３は、認識モード設定手段１１が最初に設定した認識モード（第１の認識モード）とは異なる認識モード（第２の認識モード）に変更して、音声記憶手段に記憶していた音声データに対して再度音声認識処理を行うか否かを問い合わせるフィードバック情報を生成する。例えば、「認識モードを変更して先ほどの音声の認識処理をやり直しますか」のようなフィードバック情報を生成すれば良い。 When the voice recognition process can be performed using the voice data stored in the voice storage unit, the feedback generation unit 13 uses the recognition mode (first recognition mode) initially set by the recognition mode setting unit 11. The mode is changed to a recognition mode (second recognition mode) different from (mode), and feedback information for inquiring whether or not to perform voice recognition processing again on the voice data stored in the voice storage means is generated. For example, feedback information such as “Do you want to change the recognition mode and repeat the speech recognition process earlier?” May be generated.

音声記憶手段への音声データの記憶に関しては、入力された音声を例えば波形形式でそのまま記憶しても構わないし、音響分析処理後の音声特徴量のような変換処理を加えた形式で記憶しても構わない。音声特徴量の形式で音声記憶手段へ記憶している場合には、音声データに対して再度の音声認識処理を行う際に、音響分析処理を省略することができるので、再度の音声認識処理における処理量を減らすことが可能となる。 Regarding the storage of the voice data in the voice storage means, the input voice may be stored as it is in, for example, a waveform format, or may be stored in a format to which a conversion process such as a voice feature after acoustic analysis processing is added. It doesn't matter. In the case of storing in the voice storage means in the form of the voice feature amount, the acoustic analysis process can be omitted when the voice recognition process is performed again on the voice data. The amount of processing can be reduced.

また、入力された音声を全て音声記憶手段へ記憶しても構わないし、音声検出処理により音声区間として判断された区間のみを記憶しても構わない。音声区間として判断された区間のみを音声記憶手段へ記憶する場合、記憶する音声データのサイズを減らすことが可能となり、また、音声データに対して再度の音声認識処理を行う際に、音声検出処理を省略することができるので、再度の音声認識処理における処理量を減らすことが可能となる。 Further, all input voices may be stored in the voice storage means, or only the section determined as the voice section by the voice detection process may be stored. When only the section determined as the voice section is stored in the voice storage unit, the size of the voice data to be stored can be reduced, and the voice detection process is performed when the voice recognition process is performed again on the voice data. Therefore, it is possible to reduce the processing amount in the second speech recognition process.

また、再度音声認識処理を行うか否かを問い合わせるフィードバック情報をユーザに出力し、ユーザから音声認識処理をやり直す旨の反応が得られるタイミングに先行して、認識モードを変更して、音声記憶手段に記憶している音声データに対して音声認識処理を行うようにしても良い。この場合、ユーザから音声認識処理をやり直す旨の反応を受け取ってから、音声認識処理結果が得られるまでの時間を短くすることが可能となる。 Also, feedback information for inquiring whether or not to perform the speech recognition process again is output to the user, and the recognition mode is changed prior to the timing at which a response to restart the speech recognition process is obtained from the user. Voice recognition processing may be performed on the voice data stored in the memory. In this case, it is possible to shorten the time from when a response indicating that the voice recognition process is performed again from the user until the result of the voice recognition process is obtained.

また、フィードバック生成手段１３が生成したフィードバック情報つまり前の認識モード（第１の認識モード）とは異なる認識モード（第２の認識モード）で前に入力した音声に対して再度音声認識処理を行うか否かを問い合わせるフィードバック情報をユーザに対して出力するタイミングは、音声認識結果の単語列をユーザに出力するタイミングとほぼ同時ではなく、ユーザが次の何らかの操作を入力手段２０で行ったことを検知したタイミングで行うようにしても良い。ユーザが入力手段２０で次の操作を行うのは、当該ユーザが、出力された音声認識結果の単語列を見て、認識モードの設定誤りに気付き、音声認識処理をやり直そうとして認識モードを再度指定する場合がある。また、別の発声をしようとして認識モードを変更している場合もある。 Further, the voice recognition process is performed again on the feedback information generated by the feedback generation unit 13, that is, the voice previously input in the recognition mode (second recognition mode) different from the previous recognition mode (first recognition mode). The timing for outputting feedback information for inquiring whether or not to the user is not almost the same as the timing for outputting the word string of the speech recognition result to the user, and that the user has performed any of the following operations with the input means 20. It may be performed at the detected timing. The user performs the following operation with the input means 20 because the user sees the word string of the output voice recognition result, notices the setting error of the recognition mode, and sets the recognition mode again to try the voice recognition process again. May be specified. In some cases, the recognition mode is changed to try another utterance.

したがって、ユーザが次の何らかの操作を入力手段２０で行ったことを検知したタイミングで、フィードバック情報をユーザに対して出力するようにすれば、認識モードの設定誤りに気付いて認識モードを指定し直そうとしているかあるいは別の発声をしようとして認識モードを変更しようとしているかのいずれの場合であっても、フィードバック情報を生成して出力するという手間を掛けることもなく、認識モードの変更処理を行うことができるとともに、ユーザ自らが認識モードの指定誤りに気付いて音声認識処理をやり直す場合であっても、フィードバック情報を得た後で、音声入力の操作を最初からやり直すという手間を軽減することが可能となる。 Therefore, if the feedback information is output to the user at the timing when it is detected that the user has performed any of the following operations with the input means 20, the recognition mode is re-designated by recognizing the recognition mode setting error. Whether you are trying to change the recognition mode or trying to make another utterance, do the process of changing the recognition mode without the hassle of generating and outputting feedback information In addition, even if the user himself / herself notices an incorrect recognition mode specification and redoes voice recognition processing, it is possible to reduce the trouble of redoing voice input from the beginning after obtaining feedback information. It becomes.

また、フィードバック生成手段１３が生成するフィードバック情報として、前に入力した音声に対して再度音声認識処理を行うか否かを問い合わせる情報とともに、変更すべき認識モード候補を含めてフィードバック情報を生成するようにしても良い。例えば、「女性の認識モードに変更して再認識しますか？」のようなフィードバック情報をユーザに対して出力する。また、変更すべき認識モード候補のモード指定ボタンを点滅表示させたりすることによって、ユーザが選択すべき認識モードの候補を明確にユーザに提示するようにしても良い。 In addition, as feedback information generated by the feedback generation unit 13, feedback information including a recognition mode candidate to be changed is generated together with information for inquiring whether or not to perform speech recognition processing again on previously input speech. Anyway. For example, feedback information such as “Do you want to re-recognize by changing to the female recognition mode?” Is output to the user. In addition, the recognition mode candidate to be selected may be clearly presented to the user by blinking the mode designation button for the recognition mode candidate to be changed.

なお、変更すべき認識モードの候補は、例えば、２択の認識モードであれば、ユーザが最初に指定していない方の認識モードと判断すれば良いし、あるいは、認識モードを誤った場合の誤りパターンを事前に学習しておき、学習した誤りパターンから判断するようにしても良い。また、前述のように、認識モードの候補を確認したユーザが認識モードを新たに指定した場合は、ユーザによって指定された認識モードを用いれば良い。これにより、認識モードの指定方法を明確にするとともに、よりスムーズな変更が可能となる。 For example, if the recognition mode candidate to be changed is a two-choice recognition mode, it may be determined that the recognition mode is not designated first by the user, or if the recognition mode is incorrect. An error pattern may be learned in advance and determined from the learned error pattern. As described above, when a user who has confirmed a recognition mode candidate newly designates a recognition mode, the recognition mode designated by the user may be used. This makes it possible to clarify the method for specifying the recognition mode and to make a smoother change.

また、音声認識結果の単語列を出力する都度、フィードバック情報を出力するのではなく、フィードバック生成手段１３がフィードバック情報を生成するか否かを制御するようにしても良い。つまり、音声認識結果の単語列の誤り度合いを判定して、フィードバック情報を生成するか否かを制御するようにしても良い。 Moreover, instead of outputting feedback information each time a word string of a speech recognition result is output, it may be controlled whether or not the feedback generation means 13 generates feedback information. That is, it is possible to determine whether or not the feedback information is generated by determining the error level of the word string of the speech recognition result.

例えば、図４に示すように、音声認識装置１０Ａとして、音声認識手段１２から供給された音声認識結果の単語列の誤り度合いを判定する誤認識判定手段１５を備え、誤認識判定手段１５の判定結果として出力される、音声認識結果の単語列の誤りの発生度合いに応じて、フィードバック情報を生成するか否かの制御情報をフィードバック生成手段１３に対して出力するようにしても良い。なお、図４は、本発明に係る音声認識装置の全体構成の他の例を示すブロック構成図であり、図３の全体構成に、誤認識判定手段１５を追加した音声認識装置１０Ａの構成例を示している。 For example, as shown in FIG. 4, the speech recognition apparatus 10 A includes an erroneous recognition determination unit 15 that determines an error degree of a word string of a speech recognition result supplied from the speech recognition unit 12. Control information indicating whether or not to generate feedback information may be output to the feedback generation unit 13 according to the degree of occurrence of an error in the word string of the speech recognition result that is output as a result. FIG. 4 is a block diagram showing another example of the overall configuration of the speech recognition apparatus according to the present invention, and an example of the configuration of the speech recognition apparatus 10A in which erroneous recognition determination means 15 is added to the overall configuration of FIG. Is shown.

誤認識判定手段１５において、例えば、音声認識結果の単語列の誤り度合いがあらかじめ定めた閾値よりも多い場合には、認識モードの指定誤りの可能性が高いものと判定して、フィードバック生成手段１３に対してフィードバック情報の生成を指示する制御情報を出力し、逆に、音声認識結果の単語列の誤り度合いが該閾値以下であった場合には、誤りが少ないものと判定して、フィードバック生成手段１３に対してフィードバック情報の生成を行う必要がない旨を指示する制御情報を出力する。 In the erroneous recognition determination means 15, for example, when the error level of the word string of the speech recognition result is greater than a predetermined threshold, it is determined that there is a high possibility of recognition mode designation error, and the feedback generation means 13 Control information for instructing the generation of feedback information is output, and conversely, if the error degree of the word string of the speech recognition result is less than or equal to the threshold, it is determined that there are few errors, and feedback generation is performed. Control information that instructs the means 13 that it is not necessary to generate feedback information is output.

ここで、音声認識結果の単語列の誤り度合いは、例えば、音声認識処理において計算される確信度・信頼度といった尺度などを利用することができる。また、誤認識判定手段１５を用いることなく、フィードバック生成手段１３が、音声認識結果の単語列の誤りの有無をユーザに問い合わせることとし、単語列が誤っているというユーザからの反応があった場合に限って、フィードバック情報を生成するようにしても良い。 Here, the error degree of the word string of the speech recognition result can use, for example, a scale such as a certainty factor / reliability calculated in the speech recognition process. In addition, when the feedback generation unit 13 inquires of the user whether there is an error in the word sequence of the speech recognition result without using the erroneous recognition determination unit 15, and there is a reaction from the user that the word sequence is incorrect However, the feedback information may be generated only.

つまり、図５に示すように、フィードバック生成手段１３からの問い合わせを受け取ったユーザが、誤認識ボタン（あるいはモード指定ボタン）を押下するか否かによって、フィードバック生成手段１３は、音声認識結果に誤りが含まれているか否かという誤認識発生情報を受け取り、フィードバック情報の生成の有無を制御するようにしても良い。図５は、本発明に係る音声認識装置の全体構成のさらに異なる例を示すブロック構成図であり、ユーザに対して音声認識結果の誤りの有無を問い合わせ、ユーザから返送されてくる誤認識発生情報に応じてフィードバック情報の生成の有無を決定する機能を有するフィードバック生成手段１３Ａを備えた音声認識装置１０Ｂの構成例を示している。 That is, as shown in FIG. 5, the feedback generation unit 13 determines that the voice recognition result is incorrect depending on whether or not the user who has received the inquiry from the feedback generation unit 13 presses the misrecognition button (or mode designation button). May be received to determine whether or not feedback information is generated and control whether to generate feedback information. FIG. 5 is a block configuration diagram showing still another example of the overall configuration of the speech recognition apparatus according to the present invention, in which the user is inquired whether there is an error in the speech recognition result, and erroneous recognition occurrence information returned from the user is shown. 3 shows an example of the configuration of a speech recognition apparatus 10B provided with feedback generation means 13A having a function of determining whether or not feedback information is generated according to.

かくのごとく、ユーザへの問い合わせを行うことにより、認識モード指定誤りの可能性が高いときにのみフィードバック情報を出力することができる。なお、再度音声認識処理を行うか否かを問い合わせるフィードバック情報を生成しない場合には、同一の入力音声に対して認識モードを変更して再度音声認識を行うことは極めて少ないので、入力音声を音声記憶手段に音声データとして記憶しなくても良い。 As described above, by making an inquiry to the user, it is possible to output feedback information only when there is a high possibility of recognition mode designation error. If feedback information for inquiring whether or not to perform voice recognition processing again is not generated, it is very rare to perform voice recognition again after changing the recognition mode for the same input voice. It does not have to be stored as audio data in the storage means.

また、入力音声の信号対ノイズ比（Ｓ／Ｎ）があらかじめ定めた閾値以上か否かに基づいて、当該入力音声が音声認識処理を行うことが適切な状態の音声か否かを判定し、認識モードを変更して再度認識処理を行うか否かを問い合わせるフィードバック情報を生成するか否かを制御するようにしても良い。つまり、入力音声が認識モードを変更して再度音声認識処理をした際に正しい認識結果が得られる可能性が高いと判断した場合は、フィードバック情報を生成し、正しい認識結果が得られる可能性が低いと判断した場合は、フィードバック情報を生成しないように制御するようにしても良い。 Further, based on whether or not the signal-to-noise ratio (S / N) of the input voice is equal to or greater than a predetermined threshold, it is determined whether or not the input voice is appropriate for performing voice recognition processing. It may be configured to control whether to generate feedback information that inquires whether or not to perform recognition processing again after changing the recognition mode. In other words, if it is determined that there is a high possibility that a correct recognition result is obtained when the input speech changes the recognition mode and the speech recognition process is performed again, the feedback information may be generated and the correct recognition result may be obtained. When it is determined that the value is low, control may be performed so that feedback information is not generated.

例えば、図６に示すように、音声認識装置１０Ｃとして、音声認識誤りを起こし易い状況か否かを判定する状況検知手段１６を備え、音声認識手段１２から、入力音声に関して、例えば、背景雑音が大きいかどうか、突発ノイズが混入されているかどうか、ユーザの音量が小さいかどうかなどの音声情報を受け取り、音声認識誤りを起こす可能性が高いか否かを判定し、該判定結果に基づいて、フィードバック生成手段１３に対してフィードバック情報を生成するか否かの制御情報を出力するようにしても良い。このように、状況検知手段１６は、入力音声に関する前記音声認識の容易性を表す情報に基づいて、当該入力音声が音声認識処理を行うことが容易な音声か否かを判定する。入力音声に関する前記音声認識の容易性を表す情報としては、入力音声の前記信号対ノイズ比（Ｓ／Ｎ）、背景雑音の大きさ、突発ノイズの有無、音量などがある。図６は、本発明に係る音声認識装置の全体構成のさらに異なる例を示すブロック構成図であり、当該入力音声が音声認識処理を行うことが容易な音声か否かを判定する状況検知手段１６を備えた音声認識装置１０Ｃの構成例を示している。 For example, as shown in FIG. 6, the speech recognition apparatus 10 C includes a situation detection unit 16 that determines whether or not the situation is likely to cause a speech recognition error. Whether or not it is large, whether or not sudden noise is mixed, whether or not the user's volume is low, receives voice information, determines whether there is a high possibility of causing a voice recognition error, and based on the determination result, Control information indicating whether or not to generate feedback information may be output to the feedback generation unit 13. As described above, the status detection unit 16 determines whether or not the input voice is easy to perform voice recognition processing based on the information representing the ease of voice recognition regarding the input voice. Information representing the ease of speech recognition regarding the input speech includes the signal-to-noise ratio (S / N) of the input speech, the magnitude of background noise, the presence or absence of sudden noise, the volume, and the like. FIG. 6 is a block diagram showing a further different example of the overall configuration of the speech recognition apparatus according to the present invention. The status detection unit 16 determines whether or not the input speech is easy to perform speech recognition processing. 10 shows a configuration example of a speech recognition apparatus 10C provided with

なお、再度認識処理を行うか否かを問い合わせるフィードバック情報を生成しない場合には、同一の入力音声に対して認識モードを変更して再度音声認識を行うことは極めて少ないので、入力音声を音声記憶手段に音声データとして記憶しなくても良い。 If feedback information for inquiring whether or not to perform recognition processing again is not generated, it is very rare to perform recognition again after changing the recognition mode for the same input speech, so the input speech is stored as speech. The means need not be stored as audio data.

また、フィードバック情報をユーザに対して出力するタイミングに先行して、音声認識処理に用いた認識モード（第１の認識モード）とは異なる認識モード（第２の認識モード）に変更して、音声記憶手段に記憶した音声データに対して再度音声認識処理を行うようにしても良い。この場合、第２の認識モードに基づいて再度音声認識処理を行った結果の単語列をユーザに対して出力するとともに、例えば「女性の認識モードに変更して再度認識を行いました」のようなフィードバック情報を生成して出力するようにすれば良い。 In addition, prior to the timing when the feedback information is output to the user, the speech information is changed to a recognition mode (second recognition mode) different from the recognition mode (first recognition mode) used for the speech recognition processing. Voice recognition processing may be performed again on the voice data stored in the storage means. In this case, the word string resulting from performing the speech recognition process again based on the second recognition mode is output to the user and, for example, “the recognition mode is changed again to the female recognition mode”. It is sufficient to generate and output feedback information.

以上、本発明の好適実施例の構成を説明した。しかし、斯かる実施例は、本発明の単なる例示に過ぎず、何ら本発明を限定するものではないことに留意されたい。本発明の要旨を逸脱することなく、特定用途に応じて種々の変形変更が可能であることが、当業者には容易に理解できよう。例えば、本発明の実施態様は、課題を解決するための手段における構成（１）及び（１２）に加え、次のような構成として表現できる。下記（２）−（１１）及び（１３）−（２４）なる番号は、請求項の項番号にそれぞれ対応している。
（２）音声データを記憶する音声記憶手段を備え、前記音声認識処理手段は、音声認識を行う前記入力音声の全部あるいは一部を音声データとして前記音声記憶手段に記憶し、前記フィードバック生成手段が生成する前記フィードバック情報に、前記音声記憶手段に記憶した前記入力音声に関する音声データに対して音声認識処理を行うか否かをユーザに問い合わせる情報を含んでいる上記（１）の音声認識装置。
（３）前記フィードバック情報を受け取ったユーザから再度音声認識処理を行う旨の応答を検出した場合に、前記音声認識手段は、前記第２の認識モードが規定する条件に基づいて、前記音声記憶手段に記憶した前記入力音声に関する音声データに対して音声認識処理を行う上記（２）の音声認識装置。
（４）前記フィードバック情報を受け取ったユーザから再度音声認識処理を行う旨の応答を検出するよりも先行して、前記音声認識手段は、前記第２の認識モードが規定する条件に基づいて、前記音声記憶手段に記憶した前記入力音声に関する音声データに対して音声認識処理を行う上記（２）の音声認識装置。
（５）前記音声記憶手段に記憶する前記入力音声に関する音声データは、前記入力音声のうち音声区間と判定された区間の音声データである上記（２）ないし（４）のいずれかの音声認識装置。
（６）前記フィードバック情報を受け取ったユーザから前記第１の認識モードを別の認識モードに前記第２の認識モードとして変更する旨の指示があった場合、前記フィードバック生成手段は、前記第２の認識モードからさらに異なる認識モードに変更して、前記入力音声と同一の音声データに対して再度音声認識を行うか否かをユーザに問い合わせるフィードバック情報を生成する上記（１）ないし（５）のいずれかの音声認識装置。
（７）前記フィードバック生成手段が生成する前記フィードバック情報に、変更する前記第２の認識モードの対象となる認識モードの候補を含んでいる上記（１）ないし（６）のいずれかの音声認識装置。
（８）前記フィードバック生成手段は、前記音声認識手段による音声認識結果の誤り度合いがあらかじめ定めた閾値よりも多いと判断した場合に、前記フィードバック情報を生成するように制御する上記（１）ないし（７）のいずれかの音声認識装置。
（９）前記フィードバック生成手段は、前記音声認識手段による音声認識結果に誤りがあるか否かをユーザに問い合わせ、当該ユーザから誤りがある旨の応答があった場合に、前記フィードバック情報を生成するように制御する上記（１）ないし（８）のいずれかの音声認識装置。
（１０）前記音声認識手段による前記音声認識の容易さに関する情報を検知する手段を有し、該検知手段が該情報に基づき前記音声認識の誤りの可能性が所定値より低いと判定したときに、前記フィードバック生成手段は、前記フィードバック情報を生成する上記（１）ないし（９）のいずれかの音声認識装置。
（１１）前記音声認識の容易さに関する情報が、前記入力音声における背景雑音のレベル若しくは突発性ノイズの有無又は前記入力音声の信号対ノイズ比若しくは音量の内の少なくとも１つである上記（１０）の音声認識装置。
（１２）入力音声に対する音声認識を行うための少なくとも１つ以上の認識モードを、あらかじめ備えている複数の認識モードの中からユーザが選択して第１の認識モードとして指定し、該第１の認識モードが規定する条件に基づいて、前記入力音声に対する音声認識を行い、音声認識結果を当該ユーザに対して出力する音声認識方法であって、ユーザが指定した前記第１の認識モードとは異なる第２の認識モードに変更して、前記入力音声と同一の音声データに対して再度音声認識を行うか否かをユーザに問い合わせるフィードバック情報をあらかじめ定めたタイミングで生成して当該ユーザに対して出力する音声認識方法。
（１３）音声認識を行う前記入力音声の全部あるいは一部を音声データとして記憶し、前記フィードバック情報に、記憶した前記入力音声に関する音声データに対して音声認識処理を行うか否かをユーザに問い合わせる情報を含んでいる上記（１２）の音声認識方法。
（１４）前記フィードバック情報を受け取ったユーザから再度音声認識処理を行う旨の応答を検出した場合に、前記第２の認識モードが規定する条件に基づいて、記憶した前記入力音声に関する音声データに対して音声認識処理を行う上記（１３）の音声認識方法。
（１５）前記フィードバック情報を受け取ったユーザから再度音声認識処理を行う旨の応答を検出するよりも先行して、前記第２の認識モードが規定する条件に基づいて、記憶した前記入力音声に関する音声データに対して音声認識処理を行う上記（１３）の音声認識方法。
（１６）記憶する前記入力音声に関する音声データは、前記入力音声のうち音声区間と判定された区間の音声データである上記（１３）ないし（１５）のいずれかの音声認識方法。
（１７）前記フィードバック情報を受け取ったユーザから前記第１の認識モードを別の認識モードに前記第２の認識モードとして変更する旨の指示があった場合、前記第２の認識モードからさらに異なる認識モードに変更して、前記入力音声と同一の音声データに対して再度音声認識を行うか否かをユーザに問い合わせるフィードバック情報を生成する上記（１２）ないし（１６）のいずれかの音声認識方法。
（１８）前記フィードバック情報に、変更する前記第２の認識モードの対象となる認識モードの候補を含んでいる上記（１２）ないし（１７）のいずれかの音声認識方法。
（１９）前記音声認識結果の誤り度合いがあらかじめ定めた閾値よりも多いと判断した場合に、前記フィードバック情報を生成するように制御する上記（１２）ないし（１８）のいずれかの音声認識方法。
（２０）前記音声認識結果に誤りがあるか否かをユーザに問い合わせ、当該ユーザから誤りがある旨の応答があった場合に、前記フィードバック情報を生成するように制御する上記（１２）ないし（１９）のいずれかの音声認識方法。
（２１）前記音声認識手段による前記音声認識の容易さに関する情報を検知し、該情報に基づき前記音声認識の誤りの可能性が所定値より低いと判定されときに、前記フィードバック情報を生成する上記（１２）ないし（２０）のいずれかの音声認識方法。
（２２）前記音声認識の容易さに関する情報が、前記入力音声における背景雑音のレベル若しくは突発性ノイズの有無又は前記入力音声の信号対ノイズ比若しくは音量の内の少なくとも１つである上記（２１）の音声認識方法。
（２３）上記（１２）ないし（２２）のいずれかの音声認識方法を、コンピュータによって実行可能なプログラムとして実施している音声認識プログラム。
（２４）上記（２３）の音声認識プログラムを、コンピュータによって読み取り可能な記録媒体に記録しているプログラム記録媒体。 The configuration of the preferred embodiment of the present invention has been described above. However, it should be noted that such examples are merely illustrative of the invention and do not limit the invention in any way. Those skilled in the art will readily understand that various modifications and changes can be made according to a specific application without departing from the gist of the present invention. For example, the embodiment of the present invention can be expressed as the following configurations in addition to the configurations (1) and (12) in the means for solving the problems. The numbers (2)-(11) and (13)-(24) below correspond to the item numbers in the claims.
(2) Voice storage means for storing voice data is provided, and the voice recognition processing means stores all or part of the input voice for voice recognition as voice data in the voice storage means, and the feedback generation means The speech recognition apparatus according to (1), wherein the feedback information to be generated includes information for inquiring a user whether to perform speech recognition processing on speech data relating to the input speech stored in the speech storage means.
(3) When a response indicating that voice recognition processing is to be performed again from the user who has received the feedback information is detected, the voice recognition unit performs the voice storage unit based on a condition defined by the second recognition mode. The speech recognition apparatus according to (2), wherein speech recognition processing is performed on speech data relating to the input speech stored in the above.
(4) Prior to detecting a response to perform voice recognition processing again from the user who has received the feedback information, the voice recognition means, based on a condition defined by the second recognition mode, The speech recognition apparatus according to (2), wherein speech recognition processing is performed on speech data relating to the input speech stored in speech storage means.
(5) The voice recognition device according to any one of (2) to (4), wherein the voice data related to the input voice stored in the voice storage means is voice data of a section determined as a voice section of the input voice. .
(6) When there is an instruction from the user who has received the feedback information to change the first recognition mode to another recognition mode as the second recognition mode, the feedback generation means Any one of the above (1) to (5) that changes the recognition mode to a different recognition mode and generates feedback information for inquiring the user whether or not to perform voice recognition again on the same voice data as the input voice Voice recognition device.
(7) The speech recognition device according to any one of (1) to (6), wherein the feedback information generated by the feedback generation unit includes a recognition mode candidate to be changed in the second recognition mode. .
(8) The feedback generation means controls to generate the feedback information when it is determined that the degree of error in the voice recognition result by the voice recognition means is greater than a predetermined threshold. 7) The voice recognition device according to any one of the above.
(9) The feedback generation unit inquires of the user whether there is an error in the voice recognition result by the voice recognition unit, and generates the feedback information when there is a response from the user that there is an error. The voice recognition device according to any one of (1) to (8), which is controlled as described above.
(10) When there is means for detecting information related to the ease of voice recognition by the voice recognition means, and when the detection means determines that the possibility of an error in voice recognition is lower than a predetermined value based on the information The voice recognition device according to any one of (1) to (9), wherein the feedback generation means generates the feedback information.
(11) The information on the ease of speech recognition is at least one of a background noise level or sudden noise in the input speech, or a signal-to-noise ratio or volume of the input speech (10) Voice recognition device.
(12) The user selects at least one or more recognition modes for performing speech recognition on the input speech from a plurality of recognition modes provided in advance and designates the first recognition mode as the first recognition mode. A speech recognition method for performing speech recognition on the input speech based on conditions defined by a recognition mode and outputting a speech recognition result to the user, which is different from the first recognition mode designated by the user. Change to the second recognition mode, generate feedback information for inquiring the user whether or not to perform voice recognition again on the same voice data as the input voice, and output the feedback information to the user Voice recognition method.
(13) All or a part of the input voice to be subjected to voice recognition is stored as voice data, and the feedback information is inquired of the user as to whether or not voice recognition processing is performed on the stored voice data related to the input voice. (12) The speech recognition method according to (12), which includes information.
(14) When a response indicating that voice recognition processing is to be performed again is detected from the user who has received the feedback information, the stored voice data related to the input voice is determined based on the condition defined by the second recognition mode. (13) The speech recognition method according to (13) above, wherein speech recognition processing is performed.
(15) Prior to detecting a response to perform voice recognition processing again from the user who has received the feedback information, the voice related to the stored input voice based on the condition defined by the second recognition mode The voice recognition method according to (13), wherein voice recognition processing is performed on data.
(16) The voice recognition method according to any one of (13) to (15), wherein the voice data related to the input voice to be stored is voice data of a section determined to be a voice section of the input voice.
(17) When there is an instruction from the user who has received the feedback information to change the first recognition mode to another recognition mode as the second recognition mode, the recognition is further different from the second recognition mode. The voice recognition method according to any one of (12) to (16), wherein the feedback information for inquiring the user whether or not to perform voice recognition again on the same voice data as the input voice is generated by changing to the mode.
(18) The speech recognition method according to any one of (12) to (17), wherein the feedback information includes a recognition mode candidate to be changed in the second recognition mode.
(19) The speech recognition method according to any one of (12) to (18), wherein control is performed to generate the feedback information when it is determined that the degree of error in the speech recognition result is greater than a predetermined threshold.
(20) The user is inquired whether there is an error in the speech recognition result, and controls to generate the feedback information when there is a response from the user that there is an error. The voice recognition method according to any one of 19).
(21) The information on the ease of speech recognition by the speech recognition means is detected, and the feedback information is generated when it is determined that the possibility of the speech recognition error is lower than a predetermined value based on the information. The speech recognition method according to any one of (12) to (20).
(22) The information regarding the ease of speech recognition is at least one of a background noise level or presence / absence of sudden noise in the input speech or a signal-to-noise ratio or volume of the input speech. Voice recognition method.
(23) A speech recognition program that implements the speech recognition method according to any one of (12) to (22) as a program executable by a computer.
(24) A program recording medium in which the voice recognition program of (23) is recorded on a computer-readable recording medium.

１０音声認識装置
１０Ａ音声認識装置
１０Ｂ音声認識装置
１０Ｃ音声認識装置
１１認識モード設定手段
１２音声認識手段
１３フィードバック生成手段
１３Ａフィードバック生成手段
１４出力手段
１５誤認識判定手段
１６状況検知手段
２０入力手段 10 speech recognition device 10A speech recognition device 10B speech recognition device 10C speech recognition device 11 recognition mode setting means 12 speech recognition means 13 feedback generation means 13A feedback generation means 14 output means 15 erroneous recognition determination means 16 situation detection means 20 input means

Claims

The user selects at least one or more recognition modes for performing speech recognition on the input speech from a plurality of recognition modes provided in advance and designates them as a first recognition mode, and the first recognition mode is In the speech recognition apparatus provided with speech recognition means for performing speech recognition on the input speech based on specified conditions and outputting the speech recognition result to the user, the first recognition mode designated by the user is By changing to a different second recognition mode, feedback information for inquiring the user whether or not to perform voice recognition again on the same voice data as the input voice is generated at a predetermined timing to the user. A speech recognition apparatus comprising feedback generation means for outputting.

Voice storage means for storing voice data, wherein the voice recognition processing means stores all or part of the input voice for voice recognition as voice data in the voice storage means, and the feedback generation means generates the voice data; The voice recognition according to claim 1, wherein the feedback information includes information for inquiring a user whether or not to perform voice recognition processing on voice data related to the input voice stored in the voice storage means. apparatus.

When a response indicating that voice recognition processing is to be performed again from the user who has received the feedback information is detected, the voice recognition unit stores the response in the voice storage unit based on a condition defined by the second recognition mode. The voice recognition apparatus according to claim 2, wherein voice recognition processing is performed on voice data related to the input voice.

Prior to detecting a response to perform voice recognition processing again from the user who has received the feedback information, the voice recognition means performs the voice storage means based on a condition defined by the second recognition mode. The voice recognition apparatus according to claim 2, wherein voice recognition processing is performed on voice data related to the input voice stored in the voice data.

The voice recognition according to any one of claims 2 to 4, wherein the voice data relating to the input voice stored in the voice storage means is voice data of a section determined as a voice section of the input voice. apparatus.

When there is an instruction from the user who has received the feedback information to change the first recognition mode to another recognition mode as the second recognition mode, the feedback generation unit starts from the second recognition mode. 6. The feedback information for changing to a different recognition mode and inquiring the user whether or not to perform voice recognition again on the same voice data as the input voice is generated. The speech recognition apparatus described in 1.

The speech recognition according to any one of claims 1 to 6, wherein the feedback information generated by the feedback generation means includes a recognition mode candidate to be a target of the second recognition mode to be changed. apparatus.

2. The control unit according to claim 1, wherein the feedback generation unit performs control so as to generate the feedback information when it is determined that the degree of error in the speech recognition result by the speech recognition unit is greater than a predetermined threshold. 8. The speech recognition device according to any one of items 7.

The feedback generation means inquires of the user whether there is an error in the voice recognition result by the voice recognition means, and controls to generate the feedback information when there is a response from the user that there is an error. The speech recognition apparatus according to claim 1, wherein

Means for detecting information relating to the ease of speech recognition by the speech recognition means, and when the detection means determines that the possibility of an error in speech recognition is lower than a predetermined value based on the information, the feedback The speech recognition apparatus according to claim 1, wherein the generation unit generates the feedback information.

The information relating to the ease of speech recognition is at least one of a background noise level or presence / absence of sudden noise in the input speech, or a signal-to-noise ratio or volume of the input speech. The speech recognition apparatus according to 10.

The user selects at least one or more recognition modes for performing speech recognition on the input speech from a plurality of recognition modes provided in advance and designates them as a first recognition mode, and the first recognition mode is A speech recognition method for performing speech recognition on the input speech based on a prescribed condition and outputting a speech recognition result to the user, wherein the second recognition mode is different from the first recognition mode designated by the user. Changing to the recognition mode, generating feedback information for inquiring the user whether or not to perform voice recognition again on the same voice data as the input voice, and outputting the feedback information to the user A feature of speech recognition.

All or part of the input speech for speech recognition is stored as speech data, and the feedback information includes information for inquiring the user whether speech recognition processing is to be performed on the stored speech data related to the input speech. The speech recognition method according to claim 12, wherein:

When a response indicating that voice recognition processing is to be performed again is detected from the user who has received the feedback information, voice recognition is performed on the stored voice data related to the input voice based on the condition defined by the second recognition mode. The speech recognition method according to claim 13, wherein processing is performed.

Prior to detecting a response to perform voice recognition processing again from the user who has received the feedback information, the stored voice data related to the input voice is determined based on the conditions defined by the second recognition mode. The voice recognition method according to claim 13, wherein voice recognition processing is performed.

The voice recognition method according to claim 13, wherein the voice data relating to the input voice to be stored is voice data of a section determined as a voice section of the input voice.

When there is an instruction from the user who has received the feedback information to change the first recognition mode to another recognition mode as the second recognition mode, the second recognition mode is changed to a different recognition mode. The voice recognition method according to claim 12, further comprising: generating feedback information for inquiring a user whether to perform voice recognition again on the same voice data as the input voice. .

18. The speech recognition method according to claim 12, wherein the feedback information includes a recognition mode candidate to be a target of the second recognition mode to be changed.

19. The speech recognition method according to claim 12, wherein control is performed so as to generate the feedback information when it is determined that an error degree of the speech recognition result is greater than a predetermined threshold value. .

13. The control unit is configured to inquire a user whether there is an error in the speech recognition result and to generate the feedback information when there is a response from the user that there is an error. The speech recognition method according to any one of 19.

Detecting information related to the ease of speech recognition by the speech recognition means, and generating the feedback information when it is determined that the possibility of an error in speech recognition is lower than a predetermined value based on the information. The speech recognition method according to any one of claims 12 to 20.

The information relating to the ease of speech recognition is at least one of a background noise level or presence / absence of sudden noise in the input speech, or a signal-to-noise ratio or volume of the input speech. The speech recognition method according to 21.

23. A speech recognition program, wherein the speech recognition method according to claim 12 is implemented as a program executable by a computer.

24. A program recording medium, wherein the voice recognition program according to claim 23 is recorded on a computer-readable recording medium.