JP2009015148A

JP2009015148A - Speech recognition device, speech recognition method and speech recognition program

Info

Publication number: JP2009015148A
Application number: JP2007178671A
Authority: JP
Inventors: Takuya Hirai; 卓哉平井; Atsushi Yamashita; 敦士山下; Tomohiro Terada; 智裕寺田
Original assignee: Panasonic Corp
Current assignee: Panasonic Corp
Priority date: 2007-07-06
Filing date: 2007-07-06
Publication date: 2009-01-22

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech recognition device with user friendliness, without spoiling easy input which is originally facilitated by speech input itself, when a correct word is input by speech again. <P>SOLUTION: The speech recognition device includes: a storage section for storing a word data; a speech recognition section in which a recognition score for expressing a matching degree of comparison is calculated by comparing the word data expressed by input speech with the word data stored in the storage section, and on the basis of the calculated recognition score, the recognized word is output; a receiving period determination means for determining a receiving period when re-input by speech can be received, on the basis of the recognition results calculated by the speech recognition section; and a speech re-recognition section for recognizing the word which is input again within the receiving period determined by the receiving period determination section, and the word which is recognized again is output. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、音声認識装置、音声認識方法、及び音声認識プログラムに関し、より特定的には、誤認識された発話内容を認識されるべき発話内容に訂正する音声認識装置、音声認識方法、及び音声認識プログラムに関する。 The present invention relates to a voice recognition apparatus, a voice recognition method, and a voice recognition program, and more specifically, a voice recognition apparatus, a voice recognition method, and a voice that correct a misrecognized utterance content to an utterance content to be recognized. It relates to recognition programs.

従来の音声認識装置は、ユーザが単語を発話すると、あらかじめ保持している辞書の中に格納されている単語（以後、辞書内単語と呼ぶ）と発話した単語（以後、発話単語と呼ぶ）を確率統計処理等を用いて比較し、その中から確からしい単語を認識結果としてユーザに報知する。この際に、ユーザが発話した単語と報知した単語が異なってしまう（以後、このような現象を誤認識という）場合が発生する。これは様々原因が考えられるが、１００％正しく認識することは非常に困難である。 In a conventional speech recognition apparatus, when a user utters a word, a word (hereinafter referred to as a word in the dictionary) stored in a dictionary held in advance and a word (hereinafter referred to as an utterance word) that are spoken are stored. Comparison is made using probability statistical processing or the like, and a probable word is notified to the user as a recognition result. At this time, there may occur a case where the word spoken by the user is different from the informed word (hereinafter, such phenomenon is referred to as misrecognition). There are various causes for this, but it is very difficult to recognize 100% correctly.

一方、確率統計処理を用いた音声認識処理では、発話単語と辞書内単語とを比較したときに、発話単語がどれだけ辞書内単語と音響的に似ているかを表す自信度（単語信頼度、尤度等と呼ばれる場合もある）を算出することができる。 On the other hand, in speech recognition processing using probability statistical processing, when comparing an utterance word with a word in the dictionary, a confidence level (word reliability, which indicates how much the utterance word is acoustically similar to the word in the dictionary) May be called likelihood etc.).

このような自信度を用いて、自信度が９０％であれば、単語が正しく認識されたとしてそのまま受付け、自信度が５０％程度であれば、認識された単語をユーザに問い合わせ、再度正しい単語の音声入力を受付ける（例えば特許文献１参照）。
特開２００１−１７５２７６号公報 Using such confidence level, if the confidence level is 90%, the word is accepted as correctly recognized, and if the confidence level is about 50%, the user is inquired about the recognized word, and the correct word is again displayed. (For example, refer to Patent Document 1).
JP 2001-175276 A

しかしながら、従来の音声認識では、自信度が低い場合、認識された単語をユーザに問い合わせ、ユーザは、再度正しい単語を音声入力するために、更に再入力ボタンを押下する等の操作を行わなくてはならない。 However, in the conventional voice recognition, when the degree of confidence is low, the user is inquired about the recognized word, and the user does not need to perform an operation such as pressing a re-input button again to input the correct word again. Must not.

このため、再入力ボタン等を押下する等の更なる１ステップの操作を経なくては、再度正しい単語を音声入力することができず、音声入力そのものが本来有する入力の容易性を損なうため、ユーザ利便性に欠ける。 For this reason, the correct word cannot be input again without further one-step operation such as pressing the re-input button or the like, and the ease of input inherent in the voice input itself is impaired. It lacks user convenience.

そこで、本発明は、上記問題に鑑みてなされた。すなわち、再度正しい単語を音声入力する際に、音声入力そのものが本来有する入力の容易性を損なわず、ユーザ利便性に富んだ音声認識装置、音声認識方法、及び音声認識プログラムを提供することを目的とする。 Therefore, the present invention has been made in view of the above problems. That is, an object of the present invention is to provide a speech recognition device, a speech recognition method, and a speech recognition program that are rich in user convenience without impairing the ease of input inherent in speech input itself when a correct word is input again. And

本発明の第１の局面は、音声認識装置に向けられている。本発明は、単語データを記憶する記憶部と、入力された音声の表す単語データと記憶部内に記憶された単語データとを比較して、比較時のマッチング度合いを表す認識スコアを算出し、算出した認識スコアに基づいて、認識した単語を出力する音声認識部と、音声認識部が算出した認識スコアに基づいて、音声による再入力を受付け可能な受付け時間を決定する受付時間決定部と、受付時間決定部が決定した受付け時間内に再入力された音声の表す単語を再認識し、再認識した単語を出力する音声再認識部とを備える。 The first aspect of the present invention is directed to a voice recognition device. The present invention compares a storage unit that stores word data, word data represented by input speech and word data stored in the storage unit, and calculates a recognition score that represents a matching degree at the time of comparison. A speech recognition unit that outputs the recognized word based on the recognition score, a reception time determination unit that determines a reception time that can accept re-input by speech based on the recognition score calculated by the speech recognition unit, and a reception A speech re-recognition unit that re-recognizes a word represented by the speech re-input within the reception time determined by the time determination unit and outputs the re-recognized word.

この構成により、再度正しい単語を音声入力する際に、音声入力そのものが本来有する入力の容易性を損なわず、ユーザ利便性に富んだ音声認識装置を提供することができる。 With this configuration, it is possible to provide a speech recognition device that is rich in user convenience without impairing the ease of input inherent in speech input itself when inputting correct words again.

また、受付け時間決定部が決定した受付け時間の変化と共に変化する表示オブジェクトによってユーザに報知する受付け時間報知部を更に備えることが好ましい。 In addition, it is preferable to further include an acceptance time notification unit that notifies the user by a display object that changes with a change in the acceptance time determined by the acceptance time determination unit.

この構成により、機器側の認識の自信度によって、ユーザが認識結果を正しく変更できる時間が報知されるため、ユーザは訂正操作にあわてることなく誤認識を訂正できる。 According to this configuration, since the time when the user can correctly change the recognition result is notified according to the degree of confidence of recognition on the device side, the user can correct the erroneous recognition without performing the correction operation.

また、受付け時間報知部は、受付時間決定部が決定した受付け時間内に音声が再入力された場合、表示オブジェクトの変化を停止させることが好ましい。 Moreover, it is preferable that the reception time notification unit stops the change of the display object when the voice is re-input within the reception time determined by the reception time determination unit.

この構成により、ユーザは機器側が自分の発話した内容を処理（認識）していることを直感的に理解することができる。 With this configuration, the user can intuitively understand that the device side is processing (recognizing) the content spoken by itself.

また、受付け時間報知部は、受付け時間決定部が決定した受付け時間を表す表示オブジェクトの表示領域上での位置を当該受付け時間の変化と共に変化させることが好ましい。 Moreover, it is preferable that an acceptance time alerting | reporting part changes the position on the display area of the display object showing the acceptance time which the acceptance time determination part determined with the change of the said acceptance time.

この構成により、オブジェクトの移動が開始する位置と、終了する位置の間の距離を見るだけで、再入力可能な残り時間を直感的に把握することができる。 With this configuration, it is possible to intuitively grasp the remaining time that can be input again only by looking at the distance between the position where the movement of the object starts and the position where it ends.

また、受付け時間報知部は、受付け時間決定部が決定した受付け時間を表す表示オブジェクトの目盛りの量を当該受付け時間の変化と共に変化させることが好ましい。 Moreover, it is preferable that the reception time notification unit changes the amount of the scale of the display object representing the reception time determined by the reception time determination unit together with the change in the reception time.

この構成により、ユーザは、修正する時間の緊急度を直感的に把握することができる。 With this configuration, the user can intuitively grasp the urgency of the time to be corrected.

また、受付け時間報知部は、受付け時間決定部が決定した受付け時間を表す表示オブジェクトの大きさを当該受付け時間の変化と共に変化させることが好ましい。 Moreover, it is preferable that an acceptance time alerting | reporting part changes the magnitude | size of the display object showing the acceptance time which the acceptance time determination part determined with the change of the said acceptance time.

この構成により、ユーザは最初の文字の小ささから機器側の自信度を直感的に推し量ることができる。 With this configuration, the user can intuitively estimate the degree of confidence on the device side from the smallness of the first character.

また、受付け時間報知部は、受付け時間決定部が決定した受付け時間を表す表示オブジェクトの透明度を当該受付け時間の変化と共に変化させることが好ましい。 Moreover, it is preferable that an acceptance time alerting | reporting part changes the transparency of the display object showing the acceptance time which the acceptance time determination part determined with the change of the said acceptance time.

この構成により、ユーザは最初の文字の透明度から機器側の自信度を直感的に推し量ることができる。 With this configuration, the user can intuitively estimate the degree of confidence on the device side from the transparency of the first character.

また、音声再認識部は、前記受付け時間報知部が決定した時間の間、前記音声認識部が認識した単語を前記記憶部内から除外して再認識することが好ましい。 In addition, it is preferable that the voice re-recognition unit removes words recognized by the voice recognition unit from the storage unit and re-recognizes during the time determined by the reception time notification unit.

この構成により、時間内に今回認識された結果を訂正するべく音声が入力されても、再度、前回と同じ単語が認識されることがなくなるため、認識率が高くなる。 With this configuration, even when a voice is input to correct the result recognized this time, the same word as the previous one is not recognized again, and the recognition rate is increased.

また、音声再認識部は、受付け時間決定部が決定した時間の間、音声認識部が認識した単語に所定の単語を付加した単語を再認識することが好ましい。 The speech re-recognition unit preferably re-recognizes a word in which a predetermined word is added to the word recognized by the speech recognition unit during the time determined by the reception time determination unit.

この構成により、ユーザは、時間内に今回認識された結果を訂正するべく単語を発話しようとしたときに、誤った結果を見て動揺し、単語と一緒に不要語を発話しても、ユーザが所望する単語を認識結果として出力できる。 With this configuration, when a user tries to utter a word to correct the result recognized this time, he / she is upset when he / she sees an incorrect result and utters an unnecessary word along with the word. Can output a desired word as a recognition result.

本発明の第２の局面は、音声認識方法に向けられている。本発明は、入力された音声の表す単語データと記憶部内に記憶された単語データとを比較して、比較時のマッチング度合いを表す認識スコアを算出し、算出した認識スコアに基づいて、認識した単語を出力する音声認識ステップと、音声認識ステップで算出した認識スコアに基づいて、音声による再入力を受付け可能な受付け時間を決定する受付時間決定ステップと、受付時間決定ステップで決定した受付け時間内に再入力された音声の表す単語を再認識し、再認識した単語を出力する音声再認識ステップとを備える。 The second aspect of the present invention is directed to a speech recognition method. The present invention compares the word data represented by the input speech with the word data stored in the storage unit, calculates a recognition score representing the matching degree at the time of comparison, and recognizes based on the calculated recognition score A speech recognition step for outputting a word, a reception time determination step for determining a reception time for accepting re-input by voice based on the recognition score calculated in the speech recognition step, and a reception time determined in the reception time determination step A speech re-recognition step of re-recognizing the word represented by the speech re-input to and outputting the re-recognized word.

この構成により、再度正しい単語を音声入力する際に、音声入力そのものが本来有する入力の容易性を損なわず、ユーザ利便性に富んだ音声認識方法を提供することができる。 With this configuration, when a correct word is input again by voice, it is possible to provide a voice recognition method that is rich in user convenience without impairing the ease of input inherent in the voice input itself.

本発明の第３の局面は、音声認識装置のコンピュータで実行される音声認識プログラムに向けられている。本発明は、コンピュータに、入力された音声の表す単語データと記憶部内に記憶された単語データとを比較して、比較時のマッチング度合いを表す認識スコアを算出し、算出した認識スコアに基づいて、認識した単語を出力する音声認識ステップと、音声認識ステップで算出した認識スコアに基づいて、音声による再入力を受付け可能な受付け時間を決定する受付時間決定ステップと、受付時間決定ステップで決定した受付け時間内に再入力された音声の表す単語を再認識し、再認識した単語を出力する音声再認識ステップとを実行させる。 The third aspect of the present invention is directed to a voice recognition program executed by a computer of a voice recognition device. The present invention compares the word data represented by the input speech with the word data stored in the storage unit in a computer, calculates a recognition score representing the matching degree at the time of comparison, and based on the calculated recognition score The speech recognition step that outputs the recognized word, the reception time determination step that determines the reception time that can accept re-input by voice based on the recognition score calculated in the speech recognition step, and the reception time determination step A speech re-recognition step of re-recognizing a word represented by the speech re-input within the reception time and outputting the re-recognized word is executed.

この構成により、再度正しい単語を音声入力する際に、音声入力そのものが本来有する入力の容易性を損なわず、ユーザ利便性に富んだ音声認識プログラムを提供することができる。 With this configuration, when a correct word is input again by voice, it is possible to provide a voice recognition program which is rich in user convenience without impairing the ease of input inherent in the voice input itself.

以上説明したように、本発明の各局面によれば、再度正しい単語を音声入力する際に、音声入力そのものが本来有する入力の容易性を損なわず、ユーザ利便性に富んだ音声認識装置、音声認識方法、及び音声認識プログラムを提供することができる。 As described above, according to each aspect of the present invention, when a correct word is input again by speech, the speech recognition device and the speech that are rich in user convenience without impairing the input ease inherent in the speech input itself. A recognition method and a speech recognition program can be provided.

以下、本発明の実施の携帯の音声認識装置について、図面を用いて説明する。 Hereinafter, a portable speech recognition apparatus according to an embodiment of the present invention will be described with reference to the drawings.

図１は、本発明の実施の形態に係る音声認識装置の全体構成を表すブロック図である。図１において、音声認識装置は、音声入力部１００、認識部２００、認識対象単語格納部３００、修正タイミング制御部４００、報知内容制御部５００、及び報知部６００を備える。そして音声入力部１００は認識開始終了部１１０を更に備え、認識部２００は更に認識スコア算出部２１０を備える。 FIG. 1 is a block diagram showing the overall configuration of a speech recognition apparatus according to an embodiment of the present invention. In FIG. 1, the speech recognition apparatus includes a speech input unit 100, a recognition unit 200, a recognition target word storage unit 300, a correction timing control unit 400, a notification content control unit 500, and a notification unit 600. The voice input unit 100 further includes a recognition start / end unit 110, and the recognition unit 200 further includes a recognition score calculation unit 210.

音声入力部１００はユーザが発話した音声を取り込む、例えばマイクである。この音声を取り込む際に、認識部２００にこれから音声データを入力することを事前に通知する認識開始終了部１１０をユーザは操作する。この認識開始終了部１１０は、ユーザによって操作される場合は、例えばボタン操作であったり、特別な単語を発話することをトリガーとして動作する。 The voice input unit 100 is, for example, a microphone that captures voice spoken by the user. When capturing the voice, the user operates the recognition start / end unit 110 that notifies the recognition unit 200 that voice data will be input in advance. When operated by the user, the recognition start / end unit 110 operates, for example, by a button operation or by uttering a special word.

認識開始終了部１１０から通知を受けると、認識部２００は、この入力された音声データの取込を開始し、認識対象単語格納部３００に格納されている単語（以後、辞書内単語と呼ぶ）と比較し、辞書内単語の中で音響的に近い単語を抽出する。なお、上記の比較処理を実現する技術としては、多くの音声認識処理で適用されているＨＭＭ（隠れマルコフモデル）等の確率統計処理技術を適用すればよい。この音響的に近い単語を抽出する際に、認識スコア算出部２１０は、単語同士がどれほど近いか（似ているか）をスコアとして算出する。このようなスコアとしては、一般的な尤度や単語信頼度等が用いればよい。なお、本実施の形態では、このスコアを仮に０から１０までの範囲で、最も音響的に近い場合を０とし、遠い場合を１０とする。 Upon receiving a notification from the recognition start / end unit 110, the recognition unit 200 starts taking in the input voice data, and the words stored in the recognition target word storage unit 300 (hereinafter referred to as words in the dictionary). Compared with the above, words that are acoustically close among the words in the dictionary are extracted. As a technique for realizing the above comparison process, a probability statistical processing technique such as HMM (Hidden Markov Model) applied in many speech recognition processes may be applied. When this acoustically close word is extracted, the recognition score calculation unit 210 calculates how close the words are (similar) to each other as a score. As such a score, general likelihood, word reliability, etc. may be used. In the present embodiment, this score is assumed to be 0 in the range from 0 to 10 and 0 if it is closest, and 10 if it is far.

ここで、ユーザが「見る」と単語を発話した時を例として各処理について説明する。 Here, each process will be described by taking as an example the time when the user utters the word “see”.

認識部２００は「見る」の音声データを受け取り、認識対象単語格納部３００に格納されている辞書内単語の中から、発話された「見る」に近い単語を抽出してくる。この際、認識対象単語格納部３００には、「見る」と「地図」いう単語があらかじめ登録されていたとし、認識部２００は音響的に最も近い単語として「地図」を抽出し、認識した結果とする。そして、認識した「地図」という単語に対して、認識スコア算出部２１０が算出したスコアが“６”だったとする。 The recognizing unit 200 receives the voice data of “see”, and extracts words spoken of “see” from the words in the dictionary stored in the recognition target word storage unit 300. At this time, it is assumed that the words “see” and “map” are registered in the recognition target word storage unit 300 in advance, and the recognition unit 200 extracts “map” as the acoustically closest word and recognizes the result. And Assume that the score calculated by the recognition score calculation unit 210 for the recognized word “map” is “6”.

ここで、図２を用いて処理の流れを説明する。認識部２００は認識単語（“地図”）と算出スコア（“６”）を修正タイミング制御部４００に送信する（ステップＳ２００）。 Here, the flow of processing will be described with reference to FIG. The recognition unit 200 transmits the recognition word (“map”) and the calculated score (“6”) to the correction timing control unit 400 (step S200).

修正タイミング制御部４００は、通知された算出スコアを用いて、認識単語の修正が可能な時間を決定する（ステップＳ２０１）。例えば今回の場合、スコアが“６”なので、６秒間というタイミングを設定するが、この設定方法は、認識結果に自信があればあるほど修正できる時間が短くなり、自信がないほど修正できる時間が長くなるように設定すればよい。なお、このときの時間は認識部２００が算出したスコアに対して、離散的な値でも連続的な値でも構わない。 The correction timing control unit 400 determines a time during which the recognized word can be corrected using the notified calculated score (step S201). For example, in this case, since the score is “6”, a timing of 6 seconds is set. However, this setting method shortens the time that can be corrected if there is confidence in the recognition result, and the time that can be corrected if there is no confidence. What is necessary is just to set so that it may become long. The time at this time may be a discrete value or a continuous value with respect to the score calculated by the recognition unit 200.

次に修正タイミング制御部４００は、認識開始終了部１１０に対して、認識を開始するようにトリガーを通知する（ステップＳ２０３）。この通知を受けた認識開始終了部１１０は認識部２００に認識処理ができるように開始トリガーを通知する（ステップＳ２０４）。 Next, the correction timing control unit 400 notifies the recognition start / end unit 110 of a trigger to start recognition (step S203). Upon receiving this notification, the recognition start / end unit 110 notifies the recognition unit 200 of a start trigger so that the recognition process can be performed (step S204).

ここで、修正可能な時間内にユーザが発話したかどうかで処理が分岐する。つまり、今回の例の場合、６秒以内にユーザが単語を発話したかどうかである。 Here, the process branches depending on whether the user speaks within the correctable time. That is, in the case of this example, it is whether the user has spoken a word within 6 seconds.

時間内にユーザが発話した場合、認識部２００は認識対象単語格納部３００を参照して音声認識処理を行い、認識単語を抽出する（ステップＳ２０５，ステップＳ２０６）。その後に、認識単語とスコアを通知された修正タイミング制御部４００は、前回の認識した単語を、今回通知された単語に修正して再認識する。 When the user speaks within the time, the recognition unit 200 performs speech recognition processing with reference to the recognition target word storage unit 300, and extracts a recognition word (step S205, step S206). After that, the correction timing control unit 400 notified of the recognized word and score corrects the previously recognized word to the word notified this time and re-recognizes it.

これにより、例えば、今回ユーザが再度「見る」と発話したとして、認識部２００が「見る」と認識したとすると、前回認識された単語（“地図”）を、今回発話した単語（“見る”）に置き換えて、前回の単語を修正して再認識できるようになる。そのため、ユーザは前回発話した単語を修正して再認識する際に、認識開始終了部１１０の操作や前回の認識結果を訂正するだけの操作を必要とせずに、再度認識してもらいたい単語を発話するだけで訂正操作が可能になる。そして、スコアが低いとき（認識の自信度が低いとき）は修正できる時間が長くとられるため、ユーザに修正できる機会を多く与えることができる。逆に、スコアが高いとき（認識の自信度が低いとき）は、修正の時間が短いため、ユーザにとって待ち時間が少なく、次の操作に支障が出ない。 Thus, for example, if the user utters “see” again this time, and the recognition unit 200 recognizes “see”, the previously recognized word (“map”) is changed to the word (“see”) uttered this time. ) To correct the previous word and recognize it again. Therefore, when the user corrects and re-recognizes the last spoken word, the user does not need to operate the recognition start / end unit 110 or correct the previous recognition result, and the word to be recognized again Correction operations can be performed simply by speaking. And when the score is low (when the degree of confidence in recognition is low), the time that can be corrected is taken long, so that the user can be given many opportunities for correction. On the other hand, when the score is high (when the degree of confidence in recognition is low), the correction time is short, so the waiting time is short for the user and the next operation is not hindered.

また、ステップＳ２０２にて、前回の認識単語（本例の場合、“地図”という単語）を一時的に認識部２００が参照する認識対象単語格納部３００から削除している。これは、再度ユーザが発話したときに、音響的に似通った単語が認識対象とならなくなるため誤認識が生じにくくなり、認識率を高める効果がある。 In step S202, the previous recognized word (in this example, the word “map”) is temporarily deleted from the recognition target word storage unit 300 referred to by the recognition unit 200. This is effective in that when the user speaks again, words that are acoustically similar are not recognized, so that erroneous recognition is less likely to occur and the recognition rate is increased.

そしてステップＳ２１３にて、認識対象単語格納部３００に格納されている各単語の前に、「え〜っと」や「あの〜」などの不要語を追加した単語を一時的に認識対象単語格納部３００に追加している。これは、ユーザが思いもよらない認識結果を報知されたときに思わず発声してしまう不要語に対して、認識部２００が誤認識をしないようにするためである。 In step S 213, a word in which an unnecessary word such as “Ett” or “That” is added before each word stored in the recognition target word storage unit 300 is temporarily stored in the recognition target word. Part 300 is added. This is to prevent the recognition unit 200 from erroneously recognizing unnecessary words that are uttered when the user is notified of an unexpected recognition result.

そして、これらステップＳ２０２，ステップＳ２１３の処理はスコアによって、処理をするかしないかを判定してもよい。特に、認識部２００の算出したスコアが、非常に信頼性の高い値であった場合に、単語の登録などの処理負荷は高いため省略し、負荷を軽減することができる。 Then, whether or not to perform the processes in steps S202 and S213 may be determined based on the score. In particular, when the score calculated by the recognizing unit 200 is a highly reliable value, the processing load such as word registration is high and can be omitted to reduce the load.

また、図２には記載していないが、ステップＳ２０２、ステップＳ２１３の処理が終わった後に、認識部２００にて認識が終了、もしくは、修正可能時間が過ぎたときに、認識対象単語格納部３００に対してステップＳ２０２とステップＳ２１３で処理した内容を元に戻すこととする。 Although not shown in FIG. 2, when the recognition unit 200 finishes the recognition or the correctable time has passed after the processing of step S202 and step S213 is completed, the recognition target word storage unit 300 is displayed. In contrast, the contents processed in step S202 and step S213 are restored.

次に、修正可能時間を経過しても、ユーザからの発話がなかった場合について説明する。これは、今回の例の場合、６秒を過ぎても音声データが音声入力部１００から入力されなかった場合であり、このときは、修正可能時間経過後すぐに修正タイミング制御部６００は認識開始終了部１１０に認識を終了するトリガーを通知する。そして、認識開始終了部１１０はこの通知を受けて、終了トリガーを認識部２００に通知することにより、認識部２００は次のユーザ操作による開始トリガーが入力されない限り認識処理を行わない。 Next, a case where there is no utterance from the user even after the correctable time has elapsed will be described. In the case of this example, this is a case where the voice data is not input from the voice input unit 100 even after 6 seconds. In this case, the correction timing control unit 600 starts recognition immediately after the correctable time elapses. The end unit 110 is notified of a trigger to end recognition. Then, the recognition start / end unit 110 receives this notification and notifies the recognition unit 200 of an end trigger, so that the recognition unit 200 does not perform a recognition process unless a start trigger by the next user operation is input.

このように処理することにより、ユーザが発話した単語の認識スコアが低くても、正しく認識された場合は、ユーザは何も操作をすることないため、余計な操作をする必要がない。 By processing in this way, even if the recognition score of the word spoken by the user is low, if the word is correctly recognized, the user does not need to perform any operation, so there is no need to perform an extra operation.

一方、修正タイミング制御部４００は、算出した修正可能時間と認識単語を報知内容制御部５００に通知する。 On the other hand, the correction timing control unit 400 notifies the notification content control unit 500 of the calculated correctable time and the recognized word.

報知内容制御部５００は、認識した単語が修正可能である旨をユーザに報知するために報知方法を制御し、報知部６００に通知する。この報知部６００は、例として液晶ディスプレイや有機ＥＬなどのＧＵＩを表示する形態のものや、スピーカーなどの音（音声、ビープ音）をならすことができる形態のものや、モータなどの触覚を報知することができる形態のものなどが考えられる。ここで、報知方法としては、図３Ａから図６のような方法が考えられる。なお、これらの例として、認識部２００が「地図」と認識し、修正可能時間が６秒だったときとする。 The notification content control unit 500 controls the notification method to notify the user that the recognized word can be corrected, and notifies the notification unit 600 of the notification method. For example, the notification unit 600 reports a GUI display such as a liquid crystal display or an organic EL, a display capable of smoothing sound (sound, beep sound) such as a speaker, and a tactile sense such as a motor. The form etc. which can be done are considered. Here, as a notification method, the methods as shown in FIGS. 3A to 6 can be considered. In these examples, it is assumed that the recognition unit 200 recognizes a “map” and the correctable time is 6 seconds.

図３Ａと図３Ｂには、移動量が変化する場合の例を示す。 3A and 3B show an example in which the movement amount changes.

図３Ａと図３Ｂは認始結果を出力する報知するディスプレイＤ３００上に、修正可能時間内の変化を状態１（開始）から状態３（終了）までを表したものである。なお、本例では、３つの状態に分割しているが、状態間は連続的に変化していても良い。 FIG. 3A and FIG. 3B show the change in the correctable time from state 1 (start) to state 3 (end) on the display D300 for notifying which outputs the authentication result. In this example, the state is divided into three states, but the state may change continuously.

まず、図３Ａは、認識部２００が認識した単語Ｗ３００が、ディスプレイＤ３００内を修正可能時間の間、一定の速度（もしくは加速度）で移動している様子を表している。この例では、領域Ｄ３０１の示す位置に単語Ｗ３００が移動すると修正時間は終了し、認識結果であるＷ３００は認識結果として確定するとしている。本例では、修正時間が６秒（状態１から状態３までにかかる時間）としていたが、仮に修正可能時間が３秒であった場合は、単語Ｗ３００の初期位置を状態２の位置に置くことができる。つまり、ユーザは単語Ｗ３００と領域Ｄ３０１の間の距離を見て、後どれくらいの修正時間が残っているかを感覚で捉えることができる。 First, FIG. 3A shows a state in which the word W300 recognized by the recognition unit 200 is moving at a constant speed (or acceleration) in the display D300 during the correctable time. In this example, the correction time ends when the word W300 moves to the position indicated by the region D301, and the recognition result W300 is determined as the recognition result. In this example, the correction time is 6 seconds (the time required from state 1 to state 3). However, if the correction possible time is 3 seconds, the initial position of the word W300 is placed at the state 2 position. Can do. In other words, the user can grasp how much correction time remains after seeing the distance between the word W300 and the region D301.

また、図３Ｂは、目盛りＤ３０３が、時間が経過するに従って増えていき、領域Ｄ３０２を徐々に満たすことを表している図である。このような目盛りを表示することで、ユーザは直感的に残りの修正時間を捉えることができる。また、修正時間に応じて、目盛りＤ３０３や領域Ｄ３０２の幅を変更しても良い。これにより、１目盛り増える時間を変えずに修正時間を変化させることができる。なお、図３Ｂの状態１（開始）にて、目盛りＤ３０３が増えているが、目盛りＤ３０３を０としても構わない。 FIG. 3B is a diagram showing that the scale D303 increases as time passes and gradually fills the region D302. By displaying such a scale, the user can intuitively grasp the remaining correction time. Further, the width of the scale D303 or the region D302 may be changed according to the correction time. As a result, the correction time can be changed without changing the time to increase one scale. In addition, although the scale D303 is increased in the state 1 (start) in FIG. 3B, the scale D303 may be set to 0.

次に図４を用いて、大きさが変化する場合の例を示す。 Next, an example in which the size changes will be described with reference to FIG.

図４においても、修正可能時間の間の状態を、状態１（開始）から状態３（終了）を用いて説明する。認識部２００が認識した単語Ｗ３００は、修正が開始できる状態（状態１）では、文字の大きさを小さくして、それを徐々に大きくし、最終的に一番大きくなったところで、修正時間を終了する。このようにすることで、ユーザは最初の文字の小ささから機器側の自信度を直感的に推し量ることができ、修正できる時間が長くとられることも理解することができる。またその場での変化であるため、ユーザの視線の移動量も少なくて、車などの機器にとって適していると考えられる。 In FIG. 4 as well, the state during the correctable time will be described using state 1 (start) to state 3 (end). In the state where the recognition unit 200 recognizes the word W300 in which correction can be started (state 1), the character size is reduced and gradually increased, and finally the correction time is set when it is the largest. finish. By doing so, the user can intuitively estimate the degree of confidence on the device side from the smallness of the first character, and can also understand that it takes a long time to correct. Moreover, since the change is on the spot, the amount of movement of the user's line of sight is small, and it is considered suitable for devices such as cars.

また、図４の様な形態で、文字の大きさではなく、単語Ｗ３００の文字点滅等の速度を変化させる場合も考えられる（図面なし）。 Further, in the form as shown in FIG. 4, it may be possible to change not the size of characters but the speed of flashing characters of the word W300 (not shown).

このような場合、点滅の速度（周期）を状態１では早く（短く）し、その後徐々に速度をゆっくり（長く）して、状態３では点滅しないとすることで、ユーザは点滅の速度を見るだけで、残り時間を推し量ることができ、ユーザが最も注目すべき内容（認識結果）が点滅することで、誤認識の可能性があることを機器側から示唆することができる。なお、速度を変化させる対象は、単語の大きさや、単語の移動する速度等が考えられる。 In such a case, the user sees the blinking speed by making the blinking speed (cycle) fast (short) in state 1 and then slowly slowing (longening) gradually and not blinking in state 3. Thus, the remaining time can be estimated, and the content (recognition result) most noticeable by the user blinks, and it can be suggested from the device side that there is a possibility of erroneous recognition. Note that the object whose speed is changed may be the size of the word, the moving speed of the word, or the like.

さらに、図５のように、認識単語を表示する透明度を変化させる場合も考えられる。 Further, as shown in FIG. 5, the transparency for displaying the recognized word may be changed.

この場合も認識単語Ｗ３００の自信度をユーザに直感的に提示することができ、かつ修正できる残りの時間も感覚的に推し量ることができる。 Also in this case, the degree of confidence of the recognized word W300 can be intuitively presented to the user, and the remaining time that can be corrected can be estimated in a sense.

また、図６のように時間表示領域Ｄ３０４に、修正可能時間の残り時間を表示する場合が考えられる。 In addition, as shown in FIG. 6, it is conceivable that the remaining time of the correctable time is displayed in the time display area D304.

この場合は、ユーザに、直接修正可能時間を表示するため、残り時間が非常に端的に表されており、わかりやすい。 In this case, since the correctable time is directly displayed to the user, the remaining time is expressed very simply and is easy to understand.

なお、上記に記述した図３Ａから図６までの例は、それぞれ有機的に組み合わせて表示しても良い。 Note that the examples shown in FIGS. 3A to 6 described above may be displayed in organic combination.

これら変化をしている最中に、ユーザは新たに単語を修正する際は、修正したい単語を発話することができる。例えば、「見る」と発話した場合は、認識開始終了部１１０を操作することなく、認識部２００が認識を開始する。この認識部２００が認識処理を開始し始めたときに、報知内容制御部５００は上記図３Ａから図６で示したような各種変化を止める。 While making these changes, the user can utter the word he / she wants to correct when newly correcting the word. For example, when “speak” is spoken, the recognition unit 200 starts recognition without operating the recognition start / end unit 110. When the recognizing unit 200 starts the recognition process, the notification content control unit 500 stops various changes as shown in FIGS. 3A to 6.

これにより、変化がなくなったことにより、ユーザは機器側が自分の発話した内容を処理（認識）していることが直感的に理解することができる。なお、その他の表示方法としては、アイコンの表示や変更をしたり、画面の色を変更したり、文字により状態を表すガイダンスを設けたりしても良い。 As a result, the user can intuitively understand that the device side processes (recognizes) the content of its own utterance when the change is eliminated. As other display methods, icons may be displayed or changed, the color of the screen may be changed, or guidance indicating the state with characters may be provided.

本発明は、上述した実施の形態を実現するソフトウェアのプログラム（実施の形態では図に示すフロー図に対応したプログラム）が装置に供給され、その装置のコンピュータが、供給されたプログラムを読出して、実行することによっても達成させる場合を含む。したがって、本発明の機能処理をコンピュータで実現するために、コンピュータにインストールされるプログラム自体も本発明を実現するものである。つまり、本発明は、本発明の機能処理を実現させるための音声認識プログラムも含む。 In the present invention, a software program for realizing the above-described embodiment (in the embodiment, a program corresponding to the flowchart shown in the figure) is supplied to the apparatus, and the computer of the apparatus reads the supplied program, Including the case where it is also achieved by executing. Therefore, in order to implement the functional processing of the present invention on a computer, the program itself installed in the computer also implements the present invention. That is, the present invention also includes a speech recognition program for realizing the functional processing of the present invention.

このように、再度正しい単語を音声入力する際に、音声入力そのものが本来有する入力の容易性を損なわず、ユーザ利便性に富んだ音声認識装置、音声認識方法、及び音声認識プログラムを提供することができる。 Thus, when a correct word is input again by voice, a voice recognition device, a voice recognition method, and a voice recognition program that are rich in user convenience without impairing the ease of input inherent in the voice input itself are provided. Can do.

上記実施の形態で説明した構成は、単に具体例を示すものであり、本願発明の技術的範囲を制限するものではない。本願の効果を奏する範囲において、任意の構成を採用することができる。 The configuration described in the above embodiment is merely a specific example and does not limit the technical scope of the present invention. Any configuration can be adopted as long as the effects of the present application are achieved.

以上のように、本発明にかかる音声認識装置は、ユーザは訂正操作にあわてることなく誤認識を訂正でき、誤認識を訂正するためのステップが短縮できるという効果を有し、誤認識された発話内容を認識されるべき発話内容に訂正する音声認識装置等として有用である。 As described above, the speech recognition apparatus according to the present invention has the effect that the user can correct misrecognition without performing a correction operation, and the steps for correcting the misrecognition can be shortened. It is useful as a speech recognition device or the like that corrects the content to the utterance content to be recognized.

本発明の実施の形態における音声認識装置の全体ブロック図Overall block diagram of a speech recognition apparatus in an embodiment of the present invention 本発明の実施の形態における音声認識装置の動作説明のためのシーケンス図Sequence diagram for explaining the operation of the speech recognition apparatus according to the embodiment of the present invention 本発明の実施の形態における音声認識装置が時間経過によって表示する様子を表した表示の一例An example of the display showing how the speech recognition apparatus according to the embodiment of the present invention displays over time 本発明の実施の形態における音声認識装置が時間経過によって表示する様子を表した表示の一例An example of the display showing how the speech recognition apparatus according to the embodiment of the present invention displays over time 本発明の実施の形態における音声認識装置が時間経過によって表示する様子を表した表示の一例An example of the display showing how the speech recognition apparatus according to the embodiment of the present invention displays over time 本発明の実施の形態における音声認識装置が時間経過によって表示する様子を表した表示の一例An example of the display showing how the speech recognition apparatus according to the embodiment of the present invention displays over time 本発明の実施の形態における音声認識装置が時間経過によって表示する様子を表した表示の一例An example of the display showing how the speech recognition apparatus according to the embodiment of the present invention displays over time

Explanation of symbols

１００音声入力部
１１０認識開始終了部
２００認識部
２１０認識スコア算出部
３００認識対象単語格納部
４００修正タイミング制御部
５００報知内容制御部
６００報知部
Ｄ３００ディスプレイ
Ｄ３０１表示領域
Ｄ３０２表示領域
Ｄ３０３目盛り
Ｄ３０４時間表示領域
Ｗ３００認識単語 100 voice input unit 110 recognition start end unit 200 recognition unit 210 recognition score calculation unit 300 recognition target word storage unit 400 correction timing control unit 500 notification content control unit 600 notification unit D300 display D301 display region D302 display region D303 scale D304 time display region W300 recognition word

Claims

A speech recognition device,
A storage unit for storing word data;
The word data represented by the input speech is compared with the word data stored in the storage unit, a recognition score representing the matching degree at the time of comparison is calculated, and the recognized word is output based on the calculated recognition score A voice recognition unit that
Based on the recognition score calculated by the voice recognition unit, a reception time determination unit that determines a reception time in which re-input by voice can be received;
A speech recognition apparatus comprising: a speech re-recognition unit that re-recognizes a word represented by speech re-input within an acceptance time determined by the reception time determination unit and outputs the re-recognized word.

The speech recognition apparatus according to claim 1, further comprising a reception time notification unit that notifies a user by a display object that changes with a change in the reception time determined by the reception time determination unit.

The voice recognition device according to claim 2, wherein the reception time notification unit stops the change of the display object when the voice is re-input within the reception time determined by the reception time determination unit. .

The voice according to claim 3, wherein the reception time notification unit changes the position of the display object representing the reception time determined by the reception time determination unit on the display area together with the change of the reception time. Recognition device.

The speech recognition apparatus according to claim 3, wherein the reception time notification unit changes a scale amount of a display object representing the reception time determined by the reception time determination unit along with a change in the reception time.

The voice recognition device according to claim 3, wherein the reception time notification unit changes the size of a display object representing the reception time determined by the reception time determination unit with a change in the reception time.

The voice recognition device according to claim 3, wherein the reception time notification unit changes the transparency of the display object representing the reception time determined by the reception time determination unit with a change in the reception time.

The voice re-recognition unit removes words recognized by the voice recognition unit from the storage unit and re-recognizes during the time determined by the acceptance time notification unit. The speech recognition device according to any one of the above.

The speech re-recognition unit re-recognizes a word in which a predetermined word is added to a word recognized by the speech recognition unit during a time determined by the reception time determination unit. The speech recognition device according to any one of the above.

A speech recognition method,
The word data represented by the input speech is compared with the word data stored in the storage unit, a recognition score representing the degree of matching at the time of comparison is calculated, and the recognized word is output based on the calculated recognition score. A speech recognition step;
Based on the recognition score calculated in the voice recognition step, a reception time determination step for determining a reception time in which re-input by voice can be received;
A speech recognition method comprising: a speech re-recognition step of re-recognizing a word represented by speech re-input within the reception time determined in the reception time determination step and outputting the re-recognized word.

A speech recognition program executed by a computer of a speech recognition device,
In the computer,
The word data represented by the input voice is compared with the word data stored in the storage unit, a recognition score representing the degree of matching at the time of comparison is calculated, and the recognized word is output based on the calculated recognition score. A speech recognition step;
Based on the recognition score calculated in the voice recognition step, a reception time determination step for determining a reception time in which re-input by voice can be received;
A speech recognition program for executing a speech re-recognition step of re-recognizing a word represented by speech re-input within the reception time determined in the reception time determination step and outputting the re-recognized word.