JPWO2011043380A1

JPWO2011043380A1 - Speech recognition apparatus and speech recognition method

Info

Publication number: JPWO2011043380A1
Application number: JP2011535425A
Authority: JP
Inventors: 健花沢; 長田　誠也; 誠也長田; 隆行荒川; 岡部　浩司; 浩司岡部; 田中　大介; 大介田中
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2009-10-09
Filing date: 2010-10-06
Publication date: 2013-03-04
Also published as: WO2011043380A1

Abstract

音声認識装置は、外部からの音声信号に対応する単語を導出する処理を行なう情報処理部（１０）と、単語を逐次的に出力する出力部（２０）と、情報処理部が有する資源の使用率を取得する使用率取得部（３０）とを備える。情報処理部（１０）は、音声信号に基づいて所定の情報を生成する情報生成部（１２）と、所定の情報を一時的に記憶する一時記憶部（１４）と、所定の情報に基づいて音声信号に対応する単語を導出して音声認識をし、単語を出力部（２０）に出力する単語導出部（１６）と、単語導出部（１６）が行なう処理を使用率取得部（３０）により取得された使用率に基づいて中断および再開させる制御部（１８）とを備える。単語導出部（１６）が行なう処理を使用率に基づいて中断および再開するので、資源の使用率が過大となることに起因する不都合を抑制することができる。The speech recognition apparatus includes an information processing unit (10) that performs processing for deriving a word corresponding to an external speech signal, an output unit (20) that sequentially outputs words, and use of resources of the information processing unit A usage rate acquisition unit (30) for acquiring a rate. The information processing unit (10) includes an information generation unit (12) that generates predetermined information based on an audio signal, a temporary storage unit (14) that temporarily stores predetermined information, and a predetermined information. A word derivation unit (16) for deriving a word corresponding to the speech signal and performing speech recognition, and outputting the word to the output unit (20), and processing performed by the word derivation unit (16) are used rate acquisition unit (30) And a control unit (18) for suspending and resuming based on the usage rate acquired by the above. Since the processing performed by the word deriving unit (16) is interrupted and restarted based on the usage rate, inconvenience caused by an excessive usage rate of resources can be suppressed.

Description

本発明は、音声認識装置および音声認識方法に関し、特に、音声が連続的に入力されたときに認識した結果を逐次的に出力する音声認識装置および音声認識方法に関する。 The present invention relates to a speech recognition apparatus and a speech recognition method, and more particularly to a speech recognition apparatus and a speech recognition method that sequentially output results recognized when speech is continuously input.

近年、音声で入力された発話を認識してテキストなど記号列に変換する音声認識システムを利用して、テレビ番組の字幕生成や会議の議事録の作成、音声翻訳などを行なう技術が広く知られている。 In recent years, techniques for generating subtitles for TV programs, creating meeting minutes, and translating speech using a speech recognition system that recognizes speech input by speech and converts it into a symbol string such as text are widely known. ing.

しかしながら、通常の音声認識システムでは、一旦発話が終わらなければ最も確からしい結果を出力できないため、実際の発話から認識した結果を出力するまでの間にタイムラグが生じてしまう。このため、テレビ番組の生中継において字幕を表示する場合や、翻訳システムと組み合わせて異言語間での会話に使用する場合など、リアルタイムでの音声認識が要求される場面では利用しづらかった。 However, in a normal speech recognition system, since the most probable result cannot be output unless the utterance is once finished, there is a time lag before the result recognized from the actual utterance is output. For this reason, it has been difficult to use in situations where real-time speech recognition is required, such as when subtitles are displayed during live broadcast of a TV program, or when used in a conversation between different languages in combination with a translation system.

このような問題を解決するために、発話中であっても一定間隔で音声認識を行ない、この結果を逐次的に出力する連続音声認識装置が提案されている（たとえば、特許文献１参照）。この連続音声認識装置では、一定間隔で、入力された連続音声に対応する最も確からしい単語を選択し、その中から安定して出力できる単語を抽出して出力するので、リアルタイムで、かつ安定性の高い音声認識を可能としている。 In order to solve such a problem, there has been proposed a continuous speech recognition apparatus that performs speech recognition at regular intervals even during speech and outputs the results sequentially (for example, see Patent Document 1). In this continuous speech recognition device, the most probable words corresponding to the input continuous speech are selected at regular intervals, and the words that can be output stably are extracted and output, so that they are stable in real time. High voice recognition is possible.

特許第３８３４１６９号Japanese Patent No. 3834169

L. Mangu, et al., “Finding consensus in speech recognition: word error minimization and other applications of confusion network,” Computer Speech and Language, vol.14, no.4, pp.373-400, 2000.L. Mangu, et al., “Finding consensus in speech recognition: word error minimization and other applications of confusion network,” Computer Speech and Language, vol.14, no.4, pp.373-400, 2000.

しかしながら、一般に音声認識を行なう処理ではＣＰＵやメモリなどのリソースを大量に使用するため、リソース不足を生じる恐れがある。このようなリソース不足が生じた場合、特許文献１に記載の連続音声認識装置のように逐次的に音声認識処理を行なうと、リソース不足のために音声認識処理の各処理が停滞し、音声入力を取りこぼすなどの不都合を生じる恐れがある。
なお、このようなリソース不足が生じる状況としては、比較的高性能のリソースを搭載するのが難しい携帯電話機などの小型機器で音声認識を行なう場合に起こりやすい。また、比較的高性能のリソースを搭載する機器を用いた場合であっても、音声合成システムや自動翻訳システムなどの処理量が多いシステムと組み合わせて音声認識を行なう場合にも起こりやすい。However, in general, processing for performing speech recognition uses a large amount of resources such as a CPU and a memory, which may cause a shortage of resources. When such a resource shortage occurs, if the speech recognition processing is performed sequentially as in the continuous speech recognition device described in Patent Document 1, each processing of the speech recognition processing stagnates due to the lack of resources, and speech input There is a risk of inconvenience such as spilling over.
A situation where such a resource shortage occurs is likely to occur when speech recognition is performed by a small device such as a mobile phone in which it is difficult to mount a relatively high-performance resource. Even when a device equipped with relatively high-performance resources is used, it is likely to occur when speech recognition is performed in combination with a system with a large amount of processing such as a speech synthesis system or an automatic translation system.

本発明は、上述した課題を解決するためになされたものであり、リソースの使用率が比較的大きくなったときでも、外部からの音声信号を消失してしまうことを抑制することを可能とした音声認識装置および音声認識方法を提供することを目的とする。 The present invention has been made in order to solve the above-described problem, and has made it possible to suppress the loss of an external audio signal even when the resource usage rate becomes relatively large. An object of the present invention is to provide a voice recognition device and a voice recognition method.

本発明に係る音声認識装置は、外部からの音声信号に対応する単語を導出する処理を行なう情報処理部と、単語を逐次的に出力する出力部と、前記情報処理部が有する資源の使用率を取得する使用率取得部とを備え、前記情報処理部は、前記音声信号に基づいて所定の情報を生成する情報生成部と、前記情報生成部からの所定の情報を一時的に記憶する一時記憶部と、前記一時記憶部により一時的に記憶された所定の情報に基づいて音声信号に対応する単語を導出して音声認識をし、単語を前記出力部に出力する単語導出部と、前記単語導出部が行なう処理を前記使用率取得部により取得された使用率に基づいて中断および再開させる制御部とを備えることを特徴とする。 The speech recognition apparatus according to the present invention includes an information processing unit that performs processing for deriving a word corresponding to an external speech signal, an output unit that sequentially outputs the word, and a resource usage rate of the information processing unit A usage rate acquisition unit for acquiring the information, and the information processing unit generates an information generation unit based on the audio signal, and temporarily stores the predetermined information from the information generation unit. A storage unit, a word derivation unit that derives a word corresponding to a speech signal based on predetermined information temporarily stored by the temporary storage unit, performs speech recognition, and outputs the word to the output unit; And a control unit that suspends and resumes the processing performed by the word deriving unit based on the usage rate acquired by the usage rate acquisition unit.

本発明に係る音声認識方法は、外部からの音声信号に基づいて所定の情報を生成するステップと、所定の情報を一時的に記憶するステップと、一時的に記憶した所定の情報に基づいて音声信号に対応する単語を導出して音声認識をし、単語を出力するステップと、音声信号に対応する単語を導出して音声認識をするための資源の使用率を取得するステップと、音声認識をする処理を取得した使用率に基づいて中断および再開するステップとを備えることを特徴とする。 The speech recognition method according to the present invention includes a step of generating predetermined information based on an external audio signal, a step of temporarily storing the predetermined information, and a voice based on the temporarily stored predetermined information. Deriving a word corresponding to the signal and performing speech recognition, outputting the word, deriving a word corresponding to the speech signal and obtaining a usage rate of resources for performing speech recognition, and performing speech recognition And a step of suspending and resuming processing based on the usage rate acquired.

本発明に係る音声認識装置および音声認識方法によれば、外部からの音声信号に対応する単語を導出する処理を情報処理部が有する資源の使用率に基づいて中断および再開するので、資源の使用率が過大となることに起因する不都合、すなわち外部からの音声信号に基づいて所定の情報を生成する処理が滞ってしまうことを抑制できる。
しかも、生成した所定の情報を一時的に記憶しておくことができるので、外部からの音声信号に対応する単語を導出する処理を中断したときでも、生成した所定の情報が処理されないまま消失してしまうことを抑制できる。According to the speech recognition apparatus and speech recognition method of the present invention, the process of deriving a word corresponding to an external speech signal is interrupted and restarted based on the resource usage rate of the information processing unit. It is possible to suppress inconvenience resulting from an excessive rate, that is, a delay in processing for generating predetermined information based on an external audio signal.
Moreover, since the generated predetermined information can be temporarily stored, even when the process of deriving a word corresponding to an external audio signal is interrupted, the generated predetermined information is lost without being processed. Can be suppressed.

図１は、本発明の実施の形態１に係る音声認識装置の一構成例を示すブロック図である。FIG. 1 is a block diagram showing a configuration example of a speech recognition apparatus according to Embodiment 1 of the present invention. 図２は、本発明の実施の形態１に係る音声認識装置が実行する処理の一例を示すフローチャートである。FIG. 2 is a flowchart showing an example of processing executed by the speech recognition apparatus according to Embodiment 1 of the present invention. 図３は、本発明の実施の形態２に係る音声認識装置の一構成例を示すブロック図である。FIG. 3 is a block diagram showing a configuration example of a speech recognition apparatus according to Embodiment 2 of the present invention. 図４は、本発明の実施の形態２に係る音声認識装置が実行する処理の一例を示すフローチャートである。FIG. 4 is a flowchart showing an example of processing executed by the speech recognition apparatus according to Embodiment 2 of the present invention. 図５は、本発明の実施の形態２に係る音声認識装置の制御部が実行するフラグＦ１を設定する処理の一例を示すフローチャートである。FIG. 5 is a flowchart showing an example of processing for setting the flag F1 executed by the control unit of the speech recognition apparatus according to Embodiment 2 of the present invention. 図６は、本発明の実施の形態３に係る音声認識装置が実行する処理の一例を示すフローチャートである。FIG. 6 is a flowchart showing an example of processing executed by the speech recognition apparatus according to Embodiment 3 of the present invention. 図７は、本発明の変形例に係る音声認識装置の一構成例を示すブロック図である。FIG. 7 is a block diagram showing a configuration example of a speech recognition apparatus according to a modification of the present invention.

［実施の形態１］
以下、本発明の実施の形態１に係る音声認識装置について図を参照しながら詳細に説明する。
本発明の実施の形態１に係る音声認識装置は、外部からの音声信号に対応する単語を導出して逐次的に出力するものである。[Embodiment 1]
Hereinafter, the speech recognition apparatus according to Embodiment 1 of the present invention will be described in detail with reference to the drawings.
The speech recognition apparatus according to Embodiment 1 of the present invention derives words corresponding to external speech signals and sequentially outputs them.

まず、本発明の実施の形態１に係る音声認識装置の構成を図１を参照しながら説明する。
本発明の実施の形態１に係る音声認識装置は、図１に示すように、外部からの音声信号をもとに所定の情報処理を行なう情報処理部１０，情報処理部１０により処理された情報を出力する出力部２０，情報処理部１０を構成する資源の使用率を取得する使用率取得部３０から構成される。
情報処理部１０は、外部からの音声信号をもとに所定の情報を生成する情報生成部１２，情報生成部１２が生成した所定の情報を一時的に記憶する一時記憶部１４，一時記憶部１４から所定の情報を引き出して外部からの音声信号に対応する単語を導出する単語導出部１６，使用率取得部３０からの信号に基づいて単語導出部１６が行なう処理を中断させたり再開させたりする制御部１８を含んでいる。
出力部２０は、単語導出部１６により導出された単語を逐次的に音や映像として出力するものである。
使用率取得部３０は、所定時間毎（たとえば、数十ｍｓｅｃ毎や数百ｍｓｅｃ毎など）に情報処理部１０が有する図示しないＣＰＵやメモリなどの資源の使用率を取得して制御部１８に送信するものである。
なお、上述した本発明の実施の形態１に係る音声認識装置の各機能は、ＣＰＵやメモリ、各種インターフェースを有するコンピュータによって記録媒体に記録されたプログラムが読み込まれることにより、そのコンピュータのハードウェア資源とソフトウェアとが協働することによって実現することができる。First, the configuration of the speech recognition apparatus according to Embodiment 1 of the present invention will be described with reference to FIG.
As shown in FIG. 1, the speech recognition apparatus according to Embodiment 1 of the present invention performs information processing based on a speech signal from the outside, and information processed by the information processing unit 10. Output unit 20 and a usage rate acquisition unit 30 that acquires the usage rate of resources constituting the information processing unit 10.
The information processing unit 10 includes an information generation unit 12 that generates predetermined information based on an external audio signal, a temporary storage unit 14 that temporarily stores predetermined information generated by the information generation unit 12, and a temporary storage unit The word deriving unit 16 for deriving predetermined information from 14 and deriving a word corresponding to an external audio signal, and the processing performed by the word deriving unit 16 based on the signal from the usage rate acquiring unit 30 are interrupted or restarted. The control unit 18 is included.
The output unit 20 sequentially outputs the words derived by the word deriving unit 16 as sound or video.
The usage rate acquisition unit 30 acquires the usage rate of resources such as a CPU and a memory (not shown) included in the information processing unit 10 every predetermined time (for example, every several tens of msec or every several hundreds of msec). To be sent.
Each function of the speech recognition apparatus according to the first embodiment of the present invention described above is obtained by reading a program recorded on a recording medium by a computer having a CPU, a memory, and various interfaces, so that hardware resources of the computer are read. Can be realized through collaboration between software and software.

次に、本発明の実施の形態１に係る音声認識装置の動作を図２を参照しながら説明する。 Next, the operation of the speech recognition apparatus according to Embodiment 1 of the present invention will be described with reference to FIG.

まず、情報生成部１２は、外部から音声信号および使用率取得部３０から資源の使用率Ｕを入力する（ステップＳ１００）。
次いで、入力した音声信号に基づき所定の情報を生成し、この所定の情報を一時記憶部１４に記憶し（ステップＳ１１０）。
続いて、制御部１８は、使用率取得部３０により取得された資源の使用率Ｕが正常であるか否かを判断する（ステップＳ１２０）。First, the information generation unit 12 inputs an audio signal from the outside and a resource usage rate U from the usage rate acquisition unit 30 (step S100).
Next, predetermined information is generated based on the input audio signal, and the predetermined information is stored in the temporary storage unit 14 (step S110).
Subsequently, the control unit 18 determines whether or not the resource usage rate U acquired by the usage rate acquisition unit 30 is normal (step S120).

使用率Ｕが任意の所定値よりも小さく、正常の範囲内であると判断したときには（ステップＳ１２０：ＹＥＳ）、単語導出部１６は一時記憶部１４により記憶された所定の情報に基づいて音声信号に対応する単語を導出し（ステップＳ１３０）、この単語を出力部２０により出力する処理を行なう（ステップＳ１４０）。
続いて、処理を継続する必要があるか否かを判断する（ステップＳ１５０）。音声信号が継続して入力されているときや、未処理の音声信号があるときにはステップＳ１００に戻ってステップＳ１００〜Ｓ１４０の処理を繰り返し（ステップＳ１５０：ＮＯ）、音声信号の入力が終了しており、かつ、未処理の音声信号がないことを確認してから（ステップＳ１５０：ＹＥＳ）、一連の処理を終了する。When it is determined that the usage rate U is smaller than an arbitrary predetermined value and within the normal range (step S120: YES), the word deriving unit 16 generates an audio signal based on the predetermined information stored in the temporary storage unit 14. The word corresponding to is derived (step S130), and the process of outputting the word by the output unit 20 is performed (step S140).
Subsequently, it is determined whether or not the process needs to be continued (step S150). When the audio signal is continuously input or when there is an unprocessed audio signal, the process returns to step S100 and the processes of steps S100 to S140 are repeated (step S150: NO), and the input of the audio signal is completed. And after confirming that there is no unprocessed audio | voice signal (step S150: YES), a series of processes are complete | finished.

一方、使用率Ｕが任意の所定値以上であり、異常であると判断したときには（ステップＳ１２０：ＮＯ）、ステップＳ１３０の単語導出部１６により音声信号に対応する単語を導出する処理を実行せずに、ステップＳ１００の処理に戻る。このため、再び外部から音声信号を入力し、所定の情報を生成して一時記憶部１４に一時的に記憶するステップＳ１００，Ｓ１１０の処理を繰り返し、使用率Ｕが任意の所定値より小さく、正常の範囲内となるのを確認してから（ステップＳ１２０：ＹＥＳ）、一時記憶部１６により一時的に記憶された所定の情報に基づいて外部からの音声信号に対応する単語を導出して出力部２０から出力する処理を実行することになる（ステップＳ１３０，Ｓ１４０）。 On the other hand, when it is determined that the usage rate U is equal to or greater than an arbitrary predetermined value and is abnormal (step S120: NO), the word deriving unit 16 in step S130 does not execute the process of deriving a word corresponding to the audio signal. Then, the process returns to step S100. For this reason, the process of steps S100 and S110 in which a voice signal is input again from the outside, predetermined information is generated and temporarily stored in the temporary storage unit 14 is repeated, and the usage rate U is smaller than an arbitrary predetermined value and normal. (Step S120: YES), the word corresponding to the external audio signal is derived based on the predetermined information temporarily stored in the temporary storage unit 16, and the output unit The processing output from 20 is executed (steps S130 and S140).

なお、使用率Ｕが任意の所定値以上であり、異常であると判断したときに、ステップＳ１３０，Ｓ１４０の処理を実行せずに、ステップＳ１００〜Ｓ１２０の処理を繰り返すのは、特に、単語導出部１６がステップＳ１３０の処理を実行すると資源の使用率Ｕがさらに大きくなる恐れがあるためである。これらの処理を中断することにより、使用率Ｕがさらに大きくなることによって生じる不都合、すなわち、外部から音声信号を入力し、所定の情報を生成して一時記憶部１４に一時的に記憶するするステップＳ１００，Ｓ１１０の処理が滞ってしまうことを防ぐことができる。しかも、ステップＳ１１０の処理によって生成した所定の情報を一時記憶部１４に一時的に記憶することによって生成した情報が処理されないまま消失してしまうことを防いでいるので、使用率Ｕが正常となるのを待ってからでも、ステップＳ１３０，Ｓ１４０の処理を実行できるのである。 Note that, when the usage rate U is equal to or greater than an arbitrary predetermined value and is determined to be abnormal, the processing of steps S100 to S120 is not performed, and the word derivation is particularly performed without executing the processing of steps S130 and S140. This is because the resource usage rate U may be further increased when the unit 16 executes the process of step S130. Inconvenience caused by further increasing the usage rate U by interrupting these processes, that is, a step of inputting an audio signal from the outside, generating predetermined information, and temporarily storing it in the temporary storage unit 14 It is possible to prevent the processing of S100 and S110 from being delayed. In addition, since the predetermined information generated by the process of step S110 is temporarily stored in the temporary storage unit 14, the generated information is prevented from being lost without being processed, so the usage rate U becomes normal. Even after waiting for this, the processing of steps S130 and S140 can be executed.

以上、本発明の実施の形態１に係る音声認識装置によれば、外部からの音声信号に対応する単語を導出する処理を、情報処理部１０が有する資源の使用率Ｕに基づいて中断したり再開したりするので、資源の使用率Ｕが過大となることに起因する不都合、すなわち、外部からの音声信号を入力し、所定の情報を生成する処理が滞ってしまう不都合を抑制できる。
しかも、生成した所定の情報を一時的に記憶しておくことができるので、外部からの音声信号に対応する単語を導出する処理を中断したときでも、生成した所定の情報が未処理のまま消失してしまうことを抑制できる。As described above, according to the speech recognition apparatus according to Embodiment 1 of the present invention, the process of deriving a word corresponding to an external speech signal is interrupted based on the resource usage rate U of the information processing unit 10. Therefore, it is possible to suppress the inconvenience caused by the excessive usage rate U of the resource, that is, the inconvenience that the process of generating the predetermined information by inputting an external audio signal is delayed.
Moreover, since the generated predetermined information can be temporarily stored, the generated predetermined information remains unprocessed even when the process of deriving a word corresponding to the external audio signal is interrupted. Can be suppressed.

［実施の形態２］
次に、本発明の実施の形態２に係る音声認識装置について図を参照しながら詳細に説明する。
本発明の実施の形態２に係る音声認識装置も、外部からの音声信号に対応する単語を導出して逐次的に出力するものである。[Embodiment 2]
Next, a speech recognition apparatus according to Embodiment 2 of the present invention will be described in detail with reference to the drawings.
The speech recognition apparatus according to Embodiment 2 of the present invention also derives and sequentially outputs words corresponding to external speech signals.

まず、本発明の実施の形態２に係る音声認識装置の構成を図３を参照しながら説明する。
本発明の実施の形態２に係る音声認識装置は、図３に示すように、外部からの音声信号をもとに所定の情報処理を行なう情報処理部１１０，情報処理部１１０により処理された情報を出力する出力部２０，情報処理部１１０に含まれる図示しないＣＰＵの使用率を取得する使用率取得部１３０，音素や音節などに関するデータを格納する音響モデル１４０，語彙に関するデータを格納する認識辞書１４２から構成される。ここで、外部からの音声信号としては、図示しないマイクロフォンにより生成された音声信号を入力することができるほか、遠隔地で生成された音声信号をネットワークを介して入力することもできる。
情報処理部１１０は、外部からの音声信号を分析して時系列に並んだ特徴量を生成する特徴量生成部１１２，特徴量生成部１１２が生成した特徴量を一時的に記憶する一時記憶部１１４，一時記憶部１１４から特徴量を引き出し、この特徴量を音響モデル１４０および認識辞書１４２に格納されたデータと照合することにより外部からの音声信号に対応する単語を導出する単語導出部１１６，使用率取得部１３０からの信号に基づいて単語導出手段１１６が行なう処理を中断させたり再開させたりする制御部１１８を含んでいる。
また、単語導出部１１６は、複数の単語仮説をその尤度と結びつけて取得する単語仮説取得部１１６ａ，単語仮説取得部１１６ａが取得した単語仮説の各々に結びつけられた尤度に基づいて外部からの音声信号に対応する単語として一の単語仮説を選択する単語選択部１１６ｂを含んでいる。
出力部１２０は、単語選択部１１６ｂにより選択された単語を逐次的にディスプレイ上に出力するものである。
使用率取得部１３０は、所定時間毎（たとえば、数十ｍｓｅｃ毎や数百ｍｓｅｃ毎など）に情報処理部１１０が有するＣＰＵの使用率Ｕを取得して制御部１１８に送信するものである。
なお、上述した本発明の実施の形態２に係る音声認識装置の各機能は、ＣＰＵやメモリ、各種インターフェースを有するコンピュータによって記録媒体に記録されたプログラムが読み込まれることにより、そのコンピュータのハードウェア資源とソフトウェアとが協働することによって実現することができる。First, the configuration of the speech recognition apparatus according to Embodiment 2 of the present invention will be described with reference to FIG.
As shown in FIG. 3, the speech recognition apparatus according to Embodiment 2 of the present invention performs information processing based on an external audio signal, information processing unit 110 that performs predetermined information processing, and information processed by information processing unit 110. Output unit 20 that outputs information, a usage rate acquisition unit 130 that acquires the usage rate of a CPU (not shown) included in the information processing unit 110, an acoustic model 140 that stores data related to phonemes and syllables, and a recognition dictionary that stores data related to vocabulary 142. Here, as an external audio signal, an audio signal generated by a microphone (not shown) can be input, and an audio signal generated at a remote place can also be input via a network.
The information processing unit 110 analyzes an external audio signal and generates a feature amount arranged in time series, and a temporary storage unit that temporarily stores the feature amount generated by the feature amount generation unit 112 114, a word deriving unit 116 that derives a feature value from the temporary storage unit 114 and compares the feature value with data stored in the acoustic model 140 and the recognition dictionary 142 to derive a word corresponding to an external audio signal; A control unit 118 is included that interrupts or restarts the processing performed by the word deriving unit 116 based on a signal from the usage rate acquisition unit 130.
In addition, the word deriving unit 116 obtains a plurality of word hypotheses in association with the likelihood, a word hypothesis acquisition unit 116a, and externally based on the likelihood associated with each of the word hypotheses acquired by the word hypothesis acquisition unit 116a. Includes a word selection unit 116b that selects one word hypothesis as a word corresponding to the voice signal.
The output unit 120 sequentially outputs the words selected by the word selection unit 116b on the display.
The usage rate acquisition unit 130 acquires the CPU usage rate U of the information processing unit 110 and transmits it to the control unit 118 every predetermined time (for example, every several tens of msec or every few hundreds of msec).
Each function of the speech recognition apparatus according to the second embodiment of the present invention described above is obtained by reading a program recorded on a recording medium by a computer having a CPU, a memory, and various interfaces, so that the hardware resources of the computer Can be realized through collaboration between software and software.

次に、本発明の実施の形態２に係る音声認識装置の動作を図４を参照しながら説明する。 Next, the operation of the speech recognition apparatus according to Embodiment 2 of the present invention will be described with reference to FIG.

まず、外部から音声信号および使用率取得部１３０からＣＰＵの使用率Ｕを入力する（ステップＳ２００）。
次いで、入力された音声信号を分析して時系列に並んだ特徴量系列を生成し、一時記憶部１１４に記憶させる（ステップＳ２１０）。具体的には、入力された音声信号を数十ｍｓｅｃ毎のフレームに分断し、各フレーム毎にケプストラムなどの特徴量を生成して一時記憶部１１４に記憶させる。First, the CPU usage rate U is input from the outside from the audio signal and usage rate acquisition unit 130 (step S200).
Next, the input audio signal is analyzed to generate a feature amount sequence arranged in time series, and is stored in the temporary storage unit 114 (step S210). Specifically, the input audio signal is divided into frames every several tens of msec, and a feature amount such as a cepstrum is generated for each frame and stored in the temporary storage unit 114.

続いて、制御部１１８は、入力された使用率Ｕに基づいて、使用率Ｕが正常の範囲内であるか否か示すフラグＦ１を設定する（ステップＳ２２０）。ここで、フラグＦ１は、初期状態では値０に設定されており、使用率Ｕが過大であると判断されたときに値１に設定されるフラグである。
フラグＦ１を設定する処理について、図５を参照しながら説明する。使用率Ｕを予め定められた閾値Ｕｌｏｗおよび閾値Ｕｌｏｗより大きい閾値Ｕｈｉと比較し（ステップＳ２２１）、使用率Ｕが閾値Ｕｌｏｗ以上であり閾値Ｕｈｉ未満であるときにはフラグＦ１の値を既存の状態に維持する。これに対し、使用率Ｕが閾値Ｕｈｉ以上であるときには、使用率Ｕが過大であると判断してフラグＦ１に値１を設定する（ステップＳ２２２）。逆に、使用率Ｕが閾値Ｕｌｏｗ未満となったときには、使用率Ｕに余裕があると判断してフラグＦ１に値０を設定する（ステップＳ２２３）。ここで、閾値Ｕｈｉとしては、たとえば８０％や９０％などの値を用いることができ、閾値Ｕｌｏｗとしては、たとえば５０％や６０％などの値を用いることができる。なお、フラグＦ１に値１を設定する際の基準となる閾値Ｕｈｉと、フラグＦ１に値０を設定する際の基準となる閾値Ｕｌｏｗとが、互いに異なる値が設定されているのは、使用率ＵとフラグＦ１の値との関係にヒステリシスを持たせることにより、フラグＦ１の値が「０」と「１」との間で頻繁に設定変更されてしまうのを抑制するためである。Subsequently, the control unit 118 sets a flag F1 indicating whether or not the usage rate U is within a normal range based on the input usage rate U (step S220). Here, the flag F1 is set to a value of 0 in the initial state, and is set to a value of 1 when it is determined that the usage rate U is excessive.
The process for setting the flag F1 will be described with reference to FIG. The usage rate U is compared with a predetermined threshold value Ulow and a threshold value Uhi greater than the threshold value Ulow (step S221), and when the usage rate U is greater than or equal to the threshold value Ulow and less than the threshold value Uhi, the value of the flag F1 is maintained in the existing state. To do. On the other hand, when the usage rate U is equal to or greater than the threshold value Uhi, it is determined that the usage rate U is excessive and a value 1 is set in the flag F1 (step S222). Conversely, when the usage rate U is less than the threshold Ulow, it is determined that there is a margin in the usage rate U and a value 0 is set in the flag F1 (step S223). Here, for example, a value such as 80% or 90% can be used as the threshold value Uhi, and a value such as 50% or 60% can be used as the threshold value Ulow. Note that the threshold Uhi serving as a reference for setting the value 1 to the flag F1 and the threshold Ulow serving as a reference for setting the value 0 to the flag F1 are set to values different from each other. This is because by giving hysteresis to the relationship between U and the value of the flag F1, the value of the flag F1 is prevented from being frequently changed between “0” and “1”.

次いで、制御部１１８はフラグＦ１の値が値０であるか否かを判断する（ステップＳ２３０）。
フラグＦ１が値０であるときには（ステップＳ２３０：ＹＥＳ）、入力された外部音声に対応する単語を導出する処理を実行する。Next, the control unit 118 determines whether or not the value of the flag F1 is 0 (step S230).
When the flag F1 is 0 (step S230: YES), processing for deriving a word corresponding to the input external voice is executed.

まず、単語仮説をその確からしさを表わす尤度Ｐが高いものから順に二つ取得し、外部音声に対応する単語を導出する処理を実行中であるか否かを示すフラグＦ２に値１を設定する（ステップＳ２４０）。
具体的には、単語仮説取得部１１６ａおよび尤度設定手段１１６ｂは、一時記憶部１１４に記憶されている時系列に並んだ特徴量系列を引き出すとともに、この特徴量系列と音響モデル１４０に格納されている音素や音節に関するデータとの間の距離を計算し、この計算結果に基づいて認識辞書１４２に格納されている語彙のなかから尤度Ｐが高いものから順に二つの単語仮説を、よく知られている隠れマルコフモデル（ＨＭＭ：Hidden Markov Model）などの手法を用いて取得し、図示しないメモリに一時的に記憶しておく。なお、音声認識において距離を計算して尤度Ｐの高い単語仮説を取得する技術は、周知技術であるため、詳しい説明は省略する（たとえば、非特許文献１参照）。First, two word hypotheses are acquired in descending order of the likelihood P representing the probability, and a value 1 is set to the flag F2 indicating whether or not the process of deriving the word corresponding to the external speech is being executed. (Step S240).
Specifically, the word hypothesis acquisition unit 116 a and the likelihood setting unit 116 b extract the feature amount series arranged in a time series stored in the temporary storage unit 114 and store the feature amount series and the acoustic model 140. The distance between the phoneme and the syllable data is calculated, and based on the calculation result, the two word hypotheses are known in order from the vocabulary stored in the recognition dictionary 142 in descending order of the likelihood P. It is acquired using a technique such as a hidden Markov model (HMM), and is temporarily stored in a memory (not shown). Note that a technique for calculating a distance and obtaining a word hypothesis having a high likelihood P in speech recognition is a well-known technique, and thus a detailed description thereof is omitted (for example, see Non-Patent Document 1).

次いで、ステップＳ２４０にて取得した二つの単語仮説の各々に対して、尤度Ｐが設定されているので、二つの単語仮説のうち、尤度Ｐが最大である単語仮説の尤度Ｐ１から、尤度Ｐが二番目に大きい単語仮説の尤度Ｐ２を減じて得られる尤度差ΔＰを演算する（ステップＳ２５０）。 Next, since the likelihood P is set for each of the two word hypotheses acquired in step S240, from the likelihood P1 of the word hypothesis having the maximum likelihood P among the two word hypotheses, A likelihood difference ΔP obtained by subtracting the likelihood P2 of the word hypothesis with the second largest likelihood P is calculated (step S250).

この後、尤度差ΔＰを閾値ΔＰｔｈｒｅと比較する（ステップＳ２６０）。閾値ΔＰｔｈｒｅは、初期値として比較的大きい値であるΔＰ０が設定されているので、通常、尤度差ΔＰは閾値ΔＰｔｈｒｅ未満となって（ステップＳ２６０：ＮＯ）、閾値ΔＰｔｈｒｅからΔ（たとえば、初期値ΔＰ０に対して数百分の一や数千分の一の値など）を減じた値を閾値ΔＰｔｈｒｅに設定し直したうえで（ステップＳ２６２）、ステップＳ２００の処理に戻る。ここで、閾値ΔＰｔｈｒｅからΔを減じた値を新たに閾値ΔＰｔｈｒｅに設定し直すのは、時間の経過とともに外部音声に対応する単語を導出する際の条件を緩和し、単語が決定されないまま未処理の特徴量がいたずらに増加してしまうことを抑制するためである。 Thereafter, the likelihood difference ΔP is compared with a threshold value ΔPthre (step S260). Since the threshold value ΔPthr is set to a relatively large value ΔP0 as an initial value, the likelihood difference ΔP is usually less than the threshold value ΔPthr (step S260: NO), and the threshold value ΔPthr is set to Δ (for example, an initial value). A value obtained by subtracting one hundredths or thousands of values from ΔP0 is reset to the threshold ΔPthre (step S262), and the process returns to step S200. Here, the value obtained by subtracting Δ from the threshold value ΔPthre is newly set to the threshold value ΔPthre because the condition for deriving the word corresponding to the external speech with time elapses, and the unprocessed word is not determined. This is to prevent the feature amount from increasing unnecessarily.

こうして、上述したステップＳ２００〜Ｓ２６２の処理を何度か繰り返すうちに、一時記憶部１１４から引き出す特徴量が増加して尤度Ｐが最大である単語仮説の尤度Ｐ１も大きくなり、また、ステップＳ２６２の処理にて閾値ΔＰｔｈｒｅが徐々に小さく設定されるので、尤度差ΔＰが閾値Ｐｔｈｒｅとなる確率が徐々に上昇する。この結果、尤度差ΔＰが閾値ΔＰｔｈｒｅ以上となったときには（ステップＳ２６０：ＹＥＳ）、尤度Ｐが最大である単語仮説を入力された音声信号に対応する単語とみなして出力部１２０にて出力し、ステップＳ２４０の処理で値１に設定したフラグＦ２に値０を設定し直し、ステップＳ２６２の処理で変更した閾値ΔＰｔｈｒｅに初期値ΔＰ０を設定し直す処理を行なう（ステップＳ２７０）。この際、ステップＳ２００〜Ｓ２７０の処理において生成した特徴量や取得した単語仮説などの情報はメモリから削除することにより、メモリを有効活用することができる。 In this manner, while repeating the processes of steps S200 to S262 described above, the feature amount extracted from the temporary storage unit 114 increases, and the likelihood P1 of the word hypothesis having the maximum likelihood P also increases. Since the threshold value ΔPthr is set to be gradually smaller in the process of S262, the probability that the likelihood difference ΔP becomes the threshold value Pthre gradually increases. As a result, when the likelihood difference ΔP is equal to or greater than the threshold value ΔPthre (step S260: YES), the word hypothesis having the maximum likelihood P is regarded as a word corresponding to the input speech signal and output by the output unit 120. Then, the value 0 is reset to the flag F2 set to the value 1 in the process of step S240, and the initial value ΔP0 is reset to the threshold value ΔPthr changed in the process of step S262 (step S270). At this time, the memory can be effectively used by deleting information such as the feature amount generated in the processing of steps S200 to S270 and the acquired word hypothesis from the memory.

続いて、音声信号が継続して入力されているか否か、また、未処理の音声信号があるか否かに基づいて、処理を継続する必要があるか否かを判断する（ステップＳ２８０）。音声信号が継続して入力されているときや未処理の音声信号があるときにはステップＳ２００〜Ｓ２７０の処理を繰り返し（ステップＳ２８０：ＮＯ）、音声信号の入力が終了しており、かつ、未処理の音声信号がないことを確認して（ステップＳ２８０：ＹＥＳ）、一連の処理を終了する。 Subsequently, it is determined whether or not the process needs to be continued based on whether or not the audio signal is continuously input and whether or not there is an unprocessed audio signal (step S280). When the audio signal is continuously input or when there is an unprocessed audio signal, the processing of steps S200 to S270 is repeated (step S280: NO), the input of the audio signal is completed, and the unprocessed audio signal is processed. After confirming that there is no audio signal (step S280: YES), the series of processing is terminated.

一方、使用率Ｕが正常の範囲内であるか否かを示すフラグＦ１が使用率Ｕが過大であることを示す値１であるときには（ステップＳ２３０：ＮＯ）、さらに、外部音声に対応する単語を導出する処理を実行中であるか否かを示すフラグＦ２が値０であるか否かを判断する（ステップＳ２４０）。
フラグＦ２が値１であるとき、すなわち単語を導出する処理が実行中であるときには（ステップＳ２３２：ＹＥＳ）、ステップＳ２４０以降の外部からの音声信号に対応する単語を導出する処理を引き続き実行する。
そして、フラグＦ２が値０となるのを確認してから、すなわち、ステップＳ２８０の処理にて外部からの音声信号に対応する単語を出力したのを確認してから（ステップＳ２３２：ＮＯ）、ステップＳ２４０以降の処理を実行せずに、ステップＳ２００の処理に戻る。
このため、外部からの音声信号を入力し、特徴量を生成して一時記憶部１１４に一時的に記憶し、使用率Ｕに基づいてフラグＦ１を設定するステップＳ２００〜Ｓ２２０の処理を繰り返し、使用率Ｕが所定値Ｕｌｏｗ未満となってフラグＦ１が値０に設定されるのを確認してから（ステップＳ２３０：ＹＥＳ）、ステップＳ２４０以降の処理を実行することになる。On the other hand, when the flag F1 indicating whether or not the usage rate U is within the normal range is a value 1 indicating that the usage rate U is excessive (step S230: NO), a word corresponding to the external voice is further generated. It is determined whether or not the flag F2 indicating whether or not the process for deriving is being executed is 0 (step S240).
When the flag F2 is 1, that is, when the process of deriving a word is being executed (step S232: YES), the process of deriving a word corresponding to an external audio signal after step S240 is continued.
Then, after confirming that the flag F2 has a value of 0, that is, confirming that a word corresponding to an external audio signal has been output in the process of step S280 (step S232: NO), step The processing returns to step S200 without executing the processing after S240.
For this reason, an external audio signal is input, a feature value is generated and temporarily stored in the temporary storage unit 114, and the processing of steps S200 to S220 for setting the flag F1 based on the usage rate U is repeated and used. After confirming that the rate U is less than the predetermined value Ulow and the flag F1 is set to 0 (step S230: YES), the processing from step S240 is executed.

なお、フラグＦ１が値１であり、かつ、フラグＦ２が値０であるときに、ステップＳ２４０以降の処理を実行しないのは、以下の二つの理由に基づいている。第１の理由は、特に、単語仮説取得部１１６ａがステップＳ２４０の処理を実行した場合、使用率Ｕがさらに大きくなってしまうためである。第２の理由は、外部からの音声信号に対応する単語を導出する処理を実行中であるときには取得した二つの単語仮説に関する情報をメモリから消去できないので、メモリを効率よく使用するためには単語を導出する処理がいったん終了した時点で中止することが望ましいためである。
これらの処理を中断することにより、使用率Ｕが過大となることによって生じる不都合、すなわち、外部からの音声信号を入力し、特徴量を生成して一時記憶部１１４に一時的に記憶するステップＳ２００，Ｓ２１０の処理が滞ってしまうことを防ぐことが出来る。しかも、ステップＳ２１０の処理によって生成した特徴量を一時記憶部１１４に一時的に記憶して未処理のまま消失してしまうことを防いでいるので、使用率Ｕが正常が閾値Ｕｌｏｗ未満となってフラグＦ１に値０が設定されるのを待ってからでも、ステップＳ２４０以降の処理を実行できるのである。The reason why the processing after step S240 is not executed when the flag F1 is 1 and the flag F2 is 0 is based on the following two reasons. The first reason is that, particularly, when the word hypothesis acquisition unit 116a executes the process of step S240, the usage rate U is further increased. The second reason is that when the process of deriving a word corresponding to an external audio signal is being executed, the acquired information regarding the two word hypotheses cannot be erased from the memory. This is because it is desirable to stop the process once deriving from the process.
Discontinuing these processes causes an inconvenience caused by an excessive usage rate U, that is, an external audio signal is input, a feature amount is generated, and temporarily stored in the temporary storage unit 114 (step S200). , S210 can be prevented from being delayed. In addition, since the feature amount generated by the process of step S210 is temporarily stored in the temporary storage unit 114 and is prevented from disappearing without being processed, the normal usage rate U is less than the threshold value Low. Even after waiting for the value 0 to be set in the flag F1, the processes in and after step S240 can be executed.

以上、本発明の実施の形態２に係る音声認識装置によれば、情報処理部１１０が有するＣＰＵの使用率Ｕが過大であると判断したときには、外部からの音声信号に対応する単語を導出する処理を中断したり再開したりするので、ＣＰＵの使用率Ｕが過大となることに起因する不都合、すなわち、外部からの音声信号を入力し、特徴量を生成する処理が滞ってしまう不都合を抑制できる。
また、外部からの音声信号に対応する単語を導出する処理が実行中であるときには、ＣＰＵの使用率Ｕが過大と判断したときでも、単語を導出する処理を中断しないので、二つの単語仮説に関する情報をメモリに記憶したまま単語を導出する処理を中断してしまうことによる不都合、すなわちメモリの使用率が過大となってしまうことを抑制できる。
しかも、生成した特徴量を一時的に記憶しておくことができるので、外部からの音声信号に対応する単語を導出する処理を中断したときでも、特徴量が未処理のまま消失してしまうことを抑制できる。As described above, according to the speech recognition apparatus according to Embodiment 2 of the present invention, when it is determined that the CPU usage rate U of the information processing unit 110 is excessive, a word corresponding to an external speech signal is derived. Since the process is interrupted or restarted, the inconvenience caused by the excessive usage rate U of the CPU, that is, the inconvenience that the process of generating the feature amount by inputting the external audio signal is delayed is suppressed. it can.
Also, when the process of deriving a word corresponding to an external audio signal is being executed, the process of deriving a word is not interrupted even when the CPU usage rate U is determined to be excessive. It is possible to suppress inconvenience caused by interrupting the process of deriving the word while storing the information in the memory, that is, the excessive usage rate of the memory.
In addition, since the generated feature values can be temporarily stored, even when the process of deriving a word corresponding to an external audio signal is interrupted, the feature values are lost without being processed. Can be suppressed.

［実施の形態３］
次に、本発明の実施の形態３に係る音声認識装置について図を参照しながら詳細に説明する。
本発明の実施の形態３に係る音声認識装置も、外部からの音声信号に対応する単語を導出して逐次的に出力するものである。その構成要素は実施の形態２に係る音声認識装置の構成要素と共通するので、同一の符号を用いるとともに、詳しい説明は省略する。[Embodiment 3]
Next, a speech recognition apparatus according to Embodiment 3 of the present invention will be described in detail with reference to the drawings.
The speech recognition apparatus according to Embodiment 3 of the present invention also derives and sequentially outputs words corresponding to external speech signals. Since the constituent elements are common to the constituent elements of the speech recognition apparatus according to the second embodiment, the same reference numerals are used and detailed description thereof is omitted.

本発明の実施の形態３に係る音声認識装置の動作を図６を参照しながら説明する。
外部からの音声信号を入力してから特徴量を生成するステップＳ３００〜Ｓ３３０の処理は、それぞれ、本発明の実施の形態２で説明したステップＳ２００〜Ｓ２３０の処理と共通するので、詳しい説明は省略する。これに対し、複数の単語仮説を取得し、取得した複数の単語仮説の中から一の単語仮説を選択して出力するステップＳ３４０〜Ｓ３７０の処理は、本発明の実施の形態２で説明したステップＳ２４０〜Ｓ２７０の処理とは異なるので、以下その内容を説明する。The operation of the speech recognition apparatus according to Embodiment 3 of the present invention will be described with reference to FIG.
Since the processes in steps S300 to S330 for generating feature amounts after inputting an external audio signal are the same as the processes in steps S200 to S230 described in the second embodiment of the present invention, detailed description is omitted. To do. On the other hand, the process of steps S340 to S370 for acquiring a plurality of word hypotheses, selecting and outputting one word hypothesis from the acquired plurality of word hypotheses is the step described in the second embodiment of the present invention. Since the processing is different from the processing of S240 to S270, the contents will be described below.

本発明の実施の形態２に係る音声認識装置では、単語仮説を尤度の高いものから順に少なくとも二つ取得すればよいが、本発明の実施の形態３に係る音声認識装置では、より多く（たとえば、５個や１０個など）の単語仮説をその尤度Ｐと結びつけて取得するのが望ましい（ステップＳ３４０）。 In the speech recognition apparatus according to Embodiment 2 of the present invention, at least two word hypotheses may be acquired in descending order of likelihood, but more in the speech recognition apparatus according to Embodiment 3 of the present invention ( For example, 5 or 10 word hypotheses are preferably acquired in association with the likelihood P (step S340).

ステップＳ３４０にて複数の単語仮説をその尤度Ｐと結びつけて取得したあと、単語選択部１１６ｂは、これらの単語仮説のうち尤度Ｐが最大である一つの単語仮説と同じ意味をもつ一群の単語仮説を抽出し、この抽出された一群の単語仮説が単語仮説の全体数に対して占める割合である仮説占有率Ｈを演算する（ステップＳ３５０）。
この仮説占有率Ｈについて具体例を用いて説明する。
たとえば、入力された音声信号から特徴量を生成した結果、「キ」，「カ」，「イ」からなる音声が入力されたと判断し、尤度の高いものから順に、「機械」，「器械」，「機会」，「奇怪」，「棋界」という５つの単語仮説を取得した場合を考える。この場合、「器械」という単語仮説は、尤度Ｐが最大の単語仮説「機械」と同義である。このため、単語仮説の全体数は５つであるのに対し、尤度Ｐが最大の単語仮説と同じ意味をもつ一群の単語仮説の数は２つであるから、仮説占有率Ｈは２／５、すなわち４０％となる。
また、別の例として、入力された音声信号から特徴量を生成して音声を分析した結果、尤度の高いものから順に、「ｃｏｌｏｒ」、「ｃｏｌｏｕｒ」、「ｃｏｌｌａｒ」という３つの単語仮説を取得した場合を考える。この場合、「ｃｏｌｏｕｒ」という単語仮説は、尤度が最大の単語仮説「ｃｏｌｏｒ」と同義である。このため、単語仮説の全体数は３つであるのに対し、尤度Ｐが最大の単語仮説と同じ意味をもつ一群の単語仮説の数は２つであるから、仮説占有率Ｈは２／３、すなわち６０％となる。
なお、一の単語仮説と他の単語仮説とが同じ意味を持つか否かは、認識辞書１４２が各々の単語に対して同義の単語のデータとして格納しており、このデータを参照して判断するものとすればよい。After acquiring a plurality of word hypotheses in association with the likelihood P in step S340, the word selection unit 116b selects a group of words having the same meaning as one word hypothesis having the maximum likelihood P among these word hypotheses. A word hypothesis is extracted, and a hypothesis occupancy H that is a ratio of the extracted group of word hypotheses to the total number of word hypotheses is calculated (step S350).
The hypothesis occupation rate H will be described using a specific example.
For example, as a result of generating a feature value from an input voice signal, it is determined that a voice consisting of “ki”, “f”, and “b” is inputted, and “machine” and “instrument” are ordered in descending order of likelihood. Consider a case where five word hypotheses, “opportunity”, “strange”, and “underworld” are acquired. In this case, the word hypothesis “instrument” is synonymous with the word hypothesis “machine” having the maximum likelihood P. Therefore, the total number of word hypotheses is five, whereas the number of a group of word hypotheses having the same meaning as the word hypothesis having the maximum likelihood P is two, so the hypothesis occupancy H is 2 / 5 or 40%.
As another example, as a result of generating a feature amount from the input speech signal and analyzing the speech, the three word hypotheses “color”, “color”, and “color” are ordered in descending order of likelihood. Consider the case of acquisition. In this case, the word hypothesis “color” is synonymous with the word hypothesis “color” having the maximum likelihood. For this reason, the total number of word hypotheses is three, whereas the number of a group of word hypotheses having the same meaning as the word hypothesis having the maximum likelihood P is two, so the hypothesis occupancy H is 2 / 3, ie 60%.
Whether or not one word hypothesis and another word hypothesis have the same meaning is stored in the recognition dictionary 142 as synonymous word data for each word, and is determined by referring to this data. What should I do?

次いで、仮説占有率Ｈを閾値Ｈｔｈｒｅと比較する（ステップＳ３６０）。閾値Ｈｔｈｒｅは、初期値として比較的大きなＨ０（たとえば、５０％や６０％など）が設定されているので、通常、仮説占有率Ｈｔｈｒｅは閾値Ｈｔｈｒｅ未満となり（ステップＳ３６０：ＮＯ）、閾値ＨｔｈｒｅからΔ（たとえば、初期値Ｈ０に対して数百分の一や数千分の一の値など）を減じた値を閾値Ｈｔｈｒｅに設定し直したうえで（ステップＳ３６２）、ステップＳ３００の処理に戻る。ここで、閾値ＨｔｈｒｅからΔを減じた値を新たに閾値Ｈｔｈｒｅに設定し直すのは、時間の経過とともに外部音声に対応する単語を導出する際の条件を緩和し、単語が決定されないまま未処理の特徴量がいたずらに増加してしまうことを抑制するためである。 Next, the hypothesis occupancy H is compared with a threshold Hthre (step S360). Since the threshold value Hthre is set to a relatively large value H0 (for example, 50%, 60%, etc.) as an initial value, the hypothesis occupancy Hthre is usually less than the threshold value Hthre (step S360: NO), and Δ from the threshold value Hthre A value obtained by subtracting (for example, one hundredth or one thousandth of the initial value H0) is reset to the threshold value Hthre (step S362), and the process returns to step S300. Here, the value obtained by subtracting Δ from the threshold value Hthre is newly set as the threshold value Hthre because the condition for deriving the word corresponding to the external speech with the passage of time is relaxed, and the unprocessed word is not determined. This is to prevent the feature amount from increasing unnecessarily.

こうして、ステップＳ３００〜Ｓ３６２の処理を何度か繰り返すうちに、ステップＳ３６２の処理にて閾値Ｈｔｈｒｅが徐々に小さく設定されるので、仮説占有率Ｈが閾値Ｈｔｈｒｅ以上となる確率は徐々に上昇する。この結果、仮説占有率Ｈが閾値Ｈｔｈｒｅ以上となったときには（ステップＳ３６０：ＹＥＳ）、尤度Ｐが最大である単語仮説を入力された音声信号に対応する単語とみなして出力部１２０にて出力し、ステップＳ３４０の処理で値１に設定したフラグＦ２に値０を設定し直し、ステップＳ３６２の処理で変更した閾値Ｈｔｈｒｅに初期値Ｈ０を設定し直す処理を行なう（ステップＳ３７０）。この際、ステップＳ３１０の処理において生成した特徴量や、ステップＳ３４０の処理において取得した単語仮説などの情報はメモリから削除することにより、メモリを有効活用することができる。 Thus, as the process of steps S300 to S362 is repeated several times, the threshold value Hthre is gradually set smaller in the process of step S362, so the probability that the hypothesis occupancy H becomes equal to or higher than the threshold value Hthre gradually increases. As a result, when the hypothesis occupancy H is equal to or higher than the threshold Hthre (step S360: YES), the word hypothesis having the maximum likelihood P is regarded as a word corresponding to the input speech signal and output by the output unit 120. Then, the value 0 is reset to the flag F2 set to the value 1 in the process of step S340, and the initial value H0 is reset to the threshold value Hthre changed in the process of step S362 (step S370). At this time, the feature amount generated in the process of step S310 and the information such as the word hypothesis acquired in the process of step S340 are deleted from the memory, so that the memory can be effectively used.

なお、本発明の実施の形態３にかかる音声認識装置では、仮説占有率Ｈという概念を用いることにより、取得した単語仮説のなかから一の単語仮説を外部から入力された音声信号に対応する単語とみなして出力している。これは、仮説占有率Ｈが比較的大きいとき、すなわち、取得した単語仮説が、尤度Ｐが最大の単語仮説と同じ意味をもつ単語仮説により占められているときには、尤度Ｐが最大の単語仮説を外部からの音声信号に対応する単語とみなしても問題が生じることは稀であるためである。この結果、尤度Ｐが最大の単語仮説と、尤度Ｐが二番目に大きい単語仮説の尤度との差が小さいときでも、外部からの音声信号に対応する単語を早期に決定できる。このように外部からの音声信号に対応する単語仮説を決定しにくい場合に、ＣＰＵの使用率Ｕが過大であるにもかかわらず、外部からの音声信号に対応する単語を出力するステップＳ３７０の処理を実行する時期が遅くなり、ステップＳ３４０〜Ｓ３８０の処理を中断する時期が遅くなることに起因して、ＣＰＵの使用率Ｕのさらなる増大を招来してしまうことを抑制できる。 In the speech recognition apparatus according to the third exemplary embodiment of the present invention, the word corresponding to the speech signal input from the outside as one word hypothesis among the acquired word hypotheses by using the concept of hypothesis occupancy H Is output. This is because when the hypothesis occupancy H is relatively large, that is, when the acquired word hypothesis is occupied by a word hypothesis having the same meaning as the word hypothesis with the maximum likelihood P, the word with the maximum likelihood P This is because it is rare that a problem occurs even if the hypothesis is regarded as a word corresponding to an external audio signal. As a result, even when the difference between the word hypothesis with the maximum likelihood P and the likelihood of the word hypothesis with the second largest likelihood P is small, the word corresponding to the external speech signal can be determined early. In this way, when it is difficult to determine the word hypothesis corresponding to the external audio signal, the process of step S370 for outputting the word corresponding to the external audio signal even though the usage rate U of the CPU is excessive. It is possible to prevent the CPU usage rate U from being further increased due to the later execution time of the process and the later suspension of the processes in steps S340 to S380.

続いて、音声信号が継続して入力されているか否か、また、未処理の音声信号があるか否かに基づいて、処理を継続する必要があるか否かを判断する（ステップＳ３８０）。音声信号が継続して入力されているときや未処理の音声信号があるときにはステップＳ３００〜Ｓ３７０の処理を繰り返し（ステップＳ３８０：ＮＯ）、音声信号の入力が終了しており、かつ、未処理の音声信号がないことを確認して（ステップＳ３８０：ＹＥＳ）、一連の処理を終了する。 Subsequently, it is determined whether or not the processing needs to be continued based on whether or not the audio signal is continuously input and whether or not there is an unprocessed audio signal (step S380). When the audio signal is continuously input or when there is an unprocessed audio signal, the processes of steps S300 to S370 are repeated (step S380: NO), the input of the audio signal is completed, and the unprocessed audio signal is processed. After confirming that there is no audio signal (step S380: YES), the series of processing is terminated.

以上、本発明の実施の形態３に係る音声認識装置によっても、本発明の実施の形態２に係る音声認識装置と同様の効果を得ることができる。
また、仮説占有率Ｈという概念を用いて外部からの音声信号に対応する単語を導出して出力するので、尤度Ｐが最大の単語仮説の尤度と、尤度Ｐが二番目に大きい単語仮説の尤度との差が小さいときでも、比較的早い段階で外部からの音声信号に対応する単語を導出して出力することができる。
このため、ＣＰＵの使用率Ｕが過大であるにもかかわらず、外部からの音声信号に対応する単語を出力するステップＳ３７０の処理を実行する時期が遅くなり、ステップＳ３４０〜Ｓ３８０の処理を中断する時期が遅くなることに起因して、ＣＰＵの使用率Ｕのさらなる増大を招来してしまうことを抑制できる。As described above, also by the speech recognition apparatus according to Embodiment 3 of the present invention, the same effects as those of the speech recognition apparatus according to Embodiment 2 of the present invention can be obtained.
Further, since a word corresponding to an external speech signal is derived and output using the concept of hypothesis occupancy H, the likelihood of the word hypothesis with the maximum likelihood P and the word with the second largest likelihood P Even when the difference from the hypothesis likelihood is small, it is possible to derive and output a word corresponding to an external audio signal at a relatively early stage.
For this reason, although the usage rate U of the CPU is excessive, the timing of executing the process of step S370 for outputting a word corresponding to the external audio signal is delayed, and the processes of steps S340 to S380 are interrupted. It can be suppressed that the CPU usage rate U is further increased due to the late timing.

［変形例］
なお、上述した本発明の実施の形態２，３に係る音声認識装置では、音響モデル１４０と，認識辞書１４２とを用いて外部からの音声信号に対応する単語を導出するものとして説明したが、言語的な確からしさを考慮するために言語モデルを併用してもよい。[Modification]
In the voice recognition device according to the second and third embodiments of the present invention described above, the acoustic model 140 and the recognition dictionary 142 are used to derive words corresponding to an external voice signal. A language model may be used in combination in order to consider linguistic accuracy.

また、上述した本発明の実施の形態２，３に係る音声認識装置では、使用率Ｕは、ＣＰＵの使用率として説明したが、メモリの使用率としてもよい。 Moreover, in the speech recognition apparatus according to the second and third embodiments of the present invention described above, the usage rate U has been described as the usage rate of the CPU, but may be the usage rate of the memory.

さらに、上述した本発明の実施の形態１〜３に係る音声認識装置では、単語導出部１６，１１６が導出した単語を出力部２０，１２０に逐次的に出力するものとして説明したが、図７に示すように、翻訳部２１９をさらに備え、翻訳部２１９を用いて他言語に翻訳したうえで出力部２２０に逐次的に出力するものとしてもよい。 Furthermore, in the speech recognition apparatus according to the first to third embodiments of the present invention described above, the words derived by the word deriving units 16 and 116 are described as being sequentially output to the output units 20 and 120, but FIG. As shown in FIG. 7, the image processing apparatus may further include a translation unit 219, which translates the document into another language using the translation unit 219 and sequentially outputs the output to the output unit 220.

また、上述した本発明の実施の形態１〜３では、音声認識装置の形態として説明したが、音声認識方法の形態としてもよく、プログラムを記録したコンピュータ読み取り可能な記録媒体の形態としてもよい。 Moreover, although Embodiment 1-3 of this invention mentioned above demonstrated as a form of the speech recognition apparatus, it may be the form of a speech recognition method, and is good also as a form of the computer-readable recording medium which recorded the program.

本発明は上述した実施の形態に制限されるものではなく、本発明の要旨を逸脱しない範囲内において、種々なる形態で実施することができる。 The present invention is not limited to the embodiment described above, and can be implemented in various forms without departing from the gist of the present invention.

この出願は、２００９年１０月９日に出願された日本出願２００９−２３５３０２を基礎とする優先権を主張し、その開示を全てここに取り込む。 This application claims the priority on the basis of Japanese application 2009-235302 for which it applied on October 9, 2009, and takes in those the indications of all here.

本発明は音声認識装置の製造産業などに利用可能である。 The present invention can be used in the speech recognition device manufacturing industry.

１０…情報処理部、１２…情報生成部、１４…一時記憶部、１６…単語導出部、１８…制御部、２０…出力部、３０…使用率取得部。 DESCRIPTION OF SYMBOLS 10 ... Information processing part, 12 ... Information generation part, 14 ... Temporary storage part, 16 ... Word derivation part, 18 ... Control part, 20 ... Output part, 30 ... Usage rate acquisition part.

Claims

An information processing unit that performs processing for deriving a word corresponding to an external audio signal;
An output unit that sequentially outputs words;
A usage rate acquisition unit that acquires the usage rate of the resources of the information processing unit,
The information processing unit
An information generator that generates predetermined information based on the audio signal;
A temporary storage unit that temporarily stores predetermined information from the information generation unit;
A word deriving unit for deriving a word corresponding to an audio signal based on predetermined information temporarily stored by the temporary storage unit, performing speech recognition, and outputting the word to the output unit;
A speech recognition apparatus comprising: a control unit that suspends and resumes processing performed by the word derivation unit based on the usage rate acquired by the usage rate acquisition unit.

The speech recognition apparatus according to claim 1, wherein the resource is at least one of a CPU and a memory.

The control unit interrupts the processing performed by the word derivation unit when the usage rate acquired by the usage rate acquisition unit is equal to or greater than a first predetermined value, and a second usage rate is smaller than the first predetermined value. The speech recognition apparatus according to claim 1, wherein the processing performed by the word derivation unit is resumed after waiting for the value to become less than a predetermined value.

The control unit does not interrupt the process performed by the word deriving unit until the word deriving unit completes the process of deriving a word when the word deriving unit performs the process of deriving a word corresponding to the audio signal. The speech recognition apparatus according to claim 1.

The word derivation unit
A word hypothesis acquisition unit that acquires a plurality of word hypotheses corresponding to a speech signal input based on predetermined information temporarily stored by the temporary storage unit in association with a likelihood representing the probability;
The speech recognition according to claim 1, further comprising: a word selection unit that selects one word hypothesis as a word corresponding to the speech signal based on a likelihood set for each of the plurality of word hypotheses. apparatus.

The word selection unit selects two word hypotheses from a plurality of word hypotheses having the highest likelihood, and if the difference between the likelihoods set for each is less than a threshold, the word selection corresponding to the speech signal is forgotten. The speech recognition apparatus according to claim 5, wherein when the difference between the likelihoods set for each is equal to or greater than a threshold, the word hypothesis having the maximum likelihood is selected as a word corresponding to the speech signal.

The speech recognition apparatus according to claim 6, wherein the word selection unit sets a threshold value as the elapsed time from the start of the process of recognizing a word corresponding to the speech signal becomes longer.

The word selection unit selects a word hypothesis having the maximum likelihood from a plurality of word hypotheses, and extracts a group of word hypotheses having the same meaning as the selected word hypothesis from the plurality of word hypotheses. When the hypothesis occupancy ratio, which is the ratio of the extracted group of word hypotheses to the total number of the plurality of word hypotheses, is less than a predetermined occupancy ratio, the selected word hypothesis is selected as a word corresponding to the speech signal. 6. The speech according to claim 5, wherein the selected word hypothesis is selected as a word corresponding to the speech signal when the hypothesis occupancy of the extracted group of word hypotheses is greater than or equal to a predetermined occupancy. Recognition device.

The speech recognition apparatus according to claim 8, wherein the word selection unit sets the predetermined occupancy rate as the elapsed time from the start of the process of deriving a word corresponding to the speech signal increases.

The speech recognition apparatus according to claim 1, wherein the information generation unit generates a feature amount representing a feature of the speech signal based on the speech signal.

The said information processing part is further provided with the translation part which translates the word selected by the said word selection part, and outputs this translated word to the said output part sequentially. Voice recognition device.

Generating predetermined information based on an external audio signal;
Temporarily storing predetermined information;
Deriving a word corresponding to the voice signal based on the temporarily stored predetermined information, performing voice recognition, and outputting the word;
Deriving a word corresponding to the speech signal and obtaining a resource usage rate for speech recognition;
And a step of suspending and resuming based on the usage rate obtained by the processing for performing speech recognition.

Generating predetermined information based on an external audio signal;
Temporarily storing predetermined information;
Deriving a word corresponding to the voice signal based on the temporarily stored predetermined information, performing voice recognition, and outputting the word;
Deriving a word corresponding to the speech signal and obtaining a resource usage rate for speech recognition;
A computer-readable recording medium on which a program is recorded, which causes a computer to execute a step of suspending and resuming based on a usage rate obtained by processing for performing speech recognition.