WO2011043380A1 - Voice recognition device and voice recognition method - Google Patents

Voice recognition device and voice recognition method Download PDF

Info

Publication number
WO2011043380A1
WO2011043380A1 PCT/JP2010/067555 JP2010067555W WO2011043380A1 WO 2011043380 A1 WO2011043380 A1 WO 2011043380A1 JP 2010067555 W JP2010067555 W JP 2010067555W WO 2011043380 A1 WO2011043380 A1 WO 2011043380A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
unit
speech recognition
usage rate
speech
Prior art date
Application number
PCT/JP2010/067555
Other languages
French (fr)
Japanese (ja)
Inventor
健 花沢
長田 誠也
隆行 荒川
岡部 浩司
田中 大介
Original Assignee
日本電気株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電気株式会社 filed Critical 日本電気株式会社
Priority to JP2011535425A priority Critical patent/JPWO2011043380A1/en
Publication of WO2011043380A1 publication Critical patent/WO2011043380A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output

Definitions

  • the present invention relates to a speech recognition apparatus and speech recognition method, and more particularly to a speech recognition apparatus and speech recognition method for sequentially outputting the result recognized when speech is continuously input.
  • An object of the present invention is to provide a speech recognition apparatus and a speech recognition method.
  • a speech recognition apparatus includes an information processing unit that performs a process of deriving a word corresponding to an external speech signal, an output unit that sequentially outputs a word, and a resource usage rate of the information processing unit. And an information generation unit for generating predetermined information based on the audio signal, and a temporary storage for temporarily storing predetermined information from the information generation unit.
  • the speech recognition method comprises the steps of generating predetermined information based on an external speech signal, temporarily storing the predetermined information, and speech based on the temporarily stored predetermined information.
  • the process of deriving the word corresponding to the speech signal from the outside is interrupted and resumed based on the usage rate of the resource possessed by the information processing unit. It is possible to suppress an inconvenience caused by an excessive rate, that is, a delay in the process of generating predetermined information based on an external audio signal. Moreover, since the generated predetermined information can be temporarily stored, the generated predetermined information is lost without being processed even when the processing for deriving the word corresponding to the external audio signal is interrupted. Can be suppressed.
  • the speech recognition apparatus derives words corresponding to an external speech signal and sequentially outputs the words.
  • the speech recognition apparatus according to the first embodiment of the present invention includes information processed by an information processing unit 10 and an information processing unit 10 that performs predetermined information processing based on an external speech signal. And a usage rate acquisition unit 30 for acquiring the usage rate of the resources constituting the information processing unit 10.
  • the information processing unit 10 generates an information generation unit 12 that generates predetermined information based on an audio signal from the outside, a temporary storage unit 14 that temporarily stores predetermined information generated by the information generation unit 12, and a temporary storage unit
  • the word derivation unit 16 derives a word corresponding to an external speech signal by extracting predetermined information from 14 and interrupts or resumes the processing performed by the word derivation unit 16 based on the signal from the usage rate acquisition unit 30
  • Control unit 18 is included.
  • the output unit 20 sequentially outputs the word derived by the word deriving unit 16 as a sound or a video.
  • the usage rate acquisition unit 30 acquires a usage rate of resources, such as a CPU and a memory (not shown), which the information processing unit 10 has, at predetermined time intervals (for example, every several tens of msec or several hundreds of msec). It is something to send.
  • a usage rate of resources such as a CPU and a memory (not shown)
  • predetermined time intervals for example, every several tens of msec or several hundreds of msec. It is something to send.
  • Each function of the speech recognition apparatus according to the first embodiment of the present invention described above is realized by reading a program recorded in a recording medium by a computer having a CPU, a memory, and various interfaces, the hardware resources of the computer. It can be realized by the collaboration of the software and the software.
  • the information generation unit 12 inputs an audio signal from the outside and the usage rate U of the resource from the usage rate acquisition unit 30 (step S100).
  • predetermined information is generated based on the input audio signal, and the predetermined information is stored in the temporary storage unit 14 (step S110).
  • the control unit 18 determines whether the usage rate U of the resource acquired by the usage rate acquisition unit 30 is normal (step S120).
  • step S120: YES When it is determined that the usage rate U is smaller than an arbitrary predetermined value and within the normal range (step S120: YES), the word derivation unit 16 generates an audio signal based on the predetermined information stored by the temporary storage unit 14. Is derived (step S130), and the output unit 20 outputs the word (step S140). Subsequently, it is determined whether the process needs to be continued (step S150). When the audio signal is continuously input or when there is an unprocessed audio signal, the process returns to step S100 and the processing of steps S100 to S140 is repeated (step S150: NO), and the input of the audio signal is completed. And, after confirming that there is no unprocessed audio signal (step S150: YES), a series of processing ends.
  • step S120: NO when it is determined that the usage rate U is not less than an arbitrary predetermined value and is abnormal (step S120: NO), the word deriving unit 16 in step S130 does not execute the process of deriving the word corresponding to the audio signal. Then, the process returns to the process of step S100. For this reason, the process of steps S100 and S110 of inputting an audio signal from the outside again, generating predetermined information and temporarily storing the same in the temporary storage unit 14 is repeated, and the usage rate U is smaller than any predetermined value. (Step S120: YES), a word corresponding to an audio signal from the outside is derived based on predetermined information temporarily stored by the temporary storage unit 16, and the output unit is output. Processing to output from 20 is executed (steps S130 and S140).
  • the resource usage rate U may be further increased.
  • the step of interrupting these processes causes inconvenience due to further increase in the utilization factor U, that is, a step of inputting an audio signal from the outside, generating predetermined information and temporarily storing it in the temporary storage unit 14 It is possible to prevent the processing of S100 and S110 from being delayed.
  • the predetermined information generated by the process of step S110 is temporarily stored in the temporary storage unit 14 and prevented from being lost without being processed, the usage rate U becomes normal.
  • the processes of steps S130 and S140 can be executed even after waiting for
  • the speech recognition apparatus includes information processed by an information processing unit 110 and an information processing unit 110 that perform predetermined information processing based on an external speech signal.
  • Output unit 20 for outputting the information
  • a usage rate acquisition unit 130 for obtaining the usage rate of the CPU (not shown) included in the information processing unit 110
  • an acoustic model 140 for storing data on phonemes and syllables
  • a recognition dictionary for storing data on vocabulary It consists of 142.
  • a word derivation unit 116 which derives a word corresponding to an external speech signal by extracting a feature amount from the temporary storage unit 114 and collating the feature amount with data stored in the acoustic model 140 and the recognition dictionary 142, It includes a control unit 118 which interrupts or restarts the process performed by the word derivation unit 116 based on the signal from the usage rate acquisition unit 130.
  • the word derivation unit 116 acquires a plurality of word hypotheses by combining their likelihoods with the word hypothesis acquisition unit 116a and the word hypothesis acquisition unit 116a based on the likelihoods associated with each of the word hypotheses acquired.
  • a word selection unit 116 b for selecting one word hypothesis as a word corresponding to the voice signal.
  • the output unit 120 sequentially outputs the words selected by the word selection unit 116b on the display.
  • the usage rate acquisition unit 130 acquires the usage rate U of the CPU of the information processing unit 110 every predetermined time (for example, every several tens of msec, every several hundreds of msec, etc.) and transmits it to the control unit 118.
  • the above-described functions of the speech recognition apparatus according to the second embodiment of the present invention can be realized by reading a program recorded on a recording medium by a computer having a CPU, a memory, and various interfaces, and the hardware resources of the computer. It can be realized by the collaboration of the software and the software.
  • two word hypotheses are acquired in order from the one with the highest likelihood P representing the certainty, and the value F1 is set to the flag F2 indicating whether the process of deriving the word corresponding to the external speech is being executed (Step S240).
  • the word hypothesis acquisition unit 116 a and the likelihood setting unit 116 b extract a feature amount sequence arranged in time series stored in the temporary storage unit 114, and store the feature amount sequence and the acoustic model 140.
  • the likelihood P1 of the word hypothesis having the largest likelihood P among the two word hypotheses is The likelihood difference ⁇ P obtained by subtracting the likelihood P2 of the word hypothesis with the second largest likelihood P is calculated (step S250).
  • the likelihood difference ⁇ P is compared with the threshold value ⁇ Pthre (step S260). Since the threshold ⁇ Pthre is set to a relatively large value ⁇ P0 as an initial value, the likelihood difference ⁇ P is usually smaller than the threshold ⁇ Pthre (step S260: NO), and ⁇ (eg, the initial value) The value obtained by subtracting one-hundredth or one-thousandth-thousand value from ⁇ P0 is reset to the threshold value ⁇ Pthre (step S262), and the process returns to step S200.
  • the feature quantity extracted from temporary storage unit 114 is increased, and the likelihood P1 of the word hypothesis having the maximum likelihood P is also increased. Since the threshold value ⁇ Pthre is set to be gradually smaller in the process of S262, the probability that the likelihood difference ⁇ P becomes the threshold value Pthre gradually increases.
  • step S 260 when the likelihood difference ⁇ P becomes equal to or larger than the threshold ⁇ Pthre (step S 260: YES), the word hypothesis having the maximum likelihood P is regarded as the word corresponding to the input voice signal and is output at the output unit 120 Then, the value F is reset to the flag F2 set to the value 1 in the process of step S240, and the initial value ⁇ P0 is reset to the threshold value ⁇ Pthre changed in the process of step S262 (step S270).
  • the memory can be effectively used by deleting the information such as the feature amount generated in the process of steps S200 to S270 and the acquired word hypothesis from the memory.
  • step S230 when the flag F1 indicating whether the usage rate U is within the normal range is the value 1 indicating that the usage rate U is excessive (step S230: NO), a word corresponding to the external voice is further generated. It is determined whether the flag F2 indicating whether or not the process of deriving the is being executed is the value 0 (step S240). When the flag F2 is the value 1, that is, when the process of deriving the word is being executed (step S232: YES), the process of deriving the word corresponding to the external audio signal is continuously executed after step S240.
  • step S280 step S232: NO
  • step S280 step S232: NO
  • step S200 returns to the process of step S200, without performing the process after S240. Therefore, an audio signal from the outside is input, a feature amount is generated and temporarily stored in the temporary storage unit 114, and the processing of steps S200 to S220 of setting the flag F1 based on the usage rate U is repeated and used. After it is confirmed that the rate U is less than the predetermined value Ulow and the flag F1 is set to the value 0 (step S230: YES), the processes after step S240 are performed.
  • step S240 When the flag F1 has a value of 1 and the flag F2 has a value of 0, the process after step S240 is not executed based on the following two reasons.
  • the first reason is that, particularly when the word hypothesis acquisition unit 116a executes the process of step S240, the usage rate U is further increased.
  • the second reason is that information can not be erased from the memory when it is in the process of deriving a word corresponding to an external speech signal, so that the word can be used efficiently. This is because it is desirable to stop once the process of deriving.
  • Step S200 of interrupting these processes causes inconvenience due to excessive usage rate U, that is, a voice signal from the outside is input, and a feature is generated and temporarily stored in temporary storage unit 114. , S210 can be prevented from being delayed.
  • the speech recognition apparatus when it is determined that the usage rate U of the CPU of the information processing unit 110 is excessive, a word corresponding to an external speech signal is derived. Since processing is interrupted or resumed, it is possible to suppress the inconvenience caused by excessive CPU utilization U, that is, the inconvenience of delaying processing for generating a feature amount by inputting an external audio signal. it can. Furthermore, when the process of deriving a word corresponding to an external speech signal is being executed, the process of deriving a word is not interrupted even when it is determined that the CPU usage rate U is excessive. It is possible to suppress the inconvenience caused by interrupting the process of deriving the word while storing the information in the memory, that is, the memory usage rate becoming excessive. Moreover, since the generated feature amount can be temporarily stored, even when the process for deriving a word corresponding to an external audio signal is interrupted, the feature amount is lost without being processed. Can be suppressed.
  • steps S300 to S330 for generating feature values after inputting an external audio signal are the same as the processes of steps S200 to S230 described in the second embodiment of the present invention, and thus detailed description will be omitted.
  • steps S340 to S370 for acquiring a plurality of word hypotheses and selecting and outputting one word hypothesis from among the acquired plurality of word hypotheses are the steps described in the second embodiment of the present invention.
  • the processing is different from the processing of S240 to S270, so the contents will be described below.
  • the same effect as that of the speech recognition apparatus according to the second embodiment of the present invention can be obtained. Also, since the word corresponding to the speech signal from the outside is derived and output using the concept of the hypothesis occupancy rate H, the likelihood of the word hypothesis with the largest likelihood P, the word with the second largest likelihood P Even when the difference between the likelihood of the hypothesis is small, it is possible to derive and output a word corresponding to an external speech signal at a relatively early stage. For this reason, although the usage rate U of the CPU is excessive, the timing of executing the process of step S370 for outputting a word corresponding to an external audio signal is delayed, and the process of steps S340 to S380 is interrupted. It is possible to suppress the further increase of the CPU utilization U due to the late timing.
  • the usage rate U has been described as the usage rate of the CPU, but may be a memory usage rate.
  • the words derived by word derivation units 16 and 116 are sequentially output to output units 20 and 120.
  • the translation unit 219 may be further provided, and the translation unit 219 may be used to translate into another language and sequentially output to the output unit 220.
  • the speech recognition apparatus has been described, but it may be a speech recognition method or a computer readable recording medium having a program recorded thereon.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

Disclosed is a voice recognition device comprising: an information processing unit (10) which carries out processing for deriving words corresponding to external voice signals; an output unit (20) which sequentially outputs words; and a usage rate acquisition unit (30) which acquires the usage rate of the resources of the information processing unit. The information processing unit (10) is provided with: an information generation unit (12) which generates prescribed information on the basis of voice signals; a temporary storage unit (14) which temporarily stores the prescribed information; a word derivation unit (16) which derives words corresponding to external voice signals on the basis of the prescribed information, carries out voice recognition and outputs words to the output unit (20); and a control unit (18) which discontinues and resumes the processing carried out by the word derivation unit (16) on the basis of usage rate acquired by the usage rate acquisition unit (30). As the processing carried out by the word derivation unit (16) is discontinued and resumed on the basis of the usage rate, inconveniences caused by excessive resource-usage rates can be suppressed.

Description

音声認識装置および音声認識方法Speech recognition apparatus and speech recognition method
 本発明は、音声認識装置および音声認識方法に関し、特に、音声が連続的に入力されたときに認識した結果を逐次的に出力する音声認識装置および音声認識方法に関する。 The present invention relates to a speech recognition apparatus and speech recognition method, and more particularly to a speech recognition apparatus and speech recognition method for sequentially outputting the result recognized when speech is continuously input.
 近年、音声で入力された発話を認識してテキストなど記号列に変換する音声認識システムを利用して、テレビ番組の字幕生成や会議の議事録の作成、音声翻訳などを行なう技術が広く知られている。 In recent years, a technology is widely known that uses a speech recognition system that recognizes speech input as speech and converts it into text strings, etc., to generate subtitles for television programs, create meeting minutes, and translate speech. ing.
 しかしながら、通常の音声認識システムでは、一旦発話が終わらなければ最も確からしい結果を出力できないため、実際の発話から認識した結果を出力するまでの間にタイムラグが生じてしまう。このため、テレビ番組の生中継において字幕を表示する場合や、翻訳システムと組み合わせて異言語間での会話に使用する場合など、リアルタイムでの音声認識が要求される場面では利用しづらかった。 However, in an ordinary speech recognition system, since the most probable result can not be output unless the speech is finished once, a time lag occurs between the actual speech and the output of the recognized result. For this reason, it has been difficult to use in a situation where voice recognition in real time is required, such as when displaying subtitles in live broadcasting of a television program or when using in combination with a translation system for conversation between different languages.
 このような問題を解決するために、発話中であっても一定間隔で音声認識を行ない、この結果を逐次的に出力する連続音声認識装置が提案されている(たとえば、特許文献1参照)。この連続音声認識装置では、一定間隔で、入力された連続音声に対応する最も確からしい単語を選択し、その中から安定して出力できる単語を抽出して出力するので、リアルタイムで、かつ安定性の高い音声認識を可能としている。 In order to solve such a problem, there has been proposed a continuous speech recognition apparatus which performs speech recognition at constant intervals even during speech and sequentially outputs the result (see, for example, Patent Document 1). The continuous speech recognition apparatus selects the most probable word corresponding to the input continuous speech at regular intervals, and extracts and outputs a word that can be stably output therefrom, so that it is stable in real time High voice recognition.
特許第3834169号Patent No. 3834169
 しかしながら、一般に音声認識を行なう処理ではCPUやメモリなどのリソースを大量に使用するため、リソース不足を生じる恐れがある。このようなリソース不足が生じた場合、特許文献1に記載の連続音声認識装置のように逐次的に音声認識処理を行なうと、リソース不足のために音声認識処理の各処理が停滞し、音声入力を取りこぼすなどの不都合を生じる恐れがある。
 なお、このようなリソース不足が生じる状況としては、比較的高性能のリソースを搭載するのが難しい携帯電話機などの小型機器で音声認識を行なう場合に起こりやすい。また、比較的高性能のリソースを搭載する機器を用いた場合であっても、音声合成システムや自動翻訳システムなどの処理量が多いシステムと組み合わせて音声認識を行なう場合にも起こりやすい。
However, in general, processing for performing speech recognition uses a large amount of resources such as a CPU and a memory, which may cause a shortage of resources. When such a shortage of resources occurs, if voice recognition processing is sequentially performed as in the continuous speech recognition device described in Patent Document 1, each processing of voice recognition processing is stagnant due to the lack of resources, and voice input is performed. May cause inconveniences such as dropping the
A situation where such a shortage of resources occurs is likely to occur when voice recognition is performed with a small device such as a mobile phone which is difficult to install relatively high-performance resources. In addition, even when a device equipped with a relatively high-performance resource is used, it tends to occur when performing speech recognition in combination with a system with a large amount of processing such as a speech synthesis system or an automatic translation system.
 本発明は、上述した課題を解決するためになされたものであり、リソースの使用率が比較的大きくなったときでも、外部からの音声信号を消失してしまうことを抑制することを可能とした音声認識装置および音声認識方法を提供することを目的とする。 The present invention has been made to solve the above-described problems, and has made it possible to suppress the loss of an external audio signal even when the resource usage rate becomes relatively large. An object of the present invention is to provide a speech recognition apparatus and a speech recognition method.
 本発明に係る音声認識装置は、外部からの音声信号に対応する単語を導出する処理を行なう情報処理部と、単語を逐次的に出力する出力部と、前記情報処理部が有する資源の使用率を取得する使用率取得部とを備え、前記情報処理部は、前記音声信号に基づいて所定の情報を生成する情報生成部と、前記情報生成部からの所定の情報を一時的に記憶する一時記憶部と、前記一時記憶部により一時的に記憶された所定の情報に基づいて音声信号に対応する単語を導出して音声認識をし、単語を前記出力部に出力する単語導出部と、前記単語導出部が行なう処理を前記使用率取得部により取得された使用率に基づいて中断および再開させる制御部とを備えることを特徴とする。 A speech recognition apparatus according to the present invention includes an information processing unit that performs a process of deriving a word corresponding to an external speech signal, an output unit that sequentially outputs a word, and a resource usage rate of the information processing unit. And an information generation unit for generating predetermined information based on the audio signal, and a temporary storage for temporarily storing predetermined information from the information generation unit. A storage unit, a word derivation unit for performing speech recognition by deriving a word corresponding to a voice signal based on predetermined information temporarily stored by the temporary storage unit, and outputting the word to the output unit; And a control unit that suspends and resumes the process performed by the word derivation unit based on the usage rate acquired by the usage rate acquisition unit.
 本発明に係る音声認識方法は、外部からの音声信号に基づいて所定の情報を生成するステップと、所定の情報を一時的に記憶するステップと、一時的に記憶した所定の情報に基づいて音声信号に対応する単語を導出して音声認識をし、単語を出力するステップと、音声信号に対応する単語を導出して音声認識をするための資源の使用率を取得するステップと、音声認識をする処理を取得した使用率に基づいて中断および再開するステップとを備えることを特徴とする。 The speech recognition method according to the present invention comprises the steps of generating predetermined information based on an external speech signal, temporarily storing the predetermined information, and speech based on the temporarily stored predetermined information. A step of deriving a word corresponding to the signal to perform speech recognition and outputting the word; a step of deriving a word corresponding to the speech signal to obtain a usage rate of resources for performing speech recognition; And c) interrupting and resuming the process based on the acquired usage rate.
 本発明に係る音声認識装置および音声認識方法によれば、外部からの音声信号に対応する単語を導出する処理を情報処理部が有する資源の使用率に基づいて中断および再開するので、資源の使用率が過大となることに起因する不都合、すなわち外部からの音声信号に基づいて所定の情報を生成する処理が滞ってしまうことを抑制できる。
 しかも、生成した所定の情報を一時的に記憶しておくことができるので、外部からの音声信号に対応する単語を導出する処理を中断したときでも、生成した所定の情報が処理されないまま消失してしまうことを抑制できる。
According to the speech recognition apparatus and the speech recognition method according to the present invention, the process of deriving the word corresponding to the speech signal from the outside is interrupted and resumed based on the usage rate of the resource possessed by the information processing unit. It is possible to suppress an inconvenience caused by an excessive rate, that is, a delay in the process of generating predetermined information based on an external audio signal.
Moreover, since the generated predetermined information can be temporarily stored, the generated predetermined information is lost without being processed even when the processing for deriving the word corresponding to the external audio signal is interrupted. Can be suppressed.
図1は、本発明の実施の形態1に係る音声認識装置の一構成例を示すブロック図である。FIG. 1 is a block diagram showing an example of the configuration of the speech recognition apparatus according to Embodiment 1 of the present invention. 図2は、本発明の実施の形態1に係る音声認識装置が実行する処理の一例を示すフローチャートである。FIG. 2 is a flowchart showing an example of processing executed by the speech recognition apparatus according to the first embodiment of the present invention. 図3は、本発明の実施の形態2に係る音声認識装置の一構成例を示すブロック図である。FIG. 3 is a block diagram showing an example of the configuration of the speech recognition apparatus according to Embodiment 2 of the present invention. 図4は、本発明の実施の形態2に係る音声認識装置が実行する処理の一例を示すフローチャートである。FIG. 4 is a flowchart showing an example of processing performed by the speech recognition apparatus according to Embodiment 2 of the present invention. 図5は、本発明の実施の形態2に係る音声認識装置の制御部が実行するフラグF1を設定する処理の一例を示すフローチャートである。FIG. 5 is a flowchart showing an example of a process of setting the flag F1 executed by the control unit of the speech recognition apparatus according to the second embodiment of the present invention. 図6は、本発明の実施の形態3に係る音声認識装置が実行する処理の一例を示すフローチャートである。FIG. 6 is a flowchart showing an example of processing executed by the speech recognition apparatus according to Embodiment 3 of the present invention. 図7は、本発明の変形例に係る音声認識装置の一構成例を示すブロック図である。FIG. 7 is a block diagram showing a configuration example of a speech recognition apparatus according to a modification of the present invention.
[実施の形態1]
 以下、本発明の実施の形態1に係る音声認識装置について図を参照しながら詳細に説明する。
 本発明の実施の形態1に係る音声認識装置は、外部からの音声信号に対応する単語を導出して逐次的に出力するものである。
First Embodiment
Hereinafter, the speech recognition apparatus according to the first embodiment of the present invention will be described in detail with reference to the drawings.
The speech recognition apparatus according to the first embodiment of the present invention derives words corresponding to an external speech signal and sequentially outputs the words.
 まず、本発明の実施の形態1に係る音声認識装置の構成を図1を参照しながら説明する。
 本発明の実施の形態1に係る音声認識装置は、図1に示すように、外部からの音声信号をもとに所定の情報処理を行なう情報処理部10,情報処理部10により処理された情報を出力する出力部20,情報処理部10を構成する資源の使用率を取得する使用率取得部30から構成される。
 情報処理部10は、外部からの音声信号をもとに所定の情報を生成する情報生成部12,情報生成部12が生成した所定の情報を一時的に記憶する一時記憶部14,一時記憶部14から所定の情報を引き出して外部からの音声信号に対応する単語を導出する単語導出部16,使用率取得部30からの信号に基づいて単語導出部16が行なう処理を中断させたり再開させたりする制御部18を含んでいる。
 出力部20は、単語導出部16により導出された単語を逐次的に音や映像として出力するものである。
 使用率取得部30は、所定時間毎(たとえば、数十msec毎や数百msec毎など)に情報処理部10が有する図示しないCPUやメモリなどの資源の使用率を取得して制御部18に送信するものである。
 なお、上述した本発明の実施の形態1に係る音声認識装置の各機能は、CPUやメモリ、各種インターフェースを有するコンピュータによって記録媒体に記録されたプログラムが読み込まれることにより、そのコンピュータのハードウェア資源とソフトウェアとが協働することによって実現することができる。
First, the configuration of the speech recognition apparatus according to the first embodiment of the present invention will be described with reference to FIG.
The speech recognition apparatus according to the first embodiment of the present invention, as shown in FIG. 1, includes information processed by an information processing unit 10 and an information processing unit 10 that performs predetermined information processing based on an external speech signal. And a usage rate acquisition unit 30 for acquiring the usage rate of the resources constituting the information processing unit 10.
The information processing unit 10 generates an information generation unit 12 that generates predetermined information based on an audio signal from the outside, a temporary storage unit 14 that temporarily stores predetermined information generated by the information generation unit 12, and a temporary storage unit The word derivation unit 16 derives a word corresponding to an external speech signal by extracting predetermined information from 14 and interrupts or resumes the processing performed by the word derivation unit 16 based on the signal from the usage rate acquisition unit 30 Control unit 18 is included.
The output unit 20 sequentially outputs the word derived by the word deriving unit 16 as a sound or a video.
The usage rate acquisition unit 30 acquires a usage rate of resources, such as a CPU and a memory (not shown), which the information processing unit 10 has, at predetermined time intervals (for example, every several tens of msec or several hundreds of msec). It is something to send.
Each function of the speech recognition apparatus according to the first embodiment of the present invention described above is realized by reading a program recorded in a recording medium by a computer having a CPU, a memory, and various interfaces, the hardware resources of the computer. It can be realized by the collaboration of the software and the software.
 次に、本発明の実施の形態1に係る音声認識装置の動作を図2を参照しながら説明する。 Next, the operation of the speech recognition apparatus according to the first embodiment of the present invention will be described with reference to FIG.
 まず、情報生成部12は、外部から音声信号および使用率取得部30から資源の使用率Uを入力する(ステップS100)。
 次いで、入力した音声信号に基づき所定の情報を生成し、この所定の情報を一時記憶部14に記憶し(ステップS110)。
 続いて、制御部18は、使用率取得部30により取得された資源の使用率Uが正常であるか否かを判断する(ステップS120)。
First, the information generation unit 12 inputs an audio signal from the outside and the usage rate U of the resource from the usage rate acquisition unit 30 (step S100).
Next, predetermined information is generated based on the input audio signal, and the predetermined information is stored in the temporary storage unit 14 (step S110).
Subsequently, the control unit 18 determines whether the usage rate U of the resource acquired by the usage rate acquisition unit 30 is normal (step S120).
 使用率Uが任意の所定値よりも小さく、正常の範囲内であると判断したときには(ステップS120:YES)、単語導出部16は一時記憶部14により記憶された所定の情報に基づいて音声信号に対応する単語を導出し(ステップS130)、この単語を出力部20により出力する処理を行なう(ステップS140)。
 続いて、処理を継続する必要があるか否かを判断する(ステップS150)。音声信号が継続して入力されているときや、未処理の音声信号があるときにはステップS100に戻ってステップS100~S140の処理を繰り返し(ステップS150:NO)、音声信号の入力が終了しており、かつ、未処理の音声信号がないことを確認してから(ステップS150:YES)、一連の処理を終了する。
When it is determined that the usage rate U is smaller than an arbitrary predetermined value and within the normal range (step S120: YES), the word derivation unit 16 generates an audio signal based on the predetermined information stored by the temporary storage unit 14. Is derived (step S130), and the output unit 20 outputs the word (step S140).
Subsequently, it is determined whether the process needs to be continued (step S150). When the audio signal is continuously input or when there is an unprocessed audio signal, the process returns to step S100 and the processing of steps S100 to S140 is repeated (step S150: NO), and the input of the audio signal is completed. And, after confirming that there is no unprocessed audio signal (step S150: YES), a series of processing ends.
 一方、使用率Uが任意の所定値以上であり、異常であると判断したときには(ステップS120:NO)、ステップS130の単語導出部16により音声信号に対応する単語を導出する処理を実行せずに、ステップS100の処理に戻る。このため、再び外部から音声信号を入力し、所定の情報を生成して一時記憶部14に一時的に記憶するステップS100,S110の処理を繰り返し、使用率Uが任意の所定値より小さく、正常の範囲内となるのを確認してから(ステップS120:YES)、一時記憶部16により一時的に記憶された所定の情報に基づいて外部からの音声信号に対応する単語を導出して出力部20から出力する処理を実行することになる(ステップS130,S140)。 On the other hand, when it is determined that the usage rate U is not less than an arbitrary predetermined value and is abnormal (step S120: NO), the word deriving unit 16 in step S130 does not execute the process of deriving the word corresponding to the audio signal. Then, the process returns to the process of step S100. For this reason, the process of steps S100 and S110 of inputting an audio signal from the outside again, generating predetermined information and temporarily storing the same in the temporary storage unit 14 is repeated, and the usage rate U is smaller than any predetermined value. (Step S120: YES), a word corresponding to an audio signal from the outside is derived based on predetermined information temporarily stored by the temporary storage unit 16, and the output unit is output. Processing to output from 20 is executed (steps S130 and S140).
 なお、使用率Uが任意の所定値以上であり、異常であると判断したときに、ステップS130,S140の処理を実行せずに、ステップS100~S120の処理を繰り返すのは、特に、単語導出部16がステップS130の処理を実行すると資源の使用率Uがさらに大きくなる恐れがあるためである。これらの処理を中断することにより、使用率Uがさらに大きくなることによって生じる不都合、すなわち、外部から音声信号を入力し、所定の情報を生成して一時記憶部14に一時的に記憶するするステップS100,S110の処理が滞ってしまうことを防ぐことができる。しかも、ステップS110の処理によって生成した所定の情報を一時記憶部14に一時的に記憶することによって生成した情報が処理されないまま消失してしまうことを防いでいるので、使用率Uが正常となるのを待ってからでも、ステップS130,S140の処理を実行できるのである。 In addition, when it is determined that the usage rate U is an arbitrary predetermined value or more and it is abnormal, repeating the processes of steps S100 to S120 without executing the processes of steps S130 and S140 is particularly word derivation. If the unit 16 executes the process of step S130, the resource usage rate U may be further increased. The step of interrupting these processes causes inconvenience due to further increase in the utilization factor U, that is, a step of inputting an audio signal from the outside, generating predetermined information and temporarily storing it in the temporary storage unit 14 It is possible to prevent the processing of S100 and S110 from being delayed. Moreover, since the predetermined information generated by the process of step S110 is temporarily stored in the temporary storage unit 14 and prevented from being lost without being processed, the usage rate U becomes normal. The processes of steps S130 and S140 can be executed even after waiting for
 以上、本発明の実施の形態1に係る音声認識装置によれば、外部からの音声信号に対応する単語を導出する処理を、情報処理部10が有する資源の使用率Uに基づいて中断したり再開したりするので、資源の使用率Uが過大となることに起因する不都合、すなわち、外部からの音声信号を入力し、所定の情報を生成する処理が滞ってしまう不都合を抑制できる。
 しかも、生成した所定の情報を一時的に記憶しておくことができるので、外部からの音声信号に対応する単語を導出する処理を中断したときでも、生成した所定の情報が未処理のまま消失してしまうことを抑制できる。
As described above, according to the speech recognition apparatus according to the first embodiment of the present invention, the process of deriving the word corresponding to the speech signal from the outside is interrupted based on the usage rate U of the resource that the information processing unit 10 has. Since the restart is performed, it is possible to suppress the inconvenience caused by the resource usage rate U becoming excessive, that is, the inconvenience that the process of inputting predetermined audio signals and generating predetermined information is delayed.
In addition, since the generated predetermined information can be temporarily stored, even when the process for deriving the word corresponding to the voice signal from the outside is interrupted, the generated predetermined information is lost without being processed. It can control that it does.
[実施の形態2]
 次に、本発明の実施の形態2に係る音声認識装置について図を参照しながら詳細に説明する。
 本発明の実施の形態2に係る音声認識装置も、外部からの音声信号に対応する単語を導出して逐次的に出力するものである。
Second Embodiment
Next, a speech recognition apparatus according to a second embodiment of the present invention will be described in detail with reference to the drawings.
The speech recognition apparatus according to the second embodiment of the present invention also derives the word corresponding to the speech signal from the outside and sequentially outputs it.
 まず、本発明の実施の形態2に係る音声認識装置の構成を図3を参照しながら説明する。
 本発明の実施の形態2に係る音声認識装置は、図3に示すように、外部からの音声信号をもとに所定の情報処理を行なう情報処理部110,情報処理部110により処理された情報を出力する出力部20,情報処理部110に含まれる図示しないCPUの使用率を取得する使用率取得部130,音素や音節などに関するデータを格納する音響モデル140,語彙に関するデータを格納する認識辞書142から構成される。ここで、外部からの音声信号としては、図示しないマイクロフォンにより生成された音声信号を入力することができるほか、遠隔地で生成された音声信号をネットワークを介して入力することもできる。
 情報処理部110は、外部からの音声信号を分析して時系列に並んだ特徴量を生成する特徴量生成部112,特徴量生成部112が生成した特徴量を一時的に記憶する一時記憶部114,一時記憶部114から特徴量を引き出し、この特徴量を音響モデル140および認識辞書142に格納されたデータと照合することにより外部からの音声信号に対応する単語を導出する単語導出部116,使用率取得部130からの信号に基づいて単語導出手段116が行なう処理を中断させたり再開させたりする制御部118を含んでいる。
 また、単語導出部116は、複数の単語仮説をその尤度と結びつけて取得する単語仮説取得部116a,単語仮説取得部116aが取得した単語仮説の各々に結びつけられた尤度に基づいて外部からの音声信号に対応する単語として一の単語仮説を選択する単語選択部116bを含んでいる。
 出力部120は、単語選択部116bにより選択された単語を逐次的にディスプレイ上に出力するものである。
 使用率取得部130は、所定時間毎(たとえば、数十msec毎や数百msec毎など)に情報処理部110が有するCPUの使用率Uを取得して制御部118に送信するものである。
 なお、上述した本発明の実施の形態2に係る音声認識装置の各機能は、CPUやメモリ、各種インターフェースを有するコンピュータによって記録媒体に記録されたプログラムが読み込まれることにより、そのコンピュータのハードウェア資源とソフトウェアとが協働することによって実現することができる。
First, the configuration of the speech recognition apparatus according to the second embodiment of the present invention will be described with reference to FIG.
The speech recognition apparatus according to the second embodiment of the present invention, as shown in FIG. 3, includes information processed by an information processing unit 110 and an information processing unit 110 that perform predetermined information processing based on an external speech signal. Output unit 20 for outputting the information, a usage rate acquisition unit 130 for obtaining the usage rate of the CPU (not shown) included in the information processing unit 110, an acoustic model 140 for storing data on phonemes and syllables, and a recognition dictionary for storing data on vocabulary It consists of 142. Here, as an external audio signal, an audio signal generated by a microphone (not shown) can be input, and an audio signal generated at a remote location can also be input through a network.
The information processing unit 110 analyzes a voice signal from the outside and generates a feature amount arranged in time series, and a temporary storage unit temporarily stores the feature amount generated by the feature amount generating unit 112. 114, a word derivation unit 116 which derives a word corresponding to an external speech signal by extracting a feature amount from the temporary storage unit 114 and collating the feature amount with data stored in the acoustic model 140 and the recognition dictionary 142, It includes a control unit 118 which interrupts or restarts the process performed by the word derivation unit 116 based on the signal from the usage rate acquisition unit 130.
In addition, the word derivation unit 116 acquires a plurality of word hypotheses by combining their likelihoods with the word hypothesis acquisition unit 116a and the word hypothesis acquisition unit 116a based on the likelihoods associated with each of the word hypotheses acquired. And a word selection unit 116 b for selecting one word hypothesis as a word corresponding to the voice signal.
The output unit 120 sequentially outputs the words selected by the word selection unit 116b on the display.
The usage rate acquisition unit 130 acquires the usage rate U of the CPU of the information processing unit 110 every predetermined time (for example, every several tens of msec, every several hundreds of msec, etc.) and transmits it to the control unit 118.
The above-described functions of the speech recognition apparatus according to the second embodiment of the present invention can be realized by reading a program recorded on a recording medium by a computer having a CPU, a memory, and various interfaces, and the hardware resources of the computer. It can be realized by the collaboration of the software and the software.
 次に、本発明の実施の形態2に係る音声認識装置の動作を図4を参照しながら説明する。 Next, the operation of the speech recognition apparatus according to the second embodiment of the present invention will be described with reference to FIG.
 まず、外部から音声信号および使用率取得部130からCPUの使用率Uを入力する(ステップS200)。
 次いで、入力された音声信号を分析して時系列に並んだ特徴量系列を生成し、一時記憶部114に記憶させる(ステップS210)。具体的には、入力された音声信号を数十msec毎のフレームに分断し、各フレーム毎にケプストラムなどの特徴量を生成して一時記憶部114に記憶させる。
First, the voice signal from outside and the usage rate U of the CPU from the usage rate acquisition unit 130 are input (step S200).
Next, the input voice signal is analyzed to generate a feature amount sequence arranged in time series, and the temporary storage unit 114 stores the feature amount sequence (step S210). Specifically, the input audio signal is divided into frames of several tens of msec, feature quantities such as cepstrum are generated for each frame, and stored in the temporary storage unit 114.
 続いて、制御部118は、入力された使用率Uに基づいて、使用率Uが正常の範囲内であるか否か示すフラグF1を設定する(ステップS220)。ここで、フラグF1は、初期状態では値0に設定されており、使用率Uが過大であると判断されたときに値1に設定されるフラグである。
 フラグF1を設定する処理について、図5を参照しながら説明する。使用率Uを予め定められた閾値Ulowおよび閾値Ulowより大きい閾値Uhiと比較し(ステップS221)、使用率Uが閾値Ulow以上であり閾値Uhi未満であるときにはフラグF1の値を既存の状態に維持する。これに対し、使用率Uが閾値Uhi以上であるときには、使用率Uが過大であると判断してフラグF1に値1を設定する(ステップS222)。逆に、使用率Uが閾値Ulow未満となったときには、使用率Uに余裕があると判断してフラグF1に値0を設定する(ステップS223)。ここで、閾値Uhiとしては、たとえば80%や90%などの値を用いることができ、閾値Ulowとしては、たとえば50%や60%などの値を用いることができる。なお、フラグF1に値1を設定する際の基準となる閾値Uhiと、フラグF1に値0を設定する際の基準となる閾値Ulowとが、互いに異なる値が設定されているのは、使用率UとフラグF1の値との関係にヒステリシスを持たせることにより、フラグF1の値が「0」と「1」との間で頻繁に設定変更されてしまうのを抑制するためである。
Subsequently, the control unit 118 sets a flag F1 indicating whether the usage rate U is within the normal range based on the input usage rate U (step S220). Here, the flag F1 is set to the value 0 in the initial state, and is set to the value 1 when it is determined that the usage rate U is excessive.
The process of setting the flag F1 will be described with reference to FIG. The usage rate U is compared with a predetermined threshold value Ulow and a threshold value Uhi larger than the threshold value Ulow (step S221), and the value of the flag F1 is maintained at the existing state when the usage rate U is at least the threshold value Ulow and less than the threshold Uhi. Do. On the other hand, when the usage rate U is equal to or higher than the threshold value Uhi, it is determined that the usage rate U is excessive, and the value F1 is set to the flag F1 (step S222). Conversely, when the usage rate U becomes less than the threshold value Ulow, it is determined that the usage rate U has a margin, and the value F1 is set to the flag F1 (step S223). Here, a value such as 80% or 90% can be used as the threshold Uhi, and a value such as 50% or 60% can be used as the threshold Ulow. Note that different values are set for the threshold Uhi used as a reference when setting the value 1 to the flag F1 and the threshold Ulow serving as a reference when setting the value 0 to the flag F1. By providing hysteresis in the relationship between U and the value of the flag F1, the value of the flag F1 is prevented from being frequently set and changed between “0” and “1”.
 次いで、制御部118はフラグF1の値が値0であるか否かを判断する(ステップS230)。
 フラグF1が値0であるときには(ステップS230:YES)、入力された外部音声に対応する単語を導出する処理を実行する。
Next, the control unit 118 determines whether the value of the flag F1 is 0 (step S230).
When the flag F1 is the value 0 (step S230: YES), a process of deriving a word corresponding to the input external speech is executed.
 まず、単語仮説をその確からしさを表わす尤度Pが高いものから順に二つ取得し、外部音声に対応する単語を導出する処理を実行中であるか否かを示すフラグF2に値1を設定する(ステップS240)。
 具体的には、単語仮説取得部116aおよび尤度設定手段116bは、一時記憶部114に記憶されている時系列に並んだ特徴量系列を引き出すとともに、この特徴量系列と音響モデル140に格納されている音素や音節に関するデータとの間の距離を計算し、この計算結果に基づいて認識辞書142に格納されている語彙のなかから尤度Pが高いものから順に二つの単語仮説を、よく知られている隠れマルコフモデル(HMM:Hidden Markov Model)などの手法を用いて取得し、図示しないメモリに一時的に記憶しておく。なお、音声認識において距離を計算して尤度Pの高い単語仮説を取得する技術は、周知技術であるため、詳しい説明は省略する(たとえば、非特許文献1参照)。
First, two word hypotheses are acquired in order from the one with the highest likelihood P representing the certainty, and the value F1 is set to the flag F2 indicating whether the process of deriving the word corresponding to the external speech is being executed (Step S240).
Specifically, the word hypothesis acquisition unit 116 a and the likelihood setting unit 116 b extract a feature amount sequence arranged in time series stored in the temporary storage unit 114, and store the feature amount sequence and the acoustic model 140. Calculate the distance between the phonetic and syllable-related data, and based on the calculation result, recognize the two word hypotheses in descending order of likelihood P from the vocabulary stored in the recognition dictionary 142, It is acquired using a method such as Hidden Markov Model (HMM) and temporarily stored in a memory (not shown). The technique for calculating the distance in the speech recognition and acquiring the word hypothesis with a high likelihood P is a known technique, and thus the detailed description is omitted (for example, see Non-Patent Document 1).
 次いで、ステップS240にて取得した二つの単語仮説の各々に対して、尤度Pが設定されているので、二つの単語仮説のうち、尤度Pが最大である単語仮説の尤度P1から、尤度Pが二番目に大きい単語仮説の尤度P2を減じて得られる尤度差ΔPを演算する(ステップS250)。 Next, since the likelihood P is set for each of the two word hypotheses acquired in step S240, the likelihood P1 of the word hypothesis having the largest likelihood P among the two word hypotheses is The likelihood difference ΔP obtained by subtracting the likelihood P2 of the word hypothesis with the second largest likelihood P is calculated (step S250).
 この後、尤度差ΔPを閾値ΔPthreと比較する(ステップS260)。閾値ΔPthreは、初期値として比較的大きい値であるΔP0が設定されているので、通常、尤度差ΔPは閾値ΔPthre未満となって(ステップS260:NO)、閾値ΔPthreからΔ(たとえば、初期値ΔP0に対して数百分の一や数千分の一の値など)を減じた値を閾値ΔPthreに設定し直したうえで(ステップS262)、ステップS200の処理に戻る。ここで、閾値ΔPthreからΔを減じた値を新たに閾値ΔPthreに設定し直すのは、時間の経過とともに外部音声に対応する単語を導出する際の条件を緩和し、単語が決定されないまま未処理の特徴量がいたずらに増加してしまうことを抑制するためである。 Thereafter, the likelihood difference ΔP is compared with the threshold value ΔPthre (step S260). Since the threshold ΔPthre is set to a relatively large value ΔP0 as an initial value, the likelihood difference ΔP is usually smaller than the threshold ΔPthre (step S260: NO), and Δ (eg, the initial value) The value obtained by subtracting one-hundredth or one-thousandth-thousand value from ΔP0 is reset to the threshold value ΔPthre (step S262), and the process returns to step S200. Here, setting a new value obtained by subtracting Δ from the threshold ΔPthre to the threshold ΔPthre relaxes the condition for deriving the word corresponding to the external voice with the passage of time, and the unprocessed word is not determined. It is for suppressing that the feature-value of N increases unreasonably.
 こうして、上述したステップS200~S262の処理を何度か繰り返すうちに、一時記憶部114から引き出す特徴量が増加して尤度Pが最大である単語仮説の尤度P1も大きくなり、また、ステップS262の処理にて閾値ΔPthreが徐々に小さく設定されるので、尤度差ΔPが閾値Pthreとなる確率が徐々に上昇する。この結果、尤度差ΔPが閾値ΔPthre以上となったときには(ステップS260:YES)、尤度Pが最大である単語仮説を入力された音声信号に対応する単語とみなして出力部120にて出力し、ステップS240の処理で値1に設定したフラグF2に値0を設定し直し、ステップS262の処理で変更した閾値ΔPthreに初期値ΔP0を設定し直す処理を行なう(ステップS270)。この際、ステップS200~S270の処理において生成した特徴量や取得した単語仮説などの情報はメモリから削除することにより、メモリを有効活用することができる。 Thus, while repeating the above-described processes of steps S200 to S262 several times, the feature quantity extracted from temporary storage unit 114 is increased, and the likelihood P1 of the word hypothesis having the maximum likelihood P is also increased. Since the threshold value ΔPthre is set to be gradually smaller in the process of S262, the probability that the likelihood difference ΔP becomes the threshold value Pthre gradually increases. As a result, when the likelihood difference ΔP becomes equal to or larger than the threshold ΔPthre (step S 260: YES), the word hypothesis having the maximum likelihood P is regarded as the word corresponding to the input voice signal and is output at the output unit 120 Then, the value F is reset to the flag F2 set to the value 1 in the process of step S240, and the initial value ΔP0 is reset to the threshold value ΔPthre changed in the process of step S262 (step S270). At this time, the memory can be effectively used by deleting the information such as the feature amount generated in the process of steps S200 to S270 and the acquired word hypothesis from the memory.
 続いて、音声信号が継続して入力されているか否か、また、未処理の音声信号があるか否かに基づいて、処理を継続する必要があるか否かを判断する(ステップS280)。音声信号が継続して入力されているときや未処理の音声信号があるときにはステップS200~S270の処理を繰り返し(ステップS280:NO)、音声信号の入力が終了しており、かつ、未処理の音声信号がないことを確認して(ステップS280:YES)、一連の処理を終了する。 Subsequently, it is determined whether it is necessary to continue the processing based on whether or not the audio signal is continuously input and whether there is an unprocessed audio signal (step S280). When the audio signal is continuously input or when there is an unprocessed audio signal, the processing of steps S200 to S270 is repeated (step S280: NO), the input of the audio signal is completed, and the processing is not performed. After confirming that there is no audio signal (step S280: YES), a series of processing ends.
 一方、使用率Uが正常の範囲内であるか否かを示すフラグF1が使用率Uが過大であることを示す値1であるときには(ステップS230:NO)、さらに、外部音声に対応する単語を導出する処理を実行中であるか否かを示すフラグF2が値0であるか否かを判断する(ステップS240)。
 フラグF2が値1であるとき、すなわち単語を導出する処理が実行中であるときには(ステップS232:YES)、ステップS240以降の外部からの音声信号に対応する単語を導出する処理を引き続き実行する。
 そして、フラグF2が値0となるのを確認してから、すなわち、ステップS280の処理にて外部からの音声信号に対応する単語を出力したのを確認してから(ステップS232:NO)、ステップS240以降の処理を実行せずに、ステップS200の処理に戻る。
 このため、外部からの音声信号を入力し、特徴量を生成して一時記憶部114に一時的に記憶し、使用率Uに基づいてフラグF1を設定するステップS200~S220の処理を繰り返し、使用率Uが所定値Ulow未満となってフラグF1が値0に設定されるのを確認してから(ステップS230:YES)、ステップS240以降の処理を実行することになる。
On the other hand, when the flag F1 indicating whether the usage rate U is within the normal range is the value 1 indicating that the usage rate U is excessive (step S230: NO), a word corresponding to the external voice is further generated. It is determined whether the flag F2 indicating whether or not the process of deriving the is being executed is the value 0 (step S240).
When the flag F2 is the value 1, that is, when the process of deriving the word is being executed (step S232: YES), the process of deriving the word corresponding to the external audio signal is continuously executed after step S240.
Then, after confirming that the flag F2 becomes the value 0, that is, after confirming that the word corresponding to the audio signal from the outside is output in the process of step S280 (step S232: NO), step It returns to the process of step S200, without performing the process after S240.
Therefore, an audio signal from the outside is input, a feature amount is generated and temporarily stored in the temporary storage unit 114, and the processing of steps S200 to S220 of setting the flag F1 based on the usage rate U is repeated and used. After it is confirmed that the rate U is less than the predetermined value Ulow and the flag F1 is set to the value 0 (step S230: YES), the processes after step S240 are performed.
 なお、フラグF1が値1であり、かつ、フラグF2が値0であるときに、ステップS240以降の処理を実行しないのは、以下の二つの理由に基づいている。第1の理由は、特に、単語仮説取得部116aがステップS240の処理を実行した場合、使用率Uがさらに大きくなってしまうためである。第2の理由は、外部からの音声信号に対応する単語を導出する処理を実行中であるときには取得した二つの単語仮説に関する情報をメモリから消去できないので、メモリを効率よく使用するためには単語を導出する処理がいったん終了した時点で中止することが望ましいためである。
 これらの処理を中断することにより、使用率Uが過大となることによって生じる不都合、すなわち、外部からの音声信号を入力し、特徴量を生成して一時記憶部114に一時的に記憶するステップS200,S210の処理が滞ってしまうことを防ぐことが出来る。しかも、ステップS210の処理によって生成した特徴量を一時記憶部114に一時的に記憶して未処理のまま消失してしまうことを防いでいるので、使用率Uが正常が閾値Ulow未満となってフラグF1に値0が設定されるのを待ってからでも、ステップS240以降の処理を実行できるのである。
When the flag F1 has a value of 1 and the flag F2 has a value of 0, the process after step S240 is not executed based on the following two reasons. The first reason is that, particularly when the word hypothesis acquisition unit 116a executes the process of step S240, the usage rate U is further increased. The second reason is that information can not be erased from the memory when it is in the process of deriving a word corresponding to an external speech signal, so that the word can be used efficiently. This is because it is desirable to stop once the process of deriving.
Step S200 of interrupting these processes causes inconvenience due to excessive usage rate U, that is, a voice signal from the outside is input, and a feature is generated and temporarily stored in temporary storage unit 114. , S210 can be prevented from being delayed. Moreover, since the feature amount generated by the process of step S210 is temporarily stored in the temporary storage unit 114 and prevented from disappearing without being processed, the usage rate U becomes less than the threshold Ulow as normal. Even after waiting for the flag F1 to be set to the value 0, the processes after step S240 can be executed.
 以上、本発明の実施の形態2に係る音声認識装置によれば、情報処理部110が有するCPUの使用率Uが過大であると判断したときには、外部からの音声信号に対応する単語を導出する処理を中断したり再開したりするので、CPUの使用率Uが過大となることに起因する不都合、すなわち、外部からの音声信号を入力し、特徴量を生成する処理が滞ってしまう不都合を抑制できる。
 また、外部からの音声信号に対応する単語を導出する処理が実行中であるときには、CPUの使用率Uが過大と判断したときでも、単語を導出する処理を中断しないので、二つの単語仮説に関する情報をメモリに記憶したまま単語を導出する処理を中断してしまうことによる不都合、すなわちメモリの使用率が過大となってしまうことを抑制できる。
 しかも、生成した特徴量を一時的に記憶しておくことができるので、外部からの音声信号に対応する単語を導出する処理を中断したときでも、特徴量が未処理のまま消失してしまうことを抑制できる。
As described above, according to the speech recognition apparatus according to the second embodiment of the present invention, when it is determined that the usage rate U of the CPU of the information processing unit 110 is excessive, a word corresponding to an external speech signal is derived. Since processing is interrupted or resumed, it is possible to suppress the inconvenience caused by excessive CPU utilization U, that is, the inconvenience of delaying processing for generating a feature amount by inputting an external audio signal. it can.
Furthermore, when the process of deriving a word corresponding to an external speech signal is being executed, the process of deriving a word is not interrupted even when it is determined that the CPU usage rate U is excessive. It is possible to suppress the inconvenience caused by interrupting the process of deriving the word while storing the information in the memory, that is, the memory usage rate becoming excessive.
Moreover, since the generated feature amount can be temporarily stored, even when the process for deriving a word corresponding to an external audio signal is interrupted, the feature amount is lost without being processed. Can be suppressed.
[実施の形態3]
 次に、本発明の実施の形態3に係る音声認識装置について図を参照しながら詳細に説明する。
 本発明の実施の形態3に係る音声認識装置も、外部からの音声信号に対応する単語を導出して逐次的に出力するものである。その構成要素は実施の形態2に係る音声認識装置の構成要素と共通するので、同一の符号を用いるとともに、詳しい説明は省略する。
Third Embodiment
Next, a speech recognition apparatus according to a third embodiment of the present invention will be described in detail with reference to the drawings.
The speech recognition apparatus according to the third embodiment of the present invention is also for deriving and sequentially outputting words corresponding to speech signals from the outside. The constituent elements are the same as the constituent elements of the speech recognition apparatus according to the second embodiment, so the same reference numerals are used and the detailed description is omitted.
 本発明の実施の形態3に係る音声認識装置の動作を図6を参照しながら説明する。
 外部からの音声信号を入力してから特徴量を生成するステップS300~S330の処理は、それぞれ、本発明の実施の形態2で説明したステップS200~S230の処理と共通するので、詳しい説明は省略する。これに対し、複数の単語仮説を取得し、取得した複数の単語仮説の中から一の単語仮説を選択して出力するステップS340~S370の処理は、本発明の実施の形態2で説明したステップS240~S270の処理とは異なるので、以下その内容を説明する。
The operation of the speech recognition apparatus according to the third embodiment of the present invention will be described with reference to FIG.
The processes of steps S300 to S330 for generating feature values after inputting an external audio signal are the same as the processes of steps S200 to S230 described in the second embodiment of the present invention, and thus detailed description will be omitted. Do. On the other hand, the processes of steps S340 to S370 for acquiring a plurality of word hypotheses and selecting and outputting one word hypothesis from among the acquired plurality of word hypotheses are the steps described in the second embodiment of the present invention. The processing is different from the processing of S240 to S270, so the contents will be described below.
 本発明の実施の形態2に係る音声認識装置では、単語仮説を尤度の高いものから順に少なくとも二つ取得すればよいが、本発明の実施の形態3に係る音声認識装置では、より多く(たとえば、5個や10個など)の単語仮説をその尤度Pと結びつけて取得するのが望ましい(ステップS340)。 In the speech recognition apparatus according to the second embodiment of the present invention, it is sufficient to acquire at least two word hypotheses in descending order of likelihood, but more speech recognition apparatuses according to the third embodiment of the present invention For example, it is desirable to obtain five or ten word hypotheses in association with the likelihood P (step S340).
 ステップS340にて複数の単語仮説をその尤度Pと結びつけて取得したあと、単語選択部116bは、これらの単語仮説のうち尤度Pが最大である一つの単語仮説と同じ意味をもつ一群の単語仮説を抽出し、この抽出された一群の単語仮説が単語仮説の全体数に対して占める割合である仮説占有率Hを演算する(ステップS350)。
 この仮説占有率Hについて具体例を用いて説明する。
 たとえば、入力された音声信号から特徴量を生成した結果、「キ」,「カ」,「イ」からなる音声が入力されたと判断し、尤度の高いものから順に、「機械」,「器械」,「機会」,「奇怪」,「棋界」という5つの単語仮説を取得した場合を考える。この場合、「器械」という単語仮説は、尤度Pが最大の単語仮説「機械」と同義である。このため、単語仮説の全体数は5つであるのに対し、尤度Pが最大の単語仮説と同じ意味をもつ一群の単語仮説の数は2つであるから、仮説占有率Hは2/5、すなわち40%となる。
 また、別の例として、入力された音声信号から特徴量を生成して音声を分析した結果、尤度の高いものから順に、「color」、「colour」、「collar」という3つの単語仮説を取得した場合を考える。この場合、「colour」という単語仮説は、尤度が最大の単語仮説「color」と同義である。このため、単語仮説の全体数は3つであるのに対し、尤度Pが最大の単語仮説と同じ意味をもつ一群の単語仮説の数は2つであるから、仮説占有率Hは2/3、すなわち60%となる。
 なお、一の単語仮説と他の単語仮説とが同じ意味を持つか否かは、認識辞書142が各々の単語に対して同義の単語のデータとして格納しており、このデータを参照して判断するものとすればよい。
After acquiring a plurality of word hypotheses in step S340 by linking it to the likelihood P, the word selection unit 116b selects a group of word hypotheses having the same meaning as one of the word hypotheses having the largest likelihood P among these word hypotheses. Word hypotheses are extracted, and a hypotheses occupation ratio H, which is a ratio of the extracted group of word hypotheses to the total number of word hypotheses, is calculated (step S350).
The hypothesis occupancy rate H will be described using a specific example.
For example, as a result of generating the feature amount from the input audio signal, it is determined that the audio consisting of "ki", "ka" and "i" has been input, and "machine" and "instrument" are ordered in descending order of likelihood. Consider the case where the five word hypotheses are acquired: “Opportunity”, “Mirai Kai”, and “The Underworld”. In this case, the word hypothesis "instrument" is synonymous with the word hypothesis "machine" whose likelihood P is largest. For this reason, while the total number of word hypotheses is five, the number of word hypotheses having the same meaning as the word hypothesis having the largest likelihood P is two, so the hypothesis occupancy rate H is 2 / 5 or 40%.
In addition, as another example, as a result of generating a feature quantity from an input speech signal and analyzing speech, three word hypotheses “color”, “colour”, and “collar” are sequentially arranged in descending order of likelihood. Consider the case of acquisition. In this case, the word hypothesis "colour" is synonymous with the word hypothesis "color" having the highest likelihood. For this reason, while the total number of word hypotheses is three, the number of word hypotheses having the same meaning as the word hypothesis having the largest likelihood P is two, so the hypothesis occupancy rate H is 2 / 3 or 60%.
Note that whether or not one word hypothesis and another word hypothesis have the same meaning is stored in the recognition dictionary 142 as data of a word synonymous with each word, and judgment is made with reference to this data. It should be done.
 次いで、仮説占有率Hを閾値Hthreと比較する(ステップS360)。閾値Hthreは、初期値として比較的大きなH0(たとえば、50%や60%など)が設定されているので、通常、仮説占有率Hthreは閾値Hthre未満となり(ステップS360:NO)、閾値HthreからΔ(たとえば、初期値H0に対して数百分の一や数千分の一の値など)を減じた値を閾値Hthreに設定し直したうえで(ステップS362)、ステップS300の処理に戻る。ここで、閾値HthreからΔを減じた値を新たに閾値Hthreに設定し直すのは、時間の経過とともに外部音声に対応する単語を導出する際の条件を緩和し、単語が決定されないまま未処理の特徴量がいたずらに増加してしまうことを抑制するためである。 Next, the hypothesis occupancy rate H is compared with the threshold Hthre (step S360). Since the threshold Hthre is initially set to a relatively large H0 (for example, 50%, 60%, etc.), the hypothesis occupancy rate Hthre is usually less than the threshold Hthre (step S360: NO). A value obtained by subtracting (for example, a value of several hundredths or a few thousandths of the initial value H0) is reset to the threshold Hthre (step S362), and the process returns to step S300. Here, setting a new value obtained by subtracting Δ from the threshold Hthre to the threshold Hthre relaxes the condition for deriving the word corresponding to the external speech with the passage of time, and the processing is not performed without the word being determined. It is for suppressing that the feature-value of N increases unreasonably.
 こうして、ステップS300~S362の処理を何度か繰り返すうちに、ステップS362の処理にて閾値Hthreが徐々に小さく設定されるので、仮説占有率Hが閾値Hthre以上となる確率は徐々に上昇する。この結果、仮説占有率Hが閾値Hthre以上となったときには(ステップS360:YES)、尤度Pが最大である単語仮説を入力された音声信号に対応する単語とみなして出力部120にて出力し、ステップS340の処理で値1に設定したフラグF2に値0を設定し直し、ステップS362の処理で変更した閾値Hthreに初期値H0を設定し直す処理を行なう(ステップS370)。この際、ステップS310の処理において生成した特徴量や、ステップS340の処理において取得した単語仮説などの情報はメモリから削除することにより、メモリを有効活用することができる。 In this way, while the processing in steps S300 to S362 is repeated several times, the threshold Hthre is set to be gradually smaller in the processing in step S362, so the probability that the hypothesis occupancy rate H becomes the threshold Hthre or more gradually increases. As a result, when the hypothesis occupancy rate H becomes equal to or higher than the threshold value Hthre (step S360: YES), the word hypothesis having the maximum likelihood P is regarded as the word corresponding to the input speech signal and output from the output unit 120 Then, the value F is reset to the flag F2 set to the value 1 in the process of step S340, and the initial value H0 is reset to the threshold Hthre changed in the process of step S362 (step S370). At this time, the memory can be effectively used by deleting information such as the feature amount generated in the process of step S310 and the word hypothesis acquired in the process of step S340 from the memory.
 なお、本発明の実施の形態3にかかる音声認識装置では、仮説占有率Hという概念を用いることにより、取得した単語仮説のなかから一の単語仮説を外部から入力された音声信号に対応する単語とみなして出力している。これは、仮説占有率Hが比較的大きいとき、すなわち、取得した単語仮説が、尤度Pが最大の単語仮説と同じ意味をもつ単語仮説により占められているときには、尤度Pが最大の単語仮説を外部からの音声信号に対応する単語とみなしても問題が生じることは稀であるためである。この結果、尤度Pが最大の単語仮説と、尤度Pが二番目に大きい単語仮説の尤度との差が小さいときでも、外部からの音声信号に対応する単語を早期に決定できる。このように外部からの音声信号に対応する単語仮説を決定しにくい場合に、CPUの使用率Uが過大であるにもかかわらず、外部からの音声信号に対応する単語を出力するステップS370の処理を実行する時期が遅くなり、ステップS340~S380の処理を中断する時期が遅くなることに起因して、CPUの使用率Uのさらなる増大を招来してしまうことを抑制できる。 In the speech recognition apparatus according to the third embodiment of the present invention, by using the concept of the hypothesis occupancy rate H, a word corresponding to a speech signal externally inputted with one word hypothesis from among acquired word hypotheses It is considered as and output. This is because when the hypothesis occupancy rate H is relatively large, that is, when the acquired word hypothesis is occupied by a word hypothesis having the same meaning as the word hypothesis with the maximum likelihood P, the word with the maximum likelihood P is This is because problems rarely occur even if the hypothesis is regarded as a word corresponding to an external speech signal. As a result, even when the difference between the word hypothesis with the largest likelihood P and the likelihood of the word hypothesis with the second largest likelihood P is small, the word corresponding to the speech signal from the outside can be determined early. As described above, when it is difficult to determine the word hypothesis corresponding to the external speech signal, the process of step S370 of outputting the word corresponding to the external speech signal even though the usage rate U of the CPU is excessive. It is possible to suppress the further increase of the CPU utilization factor U due to the delay of the time of executing the steps S340 to S380 and the delay of the time of executing the steps S340 to S380.
 続いて、音声信号が継続して入力されているか否か、また、未処理の音声信号があるか否かに基づいて、処理を継続する必要があるか否かを判断する(ステップS380)。音声信号が継続して入力されているときや未処理の音声信号があるときにはステップS300~S370の処理を繰り返し(ステップS380:NO)、音声信号の入力が終了しており、かつ、未処理の音声信号がないことを確認して(ステップS380:YES)、一連の処理を終了する。 Subsequently, it is determined whether it is necessary to continue the process based on whether or not the audio signal is continuously input and whether or not there is an unprocessed audio signal (step S380). When the audio signal is continuously input or when there is an unprocessed audio signal, the processing of steps S300 to S370 is repeated (step S380: NO), the input of the audio signal is completed, and the processing is not performed. After confirming that there is no audio signal (step S380: YES), a series of processing is ended.
 以上、本発明の実施の形態3に係る音声認識装置によっても、本発明の実施の形態2に係る音声認識装置と同様の効果を得ることができる。
 また、仮説占有率Hという概念を用いて外部からの音声信号に対応する単語を導出して出力するので、尤度Pが最大の単語仮説の尤度と、尤度Pが二番目に大きい単語仮説の尤度との差が小さいときでも、比較的早い段階で外部からの音声信号に対応する単語を導出して出力することができる。
 このため、CPUの使用率Uが過大であるにもかかわらず、外部からの音声信号に対応する単語を出力するステップS370の処理を実行する時期が遅くなり、ステップS340~S380の処理を中断する時期が遅くなることに起因して、CPUの使用率Uのさらなる増大を招来してしまうことを抑制できる。
As described above, even with the speech recognition apparatus according to the third embodiment of the present invention, the same effect as that of the speech recognition apparatus according to the second embodiment of the present invention can be obtained.
Also, since the word corresponding to the speech signal from the outside is derived and output using the concept of the hypothesis occupancy rate H, the likelihood of the word hypothesis with the largest likelihood P, the word with the second largest likelihood P Even when the difference between the likelihood of the hypothesis is small, it is possible to derive and output a word corresponding to an external speech signal at a relatively early stage.
For this reason, although the usage rate U of the CPU is excessive, the timing of executing the process of step S370 for outputting a word corresponding to an external audio signal is delayed, and the process of steps S340 to S380 is interrupted. It is possible to suppress the further increase of the CPU utilization U due to the late timing.
[変形例]
 なお、上述した本発明の実施の形態2,3に係る音声認識装置では、音響モデル140と,認識辞書142とを用いて外部からの音声信号に対応する単語を導出するものとして説明したが、言語的な確からしさを考慮するために言語モデルを併用してもよい。
[Modification]
Although the speech recognition apparatus according to the second and third embodiments of the present invention has been described as using the acoustic model 140 and the recognition dictionary 142 to derive a word corresponding to an external speech signal, Language models may be used in combination to consider linguistic certainty.
 また、上述した本発明の実施の形態2,3に係る音声認識装置では、使用率Uは、CPUの使用率として説明したが、メモリの使用率としてもよい。 Further, in the voice recognition device according to the second and third embodiments of the present invention described above, the usage rate U has been described as the usage rate of the CPU, but may be a memory usage rate.
 さらに、上述した本発明の実施の形態1~3に係る音声認識装置では、単語導出部16,116が導出した単語を出力部20,120に逐次的に出力するものとして説明したが、図7に示すように、翻訳部219をさらに備え、翻訳部219を用いて他言語に翻訳したうえで出力部220に逐次的に出力するものとしてもよい。 Furthermore, in the speech recognition devices according to Embodiments 1 to 3 of the present invention described above, it has been described that the words derived by word derivation units 16 and 116 are sequentially output to output units 20 and 120. As shown in FIG. 5, the translation unit 219 may be further provided, and the translation unit 219 may be used to translate into another language and sequentially output to the output unit 220.
 また、上述した本発明の実施の形態1~3では、音声認識装置の形態として説明したが、音声認識方法の形態としてもよく、プログラムを記録したコンピュータ読み取り可能な記録媒体の形態としてもよい。 Further, in the first to third embodiments of the present invention described above, the speech recognition apparatus has been described, but it may be a speech recognition method or a computer readable recording medium having a program recorded thereon.
 本発明は上述した実施の形態に制限されるものではなく、本発明の要旨を逸脱しない範囲内において、種々なる形態で実施することができる。 The present invention is not limited to the embodiments described above, and can be implemented in various forms without departing from the scope of the present invention.
 この出願は、2009年10月9日に出願された日本出願2009-235302を基礎とする優先権を主張し、その開示を全てここに取り込む。 This application claims priority based on Japanese application 2009-235302 filed on October 9, 2009, the entire disclosure of which is incorporated herein.
 本発明は音声認識装置の製造産業などに利用可能である。 The present invention is applicable to the manufacturing industry of speech recognition devices.
 10…情報処理部、12…情報生成部、14…一時記憶部、16…単語導出部、18…制御部、20…出力部、30…使用率取得部。 10: information processing unit, 12: information generation unit, 14: temporary storage unit, 16: word derivation unit, 18: control unit, 20: output unit, 30: usage rate acquisition unit.

Claims (13)

  1.  外部からの音声信号に対応する単語を導出する処理を行なう情報処理部と、
     単語を逐次的に出力する出力部と、
     前記情報処理部が有する資源の使用率を取得する使用率取得部とを備え、
     前記情報処理部は、
     前記音声信号に基づいて所定の情報を生成する情報生成部と、
     前記情報生成部からの所定の情報を一時的に記憶する一時記憶部と、
     前記一時記憶部により一時的に記憶された所定の情報に基づいて音声信号に対応する単語を導出して音声認識をし、単語を前記出力部に出力する単語導出部と、
     前記単語導出部が行なう処理を前記使用率取得部により取得された使用率に基づいて中断および再開させる制御部と
     を備えることを特徴とする音声認識装置。
    An information processing unit for performing processing for deriving a word corresponding to an external audio signal;
    An output unit that sequentially outputs words;
    And a usage rate acquisition unit that obtains the usage rate of the resource possessed by the information processing unit.
    The information processing unit
    An information generator configured to generate predetermined information based on the audio signal;
    A temporary storage unit for temporarily storing predetermined information from the information generation unit;
    A word derivation unit that derives a word corresponding to an audio signal based on predetermined information temporarily stored by the temporary storage unit, performs speech recognition, and outputs the word to the output unit;
    A control unit for interrupting and resuming processing performed by the word derivation unit based on the usage rate acquired by the usage rate acquisition unit.
  2.  前記資源は、CPUおよびメモリの少なくとも一方である
     ことを特徴とする請求項1に記載の音声認識装置。
    The speech recognition apparatus according to claim 1, wherein the resource is at least one of a CPU and a memory.
  3.  前記制御部は、前記使用率取得部により取得された使用率が第1の所定値以上であるときには前記単語導出部が行なう処理を中断させ、使用率が第1の所定値より小さい第2の所定値未満となるのを待ってから前記単語導出部が行なう処理を再開させる
     ことを特徴とする請求項1に記載の音声認識装置。
    The control unit interrupts processing performed by the word derivation unit when the usage rate acquired by the usage rate acquisition unit is equal to or higher than a first predetermined value, and the second usage rate is smaller than the first predetermined value. The speech recognition apparatus according to claim 1, wherein the processing performed by the word derivation unit is resumed after waiting for the value to become less than a predetermined value.
  4.  前記制御部は、前記単語導出部が音声信号に対応する単語を導出する処理を行なっているときには、前記単語導出部が単語を導出する処理を終了するまで前記単語導出部が行なう処理を中断させない
     ことを特徴とする請求項1に記載の音声認識装置。
    The control unit does not interrupt the process performed by the word derivation unit until the word derivation unit completes the process of deriving a word when the word derivation unit is performing the process of deriving a word corresponding to a voice signal. The speech recognition apparatus according to claim 1, characterized in that:
  5.  前記単語導出部は、
     前記一時記憶部により一時記憶された所定の情報に基づいて入力された音声信号に対応する複数の単語仮説をその確からしさを表わす尤度と結びつけて取得する単語仮説取得部と、
     複数の単語仮説の中から各々に設定された尤度に基づいて一の単語仮説を音声信号に対応する単語として選択する単語選択部と
     を備えることを特徴とする請求項1に記載の音声認識装置。
    The word derivation unit
    A word hypothesis acquisition unit which acquires a plurality of word hypotheses corresponding to the voice signal input based on the predetermined information temporarily stored by the temporary storage unit in association with the likelihood representing the certainty;
    The speech recognition according to claim 1, further comprising: a word selection unit which selects one word hypothesis as a word corresponding to a speech signal based on the likelihood set for each of a plurality of word hypotheses. apparatus.
  6.  前記単語選択部は、複数の単語仮説から尤度が大きい方から二つの単語仮説を選択し、各々に設定された尤度の差が閾値未満であるときには音声信号に対応する単語の選択を見送り、各々に設定された尤度の差が閾値以上であるときには尤度が最大の一の単語仮説を音声信号に対応する単語として選択する
     ことを特徴とする請求項5に記載の音声認識装置。
    The word selection unit selects two word hypotheses from the plurality of word hypotheses in descending order of likelihood, and when the difference between the likelihoods set for each is less than a threshold, the selection of the word corresponding to the speech signal is missed 6. The speech recognition apparatus according to claim 5, wherein one word hypothesis with the largest likelihood is selected as a word corresponding to the speech signal when the difference between the likelihoods set in each is equal to or greater than a threshold.
  7.  前記単語選択部は、音声信号に対応する単語を認識する処理を開始してからの経過時間が長くなるほど小さく閾値を設定する
     ことを特徴とする請求項6に記載の音声認識装置。
    The speech recognition apparatus according to claim 6, wherein the word selection unit sets the threshold smaller as the elapsed time from the start of the process of recognizing the word corresponding to the speech signal becomes longer.
  8.  前記単語選択部は、複数の単語仮説のうち尤度が最大である一の単語仮説を選択し、この選択された単語仮説と同じ意味をもつ一群の単語仮説を複数の単語仮説の中から抽出し、この抽出された一群の単語仮説が複数の単語仮説の全体数に対して占める割合である仮説占有率が所定占有率未満であるときには選択された単語仮説を音声信号に対応する単語の選択を見送り、この抽出された一群の単語仮説の仮説占有率が所定占有率以上であるときには選択された単語仮説を音声信号に対応する単語として選択する
     ことを特徴とする請求項5に記載の音声認識装置。
    The word selection unit selects one word hypothesis having the largest likelihood among a plurality of word hypotheses, and extracts a group of word hypotheses having the same meaning as the selected word hypothesis from among a plurality of word hypotheses When the hypothesis occupancy rate, which is a ratio of the extracted group of word hypotheses to the total number of word hypotheses, is less than a predetermined occupancy rate, the selected word hypothesis is selected as the word corresponding to the speech signal. 6. The speech according to claim 5, wherein the selected word hypothesis is selected as a word corresponding to a speech signal if the hypothesis occupancy rate of the extracted group of word hypotheses is equal to or greater than a predetermined occupancy rate. Recognition device.
  9.  前記単語選択部は、音声信号に対応する単語を導出する処理を開始してからの経過時間が長くなるほど小さく所定占有率を設定する
     ことを特徴とする請求項8に記載の音声認識装置。
    9. The speech recognition apparatus according to claim 8, wherein the word selection unit sets a predetermined occupancy rate smaller as the elapsed time from the start of the process of deriving the word corresponding to the speech signal becomes longer.
  10.  前記情報生成部は、音声信号に基づいてこの音声信号の特徴を表わす特徴量を生成する
     ことを特徴とする請求項1に記載の音声認識装置。
    The speech recognition apparatus according to claim 1, wherein the information generation unit generates a feature representing a feature of the speech signal based on the speech signal.
  11.  前記情報処理部は、前記単語選択部により選択された単語を翻訳してこの翻訳された単語を前記出力部に逐次的に出力する翻訳部をさらに備える
     ことを特徴とする請求項1に記載の音声認識装置。
    The information processing unit according to claim 1, further comprising: a translation unit that translates the word selected by the word selection unit and sequentially outputs the translated word to the output unit. Voice recognition device.
  12.  外部からの音声信号に基づいて所定の情報を生成するステップと、
     所定の情報を一時的に記憶するステップと、
     一時的に記憶した所定の情報に基づいて音声信号に対応する単語を導出して音声認識をし、単語を出力するステップと、
     音声信号に対応する単語を導出して音声認識をするための資源の使用率を取得するステップと、
     音声認識をする処理を取得した使用率に基づいて中断および再開するステップと
     を備えることを特徴とする音声認識方法。
    Generating predetermined information based on an external audio signal;
    Temporarily storing predetermined information;
    Deriving a word corresponding to the voice signal based on the temporarily stored predetermined information for speech recognition and outputting the word;
    Deriving a resource corresponding to speech recognition by deriving a word corresponding to the speech signal;
    A speech recognition method comprising the steps of: suspending and resuming a speech recognition process based on the acquired usage rate.
  13.  外部からの音声信号に基づいて所定の情報を生成するステップと、
     所定の情報を一時的に記憶するステップと、
     一時的に記憶した所定の情報に基づいて音声信号に対応する単語を導出して音声認識をし、単語を出力するステップと、
     音声信号に対応する単語を導出して音声認識をするための資源の使用率を取得するステップと、
     音声認識をする処理を取得した使用率に基づいて中断および再開するステップと
     をコンピュータに実行させることを特徴とするプログラムを記録したコンピュータ読み取り可能な記録媒体。
    Generating predetermined information based on an external audio signal;
    Temporarily storing predetermined information;
    Deriving a word corresponding to the voice signal based on the temporarily stored predetermined information for speech recognition and outputting the word;
    Deriving a resource corresponding to speech recognition by deriving a word corresponding to the speech signal;
    A computer readable recording medium storing a program for causing a computer to execute interrupting and resuming steps based on a usage rate obtained for voice recognition processing.
PCT/JP2010/067555 2009-10-09 2010-10-06 Voice recognition device and voice recognition method WO2011043380A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2011535425A JPWO2011043380A1 (en) 2009-10-09 2010-10-06 Speech recognition apparatus and speech recognition method

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2009-235302 2009-10-09
JP2009235302 2009-10-09

Publications (1)

Publication Number Publication Date
WO2011043380A1 true WO2011043380A1 (en) 2011-04-14

Family

ID=43856833

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2010/067555 WO2011043380A1 (en) 2009-10-09 2010-10-06 Voice recognition device and voice recognition method

Country Status (2)

Country Link
JP (1) JPWO2011043380A1 (en)
WO (1) WO2011043380A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0440557A (en) * 1990-06-06 1992-02-10 Seiko Epson Corp Portable speech recognition electronic dictionary
JP2000322087A (en) * 1999-05-13 2000-11-24 Nec Corp Multichannel input speech recognition device
JP2005516231A (en) * 2000-06-09 2005-06-02 スピーチワークス・インターナショナル・インコーポレーテッド Load-regulated speech recognition

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0440557A (en) * 1990-06-06 1992-02-10 Seiko Epson Corp Portable speech recognition electronic dictionary
JP2000322087A (en) * 1999-05-13 2000-11-24 Nec Corp Multichannel input speech recognition device
JP2005516231A (en) * 2000-06-09 2005-06-02 スピーチワークス・インターナショナル・インコーポレーテッド Load-regulated speech recognition

Also Published As

Publication number Publication date
JPWO2011043380A1 (en) 2013-03-04

Similar Documents

Publication Publication Date Title
US8818801B2 (en) Dialogue speech recognition system, dialogue speech recognition method, and recording medium for storing dialogue speech recognition program
US9431005B2 (en) System and method for supplemental speech recognition by identified idle resources
JP2015127758A (en) Response control device and control program
US11741947B2 (en) Transformer transducer: one model unifying streaming and non-streaming speech recognition
US20210375260A1 (en) Device and method for generating speech animation
CN112242144A (en) Voice recognition decoding method, device and equipment based on streaming attention model and computer readable storage medium
US11714973B2 (en) Methods and systems for control of content in an alternate language or accent
JP2002215187A (en) Speech recognition method and device for the same
Ramani et al. Automatic subtitle generation for videos
JP2010139745A (en) Recording medium storing statistical pronunciation variation model, automatic voice recognition system, and computer program
US20230352006A1 (en) Tied and reduced rnn-t
US20230107450A1 (en) Disfluency Detection Models for Natural Conversational Voice Systems
WO2011043380A1 (en) Voice recognition device and voice recognition method
JP2014048443A (en) Voice synthesis system, voice synthesis method, and voice synthesis program
JP2011053312A (en) Adaptive acoustic model generating device and program
CN118076999A (en) Transducer-based streaming push for concatenated encoders
US20130268271A1 (en) Speech recognition system, speech recognition method, and speech recognition program
US20240169981A1 (en) End-To-End Segmentation in a Two-Pass Cascaded Encoder Automatic Speech Recognition Model
US20230343332A1 (en) Joint Segmenting and Automatic Speech Recognition
US12026476B2 (en) Methods and systems for control of content in an alternate language or accent
WO2023205367A1 (en) Joint segmenting and automatic speech recognition
US20240135923A1 (en) Universal Monolingual Output Layer for Multilingual Speech Recognition
JP6830148B1 (en) Modification candidate identification device, modification candidate identification method, and modification candidate identification program
US20240029719A1 (en) Unified End-To-End Speech Recognition And Endpointing Using A Switch Connection
US20240153495A1 (en) Multi-Output Decoders for Multi-Task Learning of ASR and Auxiliary Tasks

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 10822050

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2011535425

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 10822050

Country of ref document: EP

Kind code of ref document: A1