WO2011043380A1

WO2011043380A1 - Voice recognition device and voice recognition method

Info

Publication number: WO2011043380A1
Application number: PCT/JP2010/067555
Authority: WO
Inventors: 健花沢; 長田　誠也; 隆行荒川; 岡部　浩司; 田中　大介
Original assignee: 日本電気株式会社
Priority date: 2009-10-09
Filing date: 2010-10-06
Publication date: 2011-04-14
Also published as: JPWO2011043380A1

Abstract

Disclosed is a voice recognition device comprising: an information processing unit (10) which carries out processing for deriving words corresponding to external voice signals; an output unit (20) which sequentially outputs words; and a usage rate acquisition unit (30) which acquires the usage rate of the resources of the information processing unit. The information processing unit (10) is provided with: an information generation unit (12) which generates prescribed information on the basis of voice signals; a temporary storage unit (14) which temporarily stores the prescribed information; a word derivation unit (16) which derives words corresponding to external voice signals on the basis of the prescribed information, carries out voice recognition and outputs words to the output unit (20); and a control unit (18) which discontinues and resumes the processing carried out by the word derivation unit (16) on the basis of usage rate acquired by the usage rate acquisition unit (30). As the processing carried out by the word derivation unit (16) is discontinued and resumed on the basis of the usage rate, inconveniences caused by excessive resource-usage rates can be suppressed.

Description

Speech recognition apparatus and speech recognition method

The present invention relates to a speech recognition apparatus and speech recognition method, and more particularly to a speech recognition apparatus and speech recognition method for sequentially outputting the result recognized when speech is continuously input.

In recent years, a technology is widely known that uses a speech recognition system that recognizes speech input as speech and converts it into text strings, etc., to generate subtitles for television programs, create meeting minutes, and translate speech. ing.

However, in an ordinary speech recognition system, since the most probable result can not be output unless the speech is finished once, a time lag occurs between the actual speech and the output of the recognized result. For this reason, it has been difficult to use in a situation where voice recognition in real time is required, such as when displaying subtitles in live broadcasting of a television program or when using in combination with a translation system for conversation between different languages.

In order to solve such a problem, there has been proposed a continuous speech recognition apparatus which performs speech recognition at constant intervals even during speech and sequentially outputs the result (see, for example, Patent Document 1). The continuous speech recognition apparatus selects the most probable word corresponding to the input continuous speech at regular intervals, and extracts and outputs a word that can be stably output therefrom, so that it is stable in real time High voice recognition.

Patent No. 3834169

However, in general, processing for performing speech recognition uses a large amount of resources such as a CPU and a memory, which may cause a shortage of resources. When such a shortage of resources occurs, if voice recognition processing is sequentially performed as in the continuous speech recognition device described in Patent Document 1, each processing of voice recognition processing is stagnant due to the lack of resources, and voice input is performed. May cause inconveniences such as dropping the
A situation where such a shortage of resources occurs is likely to occur when voice recognition is performed with a small device such as a mobile phone which is difficult to install relatively high-performance resources. In addition, even when a device equipped with a relatively high-performance resource is used, it tends to occur when performing speech recognition in combination with a system with a large amount of processing such as a speech synthesis system or an automatic translation system.

The present invention has been made to solve the above-described problems, and has made it possible to suppress the loss of an external audio signal even when the resource usage rate becomes relatively large. An object of the present invention is to provide a speech recognition apparatus and a speech recognition method.

A speech recognition apparatus according to the present invention includes an information processing unit that performs a process of deriving a word corresponding to an external speech signal, an output unit that sequentially outputs a word, and a resource usage rate of the information processing unit. And an information generation unit for generating predetermined information based on the audio signal, and a temporary storage for temporarily storing predetermined information from the information generation unit. A storage unit, a word derivation unit for performing speech recognition by deriving a word corresponding to a voice signal based on predetermined information temporarily stored by the temporary storage unit, and outputting the word to the output unit; And a control unit that suspends and resumes the process performed by the word derivation unit based on the usage rate acquired by the usage rate acquisition unit.

The speech recognition method according to the present invention comprises the steps of generating predetermined information based on an external speech signal, temporarily storing the predetermined information, and speech based on the temporarily stored predetermined information. A step of deriving a word corresponding to the signal to perform speech recognition and outputting the word; a step of deriving a word corresponding to the speech signal to obtain a usage rate of resources for performing speech recognition; And c) interrupting and resuming the process based on the acquired usage rate.

According to the speech recognition apparatus and the speech recognition method according to the present invention, the process of deriving the word corresponding to the speech signal from the outside is interrupted and resumed based on the usage rate of the resource possessed by the information processing unit. It is possible to suppress an inconvenience caused by an excessive rate, that is, a delay in the process of generating predetermined information based on an external audio signal.
Moreover, since the generated predetermined information can be temporarily stored, the generated predetermined information is lost without being processed even when the processing for deriving the word corresponding to the external audio signal is interrupted. Can be suppressed.

FIG. 1 is a block diagram showing an example of the configuration of the speech recognition apparatus according to Embodiment 1 of the present invention. FIG. 2 is a flowchart showing an example of processing executed by the speech recognition apparatus according to the first embodiment of the present invention. FIG. 3 is a block diagram showing an example of the configuration of the speech recognition apparatus according to Embodiment 2 of the present invention. FIG. 4 is a flowchart showing an example of processing performed by the speech recognition apparatus according to Embodiment 2 of the present invention. FIG. 5 is a flowchart showing an example of a process of setting the flag F1 executed by the control unit of the speech recognition apparatus according to the second embodiment of the present invention. FIG. 6 is a flowchart showing an example of processing executed by the speech recognition apparatus according to Embodiment 3 of the present invention. FIG. 7 is a block diagram showing a configuration example of a speech recognition apparatus according to a modification of the present invention.

First Embodiment
Hereinafter, the speech recognition apparatus according to the first embodiment of the present invention will be described in detail with reference to the drawings.
The speech recognition apparatus according to the first embodiment of the present invention derives words corresponding to an external speech signal and sequentially outputs the words.

First, the configuration of the speech recognition apparatus according to the first embodiment of the present invention will be described with reference to FIG.
The speech recognition apparatus according to the first embodiment of the present invention, as shown in FIG. 1, includes information processed by an information processing unit 10 and an information processing unit 10 that performs predetermined information processing based on an external speech signal. And a usage rate acquisition unit 30 for acquiring the usage rate of the resources constituting the information processing unit 10.
The information processing unit 10 generates an information generation unit 12 that generates predetermined information based on an audio signal from the outside, a temporary storage unit 14 that temporarily stores predetermined information generated by the information generation unit 12, and a temporary storage unit The word derivation unit 16 derives a word corresponding to an external speech signal by extracting predetermined information from 14 and interrupts or resumes the processing performed by the word derivation unit 16 based on the signal from the usage rate acquisition unit 30 Control unit 18 is included.
The output unit 20 sequentially outputs the word derived by the word deriving unit 16 as a sound or a video.
The usage rate acquisition unit 30 acquires a usage rate of resources, such as a CPU and a memory (not shown), which the information processing unit 10 has, at predetermined time intervals (for example, every several tens of msec or several hundreds of msec). It is something to send.
Each function of the speech recognition apparatus according to the first embodiment of the present invention described above is realized by reading a program recorded in a recording medium by a computer having a CPU, a memory, and various interfaces, the hardware resources of the computer. It can be realized by the collaboration of the software and the software.

Next, the operation of the speech recognition apparatus according to the first embodiment of the present invention will be described with reference to FIG.

First, the information generation unit 12 inputs an audio signal from the outside and the usage rate U of the resource from the usage rate acquisition unit 30 (step S100).
Next, predetermined information is generated based on the input audio signal, and the predetermined information is stored in the temporary storage unit 14 (step S110).
Subsequently, the control unit 18 determines whether the usage rate U of the resource acquired by the usage rate acquisition unit 30 is normal (step S120).

When it is determined that the usage rate U is smaller than an arbitrary predetermined value and within the normal range (step S120: YES), the word derivation unit 16 generates an audio signal based on the predetermined information stored by the temporary storage unit 14. Is derived (step S130), and the output unit 20 outputs the word (step S140).
Subsequently, it is determined whether the process needs to be continued (step S150). When the audio signal is continuously input or when there is an unprocessed audio signal, the process returns to step S100 and the processing of steps S100 to S140 is repeated (step S150: NO), and the input of the audio signal is completed. And, after confirming that there is no unprocessed audio signal (step S150: YES), a series of processing ends.

On the other hand, when it is determined that the usage rate U is not less than an arbitrary predetermined value and is abnormal (step S120: NO), the word deriving unit 16 in step S130 does not execute the process of deriving the word corresponding to the audio signal. Then, the process returns to the process of step S100. For this reason, the process of steps S100 and S110 of inputting an audio signal from the outside again, generating predetermined information and temporarily storing the same in the temporary storage unit 14 is repeated, and the usage rate U is smaller than any predetermined value. (Step S120: YES), a word corresponding to an audio signal from the outside is derived based on predetermined information temporarily stored by the temporary storage unit 16, and the output unit is output. Processing to output from 20 is executed (steps S130 and S140).

In addition, when it is determined that the usage rate U is an arbitrary predetermined value or more and it is abnormal, repeating the processes of steps S100 to S120 without executing the processes of steps S130 and S140 is particularly word derivation. If the unit 16 executes the process of step S130, the resource usage rate U may be further increased. The step of interrupting these processes causes inconvenience due to further increase in the utilization factor U, that is, a step of inputting an audio signal from the outside, generating predetermined information and temporarily storing it in the temporary storage unit 14 It is possible to prevent the processing of S100 and S110 from being delayed. Moreover, since the predetermined information generated by the process of step S110 is temporarily stored in the temporary storage unit 14 and prevented from being lost without being processed, the usage rate U becomes normal. The processes of steps S130 and S140 can be executed even after waiting for

As described above, according to the speech recognition apparatus according to the first embodiment of the present invention, the process of deriving the word corresponding to the speech signal from the outside is interrupted based on the usage rate U of the resource that the information processing unit 10 has. Since the restart is performed, it is possible to suppress the inconvenience caused by the resource usage rate U becoming excessive, that is, the inconvenience that the process of inputting predetermined audio signals and generating predetermined information is delayed.
In addition, since the generated predetermined information can be temporarily stored, even when the process for deriving the word corresponding to the voice signal from the outside is interrupted, the generated predetermined information is lost without being processed. It can control that it does.

Second Embodiment
Next, a speech recognition apparatus according to a second embodiment of the present invention will be described in detail with reference to the drawings.
The speech recognition apparatus according to the second embodiment of the present invention also derives the word corresponding to the speech signal from the outside and sequentially outputs it.

First, the configuration of the speech recognition apparatus according to the second embodiment of the present invention will be described with reference to FIG.
The speech recognition apparatus according to the second embodiment of the present invention, as shown in FIG. 3, includes information processed by an information processing unit 110 and an information processing unit 110 that perform predetermined information processing based on an external speech signal. Output unit 20 for outputting the information, a usage rate acquisition unit 130 for obtaining the usage rate of the CPU (not shown) included in the information processing unit 110, an acoustic model 140 for storing data on phonemes and syllables, and a recognition dictionary for storing data on vocabulary It consists of 142. Here, as an external audio signal, an audio signal generated by a microphone (not shown) can be input, and an audio signal generated at a remote location can also be input through a network.
The information processing unit 110 analyzes a voice signal from the outside and generates a feature amount arranged in time series, and a temporary storage unit temporarily stores the feature amount generated by the feature amount generating unit 112. 114, a word derivation unit 116 which derives a word corresponding to an external speech signal by extracting a feature amount from the temporary storage unit 114 and collating the feature amount with data stored in the acoustic model 140 and the recognition dictionary 142, It includes a control unit 118 which interrupts or restarts the process performed by the word derivation unit 116 based on the signal from the usage rate acquisition unit 130.
In addition, the word derivation unit 116 acquires a plurality of word hypotheses by combining their likelihoods with the word hypothesis acquisition unit 116a and the word hypothesis acquisition unit 116a based on the likelihoods associated with each of the word hypotheses acquired. And a word selection unit 116 b for selecting one word hypothesis as a word corresponding to the voice signal.
The output unit 120 sequentially outputs the words selected by the word selection unit 116b on the display.
The usage rate acquisition unit 130 acquires the usage rate U of the CPU of the information processing unit 110 every predetermined time (for example, every several tens of msec, every several hundreds of msec, etc.) and transmits it to the control unit 118.
The above-described functions of the speech recognition apparatus according to the second embodiment of the present invention can be realized by reading a program recorded on a recording medium by a computer having a CPU, a memory, and various interfaces, and the hardware resources of the computer. It can be realized by the collaboration of the software and the software.

Next, the operation of the speech recognition apparatus according to the second embodiment of the present invention will be described with reference to FIG.

First, the voice signal from outside and the usage rate U of the CPU from the usage rate acquisition unit 130 are input (step S200).
Next, the input voice signal is analyzed to generate a feature amount sequence arranged in time series, and the temporary storage unit 114 stores the feature amount sequence (step S210). Specifically, the input audio signal is divided into frames of several tens of msec, feature quantities such as cepstrum are generated for each frame, and stored in the temporary storage unit 114.

Subsequently, the control unit 118 sets a flag F1 indicating whether the usage rate U is within the normal range based on the input usage rate U (step S220). Here, the flag F1 is set to the value 0 in the initial state, and is set to the value 1 when it is determined that the usage rate U is excessive.
The process of setting the flag F1 will be described with reference to FIG. The usage rate U is compared with a predetermined threshold value Ulow and a threshold value Uhi larger than the threshold value Ulow (step S221), and the value of the flag F1 is maintained at the existing state when the usage rate U is at least the threshold value Ulow and less than the threshold Uhi. Do. On the other hand, when the usage rate U is equal to or higher than the threshold value Uhi, it is determined that the usage rate U is excessive, and the value F1 is set to the flag F1 (step S222). Conversely, when the usage rate U becomes less than the threshold value Ulow, it is determined that the usage rate U has a margin, and the value F1 is set to the flag F1 (step S223). Here, a value such as 80% or 90% can be used as the threshold Uhi, and a value such as 50% or 60% can be used as the threshold Ulow. Note that different values are set for the threshold Uhi used as a reference when setting the value 1 to the flag F1 and the threshold Ulow serving as a reference when setting the value 0 to the flag F1. By providing hysteresis in the relationship between U and the value of the flag F1, the value of the flag F1 is prevented from being frequently set and changed between “0” and “1”.

Next, the control unit 118 determines whether the value of the flag F1 is 0 (step S230).
When the flag F1 is the value 0 (step S230: YES), a process of deriving a word corresponding to the input external speech is executed.

First, two word hypotheses are acquired in order from the one with the highest likelihood P representing the certainty, and the value F1 is set to the flag F2 indicating whether the process of deriving the word corresponding to the external speech is being executed (Step S240).
Specifically, the word hypothesis acquisition unit 116 a and the likelihood setting unit 116 b extract a feature amount sequence arranged in time series stored in the temporary storage unit 114, and store the feature amount sequence and the acoustic model 140. Calculate the distance between the phonetic and syllable-related data, and based on the calculation result, recognize the two word hypotheses in descending order of likelihood P from the vocabulary stored in the recognition dictionary 142, It is acquired using a method such as Hidden Markov Model (HMM) and temporarily stored in a memory (not shown). The technique for calculating the distance in the speech recognition and acquiring the word hypothesis with a high likelihood P is a known technique, and thus the detailed description is omitted (for example, see Non-Patent Document 1).

Next, since the likelihood P is set for each of the two word hypotheses acquired in step S240, the likelihood P1 of the word hypothesis having the largest likelihood P among the two word hypotheses is The likelihood difference ΔP obtained by subtracting the likelihood P2 of the word hypothesis with the second largest likelihood P is calculated (step S250).

Thereafter, the likelihood difference ΔP is compared with the threshold value ΔPthre (step S260). Since the threshold ΔPthre is set to a relatively large value ΔP0 as an initial value, the likelihood difference ΔP is usually smaller than the threshold ΔPthre (step S260: NO), and Δ (eg, the initial value) The value obtained by subtracting one-hundredth or one-thousandth-thousand value from ΔP0 is reset to the threshold value ΔPthre (step S262), and the process returns to step S200. Here, setting a new value obtained by subtracting Δ from the threshold ΔPthre to the threshold ΔPthre relaxes the condition for deriving the word corresponding to the external voice with the passage of time, and the unprocessed word is not determined. It is for suppressing that the feature-value of N increases unreasonably.

Thus, while repeating the above-described processes of steps S200 to S262 several times, the feature quantity extracted from temporary storage unit 114 is increased, and the likelihood P1 of the word hypothesis having the maximum likelihood P is also increased. Since the threshold value ΔPthre is set to be gradually smaller in the process of S262, the probability that the likelihood difference ΔP becomes the threshold value Pthre gradually increases. As a result, when the likelihood difference ΔP becomes equal to or larger than the threshold ΔPthre (step S 260: YES), the word hypothesis having the maximum likelihood P is regarded as the word corresponding to the input voice signal and is output at the output unit 120 Then, the value F is reset to the flag F2 set to the value 1 in the process of step S240, and the initial value ΔP0 is reset to the threshold value ΔPthre changed in the process of step S262 (step S270). At this time, the memory can be effectively used by deleting the information such as the feature amount generated in the process of steps S200 to S270 and the acquired word hypothesis from the memory.

Subsequently, it is determined whether it is necessary to continue the processing based on whether or not the audio signal is continuously input and whether there is an unprocessed audio signal (step S280). When the audio signal is continuously input or when there is an unprocessed audio signal, the processing of steps S200 to S270 is repeated (step S280: NO), the input of the audio signal is completed, and the processing is not performed. After confirming that there is no audio signal (step S280: YES), a series of processing ends.

On the other hand, when the flag F1 indicating whether the usage rate U is within the normal range is the value 1 indicating that the usage rate U is excessive (step S230: NO), a word corresponding to the external voice is further generated. It is determined whether the flag F2 indicating whether or not the process of deriving the is being executed is the value 0 (step S240).
When the flag F2 is the value 1, that is, when the process of deriving the word is being executed (step S232: YES), the process of deriving the word corresponding to the external audio signal is continuously executed after step S240.
Then, after confirming that the flag F2 becomes the value 0, that is, after confirming that the word corresponding to the audio signal from the outside is output in the process of step S280 (step S232: NO), step It returns to the process of step S200, without performing the process after S240.
Therefore, an audio signal from the outside is input, a feature amount is generated and temporarily stored in the temporary storage unit 114, and the processing of steps S200 to S220 of setting the flag F1 based on the usage rate U is repeated and used. After it is confirmed that the rate U is less than the predetermined value Ulow and the flag F1 is set to the value 0 (step S230: YES), the processes after step S240 are performed.

When the flag F1 has a value of 1 and the flag F2 has a value of 0, the process after step S240 is not executed based on the following two reasons. The first reason is that, particularly when the word hypothesis acquisition unit 116a executes the process of step S240, the usage rate U is further increased. The second reason is that information can not be erased from the memory when it is in the process of deriving a word corresponding to an external speech signal, so that the word can be used efficiently. This is because it is desirable to stop once the process of deriving.
Step S200 of interrupting these processes causes inconvenience due to excessive usage rate U, that is, a voice signal from the outside is input, and a feature is generated and temporarily stored in temporary storage unit 114. , S210 can be prevented from being delayed. Moreover, since the feature amount generated by the process of step S210 is temporarily stored in the temporary storage unit 114 and prevented from disappearing without being processed, the usage rate U becomes less than the threshold Ulow as normal. Even after waiting for the flag F1 to be set to the value 0, the processes after step S240 can be executed.

As described above, according to the speech recognition apparatus according to the second embodiment of the present invention, when it is determined that the usage rate U of the CPU of the information processing unit 110 is excessive, a word corresponding to an external speech signal is derived. Since processing is interrupted or resumed, it is possible to suppress the inconvenience caused by excessive CPU utilization U, that is, the inconvenience of delaying processing for generating a feature amount by inputting an external audio signal. it can.
Furthermore, when the process of deriving a word corresponding to an external speech signal is being executed, the process of deriving a word is not interrupted even when it is determined that the CPU usage rate U is excessive. It is possible to suppress the inconvenience caused by interrupting the process of deriving the word while storing the information in the memory, that is, the memory usage rate becoming excessive.
Moreover, since the generated feature amount can be temporarily stored, even when the process for deriving a word corresponding to an external audio signal is interrupted, the feature amount is lost without being processed. Can be suppressed.

Third Embodiment
Next, a speech recognition apparatus according to a third embodiment of the present invention will be described in detail with reference to the drawings.
The speech recognition apparatus according to the third embodiment of the present invention is also for deriving and sequentially outputting words corresponding to speech signals from the outside. The constituent elements are the same as the constituent elements of the speech recognition apparatus according to the second embodiment, so the same reference numerals are used and the detailed description is omitted.

The operation of the speech recognition apparatus according to the third embodiment of the present invention will be described with reference to FIG.
The processes of steps S300 to S330 for generating feature values after inputting an external audio signal are the same as the processes of steps S200 to S230 described in the second embodiment of the present invention, and thus detailed description will be omitted. Do. On the other hand, the processes of steps S340 to S370 for acquiring a plurality of word hypotheses and selecting and outputting one word hypothesis from among the acquired plurality of word hypotheses are the steps described in the second embodiment of the present invention. The processing is different from the processing of S240 to S270, so the contents will be described below.

In the speech recognition apparatus according to the second embodiment of the present invention, it is sufficient to acquire at least two word hypotheses in descending order of likelihood, but more speech recognition apparatuses according to the third embodiment of the present invention For example, it is desirable to obtain five or ten word hypotheses in association with the likelihood P (step S340).

After acquiring a plurality of word hypotheses in step S340 by linking it to the likelihood P, the word selection unit 116b selects a group of word hypotheses having the same meaning as one of the word hypotheses having the largest likelihood P among these word hypotheses. Word hypotheses are extracted, and a hypotheses occupation ratio H, which is a ratio of the extracted group of word hypotheses to the total number of word hypotheses, is calculated (step S350).
The hypothesis occupancy rate H will be described using a specific example.
For example, as a result of generating the feature amount from the input audio signal, it is determined that the audio consisting of "ki", "ka" and "i" has been input, and "machine" and "instrument" are ordered in descending order of likelihood. Consider the case where the five word hypotheses are acquired: “Opportunity”, “Mirai Kai”, and “The Underworld”. In this case, the word hypothesis "instrument" is synonymous with the word hypothesis "machine" whose likelihood P is largest. For this reason, while the total number of word hypotheses is five, the number of word hypotheses having the same meaning as the word hypothesis having the largest likelihood P is two, so the hypothesis occupancy rate H is 2 / 5 or 40%.
In addition, as another example, as a result of generating a feature quantity from an input speech signal and analyzing speech, three word hypotheses “color”, “colour”, and “collar” are sequentially arranged in descending order of likelihood. Consider the case of acquisition. In this case, the word hypothesis "colour" is synonymous with the word hypothesis "color" having the highest likelihood. For this reason, while the total number of word hypotheses is three, the number of word hypotheses having the same meaning as the word hypothesis having the largest likelihood P is two, so the hypothesis occupancy rate H is 2 / 3 or 60%.
Note that whether or not one word hypothesis and another word hypothesis have the same meaning is stored in the recognition dictionary 142 as data of a word synonymous with each word, and judgment is made with reference to this data. It should be done.

Next, the hypothesis occupancy rate H is compared with the threshold Hthre (step S360). Since the threshold Hthre is initially set to a relatively large H0 (for example, 50%, 60%, etc.), the hypothesis occupancy rate Hthre is usually less than the threshold Hthre (step S360: NO). A value obtained by subtracting (for example, a value of several hundredths or a few thousandths of the initial value H0) is reset to the threshold Hthre (step S362), and the process returns to step S300. Here, setting a new value obtained by subtracting Δ from the threshold Hthre to the threshold Hthre relaxes the condition for deriving the word corresponding to the external speech with the passage of time, and the processing is not performed without the word being determined. It is for suppressing that the feature-value of N increases unreasonably.

In this way, while the processing in steps S300 to S362 is repeated several times, the threshold Hthre is set to be gradually smaller in the processing in step S362, so the probability that the hypothesis occupancy rate H becomes the threshold Hthre or more gradually increases. As a result, when the hypothesis occupancy rate H becomes equal to or higher than the threshold value Hthre (step S360: YES), the word hypothesis having the maximum likelihood P is regarded as the word corresponding to the input speech signal and output from the output unit 120 Then, the value F is reset to the flag F2 set to the value 1 in the process of step S340, and the initial value H0 is reset to the threshold Hthre changed in the process of step S362 (step S370). At this time, the memory can be effectively used by deleting information such as the feature amount generated in the process of step S310 and the word hypothesis acquired in the process of step S340 from the memory.

In the speech recognition apparatus according to the third embodiment of the present invention, by using the concept of the hypothesis occupancy rate H, a word corresponding to a speech signal externally inputted with one word hypothesis from among acquired word hypotheses It is considered as and output. This is because when the hypothesis occupancy rate H is relatively large, that is, when the acquired word hypothesis is occupied by a word hypothesis having the same meaning as the word hypothesis with the maximum likelihood P, the word with the maximum likelihood P is This is because problems rarely occur even if the hypothesis is regarded as a word corresponding to an external speech signal. As a result, even when the difference between the word hypothesis with the largest likelihood P and the likelihood of the word hypothesis with the second largest likelihood P is small, the word corresponding to the speech signal from the outside can be determined early. As described above, when it is difficult to determine the word hypothesis corresponding to the external speech signal, the process of step S370 of outputting the word corresponding to the external speech signal even though the usage rate U of the CPU is excessive. It is possible to suppress the further increase of the CPU utilization factor U due to the delay of the time of executing the steps S340 to S380 and the delay of the time of executing the steps S340 to S380.

Subsequently, it is determined whether it is necessary to continue the process based on whether or not the audio signal is continuously input and whether or not there is an unprocessed audio signal (step S380). When the audio signal is continuously input or when there is an unprocessed audio signal, the processing of steps S300 to S370 is repeated (step S380: NO), the input of the audio signal is completed, and the processing is not performed. After confirming that there is no audio signal (step S380: YES), a series of processing is ended.

As described above, even with the speech recognition apparatus according to the third embodiment of the present invention, the same effect as that of the speech recognition apparatus according to the second embodiment of the present invention can be obtained.
Also, since the word corresponding to the speech signal from the outside is derived and output using the concept of the hypothesis occupancy rate H, the likelihood of the word hypothesis with the largest likelihood P, the word with the second largest likelihood P Even when the difference between the likelihood of the hypothesis is small, it is possible to derive and output a word corresponding to an external speech signal at a relatively early stage.
For this reason, although the usage rate U of the CPU is excessive, the timing of executing the process of step S370 for outputting a word corresponding to an external audio signal is delayed, and the process of steps S340 to S380 is interrupted. It is possible to suppress the further increase of the CPU utilization U due to the late timing.

[Modification]
Although the speech recognition apparatus according to the second and third embodiments of the present invention has been described as using the acoustic model 140 and the recognition dictionary 142 to derive a word corresponding to an external speech signal, Language models may be used in combination to consider linguistic certainty.

Further, in the voice recognition device according to the second and third embodiments of the present invention described above, the usage rate U has been described as the usage rate of the CPU, but may be a memory usage rate.

Furthermore, in the speech recognition devices according to Embodiments 1 to 3 of the present invention described above, it has been described that the words derived by

word derivation units

16 and 116 are sequentially output to

output units

20 and 120. As shown in FIG. 5, the translation unit 219 may be further provided, and the translation unit 219 may be used to translate into another language and sequentially output to the output unit 220.

Further, in the first to third embodiments of the present invention described above, the speech recognition apparatus has been described, but it may be a speech recognition method or a computer readable recording medium having a program recorded thereon.

The present invention is not limited to the embodiments described above, and can be implemented in various forms without departing from the scope of the present invention.

This application claims priority based on Japanese application 2009-235302 filed on October 9, 2009, the entire disclosure of which is incorporated herein.

The present invention is applicable to the manufacturing industry of speech recognition devices.

10: information processing unit, 12: information generation unit, 14: temporary storage unit, 16: word derivation unit, 18: control unit, 20: output unit, 30: usage rate acquisition unit.

Claims

An information processing unit for performing processing for deriving a word corresponding to an external audio signal;
An output unit that sequentially outputs words;
And a usage rate acquisition unit that obtains the usage rate of the resource possessed by the information processing unit.
The information processing unit
An information generator configured to generate predetermined information based on the audio signal;
A temporary storage unit for temporarily storing predetermined information from the information generation unit;
A word derivation unit that derives a word corresponding to an audio signal based on predetermined information temporarily stored by the temporary storage unit, performs speech recognition, and outputs the word to the output unit;
A control unit for interrupting and resuming processing performed by the word derivation unit based on the usage rate acquired by the usage rate acquisition unit.
The speech recognition apparatus according to claim 1, wherein the resource is at least one of a CPU and a memory.
The control unit interrupts processing performed by the word derivation unit when the usage rate acquired by the usage rate acquisition unit is equal to or higher than a first predetermined value, and the second usage rate is smaller than the first predetermined value. The speech recognition apparatus according to claim 1, wherein the processing performed by the word derivation unit is resumed after waiting for the value to become less than a predetermined value.
The control unit does not interrupt the process performed by the word derivation unit until the word derivation unit completes the process of deriving a word when the word derivation unit is performing the process of deriving a word corresponding to a voice signal. The speech recognition apparatus according to claim 1, characterized in that:
The word derivation unit
A word hypothesis acquisition unit which acquires a plurality of word hypotheses corresponding to the voice signal input based on the predetermined information temporarily stored by the temporary storage unit in association with the likelihood representing the certainty;
The speech recognition according to claim 1, further comprising: a word selection unit which selects one word hypothesis as a word corresponding to a speech signal based on the likelihood set for each of a plurality of word hypotheses. apparatus.
The word selection unit selects two word hypotheses from the plurality of word hypotheses in descending order of likelihood, and when the difference between the likelihoods set for each is less than a threshold, the selection of the word corresponding to the speech signal is missed 6. The speech recognition apparatus according to claim 5, wherein one word hypothesis with the largest likelihood is selected as a word corresponding to the speech signal when the difference between the likelihoods set in each is equal to or greater than a threshold.
The speech recognition apparatus according to claim 6, wherein the word selection unit sets the threshold smaller as the elapsed time from the start of the process of recognizing the word corresponding to the speech signal becomes longer.
The word selection unit selects one word hypothesis having the largest likelihood among a plurality of word hypotheses, and extracts a group of word hypotheses having the same meaning as the selected word hypothesis from among a plurality of word hypotheses When the hypothesis occupancy rate, which is a ratio of the extracted group of word hypotheses to the total number of word hypotheses, is less than a predetermined occupancy rate, the selected word hypothesis is selected as the word corresponding to the speech signal. 6. The speech according to claim 5, wherein the selected word hypothesis is selected as a word corresponding to a speech signal if the hypothesis occupancy rate of the extracted group of word hypotheses is equal to or greater than a predetermined occupancy rate. Recognition device.
9. The speech recognition apparatus according to claim 8, wherein the word selection unit sets a predetermined occupancy rate smaller as the elapsed time from the start of the process of deriving the word corresponding to the speech signal becomes longer.
The speech recognition apparatus according to claim 1, wherein the information generation unit generates a feature representing a feature of the speech signal based on the speech signal.
The information processing unit according to claim 1, further comprising: a translation unit that translates the word selected by the word selection unit and sequentially outputs the translated word to the output unit. Voice recognition device.
Generating predetermined information based on an external audio signal;
Temporarily storing predetermined information;
Deriving a word corresponding to the voice signal based on the temporarily stored predetermined information for speech recognition and outputting the word;
Deriving a resource corresponding to speech recognition by deriving a word corresponding to the speech signal;
A speech recognition method comprising the steps of: suspending and resuming a speech recognition process based on the acquired usage rate.
Generating predetermined information based on an external audio signal;
Temporarily storing predetermined information;
Deriving a word corresponding to the voice signal based on the temporarily stored predetermined information for speech recognition and outputting the word;
Deriving a resource corresponding to speech recognition by deriving a word corresponding to the speech signal;
A computer readable recording medium storing a program for causing a computer to execute interrupting and resuming steps based on a usage rate obtained for voice recognition processing.