WO2011077924A1

WO2011077924A1 - Voice detection device, voice detection method, and voice detection program

Info

Publication number: WO2011077924A1
Application number: PCT/JP2010/071620
Authority: WO
Inventors: 田中　大介; 隆行荒川; 健花沢; 長田　誠也; 岡部　浩司
Original assignee: 日本電気株式会社
Priority date: 2009-12-24
Filing date: 2010-11-26
Publication date: 2011-06-30
Also published as: JP5621786B2; JPWO2011077924A1

Abstract

Provided are a voice detection device, a voice detection method, and a voice detection program for performing voice section detection with high accuracy even under a noise environment. A feature amount calculation unit (301) calculates a feature amount for each frame. A voice/non-voice judgment unit (302) compares the feature amount calculated and a voice detection threshold value with each other, thereby judging whether a section is a voice section or a non-voice section. A long section feature amount calculation unit (303) calculates a long section feature amount, which is the feature amount of the voice section or the non-voice section, on the basis of a statistic value of the feature amounts of a plurality of frames. A threshold value update unit (304) uses the long section feature amount calculated, to calculate a non-voice probability, which is such a probability that the voice section and the non-voice section are a section to which a signal based on non-voice is input, and update the voice detection threshold value on the basis of the non-voice probability calculated.

Description

Voice detection device, voice detection method, and voice detection program

The present invention relates to a voice detection device, a voice detection method, and a voice detection program for detecting a voice section.

Voice detection technology is intended to improve the efficiency of voice transmission by improving the compression rate of non-voice sections in mobile communication, etc. or not transmitting only that section, and noise in non-voice sections in noise cancellers, echo cancellers, etc. It is widely used for the purpose of estimating and determining the speech recognition, improving the speech recognition performance in the speech recognition system, and reducing the processing amount.
FIG. 14 is a block diagram illustrating a configuration example of a general voice detection device. Patent Document 1 discloses an invention corresponding to the voice detection device illustrated in FIG.
The general speech detection apparatus shown in FIG. 14 includes a waveform cutout unit 101 that cuts out and acquires an input signal in frame units, and a feature amount calculation unit that calculates a feature amount used for speech detection from the cut out input signal for each frame. 102, the calculated feature value and the threshold value stored in the threshold value storage unit 103 are compared for each frame to determine whether the input signal is a signal based on speech or a signal based on non-speech. Based on the voice / non-voice judgment unit 104, the judgment result holding unit 105 for each frame that holds the judgment results for each frame over a plurality of frames, and the section shaping rules stored in the section shaping rule storage unit 106. , A sound that shapes the determination results of a plurality of frames held in the determination result holding unit 105 and determines whether the frame is a speech segment or a non-speech segment / And a non-speech section shaping unit 107.
It should be noted that “acquiring and acquiring an input signal in units of frames” means that an input signal input from a certain time until a unit time elapses is extracted. Further, the frame is each time obtained by dividing the time during which the input signal is input into unit time. For example, when it is determined that an input signal based on speech or an input signal based on non-speech is input over a plurality of consecutive frames, the section shaping rule determines that these frames are divided into one speech segment or non-speech. This is a rule for determining a voice section.
Patent Document 1 discloses an example of the feature amount calculated by the feature amount calculation unit 102 by smoothing the fluctuation of the spectrum power and further smoothing the fluctuation. Non-Patent Document 1, section 43.3 discloses SNR (Signal to Noise ratio) values as examples of feature values, and section 4.3.5 averages SNR values. Are disclosed. B. of Non-Patent Document 2. Section 3.1.4 discloses the number of zero crossings as an example of a feature quantity, and Non-Patent Document 3 describes the likelihood using a voice GMM (Gaussian Mixture Model) and a silent GMM as examples of feature quantities. The degree ratio is disclosed.
The voice / non-voice determination unit 104 compares a threshold value determined in advance with an experiment and a feature value for each frame. If the feature value is equal to or higher than the threshold value, the voice / non-voice determination unit 104 determines that the input signal is based on voice. In this case, it is determined that the input signal is based on non-voice.
Patent Document 2 discloses a method of updating a threshold value for each utterance. FIG. 15 is a block diagram showing a voice detection device that changes a voice detection threshold. Patent Document 2 discloses an invention corresponding to the voice detection device illustrated in FIG. The voice detection threshold setting unit 18 determines the spectrum power for determining whether or not the voice section is based on the maximum value of the spectral power of the voice section and the average value of the spectral power of the background noise section that is not the voice section. The threshold value is calculated and updated to the calculated threshold value.

JP 2006-209069 A (paragraphs 0018 to 0059, FIG. 1) Japanese Patent Laid-Open No. 7-92989 (paragraphs 0008 to 0014, FIG. 1)

However, in order to set the threshold, the speech detection apparatus shown in FIG. 14 has a maximum spectral power in a section composed of a frame in which an average noise power and a speech signal are input from a plurality of frames in which only noise is input in advance. Therefore, it is not possible to cope with an environment in which noise and maximum spectral power constantly change.
The speech detection apparatus shown in FIG. 15 needs to perform speech detection to determine the threshold and obtain the spectral power of background noise. However, if the detection accuracy is low, the noise may not be estimated. For example, when the speech section continues from the beginning of the input signal, or when background noise exceeding the threshold value continues and the speech section is determined to be a speech section, the speech detection device uses the background noise spectrum power. It becomes difficult to obtain. Therefore, the voice detection device cannot determine and update the threshold value.
Therefore, in order to solve the above-described problem, the present invention provides a voice that can detect a voice section even when noise changes or noise or a voice section continues from the beginning of an input signal. An object of the present invention is to provide a detection device, a voice detection method, and a voice detection program.

The speech detection apparatus according to the present invention compares a feature amount and a threshold value with a feature amount calculation unit that calculates a feature amount of an input signal for each frame that is an input signal per unit time, and a signal based on speech over a plurality of frames. A speech / non-speech determination unit that determines whether the signal is an input speech segment or a non-speech segment in which a signal based on non-speech is input over a plurality of frames, and a speech segment calculated by a feature amount calculation unit Alternatively, a long-section feature quantity calculating unit that calculates a long-section feature quantity that is a feature quantity of a voice section or a non-speech section based on a statistical value of feature quantities of a plurality of frames constituting the non-voice section, and a long-section feature quantity Is used to calculate the non-speech probability, which is the probability that the speech section and the non-speech section are input to a signal based on non-speech, and based on the calculated non-speech probability, Characterized in that a threshold updating means for updating the voice detection threshold.
The voice detection method according to the present invention calculates a feature quantity of an input signal for each frame, which is an input signal within a unit time, compares the feature quantity with a threshold value, and receives a voice-based signal over a plurality of frames. It is determined whether it is a section or a non-speech section in which a signal based on non-speech is input over a plurality of frames, and based on statistical values of feature quantities of a plurality of frames constituting the speech section or the non-speech section This is a probability that a long segment feature value, which is a feature value of a voice segment or a non-speech segment, is calculated, and a voice segment and a non-speech segment are segments in which a signal based on non-speech is input using the long segment feature value. A non-speech probability is calculated, and a speech detection threshold is updated based on the calculated non-speech probability.
The voice detection program stored in the program recording medium according to the present invention is a computer that compares a feature amount calculation process for calculating a feature amount of an input signal for each frame, which is an input signal per unit time, with a threshold value. And a voice / non-voice determination process for determining whether a voice-based signal is input over a plurality of frames or a non-voice-based signal is input over a plurality of frames. , A long section that calculates a feature value of a long section, which is a feature quantity of a speech section or a non-speech section, based on a statistical value of feature quantities of a plurality of frames constituting the speech section or the non-speech section calculated by the feature amount calculation process Using feature amount calculation processing and long interval feature amounts, it is the probability that a speech segment and a non-speech segment are segments in which a signal based on non-speech is input. Calculating a speech probability, based on the calculated non-speech probabilities, characterized in that to perform the threshold updating process for updating the voice detection threshold value.

The present invention relates to a voice detection device, a voice detection method, and a voice detection program capable of performing voice segment detection with high accuracy even in a noise environment even when background noise exceeding a threshold value enters the head of input. I will provide a.

It is a block diagram which shows the structural example of 1st Embodiment of the audio | voice detection apparatus by this invention. It is a flowchart which shows operation | movement of the audio | voice detection apparatus of the 1st Embodiment of this invention. It is explanatory drawing which shows the function G of 1st Embodiment. It is explanatory drawing which shows the example which changes a threshold value. It is explanatory drawing which shows the example when the threshold value before an update is too small. It is explanatory drawing which shows the example when the threshold value before an update is too large. It is a block diagram which shows the structural example of 2nd Embodiment of the audio | voice detection apparatus by this invention. It is a block diagram which shows the structural example of 3rd Embodiment of the audio | voice detection apparatus by this invention. It is a block diagram which shows the other example of 3rd Embodiment of an audio | voice detection apparatus. It is explanatory drawing which shows the function for calculating | requiring the non-speech probability (alpha) in the 3rd Embodiment of this invention. It is explanatory drawing which shows the function for calculating | requiring the non-speech probability (alpha) in the 5th Embodiment of this invention. It is a block diagram which shows the structural example of 6th Embodiment of the audio | voice detection apparatus by this invention. It is a block diagram showing the outline of the present invention. It is a block diagram which shows the structural example of a common audio | voice detection apparatus. It is a block diagram which shows the audio | voice detection apparatus which changes the threshold value of an audio | voice detection.

Embodiment 1. FIG.
A first embodiment of the present invention will be described with reference to the drawings. FIG. 1 is a block diagram showing a configuration example of a first embodiment of a voice detection device according to the present invention. As shown in FIG. 1, the speech detection apparatus according to the first exemplary embodiment of the present invention includes a waveform cutout unit 101, a feature amount calculation unit 102, a threshold storage unit 103, a speech / non-speech determination unit 104, and a determination result holding unit. 105, a shaping rule storage unit 106, a voice / non-speech segment shaping unit 107, a long segment feature value computing unit 108, and a threshold updating unit 109.
The waveform cutout unit 101 cuts out and acquires an input signal in units of frames. Specifically, the waveform cutout unit 101 cuts out and acquires input signals for each predetermined unit time, for example. The feature amount calculation unit 102 calculates a feature amount used for speech detection from the input signal for each frame cut out by the waveform cutout unit 101. The threshold storage unit 103 stores a threshold for determining whether the input signal is an input signal based on voice or an input signal based on non-voice.
The voice / non-voice determination unit 104 compares the feature amount calculated by the feature amount calculation unit 102 with the threshold value stored in the threshold value storage unit 103 for each frame, and the input signal of the frame is an input signal based on the voice. It is determined whether there is an input signal based on non-voice. Note that a frame of an input signal based on voice is called a voice frame, and a frame of an input signal based on non-voice is called a non-voice frame. The determination result holding unit 105 holds the determination result for each frame by the voice / non-voice determination unit 104 over a plurality of frames.
The section shaping rule storage unit 106 stores section shaping rules. The speech / non-speech segment shaping unit 107 shapes the determination results of a plurality of frames held in the decision result holding unit 105 based on the segment shaping rules stored in the segment shaping rule storage unit 106, It is determined that it is a non-voice segment. Specifically, the speech / non-speech section shaping unit 107 determines, for example, that a plurality of frames are one speech section when a plurality of speech frames are consecutive. Further, when a plurality of non-voice frames are consecutive, the voice / non-voice section shaping unit 107 determines that the plurality of frames are one non-voice section. Note that the voice / non-voice section shaping unit 107 determines that a plurality of frames are one voice section when the ratio of the voice frames is larger than a predetermined ratio in a plurality of consecutive frames, It may be determined that the non-voice section is one when the ratio of frames is larger than a certain ratio.
The long section feature amount calculation unit 108 performs statistical processing on the feature amount for each frame calculated by the feature amount calculation unit 102 for the speech section and the non-speech section determined by the speech / non-speech section shaping unit 107. Calculate the amount.
The threshold update unit 109 calculates the non-speech probability for the speech segment and the non-speech segment determined by the speech / non-speech segment shaping unit 107, using the long segment feature amount calculated by the long segment feature amount calculator 108, The threshold value stored in the threshold value storage unit 103 is changed. The non-speech probability is a probability that the input signal in the section is an input signal based on non-speech, as will be described later.
The voice detection device is realized by, for example, a computer equipped with a voice detection program.
Next, the operation of the voice detection device according to the first exemplary embodiment of the present invention will be described with reference to the drawings. FIG. 2 is a flowchart showing the operation of the voice detection device according to the first exemplary embodiment of the present invention.
First, the waveform cutout unit 101 cuts out collected time-series input sound data input from a microphone (not shown) for each frame of unit time (step S101). For example, when the input sound data is in a 16-bit Linear-PCM (Pulse Code Modulation) format with a sampling frequency of 8000 Hz, waveform data of 8000 points of input sound data per second is stored in each frame.
For example, the waveform cutout unit 101 sequentially cuts out the waveform data at a frame width of 200 points (25 milliseconds) and a frame shift of 80 points (10 milliseconds) according to a time series.
Next, the feature amount calculation unit 102 calculates a feature amount from the waveform cut out for each frame (step S102). The feature amount calculated by the feature amount calculation unit 102 is, for example, spectrum power, SNR, zero crossing, likelihood, and the like.
The voice / non-voice determination unit 104 compares the threshold value stored in the threshold value storage unit 103 with the feature amount calculated by the feature amount calculation unit 102, and determines that the frame is an audio frame if the threshold value is exceeded. If not, it is determined that the frame is a non-voice frame (step S103). If the threshold value stored in the threshold value storage unit 103 is the same as the feature value calculated by the feature value calculation unit 102, the voice / non-voice determination unit 104 determines that the voice frame is a voice frame or a non-voice frame. May be determined in advance. Then, the voice / non-voice determination unit 104 determines a voice frame or a non-voice frame based on the determination.
The determination result holding unit 105 holds the result determined by the voice / non-voice determination unit 104 in the process of step S106 for a plurality of frames (step S104).
The voice / non-speech segment shaping unit 107 is configured to suppress the occurrence of a short-duration speech segment or a short-duration non-speech segment that occurs because the speech / non-speech determination unit 104 determines for each frame. Shaping is performed (step S105).
The long section feature amount calculation unit 108 calculates the feature amount calculation unit 102 in the process in step S102 for the shaped speech section and non-speech section obtained by the speech / non-speech section shaping unit 107 in the process in step S105. The feature amount for each frame is statistically processed to calculate the long interval feature amount (step S106). The long section feature amount is, for example, one or a combination of two or more of spectrum power, SNR, zero crossing, likelihood, and the like.
As an example of the statistical processing performed by the long section feature amount calculation unit 108, there is a method of calculating an average value of feature amounts for each frame in a shaped speech section. In addition to the method for calculating the average value, the long interval feature value calculation unit 108 uses a mode value method, a median value method, and the feature value for each frame is rearranged according to size, so that the feature value is large. A method using values in the vicinity of the upper 40% in order may be used. Note that the value of 40% is merely an example, and it may be a ratio arbitrarily determined by the user or the like. When the user or the like determines 50%, this corresponds to the method using the median.
The threshold update unit 109 calculates the non-speech probability α for the shaped speech segment using the long segment feature value calculated by the long segment feature value calculation unit 108 in the process of step S106 (step S107). Here, the non-speech probability is a probability that the input signal in the section is an input signal based on non-speech such as noise. Therefore, 1-α corresponds to the probability that the section is speech. α is calculated using the following equation.
<F> = Σωi × <fi> (1)
α = G [<F>] (2)
Here, <fi> is a long-section feature value obtained by performing the above-described statistical processing on the feature value fi for each frame. ωi is a weight applied to the long section feature <fi>. Then, in Formula (1), <F>, which is calculated by adding a plurality of types (for example, spectrum power, SNR, zero-crossing, likelihood, etc.) of long-section feature quantities <fi> and multiplying them by weights ωi, is integrated. Long section feature. G is a function having an integrated long section feature quantity (also simply referred to as a long section feature quantity) <F> as a variable. FIG. 3 is an explanatory diagram showing the function G of the present embodiment. The horizontal axis in FIG. 3 is the value of the long interval feature value, and the vertical axis is the non-speech probability α.
In the example illustrated in FIG. 3, the function G is a function with which the non-speech probability α is 1 when the long-section feature amount is 0. That is, G is a function whose non-speech probability is 100% when the long section feature amount is zero. G is a function for which the non-speech probability α is 0 when the long-section feature value is τ0. That is, G is a function whose non-speech probability is 0% when the long-section feature value is τ0. G is a function whose non-speech probability α is 1 when the long-section feature value is τmax. That is, G is a function whose non-speech probability is 100% when the long section feature amount is τmax.
The function shown in FIG. 3 is an example. The function may be another function as long as the function value increases as the long-section feature value increases from a moderate value or a monotonously decreasing (non-increasing) function. (1) ωi, and τ0 and τmax shown in FIG. If it is difficult to experimentally determine ωi, ωi may be set to an equal value (such as 1) for each long-section feature amount.
Next, the threshold update unit 109 updates the threshold stored in the threshold storage unit 103 using the non-speech probability α calculated in the process of step S107 (step S108). Specifically, the threshold update unit 109 updates the threshold as follows. First, the threshold update unit 109 calculates a threshold candidate θ ′ using the following equation.
θ ′ = α × Fmax + (1−α) × Fmin (3)
Here, Fmax is the maximum value of the feature amount for each frame in the speech section or the non-speech section. Fmin is a minimum value of the feature amount for each frame in the voice section or the non-voice section. α is a speech interval or a non-speech probability of a non-speech interval. Next, the threshold update unit 109 updates the threshold θ using the following equation using the threshold candidate θ ′.
θ ← θ + ε × (θ′−θ) (4)
Here, ε is a step size for adjusting the speed of updating the threshold. That is, the voice detection device according to the present invention can adjust the speed of the threshold update. Therefore, the voice detection device is either in the case where it is desired to greatly change the threshold according to the temporal fluctuation of the background noise or in the case where it is not desired to change the threshold depending on the temporary background noise. Can also respond.
FIG. 4 is an explanatory diagram illustrating an example of changing the threshold value. In the example shown in FIG. 4, the speech / non-speech segment shaping unit 107 causes each segment to be a speech segment or a non-speech segment in order of non-speech segment 1, speech segment 2, non-speech segment 3, speech segment 4, and non-speech segment 5. Has been determined.
The input signal is shown by the upper waveform in FIG. In FIG. 4, the maximum value and the minimum value of the feature amount of each speech segment and each non-speech segment are indicated by up and down arrows near the end of each speech segment and each non-speech segment. The transition of the threshold is indicated by a solid line that moves up and down in parallel with the vertical axis.
Here, when the speech / non-speech segment shaping unit 107 determines a speech segment or a non-speech segment, the threshold update unit 109 calculates a non-speech probability using equations (1) and (2), and formula (3) ) Is used to determine threshold candidates. The determined threshold value is changed using Equation (4).
Further, the threshold value can be updated using the average value of the threshold candidates for the past N utterances as shown in Equation (5) below.
θ ← 1 / N × Σθ ′ (5)
The threshold update unit 109 can also update the threshold only when the non-voice probability is greater than or less than a specific value. In addition, the long segment feature amount calculation unit 108 performs statistical processing on the feature amount for each of one or more speech sections or non-speech sections to calculate a long segment feature amount, and the threshold update unit 109 performs one or more It is also possible to update the threshold value for each voice interval or non-voice interval.
Also, if the initially set threshold is too large or too small, based on the determination result in the sound / non-voice determination unit 104, the voice / non-speech section shaping unit 107, for example, The section may be determined as a voice section or a non-voice section, and the threshold update unit 109 may not update the threshold.
In order to cope with such a case, the threshold value updating unit 109 reduces the threshold value by a certain value or determines a certain value when the speech / non-speech determination unit 104 does not determine a speech period or a non-speech period for a certain time or more. The threshold value may be increased, or the average value of the feature values calculated by the feature value calculation unit 102 during the certain time may be used as a threshold value.
After the threshold value is updated by the threshold update unit 109, the voice detection device performs the processing of steps S101 to S108 for the next voice segment or non-voice segment. In addition, the voice detection device can repeat the processing of steps S101 to S108 again for the same utterance.
FIG. 5 is an explanatory diagram illustrating an example in which the threshold before update is too small. In the example shown in FIG. 5, since the threshold value before update is too small, the voice detection device erroneously determines that the non-voice section 1 is a voice section.
FIG. 6 is an explanatory diagram illustrating an example when the threshold before update is too large. In the example illustrated in FIG. 6, since the threshold value before the update is too large, the voice detection device erroneously determines that the voice section 2 is a non-voice section.
The speech detection apparatus according to the present embodiment increases the non-speech probability α calculated using the long section feature amount even when the pre-update threshold illustrated in FIG. 5 is too small. As shown in FIG. 5, the non-speech probability α in the non-speech section 1 is 0.8. In such a case, when the threshold update unit 109 calculates the expression (3), the threshold candidate θ ′ approaches the maximum value of the long section feature amount of the non-speech section 1, and thus the threshold is updated to a larger value.
Further, the speech detection apparatus according to the present embodiment reduces the non-speech probability α calculated using the long section feature amount even when the pre-update threshold illustrated in FIG. 6 is too large. As shown in FIG. 6, the non-voice probability α of the voice section 2 is 0.2. In such a case, when the threshold update unit 109 calculates the expression (3), the threshold candidate θ ′ approaches the minimum value of the long section feature amount of the speech section 2, and thus the threshold is updated to a smaller value.
Therefore, the speech detection apparatus according to the present embodiment calculates the non-speech probability α in the long section feature quantity calculation unit 108 and sets an appropriate threshold value in the threshold update unit 109, so that the speech / non-speech determination unit 104 in the previous stage. Thus, it is possible to correctly detect a speech section to be recognized and to realize speech detection that is robust against noise that varies depending on the speech environment.
Embodiment 2. FIG.
A second embodiment of the present invention will be described with reference to the drawings. FIG. 7 is a block diagram showing a configuration example of the second embodiment of the voice detection device according to the present invention.
In addition to the configuration of the voice detection device of the first embodiment shown in FIG. 1, the voice detection device of the second embodiment is a voice analysis unit that outputs a feature quantity that represents voice likeness by dividing an input signal for each frame. 110 is included. The voice analysis unit 110 has functions corresponding to the waveform cutout unit 101 and the feature amount calculation unit 102 in the configuration of the voice detection device according to the first embodiment shown in FIG.
The voice analysis unit 110 calculates the second feature amount independently of the feature amount calculation unit 102 in the process of step S102. The second feature amount calculated by the speech analysis unit 110 is, for example, spectrum power, SNR, zero crossing, likelihood, and the like.
The voice analysis unit 110 calculates the second feature amount by analyzing the input signal in more detail using a parameter different from the parameter used when the feature amount calculation unit 102 calculates the feature amount. The voice analysis unit 110 calculates the second feature value for each of a plurality of utterances, or calculates the second feature value when instructed by the user. The second feature amount may be calculated at a timing different from the time of calculating.
Then, the long-section feature value calculation unit 108 performs the long-section feature value based on the feature value calculated by the feature value calculation unit 102 and the second feature value calculated by the speech analysis unit 110 in the process of step S106. Is calculated. Each feature amount described above may be easily detected depending on the environment in which the input signal is generated, or may be difficult to detect. Therefore, the long-section feature value calculation unit 108 calculates the long-section feature value using the second feature value calculated by the speech analysis unit 110, for example, when the feature value calculation unit 102 cannot calculate the feature value. To do. Further, the speech analysis unit 110 calculates a feature amount different from the feature amount calculated by the feature amount calculation unit 102, and the long-section feature amount calculation unit 108 is a second feature amount that is the feature amount calculated by the speech analysis unit 110. May be used to calculate the long-section feature value.
In the speech detection apparatus according to the present embodiment, since the speech analysis unit 110 can calculate various feature amounts independently of the feature amount calculation unit 102, feature amounts are calculated from various viewpoints, and more robust speech. Detection can be realized.
Embodiment 3. FIG.
A third embodiment of the present invention will be described with reference to the drawings. FIG. 8 is a block diagram showing a configuration example of the third embodiment of the voice detection device according to the present invention.
In addition to the configuration of the voice detection device of the first embodiment shown in FIG. 1, the voice detection device of the third embodiment outputs a recognition result corresponding to a voice section using a feature amount that seems to be voice. 111 is included.
FIG. 9 is a block diagram illustrating another example of the third embodiment of the voice detection device. In the example illustrated in FIG. 9, the voice recognition unit 111 performs voice recognition on a voice section in which voice is detected.
The voice detection apparatus according to the third embodiment shown in FIGS. 8 and 9 operates as follows. That is, the voice recognition unit 111 appropriately extracts a feature amount from the input voice signal. The speech recognition unit 111 is a word string with time information of the speech section by matching the feature amount of the word stored in the language model / speech recognition dictionary (not shown) with the extracted feature amount. Speech recognition for calculating a recognition result is performed, and a speech recognition result word string with time information is output.
The long segment feature value calculation unit 108 obtains the phoneme duration from the speech recognition result as the long segment feature value. The phoneme duration Ta is calculated by the following equation (6).
Ta = Tb / Nf (6)
Here, Tb is the number of frames for one word in the speech recognition result word string output by the speech recognition unit 111, and Nf is the number of phonemes of the word.
The threshold update unit 109 uses the long-section feature value calculated by the long-section feature value calculation unit 108 in step S106, that is, the phoneme duration length, for each section cut out by the speech / non-speech section shaping unit 107. Non-voice probability α is calculated.
Specifically, the threshold update unit 109 obtains the non-speech probability α using, for example, a function having a long-section feature value as a variable as shown in FIG. FIG. 10 is an explanatory diagram showing a function for obtaining the non-voice probability α in the third embodiment of the present invention. As shown in FIG. 10, the horizontal axis represents the value of the long section feature value, and the vertical axis represents the non-speech probability α. As shown in FIG. 10, the non-speech probability α is 1 when the long-section feature value is τmin or less and when it is τmax or more. In addition, the non-speech probability α is 0 when the long section feature amount is τ0 or more and τ1 or less. In the example shown in FIG. 10, the non-speech probability α monotonously decreases to τ0 when the long-section feature value exceeds τmin, and the non-speech probability α to τmax when the long-section feature value exceeds τ1. Increases monotonically.
It is assumed that τmin, τmax, τ0, and τ1 are appropriate values obtained in advance through experiments.
In the present embodiment, the long segment feature value calculation unit 108 uses phonemes as the unit for calculating the duration length, but other units such as syllables may be used. Further, the function shown in FIG. 10 is merely an example, and the present invention is not limited to this. The function may be defined as an arbitrary function whose function value increases as the distance from the medium value of the long interval feature amount increases.
The effect of this embodiment will be described. When background noise exceeding the threshold value continues for a long time, there is a property that a duration time extremely longer or shorter than a duration time obtained from a normal speech recognition result is likely to occur. More specifically, when the background noise continues for a long time, resulting in an extremely long voice section, the sound in the voice section is background noise, so there is almost no voice. Even if the voice recognition unit 111 recognizes the sound, a short word may be output as a recognition result. That is, appropriate speech recognition is not performed. In addition, when an extremely short sudden noise such as 2 to 3 frames is used as a speech section, it is impossible to utter a word in such a short time, so the sound in the speech section is non-speech. It is judged. Therefore, the sound in the speech section having a duration longer or shorter than the duration obtained from the normal speech recognition result has a property of being non-speech.
Since the speech detection apparatus according to the present embodiment calculates the non-speech probability α using such a property, it is possible to calculate the non-speech probability α with higher accuracy.
Embodiment 4 FIG.
A fourth embodiment of the present invention will be described. In the voice detection device of the fourth embodiment, the voice recognition unit 111 of the voice detection device of the third embodiment shown in FIGS. 8 and 9 performs continuous phoneme recognition instead of voice recognition. That is, the speech recognition unit 111 performs continuous phoneme recognition and outputs a phoneme string with time information. The long section feature amount calculation unit 108 obtains the duration time of each phoneme constituting the phoneme string output by the speech recognition unit 111. The operation of the threshold update unit 109 is the same as the operation in the third embodiment described above.
In this embodiment, as in the third embodiment, the unit for calculating the duration is a phoneme. However, a unit such as a syllable may be used.
In the speech detection device according to the present embodiment, since the speech recognition unit 111 performs continuous phoneme recognition, the duration of phonemes can be acquired more easily than the speech detection device according to the third embodiment that performs speech recognition. Then, the load for calculating the phoneme duration time is reduced, and the processing speed of the entire speech detection apparatus is increased. In the case of phoneme recognition, the speech recognition unit 111 performs recognition in units of phonemes, so that the phoneme length of the utterance section can be easily acquired. The prime number must be derived and divided by the time per utterance to calculate the duration of the phoneme. Therefore, it is important for the reduction of processing load that the voice detection device easily acquires the duration of phonemes.
Embodiment 5. FIG.
A fifth embodiment of the present invention will be described. The speech detection apparatus according to the fifth embodiment has the same configuration as that of the speech detection apparatus according to the third embodiment illustrated in FIG. 8 or FIG. 9, but the long interval feature value calculation unit 108 determines the reliability of the speech recognition result. Is used to calculate long-section feature values.
Specifically, for example, the voice recognition unit 111 appropriately extracts a feature amount from the input voice signal. The speech recognition unit 111 then matches the feature quantities of the words stored in the language model / speech recognition dictionary with the extracted feature quantities, and outputs a plurality of candidate speech recognition result scores. The score is, for example, a numerical value representing the degree of matching between the feature amount of the word stored in the language model / speech recognition dictionary and the extracted feature amount. The voice recognition unit 111 outputs a plurality of scores having a high degree.
Then, the long interval feature value calculation unit 108 calculates the difference between the score of the first candidate and the score of the second candidate in descending order of the scores of the speech recognition result output by the speech recognition unit 111. calculate. When the score difference is small, the reliability of the speech recognition result is considered low. When the score difference is large, the reliability of the speech recognition result is considered high. Note that the scale corresponding to the reliability of the speech recognition result may be another scale instead of the difference in scores.
The threshold update unit 109 uses the long-section feature amount calculated by the long-section feature amount calculation unit 108, that is, the reliability, to calculate the non-speech probability α for the speech section cut out by the speech / non-speech section shaping unit 107. calculate. Specifically, the threshold update unit 109 obtains the non-speech probability α using, for example, a function having a long-section feature value as a variable as shown in FIG.
FIG. 11 is an explanatory diagram showing a function for obtaining the non-speech probability α in the fifth embodiment of the present invention. As shown in FIG. 11, the horizontal axis represents the value of the long segment feature value, and the vertical axis represents the non-speech probability α. As shown in FIG. 11, the non-speech probability α is 0 when the long-section feature amount is τ0 or more. In addition, when the long-section feature value is 0 to less than τ0, the non-speech probability α monotonously decreases from 1 to 0. It is assumed that τ0 is an appropriate value obtained in advance through experiments. Moreover, the function shown in FIG. 11 is an example, and may be an arbitrary monotone decreasing function or a monotonic non-increasing function.
Since the speech detection apparatus according to the present embodiment operates to calculate the non-speech probability α using the property that a section with low reliability of the speech recognition result is likely to be a non-speech section, more accuracy is achieved. It is possible to calculate a high non-voice probability.
Embodiment 6. FIG.
A sixth embodiment of the present invention will be described with reference to the drawings. FIG. 12 is a block diagram showing a configuration example of the sixth embodiment of the speech detection device according to the present invention.
The voice detection device of the sixth embodiment is a combination of the first to fifth embodiments. The long section feature quantity calculation unit 108 calculates a long section feature quantity by combining one or more methods of the first to fifth embodiments. The speech detection apparatus calculates the non-speech probability α using the non-speech probability calculation methods of the first to fifth embodiments, and sets the product of each non-speech probability α as a non-speech probability. Further, the voice detection device may calculate the product after weighting each non-voice probability α and use it as the non-voice probability. Further, the speech detection apparatus may use the average value of each non-speech probability α or an appropriate weighted average value as the non-speech probability.
The speech detection apparatus according to the present embodiment can calculate a more accurate non-speech probability by combining the first to fifth embodiments.
Embodiment 7. FIG.
The seventh embodiment of the present invention is a voice recognition device including the voice detection devices of the first to fifth embodiments. The speech recognition apparatus performs a known speech recognition process on a section determined to be a speech section by the speech detection apparatuses of the first to fifth embodiments, and outputs a speech recognition result.
Since the speech recognition apparatus according to the present embodiment performs speech recognition processing on a segment determined to be a speech segment with high accuracy, execution of useless processing that performs speech recognition processing on a non-speech segment can be prevented. In addition, it is possible to perform speech recognition processing with high accuracy on the speech section, and prevent the speech recognition processing from being leaked.
Next, the outline of the present invention will be described. FIG. 13 is a block diagram showing an outline of the present invention. The voice detection device 300 according to the present invention includes a feature amount calculation unit 301 (corresponding to the feature amount calculation unit 102 shown in FIG. 1), a voice / non-voice determination unit 302 (the voice / non-voice determination unit 104 and the voice / non-voice determination unit 104 shown in FIG. 1). (Corresponding to the non-speech segment shaping unit 107), long segment feature value calculating unit 303 (corresponding to the long segment feature value calculating unit 108 shown in FIG. 1), and threshold updating unit 304 (corresponding to the threshold updating unit 109 shown in FIG. 1). including.
The feature amount calculation unit 301 calculates the feature amount of the input signal for each frame, which is an input signal for each predetermined unit time. The speech / non-speech determination unit 302 compares the feature amount calculated by the feature amount calculation unit 301 with a speech detection threshold value for determining whether or not the input signal is a signal based on speech. It is determined whether it is a speech section in which a signal based on speech is input or a non-speech section in which a signal based on non-speech is input over a plurality of frames.
The long section feature amount calculation unit 303 is a feature amount of the speech section or the non-speech section based on the statistical values of the feature amounts of a plurality of frames constituting the speech section or the non-speech section calculated by the feature amount calculation unit 301. The long interval feature value is calculated.
The threshold update unit 304 uses the long-section feature value calculated by the long-section feature value calculation unit 303 to use the long-section feature value as a probability that the speech section and the non-speech section are sections in which a signal based on non-speech is input. And the voice detection threshold is updated based on the calculated non-voice probability.
The voice detection device 300 having the above configuration updates the voice detection threshold even when the head of the input signal is a signal based on background noise, and the feature amount exceeds the voice detection threshold, so that a high-precision voice can be obtained. Section detection can be performed.
Further, in each of the above embodiments, voice detection devices as shown in the following (1) to (11) are also disclosed.
(1) A voice in which the long section feature quantity calculation unit 303 performs statistical processing on one or more voice sections determined by the voice / non-voice judgment unit 302 or a feature quantity over a non-speech section to calculate a long section feature quantity Detection device.
(2) When the long section feature amount calculation unit 303 calculates the long section feature amount, the predetermined value is counted from the average value of the feature amount for each frame, the mode value, the median value, and the results arranged in descending order. A voice detection device using at least one of the methods using a value at a position reaching the ratio of.
(3) The voice detection device in which the threshold update unit 304 updates the voice detection threshold by using the maximum value and the minimum value of the feature amount in the voice section or the non-voice section and the non-voice probability.
(4) The threshold detection unit 304 obtains a value that internally divides the maximum value and the minimum value of the feature amount using the non-speech probability and updates the speech detection threshold so that the value is close to the internally divided value. apparatus.
(5) A long-section feature value is provided that includes a second feature value calculation unit (corresponding to the speech analysis unit 110 shown in FIG. 7) that calculates a second feature value that is different from the feature value calculated by the feature value calculation unit 304. A speech detection apparatus in which the calculation unit 303 calculates a long-section feature value using the feature value calculated by the feature value calculation unit 304 and the second feature value calculated by the second feature value calculation unit.
(6) A second feature quantity calculation unit (corresponding to the voice recognition unit 111 shown in FIG. 8) performs voice recognition on the input signal and outputs a voice recognition result. A speech detection device that calculates long-section feature values based on results.
(7) The speech detection apparatus in which the long section feature value calculation unit 303 calculates the reliability of the speech recognition result as the long section feature value.
(8) The voice recognition result based on the score, which is a value indicating a degree by which the second feature quantity calculation unit matches the feature quantity of the word stored in advance in the storage unit and the feature quantity of the input signal to be voice-recognized. The speech detection device that outputs the scores of the plurality of candidates, and the long interval feature amount calculation unit calculates the difference between the score of the first candidate and the score of the second candidate in the descending order as the reliability.
(9) The second feature amount calculation unit performs speech recognition on the input signal and outputs a speech recognition result with time information, and the long interval feature amount calculation unit 303 determines from the speech recognition result with time information. A voice detection device for calculating long-section feature values.
(10) The long section feature value calculation unit 303 is a voice detection device that calculates a duration length from time information as a long section feature value.
(11) The speech detection apparatus in which the long segment feature amount calculation unit 303 calculates the duration time in units of phonemes or syllables.
While the present invention has been described with reference to the embodiments, the present invention is not limited to the above embodiments. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.
This application claims the priority on the basis of Japanese application Japanese Patent Application No. 2009-291976 for which it applied on December 24, 2009, and takes in those the indications of all here.
(Additional remark 1) The feature-value calculation part which calculates the feature-value of the input signal for every flame | frame which is an input signal for every predetermined unit time, The said feature-value, and whether the said input signal is a signal based on an audio | voice. Compared to the voice detection threshold for determination, whether the signal is based on speech over a plurality of frames or whether it is a non-speech segment where signals based on non-speech are input over a plurality of frames A speech / non-speech determination unit and a feature value statistical value of a plurality of frames constituting the speech segment or the non-speech segment calculated by the feature amount calculation unit. A long-section feature quantity calculation unit that calculates a long-section feature quantity that is a feature quantity of the section, and the voice section and the non-speech section are based on non-speech using the long-section feature quantity. A speech detection apparatus comprising: a threshold update unit that calculates a non-speech probability that is a probability of being a section in which a speech is input, and updates the speech detection threshold based on the calculated non-speech probability .
(Additional remark 2) The long section feature-value calculation part calculates a long-section feature-value by performing a statistical process to the feature-value over the 1 or more audio | voice area determined by the audio | voice / non-voice determination part, or a non-voice section. The voice detection device according to 1.
(Additional remark 3) When calculating a long section feature-value, a long-section feature-value calculation part counts from the result arranged in the order of the average value of the feature-value for every frame, a mode value, a median, and a big order. The voice detection device according to supplementary note 1 or supplementary note 2, which uses at least one of methods using a value at a position that reaches the ratio of.
(Additional remark 4) A threshold value update part is described in any one of Additional remark 1 to Additional remark 3 which updates an audio | voice detection threshold value using the maximum value and minimum value of a feature-value in a audio | voice area or a non-voice area, and a non-voice probability. Voice detection device.
(Additional remark 5) A threshold value update part calculates | requires the value which divides the maximum value and minimum value of the said feature-value using a non-speech probability, and updates a speech detection threshold value so that it may become a value close | similar to the said internally divided value. The voice detection device according to attachment 4.
(Additional remark 6) It has the 2nd feature-value calculation part which calculates the 2nd feature-value different from the feature-value which the feature-value calculation part calculates, The long section feature-value calculation part calculated by the said feature-value calculation part The speech detection device according to any one of supplementary notes 1 to 5, wherein a long section feature amount is calculated using the feature amount and the second feature amount calculated by the second feature amount calculation unit.
(Supplementary Note 7) The second feature quantity calculator performs speech recognition on the input signal and outputs a speech recognition result, and the long section feature quantity calculator calculates the long section feature quantity based on the speech recognition result. The voice detection device according to appendix 6.
(Supplementary note 8) The speech detection device according to supplementary note 7, wherein the long section feature amount calculation unit calculates the reliability of the speech recognition result as the long section feature amount.
(Supplementary Note 9) The second feature amount calculation unit performs speech recognition based on a score that is a value indicating a degree of matching between the feature amount of the word stored in the storage unit in advance and the feature amount of the input signal to be recognized. The score of a plurality of candidate results is output, and the long interval feature value calculation unit calculates the difference between the score of the first candidate and the score of the second candidate in descending order of the degree as the reliability level 8 The voice detection device according to 1.
(Additional remark 10) The 2nd feature-value calculation part performs speech recognition to an input signal, and outputs the speech recognition result with time information, and a long section feature-value calculation part has the said voice recognition result with the said time information. The voice detection device according to appendix 6, wherein a long section feature amount is calculated from
(Supplementary note 11) The voice detection device according to supplementary note 10, wherein the long section feature amount calculation unit calculates a duration length from time information as the long section feature amount.
(Supplementary note 12) The speech detection device according to supplementary note 11, wherein the long section feature amount calculation unit calculates a duration length in units of phonemes or syllables.
(Supplementary note 13) The speech detection device according to any one of Supplementary note 1 to Supplementary note 12, including speech recognition performed on a speech section output by the speech detection device, and a speech recognition result being output. Voice recognition device.
(Additional remark 14) The audio | voice for calculating the feature-value of the input signal for every flame | frame which is an input signal within predetermined unit time, and determining whether the said feature-value and the said input signal are signals based on an audio | voice The detection threshold value is compared, and it is determined whether the signal is a voice segment in which a signal based on speech is input over a plurality of frames or a non-speech segment in which a signal based on non-speech is input over a plurality of frames, Based on the statistical values of the feature quantities of a plurality of frames constituting the speech section or the non-speech section, a long section feature quantity that is a feature quantity of the speech section or the non-speech section is calculated, and the long section feature quantity is calculated. And calculating a non-speech probability that is a probability that the speech section and the non-speech section are sections in which a signal based on non-speech is input, and based on the calculated non-speech probability. Te, voice detection method and updates the voice detection threshold value.
(Additional remark 15) The audio | voice detection method of Additional remark 14 which performs a statistical process to the feature-value over one or more audio | voice area or a non-voice area, and calculates a long-section feature-value.

DESCRIPTION OF SYMBOLS 101 Waveform cut-out part 102,301 Feature-value calculation part 103 Threshold storage part 104,302 Voice / non-voice determination part 105 Determination result holding part 106 Shaping rule storage part 107 Voice / non-voice section shaping part 108,303 Long-section feature

quantity Calculation unit

109, 304 Threshold update unit 110 Speech analysis unit 111 Speech recognition unit 300 Speech detection device

Claims

A feature amount calculating means for calculating a feature amount of an input signal for each frame, which is an input signal for each unit time;
The feature amount is compared with a threshold value, and it is determined whether it is a speech section in which a signal based on speech is input over a plurality of frames or a non-speech section in which a signal based on non-speech is input over a plurality of frames. Voice / non-voice judgment means to perform,
Based on a statistical value of feature quantities of a plurality of frames constituting the speech section or the non-speech section calculated by the feature quantity calculation unit, a long section feature quantity that is a feature quantity of the speech section or the non-speech section is calculated. Long-section feature value calculating means for calculating;
Using the long section feature amount, the speech section and the non-speech section calculate a non-speech probability that is a section in which a signal based on non-speech is input, and based on the calculated non-speech probability, Threshold updating means for updating the threshold;
A voice detection device.
2. The long section feature quantity calculating unit performs statistical processing on the plurality of voice sections determined by the voice / non-speech determination unit, or the feature quantity over the non-speech section, and calculates the long section feature quantity. The voice detection device according to 1.
The long section feature quantity calculating means calculates the long section feature quantity by counting from the average value, mode value, median value, and results arranged in descending order of the feature quantity for each frame. The voice detection device according to claim 1, wherein at least one of the methods using a value at a position that reaches the ratio of at least one of the methods is used.
The threshold value updating unit updates the voice detection threshold value using the maximum value and the minimum value of the feature amount and the non-voice probability in the voice section or the non-voice section. The voice detection device according to claim 1.
5. The threshold update unit obtains a value that internally divides the maximum value and the minimum value of the feature amount using the non-speech probability, and updates the threshold so that the value is close to the internally divided value. The voice detection device according to 1.
A second feature amount calculating means for calculating a second feature amount different from the feature amount calculated by the feature amount calculating means;
The long section feature quantity calculating means calculates the long section feature quantity using the feature quantity calculated by the feature quantity calculating means and the second feature quantity calculated by the second feature quantity calculating means. The voice detection device according to any one of claims 1 to 5.
The second feature amount calculating means performs voice recognition on the input signal and outputs a voice recognition result,
The speech detection apparatus according to claim 6, wherein the long section feature amount calculating unit calculates the long section feature amount based on the speech recognition result.
The speech detection apparatus according to claim 7, wherein the long section feature amount calculation unit calculates a reliability of the speech recognition result as the long section feature amount.
Calculate the feature value of the input signal for each frame that is the input signal within the unit time,
The feature amount is compared with a threshold value, and it is determined whether it is a speech section in which a signal based on speech is input over a plurality of frames or a non-speech section in which a signal based on non-speech is input over a plurality of frames. And
Based on the statistical values of the feature quantities of a plurality of frames constituting the speech section or the non-speech section, a long section feature quantity that is a feature quantity of the speech section or the non-speech section is calculated,
Using the long section feature amount, the speech section and the non-speech section calculate a non-speech probability that is a section in which a signal based on non-speech is input, and based on the calculated non-speech probability, A voice detection method for updating the threshold.
On the computer,
A feature amount calculation process for calculating a feature amount of an input signal for each frame that is an input signal for each unit time;
The feature amount is compared with a threshold value, and it is determined whether it is a speech section in which a signal based on speech is input over a plurality of frames or a non-speech section in which a signal based on non-speech is input over a plurality of frames. Voice / non-voice judgment processing,
Based on a statistical value of feature quantities of a plurality of frames constituting the speech section or the non-speech section calculated by the feature quantity computation process, a long section feature quantity that is a feature quantity of the speech section or the non-speech section is calculated. Long-section feature value calculation processing to be calculated;
Using the long section feature amount, the speech section and the non-speech section calculate a non-speech probability that is a section in which a signal based on non-speech is input, and based on the calculated non-speech probability, A threshold update process for updating the threshold;
A program recording medium for storing a voice detection program for executing the program.