WO2011077924A1 - Dispositif de détection vocale, procédé de détection vocale et programme de détection vocale - Google Patents

Dispositif de détection vocale, procédé de détection vocale et programme de détection vocale Download PDF

Info

Publication number
WO2011077924A1
WO2011077924A1 PCT/JP2010/071620 JP2010071620W WO2011077924A1 WO 2011077924 A1 WO2011077924 A1 WO 2011077924A1 JP 2010071620 W JP2010071620 W JP 2010071620W WO 2011077924 A1 WO2011077924 A1 WO 2011077924A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
section
voice
feature
value
Prior art date
Application number
PCT/JP2010/071620
Other languages
English (en)
Japanese (ja)
Inventor
田中 大介
隆行 荒川
健 花沢
長田 誠也
岡部 浩司
Original Assignee
日本電気株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電気株式会社 filed Critical 日本電気株式会社
Priority to JP2011547442A priority Critical patent/JP5621786B2/ja
Publication of WO2011077924A1 publication Critical patent/WO2011077924A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals

Definitions

  • the present invention relates to a voice detection device, a voice detection method, and a voice detection program for detecting a voice section.
  • FIG. 14 is a block diagram illustrating a configuration example of a general voice detection device.
  • Patent Document 1 discloses an invention corresponding to the voice detection device illustrated in FIG.
  • the general speech detection apparatus shown in FIG. 14 includes a waveform cutout unit 101 that cuts out and acquires an input signal in frame units, and a feature amount calculation unit that calculates a feature amount used for speech detection from the cut out input signal for each frame.
  • the calculated feature value and the threshold value stored in the threshold value storage unit 103 are compared for each frame to determine whether the input signal is a signal based on speech or a signal based on non-speech.
  • the judgment result holding unit 105 for each frame that holds the judgment results for each frame over a plurality of frames, and the section shaping rules stored in the section shaping rule storage unit 106.
  • “acquiring and acquiring an input signal in units of frames” means that an input signal input from a certain time until a unit time elapses is extracted. Further, the frame is each time obtained by dividing the time during which the input signal is input into unit time. For example, when it is determined that an input signal based on speech or an input signal based on non-speech is input over a plurality of consecutive frames, the section shaping rule determines that these frames are divided into one speech segment or non-speech. This is a rule for determining a voice section.
  • Patent Document 1 discloses an example of the feature amount calculated by the feature amount calculation unit 102 by smoothing the fluctuation of the spectrum power and further smoothing the fluctuation.
  • Non-Patent Document 1 section 43.3 discloses SNR (Signal to Noise ratio) values as examples of feature values, and section 4.3.5 averages SNR values. Are disclosed.
  • Section 3.1.4 discloses the number of zero crossings as an example of a feature quantity, and
  • Non-Patent Document 3 describes the likelihood using a voice GMM (Gaussian Mixture Model) and a silent GMM as examples of feature quantities. The degree ratio is disclosed.
  • the voice / non-voice determination unit 104 compares a threshold value determined in advance with an experiment and a feature value for each frame. If the feature value is equal to or higher than the threshold value, the voice / non-voice determination unit 104 determines that the input signal is based on voice.
  • Patent Document 2 discloses a method of updating a threshold value for each utterance.
  • FIG. 15 is a block diagram showing a voice detection device that changes a voice detection threshold.
  • Patent Document 2 discloses an invention corresponding to the voice detection device illustrated in FIG.
  • the voice detection threshold setting unit 18 determines the spectrum power for determining whether or not the voice section is based on the maximum value of the spectral power of the voice section and the average value of the spectral power of the background noise section that is not the voice section.
  • the threshold value is calculated and updated to the calculated threshold value.
  • JP 2006-209069 A paragraphs 0018 to 0059, FIG. 1
  • Japanese Patent Laid-Open No. 7-92989 paragraphs 0008 to 0014, FIG. 1
  • the speech detection apparatus shown in FIG. 14 has a maximum spectral power in a section composed of a frame in which an average noise power and a speech signal are input from a plurality of frames in which only noise is input in advance. Therefore, it is not possible to cope with an environment in which noise and maximum spectral power constantly change.
  • the speech detection apparatus shown in FIG. 15 needs to perform speech detection to determine the threshold and obtain the spectral power of background noise. However, if the detection accuracy is low, the noise may not be estimated. For example, when the speech section continues from the beginning of the input signal, or when background noise exceeding the threshold value continues and the speech section is determined to be a speech section, the speech detection device uses the background noise spectrum power. It becomes difficult to obtain.
  • the present invention provides a voice that can detect a voice section even when noise changes or noise or a voice section continues from the beginning of an input signal.
  • An object of the present invention is to provide a detection device, a voice detection method, and a voice detection program.
  • the speech detection apparatus compares a feature amount and a threshold value with a feature amount calculation unit that calculates a feature amount of an input signal for each frame that is an input signal per unit time, and a signal based on speech over a plurality of frames.
  • a speech / non-speech determination unit that determines whether the signal is an input speech segment or a non-speech segment in which a signal based on non-speech is input over a plurality of frames, and a speech segment calculated by a feature amount calculation unit
  • a long-section feature quantity calculating unit that calculates a long-section feature quantity that is a feature quantity of a voice section or a non-speech section based on a statistical value of feature quantities of a plurality of frames constituting the non-voice section, and a long-section feature quantity Is used to calculate the non-speech probability, which is the probability that the speech section and the non-speech section are input to a signal based on non-speech, and based on the calculated non-speech probability, Characterized in that a threshold updating means for updating the voice detection threshold.
  • the voice detection method calculates a feature quantity of an input signal for each frame, which is an input signal within a unit time, compares the feature quantity with a threshold value, and receives a voice-based signal over a plurality of frames. It is determined whether it is a section or a non-speech section in which a signal based on non-speech is input over a plurality of frames, and based on statistical values of feature quantities of a plurality of frames constituting the speech section or the non-speech section This is a probability that a long segment feature value, which is a feature value of a voice segment or a non-speech segment, is calculated, and a voice segment and a non-speech segment are segments in which a signal based on non-speech is input using the long segment feature value.
  • a non-speech probability is calculated, and a speech detection threshold is updated based on the calculated non-speech probability.
  • the voice detection program stored in the program recording medium according to the present invention is a computer that compares a feature amount calculation process for calculating a feature amount of an input signal for each frame, which is an input signal per unit time, with a threshold value. And a voice / non-voice determination process for determining whether a voice-based signal is input over a plurality of frames or a non-voice-based signal is input over a plurality of frames.
  • a long section that calculates a feature value of a long section which is a feature quantity of a speech section or a non-speech section, based on a statistical value of feature quantities of a plurality of frames constituting the speech section or the non-speech section calculated by the feature amount calculation process
  • the present invention relates to a voice detection device, a voice detection method, and a voice detection program capable of performing voice segment detection with high accuracy even in a noise environment even when background noise exceeding a threshold value enters the head of input. I will provide a.
  • FIG. 1 is a block diagram showing a configuration example of a first embodiment of a voice detection device according to the present invention.
  • the speech detection apparatus includes a waveform cutout unit 101, a feature amount calculation unit 102, a threshold storage unit 103, a speech / non-speech determination unit 104, and a determination result holding unit. 105, a shaping rule storage unit 106, a voice / non-speech segment shaping unit 107, a long segment feature value computing unit 108, and a threshold updating unit 109.
  • the waveform cutout unit 101 cuts out and acquires an input signal in units of frames.
  • the waveform cutout unit 101 cuts out and acquires input signals for each predetermined unit time, for example.
  • the feature amount calculation unit 102 calculates a feature amount used for speech detection from the input signal for each frame cut out by the waveform cutout unit 101.
  • the threshold storage unit 103 stores a threshold for determining whether the input signal is an input signal based on voice or an input signal based on non-voice.
  • the voice / non-voice determination unit 104 compares the feature amount calculated by the feature amount calculation unit 102 with the threshold value stored in the threshold value storage unit 103 for each frame, and the input signal of the frame is an input signal based on the voice. It is determined whether there is an input signal based on non-voice.
  • the determination result holding unit 105 holds the determination result for each frame by the voice / non-voice determination unit 104 over a plurality of frames.
  • the section shaping rule storage unit 106 stores section shaping rules.
  • the speech / non-speech segment shaping unit 107 shapes the determination results of a plurality of frames held in the decision result holding unit 105 based on the segment shaping rules stored in the segment shaping rule storage unit 106, It is determined that it is a non-voice segment.
  • the speech / non-speech section shaping unit 107 determines, for example, that a plurality of frames are one speech section when a plurality of speech frames are consecutive. Further, when a plurality of non-voice frames are consecutive, the voice / non-voice section shaping unit 107 determines that the plurality of frames are one non-voice section. Note that the voice / non-voice section shaping unit 107 determines that a plurality of frames are one voice section when the ratio of the voice frames is larger than a predetermined ratio in a plurality of consecutive frames, It may be determined that the non-voice section is one when the ratio of frames is larger than a certain ratio.
  • the long section feature amount calculation unit 108 performs statistical processing on the feature amount for each frame calculated by the feature amount calculation unit 102 for the speech section and the non-speech section determined by the speech / non-speech section shaping unit 107. Calculate the amount.
  • the threshold update unit 109 calculates the non-speech probability for the speech segment and the non-speech segment determined by the speech / non-speech segment shaping unit 107, using the long segment feature amount calculated by the long segment feature amount calculator 108, The threshold value stored in the threshold value storage unit 103 is changed.
  • the non-speech probability is a probability that the input signal in the section is an input signal based on non-speech, as will be described later.
  • FIG. 2 is a flowchart showing the operation of the voice detection device according to the first exemplary embodiment of the present invention.
  • the waveform cutout unit 101 cuts out collected time-series input sound data input from a microphone (not shown) for each frame of unit time (step S101). For example, when the input sound data is in a 16-bit Linear-PCM (Pulse Code Modulation) format with a sampling frequency of 8000 Hz, waveform data of 8000 points of input sound data per second is stored in each frame.
  • Linear-PCM Pulse Code Modulation
  • the waveform cutout unit 101 sequentially cuts out the waveform data at a frame width of 200 points (25 milliseconds) and a frame shift of 80 points (10 milliseconds) according to a time series.
  • the feature amount calculation unit 102 calculates a feature amount from the waveform cut out for each frame (step S102).
  • the feature amount calculated by the feature amount calculation unit 102 is, for example, spectrum power, SNR, zero crossing, likelihood, and the like.
  • the voice / non-voice determination unit 104 compares the threshold value stored in the threshold value storage unit 103 with the feature amount calculated by the feature amount calculation unit 102, and determines that the frame is an audio frame if the threshold value is exceeded.
  • step S103 If not, it is determined that the frame is a non-voice frame (step S103). If the threshold value stored in the threshold value storage unit 103 is the same as the feature value calculated by the feature value calculation unit 102, the voice / non-voice determination unit 104 determines that the voice frame is a voice frame or a non-voice frame. May be determined in advance. Then, the voice / non-voice determination unit 104 determines a voice frame or a non-voice frame based on the determination.
  • the determination result holding unit 105 holds the result determined by the voice / non-voice determination unit 104 in the process of step S106 for a plurality of frames (step S104).
  • the voice / non-speech segment shaping unit 107 is configured to suppress the occurrence of a short-duration speech segment or a short-duration non-speech segment that occurs because the speech / non-speech determination unit 104 determines for each frame. Shaping is performed (step S105).
  • the long section feature amount calculation unit 108 calculates the feature amount calculation unit 102 in the process in step S102 for the shaped speech section and non-speech section obtained by the speech / non-speech section shaping unit 107 in the process in step S105.
  • the feature amount for each frame is statistically processed to calculate the long interval feature amount (step S106).
  • the long section feature amount is, for example, one or a combination of two or more of spectrum power, SNR, zero crossing, likelihood, and the like.
  • the long section feature amount calculation unit 108 there is a method of calculating an average value of feature amounts for each frame in a shaped speech section.
  • the long interval feature value calculation unit 108 uses a mode value method, a median value method, and the feature value for each frame is rearranged according to size, so that the feature value is large.
  • a method using values in the vicinity of the upper 40% in order may be used. Note that the value of 40% is merely an example, and it may be a ratio arbitrarily determined by the user or the like.
  • the threshold update unit 109 calculates the non-speech probability ⁇ for the shaped speech segment using the long segment feature value calculated by the long segment feature value calculation unit 108 in the process of step S106 (step S107).
  • the non-speech probability is a probability that the input signal in the section is an input signal based on non-speech such as noise. Therefore, 1- ⁇ corresponds to the probability that the section is speech.
  • is calculated using the following equation.
  • ⁇ F> ⁇ i ⁇ ⁇ fi> (1)
  • G [ ⁇ F>] (2)
  • ⁇ fi> is a long-section feature value obtained by performing the above-described statistical processing on the feature value fi for each frame.
  • ⁇ i is a weight applied to the long section feature ⁇ fi>.
  • ⁇ F> which is calculated by adding a plurality of types (for example, spectrum power, SNR, zero-crossing, likelihood, etc.) of long-section feature quantities ⁇ fi> and multiplying them by weights ⁇ i, is integrated.
  • Long section feature. G is a function having an integrated long section feature quantity (also simply referred to as a long section feature quantity) ⁇ F> as a variable.
  • FIG. 3 is an explanatory diagram showing the function G of the present embodiment.
  • the horizontal axis in FIG. 3 is the value of the long interval feature value, and the vertical axis is the non-speech probability ⁇ .
  • FIG. 3 is an explanatory diagram showing the function G of the present embodiment.
  • the horizontal axis in FIG. 3 is the value of the long interval feature value, and the vertical axis is the non-speech probability ⁇ .
  • the function G is a function with which the non-speech probability ⁇ is 1 when the long-section feature amount is 0. That is, G is a function whose non-speech probability is 100% when the long section feature amount is zero. G is a function for which the non-speech probability ⁇ is 0 when the long-section feature value is ⁇ 0. That is, G is a function whose non-speech probability is 0% when the long-section feature value is ⁇ 0. G is a function whose non-speech probability ⁇ is 1 when the long-section feature value is ⁇ max. That is, G is a function whose non-speech probability is 100% when the long section feature amount is ⁇ max.
  • the function shown in FIG. 3 is an example.
  • the function may be another function as long as the function value increases as the long-section feature value increases from a moderate value or a monotonously decreasing (non-increasing) function.
  • the threshold update unit 109 updates the threshold stored in the threshold storage unit 103 using the non-speech probability ⁇ calculated in the process of step S107 (step S108). Specifically, the threshold update unit 109 updates the threshold as follows. First, the threshold update unit 109 calculates a threshold candidate ⁇ ′ using the following equation.
  • ⁇ ′ ⁇ ⁇ Fmax + (1 ⁇ ) ⁇ Fmin (3)
  • Fmax is the maximum value of the feature amount for each frame in the speech section or the non-speech section.
  • Fmin is a minimum value of the feature amount for each frame in the voice section or the non-voice section.
  • is a speech interval or a non-speech probability of a non-speech interval.
  • the threshold update unit 109 updates the threshold ⁇ using the following equation using the threshold candidate ⁇ ′.
  • is a step size for adjusting the speed of updating the threshold. That is, the voice detection device according to the present invention can adjust the speed of the threshold update.
  • FIG. 4 is an explanatory diagram illustrating an example of changing the threshold value.
  • the speech / non-speech segment shaping unit 107 causes each segment to be a speech segment or a non-speech segment in order of non-speech segment 1, speech segment 2, non-speech segment 3, speech segment 4, and non-speech segment 5.
  • the input signal is shown by the upper waveform in FIG. In FIG.
  • the maximum value and the minimum value of the feature amount of each speech segment and each non-speech segment are indicated by up and down arrows near the end of each speech segment and each non-speech segment.
  • the transition of the threshold is indicated by a solid line that moves up and down in parallel with the vertical axis.
  • the threshold update unit 109 calculates a non-speech probability using equations (1) and (2), and formula (3) ) Is used to determine threshold candidates.
  • the determined threshold value is changed using Equation (4).
  • the threshold value can be updated using the average value of the threshold candidates for the past N utterances as shown in Equation (5) below.
  • the threshold update unit 109 can also update the threshold only when the non-voice probability is greater than or less than a specific value.
  • the long segment feature amount calculation unit 108 performs statistical processing on the feature amount for each of one or more speech sections or non-speech sections to calculate a long segment feature amount, and the threshold update unit 109 performs one or more It is also possible to update the threshold value for each voice interval or non-voice interval.
  • the voice / non-speech section shaping unit 107 may be determined as a voice section or a non-voice section, and the threshold update unit 109 may not update the threshold.
  • the threshold value updating unit 109 reduces the threshold value by a certain value or determines a certain value when the speech / non-speech determination unit 104 does not determine a speech period or a non-speech period for a certain time or more.
  • the threshold value may be increased, or the average value of the feature values calculated by the feature value calculation unit 102 during the certain time may be used as a threshold value.
  • FIG. 5 is an explanatory diagram illustrating an example in which the threshold before update is too small. In the example shown in FIG. 5, since the threshold value before update is too small, the voice detection device erroneously determines that the non-voice section 1 is a voice section.
  • FIG. 6 is an explanatory diagram illustrating an example when the threshold before update is too large. In the example illustrated in FIG.
  • the voice detection device since the threshold value before the update is too large, the voice detection device erroneously determines that the voice section 2 is a non-voice section.
  • the speech detection apparatus increases the non-speech probability ⁇ calculated using the long section feature amount even when the pre-update threshold illustrated in FIG. 5 is too small. As shown in FIG. 5, the non-speech probability ⁇ in the non-speech section 1 is 0.8. In such a case, when the threshold update unit 109 calculates the expression (3), the threshold candidate ⁇ ′ approaches the maximum value of the long section feature amount of the non-speech section 1, and thus the threshold is updated to a larger value.
  • the speech detection apparatus reduces the non-speech probability ⁇ calculated using the long section feature amount even when the pre-update threshold illustrated in FIG. 6 is too large.
  • the non-voice probability ⁇ of the voice section 2 is 0.2.
  • the threshold update unit 109 calculates the expression (3)
  • the threshold candidate ⁇ ′ approaches the minimum value of the long section feature amount of the speech section 2, and thus the threshold is updated to a smaller value. Therefore, the speech detection apparatus according to the present embodiment calculates the non-speech probability ⁇ in the long section feature quantity calculation unit 108 and sets an appropriate threshold value in the threshold update unit 109, so that the speech / non-speech determination unit 104 in the previous stage.
  • FIG. 7 is a block diagram showing a configuration example of the second embodiment of the voice detection device according to the present invention.
  • the voice detection device of the second embodiment is a voice analysis unit that outputs a feature quantity that represents voice likeness by dividing an input signal for each frame. 110 is included.
  • the voice analysis unit 110 has functions corresponding to the waveform cutout unit 101 and the feature amount calculation unit 102 in the configuration of the voice detection device according to the first embodiment shown in FIG.
  • the voice analysis unit 110 calculates the second feature amount independently of the feature amount calculation unit 102 in the process of step S102.
  • the second feature amount calculated by the speech analysis unit 110 is, for example, spectrum power, SNR, zero crossing, likelihood, and the like.
  • the voice analysis unit 110 calculates the second feature amount by analyzing the input signal in more detail using a parameter different from the parameter used when the feature amount calculation unit 102 calculates the feature amount.
  • the voice analysis unit 110 calculates the second feature value for each of a plurality of utterances, or calculates the second feature value when instructed by the user.
  • the second feature amount may be calculated at a timing different from the time of calculating.
  • the long-section feature value calculation unit 108 performs the long-section feature value based on the feature value calculated by the feature value calculation unit 102 and the second feature value calculated by the speech analysis unit 110 in the process of step S106. Is calculated.
  • Each feature amount described above may be easily detected depending on the environment in which the input signal is generated, or may be difficult to detect. Therefore, the long-section feature value calculation unit 108 calculates the long-section feature value using the second feature value calculated by the speech analysis unit 110, for example, when the feature value calculation unit 102 cannot calculate the feature value. To do.
  • the speech analysis unit 110 calculates a feature amount different from the feature amount calculated by the feature amount calculation unit 102, and the long-section feature amount calculation unit 108 is a second feature amount that is the feature amount calculated by the speech analysis unit 110. May be used to calculate the long-section feature value.
  • the speech analysis unit 110 can calculate various feature amounts independently of the feature amount calculation unit 102, feature amounts are calculated from various viewpoints, and more robust speech. Detection can be realized.
  • Embodiment 3. FIG. A third embodiment of the present invention will be described with reference to the drawings.
  • FIG. 8 is a block diagram showing a configuration example of the third embodiment of the voice detection device according to the present invention.
  • the voice detection device of the third embodiment outputs a recognition result corresponding to a voice section using a feature amount that seems to be voice.
  • 111 is included.
  • FIG. 9 is a block diagram illustrating another example of the third embodiment of the voice detection device.
  • the voice recognition unit 111 performs voice recognition on a voice section in which voice is detected.
  • the voice detection apparatus according to the third embodiment shown in FIGS. 8 and 9 operates as follows. That is, the voice recognition unit 111 appropriately extracts a feature amount from the input voice signal.
  • the speech recognition unit 111 is a word string with time information of the speech section by matching the feature amount of the word stored in the language model / speech recognition dictionary (not shown) with the extracted feature amount. Speech recognition for calculating a recognition result is performed, and a speech recognition result word string with time information is output.
  • the long segment feature value calculation unit 108 obtains the phoneme duration from the speech recognition result as the long segment feature value.
  • Tb is the number of frames for one word in the speech recognition result word string output by the speech recognition unit 111
  • Nf is the number of phonemes of the word.
  • the threshold update unit 109 uses the long-section feature value calculated by the long-section feature value calculation unit 108 in step S106, that is, the phoneme duration length, for each section cut out by the speech / non-speech section shaping unit 107.
  • Non-voice probability ⁇ is calculated.
  • the threshold update unit 109 obtains the non-speech probability ⁇ using, for example, a function having a long-section feature value as a variable as shown in FIG.
  • FIG. 10 is an explanatory diagram showing a function for obtaining the non-voice probability ⁇ in the third embodiment of the present invention. As shown in FIG. 10, the horizontal axis represents the value of the long section feature value, and the vertical axis represents the non-speech probability ⁇ . As shown in FIG. 10, the horizontal axis represents the value of the long section feature value, and the vertical axis represents the non-speech probability ⁇ . As shown in FIG.
  • the non-speech probability ⁇ is 1 when the long-section feature value is ⁇ min or less and when it is ⁇ max or more.
  • the non-speech probability ⁇ is 0 when the long section feature amount is ⁇ 0 or more and ⁇ 1 or less.
  • the non-speech probability ⁇ monotonously decreases to ⁇ 0 when the long-section feature value exceeds ⁇ min, and the non-speech probability ⁇ to ⁇ max when the long-section feature value exceeds ⁇ 1. Increases monotonically. It is assumed that ⁇ min, ⁇ max, ⁇ 0, and ⁇ 1 are appropriate values obtained in advance through experiments.
  • the long segment feature value calculation unit 108 uses phonemes as the unit for calculating the duration length, but other units such as syllables may be used.
  • the function shown in FIG. 10 is merely an example, and the present invention is not limited to this.
  • the function may be defined as an arbitrary function whose function value increases as the distance from the medium value of the long interval feature amount increases. The effect of this embodiment will be described.
  • the voice recognition unit 111 of the voice detection device of the third embodiment shown in FIGS. 8 and 9 performs continuous phoneme recognition instead of voice recognition. That is, the speech recognition unit 111 performs continuous phoneme recognition and outputs a phoneme string with time information.
  • the long section feature amount calculation unit 108 obtains the duration time of each phoneme constituting the phoneme string output by the speech recognition unit 111.
  • the operation of the threshold update unit 109 is the same as the operation in the third embodiment described above.
  • the unit for calculating the duration is a phoneme. However, a unit such as a syllable may be used.
  • the speech recognition unit 111 since the speech recognition unit 111 performs continuous phoneme recognition, the duration of phonemes can be acquired more easily than the speech detection device according to the third embodiment that performs speech recognition. Then, the load for calculating the phoneme duration time is reduced, and the processing speed of the entire speech detection apparatus is increased.
  • the speech recognition unit 111 performs recognition in units of phonemes, so that the phoneme length of the utterance section can be easily acquired. The prime number must be derived and divided by the time per utterance to calculate the duration of the phoneme. Therefore, it is important for the reduction of processing load that the voice detection device easily acquires the duration of phonemes.
  • Embodiment 5 FIG. A fifth embodiment of the present invention will be described.
  • the speech detection apparatus has the same configuration as that of the speech detection apparatus according to the third embodiment illustrated in FIG. 8 or FIG. 9, but the long interval feature value calculation unit 108 determines the reliability of the speech recognition result. Is used to calculate long-section feature values. Specifically, for example, the voice recognition unit 111 appropriately extracts a feature amount from the input voice signal. The speech recognition unit 111 then matches the feature quantities of the words stored in the language model / speech recognition dictionary with the extracted feature quantities, and outputs a plurality of candidate speech recognition result scores. The score is, for example, a numerical value representing the degree of matching between the feature amount of the word stored in the language model / speech recognition dictionary and the extracted feature amount.
  • the voice recognition unit 111 outputs a plurality of scores having a high degree. Then, the long interval feature value calculation unit 108 calculates the difference between the score of the first candidate and the score of the second candidate in descending order of the scores of the speech recognition result output by the speech recognition unit 111. calculate. When the score difference is small, the reliability of the speech recognition result is considered low. When the score difference is large, the reliability of the speech recognition result is considered high. Note that the scale corresponding to the reliability of the speech recognition result may be another scale instead of the difference in scores.
  • the threshold update unit 109 uses the long-section feature amount calculated by the long-section feature amount calculation unit 108, that is, the reliability, to calculate the non-speech probability ⁇ for the speech section cut out by the speech / non-speech section shaping unit 107. calculate. Specifically, the threshold update unit 109 obtains the non-speech probability ⁇ using, for example, a function having a long-section feature value as a variable as shown in FIG.
  • FIG. 11 is an explanatory diagram showing a function for obtaining the non-speech probability ⁇ in the fifth embodiment of the present invention. As shown in FIG. 11, the horizontal axis represents the value of the long segment feature value, and the vertical axis represents the non-speech probability ⁇ . As shown in FIG. 11, the horizontal axis represents the value of the long segment feature value, and the vertical axis represents the non-speech probability ⁇ . As shown in FIG.
  • the non-speech probability ⁇ is 0 when the long-section feature amount is ⁇ 0 or more.
  • the non-speech probability ⁇ monotonously decreases from 1 to 0. It is assumed that ⁇ 0 is an appropriate value obtained in advance through experiments.
  • the function shown in FIG. 11 is an example, and may be an arbitrary monotone decreasing function or a monotonic non-increasing function. Since the speech detection apparatus according to the present embodiment operates to calculate the non-speech probability ⁇ using the property that a section with low reliability of the speech recognition result is likely to be a non-speech section, more accuracy is achieved. It is possible to calculate a high non-voice probability.
  • FIG. 12 is a block diagram showing a configuration example of the sixth embodiment of the speech detection device according to the present invention.
  • the voice detection device of the sixth embodiment is a combination of the first to fifth embodiments.
  • the long section feature quantity calculation unit 108 calculates a long section feature quantity by combining one or more methods of the first to fifth embodiments.
  • the speech detection apparatus calculates the non-speech probability ⁇ using the non-speech probability calculation methods of the first to fifth embodiments, and sets the product of each non-speech probability ⁇ as a non-speech probability. Further, the voice detection device may calculate the product after weighting each non-voice probability ⁇ and use it as the non-voice probability.
  • the speech detection apparatus may use the average value of each non-speech probability ⁇ or an appropriate weighted average value as the non-speech probability.
  • the speech detection apparatus according to the present embodiment can calculate a more accurate non-speech probability by combining the first to fifth embodiments.
  • FIG. The seventh embodiment of the present invention is a voice recognition device including the voice detection devices of the first to fifth embodiments.
  • the speech recognition apparatus performs a known speech recognition process on a section determined to be a speech section by the speech detection apparatuses of the first to fifth embodiments, and outputs a speech recognition result.
  • FIG. 13 is a block diagram showing an outline of the present invention.
  • the voice detection device 300 according to the present invention includes a feature amount calculation unit 301 (corresponding to the feature amount calculation unit 102 shown in FIG. 1), a voice / non-voice determination unit 302 (the voice / non-voice determination unit 104 and the voice / non-voice determination unit 104 shown in FIG. 1).
  • the feature amount calculation unit 301 calculates the feature amount of the input signal for each frame, which is an input signal for each predetermined unit time.
  • the speech / non-speech determination unit 302 compares the feature amount calculated by the feature amount calculation unit 301 with a speech detection threshold value for determining whether or not the input signal is a signal based on speech.
  • the long section feature amount calculation unit 303 is a feature amount of the speech section or the non-speech section based on the statistical values of the feature amounts of a plurality of frames constituting the speech section or the non-speech section calculated by the feature amount calculation unit 301.
  • the long interval feature value is calculated.
  • the threshold update unit 304 uses the long-section feature value calculated by the long-section feature value calculation unit 303 to use the long-section feature value as a probability that the speech section and the non-speech section are sections in which a signal based on non-speech is input. And the voice detection threshold is updated based on the calculated non-voice probability.
  • the voice detection device 300 having the above configuration updates the voice detection threshold even when the head of the input signal is a signal based on background noise, and the feature amount exceeds the voice detection threshold, so that a high-precision voice can be obtained. Section detection can be performed. Further, in each of the above embodiments, voice detection devices as shown in the following (1) to (11) are also disclosed.
  • the long section feature amount calculation unit 303 calculates the long section feature amount the predetermined value is counted from the average value of the feature amount for each frame, the mode value, the median value, and the results arranged in descending order.
  • the voice detection device in which the threshold update unit 304 updates the voice detection threshold by using the maximum value and the minimum value of the feature amount in the voice section or the non-voice section and the non-voice probability.
  • the threshold detection unit 304 obtains a value that internally divides the maximum value and the minimum value of the feature amount using the non-speech probability and updates the speech detection threshold so that the value is close to the internally divided value.
  • apparatus A long-section feature value is provided that includes a second feature value calculation unit (corresponding to the speech analysis unit 110 shown in FIG. 7) that calculates a second feature value that is different from the feature value calculated by the feature value calculation unit 304.
  • a speech detection apparatus in which the calculation unit 303 calculates a long-section feature value using the feature value calculated by the feature value calculation unit 304 and the second feature value calculated by the second feature value calculation unit.
  • a second feature quantity calculation unit (corresponding to the voice recognition unit 111 shown in FIG. 8) performs voice recognition on the input signal and outputs a voice recognition result.
  • a speech detection device that calculates long-section feature values based on results.
  • the speech detection apparatus in which the long section feature value calculation unit 303 calculates the reliability of the speech recognition result as the long section feature value.
  • the voice recognition result based on the score which is a value indicating a degree by which the second feature quantity calculation unit matches the feature quantity of the word stored in advance in the storage unit and the feature quantity of the input signal to be voice-recognized.
  • the speech detection device that outputs the scores of the plurality of candidates, and the long interval feature amount calculation unit calculates the difference between the score of the first candidate and the score of the second candidate in the descending order as the reliability.
  • the second feature amount calculation unit performs speech recognition on the input signal and outputs a speech recognition result with time information, and the long interval feature amount calculation unit 303 determines from the speech recognition result with time information.
  • a voice detection device for calculating long-section feature values is a value indicating a degree by which the second feature quantity calculation unit matches the feature quantity of the word stored in advance in the storage unit and the feature quantity of the input signal to be voice-recognized.
  • the long section feature value calculation unit 303 is a voice detection device that calculates a duration length from time information as a long section feature value.
  • the speech detection apparatus in which the long segment feature amount calculation unit 303 calculates the duration time in units of phonemes or syllables. While the present invention has been described with reference to the embodiments, the present invention is not limited to the above embodiments. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention. This application claims the priority on the basis of Japanese application Japanese Patent Application No. 2009-291976 for which it applied on December 24, 2009, and takes in those the indications of all here.
  • the feature-value calculation part which calculates the feature-value of the input signal for every flame
  • voice detection threshold for determination whether the signal is based on speech over a plurality of frames or whether it is a non-speech segment where signals based on non-speech are input over a plurality of frames
  • a long-section feature quantity calculation unit that calculates a long-section feature quantity that is a feature quantity of the section, and the voice section and the non-speech section are based on non-speech using the long-section feature quantity.
  • a speech detection apparatus comprising: a threshold update unit that calculates a non-speech probability that is a probability of being a section in which a speech is input, and updates the speech detection threshold based on the calculated non-speech probability . (Additional remark 2)
  • the long section feature-value calculation part calculates a long-section feature-value by performing a statistical process to the feature-value over the 1 or more audio
  • the voice detection device according to 1.
  • a long-section feature-value calculation part counts from the result arranged in the order of the average value of the feature-value for every frame, a mode value, a median, and a big order.
  • the voice detection device according to supplementary note 1 or supplementary note 2, which uses at least one of methods using a value at a position that reaches the ratio of.
  • a threshold value update part is described in any one of Additional remark 1 to Additional remark 3 which updates an audio
  • Voice detection device is described in any one of Additional remark 1 to Additional remark 3 which updates an audio
  • a threshold value update part calculates
  • the voice detection device according to attachment 4. It has the 2nd feature-value calculation part which calculates the 2nd feature-value different from the feature-value which the feature-value calculation part calculates, The long section feature-value calculation part calculated by the said feature-value calculation part.
  • the speech detection device according to any one of supplementary notes 1 to 5, wherein a long section feature amount is calculated using the feature amount and the second feature amount calculated by the second feature amount calculation unit.
  • the second feature quantity calculator performs speech recognition on the input signal and outputs a speech recognition result, and the long section feature quantity calculator calculates the long section feature quantity based on the speech recognition result.
  • the voice detection device according to appendix 6.
  • the speech detection device according to supplementary note 7, wherein the long section feature amount calculation unit calculates the reliability of the speech recognition result as the long section feature amount.
  • the second feature amount calculation unit performs speech recognition based on a score that is a value indicating a degree of matching between the feature amount of the word stored in the storage unit in advance and the feature amount of the input signal to be recognized.
  • the score of a plurality of candidate results is output, and the long interval feature value calculation unit calculates the difference between the score of the first candidate and the score of the second candidate in descending order of the degree as the reliability level 8
  • the voice detection device according to 1. (Additional remark 10)
  • the 2nd feature-value calculation part performs speech recognition to an input signal, and outputs the speech recognition result with time information, and a long section feature-value calculation part has the said voice recognition result with the said time information.
  • the voice detection device according to supplementary note 10 wherein the long section feature amount calculation unit calculates a duration length from time information as the long section feature amount.
  • voice The detection threshold value is compared, and it is determined whether the signal is a voice segment in which a signal based on speech is input over a plurality of frames or a non-speech segment in which a signal based on non-speech is input over a plurality of frames, Based on the statistical values of the feature quantities of a plurality of frames constituting the speech section or the non-speech section, a long section feature quantity that is a feature quantity of the speech section or the non-speech section is calculated, and the long section feature quantity is calculated.
  • a non-speech probability that is a probability that the speech section and the non-speech section are sections in which a signal based on non-speech is input, and based on the calculated non-speech probability.
  • Te voice detection method and updates the voice detection threshold value.
  • voice detection method of Additional remark 14 which performs a statistical process to the feature-value over one or more audio

Abstract

L'invention concerne un dispositif de détection vocale, un procédé de détection vocale et un programme de détection vocale permettant de réaliser une détection de section vocale avec une grande précision même dans un environnement bruyant. Une unité de calcul de la quantité d'attribut (301) calcule une quantité d'attribut pour chaque trame. Une unité de jugement voix/non-voix (302) compare la quantité d'attribut calculée et une valeur de seuil de détection vocale, jugeant ainsi si une section est une section vocale ou une section non vocale. Une unité de calcul de quantité d'attribut de section longue (303) calcule une quantité d'attribut de section longue qui est la quantité d'attribut de la section vocale ou de la section non vocale, sur la base d'une valeur statistique des quantités d'attributs d'une pluralité de trames. Une unité d'actualisation de la valeur de seuil (304) utilise la quantité d'attribut de section longue calculée pour calculer la probabilité de non-voix, qui est une probabilité telle que la section vocale et la section non vocale sont des sections dans lesquelles un signal basé sur la non-voix est entré, et actualise la valeur du seuil de détection vocale sur la base de la probabilité de non-voix calculée.
PCT/JP2010/071620 2009-12-24 2010-11-26 Dispositif de détection vocale, procédé de détection vocale et programme de détection vocale WO2011077924A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2011547442A JP5621786B2 (ja) 2009-12-24 2010-11-26 音声検出装置、音声検出方法、および音声検出プログラム

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2009291976 2009-12-24
JP2009-291976 2009-12-24

Publications (1)

Publication Number Publication Date
WO2011077924A1 true WO2011077924A1 (fr) 2011-06-30

Family

ID=44195460

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2010/071620 WO2011077924A1 (fr) 2009-12-24 2010-11-26 Dispositif de détection vocale, procédé de détection vocale et programme de détection vocale

Country Status (2)

Country Link
JP (1) JP5621786B2 (fr)
WO (1) WO2011077924A1 (fr)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012048119A (ja) * 2010-08-30 2012-03-08 Nippon Telegr & Teleph Corp <Ntt> 音声区間検出方法、音声認識方法、音声区間検出装置、音声認識装置、そのプログラム及び記録媒体
KR101804787B1 (ko) * 2016-09-28 2017-12-06 대한민국 음질특징을 이용한 화자인식장치 및 방법
KR20200109072A (ko) * 2019-03-12 2020-09-22 울산과학기술원 음성 구간 검출장치 및 그 방법
JP2022008928A (ja) * 2018-03-15 2022-01-14 日本電気株式会社 信号処理システム、信号処理装置、信号処理方法、およびプログラム

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH06236195A (ja) * 1993-02-12 1994-08-23 Sony Corp 音声区間検出方法
JPH08305388A (ja) * 1995-04-28 1996-11-22 Matsushita Electric Ind Co Ltd 音声区間検出装置
JPH09212195A (ja) * 1995-12-12 1997-08-15 Nokia Mobile Phones Ltd 音声活性検出装置及び移動局並びに音声活性検出方法
JP2010032792A (ja) * 2008-07-29 2010-02-12 Nippon Telegr & Teleph Corp <Ntt> 発話区間話者分類装置とその方法と、その装置を用いた音声認識装置とその方法と、プログラムと記録媒体

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH06236195A (ja) * 1993-02-12 1994-08-23 Sony Corp 音声区間検出方法
JPH08305388A (ja) * 1995-04-28 1996-11-22 Matsushita Electric Ind Co Ltd 音声区間検出装置
JPH09212195A (ja) * 1995-12-12 1997-08-15 Nokia Mobile Phones Ltd 音声活性検出装置及び移動局並びに音声活性検出方法
JP2010032792A (ja) * 2008-07-29 2010-02-12 Nippon Telegr & Teleph Corp <Ntt> 発話区間話者分類装置とその方法と、その装置を用いた音声認識装置とその方法と、プログラムと記録媒体

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012048119A (ja) * 2010-08-30 2012-03-08 Nippon Telegr & Teleph Corp <Ntt> 音声区間検出方法、音声認識方法、音声区間検出装置、音声認識装置、そのプログラム及び記録媒体
KR101804787B1 (ko) * 2016-09-28 2017-12-06 대한민국 음질특징을 이용한 화자인식장치 및 방법
JP2022008928A (ja) * 2018-03-15 2022-01-14 日本電気株式会社 信号処理システム、信号処理装置、信号処理方法、およびプログラム
JP7268711B2 (ja) 2018-03-15 2023-05-08 日本電気株式会社 信号処理システム、信号処理装置、信号処理方法、およびプログラム
US11842741B2 (en) 2018-03-15 2023-12-12 Nec Corporation Signal processing system, signal processing device, signal processing method, and recording medium
KR20200109072A (ko) * 2019-03-12 2020-09-22 울산과학기술원 음성 구간 검출장치 및 그 방법
KR102237286B1 (ko) 2019-03-12 2021-04-07 울산과학기술원 음성 구간 검출장치 및 그 방법

Also Published As

Publication number Publication date
JP5621786B2 (ja) 2014-11-12
JPWO2011077924A1 (ja) 2013-05-02

Similar Documents

Publication Publication Date Title
US10157610B2 (en) Method and system for acoustic data selection for training the parameters of an acoustic model
JP5621783B2 (ja) 音声認識システム、音声認識方法および音声認識プログラム
US8532991B2 (en) Speech models generated using competitive training, asymmetric training, and data boosting
US5692104A (en) Method and apparatus for detecting end points of speech activity
US8019602B2 (en) Automatic speech recognition learning using user corrections
JP4911034B2 (ja) 音声判別システム、音声判別方法及び音声判別用プログラム
JP2005043666A (ja) 音声認識装置
JP2011033680A (ja) 音声処理装置及び方法、並びにプログラム
Zhang et al. Improved modeling for F0 generation and V/U decision in HMM-based TTS
JP5621786B2 (ja) 音声検出装置、音声検出方法、および音声検出プログラム
JP4353202B2 (ja) 韻律識別装置及び方法、並びに音声認識装置及び方法
KR100744288B1 (ko) 음성 신호에서 음소를 분절하는 방법 및 그 시스템
KR101122590B1 (ko) 음성 데이터 분할에 의한 음성 인식 장치 및 방법
JP5282523B2 (ja) 基本周波数抽出方法、基本周波数抽出装置、およびプログラム
JP4490090B2 (ja) 有音無音判定装置および有音無音判定方法
JPH09325798A (ja) 音声認識装置
JP2011154341A (ja) 音声認識装置、音声認識方法および音声認識プログラム
US6823304B2 (en) Speech recognition apparatus and method performing speech recognition with feature parameter preceding lead voiced sound as feature parameter of lead consonant
JP4666129B2 (ja) 発声速度正規化分析を用いた音声認識装置
US8103512B2 (en) Method and system for aligning windows to extract peak feature from a voice signal
JP4839970B2 (ja) 韻律識別装置及び方法、並びに音声認識装置及び方法
JP2008026721A (ja) 音声認識装置、音声認識方法、および音声認識用プログラム
Sorin et al. The ETSI extended distributed speech recognition (DSR) standards: client side processing and tonal language recognition evaluation
JP2006010739A (ja) 音声認識装置
JP6526602B2 (ja) 音声認識装置、その方法、及びプログラム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 10839159

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2011547442

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 10839159

Country of ref document: EP

Kind code of ref document: A1