WO2011070972A1 - Système, procédé et programme de reconnaissance vocale - Google Patents

Système, procédé et programme de reconnaissance vocale Download PDF

Info

Publication number
WO2011070972A1
WO2011070972A1 PCT/JP2010/071619 JP2010071619W WO2011070972A1 WO 2011070972 A1 WO2011070972 A1 WO 2011070972A1 JP 2010071619 W JP2010071619 W JP 2010071619W WO 2011070972 A1 WO2011070972 A1 WO 2011070972A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
voice
section
likelihood
length
Prior art date
Application number
PCT/JP2010/071619
Other languages
English (en)
Japanese (ja)
Inventor
隆行 荒川
Original Assignee
日本電気株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電気株式会社 filed Critical 日本電気株式会社
Priority to US13/514,894 priority Critical patent/US9002709B2/en
Priority to JP2011545189A priority patent/JP5621783B2/ja
Publication of WO2011070972A1 publication Critical patent/WO2011070972A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection

Definitions

  • the present invention relates to a speech recognition system, a speech recognition method, and a speech recognition program for recognizing speech in an environment where background noise exists.
  • a general speech recognition system extracts a time series of feature amounts from time series data of input sound collected by a microphone and the like, a word and phoneme model to be recognized, and a non-speech model other than the recognition target. Is used to calculate the likelihood of the feature quantity with respect to the time series. Then, the speech recognition system searches a word string corresponding to the time series of the input sound based on the calculated likelihood, and outputs a recognition result.
  • Patent Document 1 describes a speech recognition device that reduces deterioration of speech recognition performance caused by a silent portion.
  • FIG. 9 is an explanatory diagram showing a voice recognition device described in Patent Document 1. In FIG.
  • the speech recognition apparatus described in Patent Document 1 includes a microphone 201 that collects input sound, a framing unit 202 that extracts time-series data of the collected sound in predetermined time units, and noise that extracts a noise section.
  • An observation section extraction unit 203 an observation section extraction unit 203, an utterance switch 204 for a user to notify the system of the start of utterance, a feature amount extraction unit 205 that extracts a feature amount for each extracted audio data, and a time series of feature amounts
  • a speech recognition unit 208 that performs speech recognition, and a silence model correction unit 207 that corrects a silence model among acoustic models used in the speech recognition unit.
  • the noise observation section extraction unit 203 estimates background noise from the section immediately before the speech switch 204 is pressed, and the silence model correction unit 207 is based on the estimated background noise. Adapt the silence model to the background noise environment. With such a configuration, the speech recognition apparatus reduces misrecognition of speech by facilitating determination of silence other than the target speech.
  • Patent Document 2 describes a speech recognition device that reduces the misrecognition rate for a speech section to which background noise other than the data used during garbage model learning is added.
  • FIG. 10 is an explanatory diagram showing a voice recognition device described in Patent Document 2. As shown in FIG.
  • the speech recognition apparatus described in Patent Document 2 includes an analysis unit 302 that analyzes a time series of feature amounts from time series data of collected sounds, and a correction value calculation unit 303 that calculates a correction amount based on the feature amounts.
  • a collation unit 304 that collates a recognition target word string from a time series of feature amounts, a garbage model 305 that models a sound pattern corresponding to background noise, and a recognition target vocabulary model 306.
  • the correction value calculation unit 303 determines the likelihood of speech from the feature amount based on the pitch frequency, formant frequency, bandwidth feature amount, and the like.
  • the correction value calculation means 303 calculates
  • Non-Patent Document 1 describes a method for recognizing speech from speech data and a model used in speech recognition.
  • the speech recognition apparatus described in Patent Document 1 adapts the silence model to the background noise environment by estimating noise from a section immediately before the speech switch is pressed in order to suppress the adverse effects of sounds other than the recognition target. I am letting.
  • the time during which the speech switch is pressed does not necessarily correspond to the time during which the speech to be recognized is performed.
  • the speech recognition apparatus described in Patent Document 2 determines the likelihood of speech from the pitch frequency, formant frequency, bandwidth feature amount, etc. in order to suppress the adverse effects of sounds other than the recognition target, and the likelihood for the garbage model.
  • the correction value for correcting is obtained.
  • the calculated correction value may adversely affect the sound quality determination.
  • the speech recognition apparatus can determine a speech section (a section where a person is speaking) and a non-speech section other than that by using the fact that power (volume) is different. . That is, since the volume of the section where the person is not speaking is low and the volume of the section where the person is speaking is high, the speech recognition apparatus determines whether the volume is equal to or higher than a certain threshold value. Non-voice can be determined. However, in a noisy environment, the volume of noise is high even if a person is not speaking. Further, since the threshold value determined for determining whether the sound is non-speech depends on the volume of the noise, it is difficult to determine the sound and non-speech.
  • the volume of the voice tends to increase in a relatively large and clear speaking section and decrease in the first or last section of the utterance.
  • S the sound volume
  • Smax the maximum value of the sound volume
  • Smin the minimum value
  • Smin the volume of noise
  • the threshold value for determining speech and non-speech is denoted as ⁇
  • the threshold ⁇ is included in the range of Nmax ⁇ ⁇ Smin + Nmin, the relationship of S> ⁇ in the speech section and non Since the relationship of N ⁇ always holds in the speech section, the speech recognition apparatus can determine speech and non-speech. From this relationship, the following two points can be cited as conditions required for the threshold ⁇ . (1) Since the minimum value Smin of the sound volume is unknown until the utterance is finished, the maximum value that the threshold ⁇ can take is unknown. For this reason, the user or the like wants to set ⁇ as small as possible.
  • the present invention provides a speech recognition system, a speech recognition method, and a speech recognition program capable of suppressing the adverse effects of sounds other than the recognition target and accurately estimating the target speech segment. For the purpose.
  • the speech recognition system calculates a speech feature amount based on a time-series input sound, compares a threshold value with the speech feature amount, determines a speech segment or a non-speech segment, Speech determination means for determining a section in which a margin of a specified length is added before and after the section as the first speech section, speech likelihood and non-speech likelihood calculated based on the speech recognition feature amount Based on the degree, according to the difference between the search means for determining the target speech recognition section as the second speech section, the length of the first speech section and the length of the second speech section, The voice determination means includes parameter update means for updating at least one of the threshold and the margin used when determining the first voice section, and the voice determination means includes a threshold updated by the parameter update means or Ma Using down, determining a first speech section.
  • the speech recognition method calculates a speech feature amount based on a time-series input sound, compares a threshold value with the speech feature amount, determines a speech segment or a non-speech segment, A section in which a margin of a specified length is added before and after the section is determined as the first speech section, and based on speech likelihood and non-speech likelihood calculated based on the speech recognition feature value Then, the section to be subjected to speech recognition is determined as the second voice section, and the first voice section is determined according to the difference between the length of the first voice section and the length of the second voice section.
  • the speech recognition program stored in the program recording medium calculates a speech feature amount based on time-series input sounds to a computer, compares the threshold value with the speech feature amount, and determines a speech interval or non-existence.
  • Speech recognition that is a feature amount used in speech recognition processing and speech recognition for determining a speech segment and determining a segment in which the segment or a specified length margin is added before and after the segment as a first speech segment
  • Search processing for determining a speech recognition target section as a second speech section based on speech likelihood and non-speech likelihood calculated based on a feature amount
  • the first speech section A parameter for updating at least one of the threshold and the margin used when determining the first speech section in the speech determination process according to the difference between the length of the second speech section and the length of the second speech section
  • the voice determination process using the threshold or margin updated in the parameter updating process, to determine a first speech section.
  • the present invention provides a speech recognition system, a speech recognition method, and a speech recognition program capable of suppressing the adverse effects of sounds other than the recognition target and accurately estimating the target speech segment.
  • FIG. 1 is a block diagram showing an example of a speech recognition system according to the first embodiment of the present invention.
  • the speech recognition system according to the present invention includes a microphone 101, a framing unit 102, a speech determination unit 103, a correction value calculation unit 104, a feature amount calculation unit 105, a non-speech model storage unit 106, a vocabulary / phoneme model.
  • a storage unit 107, a search unit 108, and a parameter update unit 109 are provided.
  • the microphone 101 is a device that collects input sound.
  • the framing unit 102 cuts time-series input sound data collected by the microphone 101 for each unit time.
  • the voice determination unit 103 calculates a feature quantity indicating the likelihood of voice (hereinafter, sometimes referred to as a voice feature quantity) based on time-series input sound data. That is, the voice determination unit 103 obtains a feature amount indicating the likelihood of voice for each input sound data cut out for each frame. Then, the sound determination unit 103 compares a threshold value (hereinafter referred to as a threshold value ⁇ ) determined as a value for classifying the input sound into sound or non-speech and the sound feature amount, and determines based on the threshold value.
  • a threshold value hereinafter referred to as a threshold value ⁇
  • the voice determination unit 103 determines a section where the calculated voice feature amount is larger than the threshold ⁇ set as a value for classifying the input sound as voice or non-voice as the first voice section.
  • a section in which the voice feature amount is larger than the threshold ⁇ is described as the first voice section.
  • the feature amount (speech feature amount) indicating the soundness is, for example, amplitude power.
  • the feature quantity indicating the sound quality is not limited to the amplitude power.
  • the speech determination unit 103 determines the first speech section by comparing the feature amount with the threshold value ⁇ .
  • the feature amount calculation unit 105 calculates a feature amount used for speech recognition (hereinafter sometimes referred to as a speech recognition feature amount) based on the speech data. Specifically, the feature amount calculation unit 105 calculates a feature amount (speech recognition feature amount) used for speech recognition from speech data cut out for each frame.
  • the feature amount (speech recognition feature amount) used for speech recognition is, for example, a cepstrum feature amount and its dynamic feature amount. However, the feature amount used for speech recognition is not limited to the cepstrum feature amount. Since the calculation method of the feature amount used for speech recognition is widely known, detailed description is omitted.
  • the non-speech model storage unit 106 stores a non-speech model representing a pattern other than speech that is a target of speech recognition. In the following description, a pattern other than speech that is subject to speech recognition may be referred to as a non-speech pattern.
  • the vocabulary / phoneme model storage unit 107 stores a vocabulary / phoneme model representing a vocabulary of speech or a phoneme pattern to be subjected to speech recognition.
  • the non-speech model storage unit 106 and the vocabulary / phoneme model storage unit 107 store a non-speech model and a vocabulary / phoneme model represented by a probability model such as a hidden Markov model, for example.
  • the model parameters may be learned in advance by the speech recognition apparatus using standard input sound data.
  • the non-voice model storage unit 106 and the vocabulary / phoneme model storage unit 107 are realized by a magnetic disk device, for example.
  • the search unit 108 calculates the likelihood of speech and the likelihood of non-speech based on the feature amount (speech recognition feature amount) used for speech recognition, and searches the word string using this likelihood and the above model. To do. For example, the search unit 108 may search for the most likely word string among the calculated speech likelihoods.
  • the search unit 108 determines a section (hereinafter referred to as a second voice section) that is a target of speech recognition based on the calculated speech likelihood and non-speech likelihood. Specifically, the search unit 108 determines a section in which the speech likelihood calculated based on the speech recognition feature value is higher than the non-speech likelihood as the second speech section. As described above, the search unit 108 obtains a word string (recognition result) corresponding to the input sound and obtains a second speech section by using the feature amount, the vocabulary / phoneme model, and the non-speech model for each frame.
  • the speech likelihood is a numerical value representing the likelihood that a speech vocabulary or phoneme pattern represented by a vocabulary / phoneme model matches an input sound.
  • the non-speech likelihood is a numerical value representing the likelihood that the non-speech pattern represented by the non-speech model matches the input sound.
  • the parameter update unit 109 updates the threshold ⁇ according to the difference between the length of the first speech segment and the length of the second speech segment. That is, the parameter update unit 109 compares the first voice segment and the second voice segment, and updates the threshold ⁇ used by the voice determination unit 103. At this time, the voice determination unit 103 determines the first voice section using the updated threshold value ⁇ . As described above, the voice determination unit 103 determines the first voice section using the value (parameter) updated by the parameter update unit 109.
  • the threshold value ⁇ updated by the parameter update unit 109 is a parameter used when the voice determination unit 103 determines the first voice segment.
  • the correction value calculation unit 104 calculates a correction value used as a value for correcting the likelihood of speech or the likelihood of non-speech according to the difference between the feature amount (speech feature amount) indicating the likelihood of speech and the threshold ⁇ . . That is, the correction value calculation unit 104 calculates a likelihood correction value from the feature amount (speech feature amount) indicating the likelihood of speech and the threshold value ⁇ .
  • the search unit 108 determines the second speech section based on the likelihood corrected based on the correction value.
  • the framing unit 102, the speech determination unit 103, the correction value calculation unit 104, the feature amount calculation unit 105, the search unit 108, and the parameter update unit 109 are computer CPUs that operate according to a program (voice recognition program). (Central Processing Unit).
  • a program voice recognition program
  • the program is stored in a storage unit (not shown) of the speech recognition apparatus, and the CPU reads the program, and in accordance with the program, the framing unit 102, the speech determination unit 103, the correction value calculation unit 104, and the feature amount calculation Unit 105, search unit 108, and parameter update unit 109 may operate.
  • FIG. 2 is a flowchart showing an example of the operation of the speech recognition system in the present embodiment.
  • the framing unit 102 cuts the collected time-series input sound data into frames for each unit time (step S101). For example, the framing unit 102 may sequentially cut out waveform data for a unit time while shifting a portion to be cut out from the input sound data by a predetermined time.
  • this unit time is referred to as a frame width, and this predetermined time is referred to as a frame shift.
  • the input sound data is 16-bit Linear-PCM (Pulse Code Modulation) with a sampling frequency of 8000 Hz
  • waveform data for 8000 points per second is included.
  • the framing unit 102 sequentially cuts out the waveform data according to a time series at a frame width of 200 points (ie, 25 milliseconds) and a frame shift of 80 points (ie, 10 milliseconds).
  • the speech determination unit 103 determines a first speech section by calculating a feature amount (that is, speech feature amount) indicating the speech likeness of the input sound data cut out for each frame and comparing it with a threshold value ⁇ .
  • a feature amount that is, speech feature amount
  • the value of the threshold ⁇ in the initial state for example, the user or the like may specify and set the value of the threshold ⁇ in advance, or may have a noise value estimated in a non-speech section before the utterance starts. A value larger than that value may be set for each.
  • the feature amount indicating the sound quality can be expressed by, for example, amplitude power.
  • the voice determination unit 103 calculates the amplitude power xt by the following Expression 1.
  • FIG. 3 is an explanatory diagram showing an example of a time series of input sound data and a feature quantity indicating the likelihood of speech and a time series of feature quantities used for speech recognition.
  • Figure 3 represents a time series 3A of the feature amount indicating the sound likeness when voice 3C is input, and a time series 3B of the feature amount used for speech recognition that "Hello Hayashi". As the time series 3A in FIG. 3 indicates, it can be said that the voice is more likely if the amplitude power is larger than the threshold value ⁇ .
  • the voice determination unit 103 determines the section as a voice section (L1 in FIG. 3). On the other hand, if the amplitude power is smaller than the threshold ⁇ , it can be said that the voice is more likely to be non-speech, and therefore the speech determination unit 103 determines that the section is a non-speech section.
  • a case has been described in which amplitude power is used as a feature amount indicating the likelihood of speech.
  • the speech determination unit 103 is based on a signal-to-noise ratio (S / N ratio), the number of zero crossings, a likelihood ratio between a speech model and a non-speech model, or a Gaussian mixture distribution model as a feature amount indicating speech likeness.
  • a likelihood ratio (GMM likelihood ratio), a pitch frequency, or a combination of these may be calculated, and a speech section may be determined using these feature amounts.
  • the correction value calculation unit 104 calculates a likelihood correction value from the feature value indicating the likelihood of speech and the threshold value ⁇ (step S103).
  • the likelihood correction value is used as a feature value likelihood correction value for a vocabulary / speech model and a non-speech model, which are calculated when the search unit 108 to be described later searches for a word string.
  • the correction value calculation unit 104 calculates a likelihood correction value for the vocabulary / phoneme model, for example, using Equation 2 below.
  • w is a factor for the correction value and takes a positive real value.
  • w is a parameter for adjusting the amount by which a log likelihood described later is changed by a single correction.
  • the speech recognition apparatus can suppress the threshold value ⁇ from being changed excessively and can change the correction value stably.
  • the system administrator may predetermine an appropriate value of w in consideration of these balances.
  • correction value calculation unit 104 calculates a likelihood correction value for the non-speech model using, for example, Equation 3 below.
  • the correction value is calculated by a linear function of the feature amount xt indicating the likelihood of sound.
  • the method by which the correction value calculation unit 104 calculates the correction value is not limited to the case of using a linear function of the feature quantity xt indicating the likelihood of speech.
  • the correction value calculation unit 104 calculates a correction value large when the feature amount xt is larger than the threshold ⁇ , and if the relationship of calculating the feature amount xt smaller than the threshold ⁇ is maintained, The correction value may be calculated using a function.
  • the correction value calculation unit 104 calculates both the likelihood correction value for the vocabulary / phoneme model and the likelihood correction value for the non-speech model.
  • the correction value calculation unit 104 does not have to calculate both the likelihood correction value for the vocabulary / phoneme model and the likelihood correction value for the non-speech model.
  • the correction value calculation unit 104 may calculate only one of the correction values and set the other correction value to zero.
  • the feature amount calculation unit 105 calculates a feature amount (speech recognition feature amount) used for speech recognition from the input sound data cut out for each frame (step S104).
  • the search unit 108 searches for a word string corresponding to the time series of the input sound data using the feature amount (speech identification feature amount) for each frame, the vocabulary / phoneme model, and the non-speech model. Is determined (step S105).
  • the search unit 108 searches for a word string using a hidden Markov model, for example, as a vocabulary / phoneme model and a non-speech model.
  • the parameters of each model may be parameters that the speech recognition apparatus has learned in advance using standard input sound data.
  • the search unit 108 calculates the likelihood of speech and the likelihood of non-speech.
  • log likelihood is used as a distance measure between a feature amount and each model. Therefore, here, a case where log likelihood is used will be described.
  • the search unit 108 may calculate the log likelihood of speech and non-speech based on the following Equation 4.
  • logL (y; ⁇ ) is a logarithmic likelihood of speech (non-speech) when a speech (non-speech) pattern sequence y is given
  • y (i) is a feature used for speech recognition.
  • Amount voice recognition feature amount
  • ⁇ and ⁇ are parameters set for each model.
  • the search unit 108 calculates the likelihood of speech and the likelihood of non-speech based on the speech recognition feature amount.
  • the search unit 108 calculates the log likelihood as the likelihood has been described.
  • the content calculated as the likelihood is not limited to the log likelihood.
  • the log likelihood of the time series of feature values for each frame and the model representing each vocabulary / phoneme included in the vocabulary / phoneme model is represented as Ls (j, t). j represents one state of each vocabulary / phoneme model.
  • the search unit 108 corrects the log likelihood Ls (j, t) using the correction value calculated by the correction value calculation unit 104 according to Equation 5 illustrated below.
  • Ls (j, t) ⁇ Ls (j, t) + w ⁇ (xt ⁇ ) (Formula 5)
  • Ln (j, t) The log likelihood of the time series of feature values for each frame and a model representing each non-speech included in the non-speech model is represented as Ln (j, t).
  • j represents one state of the non-voice model.
  • the search unit 108 corrects the log likelihood Ln (j, t) using the correction value calculated by the correction value calculation unit 104 according to Equation 6 illustrated below.
  • the search unit 108 searches the time series of the input sound data by searching for a speech vocabulary or phoneme pattern or a non-speech pattern having the maximum log likelihood from the corrected log likelihood time series.
  • a word string like the voice 3C illustrated in FIG. For example, when using the above-described Expression 4, the search unit 108 obtains the value of ⁇ that maximizes the value of logL (y; ⁇ ). Further, at this time, the search unit 108 determines that a section in which the log likelihood of the corrected vocabulary / phoneme model is larger than the log likelihood of the corrected non-speech model is the second speech section. In the example illustrated in FIG.
  • the search unit 108 determines that the portion where the time series 3B is indicated by the waveform is determined as the second speech section L2. As described above, the search unit 108 calculates the log likelihoods Ls and Ln, and corrects the calculated log likelihoods Ls and Ln using the likelihood correction value. Then, the search unit 108 determines a section in which the corrected Ls and Ln satisfy Ls (j, t)> Ln (j, t) as the second speech section. In the above description, the case where the search unit 108 calculates the log likelihood using Equation 4 and determines the second speech section has been described. However, the search unit 108 may determine the second speech section using a method such as A * search or beam search.
  • the search unit 108 may determine that the calculated speech score is higher than the non-speech score as the second speech interval.
  • the parameter update unit 109 compares the first speech segment determined by the speech determination unit 103 with the second speech segment determined by the search unit 108, and a threshold ⁇ that is a parameter used by the speech determination unit 103 Is updated (step S106). Specifically, the parameter update unit 109 updates the value of the threshold value ⁇ for determining the first voice segment according to the length of the first voice segment and the length of the second voice segment.
  • FIG. 4 is an explanatory diagram illustrating an example in which the first voice segment is longer than the second voice segment.
  • FIG. 5 is an explanatory diagram illustrating an example in which the first voice segment is shorter than the second voice segment.
  • the parameter update unit 109 updates the threshold ⁇ to be larger.
  • the parameter update unit 109 updates the threshold ⁇ to be smaller. To do.
  • the parameter update unit 109 updates the threshold ⁇ using Expression 7 illustrated below.
  • is a positive value indicating the step size, and is a parameter for adjusting the amount by which the threshold value ⁇ is changed by one update.
  • the parameter update unit 109 may update the threshold ⁇ based on the length of the non-voice section.
  • the voice determination unit 103 determines a section in which the voice feature amount is smaller than the threshold value ⁇ as the first voice section.
  • the search unit 108 determines a section in which the likelihood Ln for the corrected non-speech is higher than the likelihood Ls for the corrected speech as the second speech section.
  • the parameter updating unit 109 updates the value of the threshold value ⁇ according to the difference in the length of the speech section.
  • the parameter update unit 109 may determine the length of the voice section or the non-voice section, and update the threshold ⁇ by a predetermined value according to the magnitude. For example, the parameter updating unit 109 corrects ⁇ ⁇ ⁇ + ⁇ when the length L2 of the second speech section> the length L1 of the first speech section, and the length L2 of the second speech section ⁇ the first May be corrected as ⁇ ⁇ ⁇ .
  • the parameter update unit 109 updates the threshold ⁇ every time one utterance or one voice section is determined.
  • the timing at which the parameter updating unit 109 updates the threshold ⁇ is not limited to the above timing.
  • the parameter update unit 109 may update the threshold value ⁇ according to an instruction from the speaker. Then, the parameter updating unit 109 repeats the processing from step S101 to step S106 for the next utterance or the next voice section using the updated threshold value ⁇ .
  • the parameter updating unit 109 may perform the processing from step S102 to step S106 for the same utterance using the updated threshold value ⁇ . Further, the parameter update unit 109 may repeat the process from S102 to step S106 for the same utterance not only once but a plurality of times.
  • the speech determination unit 103 calculates the feature amount indicating speech likeness based on the time-series input sound, and calculates the threshold ⁇ and the feature amount indicating speech likelihood. In comparison, a voice segment (or a non-speech segment) is determined, and a first voice segment is determined. Further, the search unit 108 determines the second speech section based on the speech likelihood and the non-speech likelihood calculated based on the feature amount used for speech recognition.
  • the parameter update unit 109 updates the threshold value ⁇ according to the difference between the length of the first voice segment and the length of the second voice segment, and the voice determination unit 103 uses the updated threshold value ⁇ to change the first threshold value ⁇ . 1 speech segment is determined.
  • the voice recognition device can suppress the adverse effects of sounds other than the recognition target and accurately estimate the target utterance section. That is, the correction value calculation unit 104 calculates a likelihood correction value from the feature value indicating the likelihood of speech and the threshold value ⁇ , and the search unit 108 identifies the speech based on the likelihood corrected by the correction value. To do. Therefore, it becomes easy for the search unit 108 to correctly recognize the speech to be recognized and to determine the other as non-speech.
  • the parameter update unit 109 compares the first speech segment and the second speech segment, and updates the threshold used by the speech determination unit 103 based on the comparison result. Therefore, even if the threshold is not set correctly for the noise environment, or even when the noise environment fluctuates according to time, the likelihood correction value can be obtained accurately, so that more noise Realize robust voice recognition.
  • the search unit 108 can more accurately determine the voice section than the voice determination unit 103. This is because the search unit 108 determines a speech section using more information such as a word / phoneme model and a non-speech model.
  • FIG. 6 is a block diagram illustrating an example of a speech recognition system according to the second embodiment of the present invention.
  • the speech recognition system includes a microphone 101, a framing unit 102, a speech determination unit 113, a correction value calculation unit 104, a feature amount calculation unit 105, a non-speech model storage unit 106, a vocabulary / phoneme model.
  • a storage unit 107, a search unit 108, and a parameter update unit 119 are provided. That is, as illustrated in FIG. 6, the speech recognition system according to the second embodiment includes a speech determination unit 113 instead of the speech determination unit 103 in the configuration of the speech recognition system according to the first embodiment.
  • a parameter update unit 119 is provided.
  • the voice determination unit 113 calculates a feature amount (that is, a voice feature amount) indicating the likelihood of speech based on a time-series input sound. Then, the speech determination unit 113 compares the threshold ⁇ for classifying the input sound as speech or non-speech and the speech feature amount, and determines the speech segment or non-speech segment determined based on the threshold ⁇ . A section with a margin (hereinafter referred to as margin m) added before and after the section is determined as the first voice section.
  • margin m A section with a margin
  • the voice determination unit 113 determines a section in which a margin m is added before and after a section in which the feature amount indicating the voice is larger than the threshold ⁇ as the first voice section. As described above, the voice determination unit 113 determines a section obtained by adding a margin to a section of a voice whose feature value indicating the likelihood of voice is larger than the threshold value ⁇ , as the first voice section.
  • the value of the threshold value ⁇ may be a predetermined fixed value, or may be a value that is updated as needed as shown in the first embodiment. In the following description, a predetermined fixed value is used as the value of the threshold ⁇ .
  • the parameter update unit 119 updates the margin m in accordance with the difference between the length of the first speech segment and the length of the second speech segment. In other words, the parameter update unit 119 compares the first speech segment and the second speech segment, and updates the length of the margin m used by the speech determination unit 113. At this time, the voice determination unit 113 determines the first voice section using the updated margin m. Thus, the voice determination unit 113 determines the first voice section using the value (parameter) updated by the parameter update unit 119.
  • the margin m updated by the parameter updating unit 119 is a parameter used when the speech determination unit 113 determines the first speech segment.
  • microphone 101 For microphone 101, framing unit 102, correction value calculation unit 104, feature amount calculation unit 105, non-speech model storage unit 106, vocabulary / phoneme model storage unit 107, and search unit 108, are the first implementation. It is the same as the form.
  • the framing unit 102, the speech determination unit 113, the correction value calculation unit 104, the feature amount calculation unit 105, the search unit 108, and the parameter update unit 119 are computer CPUs that operate according to a program (voice recognition program). It is realized by.
  • the framing unit 102, the voice determination unit 113, the correction value calculation unit 104, the feature amount calculation unit 105, the search unit 108, and the parameter update unit 119 are each realized by dedicated hardware. May be.
  • step S101 the framing unit 102 cuts out the input sound collected by the microphone 101 for each frame (step S101).
  • the voice determination unit 113 displays a feature amount (ie, voice) indicating the voice likeness of the input sound data cut out for each frame. Feature).
  • the method for calculating the feature amount indicating the sound quality is the same as in the first embodiment.
  • the voice determination unit 113 compares the feature amount indicating the likelihood of voice and the threshold value ⁇ to obtain a temporary voice section.
  • the method for obtaining a provisional speech segment is the same as the method for obtaining the first speech segment in the first embodiment.
  • the voice determination unit 113 sets a section in which the feature amount indicating the likelihood of voice is larger than the threshold ⁇ as a temporary voice section. Then, the voice determination unit 113 determines a section with a margin m before and after the provisional voice section as the first voice section (step S102).
  • FIG. 7 is an explanatory diagram illustrating an example in which a margin is added to a temporary audio section. In the example illustrated in FIG.
  • the sound determination unit 113 compares the feature amount indicated by the time series 7 ⁇ / b> A with the threshold ⁇ , and sets a portion larger than the threshold ⁇ as the temporary sound section 71 and the sound section 72.
  • the voice determination unit 113 determines a section to which the margin 73a, the margin 73b, and the margin 73c are added as margins before and after the temporary voice section as the first voice section.
  • the correction value calculation unit 104 calculates the likelihood correction value
  • the feature amount calculation unit 105 calculates the feature amount used for speech recognition
  • the search unit 108 searches the word string and the second
  • the process for determining the voice section is the same as the process in steps S103 to S105 in the first embodiment.
  • the parameter update unit 119 compares the first speech segment determined by the speech determination unit 113 with the second speech segment determined by the search unit 108, and the margin m that is a parameter used by the speech determination unit 113. Is updated (step S106).
  • the parameter update unit 119 updates the value of the margin m to be added to the temporary speech segment according to the length of the first speech segment and the length of the second speech segment.
  • FIGS. 4 and 5 an operation in which the parameter updating unit 119 updates the value of the margin m will be described with reference to FIGS. 4 and 5. As illustrated in FIG. 4, when the length L1 of the first speech segment is longer than the length L2 of the second speech segment, the parameter update unit 119 updates the margin m so as to be shorter.
  • the parameter update unit 119 increases the margin m.
  • the parameter update unit 119 updates the margin m using Equation 8 illustrated below.
  • is a positive value indicating the step size, and is a parameter for adjusting the amount by which the length of the margin m is changed by one update.
  • the parameter update unit 119 may update the margin m based on the length of the non-voice section.
  • the voice determination unit 113 determines a first voice section in which a margin m is added to a temporary voice section that is a voice section that is smaller than the threshold ⁇ , and the search unit 108 determines the corrected non-voice.
  • a section in which the likelihood Ln is higher than the likelihood Ls for the corrected speech may be determined as the second speech section.
  • the parameter updating unit 119 may update not only the length of the margin m but also the value of the threshold ⁇ in the first embodiment. Specifically, the parameter update unit 119 updates the length of the margin m to be shorter and increases the threshold ⁇ when the length of the first speech segment is longer than the length of the second speech segment. Update to a new value.
  • the parameter updating unit 119 updates the length of the margin m longer when the length of the first voice segment is shorter than the length of the second voice segment, and sets the value to a value obtained by reducing the threshold ⁇ . Update. Note that the method of updating the threshold is the same as the method described in the first embodiment. In the above description, a case has been described in which the parameter updating unit 119 updates the margin m according to the difference in the length of the speech section. In addition, the parameter updating unit 119 may determine the length of the voice section or the non-voice section and update the margin m by a predetermined value according to the magnitude.
  • the parameter updating unit 119 corrects m ⁇ m + ⁇ when the length L2 of the second speech section> the length L1 of the first speech section, and the length L2 of the second speech section ⁇ the first May be corrected as m ⁇ m ⁇ .
  • the parameter updating unit 119 updates the margin m in response to the timing for each utterance or the determination of one voice section.
  • the timing at which the parameter updating unit 119 updates the margin m is not limited to the above timing.
  • the parameter update unit 119 may update the margin m in accordance with an instruction from the speaker. Then, the parameter updating unit 119 repeats the processing from step S101 to step S106 for the next utterance or the next voice segment using the updated margin m.
  • the parameter updating unit 119 may perform the processing from step S102 to step S106 for the same utterance using the updated margin m. Further, the parameter update unit 119 may repeat the processing from S102 to S106 for the same utterance not only once but a plurality of times. Next, the effect in this embodiment is demonstrated.
  • the speech determination unit 113 determines the section in which the margin m is added before and after the section in which the speech feature amount is larger than the threshold ⁇ as the first speech section, and the parameter The update unit 119 updates the length of the margin m added before and after the section. Then, the voice determination unit 113 determines a section in which the updated margin m is added before and after the section as the first voice section.
  • the speech recognition apparatus can suppress the adverse effects of sounds other than the recognition target and accurately estimate the target speech segment.
  • consonants are less powerful than vowels and are easily confused with noise, so that the front and back of a speech segment are easily lost.
  • the speech recognition apparatus can prevent speech from being lost by setting a speech segment in which front and rear portions are easily missing as a temporary speech segment and adding a margin m to the temporary speech segment. Note that if the length of the margin m is set too long, there is a possibility that sound other than the target of speech recognition is recognized as speech. For this reason, it is desirable that the length of the margin m is appropriately set according to the background noise.
  • the parameter update unit 119 appropriately updates the length of the margin m based on the length of the first speech segment and the length of the second speech segment.
  • Voice recognition can be realized and the object of the present invention can be achieved.
  • FIG. 8 is a block diagram showing an example of the minimum configuration of the speech recognition system according to the present invention.
  • the speech recognition system according to the present invention calculates a speech feature amount (for example, amplitude power), which is a feature amount indicating speech likeness, based on time-series input sounds (for example, input sound data cut out for each frame).
  • a voice interval (for example, a section where the voice feature amount is larger than the threshold ⁇ ) by comparing a threshold value (for example, threshold value ⁇ ) defined as a value for classifying the input sound into voice or non-voice and the voice feature amount, or Non-speech sections (for example, sections in which the audio feature value is smaller than the threshold ⁇ ) are determined, and those sections or sections with a specified length margin (for example, margin m) are added before and after those sections. It is calculated based on the voice recognition feature quantity which is a feature quantity used for voice recognition (for example, using Expression 4).
  • a section (for example, a section in which the likelihood of speech is higher than the likelihood of non-speech) is determined as the second speech section.
  • the voice determination means 81 determines the first voice section according to the difference between the length of the first voice section and the length of the second voice section
  • the search means 82 for example, the search unit 108 to perform Parameter update means 83 (for example, parameter update unit 109, parameter update unit 119) that updates at least one of the threshold value and the margin used is provided.
  • the voice determination unit 81 determines the first voice segment using the threshold value or the margin updated by the parameter update unit 83.
  • a voice feature amount (for example, amplitude power), which is a feature amount indicating the likelihood of speech, is calculated based on time-series input sounds (for example, input sound data cut out for each frame), and voice or non-speech
  • the voice threshold (for example, the threshold ⁇ ) and the voice feature amount are compared with a threshold value determined as a value for classifying the input sound in the voice section (for example, the section where the voice feature quantity is larger than the threshold ⁇ ) or the non-voice section ( For example, a section in which a voice feature is smaller than a threshold ⁇ is determined, and a section in which those sections or a margin of a specified length (for example, margin m) is added before and after those sections is the first voice section.
  • a voice determination unit for example, the voice determination unit 103 that determines and a voice recognition feature value that is a feature value used for voice recognition (for example, calculated using Equation 4)
  • Likelihood and non-speech Search means for example, the search unit 108, for determining, as a second speech section, a section (for example, a section where the likelihood of speech is higher than the likelihood of non-speech) based on the likelihood.
  • a section for example, a section where the likelihood of speech is higher than the likelihood of non-speech
  • a speech recognition system comprising parameter update means (for example, parameter update unit 109, parameter update unit 119), and the speech determination means determines the first speech section using the threshold value or margin updated by the parameter update means.
  • the parameter updating means increases the threshold when the length of the first voice segment is longer than the length of the second voice segment, and the length of the first voice segment is the length of the second voice segment.
  • the parameter update means shortens the margin length when the length of the first voice section is longer than the length of the second voice section, and the length of the first voice section is the second voice.
  • a speech recognition system that increases the length of the margin when it is shorter than the length of the section.
  • Vocabulary phoneme model storage means for example, a vocabulary / phoneme model storage unit 107) for storing a vocabulary phoneme model representing a vocabulary or phoneme pattern of speech to be speech-recognized, and non-speech to be speech-recognized
  • Non-speech model storage means for example, non-speech model storage unit 106) for storing a non-speech model representing the pattern of the above, and the search means is the likelihood of speech based on the speech recognition feature amount.
  • the likelihood of the non-speech model which is the likelihood of the lexical phoneme model and the likelihood of non-speech
  • the maximum value of the likelihood of speech is greater than the maximum value of the likelihood of non-speech
  • the likelihood of speech If the maximum speech non-speech likelihood is greater than the maximum speech likelihood, the non-speech pattern that maximizes the non-speech likelihood is searched.
  • a correction value used as a value for correcting at least one of the likelihood of the vocabulary phoneme model and the likelihood of the non-speech model is calculated according to the difference between the speech feature amount and the threshold (for example, , Calculated using equation 5 or equation 6) (for example, the search unit 108), and the search unit calculates the second speech section based on the likelihood corrected based on the correction value.
  • the correction value calculation means calculates a value obtained by subtracting the threshold value from the speech feature value as a correction value for the likelihood of the vocabulary phoneme model (for example, calculates a correction value using Equation 2), and the speech feature is calculated from the threshold value.
  • a speech recognition system that calculates a value obtained by subtracting the amount as a correction value of likelihood of a non-speech model (for example, calculates a correction value using Equation 3).
  • the voice determination means uses the amplitude power, the signal-to-noise ratio, the number of zero crossings, the likelihood ratio based on the Gaussian mixture distribution model, the pitch frequency, or a combination thereof as a voice feature amount based on the time-series input sound. Speech recognition system to calculate. While the present invention has been described with reference to the embodiments, the present invention is not limited to the above embodiments. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention. This application claims the priority on the basis of Japanese application Japanese Patent Application No. 2009-280927 for which it applied on December 10, 2009, and takes in those the indications of all here.

Abstract

L'invention concerne un système de reconnaissance vocale capable, tout en supprimant les influences négatives d'un son à ne pas reconnaître, d'estimer correctement les sections d'énoncé qui doivent être reconnues. Un moyen de discrimination vocale (81) calcule les quantités de caractéristiques vocales sur la base d'un son d'entrée de séries chronologiques, et différencie les sections vocales ou les sections non vocales des autres sections en comparant les quantités de caractéristiques vocales à une valeur seuil, qui a été déterminée comme une valeur en fonction des sections du son entré qui sont catégorisées. Ensuite, le moyen de discrimination vocale (81) détermine, comme premières sections vocales, ces sections discriminées ou obtenues en ajoutant une marge d'une longueur désignée à l'avant et à l'arrière de chacune de ces sections discriminées. Sur la base des probabilités vocales et non vocales calculées d'après les quantités de caractéristiques vocales, un moyen de recherche (82) détermine, comme secondes sections vocales, les sections auxquelles la reconnaissance vocale doit être appliquée. Un moyen de mise à jour de paramètres (83) met à jour au moins l'une quelconque de la valeur de seuil ou de la marge conformément aux différences entre les longueurs des premières sections vocales respectives et les longueurs des secondes sections vocales correspondantes. Le moyen de discrimination vocale (81) détermine les premières sections vocales en utilisant l'une de la valeur de seuil ou de la marge qui a été mise à jour par le moyen de mise à jour de paramètres (83).
PCT/JP2010/071619 2009-12-10 2010-11-26 Système, procédé et programme de reconnaissance vocale WO2011070972A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US13/514,894 US9002709B2 (en) 2009-12-10 2010-11-26 Voice recognition system and voice recognition method
JP2011545189A JP5621783B2 (ja) 2009-12-10 2010-11-26 音声認識システム、音声認識方法および音声認識プログラム

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2009-280927 2009-12-10
JP2009280927 2009-12-10

Publications (1)

Publication Number Publication Date
WO2011070972A1 true WO2011070972A1 (fr) 2011-06-16

Family

ID=44145517

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2010/071619 WO2011070972A1 (fr) 2009-12-10 2010-11-26 Système, procédé et programme de reconnaissance vocale

Country Status (3)

Country Link
US (1) US9002709B2 (fr)
JP (1) JP5621783B2 (fr)
WO (1) WO2011070972A1 (fr)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012020717A1 (fr) * 2010-08-10 2012-02-16 日本電気株式会社 Dispositif de détermination d'intervalle de parole, procédé de détermination d'intervalle de parole, et programme de détermination d'intervalle de parole
JP2013013092A (ja) * 2011-06-29 2013-01-17 Gracenote Inc 双方向ストリーミングコンテンツ処理方法、装置、及びシステム
JP2013228459A (ja) * 2012-04-24 2013-11-07 Nippon Telegr & Teleph Corp <Ntt> 音声聴取装置とその方法とプログラム
CN103561643A (zh) * 2012-04-24 2014-02-05 松下电器产业株式会社 语音辨别能力判定装置、语音辨别能力判定系统、助听器增益决定装置、语音辨别能力判定方法及其程序
JP2014142626A (ja) * 2013-01-24 2014-08-07 ▲華▼▲為▼終端有限公司 音声識別方法および装置
JP2014142627A (ja) * 2013-01-24 2014-08-07 ▲華▼▲為▼終端有限公司 音声識別方法および装置
WO2015059946A1 (fr) * 2013-10-22 2015-04-30 日本電気株式会社 Dispositif de détection de la parole, procédé de détection de la parole et programme
JP2015206906A (ja) * 2014-04-21 2015-11-19 インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation 音声検索方法、音声検索装置、並びに、音声検索装置用のプログラム
JPWO2016143125A1 (ja) * 2015-03-12 2017-06-01 三菱電機株式会社 音声区間検出装置および音声区間検出方法
JP2018156044A (ja) * 2017-03-21 2018-10-04 株式会社東芝 音声認識装置、音声認識方法及び音声認識プログラム
WO2023181107A1 (fr) * 2022-03-22 2023-09-28 日本電気株式会社 Dispositif de détection vocale, procédé de détection vocale, et support d'enregistrement

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6235280B2 (ja) * 2013-09-19 2017-11-22 株式会社東芝 音声同時処理装置、方法およびプログラム
US9633019B2 (en) 2015-01-05 2017-04-25 International Business Machines Corporation Augmenting an information request
CN106601238A (zh) * 2015-10-14 2017-04-26 阿里巴巴集团控股有限公司 一种应用操作的处理方法和装置
US9984688B2 (en) 2016-09-28 2018-05-29 Visteon Global Technologies, Inc. Dynamically adjusting a voice recognition system
US10811007B2 (en) * 2018-06-08 2020-10-20 International Business Machines Corporation Filtering audio-based interference from voice commands using natural language processing
WO2022198474A1 (fr) 2021-03-24 2022-09-29 Sas Institute Inc. Structure d'analyse de parole avec support pour grands corpus de n-grammes
US11049502B1 (en) * 2020-03-18 2021-06-29 Sas Institute Inc. Speech audio pre-processing segmentation
CN113409763B (zh) * 2021-07-20 2022-10-25 北京声智科技有限公司 语音纠正方法、装置及电子设备

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH056193A (ja) * 1990-08-15 1993-01-14 Ricoh Co Ltd 音声区間検出方式及び音声認識装置
JPH0643895A (ja) * 1992-07-22 1994-02-18 Nec Corp 音声認識装置
JPH10254475A (ja) * 1997-03-14 1998-09-25 Nippon Telegr & Teleph Corp <Ntt> 音声認識方法
JP2001013988A (ja) * 1999-06-29 2001-01-19 Toshiba Corp 音声認識方法及び装置
JP2002091468A (ja) * 2000-09-12 2002-03-27 Pioneer Electronic Corp 音声認識システム
JP2005181458A (ja) * 2003-12-16 2005-07-07 Canon Inc 信号検出装置および方法、ならびに雑音追跡装置および方法
WO2009069662A1 (fr) * 2007-11-27 2009-06-04 Nec Corporation Système de détection de parole, procédé de détection de parole et programme de détection de parole
JP2009175179A (ja) * 2008-01-21 2009-08-06 Denso Corp 音声認識装置、プログラム、及び発話信号抽出方法

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4700392A (en) * 1983-08-26 1987-10-13 Nec Corporation Speech signal detector having adaptive threshold values
US5305422A (en) * 1992-02-28 1994-04-19 Panasonic Technologies, Inc. Method for determining boundaries of isolated words within a speech signal
US6471420B1 (en) * 1994-05-13 2002-10-29 Matsushita Electric Industrial Co., Ltd. Voice selection apparatus voice response apparatus, and game apparatus using word tables from which selected words are output as voice selections
JP3255584B2 (ja) * 1997-01-20 2002-02-12 ロジック株式会社 有音検知装置および方法
US6718302B1 (en) * 1997-10-20 2004-04-06 Sony Corporation Method for utilizing validity constraints in a speech endpoint detector
JP4577543B2 (ja) 2000-11-21 2010-11-10 ソニー株式会社 モデル適応装置およびモデル適応方法、記録媒体、並びに音声認識装置
JP2007017736A (ja) 2005-07-08 2007-01-25 Mitsubishi Electric Corp 音声認識装置

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH056193A (ja) * 1990-08-15 1993-01-14 Ricoh Co Ltd 音声区間検出方式及び音声認識装置
JPH0643895A (ja) * 1992-07-22 1994-02-18 Nec Corp 音声認識装置
JPH10254475A (ja) * 1997-03-14 1998-09-25 Nippon Telegr & Teleph Corp <Ntt> 音声認識方法
JP2001013988A (ja) * 1999-06-29 2001-01-19 Toshiba Corp 音声認識方法及び装置
JP2002091468A (ja) * 2000-09-12 2002-03-27 Pioneer Electronic Corp 音声認識システム
JP2005181458A (ja) * 2003-12-16 2005-07-07 Canon Inc 信号検出装置および方法、ならびに雑音追跡装置および方法
WO2009069662A1 (fr) * 2007-11-27 2009-06-04 Nec Corporation Système de détection de parole, procédé de détection de parole et programme de détection de parole
JP2009175179A (ja) * 2008-01-21 2009-08-06 Denso Corp 音声認識装置、プログラム、及び発話信号抽出方法

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPWO2012020717A1 (ja) * 2010-08-10 2013-10-28 日本電気株式会社 音声区間判定装置、音声区間判定方法および音声区間判定プログラム
WO2012020717A1 (fr) * 2010-08-10 2012-02-16 日本電気株式会社 Dispositif de détermination d'intervalle de parole, procédé de détermination d'intervalle de parole, et programme de détermination d'intervalle de parole
JP5725028B2 (ja) * 2010-08-10 2015-05-27 日本電気株式会社 音声区間判定装置、音声区間判定方法および音声区間判定プログラム
US9160837B2 (en) 2011-06-29 2015-10-13 Gracenote, Inc. Interactive streaming content apparatus, systems and methods
JP2013013092A (ja) * 2011-06-29 2013-01-17 Gracenote Inc 双方向ストリーミングコンテンツ処理方法、装置、及びシステム
US11935507B2 (en) 2011-06-29 2024-03-19 Gracenote, Inc. Machine-control of a device based on machine-detected transitions
US11417302B2 (en) 2011-06-29 2022-08-16 Gracenote, Inc. Machine-control of a device based on machine-detected transitions
US10783863B2 (en) 2011-06-29 2020-09-22 Gracenote, Inc. Machine-control of a device based on machine-detected transitions
US10134373B2 (en) 2011-06-29 2018-11-20 Gracenote, Inc. Machine-control of a device based on machine-detected transitions
US9479880B2 (en) 2012-04-24 2016-10-25 Panasonic Intellectual Property Management Co., Ltd. Speech-sound distinguishing ability determination apparatus, speech-sound distinguishing ability determination system, hearing aid gain determination apparatus, speech-sound distinguishing ability determination method, and program thereof
CN103561643B (zh) * 2012-04-24 2016-10-05 松下知识产权经营株式会社 语音辨别能力判定装置、系统和方法、以及助听器增益决定装置
JP2013228459A (ja) * 2012-04-24 2013-11-07 Nippon Telegr & Teleph Corp <Ntt> 音声聴取装置とその方法とプログラム
CN103561643A (zh) * 2012-04-24 2014-02-05 松下电器产业株式会社 语音辨别能力判定装置、语音辨别能力判定系统、助听器增益决定装置、语音辨别能力判定方法及其程序
JP2014142626A (ja) * 2013-01-24 2014-08-07 ▲華▼▲為▼終端有限公司 音声識別方法および装置
US9607619B2 (en) 2013-01-24 2017-03-28 Huawei Device Co., Ltd. Voice identification method and apparatus
US9666186B2 (en) 2013-01-24 2017-05-30 Huawei Device Co., Ltd. Voice identification method and apparatus
JP2014142627A (ja) * 2013-01-24 2014-08-07 ▲華▼▲為▼終端有限公司 音声識別方法および装置
WO2015059946A1 (fr) * 2013-10-22 2015-04-30 日本電気株式会社 Dispositif de détection de la parole, procédé de détection de la parole et programme
JPWO2015059946A1 (ja) * 2013-10-22 2017-03-09 日本電気株式会社 音声検出装置、音声検出方法及びプログラム
JP2015206906A (ja) * 2014-04-21 2015-11-19 インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation 音声検索方法、音声検索装置、並びに、音声検索装置用のプログラム
JPWO2016143125A1 (ja) * 2015-03-12 2017-06-01 三菱電機株式会社 音声区間検出装置および音声区間検出方法
US10579327B2 (en) 2017-03-21 2020-03-03 Kabushiki Kaisha Toshiba Speech recognition device, speech recognition method and storage medium using recognition results to adjust volume level threshold
JP2018156044A (ja) * 2017-03-21 2018-10-04 株式会社東芝 音声認識装置、音声認識方法及び音声認識プログラム
WO2023181107A1 (fr) * 2022-03-22 2023-09-28 日本電気株式会社 Dispositif de détection vocale, procédé de détection vocale, et support d'enregistrement

Also Published As

Publication number Publication date
US20120239401A1 (en) 2012-09-20
US9002709B2 (en) 2015-04-07
JPWO2011070972A1 (ja) 2013-04-22
JP5621783B2 (ja) 2014-11-12

Similar Documents

Publication Publication Date Title
JP5621783B2 (ja) 音声認識システム、音声認識方法および音声認識プログラム
US9536525B2 (en) Speaker indexing device and speaker indexing method
US8880409B2 (en) System and method for automatic temporal alignment between music audio signal and lyrics
US9165555B2 (en) Low latency real-time vocal tract length normalization
JP4911034B2 (ja) 音声判別システム、音声判別方法及び音声判別用プログラム
JP5949550B2 (ja) 音声認識装置、音声認識方法、及びプログラム
US8532991B2 (en) Speech models generated using competitive training, asymmetric training, and data boosting
EP1557822B1 (fr) Adaptation de la reconnaissance automatique de la parole en utilisant les corrections de l&#39;utilisateur
US20140046662A1 (en) Method and system for acoustic data selection for training the parameters of an acoustic model
EP1675102A2 (fr) Procédé d&#39;extraction d&#39;un vecteur de charactéristiques pour la reconnaissance de la parole
US20110238417A1 (en) Speech detection apparatus
JPWO2010128560A1 (ja) 音声認識装置、音声認識方法、及び音声認識プログラム
WO2010070839A1 (fr) Dispositif et programme de détection sonore et procédé de réglage de paramètre
JP6481939B2 (ja) 音声認識装置および音声認識プログラム
JP5621786B2 (ja) 音声検出装置、音声検出方法、および音声検出プログラム
JPH11184491A (ja) 音声認識装置
JP2008026721A (ja) 音声認識装置、音声認識方法、および音声認識用プログラム
JP4749990B2 (ja) 音声認識装置
JP4576612B2 (ja) 音声認識方法および音声認識装置
Wang et al. Improved Mandarin speech recognition by lattice rescoring with enhanced tone models
JP2014092751A (ja) 音響モデル生成装置とその方法とプログラム
JP2006071956A (ja) 音声信号処理装置及びプログラム
JP2004163448A (ja) 音声認識装置、方法、およびそのプログラム
JP2009025388A (ja) 音声認識装置
Wang et al. An Algorithm for Voiced/Unvoiced decision and pitch estimation in speech feature extraction

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 10835893

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2011545189

Country of ref document: JP

WWE Wipo information: entry into national phase

Ref document number: 13514894

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 10835893

Country of ref document: EP

Kind code of ref document: A1