WO2005020212A1 - 信号分析装置、信号処理装置、音声認識装置、信号分析プログラム、信号処理プログラム、音声認識プログラム、記録媒体および電子機器 - Google Patents
信号分析装置、信号処理装置、音声認識装置、信号分析プログラム、信号処理プログラム、音声認識プログラム、記録媒体および電子機器 Download PDFInfo
- Publication number
- WO2005020212A1 WO2005020212A1 PCT/JP2004/010841 JP2004010841W WO2005020212A1 WO 2005020212 A1 WO2005020212 A1 WO 2005020212A1 JP 2004010841 W JP2004010841 W JP 2004010841W WO 2005020212 A1 WO2005020212 A1 WO 2005020212A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- band
- level
- input signal
- signal
- normalization
- Prior art date
Links
- 238000012545 processing Methods 0.000 title claims description 43
- 238000004458 analytical method Methods 0.000 title abstract description 65
- 238000010606 normalization Methods 0.000 claims abstract description 97
- 238000000034 method Methods 0.000 claims description 96
- 238000004364 calculation method Methods 0.000 claims description 25
- 238000006243 chemical reaction Methods 0.000 claims description 14
- 230000008569 process Effects 0.000 claims description 14
- 238000009826 distribution Methods 0.000 claims description 11
- 238000000605 extraction Methods 0.000 abstract description 3
- 238000001228 spectrum Methods 0.000 description 43
- 238000010586 diagram Methods 0.000 description 29
- 230000008859 change Effects 0.000 description 16
- 238000001514 detection method Methods 0.000 description 14
- 238000004891 communication Methods 0.000 description 9
- 230000000694 effects Effects 0.000 description 8
- 230000006978 adaptation Effects 0.000 description 5
- 230000007423 decrease Effects 0.000 description 5
- 230000003595 spectral effect Effects 0.000 description 5
- 239000000654 additive Substances 0.000 description 3
- 230000000996 additive effect Effects 0.000 description 3
- 230000007613 environmental effect Effects 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 230000015654 memory Effects 0.000 description 3
- 230000002411 adverse Effects 0.000 description 2
- 238000012937 correction Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000010365 information processing Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000010561 standard procedure Methods 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 101000822695 Clostridium perfringens (strain 13 / Type A) Small, acid-soluble spore protein C1 Proteins 0.000 description 1
- 101000655262 Clostridium perfringens (strain 13 / Type A) Small, acid-soluble spore protein C2 Proteins 0.000 description 1
- 102000003712 Complement factor B Human genes 0.000 description 1
- 108090000056 Complement factor B Proteins 0.000 description 1
- 101000655256 Paraclostridium bifermentans Small, acid-soluble spore protein alpha Proteins 0.000 description 1
- 101000655264 Paraclostridium bifermentans Small, acid-soluble spore protein beta Proteins 0.000 description 1
- 230000005534 acoustic noise Effects 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000002238 attenuated effect Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000001816 cooling Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000002542 deteriorative effect Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000000737 periodic effect Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 230000001568 sexual effect Effects 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 238000003860 storage Methods 0.000 description 1
- 238000001308 synthesis method Methods 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
- 230000003936 working memory Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/20—Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
Definitions
- Signal analysis device signal processing device, speech recognition device, signal analysis program, signal processing program, speech recognition program, recording medium, and electronic device
- the present invention relates to a signal analysis device that analyzes input speech and acoustic signals, a signal processing device, and a speech recognition device using the signal analysis device.
- the present invention also relates to a signal analysis program, a signal processing program, and a speech recognition program for causing a computer to execute such processing.
- the present invention also relates to a recording medium on which such a program is recorded.
- the present invention also relates to an electronic device equipped with such a signal analyzer.
- a signal is extracted while shifting a section of about several tens of milliseconds from the input of a signal at intervals of about several to several tens of milliseconds, and is used as an analysis frame. Then, the acoustic parameters are calculated from the waveform of the input signal in each analysis frame, and are set as a time series of the acoustic parameters.
- time-series acoustic parameters are collated with time-series patterns (standard patterns) of pre-registered acoustic parameters, and a standard pattern most similar to the input is recognized as a recognition result. I do.
- acoustic parameters are calculated in advance from a large amount of data, the statistics of the acoustic parameters calculated for each speech unit are obtained, and a probabilistic acoustic model is created.
- a stochastic acoustic model for each voice unit is connected to create a word model or a sentence model.
- the likelihood of the word model or the sentence model is calculated and compared, and the word model or the sentence model having the highest probability likelihood is used as the recognition result.
- units such as phonemes, syllables, or words are used as sound units.
- Non-Patent Document 1 describes such a signal.
- MFCC Mel Frequency Cepstrum Coefficient
- FIG. 1 is a flowchart showing a procedure for obtaining the MFCC.
- the MFCC analysis method is described below with reference to FIG.
- a speech waveform is input to the signal analyzer for each analysis frame (step S101), and a no-ming window function is multiplied so that abrupt changes do not occur at both ends of the cutout section of the frame (step S102). ).
- the energy on the linear frequency axis in each frame is obtained using FFT (Fast Fourier Transform) (step S103).
- FFT Fast Fourier Transform
- the energy on the linear frequency axis is grouped for each equally divided band on the mel frequency axis, and converted into band energy (step S104).
- the converted band energy is logarithmically converted for each band (step S105).
- the MFCC is obtained by performing cosine transform of the power in each band (step S106).
- the obtained MFCC is output from the signal analyzer (step S107).
- the cepstrum coefficient obtained by dividing equally on the mel frequency axis is called a mel frequency cepstrum coefficient (MFCC).
- the mel frequency is a frequency unit in accordance with human auditory characteristics, and the resolution at low frequencies is finer than the resolution at high frequencies. For this reason, MFCCs are known to have better speech recognition performance than cepstrum of the same order compared to the case of using the linear frequency axis.
- step S101-step S105 may be referred to as a frequency analysis step (step S201), and steps S106 to S107 may be referred to as a parameter conversion step (step S202).
- Non-Patent Document 2 discloses a spectral subtraction (SS) method as a method of reducing the influence of additive noise.
- SS spectral subtraction
- an input audio signal is frequency-analyzed to obtain an input amplitude spectrum or power (square) spectrum
- the estimated noise spectrum estimated in a noise section is multiplied by a predetermined coefficient
- this spectrum subtraction coefficient is calculated.
- This is a method for suppressing noise components by subtracting the estimated noise spectrum after multiplication from the input spectrum.
- Patent Document 1 discloses a method of improving the SS method, reducing the band division, and suppressing noise components.
- Non-Patent Document 1 discloses a CMS method (Cepstm Mean Subtraction: cepstrum mean division) as a method for reducing the influence of multiplicative distortion.
- This method is also called C MN method (Cepstrum Mean Normalization method). This is based on the assumption that the multiplicative distortion is obtained as a long-term average of the cepstrum of the uttered speech. Specifically, by subtracting the average value of the cepstrum coefficient of the input voice from the cepstrum coefficient, it is possible to reduce the influence of distortion based on the characteristics of the acoustic system such as a line or a microphone. This is equivalent to subtracting the log spectrum distortion, which is the cosine transform of the cepstrum.
- Non-patent Document 3 and Patent Document 2 propose an E-CMN method, which is an improved method of the CMS method.
- E—The CMN method calculates the cepstrum average of voice segments and the cepstrum of non-voice segments. Then, a normalization process is separately performed for the voice section and the non-voice section separately. This method can reduce the false recognition rate.
- any of the above-mentioned SS method, its simplified method, and E-CMN method voice detection is required to distinguish voice sections from non-voice sections.
- Standard methods of voice detection technology are disclosed in voice communication standards used for mobile phones. Speech detection is generally performed by temporally dividing an input signal into a speech section and a noise section based on a change in energy of the input signal in a short time, a change in spectrum shape, and the like.
- Patent Document 1 Japanese Patent Application Laid-Open No. 2001-228893
- Patent Document 2 JP-A-10-254494
- Non-Patent Document 1 Edited by Kiyohiro Kano et al., "Speech Recognition System", 1st edition, Ohm Co., Ltd., May 15, 2001, p. 13-15
- Non-Patent Document 3 Makoto Shokai and 2 others, ⁇ Model adaptation method based on cepstrum average normalization method and HMM synthesis method E-CMN / PMC and application to in-car speech recognition '', IEICE Transactions Journal, The Institute of Electronics, Information and Communication Engineers, 1997, J80-D-II, Vol. 10, p. 2
- the signal analysis methods and the speech recognition methods used in these analysis methods include:
- the SS method subtracts the spectrum of a noise section in a speech section, it is excellent when estimating input speech in an environment with little noise.
- the power S at which the noise power decreases and the noise spectral shape itself do not change.
- the noise section Since it is erroneously collated with the voice section, high recognition accuracy cannot be obtained as a whole. In order to prevent this decrease in recognition accuracy, some kind of adaptive spectrum correction is required.
- One of the methods is to normalize the noise spectrum like the E-CMN method.
- the E-CMN method has the following problems.
- the cepstrum average is obtained independently for the speech section and the noise section, so that the line characteristics can be more accurately normalized.
- the spectrum shape of the noise section can be flattened, the collation accuracy of the noise section can be improved, which cannot be solved by the SS method.
- the determination between the voice section and the noise section depends on the detection accuracy of the voice section. For this reason, in a high-noise environment, the segmentation accuracy of a voice section is reduced, and normalization is performed based on an erroneous cepstrum average value, which has a problem that recognition accuracy is adversely affected.
- the problems of voice detection are described below.
- noise spectrum estimation is performed for a section determined to be a noise section when detecting a voice section.
- a sufficient noise section length for estimating a noise spectrum cannot be obtained. For this reason, there is a problem that the noise cannot be applied or that incorrect noise is applied.
- an object of the present invention is to provide a signal analysis device and a signal processing device with high speech recognition accuracy even in a high noise environment.
- Another object of the present invention is to provide a signal that can provide stable speech recognition accuracy even when speech without any noise section is input or when the noise level changes gradually during speech.
- An object of the present invention is to provide an analyzer and a signal processor.
- Still another object of the present invention is to provide a speech recognition device that is not easily affected by noise and distortion of acoustic characteristics.
- Still another object of the present invention is to provide a speech recognition device improved so that speech recognition accuracy at a low SN ratio is improved.
- Still another object of the present invention is to provide an electronic device equipped with such a voice recognition device.
- Still another object of the present invention is to provide a signal analysis program, a signal processing program, and a speech recognition program which are improved so as to be able to perform speech recognition that is not easily affected by noise and distortion of acoustic characteristics. Is to do.
- Still another object of the present invention is to provide a signal analysis program, a signal processing program, and a speech recognition program that are improved so that the speech recognition accuracy at a low SN ratio is improved.
- Still another object of the present invention is to provide a recording medium on which such a program is recorded.
- a signal analysis device of the present invention includes a frequency band dividing unit that divides an input signal into signals in a plurality of frequency bands, and a frequency band dividing unit that divides the input signal into each of the bands.
- the extracted band energy is normalized for each band.
- signal analysis can be performed without explicitly detecting a voice section, so that a voice section detection error can be avoided.
- the normalization is to reduce the amount of movement of the energy distribution of the input signal caused by environmental factors such as the type and magnitude of noise, line characteristics, and microphone input sensitivity. Means that. Specifically, the normalization is performed by subtracting the average of the energy distribution of the signal to attenuate the DC component, subtracting the value of the environmental noise, and the like. Also, in this specification, normalization includes controlling the spread of the energy distribution at the input level.
- the normalizing means is configured by a low-frequency cutoff filter that attenuates a DC component from band energy extracted for each band. According to this configuration, since the low-frequency components including the DC component of the input signal are attenuated, normalization can be performed with the simplest configuration.
- the signal analysis device of the present invention further includes a level calculation unit configured to calculate a conditional average value for each band as a first level from the sequence of the extracted band energies,
- the converting means subtracts a value obtained by multiplying a first level of the same band by a predetermined coefficient from the band energy extracted for each band.
- the band energy of the input signal is normalized using the calculated first level. That is, more accurate normalization can be performed for each band.
- the average value of the band energy is not limited to the signal level and the noise level, but may be the level of the noise A and the level of the noise B, or the level of the specific signal X and the level of the specific signal Y. Depending on the usage environment and usage of the signal analyzer, it can be classified into two, three or more levels. In the present specification, a level serving as a reference among these levels is referred to as a first level, and other levels are referred to as a second level, a third level, and the like. In this specification, the noise level is mainly used as the first level. , And the audio level is used as the second level.
- the noise level indicates an average of a set in which relatively low energy is assumed to represent noise among the input band energies, and the sound level indicates the same sound. Shows the average of the set with relatively high energy distribution that is presumed to represent.
- the average of the energy divided under conditions such as the level of the energy, such as the noise level or the sound level, is referred to as “conditional average”.
- a set having a relatively low energy and a set having a relatively high energy are discretely divided in an energy distribution. Then, it is desirable to handle the intermediate input range continuously. In other words, it is desirable to determine the likelihood of speech and the likelihood of noise based on the magnitude of the band energy for each frame using continuous values without detecting speech sections.
- the level calculation means calculates, in addition to the calculation of the first level, for each band different from the first level from the extracted band energy series. At least one of the conditional average values of the second band is calculated as a second level, and the normalizing means calculates the first level and the second level of the same band from the band energy extracted for each band. A value multiplied by a predetermined coefficient is subtracted.
- the band energy is normalized using the calculated first level and second level, and the amount of subtraction from the band energy is calculated using the calculated voice level. Is determined, more accurate normalization is possible. It should be noted that the conditional average value for each band, which is different from the first level, is not limited to one and may be plural.
- the signal analyzer further includes a normalization coefficient obtaining unit that obtains one or a plurality of normalization coefficients according to the value of the band energy. A value obtained by multiplying the first level or the first level and a plurality of levels in the same band by the normalization coefficient is subtracted from the band energy extracted for each.
- one or a plurality of normalization coefficients corresponding to the value of the band energy corresponding to the conditional average value for each of the one or a plurality of bands are obtained.
- the amount of subtraction can be adjusted to reflect one or more conditional averages, resulting in more accurate normalization. It becomes possible.
- the signal processing device of the present invention is a signal processing device that sequentially performs signal normalization, and updates a first level based on an input signal, and stores the first level.
- Normalizing means for subtracting a value obtained by multiplying the level by a predetermined coefficient from the input signal, and update coefficient obtaining means for obtaining an update coefficient based on a difference between the first level and the input signal.
- the level calculation means sets the first level as a conditional average value of the input signal by using the update coefficient to bring the first level closer to the input signal.
- a one-dimensional input signal such as an input signal in a specific frequency band can be normalized and subjected to signal processing. That is, the energy distribution of the input signal fluctuates depending on the environment by detecting the spread of the energy distribution of the input energy from the level of the input energy, finding the update coefficient, and bringing the first level closer to the input signal. Can be suppressed.
- the signal processing device includes a normalization coefficient acquisition unit that acquires a normalization coefficient based on a difference between the first level and the input signal.
- the normalization means subtracts a value obtained by multiplying the first level by a normalization coefficient from an input signal, and controls a subtraction amount according to the input level.
- the level calculation means updates and stores a plurality of levels based on an input signal.
- the normalizing means subtracts a value obtained by multiplying each of the plurality of levels by a predetermined coefficient from the input signal.
- the update coefficient obtaining means obtains update coefficients of a plurality of levels based on a difference between the first level and the input signal.
- the level calculation means updates the obtained plurality of levels using the update coefficients of the plurality of levels, and performs normalization suitable for the distribution of the input.
- the level calculation means updates and stores a plurality of levels based on the input signal.
- the normalizing means subtracts a value obtained by multiplying each of the plurality of levels by a predetermined coefficient from the input signal.
- the update coefficient obtaining means obtains update coefficients of a plurality of levels based on a difference between the first level and the input signal.
- the normalization coefficient obtaining means obtains a plurality of normalization coefficients corresponding to a plurality of levels.
- the normalizing means corresponds to each of the plurality of levels. The values obtained by multiplying the respective normalization coefficients by the respective levels are subtracted from the input signal power, and appropriate normalization is performed according to the input level.
- the level calculator and the normalizer in each band may be configured to use the signal analyzer. That is, the signal processing device of the present invention can be used as signal processing means in each band.
- the signal analyzer performs processing by setting a predetermined coefficient to a different value between a band belonging to a low frequency and a band belonging to a high frequency. According to this configuration, when the energy distribution of the signal differs depending on the frequency, such as in the noise region and the voice region, normalization can be performed accurately.
- the above signal analyzer processes band energy for each band obtained from the input signal at each time as logarithmic energy. Performing logarithmic normalization can eliminate the effects of distortion due to line characteristics.
- the speech recognition device of the present invention comprises the above signal analyzer, a parameter conversion means for obtaining an acoustic parameter from band energy normalized for each band obtained from the signal analyzer, and the obtained sound.
- a voice recognition unit that recognizes voice included in the input signal using the parameters.
- the speech recognition apparatus normalizes each band energy in a process of extracting MFCC (Mel Frequency Cepstrum Coef ficients) parameters.
- MFCC Mel Frequency Cepstrum Coef ficients
- a normalized cepstrum coefficient can be obtained by performing a linear conversion from the normalized band energy to the cepstrum parameter.
- a signal analysis program for causing a computer to execute, comprising: a frequency band dividing step of dividing an input signal into a plurality of frequency band signals; A band energy extracting step of extracting band energy for each band, and a normalizing step of normalizing the extracted band energy for each band to obtain a normalized band energy for each band.
- a computer is configured to sequentially perform signal normalization.
- a signal analysis program for updating and storing a first level based on the input signal; and a normalization step for subtracting an input signal power from a value obtained by multiplying the first level by a predetermined coefficient.
- an update coefficient obtaining step of obtaining an update coefficient based on a difference between the first level and the input signal.
- the level calculating means sets the first level as a conditional average value of the input signal by using the update coefficient to bring the first level closer to the input signal.
- a voice recognition program for causing a computer to perform voice recognition includes: a frequency band dividing step of dividing an input signal into signals of a plurality of frequency bands; A band energy extraction step of extracting band energy for each band with respect to the input signal, a normalization step of normalizing the extracted band energy for each band, and obtaining a normalized band energy for each band; The method includes a parameter conversion step of obtaining an acoustic parameter from band energy normalized for each band, and a voice recognition step of recognizing a voice contained in an input signal using the obtained acoustic parameter.
- the invention according to yet another aspect of the present invention relates to a recording medium recording a signal analysis program to be executed by a computer.
- the signal analysis program includes: a frequency band dividing step of dividing an input signal into signals of a plurality of frequency bands; and a band energy extracting step of extracting band energy for each band with respect to the input signal divided into each band. And normalizing the extracted band energy for each band to obtain a normalized band energy for each band.
- the invention according to yet another aspect of the present invention relates to a computer-readable recording medium recording a signal analysis program for causing a computer to sequentially perform signal normalization.
- the signal analysis program updates and stores a first level based on an input signal, and a normalization step of subtracting an input signal power by a value obtained by multiplying the first level by a predetermined coefficient.
- the level calculating means sets the first level as a conditional average value of the input signal by bringing the first level closer to the input signal using the update coefficient.
- the invention works on a computer-readable recording medium that stores a speech recognition program for causing a computer to execute speech recognition.
- the speech recognition program includes a frequency band dividing step of dividing an input signal into a plurality of frequency band signals, and a band energy extracting step of extracting band energy for each band with respect to the input signal divided into each band.
- a voice recognition step of recognizing a voice included in the input signal using the obtained acoustic parameters.
- the invention according to yet another aspect of the present invention relates to an electronic device provided with a speech recognition device.
- the speech recognition device includes: a signal analysis device; parameter conversion means for obtaining an acoustic parameter from band energy normalized for each band obtained from the signal analysis device; and an input signal using the sound parameter determined above.
- a voice recognition unit for recognizing voices included in the sound.
- the signal analysis device includes a frequency band dividing unit that divides an input signal into signals in a plurality of frequency bands, and a band energy extraction unit that extracts band energy for each band from the input signal divided into each band.
- normalizing means for normalizing the extracted band energy for each band and obtaining a normalized band energy for each band.
- a function is selected and executed based on the result of recognition of the voice signal included in the input signal by the voice recognition device.
- the electronic device of the present invention is less susceptible to distortion due to noise or line characteristics. As a result, it is preferable to use it as a voice recognition type remote controller used at home or as an electronic device such as a mobile phone.
- the signal analyzer of the present invention has the following effects.
- the band energy from the input signal at each time obtained for each frequency band is used, and the condition is separately determined for each band. Calculate the level of band energy and normalize each band energy. In other words, even in a voice utterance section, a band in which noise energy is dominant is processed as a noise section, and only a band in which voice energy is dominant is processed as a voice section. The As a result, the line characteristics of the input signal can be more accurately normalized.
- the noise level and the audio level are separately obtained for each band, and the noise level for each band or the audio level for each band is normalized. Similar effects can be obtained.
- the signal analyzer of the present invention speech and noise are determined based on the extracted band energy sequence. That is, even in the vocal section, some bands are determined as noise. For this reason, if the utterance is composed of phonemes having different spectral shapes, the estimation of the noise level in almost the entire frequency band is completed within the voice utterance section. That is, the signal analysis device of the present invention can estimate the noise level even when there is no noise section.
- the signal analyzer of the present invention is particularly preferably used for a portable device that is driven by a battery.
- the input signal is analyzed only when speaking to reduce battery consumption. That is, even in a usage mode in which a noise section does not exist, since a noise spectrum can be estimated equivalently, not only distortion of line characteristics but also distortion such as noise can be normalized.
- FIG. 1 is a flowchart showing a procedure for obtaining an MFCC.
- FIG. 2 is a block diagram showing a configuration of a conventional signal analyzer for performing MFCC analysis.
- FIG. 3 is a block diagram showing a configuration of a signal analyzer that performs analysis by a conventional E-CMN method using the MFCC method.
- FIG. 4 is a diagram showing a configuration of a signal analyzer of the present invention.
- FIG. 5 is a flowchart showing a flow of a signal analysis process according to the present invention.
- FIG. 6 is a diagram showing a correspondence relationship between input band energy by signal analysis processing according to the present invention, an update coefficient, a normalization coefficient, and a normalized band energy.
- FIG. 7 is a diagram showing a configuration of a signal analysis unit of the present invention using a low-frequency cutoff filter as a normalization unit.
- FIG. 8 is a diagram showing an example of an acoustic signal including a speech waveform.
- FIG. 9 is a simplified diagram of a spectrogram of an acoustic signal including the speech waveform shown in FIG.
- FIG. 10 is a diagram showing a range in which a normalization process is performed when the spectrum shown in FIG. 9 is normalized by using a conventional E-CMN method.
- FIG. 11 is a diagram showing a range in which a normalization process is performed when the scale shown in FIG. 9 is normalized using the signal analyzer of the present invention.
- FIG. 12 is a diagram showing how noise spectrum adaptation by the E-CMN method proceeds when an acoustic signal including the speech waveform shown in FIG. 8 is input.
- FIG. 13 is a diagram showing a situation where adaptation of a noise spectrum by the signal analyzer of the present invention proceeds when an acoustic signal including the speech waveform shown in FIG. 8 is input.
- FIG. 14 is a block diagram showing an example of a speech recognition system using the speech recognition device of the present invention.
- FIG. 15 is a diagram showing a configuration of a speech recognition device of the present invention and an electronic device including the speech recognition device of the present invention.
- FIG. 2 is a block diagram showing a configuration of a conventional signal analyzer for performing MFCC analysis.
- reference numeral 101 indicates frequency analysis means
- reference numeral 102 indicates parameter conversion means.
- the frequency analysis means 101 performs the processing of the frequency analysis step (step S201) of FIG. 1, and the parameter conversion means 102 performs the processing of the parameter conversion step (step S202) of FIG.
- FIG. 3 is a block diagram showing a configuration of a signal analyzer that performs analysis by the conventional E-CMN method using the MFCC method.
- a voice section detection means 203 for detecting a voice section from an input signal an average updating means 201, and a subtraction processing means 202 are added. It has a configuration.
- the input voice is processed by the frequency analysis means 101 and the parameter calculation means 102, so that the MFCC power S can be obtained.
- the input voice is processed by the voice section detection means 203, and the voice section is detected.
- the average updating means updates the average cepstrum obtained from the nometer calculating means 102 using the voice section information obtained by the voice section detecting means 203. Specifically, the average cepstrum of the voice is updated in the voice section, and the average cepstrum of the noise is updated in the non-voice section.
- the subtraction processing means 202 subtracts the average cepstrum of the voice from the current cepstrum output from the parameter calculation means 102 if the voice section is the voice section using the voice section information obtained by the voice section detection means 203, If the section is a non-voice section, the average cepstrum of noise is subtracted from the current cepstrum output from the meter calculation means 102.
- voice sections are generally detected by using short-time signal power for each frame or an outline of a spectrum for each frame.
- standard methods used for voice calls such as mobile phones are used.
- FIG. 4 is a diagram showing a configuration of the signal analyzer of the present invention.
- the MFC of FIG. Between the frequency analysis means 101 and the parameter calculation means 102 of the signal analyzer used for the C analysis, the update coefficient acquisition means 301, the level calculation means 302, the normalization means 303, and the normalization coefficient acquisition means 304 are provided for each band. Is provided.
- the frequency band analyzing means 101 of the present invention includes a frequency band dividing means 304 for dividing an input signal into signals of a plurality of frequency bands, and a band for each band with respect to the input signal divided into each band. And band energy extracting means 306 for extracting energy.
- the update coefficient obtaining means 301 compares the noise level or the sound level calculated before the time at which the signal is input with the band energy of each band obtained by the frequency analyzing means 101. , An update coefficient used for updating the level is obtained by the level calculation means 302
- the level calculation means 302 updates the noise level or the sound level based on the difference between the input energy and the noise level or the sound level using the update coefficient obtained by the update coefficient acquisition means 301. ,Remember. A specific method will be described later.
- the normalization coefficient obtaining means 304 includes a noise level and a voice level calculated before a time at which a signal is input, and a current level obtained by the frequency analysis means 101. And the band energy of each band is compared, and a normalization coefficient used in the normalization means 303 is calculated.
- the normalization means 303 uses the normalization coefficient obtained by the normalization coefficient acquisition means 304 and the audio level or noise level obtained by the level calculation means 302 to obtain the current value obtained by the frequency analysis means 101. Is normalized and output.
- the update coefficient acquisition unit 301 and the normalization coefficient acquisition unit 304 have different configurations. However, since both perform similar processing, the update coefficient acquisition unit 301 and the normalization coefficient acquisition unit Means 304 and may have the same configuration.
- FIG. 5 is a flowchart showing a flow of a signal analysis process according to the present invention.
- FIG. 6 is a diagram showing the correspondence between the input band energy by the signal analysis processing according to the present invention, the update coefficient, the normalization coefficient, and the normalized band energy.
- the signal analysis processing of the present invention will be described in detail with reference to FIGS.
- update coefficients and normalization coefficients are applied.
- the signal analyzer of the present invention operates in principle as long as the input signal can be divided into two or more bands.
- the system operates even if the frequency axis is a park frequency axis or a linear frequency axis.
- the number of band divisions and frequency scale in frequency analysis should conform to the MFCC, and the appropriate number of divisions is about 10 to 30.
- a voice sampled at 11 kHz is effective if it is divided into 24 bands on the mel frequency axis and converted to a 12-dimensional cepstrum.
- Step S203 The processing performed in step S205 is performed independently for each band.
- the coefficient acquisition step (step S203) is based on the difference between the band energy obtained in the frequency analysis step (step S201) and the noise level of each band obtained before the input time of the input signal, and calculates the update coefficient and the normalization coefficient. Ask for.
- the noise level of the band is updated based on the update coefficient obtained for each band (step S204).
- the band energy of that band is used.
- the lugies are normalized (step S205).
- the normalized band energy is converted into cepstrum coefficients (step S202).
- the normalized band energy is converted into cepstrum coefficients generally used for speech recognition.
- a configuration may be employed in which normalized energy that is not necessarily converted into cepstrum coefficients is output as it is.
- FIG. 6 (c) is a diagram showing the relationship between the input band energy and the update coefficient.
- the noise level at Ijt is N (t)
- the input band energy is E (t)
- the update coefficient is a (t)
- the noise level N (t) can be updated as follows, for example. Done in The units of noise level and input band energy are based on decibels (dB).
- N (t) (1-ct (t)) * N (t-l) + a (t) * E (t) ... Equation (1)
- A indicates the maximum update coefficient, and is a value of 0 or more and 1 or less.
- A is, for example, a value of about 0.02.
- R indicates the boundary area between noise and voice energy, for example, about 2 dB. Degree.
- This update coefficient can be used to update the noise level because the average value can be obtained when the low energy is distributed in the time series of the input band energy.
- (a (t) / A) can be considered as an index indicating the likelihood of noise.
- the noise level is updated using the above expression, the following speed relatively follows the downward change of the noise, and the following speed gradually decreases for the upward change. If there is an increase in noise greater than R (dB) during one frame, it will not follow at all.
- the tracking of the change speed of the noise can be controlled by the A and R parameters. If A is set to 0.02, it follows the change of noise slower than about 0.5 Hz upward.
- the rate of change of voice has many components ranging from several Hz to several tens of Hz because several to several tens of phonemes are exchanged per second.
- background noise is often slower than that.
- the noise energy increases suddenly due to sudden noise, it cannot be followed thereafter. For this reason, by setting the lower limit of a (t) to a very small value other than 0, for example, about 0.001, it is possible to set so as to follow after several seconds.
- the tracking speed can be confirmed by inputting artificial data in which the energy change speed for each band is adjusted.
- the update of the audio level S (t) is as follows, for example. To do.
- ⁇ (t) C (when N (t ⁇ l) + R ⁇ E (t))
- ⁇ (t) C * (E (t) -N (t-1) / R)
- C indicates the maximum update coefficient, and is a value of 1 or less.
- C is, for example, a value of about 0.02 similarly to A above.
- R indicates a boundary range between the energy of noise and voice, and may be the same value as the boundary region at the noise level or a different value. With this update coefficient, it is possible to obtain an average value when particularly high energy is distributed in the time series of the input band energy, so that the voice level can be updated.
- a fixed value may be used without updating. In this case, it is effective to calculate and use an average audio level from a large amount of audio data.
- Equations 1 and 2 the detection of both sections can take an intermediate value, rather than a binary one (Fig. 6 (c)).
- the determination between the noise section and the voice section is performed for each frequency band. For this reason, the noise section and the speech section determined in each band are different from the noise section and the speech section determined in other bands. Further, the noise section and the speech section determined in each band are different from the actual speech section of the speaker.
- the update coefficient does not need to be common to all frequency bands. By holding a different update coefficient for each band in advance, an optimal update coefficient can be applied for each band.
- the minimum value of the input energy until the input time is determined by the noise level.
- the maximum value of the input energy until the input time can be used as the sound level. This is based on the fact that noise has low energy and voice has high energy.
- the method of obtaining the noise level and the sound level is not limited to this example. If the method can obtain the low value and the high value within the energy distribution range, the noise level and the sound level can be obtained. The audio level can be determined.
- FIG. 6 (b) is a diagram illustrating the relationship between the input band energy and the normalization coefficient.
- the noise level at Tokii is N (t)
- the input band energy is E (t)
- the normalization coefficient is ⁇ (t)
- the normalized band energy E '(t) is given by, for example, Desired.
- the unit of noise level and energy is decibel (dB).
- ⁇ (t) B * (l_ (E (t) _N (t_l)) / R)
- ⁇ (t) B (when E (t) ⁇ N (t_l))
- B indicates the maximum subtraction amount, and is a value of 1 or less.
- B is, for example, about 0.5.
- R indicates the boundary range between noise and voice energy, and is set to, for example, about 2 dB. R may be the same value as the boundary area in the update coefficient or may be a different value.
- the band energy is normalized using the sound level S (t) in addition to the noise level.
- the following describes a method for performing this.
- the normalization of the band energy using the sound level can be calculated by the following equation, for example.
- ⁇ (t) D * (E (t) -N (t-1)) / R [0137] (when N (t-l) ⁇ E (t) ⁇ N (t-l) + R)
- D indicates the maximum subtraction amount, and is a value of 1 or less.
- D is a value of, for example, about 0.5, similarly to B above.
- R indicates a boundary range between energy of noise and voice, and may be the same value as the boundary region at the noise level or a different value.
- the normalization coefficient of the voice level was obtained using the difference between the input energy E (t) and the noise level N (t). By using this method, it is possible to reduce the decrease in recognition accuracy due to the characteristics of the speaker and the line.
- the normalization coefficient of the sound level can also be obtained by using the difference between the input energy E (t) and the sound level S (t).
- normalization in order to normalize the band energy, the method of subtracting the input band energy by multiplying the voice level or the noise level by a coefficient obtained by a predetermined calculation has been described.
- normalization is not limited to this method, and the effect of normalization may be obtained by a method such as division of input energy by voice level.
- the normalization method may be changed as appropriate according to the dynamic range of the input or the magnitude of the environmental change.
- FIG. 6 (a) is a diagram showing the relationship between the normalized band energy normalized using Equations 3 and 4, and the input band energy.
- ⁇ ' ⁇ — ⁇ * ⁇ — ⁇ * S.
- FIG. 7 is a diagram showing a configuration of the signal analyzing means of the present invention using the low-pass cutoff filter 307 as the normalizing means.
- the low-frequency cutoff filter is preferably a filter that cuts off frequencies lower than about 1 Hz to 10 Hz, which is the rate of change of the spectrum due to voice, that is, frequencies below 1 Hz.
- t is a frame
- the input to the low-pass cutoff filter is x (t)
- the output is x (t)
- a low-frequency cutoff filter having characteristics different for each band makes it more suitable for the usage environment. Performance can be improved.
- FIG. 8 is a diagram showing an example of an acoustic signal including a speech waveform.
- the horizontal axis represents time
- the vertical axis represents amplitude.
- the section of the time tl force t2 represents the utterance section, and indicates that the entire displayed time includes noise.
- FIG. 9 is a simplified diagram of a spectrogram of an acoustic signal including the speech waveform shown in FIG.
- the horizontal axis represents time
- the vertical axis represents frequency
- the interval from time tl to t2 represents the utterance interval.
- the power of the actual speech spectrum is a continuous value.
- the region where the energy is relatively higher than the other parts is closed with a closed curve and is shaded.
- FIG. 10 is a diagram showing a range in which the normalization process is performed when the spectrum shown in FIG. 9 is normalized using the conventional E-CMN method.
- the horizontal axis indicates time, and each section on the horizontal axis indicates an analysis frame.
- the vertical axis indicates frequency, and each segment on the vertical axis indicates a frequency band.
- the region where the energy is relatively higher than other parts is closed by a closed curve.
- the shaded part is the applicable range as a voice section, and the other parts are the applicable range as a noise section.
- the speech cepstrum coefficient is updated in the section from time tl to t2, which is determined as the speech section, and the noise cepstrum coefficient is updated in other sections.
- the cepstrum in each section is normalized using the updated cepstrum coefficient. Therefore, if noise is included in the voice section, the cepstrum coefficient will be erroneously updated.
- FIG. 11 is a diagram showing a range in which a normalization process is performed when the spectrum shown in FIG. 9 is normalized using the signal analyzer of the present invention.
- the horizontal axis indicates time, and each section of the horizontal axis indicates an analysis frame.
- the vertical axis indicates frequency, and each segment on the vertical axis indicates a frequency band.
- the shaded part is the applicable range as a voice section, and the other part is the applicable range as a noise section.
- a speech section and a noise section are determined for each band.
- the sound In a band and a frame having higher energy than the surrounding noise related to the voice section (tl-t2) the sound level of that band is updated, and the band is updated using the updated sound level.
- Energy is normalized.
- the noise level is determined to be a low-energy noise section even in the speech utterance section (tl-t2), so the noise level in that band is updated, and this updated noise level is used.
- the band energy is normalized.
- a clear non-voice section may not exist as in the related art. If the speech includes a plurality of types of phonemes, the noise level can be updated for all the bands included in the speech section.
- FIG. 12 is a diagram showing how noise spectrum adaptation by the E-CMN method proceeds when an acoustic signal including the speech waveform shown in FIG. 8 is input.
- FIG. 13 is a diagram showing a situation where the adaptation of the noise spectrum by the signal analysis device of the present invention proceeds when an acoustic signal including the audio waveform shown in FIG. 8 is input.
- shaded portions indicate bands and frames in which noise has been correctly estimated.
- the signal analysis device of the present invention causes an error in estimation of the noise spectrum in a band where speech is dominant.
- the noise power estimation ends at a time t3 earlier than the time t2 at which the utterance ends. Therefore, between time t3 and t2, the noise spectrum can be correctly normalized.
- the use of the signal analyzer of the present invention enables a correct normalization faster than the case where the E-CMN method is used. Also, in a band in which noise estimation is completed, if the power of the band increases after the completion time, the speech power is estimated, so that accurate speech recognition can be performed in the middle of utterance.
- the signal analysis device of the present invention can estimate noise spectrum even during speech production. As a result, even if the noise spectrum changes gradually during sound generation, if the change progresses slowly, it is possible to adapt to the change in noise and perform normalization. Therefore, if the signal analysis device of the present invention is applied to a speech recognition device, a speech recognition device capable of performing more stable speech recognition can be obtained.
- FIG. 14 is a block diagram showing an example of a speech recognition system using the speech recognition device of the present invention.
- the speech recognition system generally includes an acoustic model learning device 401 and a speech recognition device 402.
- the voice database 403 is for learning an acoustic model. It is mainly stored on a personal computer or a fixed disk of a workstation.
- Reference numeral 404 is a signal analysis unit using the signal analysis device of the present invention. Actually, it is used in the configuration shown in Fig. 6, or in the configuration shown in Fig. 6, with the addition of a part that calculates the amount of change in acoustic parameters over time.
- Reference numeral 405 is an acoustic model learning unit.
- the language database 406 which records the utterance contents of the audio database and the output of the signal analysis unit 404 are statistically analyzed for each sound unit such as each phoneme or each syllable. Ask for.
- a hidden Markov model is generally used as a model.
- Reference numeral 407 denotes an acoustic model obtained by the acoustic model learning means 405.
- Reference numeral 408 is a language dictionary created separately.
- the language dictionary 408 includes a word dictionary in which words are represented by phoneme strings, and grammar data that specifies connection restrictions between words.
- the language dictionary 408 may be created manually, or the connection probability between words may be statistically obtained from sentences contained in the language database 406.
- Reference numeral 409 is a signal analysis unit that performs the same signal analysis as the signal analysis device 404.
- Reference numeral 410 denotes a likelihood calculating means, which calculates the likelihood of each voice unit with respect to the input signal at each time from each statistic of the acoustic model 407 and the acoustic parameters obtained by the signal analyzing means 409.
- Reference numeral 411 is a matching unit that calculates the likelihood of a likely linguistic hypothesis from the obtained time series of the likelihood of each voice unit, and outputs candidates in descending order of likelihood.
- a speech recognition method there may be an implementation that does not clearly separate the likelihood calculation and matching means.
- FIG. 15 is a diagram illustrating a configuration of a speech recognition device of the present invention and an electronic device including the speech recognition device of the present invention.
- Reference numeral 501 indicates a bus of a data address in a digital device such as a personal computer. Each processing means is connected to this bus and performs each processing.
- Reference numeral 502 indicates a plurality of input means such as a button, a keyboard, and a microphone. Voice on The force is not limited to being input from a microphone, but may be input via a communication line after being converted into an electrical signal by another device.
- Reference numeral 503 denotes a CPU that controls the device according to the instruction from the input means 502 and recognizes the input voice.
- Reference numeral 504 is a working memory for processing by the CPU and a program memory including a speech recognition program.
- Reference numeral 505 denotes an output device such as a display, a buzzer, a speaker, and a lamp.
- the result of speech recognition may be displayed as a candidate, the recognition result may be subjected to some processing, or the processed result may be displayed. If the electronic device is a mobile phone, a wireless communication unit (not shown) is added to these processing blocks. For personal computers and portable information devices, communication means and external storage devices are added.
- Examples of selecting and executing a function based on the result recognized by the voice recognition device include, for example, an operation of switching a TV channel, an operation of playing or stopping a video device, and a temperature setting of an air conditioner. And the like.
- an information terminal communication control, program execution control, character input, and the like can be mentioned.
- the control programs of these devices including the signal analysis program or the speech recognition program are realized by an information processing program recorded on a program recording medium.
- the program recording medium in the above-described embodiment is a program medium including a ROM (read “only” memory) provided separately from a RAM (random “access” memory).
- a program medium mounted on an external auxiliary recording device and read out is preferable.
- the program reading means for reading the information processing program from the program medium has a configuration of directly accessing and reading the program medium. It has a configuration in which the program is downloaded to a provided program storage area (not shown), and the program storage area is accessed and read. It is assumed that a download program for downloading from the program medium to the program recording area of the RAM is stored in the main unit in advance.
- the program medium is configured to be separable from the main body side, and to a tape system such as a magnetic tape or a cassette tape, or a magnetic medium such as a floppy disk or a hard disk.
- CD Compact Disk
- MO Magnetic-Optical
- MD Mini Disk
- DVD Digital Versatile Disk
- IC Integrated Circuit cards and card systems such as optical disks
- a semiconductor memory system such as a mask ROM, an EPROM (ultraviolet erasable ROM), an EEPROM (electrically erasable ROM), and a flash ROM.
- the speech recognition device or the electronic device in the above-described embodiment includes a modem and can be connected to a communication network including the Internet.
- the program medium may be a medium that carries the program fluidly by downloading from a communication network or the like.
- the download program to be downloaded from the communication network is stored in the main unit in advance. Alternatively, it shall be installed from another recording medium.
- An electronic device using the present invention is less susceptible to noise and distortion due to line characteristics. As a result, it is used as a voice-recognition remote controller used at home and as an electronic device such as a mobile phone.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
Description
Claims
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2004800241642A CN1839427B (zh) | 2003-08-22 | 2004-07-29 | 信号分析装置、信号处理装置、语音识别装置和电子设备 |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2003-299346 | 2003-08-22 | ||
JP2003299346A JP4301896B2 (ja) | 2003-08-22 | 2003-08-22 | 信号分析装置、音声認識装置、プログラム、記録媒体、並びに電子機器 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2005020212A1 true WO2005020212A1 (ja) | 2005-03-03 |
Family
ID=34213754
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2004/010841 WO2005020212A1 (ja) | 2003-08-22 | 2004-07-29 | 信号分析装置、信号処理装置、音声認識装置、信号分析プログラム、信号処理プログラム、音声認識プログラム、記録媒体および電子機器 |
Country Status (3)
Country | Link |
---|---|
JP (1) | JP4301896B2 (ja) |
CN (1) | CN1839427B (ja) |
WO (1) | WO2005020212A1 (ja) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110797008A (zh) * | 2018-07-16 | 2020-02-14 | 阿里巴巴集团控股有限公司 | 一种远场语音识别方法、语音识别模型训练方法和服务器 |
US10897534B1 (en) | 2019-09-13 | 2021-01-19 | International Business Machines Corporation | Optimization for a call that waits in queue |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP5019414B2 (ja) * | 2006-02-09 | 2012-09-05 | 株式会社リコー | 定着装置及び画像形成装置 |
JP4869420B2 (ja) * | 2010-03-25 | 2012-02-08 | 株式会社東芝 | 音情報判定装置、及び音情報判定方法 |
JP5724361B2 (ja) * | 2010-12-17 | 2015-05-27 | 富士通株式会社 | 音声認識装置、音声認識方法および音声認識プログラム |
US9992745B2 (en) * | 2011-11-01 | 2018-06-05 | Qualcomm Incorporated | Extraction and analysis of buffered audio data using multiple codec rates each greater than a low-power processor rate |
IN2014CN04097A (ja) | 2011-12-07 | 2015-07-10 | Qualcomm Inc | |
JP6127422B2 (ja) | 2012-09-25 | 2017-05-17 | セイコーエプソン株式会社 | 音声認識装置及び方法、並びに、半導体集積回路装置 |
US10629184B2 (en) | 2014-12-22 | 2020-04-21 | Intel Corporation | Cepstral variance normalization for audio feature extraction |
CN104900237B (zh) * | 2015-04-24 | 2019-07-05 | 上海聚力传媒技术有限公司 | 一种用于对音频信息进行降噪处理的方法、装置和系统 |
US11763834B2 (en) * | 2017-07-19 | 2023-09-19 | Nippon Telegraph And Telephone Corporation | Mask calculation device, cluster weight learning device, mask calculation neural network learning device, mask calculation method, cluster weight learning method, and mask calculation neural network learning method |
CN108461081B (zh) * | 2018-03-21 | 2020-07-31 | 北京金山安全软件有限公司 | 语音控制的方法、装置、设备和存储介质 |
JP7421869B2 (ja) * | 2019-04-26 | 2024-01-25 | 株式会社スクウェア・エニックス | 情報処理プログラム、情報処理装置、情報処理方法及び学習済モデル生成方法 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH03230200A (ja) * | 1990-02-05 | 1991-10-14 | Sekisui Chem Co Ltd | 音声認識方法 |
JPH10133692A (ja) * | 1996-10-28 | 1998-05-22 | Hitachi Ltd | 録音装置及びカメラ一体型映像音声記録装置 |
JP2002014694A (ja) * | 2000-06-30 | 2002-01-18 | Toyota Central Res & Dev Lab Inc | 音声認識装置 |
JP2003195894A (ja) * | 2001-12-27 | 2003-07-09 | Mitsubishi Electric Corp | 符号化装置、復号化装置、符号化方法、及び復号化方法 |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP3230200B2 (ja) * | 1995-06-26 | 2001-11-19 | 農林水産省蚕糸・昆虫農業技術研究所長 | 改質蛋白質繊維又はその繊維製品の製造法 |
JP3574123B2 (ja) * | 2001-03-28 | 2004-10-06 | 三菱電機株式会社 | 雑音抑圧装置 |
-
2003
- 2003-08-22 JP JP2003299346A patent/JP4301896B2/ja not_active Expired - Fee Related
-
2004
- 2004-07-29 WO PCT/JP2004/010841 patent/WO2005020212A1/ja active Application Filing
- 2004-07-29 CN CN2004800241642A patent/CN1839427B/zh not_active Expired - Fee Related
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH03230200A (ja) * | 1990-02-05 | 1991-10-14 | Sekisui Chem Co Ltd | 音声認識方法 |
JPH10133692A (ja) * | 1996-10-28 | 1998-05-22 | Hitachi Ltd | 録音装置及びカメラ一体型映像音声記録装置 |
JP2002014694A (ja) * | 2000-06-30 | 2002-01-18 | Toyota Central Res & Dev Lab Inc | 音声認識装置 |
JP2003195894A (ja) * | 2001-12-27 | 2003-07-09 | Mitsubishi Electric Corp | 符号化装置、復号化装置、符号化方法、及び復号化方法 |
Non-Patent Citations (2)
Title |
---|
AKABANE, T. ET AL.: "Filter Bank Shutsuryoku no Seiki o Mochiita Zatsuon ni Ganken na Onsei Ninshiki", THE ACOUSTICAL SOCIETY OF JAPAN (ASJ) 2004 NEN SHUNKI KENKYU HAPPYOKAI KOEN RONBUNSHU-I., 17 March 2004 (2004-03-17), pages 119 - 120 * |
SHOKYO, M. ET AL.: "Onsei Kyocho Shuho E-CMN/CSS no Jidosha Kankyonai deno Onsei Ninshiki ni Okeru Hyoka", THE INSTITUTE OF ELECTRONICS, INFORMATION AND COMMUNICATION ENGINEERS RONBUNSHI D-11, vol. J-81-D-II, no. 1, 25 January 1998 (1998-01-25), pages 1 - 9 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110797008A (zh) * | 2018-07-16 | 2020-02-14 | 阿里巴巴集团控股有限公司 | 一种远场语音识别方法、语音识别模型训练方法和服务器 |
CN110797008B (zh) * | 2018-07-16 | 2024-03-29 | 阿里巴巴集团控股有限公司 | 一种远场语音识别方法、语音识别模型训练方法和服务器 |
US10897534B1 (en) | 2019-09-13 | 2021-01-19 | International Business Machines Corporation | Optimization for a call that waits in queue |
WO2021047209A1 (en) * | 2019-09-13 | 2021-03-18 | International Business Machines Corporation | Optimization for a call that waits in queue |
GB2600847A (en) * | 2019-09-13 | 2022-05-11 | Ibm | Optimization for a call that waits in queue |
GB2600847B (en) * | 2019-09-13 | 2022-12-07 | Ibm | Optimization for a call that waits in queue |
Also Published As
Publication number | Publication date |
---|---|
JP2005070367A (ja) | 2005-03-17 |
CN1839427B (zh) | 2010-04-28 |
CN1839427A (zh) | 2006-09-27 |
JP4301896B2 (ja) | 2009-07-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Hilger et al. | Quantile based histogram equalization for noise robust large vocabulary speech recognition | |
US7133826B2 (en) | Method and apparatus using spectral addition for speaker recognition | |
EP1355296B1 (en) | Keyword detection in a speech signal | |
CN112951259B (zh) | 音频降噪方法、装置、电子设备及计算机可读存储介质 | |
US8473282B2 (en) | Sound processing device and program | |
JP2000132177A (ja) | 音声処理装置及び方法 | |
IL125649A (en) | Method and device for detecting signal of a sound sampled from noise | |
JP3451146B2 (ja) | スペクトルサブトラクションを用いた雑音除去システムおよび方法 | |
US6182036B1 (en) | Method of extracting features in a voice recognition system | |
JP4301896B2 (ja) | 信号分析装置、音声認識装置、プログラム、記録媒体、並びに電子機器 | |
CN110268471B (zh) | 具有嵌入式降噪的asr的方法和设备 | |
JP2000132181A (ja) | 音声処理装置及び方法 | |
US10446173B2 (en) | Apparatus, method for detecting speech production interval, and non-transitory computer-readable storage medium for storing speech production interval detection computer program | |
US7236930B2 (en) | Method to extend operating range of joint additive and convolutive compensating algorithms | |
JP2000122688A (ja) | 音声処理装置及び方法 | |
Motlıcek | Feature extraction in speech coding and recognition | |
KR102051966B1 (ko) | 음성 인식 향상 장치 및 방법 | |
KR20070061216A (ko) | Gmm을 이용한 음질향상 시스템 | |
JP2003271190A (ja) | 雑音除去方法、雑音除去装置及び、それを用いた音声認識装置 | |
Oonishi et al. | A noise-robust speech recognition approach incorporating normalized speech/non-speech likelihood into hypothesis scores | |
Gouda et al. | Robust Automatic Speech Recognition system based on using adaptive time-frequency masking | |
Seyedin et al. | A new subband-weighted MVDR-based front-end for robust speech recognition | |
CN118379986B (zh) | 基于关键词的非标准语音识别方法、装置、设备及介质 | |
Fan et al. | Power-normalized PLP (PNPLP) feature for robust speech recognition | |
Dutta et al. | A comparative study on feature dependency of the Manipuri language based phonetic engine |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WWE | Wipo information: entry into national phase |
Ref document number: 200480024164.2 Country of ref document: CN |
|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
DPEN | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed from 20040101) | ||
122 | Ep: pct application non-entry in european phase |