US9728182B2 - Method and system for generating advanced feature discrimination vectors for use in speech recognition - Google Patents
Method and system for generating advanced feature discrimination vectors for use in speech recognition Download PDFInfo
- Publication number
- US9728182B2 US9728182B2 US14/217,198 US201414217198A US9728182B2 US 9728182 B2 US9728182 B2 US 9728182B2 US 201414217198 A US201414217198 A US 201414217198A US 9728182 B2 US9728182 B2 US 9728182B2
- Authority
- US
- United States
- Prior art keywords
- oscillator peaks
- oscillator
- peaks
- sample
- extracted
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/21—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
- G10L2015/025—Phonemes, fenemes or fenones being the recognition units
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/93—Discriminating between voiced and unvoiced parts of speech signals
Definitions
- This application relates generally to speech recognition systems, and more particularly to generating feature vectors for application in speech recognition that are less susceptible to the variations in speech characteristics between individual speakers.
- Speech recognition can be generally defined as the ability of a computer or machine to identify and respond to the sounds produced in human speech. Speech recognition processes are often referred to generally as “automatic speech recognition” (“ASR”), “computer speech recognition”, and/or “speech to text.” Voice recognition is a related process that generally refers to finding the identity of the person who is speaking, in contrast to determining what the speaker is saying.
- Speech recognition systems can be broadly categorized as isolated-word recognition systems and continuous speech recognition systems.
- Isolated-word recognition systems handle speech with short pauses between spoken words, typically involve a restricted vocabulary that they must recognize, and are often employed in command/control type applications.
- Continuous speech recognition systems involve the recognition and transcription of naturally spoken speech (often performed in real time), and thus require a more universal vocabulary and the ability to discriminate words that can often run together when spoken naturally with the words that are spoken immediately before and after.
- Examples of isolated-word recognition systems include machines deployed in call centers that initiate and receive calls and navigate humans through menu options to avoid or minimize human interaction.
- Cell phones employ such systems to perform functions such as name-dialing, answering calls, Internet navigation, and other simple menu options.
- Voice-control of menu options also finds application in, for example, computers, televisions and vehicles.
- Continuous speech recognition systems are typically employed in applications such as voice to text, speaker recognition and natural language translation.
- a typical speech recognition system consists of: a) a front-end section for extracting a set of spectral-temporal speech features from a temporal sample of the time-domain speech signal from which speech is to be recognized; b) an intermediate section that consists of statistical acoustic speech models that represent a distribution of the speech features that occur for each of a set of speech sounds when uttered.
- These speech sounds are referred to as phonemes, which can be defined as the smallest unit of speech that can be used to make one word different than another.
- Such models can also be used to represent sub-phonemes; and c) a speech decoder that uses various language rules and word models by which to determine from the combination of detected sub-phonemes and phonemes what words are being spoken. Often the prediction can be enhanced by considering the typical order in which various words are used in the language in which the speech is uttered.
- the intermediate and decoder sections are often lumped together and referred to as a speech recognition engine.
- the most basic task in speech recognition is also the arguably the most difficult one.
- variables that contribute to the variations in speech from one speaker to another include for example, the time duration of the spoken word. Not only does this vary from person to person, it even varies for the same person each time the same word is spoken. To make things more complicated, the variation in the duration of a word is not even uniform over the various sounds (i.e. phonemes and sub-phonemes) that form the word.
- speaker variability lies in the fact that the content of one's speech is highly dependent upon a person's anatomical proportions and functionality. As is well known in the art, there are numerous resonances in the human body that contribute to the human voice, and these resonances are directly related to the speaker's anatomy. Gender is a very obvious manifestation of these factors, as the fundamental frequency of speech uttered by men is typically much lower overall when compared to the fundamental frequency of speech uttered by women. In addition, the emotional state and overall health of a speaker will also cause variations on top of the anatomical ones.
- Speakers also develop accents, which can have a major effect on speech characteristics and on speech recognition performance. These accents range from national to regional accents and can include very different pronunciations of certain words. Because of the mobility of the general population, these accents are often melded together.
- a similar issue refers generally to a situation in which a conceptually isolated speech sound is influenced by, and becomes more like, a preceding or following speech sound.
- Another technique is to use individualized training, where the statistical distribution of the models are tailored (through a learning process) to a particular user's voice characteristics to aid in recognizing what that person is saying.
- Such systems are referred to as speaker dependent systems.
- speaker dependent systems For course, it is far more desirable to render systems that are speaker independent, which require more generalized statistical models of speech that do not depend or otherwise employ individualized training for a particular speaker (referred to as “speaker independent” systems).
- speaker independent systems Many developers of such speaker independent systems gather vast amounts of speech from as many speakers as possible to create a massive corpus with the goal of creating models that are intended to statistically represent distributions of these many variables over virtually entire populations for all possible sounds.
- One of the downsides of this approach is clearly the vast amount of data that must be gathered and maintained. Another is the question of whether such models that have become so generalized as to represent every speaker in a given population can lose their ability to even distinguish speech.
- a general methodology commonly employed by known speech recognition systems as discussed above can be illustrated by the simple and high-level representation of a known speech recognition system 100 as is illustrated in FIG. 1 .
- Speech is captured with a transducer (e.g. a microphone) at block 104 in the form of a time domain analog audio signal 101 , and is partitioned for analysis using a continuous series of overlapping windows of short time duration (e.g. they are each advanced in time by less than the duration of each window).
- the portion of the audio signal 101 falling within each window is sampled using an analog to digital converter (ADC) that samples the analog signal at a predetermined sampling rate over each window, and therefore converts the analog time domain signal into a digital time domain audio signal.
- ADC analog to digital converter
- the digital audio signal is then converted, on a frame by frame basis, into a frequency domain representation of the portion of the time domain signal that falls within each window using any of a number of transforms such as the Fast Fourier Transform (FFT), the Discrete Fourier Transform (DFT) the Discrete Cosine Transform (DCT) or possibly other related transforms.
- FFT Fast Fourier Transform
- DFT Discrete Fourier Transform
- DCT Discrete Cosine Transform
- Speech recognition engine 110 can compare the feature vectors on a frame by frame basis to the statistical models that represent the typical distribution of such features for phonemes and sub-phonemes. Because of an overlap in the statistical distributions of the models, this comparison process typically leads to a statistical prediction of the likelihood that the feature vectors represent the spectral constituents of any one or more of the phonemes or sub-phonemes. Thus, there may be a number of possible matches for each feature vector, and each of those possible matches can be ranked using a probability score.
- the probabilities and perhaps even groupings of the extracted feature vectors are fed to a back-end portion of the speech recognition engine 110 of the speech recognition system 100 , where they are further processed to predict through statistical probabilities what words and phrases are being uttered over the course of several consecutive overlapping windows. From there, the engine 110 outputs its best guess of what the speech is, and that output 112 can be used for any purpose that suits the application. For example, the output 112 can be transcribed text, or control outputs based on recognized menu commands as discussed above.
- cepstral coefficients are derived from an inverse discrete Fourier transform (IDFT) of the logarithm of the short-term power spectrum of a speech segment defined by a window. Put another way cepstral coefficients encode the shape of the log-spectrum of the signal segment.
- IDFT inverse discrete Fourier transform
- cepstral coefficients encode the shape of the log-spectrum of the signal segment.
- a widely used form of cepstral coefficients is the Mel Frequency Cepstral Coefficients (MFCC). To obtain MFCC features, the spectral magnitude of FFT frequency bins are averaged within frequency bands spaced according to the Mel scale, which is based on a model of human auditory perception. The scale is approximately linear up to about 1000 Hz and approximates the sensitivity of the human ear.
- cepstral coefficients are primarily concerned with capturing and encoding the power distribution of the speech signal over a range of frequencies
- statistical models must be used to account for the variability between speakers who are uttering the same sounds (e.g. words, phonemes, phrases or utterances).
- these variations in speaker characteristics make it very difficult to discriminate between speech phonemes uttered by different individuals based on spectral power alone, because those varying characteristics (such as the fundamental frequency of a speaker and the duration of that speakers speech) are not directly reflected in the spectral power.
- One of the few variables that may be renormalized out (i.e. made constant for all speakers) for the MFCCs is volume of the speech.
- Oscillator peaks are derived to represent the presence, for example, of short-term stable sinusoidal components in each frame of the audio signal. Recent innovations regarding the identification and analysis of such oscillator peaks has made them a more practical means by which to encode the spectral constituents of an audio signal of interest. For example, in the publication by Kevin M. Short and Ricardo A.
- the foregoing improvements permit the underlying signal elements to be represented as essentially delta functions with only a few parameters, and these parameters are determined at a super-resolution that is much finer than the transform resolution of a typical and previously known approach to such analysis. Consequently, one can, for example, look at frequencies of the oscillator peaks on a resolution that is on a fractional period basis, whereas the original transform analysis results in only integer period output.
- This improved resolution allows for the examination of single excitations periods of an audio signal as it would be produced by the vocal tract, and then one can examine how the effects of the vocal tract (or other environmental conditions) will alter the single excitation period over time.
- the present invention is a method and system for generating advanced feature discrimination vectors (AFDVs) from highly accurate features in the form of oscillator peaks, which can be renormalized in accordance with embodiments of the invention to facilitate a more direct comparison of the spectral structure of a measured speech signal with similarly encoded speech samples that have been correlated to known speech sounds (e.g. phonemes and sub-phonemes, sibilants, fricatives and plosives).
- AFDVs advanced feature discrimination vectors
- the vectors may be used more effectively by transforming or renormalizing them to a comparison coordinate system that may be consistent for different speakers.
- the renormalized format permits effective comparison of a given speaker's utterances to speech that has been similarly encoded for a known corpus of other speakers, which allows for the accurate prediction of phonemes and sub-phoneme sounds that are present in the speech signal of interest, notwithstanding the wide variation in speaker characteristics.
- Various embodiments of the method of the invention are able to eliminate variations in the fundamental frequency of speakers, as well as the speed (i.e. duration) of their speech. This is accomplished by renormalizing the oscillator peaks with respect to fundamental frequency and the duration of the utterance such that the AFDVs of the invention no longer reflect those variations from one speaker to another. Once renormalized in accordance with embodiments of the method of the invention, the AFDVs can be compared without the need for models that must statistically account for wide variations in those variables, thereby rendering the comparison process more direct and increasing the accuracy and robustness of the speech recognition system so employing embodiments of the invention.
- Various embodiments of the invention can produce AFDVs of the invention for use in identifying voiced sounds in conjunction with known feature vectors such as MFCCs.
- Other embodiments can be extended to produce AFDVs for voiced and semi-voiced sounds as well.
- FIG. 1 illustrates a high-level block diagram of a known speech recognition system
- FIG. 2 illustrates a high-level block diagram of a speech recognition system employing an embodiment of the invention
- FIGS. 3A and 3B illustrate the periodic nature of the glottal pulse for voiced sounds of human speech
- FIG. 4A illustrates one window of an input audio signal in both time and frequency domain
- FIG. 4B is a close approximation of a single period of the audio signal of FIG. 4A ;
- FIG. 4C illustrates a concatenation of the single period of FIG. 4B to produce a close approximation to the full sample of the audio signal of FIG. 4A ;
- FIG. 5A illustrates an embodiment of spectral structure representing one “glottal pulse” period of the voiced sound from a sampled window of the audio signal of FIG. 4A , renormalized in accordance with the method of the invention
- FIG. 5B is an embodiment an n slot comparator stack employed to form an alignment for the spectral structure illustrated in FIG. 5A in accordance with the invention.
- FIG. 6A illustrates an embodiment of a 3 harmonic spectral structure representing one “glottal pulse” period of the voiced sound from a sampled window of an audio signal, renormalized in accordance with the method of the invention
- FIG. 6B is an embodiment an n slot comparator stack employed to form an alignment for the spectral structure illustrated in FIG. 6A in accordance with the invention.
- FIG. 7A illustrates an embodiment of a 2 harmonic spectral structure representing one “glottal pulse” period of the voiced sound from a sampled window of an audio signal, renormalized in accordance with the method of the invention
- FIG. 7B is an embodiment an n slot comparator stack employed to form an alignment for the spectral structure illustrated in FIG. 7A in accordance with the invention.
- FIG. 8 illustrates a flow diagram of an embodiment of the renormalization method of the invention
- FIG. 9 illustrates a flow diagram of the detailed renormalization steps that occur within the flow diagram of FIG. 8 ;
- FIG. 10 is an illustration of various frequency zones of a power spectrum that has been averaged over many utterances for a single speaker
- FIG. 11 illustrates an embodiment of a speech recognition system that can employ the AFDVs of the invention for identifying voiced sounds in conjunction with more traditional feature vectors (e.g. MFCCs) used for identifying unvoiced and semi-voiced sounds;
- MFCCs feature vectors
- FIG. 12 illustrates an embodiment of a speech recognition system that employs the method of the invention to generate AFDVs of the invention for identifying voiced, unvoiced and semi-voiced sounds.
- a method of renormalizing high-resolution oscillator peaks, extracted from windowed samples of an audio signal, is disclosed that is able to generate feature vectors for which variations in both fundamental frequency and time duration of speech are eliminated.
- This renormalization process enables the feature vectors of the invention, referred to herein as advanced feature discrimination vectors (AFDVs), that may be aligned within a common coordinate space, free of those variations in frequency and time duration that occurs between speakers and even over speech by a single speaker, to facilitate a simple and accurate determination of matches between those AFDVs generated from a sample of the audio signal and AFDVs generated for known speech at the phoneme and sub-phoneme level.
- AFDVs advanced feature discrimination vectors
- This renormalization method of the invention can be applied to harmonic groupings of oscillator peaks that are characteristic of voiced sounds, as well as to oscillator peaks that are non-harmonically related, characteristic of unvoiced sounds such as sibilants.
- the coordinate system for comparing the AFDVs of the invention can be subdivided, in accordance with predetermined zones of frequencies, to handle cases of semi-voiced sounds that register power similar to voiced components as well as unvoiced components.
- a technique for normalizing power while maintaining the ratio of power between the subdivisions is disclosed, to provide additional information by which to identify the semi-voiced phonemes and sub-phonemes.
- renormalization is used distinguish between the type of normalization that for example, reduces power to a value of one for purposes of scaling magnitude, and the creation of shifted and scaled versions of data in frequency and/or time, where the intention is that these renormalized values allow the comparison of corresponding renormalized values for different datasets from different speakers and different utterances in a way that eliminates the effects of certain gross influences, in this case frequency scale and time scale.
- FIG. 2 illustrates a block diagram of a speech recognition system 200 employing the method and system of the invention.
- a speech recognition system employing the system and method of the invention can employ a front-end section 204 that extracts features from the input audio signal 202 , for each of a plurality of short time windows of the signal that overlap each other by some fixed fraction of their period.
- the feature data extracted by the front end 204 from each window of the input audio signal 202 are oscillator peaks 209 .
- the detected audio signal 202 is then processed into uniform segments defined by an overlapping time-domain window.
- Each window is sampled at a predetermined sampling rate and converted to a digital representation of the analog signal by an analog to digital converter (ADC).
- ADC analog to digital converter
- the finite number of samples for each “short” window is that number that is appropriate to a given context/application and may include between several tens and several thousands of samples, depending on the desired sample rate.
- the digital signal is converted to a frequency domain representation thereof via a transform such as a Fast Fourier Transform (FFT), the Discrete Fourier Transform (DFT), the Discrete Cosine Transform (DCT) or possibly other related transforms.
- FFT Fast Fourier Transform
- DFT Discrete Fourier Transform
- DCT Discrete Cosine Transform
- the oscillator peaks can be preferably (but not necessarily) identified with high resolution using the Complex Spectral Phase Evolution (CSPE) methods.
- CSPE Complex Spectral Phase Evolution
- AFDVs Advanced Feature Discrimination Vectors
- the feature data are then renormalized in accordance with the method of the invention as will be described in more detail, and assembled into a frame of vectors for each window, and provided to a speech recognition engine 214 for use in recognizing speech embedded within the acoustic signal.
- the speech recognition engine is able to use the extracted feature vectors to predict what sounds, words and phrases are uttered and converts those predictions into extracted speech output 216 , which can be, as previously described above, in various forms as required by a specific application.
- tracking techniques are disclosed in the above-referenced application that can be used when an audio signal contains sounds from multiple sources, to identify the oscillator peaks with each source.
- speech from one speaker can be isolated from environmental noise and other speakers to make speech recognition of a particular speaker of interest much more robust.
- techniques in utilizing the tracking of oscillator peaks to preferentially extract a set of oscillator peaks associated with a given source are not required to practice the present invention in generating AFDVs, they can be invaluable in improving the value of the those AFDVs in applications such as automated speech recognition.
- the present method of the invention at 210 is able to renormalize the oscillator peak representations of those short-term stabilized oscillators that are determined to be harmonically related to one another with regard to both frequency and time duration.
- the method of the invention is able to generate feature vectors from harmonically related oscillator peaks extracted from the audio signal for each window, which can be compared to speech of any other speaker in a comparison space that is completely independent of any variations in fundamental frequency and time duration between speakers.
- Voiced sounds are typically vowel sounds such as when saying the letter E (“ee”).
- Unvoiced sounds are sometimes referred to as sibilants or turbulent sounds, and correspond to sounds such as the S sound at the end of a word like hiss.
- Semi-voiced sounds are sounds referred to as fricative or plosives, and tend to have a combination of unvoiced and voiced sounds. An example would result from saying the letter P. It has a combination of the consonant beginning “puh,” and the vowel sound like “ee.”
- Voiced sounds are produced by a repeating sequence of opening and closing of glottal folds, often referred to as the glottal pulse, and can have a frequency of between about 40 Hz for a low frequency male voice to about 600 Hz for female children's voice. This frequency, referred to as the fundamental frequency of the sound, is therefore obviously speaker dependent, and will further vary depending upon the phoneme being uttered, the linguistic and emotional context in which it is uttered.
- FIGS. 3A and 3B illustrate the periodic nature of the glottal pulse for voiced sounds of human speech.
- the spectrum of voiced sounds is shaped by the resonance of the vocal tract filter and contains the harmonics of the quasi-periodic glottal excitation, and has most of its power in the lower frequency bands.
- the spectrum of unvoiced sounds is non-harmonic and usually has more energy in higher frequency bands.
- the lower plot 402 illustrates one window of an audio signal ( 202 , FIG. 2 ), that demonstrates the periodicity of a voiced speech sound in accordance with the glottal pulse of the person uttering the voiced speech. This utterance exhibits approximately nine periods or repeated cycles 406 over the window.
- the spectral representation of the window of signal is illustrated as oscillator peaks, as determined by the conversion processes discussed above at block 208 of FIG. 2 . It should be noted that the oscillator peaks illustrated herein are those determined by the CSPE-based oscillator method described above.
- This plot illustrates that a first oscillation 408 a occurs at a frequency that is directly related to the periodicity of the utterance of the signal over the window.
- the frequency oscillator peak 408 a is essentially at the fundamental frequency f 0 of that utterance over the window.
- each period as produced by the glottal pulse it has a number of local maxima that correspond to the harmonic resonances of the voiced sound. These local maxima will vary in number and magnitude for each type of voiced sound, and are correlated with the type of sound being uttered. With reference to plot 402 of FIG. 4A , one can see that there are four local maxima 410 , 412 , 414 and 416 in each period. This signal structure is related to the four oscillator peaks 408 a , 408 b , 408 c and 408 d respectively of spectral plot 404 . Regardless of how f 0 evolves over time between adjacent window samples of the signal ( 202 , FIG.
- this renormalization method of the invention results in the ability to create a common coordinate system by which these oscillator peak features may be compared between all speakers, without the need to consider statistical distributions of spectral power over as many speakers (or even all speakers in the world) as might be represented by an “infinite corpus,” to account for the variations in frequency among speakers, or even variations for a given speaker due to emotion and linguistic context.
- the above-described renormalization method of the invention also serves to renormalize time duration variance in the speech signal over the sample window as well. Because some people speak very fast, and others might speak very slowly (such as with a drawl), this time variation must also be statistically modeled over many speakers when employing only the known technique of using the evolution of spectral power as the discriminating feature for a speech recognition process. Put another way, by extracting a single period of the oscillation in accordance with the method of the invention, the extracted single period can be recreated over any desired number of periods such that slow or fast speech can be easily compared between AFDVs generated by the method of the invention.
- AFDV advanced feature discrimination vector
- a means for comparing the AFDVs is to establish an n slot comparator stack 504 of FIGS. 5B, 6B and 7B .
- the number of slots n is twelve.
- the spectral structure of the sounds often consists of 1, 2, 4 or sometimes 6 oscillator peaks.
- a twelve slot comparator stack 504 is able to evenly distribute and form an alignment for each of the spectral structures as illustrated in FIGS. 5A, 6A and 7A . Further, the distribution of the spectral structures would create an alignment where elements with 4 oscillator peaks would largely be unique when compared to elements with 3 oscillator peaks.
- the renormalized spectral structure 502 representing one “glottal pulse” period of the voiced sound from a sampled window of audio signal ( 202 , FIG. 2 ) is illustrative of that of the example of FIG. 4A , having a spectral structure of four oscillator peaks 408 a - d . These peaks can then be formed into an aligned AFDV that evenly distributes the oscillator peak features into the comparator stack 504 of FIG.
- the renormalized spectral structure 602 representing one period of the glottal pulse period of a voiced sound from a sampled window of audio signal ( 202 , FIG. 2 ) has three oscillator peaks 608 a - c . These peaks can then be formed into an aligned AFDV that evenly distributes the oscillator peak features into the comparator stack 504 of FIG. 6B such that the oscillator peak of the highest magnitude frequency bin (B 3 ) 608 c of the AFDV is located in slot 506 a .
- the oscillator peak occupying the next frequency bin (B 2 ) 608 b of the renormalized AFDV is located in slot 506 e and the oscillator peak falling into the lowest frequency bin (B 1 ) 608 a of the renormalized AFDV is placed or aligned in slot 506 f , thus occupying the 4 th , 8 th and 12 th slots of the comparator stack.
- the renormalized spectral structure 702 (generated at 210 , FIG. 2 ) representing one glottal pulse period of a voiced sound derived from a sampled window of audio signal ( 202 , FIG. 2 ) has two oscillator peaks 708 a - b . These peaks can then be formed (at 210 , FIG. 2 ) into an aligned AFDV that evenly distributes the oscillator peak features into the comparator stack 504 of FIG. 7B such that the oscillator peak of the highest frequency bin (B 2 ) of the AFDV is located in slot 506 a . The remaining frequency bin of the renormalized AFDV is located in slot 506 c , thus occupying only the 6 th and 12 th slots of the comparator stack 504 .
- the AFDVs like other known feature vectors, can be normalized with respect to power, to eliminate variation in the volume of different speakers.
- One such technique is to normalize the overall power of the oscillator peaks of the AFDV to 1.
- Those of skill in the art will recognize that one could also normalizing the magnitude of the oscillator peak located at the highest slot location of the comparator stack for each structure to a value of one.
- Those of skill in the art will recognize that because every case has a peak in the n th slot of the comparator stack 504 , it provides little or no discriminatory benefit in performing the comparison, and could therefore be removed.
- comparator stack 504 Once normalized for power, one may then consider the comparator stack 504 as a vector and comparison between the oscillator peaks of each vector and a library of such vectors can be performed. To do so, vectors of known speech sounds can be analyzed and transformed to the same renormalized state in a similarly configured comparator stack, thus building up a library of vectors from the comparator stacks 504 . Then, comparison between an unknown speech sound and the library of known speech sounds can be performed by taking a dot product between the AFDV of the unknown sound and the AFDVs of the library to identify which one of the AFDVs in the library is closest to the extracted and renormalized AFDV of the unknown sound.
- the phoneme or sound associated with the AFDV in the library that is identified as most like the extracted AFDV can be chosen as the most likely sound or phoneme being uttered.
- An example of the normalization procedure by which to enable the comparison of the AFDVs with the dot product will be provided later below.
- the speech recognition engine can also look at the evolution of power over time for those peaks as the sound is uttered. For example, if a person is saying “WELL,” the strongest power may typically start out at the lower frequency oscillators of the spectral structure, and then eventually moves toward the higher peaks. At the same time, it is typical that the fundamental frequency will change over the duration of the utterance and hence it moves around in frequency. Because of the renormalization, the oscillator peaks remain stationary in the slots of the stack, so it makes it easier to monitor the evolution of the power through those frequencies, which can provide additional information regarding the phoneme being uttered.
- FIG. 8 illustrates a flow diagram of an embodiment of the renormalization method of the invention that generates the AFDVs of the invention.
- an audio signal is received and is sampled into overlapping windows at 800 .
- the windows of the time domain signal are then converted to a spectral representation of the sample from each window in the form of high resolution oscillator peaks that are determined with sufficient accuracy and at high enough resolution to enable their representation as essentially delta functions.
- the high-resolution frequency analysis can, for example, employ the CSPE method disclosed in the publication in the publication by Kevin M. Short and Ricardo A.
- short-term stabilized oscillators are identified from the oscillator peaks. It should be noted that if certain of the enhancements to the CSPE method as disclosed in U.S. patent application Ser. No. 13/886,902 are employed, frequency and amplitude modulated oscillators can also be identified at this step and used as features in the method of the invention.
- a tracker can be optionally used to track the evolution of the identified oscillators. This can be used, for example, if the speech to be recognized is uttered in a noisy environment in which one or more additional sources of sound exist. By tracking the evolution of the oscillators over time, oscillators that evolve coherently may be determined to be from the same source. Thus, the sound to be analyzed may be further focused to the source of interest, thereby removing the sounds that emanate from the other sources present in the environment. This can further simplify the speech recognition problem, as current systems must not only statistically model the speech to account for variations, to the extent that the speech is not from a clean signal, various types of noise must also be modeled as well.
- harmonically related oscillators are identified and grouped together, for purposes of identifying the harmonic components of the system.
- the harmonically related oscillator peaks are identified (e.g. peaks 408 a - d , FIG. 4A ).
- this function can be performed by a pitch detector that is able to identify those oscillators that are related to one another as multiples of the fundamental frequency f 0 of the signal.
- the harmonic structure (e.g. 502 , FIG. 5A ) for a single period of the signal is known, an AFDV can be generated for each window of the signal at 210 .
- the AFDVs are aligned through a common coordinate system at 820 so that they can be compared, for example, to a library or corpus of AFDVs for known speech sounds to determine the most likely sound being uttered during the window from which the AFDV was extracted. See the discussion above with respect to an embodiment that employs a comparator stack 504 and using the vector product to identify a match.
- FIG. 9 illustrates a flow diagram of the renormalization steps that occur at 210 of FIG. 8 .
- harmonically grouped oscillator peaks have been identified such as through a pitch detector, and as illustrated by the harmonically related oscillator peaks 408 a - d of FIG. 4A .
- these oscillators are placed in consecutive frequency bins as previously discussed to essentially create a single excitation period of the signal that renormalizes both frequency and time.
- the AFDVs are placed into an established comparison coordinate structure such as for example, the comparator stack 504 and as illustrated by FIGS. 5B, 6B and 7B .
- the AFDV for each window can be compared to a library or corpus of known speech that has itself been coded into AFDVs of the invention.
- the comparison can be any technique well-known in the art, such as by a dot product between the AFDV extracted from a window and the library of known AFDVs as previously described.
- the result of that comparison can then be an output of an identification of the most likely phoneme or sub-phoneme based on the closest match.
- Other types of comparisons may be used here, such as, but not limited to, a Bayesian decision, a Mahalanobis distance, or a weighted or curved space distance metric may be used.
- FIG. 10 one can see a power spectrum 1000 that has been averaged over many utterances for a single speaker. As illustrated, there are several zones that are delimited by the large dots in the figure. These zones roughly correspond to the resonances of the speaker's vocal tract and are affected by the size, gender, mouth shape, etc., of the given speaker.
- zone 1 is from 0 Hz-1100 Hz approximately and zone 2 is from 1100 Hz-2200 Hz, approximately
- the individual spectra that are included in the average would tend to be produced by voiced phonemes (and the individual spectra would have harmonically grouped oscillator peaks as illustrated in the examples above).
- the signal power in the individual spectra would largely be confined to zones 4 and 5 as delineated by dots 1006 and 1008 .
- voiced sounds behave in a mostly periodic (and therefore harmonic manner) in accordance with the glottal pulse and are typically vowel type phonemes and sub-phonemes.
- the renormalization method of the invention as set forth above has been demonstrated using examples of speech where the dominant signal has been a voiced phoneme like a vowel sound.
- unvoiced phonemes e.g. sibilants
- sibilants are primarily turbulent in nature, they tend to lack clearly defined, well-behaved harmonic structure such as that exhibited by voiced sounds.
- the individual spectra tend to be smeared out and when analyzed as oscillator peaks, there are groupings of the peaks, but they do not exhibit the even spacing of the harmonics that one would expect for voiced phonemes.
- semi-voiced phonemes such as fricatives and plosives
- For semi-voiced phonemes there can be signal power in most of the zones, often with oscillator peaks grouped harmonically in zones 1 and 2 and less harmonically grouped phonemes in zones 3, 4 and 5.
- the renormalization method of the invention can be used to generate AFDVs of the invention for identifying voiced sounds, to be used in conjunction with known techniques for extracting known feature vectors such as MFCCs as previously discussed.
- FIG. 11 illustrates a non-limiting embodiment of a speech recognition system 1100 that can employ the AFDVs of the invention to improve the robustness of identifying voiced sounds, while more traditional feature vectors (e.g. MFCCs) can be used for identifying unvoiced sounds such as sibilants.
- MFCCs feature vectors
- the method of the invention and the AFDVs of the invention generated therefrom can still be extended to apply more broadly to identifying both unvoiced and semi-voiced sounds as well.
- a non-limiting embodiment of a speech recognition system 1200 that can employing the AFDVs of the invention for identifying voiced sounds, as well as unvoiced and semi-voiced sounds will be discussed below with reference to FIG. 12 .
- a non-limiting embodiment of a speech recognition system 1100 employs the method of the invention to generate AFDVs of the invention for improving the robustness of identifying voiced sounds, but extracting known feature vectors such as MFCCs for identification of sounds in the higher frequencies.
- the two different feature vector types can also be used coextensively to identify semi-voiced sounds that include both voiced and unvoiced components.
- Oscillator Peaks 209 are extracted from signal 202 as previously discussed, and it is determined at decision block 1102 whether the extracted oscillator peaks are voiced in nature. This can be determined by a number of ways, including whether they can be grouped harmonically and whether most of the power of the oscillator peaks falls within the first two zones of FIG. 10 . If yes, AFDVs 212 of the invention are generated in accordance with the renormalization method of the invention at 210 . They are then normalized with respect to power at 1124 and provided to speech recognition engine 214 for use in identifying voiced sounds such as vowels.
- the conventional feature vectors such as MFCCs 1112 can be generated at 1114 , normalized as to power at 1126 , and then are provided to speech recognition engine 214 for identifying unvoiced and possibly semi-voiced sounds.
- the standard feature vectors such as MFCCs can be used to identify unvoiced sounds such as sibilants, while the AFDVs 212 of the invention can be used to simply improve the robustness of identifying voiced sounds.
- MFCCs can be used to identify unvoiced sounds such as sibilants
- the AFDVs 212 of the invention can be used to simply improve the robustness of identifying voiced sounds.
- Those of skill in the art will recognize that it might be beneficial to employ both features in combination improve the identification of semi-voiced sounds such as plosives and fricatives. This can be accomplished at least in part by maintaining the ratio of spectral power between the two for each window of sampled signal. A technique for accomplishing this result is set forth in detail below.
- FIG. 12 illustrates a non-limiting embodiment of a speech recognition system 1200 that employs the method of the invention to generate AFDVs of the invention for identifying all three categories of sound.
- Oscillator Peaks 209 are extracted from signal 202 as previously discussed, and AFDVs 212 of the invention are generated in accordance with the renormalization method of the invention at 210 .
- a unique comparator stack 504 i.e. establishing comparison coordinates
- the first subdivision of the comparator stack is employed as previously described for identified groups of oscillator peaks that are harmonically related. With respect to those higher frequency, non-harmonically related oscillator peaks, the oscillator peaks will be smeared over the spectrum as they will not be as well-behaved as the harmonically related oscillator peaks for voiced sounds. Thus, for example, one can establish contributions to comparator stack slots for nearby oscillator peaks based on their weighted average to establish entries in the slots in much the same way as performed for MFCC features.
- any group of oscillator peaks that is not harmonically related could be renormalized as before, but this often undesirable at the higher frequencies.
- the weighted average frequency entries may then be renormalized by shifting them to adjacent frequency bins starting with bin 1 as previously described above for harmonically related oscillator peaks. These bins can then be distributed into the second subdivision of the stack in the same manner as for harmonically related oscillator peaks as described above to establish a common coordinate comparison space for the non-harmonically related oscillator peaks as well.
- the power normalization performed at block 1216 of FIG. 12 should be performed such that the overall ratio of power between the subdivisions is maintained.
- a preferred embodiment for the process for balancing the power between subdivisions as performed at block 1216 is given below.
- the subscript “Z” is added to indicate that the result is a zone-based representation of the information in the oscillator peak representation.
- the foregoing technique can also be applied when the two feature vectors are a mix between an AFDV of the invention, and a conventional feature vector such as an MFCC.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Telephone Function (AREA)
- Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
Abstract
Description
and apply the rescaling to the separate zones that have been created in accordance with the method of the invention. Thus, the normalized magnitude is 1, since
Note that the magnitude information about the separate zones is retained in α and β.
and {right arrow over (V)}HF
giving
The subscript “Z” is added to indicate that the result is a zone-based representation of the information in the oscillator peak representation.
Claims (19)
Priority Applications (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US14/217,198 US9728182B2 (en) | 2013-03-15 | 2014-03-17 | Method and system for generating advanced feature discrimination vectors for use in speech recognition |
| US15/638,627 US10410623B2 (en) | 2013-03-15 | 2017-06-30 | Method and system for generating advanced feature discrimination vectors for use in speech recognition |
| US16/520,104 US11056097B2 (en) | 2013-03-15 | 2019-07-23 | Method and system for generating advanced feature discrimination vectors for use in speech recognition |
Applications Claiming Priority (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US201361786888P | 2013-03-15 | 2013-03-15 | |
| US201361914002P | 2013-12-10 | 2013-12-10 | |
| US14/217,198 US9728182B2 (en) | 2013-03-15 | 2014-03-17 | Method and system for generating advanced feature discrimination vectors for use in speech recognition |
Related Child Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US15/638,627 Continuation US10410623B2 (en) | 2013-03-15 | 2017-06-30 | Method and system for generating advanced feature discrimination vectors for use in speech recognition |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| US20160284343A1 US20160284343A1 (en) | 2016-09-29 |
| US9728182B2 true US9728182B2 (en) | 2017-08-08 |
Family
ID=56974274
Family Applications (3)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US14/217,198 Active US9728182B2 (en) | 2013-03-15 | 2014-03-17 | Method and system for generating advanced feature discrimination vectors for use in speech recognition |
| US15/638,627 Active US10410623B2 (en) | 2013-03-15 | 2017-06-30 | Method and system for generating advanced feature discrimination vectors for use in speech recognition |
| US16/520,104 Active 2034-05-28 US11056097B2 (en) | 2013-03-15 | 2019-07-23 | Method and system for generating advanced feature discrimination vectors for use in speech recognition |
Family Applications After (2)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US15/638,627 Active US10410623B2 (en) | 2013-03-15 | 2017-06-30 | Method and system for generating advanced feature discrimination vectors for use in speech recognition |
| US16/520,104 Active 2034-05-28 US11056097B2 (en) | 2013-03-15 | 2019-07-23 | Method and system for generating advanced feature discrimination vectors for use in speech recognition |
Country Status (3)
| Country | Link |
|---|---|
| US (3) | US9728182B2 (en) |
| EP (1) | EP3042377B1 (en) |
| WO (1) | WO2014145960A2 (en) |
Cited By (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20190049479A1 (en) * | 2017-08-14 | 2019-02-14 | Google Inc. | Systems and methods of motion detection using dynamic thresholds and data filtering |
| US10410623B2 (en) | 2013-03-15 | 2019-09-10 | Xmos Inc. | Method and system for generating advanced feature discrimination vectors for use in speech recognition |
| US10497381B2 (en) | 2012-05-04 | 2019-12-03 | Xmos Inc. | Methods and systems for improved measurement, entity and parameter estimation, and path propagation effect measurement and mitigation in source signal separation |
| US20200176010A1 (en) * | 2018-11-30 | 2020-06-04 | International Business Machines Corporation | Avoiding speech collisions among participants during teleconferences |
| US10957336B2 (en) | 2012-05-04 | 2021-03-23 | Xmos Inc. | Systems and methods for source signal separation |
Families Citing this family (14)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US10008201B2 (en) * | 2015-09-28 | 2018-06-26 | GM Global Technology Operations LLC | Streamlined navigational speech recognition |
| CN106296890B (en) * | 2016-07-22 | 2019-06-04 | 北京小米移动软件有限公司 | Unlocking method, device and mobile terminal of mobile terminal |
| US10062378B1 (en) * | 2017-02-24 | 2018-08-28 | International Business Machines Corporation | Sound identification utilizing periodic indications |
| CN107180640B (en) * | 2017-04-13 | 2020-06-12 | 广东工业大学 | Phase-correlated high-density stacked window frequency spectrum calculation method |
| GB201719734D0 (en) * | 2017-10-30 | 2018-01-10 | Cirrus Logic Int Semiconductor Ltd | Speaker identification |
| CN111383646B (en) * | 2018-12-28 | 2020-12-08 | 广州市百果园信息技术有限公司 | Voice signal transformation method, device, equipment and storage medium |
| CN109658943B (en) * | 2019-01-23 | 2023-04-14 | 平安科技(深圳)有限公司 | Audio noise detection method and device, storage medium and mobile terminal |
| CA3127443A1 (en) * | 2019-01-23 | 2020-07-30 | Sound Genetics, Inc. | Systems and methods for pre-filtering audio content based on prominence of frequency content |
| CN110232928B (en) * | 2019-06-13 | 2021-05-25 | 思必驰科技股份有限公司 | Text-independent speaker verification method and device |
| CN113450781B (en) * | 2020-03-25 | 2022-08-09 | 阿里巴巴集团控股有限公司 | Speech processing method, speech encoder, speech decoder and speech recognition system |
| CN111710349B (en) * | 2020-06-23 | 2023-07-04 | 长沙理工大学 | A speech emotion recognition method, system, computer equipment and storage medium |
| WO2022045395A1 (en) * | 2020-08-27 | 2022-03-03 | 임재윤 | Audio data correction method and device for removing plosives |
| US12002451B1 (en) * | 2021-07-01 | 2024-06-04 | Amazon Technologies, Inc. | Automatic speech recognition |
| CN113851147B (en) * | 2021-10-19 | 2025-05-13 | 北京百度网讯科技有限公司 | Audio recognition method, audio recognition model training method, device, and electronic device |
Citations (30)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6381571B1 (en) | 1998-05-01 | 2002-04-30 | Texas Instruments Incorporated | Sequential determination of utterance log-spectral mean by maximum a posteriori probability estimation |
| JP2002168950A (en) | 2000-12-01 | 2002-06-14 | Mitsubishi Electric Corp | Wave source detecting device and wave source detecting method |
| US6526378B1 (en) | 1997-12-08 | 2003-02-25 | Mitsubishi Denki Kabushiki Kaisha | Method and apparatus for processing sound signal |
| US6535666B1 (en) | 1995-06-02 | 2003-03-18 | Trw Inc. | Method and apparatus for separating signals transmitted over a waveguide |
| WO2003041052A1 (en) | 2001-11-05 | 2003-05-15 | Motorola, Inc. | Improve speech recognition by dynamical noise model adaptation |
| RU2216052C2 (en) | 1998-07-14 | 2003-11-10 | Интел Корпорейшн | Automatic speech recognition |
| US20040230428A1 (en) | 2003-03-31 | 2004-11-18 | Samsung Electronics Co. Ltd. | Method and apparatus for blind source separation using two sensors |
| WO2005029467A1 (en) | 2003-09-17 | 2005-03-31 | Kitakyushu Foundation For The Advancement Of Industry, Science And Technology | A method for recovering target speech based on amplitude distributions of separated signals |
| US20050091042A1 (en) | 2000-04-26 | 2005-04-28 | Microsoft Corporation | Sound source separation using convolutional mixing and a priori sound source knowledge |
| JP2005229453A (en) | 2004-02-16 | 2005-08-25 | Motorola Inc | Method and device of tuning propagation model |
| US20060053002A1 (en) | 2002-12-11 | 2006-03-09 | Erik Visser | System and method for speech processing using independent component analysis under stability restraints |
| US20060056647A1 (en) | 2004-09-13 | 2006-03-16 | Bhiksha Ramakrishnan | Separating multiple audio signals recorded as a single mixed signal |
| US20060153059A1 (en) | 2002-12-18 | 2006-07-13 | Qinetiq Limited | Signal separation |
| US20060269057A1 (en) * | 2005-05-26 | 2006-11-30 | Groove Mobile, Inc. | Systems and methods for high resolution signal analysis and chaotic data compression |
| WO2007118583A1 (en) | 2006-04-13 | 2007-10-25 | Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. | Audio signal decorrelator |
| EP1926321A1 (en) | 2006-11-27 | 2008-05-28 | Matsushita Electric Industrial Co., Ltd. | Hybrid texture representation |
| US20080228470A1 (en) | 2007-02-21 | 2008-09-18 | Atsuo Hiroe | Signal separating device, signal separating method, and computer program |
| US7457756B1 (en) | 2005-06-09 | 2008-11-25 | The United States Of America As Represented By The Director Of The National Security Agency | Method of generating time-frequency signal representation preserving phase information |
| JP2009089315A (en) | 2007-10-03 | 2009-04-23 | Nippon Telegr & Teleph Corp <Ntt> | Acoustic signal estimation device, acoustic signal synthesis device, acoustic signal estimation synthesis device, acoustic signal estimation method, acoustic signal synthesis method, acoustic signal estimation synthesis method, program using these methods, and recording medium |
| US20090222259A1 (en) * | 2008-02-29 | 2009-09-03 | Kabushiki Kaisha Toshiba | Apparatus, method and computer program product for feature extraction |
| JP2009533912A (en) | 2006-04-13 | 2009-09-17 | フラウンホッファー−ゲゼルシャフト ツァ フェルダールング デァ アンゲヴァンテン フォアシュンク エー.ファオ | Audio signal correlation separator, multi-channel audio signal processor, audio signal processor, method and computer program for deriving output audio signal from input audio signal |
| US7729912B1 (en) | 2003-12-23 | 2010-06-01 | At&T Intellectual Property Ii, L.P. | System and method for latency reduction for automatic speech recognition using partial multi-pass results |
| US20100302971A1 (en) | 2006-08-25 | 2010-12-02 | Space Systems/Loral, Inc. | Ground-based beamforming for satellite communications systems |
| US20100322437A1 (en) | 2009-06-23 | 2010-12-23 | Fujitsu Limited | Signal processing apparatus and signal processing method |
| US20110035215A1 (en) * | 2007-08-28 | 2011-02-10 | Haim Sompolinsky | Method, device and system for speech recognition |
| WO2013166439A1 (en) | 2012-05-04 | 2013-11-07 | Setem Technologies, Llc | Systems and methods for source signal separation |
| WO2014145960A2 (en) | 2013-03-15 | 2014-09-18 | Short Kevin M | Method and system for generating advanced feature discrimination vectors for use in speech recognition |
| US20150287422A1 (en) | 2012-05-04 | 2015-10-08 | Kaonyx Labs, LLC | Methods and systems for improved measurement, entity and parameter estimation, and path propagation effect measurement and mitigation in source signal separation |
| WO2015157458A1 (en) | 2014-04-09 | 2015-10-15 | Kaonyx Labs, LLC | Methods and systems for improved measurement, entity and parameter estimation, and path propagation effect measurement and mitigation in source signal separation |
| US20150348536A1 (en) * | 2012-11-13 | 2015-12-03 | Yoichi Ando | Method and device for recognizing speech |
Family Cites Families (12)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US4559605A (en) | 1983-09-16 | 1985-12-17 | The Boeing Company | Method and apparatus for random array beamforming |
| DE69725172T2 (en) * | 1996-03-08 | 2004-04-08 | Motorola, Inc., Schaumburg | METHOD AND DEVICE FOR DETECTING NOISE SAMPLE SAMPLES FROM A NOISE |
| US6233550B1 (en) * | 1997-08-29 | 2001-05-15 | The Regents Of The University Of California | Method and apparatus for hybrid coding of speech at 4kbps |
| US6253175B1 (en) * | 1998-11-30 | 2001-06-26 | International Business Machines Corporation | Wavelet-based energy binning cepstal features for automatic speech recognition |
| EP1700266A4 (en) | 2003-12-19 | 2010-01-20 | Creative Tech Ltd | Method and system to process a digital image |
| US20050156780A1 (en) | 2004-01-16 | 2005-07-21 | Ghz Tr Corporation | Methods and apparatus for automotive radar sensors |
| US9253560B2 (en) * | 2008-09-16 | 2016-02-02 | Personics Holdings, Llc | Sound library and method |
| GB2466242B (en) * | 2008-12-15 | 2013-01-02 | Audio Analytic Ltd | Sound identification systems |
| US9286911B2 (en) * | 2008-12-15 | 2016-03-15 | Audio Analytic Ltd | Sound identification systems |
| RU2419890C1 (en) * | 2009-09-24 | 2011-05-27 | Общество с ограниченной ответственностью "Центр речевых технологий" | Method of identifying speaker from arbitrary speech phonograms based on formant equalisation |
| KR101561755B1 (en) | 2011-03-03 | 2015-10-19 | 사이퍼 엘엘씨 | System for autonomous detection and separation of common elements within data, and methods and devices associated therewith |
| US9042867B2 (en) * | 2012-02-24 | 2015-05-26 | Agnitio S.L. | System and method for speaker recognition on mobile devices |
-
2014
- 2014-03-17 US US14/217,198 patent/US9728182B2/en active Active
- 2014-03-17 WO PCT/US2014/030819 patent/WO2014145960A2/en not_active Ceased
- 2014-03-17 EP EP14763371.3A patent/EP3042377B1/en active Active
-
2017
- 2017-06-30 US US15/638,627 patent/US10410623B2/en active Active
-
2019
- 2019-07-23 US US16/520,104 patent/US11056097B2/en active Active
Patent Citations (42)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6535666B1 (en) | 1995-06-02 | 2003-03-18 | Trw Inc. | Method and apparatus for separating signals transmitted over a waveguide |
| US6526378B1 (en) | 1997-12-08 | 2003-02-25 | Mitsubishi Denki Kabushiki Kaisha | Method and apparatus for processing sound signal |
| US6381571B1 (en) | 1998-05-01 | 2002-04-30 | Texas Instruments Incorporated | Sequential determination of utterance log-spectral mean by maximum a posteriori probability estimation |
| RU2216052C2 (en) | 1998-07-14 | 2003-11-10 | Интел Корпорейшн | Automatic speech recognition |
| US20050091042A1 (en) | 2000-04-26 | 2005-04-28 | Microsoft Corporation | Sound source separation using convolutional mixing and a priori sound source knowledge |
| JP2002168950A (en) | 2000-12-01 | 2002-06-14 | Mitsubishi Electric Corp | Wave source detecting device and wave source detecting method |
| WO2003041052A1 (en) | 2001-11-05 | 2003-05-15 | Motorola, Inc. | Improve speech recognition by dynamical noise model adaptation |
| US20060053002A1 (en) | 2002-12-11 | 2006-03-09 | Erik Visser | System and method for speech processing using independent component analysis under stability restraints |
| US20060153059A1 (en) | 2002-12-18 | 2006-07-13 | Qinetiq Limited | Signal separation |
| US20040230428A1 (en) | 2003-03-31 | 2004-11-18 | Samsung Electronics Co. Ltd. | Method and apparatus for blind source separation using two sensors |
| WO2005029467A1 (en) | 2003-09-17 | 2005-03-31 | Kitakyushu Foundation For The Advancement Of Industry, Science And Technology | A method for recovering target speech based on amplitude distributions of separated signals |
| US7729912B1 (en) | 2003-12-23 | 2010-06-01 | At&T Intellectual Property Ii, L.P. | System and method for latency reduction for automatic speech recognition using partial multi-pass results |
| JP2005229453A (en) | 2004-02-16 | 2005-08-25 | Motorola Inc | Method and device of tuning propagation model |
| US20060056647A1 (en) | 2004-09-13 | 2006-03-16 | Bhiksha Ramakrishnan | Separating multiple audio signals recorded as a single mixed signal |
| US7454333B2 (en) | 2004-09-13 | 2008-11-18 | Mitsubishi Electric Research Lab, Inc. | Separating multiple audio signals recorded as a single mixed signal |
| US20060269057A1 (en) * | 2005-05-26 | 2006-11-30 | Groove Mobile, Inc. | Systems and methods for high resolution signal analysis and chaotic data compression |
| US7457756B1 (en) | 2005-06-09 | 2008-11-25 | The United States Of America As Represented By The Director Of The National Security Agency | Method of generating time-frequency signal representation preserving phase information |
| JP2009533912A (en) | 2006-04-13 | 2009-09-17 | フラウンホッファー−ゲゼルシャフト ツァ フェルダールング デァ アンゲヴァンテン フォアシュンク エー.ファオ | Audio signal correlation separator, multi-channel audio signal processor, audio signal processor, method and computer program for deriving output audio signal from input audio signal |
| WO2007118583A1 (en) | 2006-04-13 | 2007-10-25 | Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. | Audio signal decorrelator |
| US20100302971A1 (en) | 2006-08-25 | 2010-12-02 | Space Systems/Loral, Inc. | Ground-based beamforming for satellite communications systems |
| EP1926321A1 (en) | 2006-11-27 | 2008-05-28 | Matsushita Electric Industrial Co., Ltd. | Hybrid texture representation |
| US20080228470A1 (en) | 2007-02-21 | 2008-09-18 | Atsuo Hiroe | Signal separating device, signal separating method, and computer program |
| US20110035215A1 (en) * | 2007-08-28 | 2011-02-10 | Haim Sompolinsky | Method, device and system for speech recognition |
| JP2009089315A (en) | 2007-10-03 | 2009-04-23 | Nippon Telegr & Teleph Corp <Ntt> | Acoustic signal estimation device, acoustic signal synthesis device, acoustic signal estimation synthesis device, acoustic signal estimation method, acoustic signal synthesis method, acoustic signal estimation synthesis method, program using these methods, and recording medium |
| US20090222259A1 (en) * | 2008-02-29 | 2009-09-03 | Kabushiki Kaisha Toshiba | Apparatus, method and computer program product for feature extraction |
| US8638952B2 (en) | 2009-06-23 | 2014-01-28 | Fujitsu Limited | Signal processing apparatus and signal processing method |
| US20100322437A1 (en) | 2009-06-23 | 2010-12-23 | Fujitsu Limited | Signal processing apparatus and signal processing method |
| JP2011007861A (en) | 2009-06-23 | 2011-01-13 | Fujitsu Ltd | Signal processing apparatus, signal processing method and signal processing program |
| US20140316771A1 (en) | 2012-05-04 | 2014-10-23 | Kaonyx Labs LLC | Systems and methods for source signal separation |
| US20160071528A9 (en) | 2012-05-04 | 2016-03-10 | Kaonyx Labs LLC | Systems and methods for source signal separation |
| US8694306B1 (en) | 2012-05-04 | 2014-04-08 | Kaonyx Labs LLC | Systems and methods for source signal separation |
| US20140163991A1 (en) | 2012-05-04 | 2014-06-12 | Kaonyx Labs LLC | Systems and methods for source signal separation |
| US20170004844A1 (en) | 2012-05-04 | 2017-01-05 | Kaonyx Labs LLC | Systems and methods for source signal separation |
| WO2013166439A1 (en) | 2012-05-04 | 2013-11-07 | Setem Technologies, Llc | Systems and methods for source signal separation |
| US9495975B2 (en) | 2012-05-04 | 2016-11-15 | Kaonyx Labs LLC | Systems and methods for source signal separation |
| US20150287422A1 (en) | 2012-05-04 | 2015-10-08 | Kaonyx Labs, LLC | Methods and systems for improved measurement, entity and parameter estimation, and path propagation effect measurement and mitigation in source signal separation |
| US9443535B2 (en) | 2012-05-04 | 2016-09-13 | Kaonyx Labs LLC | Systems and methods for source signal separation |
| US20140079248A1 (en) | 2012-05-04 | 2014-03-20 | Kaonyx Labs LLC | Systems and Methods for Source Signal Separation |
| US20150348536A1 (en) * | 2012-11-13 | 2015-12-03 | Yoichi Ando | Method and device for recognizing speech |
| WO2014145960A3 (en) | 2013-03-15 | 2015-03-05 | Short Kevin M | Method and system for generating advanced feature discrimination vectors for use in speech recognition |
| WO2014145960A2 (en) | 2013-03-15 | 2014-09-18 | Short Kevin M | Method and system for generating advanced feature discrimination vectors for use in speech recognition |
| WO2015157458A1 (en) | 2014-04-09 | 2015-10-15 | Kaonyx Labs, LLC | Methods and systems for improved measurement, entity and parameter estimation, and path propagation effect measurement and mitigation in source signal separation |
Non-Patent Citations (13)
Cited By (9)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US10497381B2 (en) | 2012-05-04 | 2019-12-03 | Xmos Inc. | Methods and systems for improved measurement, entity and parameter estimation, and path propagation effect measurement and mitigation in source signal separation |
| US10957336B2 (en) | 2012-05-04 | 2021-03-23 | Xmos Inc. | Systems and methods for source signal separation |
| US10978088B2 (en) | 2012-05-04 | 2021-04-13 | Xmos Inc. | Methods and systems for improved measurement, entity and parameter estimation, and path propagation effect measurement and mitigation in source signal separation |
| US10410623B2 (en) | 2013-03-15 | 2019-09-10 | Xmos Inc. | Method and system for generating advanced feature discrimination vectors for use in speech recognition |
| US11056097B2 (en) * | 2013-03-15 | 2021-07-06 | Xmos Inc. | Method and system for generating advanced feature discrimination vectors for use in speech recognition |
| US20190049479A1 (en) * | 2017-08-14 | 2019-02-14 | Google Inc. | Systems and methods of motion detection using dynamic thresholds and data filtering |
| US10942196B2 (en) * | 2017-08-14 | 2021-03-09 | Google Llc | Systems and methods of motion detection using dynamic thresholds and data filtering |
| US20200176010A1 (en) * | 2018-11-30 | 2020-06-04 | International Business Machines Corporation | Avoiding speech collisions among participants during teleconferences |
| US11017790B2 (en) * | 2018-11-30 | 2021-05-25 | International Business Machines Corporation | Avoiding speech collisions among participants during teleconferences |
Also Published As
| Publication number | Publication date |
|---|---|
| US20160284343A1 (en) | 2016-09-29 |
| US10410623B2 (en) | 2019-09-10 |
| US11056097B2 (en) | 2021-07-06 |
| EP3042377A4 (en) | 2017-08-30 |
| WO2014145960A3 (en) | 2015-03-05 |
| US20200160839A1 (en) | 2020-05-21 |
| EP3042377A2 (en) | 2016-07-13 |
| EP3042377B1 (en) | 2023-01-11 |
| US20170301343A1 (en) | 2017-10-19 |
| WO2014145960A2 (en) | 2014-09-18 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US11056097B2 (en) | Method and system for generating advanced feature discrimination vectors for use in speech recognition | |
| Vergin et al. | Generalized mel frequency cepstral coefficients for large-vocabulary speaker-independent continuous-speech recognition | |
| Bezoui et al. | Feature extraction of some Quranic recitation using mel-frequency cepstral coeficients (MFCC) | |
| Mouaz et al. | Speech recognition of Moroccan dialect using hidden Markov models | |
| JPH09500223A (en) | Multilingual speech recognition system | |
| JP2015068897A (en) | Utterance evaluation method and apparatus, and computer program for evaluating utterance | |
| Pao et al. | Combining acoustic features for improved emotion recognition in mandarin speech | |
| Priyadarshani et al. | Dynamic time warping based speech recognition for isolated Sinhala words | |
| Czap et al. | Intensity feature for speech stress detection | |
| Bhatt et al. | Effects of the dynamic and energy based feature extraction on hindi speech recognition | |
| Zolnay et al. | Using multiple acoustic feature sets for speech recognition | |
| CN115019775B (en) | A language identification method based on language distinguishing features of phonemes | |
| Cherif et al. | Pitch detection and formant analysis of Arabic speech processing | |
| Stasiak et al. | Fundamental frequency extraction in speech emotion recognition | |
| Sinha et al. | Continuous density hidden markov model for hindi speech recognition | |
| Saratxaga et al. | Using harmonic phase information to improve ASR rate. | |
| Hung et al. | Automatic identification of vietnamese dialects | |
| Tripathi et al. | Robust vowel region detection method for multimode speech | |
| Al-hazaimeh et al. | Cross correlation–new based technique for speaker recognition | |
| Grewal et al. | Isolated word recognition system for English language | |
| Lingam | Speaker based language independent isolated speech recognition system | |
| Heo et al. | Classification based on speech rhythm via a temporal alignment of spoken sentences | |
| Wang et al. | Improved Mandarin speech recognition by lattice rescoring with enhanced tone models | |
| Neiberg et al. | Classification of affective speech using normalized time-frequency cepstra | |
| CN119724171B (en) | Vocabulary recognition method, device, electronic device and medium based on speech model |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| AS | Assignment |
Owner name: SETEM TECHNOLOGIES, INC., MASSACHUSETTS Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:SHORT, KEVIN M.;HONE, BRIAN T.;SIGNING DATES FROM 20160919 TO 20161019;REEL/FRAME:040069/0954 |
|
| STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
| AS | Assignment |
Owner name: XMOS INC., NEW HAMPSHIRE Free format text: CHANGE OF NAME;ASSIGNOR:SETEM TECHNOLOGIES, INC.;REEL/FRAME:045137/0094 Effective date: 20171227 |
|
| FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
| FEPP | Fee payment procedure |
Free format text: PETITION RELATED TO MAINTENANCE FEES GRANTED (ORIGINAL EVENT CODE: PTGR); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
| MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 4 |
|
| MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 8 |