WO2016126753A1 - Détermination de caractéristiques de signaux harmoniques - Google Patents

Détermination de caractéristiques de signaux harmoniques Download PDF

Info

Publication number
WO2016126753A1
WO2016126753A1 PCT/US2016/016261 US2016016261W WO2016126753A1 WO 2016126753 A1 WO2016126753 A1 WO 2016126753A1 US 2016016261 W US2016016261 W US 2016016261W WO 2016126753 A1 WO2016126753 A1 WO 2016126753A1
Authority
WO
WIPO (PCT)
Prior art keywords
signal
frequency
computing
chirp rate
pitch
Prior art date
Application number
PCT/US2016/016261
Other languages
English (en)
Inventor
David Carlson BRADLEY
Yao Huang MORIN
Massimo Mascaro
Janis I. INTOY
Sean Michael O'connor
Ellisha Natalie MARONGELLI
Robert Nicholas HILTON
Original Assignee
Knuedge Incorporated
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US14/969,038 external-priority patent/US9842611B2/en
Priority claimed from US14/969,029 external-priority patent/US9870785B2/en
Priority claimed from US14/969,022 external-priority patent/US9548067B2/en
Priority claimed from US14/969,036 external-priority patent/US9922668B2/en
Application filed by Knuedge Incorporated filed Critical Knuedge Incorporated
Priority to EP16706703.2A priority Critical patent/EP3254282A1/fr
Priority to CN201680017664.6A priority patent/CN107430850A/zh
Publication of WO2016126753A1 publication Critical patent/WO2016126753A1/fr

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/90Pitch determination of speech signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction

Definitions

  • a harmonic signal may have a fundamental frequency and one or more overtones. Harmonic signals include, for example, speech and music.
  • a harmonic signal may have a fundamental frequency, which may be referred to as the first harmonic.
  • a harmonic signal may include other harmonics that may occur at multiples of the first harmonic. For example, if the fundamental frequency is /at a certain time, then the other harmonics may have frequencies of 2f, 3f, and so forth.
  • the fundamental frequency of a harmonic signal may change over time. For example, when a person is speaking, the fundamental frequency of the speech may increase at the end of a question. A change in the frequency of a signal may be referred to as a chirp rate.
  • the chirp rate of a harmonic signal may be different for different harmonics. For example, if the first harmonic has a chirp rate of c, then other the harmonics may have chirp rates of 2c, 3c, and so forth.
  • the Inventive features may include:
  • a computer-implemented method for estimating pitch comprising: obtaining a frequency representation of a first portion of a signal;
  • the plurality of frequency portions comprising a first frequency portion and a second frequency portion
  • computing the first score comprises computing a likelihood or a log likelihood of each correlation of the plurality of correlations.
  • computing the second pitch estimate comprises performing a golden section search or a gradient descent using the first score.
  • each frequency portion of the plurality of frequency portions is centered at a multiple of the first pitch.
  • a system for estimating features of a harmonic signal comprising one or more computing devices comprising at least one processor and at least one memory, the one or more computing devices configured to: obtain a frequency representation of a first portion of a signal;
  • the plurality of frequency portions comprising a first frequency portion and a second frequency portion
  • the plurality of correlations further comprises (i) a second correlation between the first frequency portion and a reversed version of the second frequency portion, and (ii) a third correlation between the first frequency portion and a reversed version of the first frequency portion.
  • computing the first score comprises computing a Fisher transformation of each correlation of the plurality of correlations.
  • each frequency portion of the plurality of frequency portions is centered at a multiple of the first pitch.
  • the plurality of frequency portions comprising a third frequency portion and a fourth frequency portion
  • One or more non-transitory computer-readable media comprising computer executable instructions that, when executed, cause at least one processor to perform actions comprising:
  • the plurality of frequency portions comprising a first frequency portion and a second frequency portion
  • the plurality of correlations further comprises (i) a second correlation between the first frequency portion and a reversed version of the second frequency portion, and (if) a third correlation between the first frequency portion and a reversed version of the first frequency portion.
  • the plurality of correlations further comprises (i) a correlation between each pair of the plurality of frequency portions, (if) a correlation between each pair of the plurality of frequency portions, wherein one of of the pair has been reversed, and (iii) a correlation between each frequency portion and a reversed version of itself.
  • inventive features may include:
  • a computer-implemented method for estimating fractional chirp rate comprising:
  • the method further comprises computing a log- likelihood ratio for a plurality of frequencies of the first frequency representation, and wherein the log-likelihood ratio is a ratio of a log-likelihood that a harmonic is present at a frequency and a log-likelihood that a harmonic is not present at the frequency.
  • computing the estimated fractional chirp rate comprises selecting a fractional chirp rate corresponding to a highest score.
  • a system for estimating fractional chirp rate comprising one or more computing devices comprising at least one processor and at least one memory, the one or more computing devices configured to:
  • the one or more computing devices are further configured to compute a log-likelihood ratio for a plurality of frequencies of the first frequency representation, and wherein the log-likelihood ratio is a ratio of a log- likelihood that a harmonic is present at a frequency and a log-likelihood that a harmonic is not present at the frequency.
  • the one or more computing devices are further configured to perform at least one of speech recognition, speaker verification, speaker identification, or signal reconstruction using at least one of the estimated fractional chirp rate or the estimated pitch.
  • the second frequency representation is created by modifying the third frequency representation using the second fractional chirp rate.
  • inventive features may include:
  • estimating the pitch of the first portion comprises estimating a cumulative distribution function of the first plurality of peak-to- peak distances.
  • estimating the pitch of the first portion of the signal comprises estimating the pitch using the histogram.
  • a system for estimating pitch comprising one or more computing devices comprising at least one processor and at least one memory, the one or more computing devices configured to:
  • the one or more computing devices are further configured to compute a histogram using the plurality of peak-to-peak distances, and estimate the pitch of the first portion of the signal using the histogram.
  • the one or more computing devices are further configured to compute the first frequency representation using a first smoothing kernel.
  • the first frequency representation comprises a log-likelihood ratio (LLR) spectrum.
  • One or more non-transitory computer-readable media comprising computer executable instructions that, when executed, cause at least one processor to perform actions comprising:
  • estimating the pitch of the first portion comprises estimating a cumulative distribution function of the first plurality of peak-to-peak distances.
  • inventive features may include:
  • computing the estimated fractional chirp rate comprises computing a plurality of scores, wherein the plurality of scores comprise a first score and a second score, the first score is computed using a first fractional chirp rate, the second score is computed using a second fractional chirp rate, and the estimated fractional chirp rate is computed by selecting a highest score.
  • a system for estimating features of a harmonic signal comprising one or more computing devices comprising at least one processor and at least one memory, the one or more computing devices configured to:
  • One or more non-transitory computer-readable media comprising computer executable instructions that, when executed, cause at least one processor to perform actions comprising:
  • Fig. 1 illustrates examples of harmonic signals with different fractional chirp rates.
  • Fig. 4 illustrates a representation of a harmonic signal over frequency and fractional chirp rate.
  • Fig. 5 illustrates two examples of a generalized spectrum of a signal.
  • Fig. 6 illustrates a pitch velocity transform of a speech signal.
  • Fig. 7 illustrates two examples of generalized spectra of a speech signal.
  • Fig. 9A illustrates peak-to-peak distances for a single threshold in an LLR spectrum of a speech signal.
  • Fig. 9B illustrates peak-to-peak distances for multiple thresholds in an LLR spectrum of a speech signal.
  • Fig. 10A illustrates frequency portions of a frequency representation of a speech signal for a first pitch estimate.
  • Fig. 11 is a flowchart showing an example implementation of computing features from a signal.
  • Fig. 14 is a flowchart showing an example implementation of estimating a pitch of a signal using correlations.
  • Fig. 15 is an exemplary computing device that may be used to estimate features of signals.
  • the properties of a harmonic signal may be determined at regular intervals, such as every 10 milliseconds. These properties may be used for processing speech or other signals, for example, as features for performing automatic speech recognition or speaker verification or identification. These properties may also be used to perform a signal reconstruction to reduce the noise level of the harmonic signal.
  • the pitch of a harmonic signal may change over time.
  • the pitch of a voice or the note of a musical instrument may change over time.
  • each of the harmonics will have a chirp rate, and the chirp rate of each harmonic may be different.
  • the rate of change of the pitch may be referred to as pitch velocity or described by a fractional chirp rate.
  • a window e.g., a Gaussian, Hamming, or Hann window
  • Fig. 1 illustrates examples of four harmonic signals with different fractional chirp rates as a function of time and frequency.
  • Fig. 1 does not represent actual signals but provides a conceptual illustration of how chirplets (Gaussian signals with a specified time, frequency, chirp rate, and duration) would appear in a time- frequency representation, such as a spectrogram.
  • Harmonic signal 110 is centered at a time tl and has four harmonics.
  • the first harmonic has a frequency of f
  • the second, third, and fourth harmonics have frequencies of 2f, 3f, and 4f, respectively.
  • Each of the harmonics has a chirp rate of 0 since the frequency of the harmonics is not changing over time. Accordingly, the fractional chirp rate of harmonic signal 110 is 0.
  • Harmonic signal 120 is centered at time t2 and has four harmonics.
  • the first harmonic has a frequency of 2f
  • the second, third, and fourth harmonics have frequencies of 4f, 6f, and 8f, respectively.
  • the first harmonic has a chirp rate of c that is positive since the frequency is increasing over time.
  • the second, third, and fourth harmonics have chirp rates of 2c, 3c, and 4c, respectively. Accordingly, the fractional chirp rate of harmonic signal 120 is c/2f.
  • Harmonic signal 130 is centered at time t3 and has four harmonics.
  • the first harmonic has a frequency of f
  • the second, third, and fourth harmonics have frequencies of 2f, 3f, and 4f, respectively.
  • the first harmonic also has a chirp rate of c
  • the second, third, and fourth harmonics have chirp rates of 2 c, 3 c, and 4c, respectively. Accordingly, the fractional chirp rate of harmonic signal 130 is c/f, which is twice that of harmonic signal 120.
  • Harmonic signal 140 is centered at time t4 and has four harmonics.
  • the first harmonic has a frequency off, and the second, third, and fourth harmonics have frequencies of 2f, 3f, and 4f, respectively.
  • the first harmonic has a chirp rate of 2c as the rate of change of frequency is double that of harmonic signal 130.
  • the second, third, and fourth harmonics have chirp rates of 4c, 6c, and 8c, respectively. Accordingly, the fractional chirp rate of harmonic signal 140 is 2 c/f, which is twice that of harmonic signal 130.
  • Fig. 3 illustrates examples of four harmonic signals as a function of frequency and chirp rate, which will be referred to herein as a frequency-chirp distribution or representation.
  • Fig. 3 does not represent actual signals but provides a conceptual illustration of how the harmonic signals of Fig. 1 would appear in a representation of frequency and chirp rate. In computing a frequency-chirp
  • the frequency-chirp distribution may represent an entire signal and not a portion of the signal at a particular time.
  • Fig. 3 may be constructed conceptually by reviewing the frequency and chirp rate of the harmonics of the harmonic signals of Fig. 1.
  • each of the chirp rates is 0, and the frequencies of the four harmonics are 2f, 3f, and 4f, respectively.
  • the four harmonics of harmonic signal 110 are represented in these locations in Fig. 3.
  • the harmonics of harmonic signals 120, 130, and 140 are represented in Fig. 3 according to their respective frequencies and chirp rates from Fig. 1.
  • a frequency-chirp distribution may be computed using techniques similar to computing a time-frequency distribution, such as a spectrogram. For example, in some implementations, a frequency-chirp distribution may be computed using an inner product. Let FC(f, c) represent a frequency-chirp distribution where f corresponds to a frequency variable and c corresponds to a chirp rate variable. A frequency-chirp rate distribution may be computed using inner products as
  • FC(f, c (x, xp(f, c )
  • ⁇ ( , c) is a function parameterized by frequency/and chirp rate c.
  • ⁇ (/, c) may represent a chirplet, such as
  • corresponds to a duration or spread of the chirplet and t 0 is a location of the chirplet in time.
  • FC FC (f, c) for multiple values of/and c.
  • a frequency-chirp distribution is not limited to the above example, and may be computed in other ways. For example, a frequency-chirp distribution may be computed as the real part, imaginary part, magnitude, or magnitude squared of an inner product, may be computed using measures of similarity other than an inner product, or may be computed using non-linear functions of the signal.
  • Harmonic signal 110 has a fractional chirp rate of 0
  • harmonic signal 120 has a fractional chirp rate of c/2f
  • harmonic signal 130 has a fractional chirp rate of c/f
  • harmonic signal 120 has a fractional chirp rate of 2 c/f.
  • the dashed and dotted lines in Fig. 3 thus indicate lines of constant fractional chirp rate.
  • a harmonic centered on the dash-dotted line will have a fractional chirp rate of c/2f
  • a harmonic centered on the dotted line will have a fractional chirp rate of c/f
  • a harmonic centered on the dashed line will have a fractional chirp rate of 2c/f.
  • any radial line in Fig. 3 corresponds to a constant fractional chirp rate.
  • PVT pitch-velocity transform
  • a PVT may be denoted as P(f, ⁇ ), where f corresponds to a frequency variable and ⁇ corresponds to a fractional chirp rate variable.
  • Fig. 4 shows a conceptual example of a PVT created from the frequency-chirp distribution of Fig. 3. Because each harmonic of a harmonic signal has the same fractional chirp rate, they are aligned horizontally as shown in Fig. 4.
  • a PVT may be computed from a frequency- chirp distribution.
  • a PVT may be computed as
  • a PVT may also be computed using techniques similar to computing a time-frequency distribution, such as a spectrogram. For example, in some
  • a PVT may be computed using an inner product.
  • a frequency-chirp rate distribution may be computed as
  • ⁇ () is a function as described above.
  • ⁇ () is a function as described above.
  • ⁇ () is a function as described above.
  • ⁇ () such as a chirplet
  • P(f, x) for multiple values off and ⁇ .
  • a PVT is not limited to the above example, and a PVT may be computed in other ways.
  • a PVT may be computed as the real part, imaginary part, magnitude, or magnitude squared of an inner product, may be computed using measures of similarity other than an inner product, or may be computed using non-linear functions of the signal.
  • the PVT for a specified value of a fractional chirp rate is a function of frequency and may be considered to be a spectrum or a generalized spectrum of the signal. Accordingly, for each value of a fractional chirp rate, a generalized spectrum may be determined from the PVT that is associated with a particular fractional chirp rate.
  • the generalized spectra may be referred to as X x (f . As described below, these generalized spectra need not be computed from a PVT and may be computed in other ways.
  • the PVT for a specified fractional chirp rate corresponds to a slice of the PVT, which will be referred to herein as a row of the PVT (if the PVT was presented in a different orientation, this could also be referred to as a column and the orientation of the PVT is not a limiting feature of the techniques described herein).
  • a chirplet will be used for the function ⁇ () in the following discussion, but any appropriate function may be used for ⁇ () .
  • the PVT corresponds to an inner product of the signal with a Gaussian where the chirp rate of the Gaussian increases as the frequency of the Gaussian increases.
  • the chirp rate may be the product of the fractional chirp rate and the frequency.
  • the PVT may have an effect similar to slowing down or reducing the fractional chirp rate of the signal (or conversely, speeding up or increasing the fractional chirp rate of the signal).
  • each row of the PVT corresponds to a generalized spectrum where the fractional chirp rate of the signal has been modified by a value corresponding to the row of the PVT.
  • the generalized spectrum may correspond to removing the fractional chirp rate of the signal and the generalized spectrum for this value of the fractional chirp rate may be referred to as a stationary spectrum of the signal or a best row of the PVT.
  • the four peaks (521, 522, 523, 524) illustrate a generalized spectrum for a fractional chirp rate that is different from the fractional chirp rate of the signal.
  • the peaks may be shorter and wider.
  • Fig. 6 illustrates a PVT of the signal from Fig. 2 at approximately 0.21 seconds.
  • the signal has a pitch of approximately 230 Hz and a fractional chirp rate of approximately 4.
  • the PVT shows features of the signal for each of the harmonics. For example, the PVT shows the first harmonic at approximately 230 Hz on the frequency axis and 4 on the fractional chirp rate axis. Similarly, the PVT shows the second harmonic at approximately 460 Hz on the frequency axis and 4 on the fractional chirp rate axis, and so forth. At frequencies between the harmonics, the PVT has lower values because the signal energy is lower in these regions. At fractional chirp rates different from 4, the PVT has lower values because the fractional chirp rate of the PVT does not match the fractional chirp rate of the signal.
  • Fig. 7 illustrates two generalized spectra corresponding to rows of the PVT of Fig. 6.
  • the solid line corresponds to a generalized spectrum where the fractional chirp rate matches the fractional chirp rate of the signal (a fractional chirp rate of about 4) or the stationary spectrum.
  • the dashed line corresponds to a generalized spectrum with a fractional chirp of zero, which will be referred to as the zero generalized spectrum (and may correspond to a short-time Fourier transform of the signal).
  • the peaks of the stationary spectrum are higher and narrower than the peaks of the zero generalized spectrum.
  • the peak 711 of the stationary spectrum is about twice the height and one-third the width of peak 721 of the zero generalized spectrum.
  • the difference between the peak 712 of the stationary spectrum and peak 722 of the zero generalized spectrum is even greater.
  • the peak 713 of the stationary spectrum is clearly visible, but the peak of the zero generalized spectrum is not visible.
  • the fractional chirp rate of a signal may be estimated from the PVT using the following:
  • is an estimate of the fractional chirp rate.
  • the function gQ may be computed for multiple rows of the PVT, and the row producing the highest value of gQ may be selected as corresponding to an estimated fractional chirp rate of the signal.
  • the estimate of the fractional chirp rate may also be computed from a frequency chirp distribution, such as the frequency chirp distribution described above:
  • each of the PVT, the frequency chirp rate distribution, and the generalized spectrum may be computed using a variety of techniques.
  • these quantities may be determined by computing an inner product of a signal with a chirplet, but the techniques described herein are not limited to that particular implementation.
  • functions other than chirplets may be used and measures of similarity other than an inner product may be used.
  • a generalized spectrum may be modified before being used to determine the fractional chirp rate of the signal.
  • a log likelihood ratio (LLR) spectrum may be computed from the generalized spectrum, and the LLR spectrum may be denoted as LLR x (f).
  • An LLR spectrum may use hypothesis testing techniques to improve a determination of whether a harmonic is present at a frequency of a spectrum. For example, to determine whether a harmonic is present at the frequencies of the stationary spectrum shown in Fig. 7, one could compare the value of the spectrum to a threshold. Using an LLR spectrum may improve this determination.
  • An LLR spectrum may be computed using a log likelihood ratio of two hypotheses: (1) a harmonic is present at a frequency of the signal, and (2) a harmonic is not present at a frequency of the signal. For each of the two hypotheses, a likelihood may be computed. The two likelihoods may be compared to determine whether a harmonic is present, such as by computing a ratio of the logs of the two likelihoods.
  • the log likelihood for a harmonic being present at a frequency of the signal may be computed by fitting a Gaussian to the signal spectrum at the frequency and then computing a residual sum of squares between the Gaussian and the signal.
  • the Gaussian may be centered at the frequency, and then an amplitude of the Gaussian may be computed using any suitable techniques for estimating these parameters.
  • a spread in frequency or duration of the Gaussian may match a window used to compute signal spectrum or the spread of the Gaussian may also be determined during the fitting process. For example, when fitting a Gaussian to peak 711 of the stationary spectrum in Fig. 7, the amplitude of the Gaussian may be
  • ⁇ r ⁇ oise is an estimated noise variance
  • X is a spectrum
  • h is a Hermitian transpose
  • G j is a best fitting Gaussian to the spectrum at frequency
  • the estimate of the fractional chirp rate may also be computed using the LLR spectrum:
  • X x (f) or its magnitude, or real or imaginary parts may be used in place of
  • fractional chirp rate of a signal may be determined using any combinations of the above techniques or any similar techniques known to one of skill in the art.
  • the pitch estimate may be different from the true pitch by an octave, which may be referred to as an octave error.
  • an octave error For example, if the true pitch is 300 Hz, the pitch estimate may be 150 Hz or 600 Hz.
  • a two-step approach may be used to estimate pitch. First, a coarse pitch estimate may be determined to obtain an estimate that may be less accurate but less susceptible to octave errors, and second, a precise pitch estimate may be used to refine the coarse pitch estimate.
  • LLR spectrum (corresponding to the estimate of the fractional chirp rate).
  • the LLR spectrum will be used as an example spectrum, but the techniques described herein are not limited to the LLR spectrum and any appropriate spectrum may be used.
  • peaks may be selected from the LLR spectrum using thresholds. For example, a standard deviation (or variance) of the noise in the spectrum may be determined and a threshold may be computed or selected using the standard deviation of the noise, such as setting the threshold to a multiple or fraction of the standard deviation (e.g., set a threshold to twice the standard deviation of the noise).
  • peak-to-peak distances may be determined. For example, Fig. 9A shows peak-to-peak distances for a threshold of approximately 0.3. At this threshold, the first 5 peak-to-peak distances are about 230 Hz, the sixth is about 460 Hz, the seventh and eighth are about 230 Hz, and the ninth is about 690 Hz.
  • thresholds may be used as illustrated in Fig. 9B.
  • thresholds may be selected using the heights of the peaks in the LLR spectrum, such as the ten highest peaks or all peaks above a second threshold (e.g., above twice the standard deviation of the noise). Peak-to-peak distances may be computed for each of the thresholds.
  • peak-to-peak distance 901 is determined using the tallest peak as a threshold
  • peak-to-peak distances 911 and 912 are determined using the second tallest peak as a threshold
  • peak-to-peak distances 921, 922, and 923 are determined using the third tallest peak as a threshold, and so forth.
  • a most frequently occurring peak-to-peak distance may be selected as the coarse pitch estimate, for example, by using a histogram.
  • peak-to-peak distances may be computed for multiple time frames for determining a coarse pitch estimate. For example, to determine a coarse pitch estimate for a particular frame, peak-to-peak distances may be computed for the current frame, five previous frames, and five subsequent frames. The peak-to-peak distances for all of the frames may be pooled together in determining a coarse pitch estimate, such as computing a histogram for all of the peak-to-peak distances.
  • a PDF may be estimated from the CDF by computing a derivative of the CDF and any appropriate techniques may be used for computing the derivative.
  • the coarse pitch estimate may then be determined as the pitch value corresponding to the peak of the PDF.
  • multiple preliminary coarse pitch estimates may be determined, and an actual coarse pitch estimate may be determined using the preliminary pitch estimates. For example, an average of the preliminary coarse pitch estimates or a most common coarse pitch estimate may be selected as the actual coarse pitch estimate. For example, a coarse pitch estimate may be computed for each of a group of threshold values. For high threshold values, the coarse pitch estimate may be too high, and for low threshold values, the coarse pitch estimate may be too low. For thresholds in between, the coarse pitch estimate may be more accurate. To determine an actual coarse pitch estimate, a histogram may be computed of the multiple preliminary coarse pitch estimates, and the actual coarse pitch estimate may correspond to the frequency of the mode of the histogram. In some implementations, outliers may be removed from the histogram to improve the actual coarse pitch estimate.
  • Fig. 10B illustrates portions of a spectrum for a second pitch estimate, where the pitch estimate is slightly lower than the true pitch of the signal.
  • the pitch estimate may be 228 Hz and the actual pitch may be 230 Hz.
  • a portion of the spectrum for each harmonic can be identified using multiples of the pitch estimate.
  • the portion is slightly to the left of the true position of the harmonic and the offset increases as the harmonic number increases.
  • Portion 1020 is about 2 Hz to the left of the true position of the first harmonic
  • portion 1021 is about 4 Hz to the left of the true position of the second harmonic
  • portions 1022-1027 are each increasingly further to the left as the harmonic number increases.
  • portion 1027 is about 16 Hz to the left of the true position of the eighth harmonic.
  • a frequency portion may be compared to a reversed version of itself since the shape of a harmonic is generally symmetric. For an accurate pitch estimate, a harmonic will be centered in a frequency portion, and thus reversing the portion will provide a similar shape. For an inaccurate pitch estimate, the harmonic will not be centered in the frequency portion, and reversing the portion will result in a different shape.
  • a first frequency portion can be compared to a reversed version of a second frequency portion.
  • the frequency portions may have any appropriate width. In some implementations, the frequency portions may partition the spectrum, may overlap adjacent portions, or may have gaps between them (as shown in Figs.
  • the frequency portions used may correspond to any frequency representation, such as a spectrum of a signal or a real part, imaginary part, magnitude, or magnitude squared of a spectrum of a signal.
  • the frequency portions may also be normalized to remove differences that are less relevant to determining pitch. For example, for each frequency portion a mean and a standard deviation may be determined, and the frequency portion may be normalized by subtracting the mean value and then dividing by the standard deviation (e.g., a z -score).
  • Correlations may be used to measure whether two frequency portions have similar shapes and to determine if a harmonic is centered at the expected frequency.
  • the frequency portions for a pitch estimate may be determined as described above, and a correlation may be performed by computing an inner product of two frequency portions.
  • Correlations that may be performed include the following: a correlation of a first frequency portion with a second frequency portion, a correlation of a first frequency portion with a reversed version of itself, and a correlation of a first frequency portion with a reversed version of a second frequency portion.
  • the correlations may have higher values for more accurate pitch estimates and lower values for less accurate pitch estimates.
  • the frequency portions will have a greater similarity to each other and reversed versions of each other (e.g., each harmonic being centered in a frequency portion) and thus the correlations may be higher.
  • the frequency portions will have less similarity to each other and reversed versions of each other (e.g., each harmonic being off center by an amount corresponding to the harmonic number) and thus correlations may be lower.
  • Each of the correlations may be computed, for example, by performing an inner product of the two frequency portions (or with a frequency portion and a reversed version of that frequency portion of another frequency portion).
  • the correlation may also be normalized by dividing by N-l where N is the number of samples in each frequency portion.
  • a Pearson product- moment correlation coefficient may be used.
  • Some or all of the above correlations may be used to determine a score for an accuracy of a pitch estimate.
  • eight correlations may be computed for the correlation of a frequency portion with a reversed version of itself, 28 correlations may be computed for a correlation between a frequency portion and another frequency portion, and 28 correlations may be computed between a frequency portion and a reversed version of another frequency portion.
  • These correlations may be combined in any appropriate way to get an overall score for the accuracy of a pitch estimate. For example, the correlations may be added or multiplied to get an overall score.
  • the correlations may be combined using the Fisher transformation.
  • the Fisher transformation of an individual correlation, r may be computed as
  • the Fisher transformation may be approximated as
  • the Fisher transformation of an individual correlation may have a probability density function that is approximately Gaussian with a standard deviation of 1/VN— 3 where N is the number of samples in each portion. Accordingly, using the above approximation, the probability density function of the Fisher transformation of an individual correlation,/ ⁇ , may be represented as
  • An overall score may then be computed by computing/frj for each correlation and multiplying them together. Accordingly, if there are M correlations, then an overall score, 5, may be computed as a likelihood
  • the score, S may be computed as a log likelihood [0094]
  • These scores may be used to obtain a precise pitch estimate through an iterative procedure, such as a golden section search or any kind of gradient descent algorithm.
  • the precise pitch estimate may be initialized with the coarse pitch estimate.
  • a score may be computed for the current precise pitch estimate and for other pitch values near the precise pitch estimate. If the score for another pitch value is higher than the score of the current pitch estimate, then the current pitch estimate may be set to that other pitch value. This process may be repeated until an appropriate stopping condition has been reached.
  • the process of determining the precise pitch estimate may be constrained, for example, by requiring the precise pitch estimate to be within a range of the coarse pitch estimate.
  • the range may be determining using any appropriate techniques. For example, the range may be determined from a variance or a confidence interval of the coarse pitch estimate, such as determining a confidence interval of the coarse pitch estimate using bootstrapping techniques. The range may be determined from the confidence interval, such as a multiple of the confidence interval. In determining the precise pitch estimate, the search may be limited so that the precise pitch estimate never goes outside of the specified range.
  • each of the harmonics may be modeled as a chirplet, where the frequency and chirp rate of the chirplet are set using the estimated pitch and estimate fractional chirp rate.
  • the frequency of the harmonic may be k times the estimated pitch
  • the chirp rate of the harmonic may be the fractional chirp rate times the frequency of the chirplet. Any appropriate duration may be used for the chirplet.
  • the amplitudes of the harmonics may be estimated using any appropriate techniques, including, for example, maximum likelihood estimation.
  • a vector of harmonic amplitudes, a may be estimated as
  • M is a matrix where each row corresponds to a chirplet for each harmonic with parameters as described above, the number of rows of the matrix M corresponds to the number of harmonic amplitudes to be estimated, h is a Hermitian transpose, and x is a time series representation of the signal.
  • the estimate of the harmonic amplitudes may be complex valued, and in some implementations, other functions of the amplitudes may be used, such as a magnitude, magnitude squared, real part, or imaginary part.
  • the amplitudes may have been computed in previous steps and need not be explicitly computed again.
  • the amplitudes may be computed in computing the LLR spectrum.
  • the LLR spectrum is computed by fitting Gaussians to a spectrum, and one fitting parameter of the Gaussian is the amplitude of the Gaussian.
  • the amplitudes of the Gaussians may be saved during the process of computing the LLR spectrum, and these amplitudes may be recalled instead of being recomputed.
  • the amplitudes determined from the LLR spectrum may be a starting point, and the amplitudes may be refined, for example, by using iterative techniques.
  • the above techniques may be carried out for successive portions of a signal to be processed, such as for a frame of the signal every 10 milliseconds. For each portion of the signal that is processed, a fractional chirp rate, pitch, and harmonic amplitudes may be determined. Some or all of the fractional chirp rate, pitch, and harmonic amplitudes may be referred to as HAM (harmonic amplitude matrix) features and a feature vector may be created that comprises the HAM features.
  • the feature vector of HAM features may be used in addition to or in place of any other features that are used for processing harmonic signals.
  • the HAM features may be used in addition to or in place of mel-frequency cepstral coefficients, perceptual linear prediction features, or neural network features.
  • the HAM features may be applied to any application of harmonic signals, including but not limited to performing speech recognition, word spotting, speaker recognition, speaker verification, noise reduction, or signal reconstruction.
  • Figs. 11-14 are flowcharts illustrating example implementations of the processes described above. Note that, for the flowcharts described below, the ordering of the steps is exemplary and that other orders are possible, not all steps are required and, in some implementations, some steps may be omitted or other steps may be added.
  • the processes of the flowcharts may be implemented, for example, by one or more computers, such as the computers described below.
  • Fig. 11 is a flowchart showing an example implementation of computing features for a first portion of a signal.
  • a portion of a signal is obtained.
  • the signal may be any signal for which it may be useful to estimate features, including but not limited to speech signals or music signals.
  • the portion may be any relevant portion of the signal, and the portion may be, for example, a frame of the signal that is extracted on regular intervals, such as every 10 milliseconds.
  • a fractional chirp rate of the portion of the signal is estimated.
  • the fractional chirp rate may be estimated using any of the techniques described above. For example, a plurality of possible fractional chirp rates may be identified and a score may be computed for each of the possible fractional chirp rates. A score may be computed using a function, such as any of the functions gfj described above.
  • the estimate of the fractional chirp rate may be determined by selecting a fractional chirp rate corresponding to a highest score. In some implementations, a more precise estimate of fractional chirp rate may be determined using iterative procedures, such as by selecting additional possible fractional chirp rates and iterating with a golden section search or a gradient descent.
  • the function gfj may take as input any frequency representation of the first portion described above, including but not limited to a spectrum of the first portion, an LLR spectrum of the first portion, a generalized spectrum of the first portion, a frequency chirp distribution of the first portion, or a PVT of the first portion.
  • a frequency representation of the portion of the signal is computed using the estimated fractional chirp rate.
  • the frequency representation may be any representation of the portion of the signal as a function of frequency.
  • the frequency representation may be, for example, a stationary spectrum, a generalized spectrum, an LLR spectrum, or a row of a PVT.
  • the frequency representation may be computed during the processing of step 1120 and need not be a separate step.
  • the frequency representation may be computed during other processing that determines an estimate of the fractional chirp rate.
  • a coarse pitch estimate is computed from the portion of the signal using the frequency representation.
  • the coarse pitch estimate may be determined using any of the techniques described above. For example, peak-to-peak distances may be determined for any of the types of spectra described above and for a variety of parameters, such as different thresholds, different smoothing kernels, and from other portions of the signal. The coarse pitch estimate may then be computed from the peak-to-peak distances using a histogram or any of the other techniques described above.
  • a precise pitch estimate is computed from the portion of the signal using the frequency representation and the coarse pitch estimate. The precise pitch estimate may be initialized with the coarse pitch estimate and then refined with an iterative procedure.
  • a score such as a likelihood or a log likelihood
  • the precise pitch estimate may be determined by maximizing the score.
  • the score may be determined using combinations of correlations as described above.
  • the score may be maximized using any appropriate procedure, such as a golden section search or a gradient descent.
  • harmonic amplitudes are computed using the estimated fractional chirp rate and the estimated pitch.
  • the harmonic amplitudes may be computed by modeling each harmonic as a chirplet and performing maximum likelihood estimation.
  • Fig. 11 may be repeated for successive portions or time intervals of the signal.
  • a fractional chirp rate, pitch, and harmonic amplitudes may be computed every 10 milliseconds.
  • the fractional chirp rate, pitch, and harmonic amplitudes may be used for a wide variety of applications, including but not limited to pitch tracking, signal reconstruction, speech recognition, and speaker verification or recognition.
  • Fig. 12 is a flowchart showing an example implementation of computing fractional chirp rate of a portion of a signal.
  • a portion of a signal is obtained, as described above.
  • a plurality of frequency representations of the portion of the signal are computed, and the frequency representations may be computed using any of the techniques described above.
  • Each of the frequency representations may correspond to a fractional chirp rate.
  • the frequency representations may be computed (i) from the rows of a PVT, (if) from radial slices of a frequency-chirp distribution, or (iii) using inner products of the portion of the signal with chirplets where the chirp rate of the chirplet increases with frequency.
  • a score is computed for each of the frequency representations and each score corresponds to a fractional chirp rate.
  • the score may indicate a match between the fractional chirp rate corresponding to the score and the fractional chirp rate of the portion of the signal.
  • the scores may be computed using any of the techniques described above. In some implementations, the scores may be computed using an auto-correlation of the frequency representations, such as an autocorrelation of the magnitude squared of a frequency representation.
  • the score may be computed from the auto-correlation using any of Fisher information, entropy, Kullback- Leibler divergence, sum of squared (or magnitude squared) values of the autocorrelation, or a sum of squared second derivatives of the auto-correlation.
  • a fractional chirp rate of the portion of the signal is estimated.
  • the fractional chirp rate is estimated by selecting a fractional chirp rate corresponding to a highest score.
  • the estimate of the fractional chirp rate may be refined using iterative techniques, such as golden section search or gradient descent. The estimated fractional chirp rate may then be used for further processing of the signal as described above, such as speech recognition or speaker recognition.
  • Fig. 13 is a flowchart showing an example implementation of computing a pitch estimate of a portion of a signal.
  • a first portion of a signal is obtained, as described above, and at step 1320, a frequency representation of the first portion of the signal is computed, using any of the techniques described above.
  • a threshold is selected using any of the techniques described above.
  • a threshold may be selected using a signal to noise ratio or may be selected using a height of a peak in the frequency representation of the first portion of the signal.
  • a plurality of peaks in the frequency representation of the first portion of the signal are identified.
  • the peaks may be identified using any appropriate techniques. For example, the values of the frequency representation may be compared to the threshold to identify a continuous portion of the frequency representation (each a frequency portion) that is always above the threshold.
  • the peak may be identified, for example, by selecting a highest point of the frequency portion, selecting the mid-point between the beginning of the portion and the end of the frequency portion, or fitting a curve (such as a Gaussian) to the frequency portion and selecting the peak using the fit.
  • the frequency representation may accordingly be processed to identify frequency portions that are above the threshold and identify a peak for each frequency portion.
  • a plurality of peak-to-peak distances in the frequency representation of the first portion of the signal are computed.
  • Each of the peaks may be associated with a frequency value that corresponds to the peak.
  • the peak-to-peak distances may be computed as the difference in frequency values of adjacent peaks. For example, if peaks are present at 230 Hz, 690 Hz, 920 Hz, 1840 Hz (e.g., similar to 931, 932, 933, and 934 of Fig. 9B), then the peak-to-peak distances may be 460 Hz, 230Hz, and 920 Hz.
  • Steps 1330, 1340, and 1350 may be repeated for other thresholds, changes to other settings with the same threshold, or changes to other settings with other thresholds.
  • multiple thresholds may be selected using the heights of multiple peaks in the frequency representation
  • the same threshold or other thresholds may be used with a second frequency representation corresponding to a second portion of the signal (e.g., where the second portion is immediately before or immediately after the first portion), and the same or other thresholds may be used with different smoothing kernels.
  • a histogram of peak-to-peak distances is computed.
  • the histogram may use some or all of the peak-to-peak distances described above. Any appropriate bin width may be used, such as a bin width of 2-5 Hz.
  • a pitch estimate is determined using the histogram of peak-to-peak distances.
  • the pitch estimate may correspond to the mode of the histogram.
  • multiple histograms may be used to determine the pitch estimate. For example, a plurality of histograms may be computed for a plurality of thresholds (or a plurality of thresholds in combination with other parameters, such as time instances or smoothing kernels), and a preliminary pitch estimate may be determined for each of the plurality of histograms. The final pitch estimate may be determined from the plurality of preliminary pitch estimates, for example, by selecting the most common preliminary pitch estimate.
  • Fig. 14 is a flowchart showing an example implementation of computing a pitch estimate of a portion of a signal.
  • a frequency representation of a portion of a signal is obtained, as described above.
  • a pitch estimate of the portion of the signal is obtained.
  • the obtained pitch estimate may have been computed using any technique for estimating pitch, including but not limited to the coarse pitch estimation techniques described above.
  • the obtained pitch estimate may be considered an initial pitch estimate to be updated or may be considered a running pitch estimate that is updated through an iterative procedure.
  • a plurality of frequency portions of the frequency representation is obtained.
  • Each of the frequency portions may be centered at a multiple of the pitch estimate. For example, a first frequency portion may be centered at the pitch estimate, a second frequency portion may be centered at twice the pitch estimate, and so forth. Any appropriate widths may be used for the frequency portions.
  • the frequency portions may partition the frequency representation, may overlap, or have spaces between them.
  • a plurality of correlations is computed using the plurality of frequency portions of the frequency representation.
  • the frequency portions may be further processed before computing the correlations.
  • each frequency portion may be extracted from the frequency representation and stored in a vector of length N, where the beginning of the vector corresponds to the beginning of the frequency portion and the end of the vector corresponds to the end of the frequency portion.
  • the frequency portions may be shifted by sub-sample amounts so that the frequency portions line up accurately.
  • the pitch estimate may lie between frequency bins of the frequency representation (e.g., a pitch estimate of 230 Hz may lie between frequency bin 37 and frequency bin 38 with and approximate location of 37.3).
  • the beginning, center, and end of the frequency portions may be defined by fractional sample values.
  • the frequency portions may be shifted by subsample amounts so that one or more of the beginning, center, and end of the frequency portions corresponds to an integer sample of the frequency representation.
  • the frequency portions may also be normalized by subtracting a mean and dividing by a standard deviation of the frequency portion.
  • the correlations may include any of a correlation between a first frequency portion and a second frequency portion, a correlation between a first frequency portion and a reversed second frequency portion, and a correlation between a first frequency portion and a reversed first frequency portion.
  • the correlations may be computed using any appropriate techniques.
  • the frequency portions may be extracted from the frequency representation and stored in a vector, as described above, and the correlations may be computed by performing inner products of the vectors (or an inner product of a vector with a reversed version of another vector).
  • the correlations are combined to obtain a score for the pitch estimate. Any appropriate techniques may be used to generate a score, including for example, computing a product of the correlations, a sum of the
  • correlations a combination of the Fisher transformation of the correlations, or a combination likelihoods or log-likelihoods of the correlations or Fisher transformation of the correlations, as described above.
  • the pitch estimate is updated. For example, a first score for a first pitch estimate may be compared to a second score for a second pitch estimate, and the pitch estimate may be determined by selecting the pitch estimate with a highest score. Steps 1420 to 1460 may be repeated to continuously update a pitch estimate using techniques such golden section search or gradient descent. Steps 1420 to 1460 may be repeated until some appropriate stop condition has been reached such as a maximum number of iterations or the improvement in the pitch estimate from a previous estimate falling below a threshold.
  • Fig. 15 illustrates components of one implementation of a computing device 110 for implementing any of the techniques described above.
  • the components are shown as being on a single computing device 1510, but the components may be distributed among multiple computing devices, such as a system of computing devices, including, for example, an end-user computing device (e.g., a smart phone or a tablet) and/or a server computing device (e.g., cloud computing).
  • an end-user computing device e.g., a smart phone or a tablet
  • server computing device e.g., cloud computing
  • the collection of audio data and pre-processing of the audio data may be performed by an end-user computing device and other operations may be performed by a server.
  • Computing device 1510 may include any components typical of a computing device, such as volatile or nonvolatile memory 1520, one or more processors 1521, and one or more network interfaces 1522. Computing device 1510 may also include any input and output components, such as displays, keyboards, and touch screens. Computing device 1510 may also include a variety of components or modules providing specific functionality, and these components or modules may be implemented in software, hardware, or a combination thereof. Below, several examples of components are described for one example implementation, and other implementations may include additional components or exclude some of the components described below. [00128] Computing device 1510 may have a signal processing component
  • Computing device 1510 may have a fractional chirp rate estimation component 1531 that estimates fractional chirp rate of a signal using any of the techniques described above.
  • Computing device 1510 may have a coarse pitch estimation component 1532 that estimates the pitch of a signal using peak-to-peak distances as described above.
  • Computing device 1510 may have a precise pitch estimation component 1533 that estimates the pitch of a signal using correlations as described above.
  • Computing device 1510 may have a HAM feature generation component 1534 that determines amplitudes of harmonics as described above.
  • Computing device 1510 may also have components for applying the above techniques to particular applications.
  • computing device 1510 may have any of a speech recognition component 1540, a speaker verification component 1541, a speaker recognition component 1542, a signal reconstruction component 1543, and a word spotting component 1544.
  • any of an estimated fractional chirp rate, an estimated pitch, and estimated harmonic amplitudes may be used as input to any of the applications and used in addition to or in place of other features or parameters used for these applications.
  • steps of any of the techniques described above may be performed in a different sequence, may be combined, may be split into multiple steps, or may not be performed at all.
  • the steps may be performed by a general purpose computer, may be performed by a computer specialized for a particular application, may be performed by a single computer or processor, may be performed by multiple computers or processers, may be performed sequentially, or may be performed simultaneously.

Abstract

L'invention concerne des caractéristiques qui peuvent être calculées à partir d'un signal harmonique et qui comprennent un taux de fluctuation de longueur d'onde fractionnel, un pas et des amplitudes des harmoniques. Un taux de fluctuation de longueur d'onde fractionnel peut être estimé, par exemple en calculant des résultats correspondant à différents taux de fluctuation de longueur d'onde fractionnels et en sélectionnant le résultat le plus élevé. Un premier pas peut être calculé à partir d'une représentation de fréquence, qui est calculée à l'aide du taux de fluctuation de longueur d'onde fractionnel estimé, par exemple en utilisant des distances de crête-à-crête dans la distribution de fréquence. Un second pas peut être calculé en utilisant le premier pas et une représentation de fréquence du signal, par exemple en utilisant des corrélations de parties de la représentation de fréquence. Des amplitudes d'harmoniques du signal peuvent être déterminées à l'aide du taux de fluctuation de longueur d'onde fractionnel estimé et du second pas. L'un quelconque parmi le taux de fluctuation de longueur d'onde fractionnel estimé, le second pas et les amplitudes d'harmoniques peut être utilisé pour un traitement ultérieur, tel qu'une reconnaissance vocale, une vérification d'orateur, une identification d'orateur ou une reconstruction de signal.
PCT/US2016/016261 2015-02-06 2016-02-03 Détermination de caractéristiques de signaux harmoniques WO2016126753A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP16706703.2A EP3254282A1 (fr) 2015-02-06 2016-02-03 Détermination de caractéristiques de signaux harmoniques
CN201680017664.6A CN107430850A (zh) 2015-02-06 2016-02-03 确定谐波信号的特征

Applications Claiming Priority (16)

Application Number Priority Date Filing Date Title
US201562112832P 2015-02-06 2015-02-06
US201562112836P 2015-02-06 2015-02-06
US201562112850P 2015-02-06 2015-02-06
US201562112796P 2015-02-06 2015-02-06
US62/112,836 2015-02-06
US62/112,796 2015-02-06
US62/112,832 2015-02-06
US62/112,850 2015-02-06
US14/969,029 2015-12-15
US14/969,038 2015-12-15
US14/969,022 2015-12-15
US14/969,038 US9842611B2 (en) 2015-02-06 2015-12-15 Estimating pitch using peak-to-peak distances
US14/969,036 2015-12-15
US14/969,029 US9870785B2 (en) 2015-02-06 2015-12-15 Determining features of harmonic signals
US14/969,022 US9548067B2 (en) 2014-09-30 2015-12-15 Estimating pitch using symmetry characteristics
US14/969,036 US9922668B2 (en) 2015-02-06 2015-12-15 Estimating fractional chirp rate with multiple frequency representations

Publications (1)

Publication Number Publication Date
WO2016126753A1 true WO2016126753A1 (fr) 2016-08-11

Family

ID=55442867

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2016/016261 WO2016126753A1 (fr) 2015-02-06 2016-02-03 Détermination de caractéristiques de signaux harmoniques

Country Status (1)

Country Link
WO (1) WO2016126753A1 (fr)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8548803B2 (en) * 2011-08-08 2013-10-01 The Intellisis Corporation System and method of processing a sound signal including transforming the sound signal into a frequency-chirp domain

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8548803B2 (en) * 2011-08-08 2013-10-01 The Intellisis Corporation System and method of processing a sound signal including transforming the sound signal into a frequency-chirp domain

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
JUAN G VARGAS-RUBIO ET AL: "An Improved Spectrogram Using the Multiangle Centered Discrete Fractional Fourier Transform", 2005 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING - 18-23 MARCH 2005 - PHILADELPHIA, PA, USA, IEEE, PISCATAWAY, NJ, vol. 4, 18 March 2005 (2005-03-18), pages 505 - 508, XP010792593, ISBN: 978-0-7803-8874-1 *
JULIUS O. III SMITH: "Spectral Audio Signal Processing", 1 January 2011 (2011-01-01), XP055267579, ISBN: 978-0-9745607-3-1, Retrieved from the Internet <URL:N/A> [retrieved on 20160421] *
KEPESI M ET AL: "Adaptive chirp-based time-frequency analysis of speech signals", SPEECH COMMUNICATION, ELSEVIER SCIENCE PUBLISHERS, AMSTERDAM, NL, vol. 48, no. 5, 1 May 2006 (2006-05-01), pages 474 - 492, XP027926235, ISSN: 0167-6393, [retrieved on 20060501] *
STEPHEN A ZAHORIAN ET AL: "A Spectral-Temporal Method for Pitch Tracking", 17 September 2006 (2006-09-17), XP055191851, Retrieved from the Internet <URL:http://www.ws.binghamton.edu/zahorian/pdf/icslp2006_pitch_v16.pdf> [retrieved on 20150528] *

Similar Documents

Publication Publication Date Title
US10438613B2 (en) Estimating pitch of harmonic signals
US9870785B2 (en) Determining features of harmonic signals
EP3599606B1 (fr) Apprentissage machine d&#39;authentification vocale
US10891944B2 (en) Adaptive and compensatory speech recognition methods and devices
US9922668B2 (en) Estimating fractional chirp rate with multiple frequency representations
US9754584B2 (en) User specified keyword spotting using neural network feature extractor
US9536547B2 (en) Speaker change detection device and speaker change detection method
US9224392B2 (en) Audio signal processing apparatus and audio signal processing method
US8831942B1 (en) System and method for pitch based gender identification with suspicious speaker detection
Bharti et al. Real time speaker recognition system using MFCC and vector quantization technique
US20170294192A1 (en) Classifying Signals Using Mutual Information
EP3254282A1 (fr) Détermination de caractéristiques de signaux harmoniques
US20100094622A1 (en) Feature normalization for speech and audio processing
Kheder et al. Additive noise compensation in the i-vector space for speaker recognition
US9548067B2 (en) Estimating pitch using symmetry characteristics
EP1465153A2 (fr) Méthode et appareil pour localiser les formants avec utilisation d&#39;un modèle résiduel
Hanilçi et al. Comparing spectrum estimators in speaker verification under additive noise degradation
US20080189109A1 (en) Segmentation posterior based boundary point determination
Maka et al. An analysis of the influence of acoustical adverse conditions on speaker gender identification
US9842611B2 (en) Estimating pitch using peak-to-peak distances
Ming et al. An iterative longest matching segment approach to speech enhancement with additive noise and channel distortion
WO2016126753A1 (fr) Détermination de caractéristiques de signaux harmoniques
Arslan et al. Noise robust voice activity detection based on multi-layer feed-forward neural network
Sreeram et al. Improved speaker verification using block sparse coding over joint speaker-channel learned dictionary
Gavrilescu Improved Automatic Speech Recognition system by using compressed sensing signal reconstruction based on L0 and L1 estimation algorithms

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16706703

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

REEP Request for entry into the european phase

Ref document number: 2016706703

Country of ref document: EP