AU2020227065A1 - Method for forming the excitation signal for a glottal pulse model based parametric speech synthesis system - Google Patents

Method for forming the excitation signal for a glottal pulse model based parametric speech synthesis system Download PDF

Info

Publication number
AU2020227065A1
AU2020227065A1 AU2020227065A AU2020227065A AU2020227065A1 AU 2020227065 A1 AU2020227065 A1 AU 2020227065A1 AU 2020227065 A AU2020227065 A AU 2020227065A AU 2020227065 A AU2020227065 A AU 2020227065A AU 2020227065 A1 AU2020227065 A1 AU 2020227065A1
Authority
AU
Australia
Prior art keywords
glottal
signal
speech
excitation signal
excitation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
AU2020227065A
Other versions
AU2020227065B2 (en
Inventor
Rajesh DACHIRAJU
Aravind GANAPATHIRAJU
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Interactive Intelligence Inc
Original Assignee
Interactive Intelligence Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Interactive Intelligence Inc filed Critical Interactive Intelligence Inc
Priority to AU2020227065A priority Critical patent/AU2020227065B2/en
Publication of AU2020227065A1 publication Critical patent/AU2020227065A1/en
Application granted granted Critical
Publication of AU2020227065B2 publication Critical patent/AU2020227065B2/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/90Pitch determination of speech signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Measuring Pulse, Heart Rate, Blood Pressure Or Blood Flow (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

A method is presented for forming the excitation signal for a glottal pulse model based parametric speech synthesis system. In one embodiment, fundamental frequency values are used to form the excitation signal. The excitation is modeled using a voice source pulse selected from a database of a given speaker. The voice source signal is segmented into glottal segments, which are used in vector representation to identify the glottal pulse used for formation of the excitation signal. Use of a novel distance metric and preserving the original signals extracted from the speakers voice samples helps capture low frequency information of the excitation signal. In addition, segment edge artifacts are removed by applying a unique segment joining method to improve the quality of synthetic speech while creating a true representation of the voice quality of a speaker.

Description

TITLE METHOD FOR FORMING THE EXCITATION SIGNAL FOR A GLOTTAL PULSE MODEL BASED PARAMETRIC SPEECH SYNTHESIS SYSTEM BACKGROUND
[0001] The present invention generally relates to telecommunications systems and methods, as well as
speech synthesis. More particularly, the present invention pertains to the formation of the excitation
signal in a Hidden Markov Model based statistical parametric speech synthesis system.
SUMMARY
[0002] A method is presented for forming the excitation signal for a glottal pulse model based
parametric speech synthesis system. In one embodiment, fundamental frequency values are used to
form the excitation signal. The excitation is modeled using a voice source pulse selected from a
database of a given speaker. The voice source signal is segmented into glottal segments, which are
used in vector representation to identify the glottal pulse used for formation of the excitation signal.
Use of a novel distance metric and preserving the original signals extracted from the speakers voice
samples helps capture low frequency information of the excitation signal. In addition, segment edge
artifacts are removed by applying a unique segment joining method to improve the quality of synthetic
speech while creating a true representation of the voice quality of a speaker.
[0003] In one embodiment, a method is presented to create a glottal pulse database from a speech
signal, comprising the steps of: performing pre-filtering on the speech signal to obtain a pre-filtered
signal; analyzing the pre-filtered signal to obtain inverse filtering parameters; performing inverse
filtering of the speech signal using the inverse filtering parameters; computing an integrated linear
prediction residual signal using the inversely filtered speech signal; identifying glottal segment
boundaries in the speech signal; segmenting the integrated linear prediction residual signal into glottal pulses using the identified glottal segment boundaries from the speech signal; performing normalization of the glottal pulses; and forming the glottal pulse database by collecting all normalized glottal pulses obtained for the speech signal.
[0004] In another embodiment, a method is presented to form parametric models, comprising the
steps of: computing a glottal pulse distance metric between a number of glottal pulses; clustering the
glottal pulse database into a number of clusters to determine centroid glottal pulses; forming a
corresponding vector database by associating a vector with each glottal pulse in the glottal pulse
database, wherein the centroid glottal pulses and the distance metric is defined mathematically to
determine association; determining Eigenvectors of the vector database; and forming parametric
models by associating a glottal pulse from the glottal pulse database to each determined Eigenvector.
[0005] In yet another embodiment, a method is presented to synthesize speech using input text,
comprising the steps of: a) converting the input text into context dependent phone labels; b) processing
the phone labels created in step (a) using trained parametric models to predict fundamental frequency
values, duration of the speech synthesized, and spectral features of the phone labels; c) creating an
excitation signal using an Eigen glottal pulse and said predicted one or more of: fundamental frequency
values, spectral features of phone labels, and duration of the speech synthesized; and d) combining the
excitation signal with the spectral features of the phone labels using a filter to create synthetic speech
output.
BRIEF DESCRIPTION OF THE DRAWINGS
[0006] Figure 1 is a diagram illustrating an embodiment of an Hidden Markov Model based Text to
Speech system.
[0007] Figure 2 is a diagram illustrating an embodiment of a signal.
[0008] Figure 3 is a diagram illustrating an embodiment of excitation signal creation.
[0009] Figure 4 is a diagram illustrating an embodiment of excitation signal creation.
[0010] Figure 5 is a diagram illustrating an embodiment of overlap boundaries.
[0011] Figure 6 is a diagram illustrating an embodiment of excitation signal creation.
[0012] Figure 7 is a diagram illustrating an embodiment of glottal pulse identification.
[0013] Figure 8 is a diagram illustrating an embodiment of glottal pulse database creation.
DETAILED DESCRIPTION
[0014] For the purposes of promoting an understanding of the principles of the invention, reference
will now be made to the embodiment illustrated in the drawings and specific language will be used to
describe the same. It will nevertheless be understood that no limitation of the scope of the invention is
thereby intended. Any alterations and further modifications in the described embodiments, and any
further applications of the principles of the invention as described herein are contemplated as would
normally occur to one skilled in the art to which the invention relates.
[0015] Excitation is generally assumed to be a quasi-periodic sequence of impulses for voiced regions.
Each sequence is separated from the previous sequence by some duration, such as To = , where F0
' To represents pitch period and F represents fundamental frequency. The excitation, in unvoiced
regions, is modeled as white noise. In voiced regions, the excitation is not actually impulse sequences.
The excitation is instead a sequence of voice source pulses which occur due to vibration of the vocal
folds. The pulses' shapes may vary depending on various factors such as the speaker, the mood of the
speaker, the linguistic context, emotions, etc.
[0016] Source pulses have been treated mathematically as vectors by length normalization (through
resampling) and impulse alignment, as described in European Patent EP 2242045 (granted June 27,
2012, inventors Thomas Drugman, et al.) The final length of normalized source pulse signal is resampled
to meet the target pitch. The source pulse is not chosen from a database, but obtained over a series of
calculations which compromise the pulse characteristics in the frequency domain. In addition, the
approximate excitation signal used for creating a pulse database does not capture low frequency source content as there is no pre-filtering done while determining the Linear Prediction (LP) coefficients, which are used for inverse filtering.
[0017] In statistical parametric speech synthesis, speech unit signals are represented by a set of
parameters which can be used to synthesize speech. The parameters may be learned by statistical
models, such as HMMs, for example. In an embodiment, speech may be represented as a source-filter
model, wherein source/excitation is a signal which when passed through an appropriate filter produces
a given sound. Figure 1 is a diagram illustrating an embodiment of a Hidden Markov Model (HMM)
based Text to Speech (TTS) system. An embodiment of an exemplary system may contain two phases,
for example, the training phase and the synthesis phase.
[0018] The Speech Database 105 may contain an amount of speech data for use in speech synthesis.
During the training phase, a speech signal 106 is converted into parameters. The parameters may be
comprised of excitation parameters and spectral parameters. Excitation Parameter Extraction 110 and
Spectral Parameter Extraction 115 occurs from the speech signal 106 which travels from the Speech
Database 105. A Hidden Markov Model 120 may be trained using these extracted parameters and the
Labels 107 from the Speech Database 105. Any number of HMM models may result from the training
and these context dependent HMMs are stored in a database 125.
[0019] The synthesis phase begins as the context dependent HMMs 125 are used to generate
parameters 140. The parameter generation 140 may utilize input from a corpus of text 130 from which
speech is to be synthesized from. The text 130 may undergo analysis 135 and the extracted labels 136
are used in the generation of parameters 140. In one embodiment, excitation and spectral parameters
may be generated in 140.
[0020] The excitation parameters may be used to generate the excitation signal 145, which is input,
along with the spectral parameters, into a synthesis filter 150. Filter parameters are generally Mel
frequency cepstral coefficients (MFCC) and are often modeled by a statistical time series by using
HMMs. The predicted values of the filter and the fundamental frequency as time series values maybe
used to synthesize the filter by creating an excitation signal from the fundamental frequency values and
the MFCC values used to form the filter.
[0021] Synthesized speech 155 is produced when the excitation signal passes through the filter. The
formation of the excitation signal 145 is integral to the quality of the output, or synthesized, speech 155.
Low frequency information of the excitation is not captured. It will thus be appreciated that an
approach is needed to capture the low frequency source content of the excitation signal and to improve
the quality of synthetic speech.
[0022] Figure 2 is a graphical illustration of an embodiment of the signal regions of a speech segment,
indicated generally at 200. The signal has been broken down into segments based on fundamental
frequency values for categories such as voiced, unvoiced, and pause segments. The vertical axis 205
illustrates fundamental frequency in Hertz (Hz) while the horizontal axis 210 represents the passage of
milliseconds (ms). The time series, FO, 215 represents the fundamental frequency. The voiced region,
220 can be seen as a series of peaks and may be referred to as a non-zero segment. The non-zero
segments 220 may be concatenated to form an excitation signal for the entire speech, as described in
further detail below. The unvoiced region 225 is seen as having no peaks in the graphical illustration 200
and may be referred to as zero segments. The zero segments may represent a pause or an unvoiced
segment given by the phone labels.
[0023] Figure 3 is a diagram illustrating an embodiment of excitation signal creation indicated generally
at 300. Figure 3 illustrates the creation of the excitation signal for both unvoiced and pause segments.
The fundamental frequency time series values, represented as FO, represent signal regions 305 that are
broken down into voiced, unvoiced, and pause segments based on the FO values.
[0024] An excitation signal 320 is created for unvoiced and pause segments. Where pauses occur,
zeroes (0) are placed in the excitation signal. In unvoiced regions, white noise of appropriate energy (in
one embodiment, this may be determined empirically by listening tests) is used as the excitation signal.
[0025] The signal regions, 305, along with the Glottal Pulse 310 are used for excitation generation 315
and subsequent generation of the excitation signal 320. The Glottal Pulse 310 comprises an Eigen glottal
pulse that has been identified from the glottal pulse database, the creation of which is described in
further detail in Figure 8 below.
[0026] Figure 4 is a diagram illustrating an embodiment of excitation signal creation for a voiced
segment, indicated generally at 400. It is assumed that a Eigen glottal pulse has been identified from
the glottal pulse database (described in further detail in Figure 7 below). The signal region 405
comprises FO values, which may be predicted by models, from the voiced segment. The lengths of the FO
segments, which may be represented by Nf, are used to determine the length of the excitation signal
using the mathematical equation:
[0027] FO(n) =f * Nf * 5/1000.
[0028] Where represents the sampling frequency of the signal. In a non-limiting example, the value
of 5/1000 represents the interval of 5 ms durations that the FO values are determined for. It should be
noted that any interval of a designated duration of a unit time may be used. Another array, designated
as Fr(n), is obtained by linearly interpolating the FO array.
[0029] From the FO values, glottal boundaries are created, 410, which mark the pitch boundaries of the
excitation signal of the voiced segments in the signal region 405. The pitch period array may be
computed using the following mathematical equation:
[0030] To(n) = Fo'(n)
[0031] Pitch boundaries may then be computed using the determined pitch period array as follows:
[0032] P0 (i) = - To (P°(i -1)
[0033] Where PO (0) =1, i = 1,2,3, ... K, and where P(K+1) just crosses length of the array To(n).
[0034] The glottal pulse 415 is used along with the identified glottal boundaries 410 in the overlap
adding 420 of a glottal pulse beginning at each glottal boundary. The excitation signal 425 is then
created through the process of "stitching", or segment joining, to avoid boundary effects which are
further described in Figures 5 and 6.
[0035] Figure 5 is a diagram illustrating an embodiment of overlap boundaries, indicated generally at
500. The illustration 500 represents a series of glottal pulses 515 and overlapping glottal pulses 520 in
the segment. The vertical axis 505 represents the amplitude of excitation. The horizontal axis 510 may
represent the frame number.
[0036] Figure 6 is a diagram illustrating an embodiment of excitation signal creation for a voiced
segment, indicated generally at 600. "Stitching" may be used to form the final excitation signal of
voiced segments (from Figure 4), which is ideally devoid of boundary effects. In an embodiment, any
number of different excitation signals may have been formed through the overlap add method
illustrated in Figure 4 and in the diagram 500 (Figure 5). The different excitation signals may have a
constantly increasing amount of shifts in glottal boundaries 605 and an equal amount of circular left
shift 630 for the glottal pulse signal. In one embodiment, if the glottal pulse signal 615 is of a length less
than the corresponding pitch period, then the glottal pulse may be zero extended 625 to the length of
the pitch period before circular left shifting 630 is performed. Different arrays of pitch boundaries
(represented as Pm(i), m = 1,2, ... M - 1) are formed with each of the same length as P0 . The arrays
are computed using the following mathematical equation:
[0037] Pm (i) = Po(i) + m * w
[0038] Where w is generally taken as 1msec or, in terms of samples, .For 1000 a sampling frequency of
fs = 16,000, w = 16, for example. The highest pitch period present in the given voice segment is
represented as m * w. Glottal pulses are created and associated with each pitch boundary array Pm
The glottal pulses 620 may be obtained from the glottal pulse signal of some length N by first zero
extending it to the pitch period and then circularly left shifting it by m * w samples.
[0039] For each set of frame boundaries, an excitation signal 635 is formed by initializing the glottal
pulses to zero (0). Overlap add 610 is used to add the glottal pulse 620 to the first N samples of the
excitation, starting from each pitch boundary value of the array Pm(i), i = 1,2,.. K. The formed signal is
as a single stitched excitation, corresponding to the shift, m.
[0040] In an embodiment, the arithmetic mean of all of the single stitched excitation signals is then
computed 640, which represents the final excitation signal for the voiced segment 645.
[0041] Figure 7 is a diagram illustrating an embodiment of glottal pulse identification, indicated
generallyat700. In an embodiment, any two given glottal pulses maybe used to compute the distance
metric/dissimilarity between them. These are taken from the glottal pulse database 840 created in
process 800 (further described in Figure 8 below). The computation may be performed by decomposing
(1) (2) (3) (1) (2) (3 the two given glottal pulses xi,yi into sub-band components x ,x ,x andyi ,yi ,yi . The
given glottal pulse may be transformed into the frequency domain by using a method such as Discrete
Cosine Transform (DCT), for example. The frequency band may be split into a number of bands, which
are demodulated and converted into time domain. In this example, three bands are used for illustrative
purposes.
[0042] The sub-band distance metric is then computed between corresponding sub-band components
of each glottal pulses, denoted as ds(xi(1) , y (1) ). The sub-band metric, which may be represented as
d(f,g), where d, represents the distance between the two sub-band components f and g, may be
computed as described in the following paragraphs.
[0043] The normalized circular cross correlation function between f and g is computed. In one
embodiment, this may be denoted as Rf(n) = f * g, where '*' denotes normalized circular cross
correlation operation between two signals. The period for circular cross correlation is taken to be the highest of lengths of the two signals f and g. The shorter signal is zero extended. The Discrete Hilbert
Transform of normalized circular cross correlation is computed and denoted as Rf(n). Using the
normalized circular cross correlation and the Discrete Hilbert Transform of the normalized circular cross
correlation, the signal may be determined as:
[0044] Hf,g(n)= RF,g(n) 2 + Rq (n)2.
[0045] The cosine of the angle between the two signals f and g may be determined using the
mathematical equation:
[0046] cos(f,g) = maximum value of the signal H,g (n) over all n.
[0047] The sub-band metric, ds(f,g), between the two sub-band components f and gmay be
determined as:
[0048] ds (f, g)= 2(1 - cos 0 (f,g).
[0049] The distance metric between the glottal pulses is finally determined mathematically as:
[0050] d(xi,yi) = d2 (xi ,y) + d (xi ,y) + d (xi ,y 3 )
[0051] The glottal pulse database 840 may be clustered into a number of clusters, for example 256 (or
M), using a modified k-means algorithm 705. Instead of using the Euclidean distance metric, the
distance metric defined above is used. The centroids of a cluster are then updated with that element of
the cluster whose sum of squares of distances from all other elements of that cluster is minimum such
that:
[0052] Dm = lid2 (i,9m) is minimum for m = c, the cluster centroid.
[0053] In an embodiment, the clustering iterations are terminated when there is no shift in any of the
centroids of the k clusters.
[0054] A vector, a set of N real numbers, for example 256, is associated with every glottal pulse 710 in
the glottal pulse database 840 to form a corresponding vector database 715. In one embodiment, the associating is performed for a given glottal pulse xi, a vector
Vi = [w 1 (xi),w2 (xi),w3 (xi), --- (xi), --- 2 s6 (xi)], where T (xi) = d2 (xi, cj) - d2 (XX 0 )_
d 2 (cj,xo) and, x 0 is a fixed glottal pulse picked from the database and d 2 (xi,c) represents the square
of the distance metric defined above between two glottal pulses xi and cj and assuming that
c1, c 2 ,... ci,.. c2 s6 are the centroid glottal pulses determined by clustering.
[0055] Thus, the vector associated with the given glottal pulse xi may be computed with the
mathematical equation:
[0056] Vi = [w 1 (xi),w 2 (xi),w 3 (xi),...j(xi),... 2 s6 (xi)
[0057] In step 720, Principal Component Analysis (PCA) is performed to compute Eigenvectors of the
vector database 715. In one embodiment, any one Eigenvector may be chosen 725. The closest
matching vector 730 to the chosen Eigenvector from the vector database 715 is then determined in the
sense of Euclidean distance. The glottal pulse from the pulse database 840 which corresponds to the
closest matching vector 730 is regarded as the resulting Eigen glottal pulse 735 associated with an
Eigenvector.
[0058] Figure 8 is a diagram illustrating an embodiment of glottal pulse database creation indicated
generally at 800. A speech signal, 805, undergoes pre-filtering, such as pre-emphasis 810. Linear
Prediction (LP) Analysis, 815, is performed using the pre-filtered signal to obtain the LP coefficients.
Thus, low frequency information of the excitation may be captured. Once the coefficients are
determined, they are used to inverse filter, 820, the original speech signal, 805, which is not pre-filtered,
to compute the Integrated Linear Prediction Residual (ILPR) signal 825. The ILPR signal 825 may be used
as an approximation to the excitation signal, or voice source signal. The ILPR signal 825 is segmented
835 into glottal pulses using the glottal segment/cycle boundaries that have been determined from the
speech signal 805. The segmentation 835 may be performed using the Zero Frequency Filtering
Technique (ZFF) technique. The resulting glottal pulses may then be energy normalized. All of the glottal pulses for the entire speech training data are combined in order to form the glottal pulse database 840.
[0059] While the invention has been illustrated and described in detail in the drawings and foregoing
description, the same is to be considered as illustrative and not restrictive in character, it being
understood that only the preferred embodiment has been shown and described and that all equivalents,
changes, and modifications that come within the spirit of the invention as described herein and/or by
the following claims are desired to be protected.
[0060] Hence, the proper scope of the present invention should be determined only by the broadest
interpretation of the appended claims so as to encompass all such modifications as well as all
relationships equivalent to those illustrated in the drawings and described in the specification.
[0061] In the specification and the claims the term "comprising" shall be understood to have a broad
meaning similar to the term "including" and will be understood to imply the inclusion of a stated integer
or step or group of integers or steps but not the exclusion of any other integer or step or group of
integers or steps. This definition also applies to variations on the term "comprising" such as "comprise"
and "comprises".
[0062] The reference to any prior art in this specification is not, and should not be taken as an
acknowledgement or any form of suggestion that the referenced prior art forms part of the common
general knowledge in Australia.

Claims (3)

1. A method to create a glottal pulse database from a speech signal, in a speech synthesis system,
wherein the system comprises at least a glottal pulse database, comprising the steps of:
a. pre-emphasizing the speech signal to obtain a pre-filtered signal;
b. analyzing the pre-filtered signal, using linear prediction, to obtain inverse filtering
parameters;
c. performing inverse filtering of the speech signal using the inverse filtering parameters;
d. determining an integrated linear prediction residual signal using the inversely filtered
speech signal;
e. identifying glottal segment boundaries in the speech signal;
f. segmenting the integrated linear prediction residual signal into glottal pulses using the
identified glottal segment boundaries from the speech signal;
g. normalizing the glottal pulses;
h. forming the glottal pulse database by collecting all normalized glottal pulses obtained
for the speech signal; and
i. applying the formed glottal pulse database to form an excitation signal, wherein the
excitation signal is applied in the speech synthesis system to synthesize speech.
2. The method of claim 1, wherein the inverse filtering parameters in step (b) comprise linear
prediction coefficients.
3. The method of claim 1, wherein the identifying of step (e) is performed using Zero Frequency
Filtering technique.
AU2020227065A 2014-05-28 2020-09-03 Method for forming the excitation signal for a glottal pulse model based parametric speech synthesis system Active AU2020227065B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2020227065A AU2020227065B2 (en) 2014-05-28 2020-09-03 Method for forming the excitation signal for a glottal pulse model based parametric speech synthesis system

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
AU2014395554A AU2014395554B2 (en) 2014-05-28 2014-05-28 Method for forming the excitation signal for a glottal pulse model based parametric speech synthesis system
AU2014395554 2014-05-28
PCT/US2014/039722 WO2015183254A1 (en) 2014-05-28 2014-05-28 Method for forming the excitation signal for a glottal pulse model based parametric speech synthesis system
AU2020227065A AU2020227065B2 (en) 2014-05-28 2020-09-03 Method for forming the excitation signal for a glottal pulse model based parametric speech synthesis system

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
AU2014395554A Division AU2014395554B2 (en) 2014-05-28 2014-05-28 Method for forming the excitation signal for a glottal pulse model based parametric speech synthesis system

Publications (2)

Publication Number Publication Date
AU2020227065A1 true AU2020227065A1 (en) 2020-09-24
AU2020227065B2 AU2020227065B2 (en) 2021-11-18

Family

ID=54699420

Family Applications (2)

Application Number Title Priority Date Filing Date
AU2014395554A Active AU2014395554B2 (en) 2014-05-28 2014-05-28 Method for forming the excitation signal for a glottal pulse model based parametric speech synthesis system
AU2020227065A Active AU2020227065B2 (en) 2014-05-28 2020-09-03 Method for forming the excitation signal for a glottal pulse model based parametric speech synthesis system

Family Applications Before (1)

Application Number Title Priority Date Filing Date
AU2014395554A Active AU2014395554B2 (en) 2014-05-28 2014-05-28 Method for forming the excitation signal for a glottal pulse model based parametric speech synthesis system

Country Status (8)

Country Link
EP (1) EP3149727B1 (en)
JP (1) JP6449331B2 (en)
AU (2) AU2014395554B2 (en)
BR (1) BR112016027537B1 (en)
CA (2) CA3178027A1 (en)
NZ (1) NZ725925A (en)
WO (1) WO2015183254A1 (en)
ZA (1) ZA201607696B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10255903B2 (en) 2014-05-28 2019-04-09 Interactive Intelligence Group, Inc. Method for forming the excitation signal for a glottal pulse model based parametric speech synthesis system
US10014007B2 (en) 2014-05-28 2018-07-03 Interactive Intelligence, Inc. Method for forming the excitation signal for a glottal pulse model based parametric speech synthesis system
NZ749370A (en) 2016-06-02 2020-03-27 Genesys Telecommunications Laboratories Inc Technologies for authenticating a speaker using voice biometrics
JP2018040838A (en) * 2016-09-05 2018-03-15 国立研究開発法人情報通信研究機構 Method for extracting intonation structure of voice and computer program therefor

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5400434A (en) * 1990-09-04 1995-03-21 Matsushita Electric Industrial Co., Ltd. Voice source for synthetic speech system
US6795807B1 (en) * 1999-08-17 2004-09-21 David R. Baraff Method and means for creating prosody in speech regeneration for laryngectomees
JP2002244689A (en) * 2001-02-22 2002-08-30 Rikogaku Shinkokai Synthesizing method for averaged voice and method for synthesizing arbitrary-speaker's voice from averaged voice
KR101214402B1 (en) * 2008-05-30 2012-12-21 노키아 코포레이션 Method, apparatus and computer program product for providing improved speech synthesis
JP5075865B2 (en) * 2009-03-25 2012-11-21 株式会社東芝 Audio processing apparatus, method, and program
DK2242045T3 (en) * 2009-04-16 2012-09-24 Univ Mons Speech synthesis and coding methods
JP5085700B2 (en) * 2010-08-30 2012-11-28 株式会社東芝 Speech synthesis apparatus, speech synthesis method and program
US8744854B1 (en) * 2012-09-24 2014-06-03 Chengjun Julian Chen System and method for voice transformation

Also Published As

Publication number Publication date
CA2947957A1 (en) 2015-12-03
AU2020227065B2 (en) 2021-11-18
EP3149727A1 (en) 2017-04-05
ZA201607696B (en) 2019-03-27
JP2017520016A (en) 2017-07-20
BR112016027537B1 (en) 2022-05-10
AU2014395554A1 (en) 2016-11-24
WO2015183254A1 (en) 2015-12-03
NZ725925A (en) 2020-04-24
CA2947957C (en) 2023-01-03
EP3149727A4 (en) 2018-01-24
EP3149727B1 (en) 2021-01-27
AU2014395554B2 (en) 2020-09-24
CA3178027A1 (en) 2015-12-03
BR112016027537A2 (en) 2017-08-15
JP6449331B2 (en) 2019-01-09

Similar Documents

Publication Publication Date Title
AU2020227065B2 (en) Method for forming the excitation signal for a glottal pulse model based parametric speech synthesis system
US10621969B2 (en) Method for forming the excitation signal for a glottal pulse model based parametric speech synthesis system
US10014007B2 (en) Method for forming the excitation signal for a glottal pulse model based parametric speech synthesis system
CN108369803B (en) Method for forming an excitation signal for a parametric speech synthesis system based on a glottal pulse model
Pannala et al. Robust Estimation of Fundamental Frequency Using Single Frequency Filtering Approach.
US8942977B2 (en) System and method for speech recognition using pitch-synchronous spectral parameters
US20130262099A1 (en) Apparatus and method for applying pitch features in automatic speech recognition
JP2017520016A5 (en) Excitation signal formation method of glottal pulse model based on parametric speech synthesis system
US20220335925A1 (en) Systems and methods for adapting human speaker embeddings in speech synthesis
EP3113180B1 (en) Method for performing audio inpainting on a speech signal and apparatus for performing audio inpainting on a speech signal
Degottex et al. Joint estimate of shape and time-synchronization of a glottal source model by phase flatness
Thirumuru et al. Improved vowel region detection from a continuous speech using post processing of vowel onset points and vowel end-points
Thirumuru et al. Application of non-negative frequency-weighted energy operator for vowel region detection
Tantisatirapong et al. Comparison of feature extraction for accent dependent Thai speech recognition system
Van Huy et al. Vietnamese recognition using tonal phoneme based on multi space distribution
Alam et al. Response of different window methods in speech recognition by using dynamic programming
KR100488121B1 (en) Speaker verification apparatus and method applied personal weighting function for better inter-speaker variation
Ajgou et al. An efficient approach for MFCC feature extraction for text independent speaker identification system
Ashouri et al. Automatic and accurate pitch marking of speech signal using an expert system based on logical combinations of different algorithms outputs
CN116741156A (en) Speech recognition method, device, equipment and storage medium based on semantic scene
CN117935789A (en) Speech recognition method, system, equipment and storage medium
Apte Innovative wavelet based speech model using optimal mother wavelet generated from pitch synchronous LPC trajectory
Lamkadam et al. Comparative study and improvement of acoustic vectors extractors: Multiple streams applied to the recognition of Arabic numerals
Nar et al. Effective Implementation of Static Voice Alteration

Legal Events

Date Code Title Description
FGA Letters patent sealed or granted (standard patent)