NZ719961B2 - Method and system for real-time keyword spotting for speech analytics - Google Patents
Method and system for real-time keyword spotting for speech analytics Download PDFInfo
- Publication number
- NZ719961B2 NZ719961B2 NZ719961A NZ71996112A NZ719961B2 NZ 719961 B2 NZ719961 B2 NZ 719961B2 NZ 719961 A NZ719961 A NZ 719961A NZ 71996112 A NZ71996112 A NZ 71996112A NZ 719961 B2 NZ719961 B2 NZ 719961B2
- Authority
- NZ
- New Zealand
- Prior art keywords
- keyword
- model
- predetermined
- probability
- phonemes
- Prior art date
Links
- 230000001419 dependent Effects 0.000 claims description 6
- 239000000203 mixture Substances 0.000 claims description 2
- 238000001514 detection method Methods 0.000 abstract description 10
- 238000000034 method Methods 0.000 description 17
- 238000010586 diagram Methods 0.000 description 14
- 230000005236 sound signal Effects 0.000 description 5
- 230000004048 modification Effects 0.000 description 3
- 238000006011 modification reaction Methods 0.000 description 3
- 240000007594 Oryza sativa Species 0.000 description 2
- 235000007164 Oryza sativa Nutrition 0.000 description 2
- 235000009566 rice Nutrition 0.000 description 2
- 230000003595 spectral Effects 0.000 description 2
- 230000035897 transcription Effects 0.000 description 2
- 241000257303 Hymenoptera Species 0.000 description 1
- 241000227653 Lycopersicon Species 0.000 description 1
- 235000007688 Lycopersicon esculentum Nutrition 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 239000000470 constituent Substances 0.000 description 1
- 230000001276 controlling effect Effects 0.000 description 1
- 230000000875 corresponding Effects 0.000 description 1
- 230000004059 degradation Effects 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 150000002500 ions Chemical class 0.000 description 1
- 238000002955 isolation Methods 0.000 description 1
- 230000001537 neural Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
Abstract
system and method are presented for real-time speech analytics in the speech analytics field. Real time audio is fed along with a keyword model, into a recognition engine. The recognition engine computes the probability of the audio stream data matching keywords in the keyword model. The probability is compared to a threshold where the system determines if the probability is indicative of whether or not the keyword has been spotted. Empirical metrics are computed and any false alarms are identified and rejected. The keyword may be reported as found when it is deemed not to be a false alarm and passes the threshold for detection. ty is compared to a threshold where the system determines if the probability is indicative of whether or not the keyword has been spotted. Empirical metrics are computed and any false alarms are identified and rejected. The keyword may be reported as found when it is deemed not to be a false alarm and passes the threshold for detection.
Description
WO 14478 PCT/U52012/047715
TITLE
METHOD AND SYSTEM FOR REAL-TIME KEYWORD SPOTTING FOR SPEECH ANALYTICS
BACKGROUND
The present invention generally relates to telecommunication systems and methods, as well as
automatic speech recognition s. More particularly, the present invention pertains to keyword
spotting within automatic speech recognition systems.
Keyword spotting systems that are currently in use may include: phonetic , garbage
models, and large vocabulary continuous speech recognition (LVCSR). Each of these s has
inherent cks which affect the accuracy and performance of the system.
In phonetic search systems, a ”phonetic decoder" is relied upon which converts an audio stream
into one or many possible sequences of phonemes which can be used to identify words. ”John says”, for
example, can be broken down into the phoneme string ”jh aa n s eh s”. The phonetic decoder
hypothesizes a phoneme stream for the audio. This phoneme sequence is compared to the expected
phoneme sequence for a keyword and a match is found. Some systems developed with this concept
have shown reasonable performance, however, there are many antages for use in a ime
application. Use of a phonetic decoder prior to keyword search clearly needs to be done in two stages.
This adds considerable complexity. Such a system would work well in retrieval from stored audio, where
real-time processing is not required. Another disadvantage is the rate of error with phoneme
recognition. The state-of—the-art speech izers, which incorporate complex ge models, still
produce accuracies in the range of . The accuracy decreases further for sational speech.
These errors are further compounded by the phonetic search errors producing degradation in keyword
ng accuracy.
Another common technique used for keyword spotting is via the use of Garbage models that
match to audio any data other than the keyword. A phoneme network is commonly used to decode
2012/047715
yword audio into a sequence of phonemes. One simple approach to implement this method is to
use speech recognizers conforming to the Speech Recognition Grammar Specification (SRGS) and write a
grammar as follows:
Sroot = SGARBAGE (”keywordl" | ”keyword2”) GE;
Since most speech recognizers use ic decoding to implement a SGARBAGE rule, these
methods have the same disadvantages of the ic search, especially from a resource usage
standpoint. Another approach to implementation of a garbage model is to treat it as a l hidden
Markov model (HMM) state, and its emitting probability to be a function of all triphone models in the
acoustic model, or estimate it iteratively. Both the approaches hinder real-time requirements as they
need computation of a large number of probabilities or go through the data in multiple passes.
LVCSR systems rely completely on a LVCSR speech recognition engine to provide a word-level
transcription of the audio and later perform a text based search on the transcriptions for the keyword.
Considering the high computational cost of LVCSR engines, this on is y infeasible for real-time
keyword spotting. Furthermore, the accuracy of LVCSR systems is usually tied y with domain
knowledge. The system’s vocabulary needs to either be rich enough to contain all possible keywords of
interest or be very domain specific. Spotting keywords from multiple languages would mean running
multiple recognizers in parallel. A more effective means to increase the cy of these methods is
desired to make keyword spotters more pervasive in real-time speech analytics systems.
SUMMARY
A system and method are presented for real—time speech analytics in the speech analytics field.
Real time audio is fed along with a keyword model, into a ition engine. The recognition engine
computes the probability of the audio stream data matching keywords in the keyword model. The
probability is compared to a threshold where the system determines ifthe probability is indicative of
r or not the keyword has been spotted. Empirical metrics are computed and any false alarms are
PCT/U52012/047715
identified and rejected. The d may be reported as found when it is deemed not to be a false
alarm and passes the threshold for detection.
In one embodiment, a computer-implemented method for spotting predetermined keywords in
an audio stream is disclosed, comprising the steps of: a) developing a keyword model for the
predetermined keywords; b) ing the keyword model and the audio stream to spot probable ones
of the ermined keywords; c) computing a probability that a portion of the audio stream matches
one of the predetermined keywords from the keyword model; cl) comparing the computed probability
to a predetermined threshold; e) declaring a potential spotted word if the computed ility is
greater than the predetermined threshold; f) computing further data to aid in determination of
mismatches; g) using the further data to determine if the potential spotted word is a false alarm; and h)
reporting d keyword if a false alarm is not identified at step (g).
In another embodiment, a computer-implemented method for ng predetermined
keywords in an audio stream is disclosed, comprising the steps of: a) developing a keyword model for
the predetermined keywords; b) dividing the audio stream into a series of points in an acoustic space
that spans all possible sounds created in a particular language; c) computing a posterior probability that
a first tory of each d model for the predetermined keywords in the acoustic space matches
a second trajectory of a portion of the series of points in the acoustic space; cl) comparing the posterior
probability to a predetermined threshold; and e) reporting a spotted keyword if the ior probability
is greater than the predetermined threshold.
In r embodiment, a system for spotting predetermined keywords in an audio stream is
disclosed, comprising: means for developing a keyword model for the predetermined keywords; means
for comparing the d model and the audio stream to spot probable ones of the predetermined
keywords; means for computing a ility that a portion of the audio stream matches one of the
predetermined keywords from the keyword model; means for comparing the computed probability to a
PCT/U52012/047715
predetermined threshold; means declaring a potential spotted word if the computed probability is
greater than the predetermined threshold; means for ing further data to aid in determination of
mismatches; means for using the further data to determine if the potential spotted word is a false
alarm; and means for reporting spotted keyword if a false alarm is not fied.
BRIEF DESCRIPTION OF THE DRAWINGS
Figure 1 is a diagram rating the basic components in a keyword spotter.
Figure 2 is a diagram illustrating a concatenated HMM model.
Figure 3a is a diagram illustrating an abstract visualization of the audio feature space and the
triphone models which span this space.
Figure 3b is a diagram rating monophone models which completely span the same audio
feature space.
Figure 4 is a diagram illustrating a speech signal showing a spoken keyword nded by
garbage models.
Figure 5 is a table illustrating phoneme level ilities.
Figure 6 is a diagram illustrating the relation between the internal match " and external
”Confidence” values.
Figure 7 is a diagram illustrating the system behavior with varied confidence settings.
Figure 8 is a flowchart illustrating the keyword spotting algorithm utilized in the system.
DETAILED DESCRIPTION
For the purposes of ing an understanding of the principles of the invention, reference
will now be made to the embodiment illustrated in the drawings and specific language will be used to
describe the same. It will nevertheless be tood that no limitation of the scope of the invention is
thereby intended. Any alterations and further modifications in the described embodiments, and any
PCT/U52012/047715
further applications of the principles of the invention as described herein are contemplated as would
ly occur to one skilled in the art to which the invention relates.
Automatic speech recognition (ASR) systems analyze human speech and translate them into text
or words. Performance of these systems is commonly evaluated based on the accuracy, ility,
language support, and the speed with which speech can be recognized. The performance of the system
is expected to be very high. Superior mance is often fied by a high detection rate and a low
false alarm rate. Industry standard is considered to be around a 70% detection rate at 5 false alarms per
keyword per hour of speech, or 5 hr. Factors such as accent, articulation, speech rate,
pronunciation, background noise, etc., can have a negative effect on the accuracy of the system.
Processing speed is necessary to analyze several hundreds of telephone conversations at once and in
real-time. The system is also expected to perform consistently and reliably irrespective of channel
conditions and various cts introduced by modern telephony channels, especially voice over IP.
Keywords from multiple languages also need to be spotted on the same audio source.
Those skilled in the art will recognize from the present disclosure that the various
methodologies sed herein may be computer implemented using a great many different forms of
data processing equipment, such as l microprocessors and associated memory executing
riate software program(s), to name just one non—limiting example. The specific form of the
hardware, firmware and software used to implement the presently disclosed ments is not critical
to the present invention.
In the present invention, posterior ility ations for speech ition systems may
be used to increase system effectiveness. Prior systems designed to perform keyword spotting use the
log-likelihood measure to match presented audio to the phonemes in a keyword. Phonemes are sub-
word units that typically are modeled in ASR systems. Additionally, phonemes can be modeled in
isolation or in context of other phonemes. The former are called monophones and the latter are called
PCT/U52012/047715
triphones when the phoneme depends on its previous and next ic context. Posterior
ility, as used in this invention, may be a measure of how well the audio matches to a model when
compared to the same audio as it is matched to all other models for a given speech pattern.
Use of posterior probabilities in speech recognition has been attempted in the past, primarily by
training a neural network. While this method returns an approximation to the posterior probability, it
tends to be extremely ationally expensive and requires special training ures.
An ative approach to posterior probability computation for speech recognition may be
ped as follows:
By definition, posterior probability (P) of a model (Ti), given an observation vector x, may be
written as:
J" E !‘ le » a
L: frll! all]!
0'0] ~ 3.;:M
where P(x| T,-) is the probability of model T,- generating the acousticsx andj is a variable that
spans the indices of all models. In the above on, the term P(T,) is held constant for all models,
and the formula can be re-written as:
r; l
l~;-« ... [A -55
m. .
g ,
a—J‘é
This equation is still prohibitively expensive to calculate. The expense may be attributed to the
fact that the denominator term is a summation of all models, which can be very large for a context
dependent triphone based system (typically tens of thousands of models). To study the impact of the
denominator terms, an intuitive and graphical approach may be taken. The denominator as a whole
signifies the total ility of models spanning the entire audio space. Therefore, the above equation
can be rewritten as:
PCT/U52012/047715
where M represents a model, VM represents all of the models in the entire audio space,
represented as Ml.
The above formula does not lose generality. The denominator term is now a summation over
any set of models that completely spans the audio feature space.
Figure 1 is a diagram rating the basic components in a keyword spotter, 100. The basic
components of a keyword spotter 100 may e User Data/Keywords 105, Keyword Model 110,
Knowledge Sources 115 which include an Acoustic Model 120 and a Pronunciation Dictionary/Predictor
125, an Audio Stream 130, a Front End Feature Calculator 135, a Recognition Engine (Pattern Matching)
140, and the Reporting of Found Keywords in Real-Time 145.
Keywords may be defined, 105, by the user of the system according to user preference. The
d model 110 may be formed by concatenating phoneme HMMs. This is further described in the
description of Figure 2. The Keyword Model, 110, may be composed based on the keywords that are
defined by the user and the input to the keyword model based on Knowledge Sources, 115. Such
knowledge s may include an Acoustic Model, 120, and a Pronunciation Dictionary/ Predictor, 125.
The Knowledge Sources 115 may store probabilistic models of relations between pronunciations
and acoustic events. The Knowledge Sources 115 may be developed by ing large quantities of
audio data. The acoustic model and the pronunciation dictionary/predictor are made, for example, by
looking at a word like ”hello” and examining the es that se the word. Every keyword in
the system is represented by a statistical model of its constituent sub—word units called the phonemes.
The phonemes for ”hello” as defined in a standard phoneme dictionary are: ”hh", ”eh”, ”I”, and ”ow”.
Models of the four phonemes are then strung together into one composite model which then becomes
WO 14478 PCT/U52012/047715
the keyword model for the world ”hello”. These models are language dependent. In order to also
provide multi-lingual support, multiple dge sources may be ed.
The acoustic model 120 may be formed by statistically modeling the various sounds that occur
in a particular language. A phoneme is assumed to be the basic unit of sound. A predefined set of such
phonemes is assumed to completely describe all sounds of a particular ge. An HMM, which
encodes the relationship of the observed audio signal and the unobserved phonemes, forms the
fundamental theory for most modern speech recognition systems. A phoneme is considered to be
composed of three states, representing the beginning, central, and trailing portions of the sound. An
HMM is constructed by concatenating these three states. A training process s the statistical
properties of each of these states for all of the phonemes over a large collection of transcribed audio. A
relation between the textual properties and the spoken properties is thus formed. Typically, the
statistics of states may be encoded using a an mixture model (GMM). A set of these GMMs is
termed as an acoustic model. Specifically, the one described in this application is ed to as a
t-independent, or monophone, model. Many other model types may also be used. For example,
many modern speech recognition systems may utilize a more advanced acoustic model, which may be
context-dependent and capture the complex variations created due to the position of phonemes in
conversational speech. Each state of a phoneme is specialized to its left and right neighboring
phonemes. Clearly such a scheme would result in a very large number of GMlVIs in the acoustic model.
One example of a t-dependent phoneme is a triphone.
The pronunciation dictionary, 125, in Figure 1 may be responsible for decomposing a word into a
sequence of phonemes. Keywords presented from the user may be in human readable form, such as
grapheme/alphabets of a particular language. However, the pattern matching algorithm may rely on a
sequence of phonemes which represent the pronunciation of the keyword. The t ion
utilizes a pronunciation nary, which may store a mapping between commonly spoken words and
WO 14478 PCT/U52012/047715
their pronunciations. Once the sequence of phonemes is obtained, the corresponding statistical model
for each of the phonemes in the ic model may be examined. A concatenation of these statistical
models may be used to m keyword spotting for the word of interest. For words that are not
present in the dictionary, a predictor, which is based on linguistic rules, may be used to e the
pronunciations.
The audio stream (i.e., what is spoken into the system by the user), 130, may be fed into the
front end feature calculator, 135, which may convert the audio stream into a representation of the
audio stream, or a sequence of spectral features. Audio analysis may be performed by segmenting the
audio signal as a sequence of short ally 10 ms) windows and extracting spectral domain features.
For each window, the feature calculator may calculate a set of 13 Mel Frequency Cepstral cients
(M FCC) and their first and second order derivatives. The resulting ations represent each of these
windows as a point in a 39-dimensional space M. This space tely spans a” possible sounds
created in a particular language.
The keyword model, 110, which may be formed by concatenating phoneme hidden Markov
models (HMMs), and the signal from the audio stream, 135, may both then be fed into a recognition
engine for pattern matching, 140. The task of the recognition engine may be to take a set of keyword
models and search through presented audio stream to find if the words were . In the multi-
dimensional space constructed by the feature ator, a spoken word may become a sequence of
MFCC vectors forming a trajectory in the acoustic space M. Keyword spotting may now simply become
a problem of computing probability of generating the trajectory given the keyword model. This
operation may be achieved by using the well-known principle of dynamic programming, specifically the
Viterbi algorithm, which aligns the keyword model to the best segment of the audio signal, and results in
a match score. If the match score is significant, the keyword spotting algorithm infers that the keyword
was spoken and reports a keyword spotted event.
PCT/U52012/047715
The resulting keywords may then be reported in real—time, 145. The report may be presented as
a start and end time of the keyword in the audio stream with a confidence value that the keyword was
found. The primary ence value may be a function of how the keyword is spoken. For example, in
the case of multiple pronunciations of a single word, the keyword ”tomato" may be spoken as “te-mah-
toh” and ”te-may-toh”. The primary confidence value may be lower when the word is spoken in a less
common pronunciation or when the word is not well enunciated. The specific variant of the
pronunciation that is part of a particular recognition is also displayed in the report.
Figure 2 is a m illustrating a concatenated HMM model. A keyword model may be formed
by concatenating e HMMs. For example, the keyword model 200 for the word ”rise” is
constructed from the monophone models of the phonemes that comprise its pronunciation. The
phonemes sing the pronunciation of ”rise" are ”r”, ”ay", and ”2". Each phoneme has three states
t consisting of a beginning portion of sound 210, a central portion of sound 211, and trailing
portion of sound 212. For example, the phoneme ll Ir
r has a beginning portion of sound 210 shown as
”r1” in the model. The central n of sound 211 is exhibited by ”r2" and the trailing portion of
sound 212 is exhibited by ”r3". The phoneme ”ay” has a beginning portion of sound 210 illustrated as
”ayl” in the model. The central portion of sound 211 is rated by ”ay2” and the trailing portion of
sound 212 is illustrated by ”ay3”. The phoneme II II
2 has a beginning portion of sound 210 illustrated as
”21" in the model. The central portion of sound 211 is exhibited by ”22” and the trailing portion of
sound 212 is ted by ”23”. Each portion of sound has a transition 213 either within the portion
itself or between portions. In a r fashion, a context dependent keyword model may be
ucted by concatenating its triphone models.
Figure 3a is a diagram illustrating an abstract visualization of the audio feature space and the
triphone models which spans this space. In reality, the audio space is 39-dimensional, but for illustration
purposes, a 2-dimensional space is shown. Figure 3b is a m illustrating monophone models which
PCT/U52012/047715
completely span the same audio feature space. In light of the ations from Figures 3a and 3b, the
keyword spotting algorithm as presented above
becomes
(is Ti}
M: ii 1.» l
0.01 [QQJ'E ~lftl
when M is assumed as the set of monophone models in the first equation, and where Mk
represents the monophone models in the second equation. VM is assumed as the set of one
models. It will be appreciated from the present disclosure that T, and Mk both span the entire audio
space, M, completely. Since the number of GMMs present in the monophone model (Figure 3b) is
icantly smaller compared to the triphone model (Figure 3a), computation of posterior probabilities
is extremely fast, yet a close representation of the correct value.
Figure 4 is a diagram illustrating a speech signal 400 showing a spoken d 410 surrounded
by garbage models 405, 415. A d is spoken as a part of a continuous speech stream. In the
segment of audio between t0 and ts, the garbage model 405 takes precedence, as it matches non-
keyword audio portions. The accumulated score during this period is represented by Slin the following
equations. Similarly, in the audio segment te to tN, the garbage match score is ented by 52. Here,
the garbage model 415 takes ence. Instead of explicitly computing the garbage probabilities, S1
and $2, a constant value e is chosen such that
l l ..
The constant e is validated on a large test dataset to realize no significant reduction in
performance when compared to explicitly computing the e probability. This approximation of
using a constant garbage value makes the system significantly faster as compared to traditional keyword
spotting algorithms.
Figure 5 is a table illustrating phoneme level probabilities 500 comparing the phoneme match
probabilities of the spoken words ber” and ”discover” as compared to the keyword model for
”December". A high rate of false alarms may be counted as one of the main problems in a keyword
spotting algorithm. Unlike LVCSR engines, keyword spotters have no access to word level contextual
information. For example, when searching for the keyword , the acoustic signal for ”rise" is very
similar to that of ”price”, ”rice", ”prize", ”notarize”, etc. These words would thus be d as a match
by the system. This is a r problem as in substring es in text where subwords match to the
keystring.
In order to constrain false alarms, the following are a few non-limiting examples of ches
may be used as a secondary check on keyword matches found by the main Viterbi algorithm. Anti-
words are a set of words that are commonly confused with keywords within the system. In the
presented example with the words l’price”, ”rice”, ”prize", ize”, etc., as mentioned above, these
words comprise the anti-word set of the keyword "rise”. The system es for these anti-words in
parallel to the d and reports a d found event only when the d match score
supersedes the anti-word match score. This feature is an effective method to curb spurious false
alarms. The method, however, still requires user intervention and creating large anti-word sets. Other
techniques may be purely data driven and thus sometimes more desirable.
Mismatch phoneme percentage determines the number of phonemes of the keyword that
mismatch the audio signal, even though the overall keyword probability from the Viterbi search was
found as a match. For example, the word ”December” as shown in Figure 5, may be found to wrongly
PCT/U52012/047715
match instances of ”Discover" by the keyword spotter. Phoneme level probabilities are ified in
Figure 5. Score represents how much the phoneme matches the audio stream. Using the instant
example, the more positive the number, the better the match. A score value of ”0” would indicate a
perfect match. These scores are always ve or zero. For the phoneme ”cl”, the ility for
“December” is -0.37, while it is -l.18 for l’discover". It can be noted that all of the phonemes yield lower
probabilities when the spoken utterance was ”discover” as compared to the spoken utterance
”December". This metric computes the percentage of such misfit phonemes and performs an additional
check before reporting keyword found events.
Analogous to the mismatch phoneme percentage, the match phoneme percentage measure
computes the tage of es that match the audio signal. The percentage of fit phonemes
may be expected to be above a preset threshold for the keyword found event to be reported.
The duration penalized probability emphasizes durational mismatches of a keyword with the
audio stream. For example, consonants such as ”t”, “d", and ”b” have a lower expected duration
compared to vowels such as ”aa”, "ae", and ”uw". In the event these ants match for a longer
than ed duration, the keyword match is most likely a false alarm. These events can be the result
of poor acoustic model or presence of noise in the signal being analyzed. To capture such a io,
the duration zed probability is computed as
fit ,6]
“I E] if“ f
1 " ‘
where p,- represents the probability of phoneme i, d,- ents the duration of phoneme i, and
D represents a duration threshold determined based upon tests performed on large datasets. The
duration penalized score for a keyword may be represented by the average of all its phoneme scores.
By doubling the scores for long phonemes, this metric emphasizes mismatches created by spurious
phonemes and thus lowering false alarms.
PCT/U52012/047715
Figure 6 is a diagram illustrating the relation n the internal match ”Score” and external
”Confidence” . Spotability is a measure of expected cy from the system. The primary use
of this measure is to guide users in determining a good set of keywords. Other uses include feedback to
the recognition engine and controlling the false alarm rate. The diagram in Figure 6 shows the
relationship between the match probability, or the ”score”, as determined by the recognition engine
and the confidence values as reported by the system. By default, the solid curve 605 is used if no
information about the keyword is known. If Spotability is known, the relationship may be modified by
changing the ing score range of the keyword, as shown by the dashed and dotted lines. The
dashed line 610 exhibits a low spotability keyword while the dotted line 615 exhibits a high spotability
keyword. As the value of confidence increases, so does the likelihood of a match where 0.0 is indicative
of no match and 1.0 is a match. As the re becomes more negative, so does the likelihood of a
mismatch. As the Score approaches 0.0, there is a greater likelihood of a match. Thus, a Score of 0 and
a Confidence of 1.0 would indicate a t match.
Figure 7 is a diagram illustrating the system behavior with varied confidence settings. The result
of changing the operating range based on spotability is a more controlled behavior of the system. When
a user registers a keyword to be spotted, an associated ility measure is presented, such as 70. By
definition, this means the system results in 70% accuracy with a false alarm rate of 5 per hour. To
obtain this behavior from the system, the al score range is modified as shown in Figure 7, such
that at the default confidence setting (0.5) the system produces 5 false alarms per hour and a detection
rate of 70%. If the user wishes a higher accuracy, the confidence setting is lowered, which in turn could
possibly create a higher false alarm rate. If the user wishes lower false alarm rate, confidence g is
increased, thus possibly resulting in lower detection rate.
The diagram 700 illustrates the behavior of the system as the confidence settings are altered.
As the Confidence setting approaches 1.0, the rate of detection decreases until it achieves a value 0.0 at
2012/047715
a Confidence setting of 1.0. The rate of false alarms also decreases and approaches 0.0 as the
Confidence setting approaches 1.0. Conversely, as the rate of detection increases, the Confidence
setting approaches 0.0 and the rate of False Alarms (FA/Hr) increases.
As illustrated in Figure 8, a process 800 for utilizing the keyword spotting algorithm is provided.
The s 800 may be operative on any or all elements of the system 100 (Figure 1).
Data is contained within both the Keyword Model 805 and the Audio Stream 810. While the
Keyword Model 805 mayjust be needed once during the data flow process, the Audio Stream 810 is a
continuous input of data into the system. For example, the Audio Stream may be a person speaking into
the system real-time via a digital telephone. The Keyword Model 805, which is formed by
concatenating e HMMs, contains the ds that are user defined according to user
preference. For example, a user may define keywords that are industry specific such as ”terms",
”conditions”, "premium”, and ”endorsement” for the insurance industry. These keywords in the
d Model 810 are used for n matching with words that are continuously input into the
system via the Audio Stream 810. Control is passed to operation 815 and the process 800 continues.
In operation 815, probability is computed in the Recognition Engine, 140 (Figure 1). As
previously described, ility scores are used by the system to determine matched phonemes. The
percentage of these phonemes is expected to be above the preset threshold for the keyword found
event to be report. Control is passed to operation 820 and the process 800 continues.
In operation 820, it is ined whether or not the computed probability is greater than the
threshold. If it is determined that the probability is greater than the threshold, then control is passed to
step 825 and process 800 continues. If it is determined that the ility is not greater than the
threshold, then the system control is passed to step 815 and process 800 continues.
The ination in operation 820 may be made based on any suitable ia. For example,
the threshold may be user set or left at a system default value. As the value of the threshold, or
PCT/U52012/047715
confidence setting, approaches 0.0, the higher the ncy of false alarms which may occur. The rate
of detection of the keyword may not be much higher than if the confidence setting was slightly higher
with less frequency of false alarms.
In the event that control is passed back to step 815, ility is then computed again using a
different piece of the audio stream and the process proceeds.
In operation 825, the system computes cal metrics, such as ison to anti-word
scores, mismatch phoneme percentage, match phoneme percentage, and/or duration penalized
probability, to name just a few non-limiting es. The metrics are used to e secondary data
and may serve as an additional check before reporting keyword found events. Control is passed
operation 830 and the process 800 continues.
In operation 830, it is determined whether or not the possible matches are identified as false
alarms. If it is determined that the possible matches are false alarms, then control is passed to step 815
and process 800 ues. If it is determined that the possible matches are not false alarms, then
control is passed to step 835 and process 800 continues.
Once the process returns to step 815, probability is computed again using a different piece of
the audio stream and the process proceeds.
The determination in operation 830 may be made based on any suitable criteria. In some
embodiments, the ia are based on the probabilities and the empirical metrics that have been
calculated by the system.
In operation 835, the system reports the keyword as found and the process ends.
While the invention has been illustrated and described in detail in the drawings and foregoing
description, the same is to be ered as illustrative and not restrictive in character, it being
understood that only the preferred embodiment has been shown and described and that all equivalents,
WO 14478 PCT/U52012/047715
changes, and modifications that come within the spirit of the inventions as described herein and/or by
the following claims are desired to be protected.
Hence, the proper scope of the present invention should be determined only by the broadest
interpretation of the ed claims so as to encompass all such modifications as well as all
relationships equivalent to those illustrated in the drawings and described in the specification.
Claims (13)
1. A method of speech recognition, in a speech recognition system wherein said system comprises at least a speech ition engine, for real-time spotting of predetermined keywords in an audio , comprising the steps of: a) developing a keyword model for the predetermined keywords; b) dividing, by the speech recognition engine, the audio stream into a series of points in an acoustic space that spans all possible sounds created in a ular language; c) determining, by the speech recognition engine, a posterior probability that a first trajectory of each keyword model for the predetermined ds in the acoustic space matches a second trajectory of a portion of the series of points in the acoustic space, wherein the posterior probability is determined utilizing the atical equation: d) comparing, by the speech recognition engine in real-time, the posterior probability to a predetermined threshold; and e) reporting a spotted d if the posterior probability is greater than the predetermined threshold.
2. The method of claim 1, wherein step (e) comprises: e.1) declaring a potential spotted word if the posterior probability is greater than the predetermined old; e.2) computing further data to aid in determination of mismatches; e.3) using the further data to determine if the ial spotted word is a false alarm; and e.4) reporting spotted keyword if a false alarm is not identified at step (e.3).
3. The method of claim 1, wherein step (a) comprises concatenating phoneme hidden Markov models of predetermined keywords.
4. The method of claim 1, wherein step (a) comprises: a.1) creating a pronunciation nary that defines a ce of phonemes for each of the predetermined keywords; a.2) creating an acoustic model that statistically models a relation between textual properties of the phonemes for each of the predetermined keywords and spoken properties of the phonemes for each of the predetermined keywords; and a.3) concatenating ic models for the sequence of phonemes for each of the predetermined keywords.
5. The method of claim 4, wherein step (a.2) comprises creating a set of Gaussian mixture models.
6. The method of claim 4, wherein step (a.2) comprises creating the acoustic model selected from the group consisting of: t-independent model, context-dependent model, and triphone model.
7. The method of claim 1, wherein step (b) comprises: b.1) converting the audio stream into a sequence of windows; and b.2) ating a set of 13 Mel Frequency Cepstrel cients and their first and second order derivatives for each window.
8. The method of claim 1, wherein step (c) comprises executing a i algorithm.
9. The method of claim 1, wherein step (c) comprises: c.1) assigning a nt predetermined probability to the portions of the audio stream that do not match the keyword.
10. The method of claim 2, wherein step (e.2) comprises computing further data selected from the group consisting of: anti-word match scores, mismatch phoneme percentage, match phoneme tage, duration penalized probability, and a predetermined Confidence value.
11. The method of claim 10, wherein the predetermined ence value is chosen for each of the predetermined keywords so as to achieve a desired false alarm rate and accuracy.
12. The method of claim 1, wherein the audio stream comprises a continuous spoken speech stream.
13. The method of claim 1, wherein the space comprises a 39-dimensional space.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
NZ719961A NZ719961B2 (en) | 2012-07-20 | Method and system for real-time keyword spotting for speech analytics |
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/US2012/047715 WO2014014478A1 (en) | 2012-07-20 | 2012-07-20 | Method and system for real-time keyword spotting for speech analytics |
NZ719961A NZ719961B2 (en) | 2012-07-20 | Method and system for real-time keyword spotting for speech analytics | |
NZ704832A NZ704832B2 (en) | 2012-07-20 | Method and system for real-time keyword spotting for speech analytics |
Publications (2)
Publication Number | Publication Date |
---|---|
NZ719961A NZ719961A (en) | 2016-11-25 |
NZ719961B2 true NZ719961B2 (en) | 2017-02-28 |
Family
ID=
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9672815B2 (en) | Method and system for real-time keyword spotting for speech analytics | |
US10157610B2 (en) | Method and system for acoustic data selection for training the parameters of an acoustic model | |
US10360898B2 (en) | Method and system for predicting speech recognition performance using accuracy scores | |
US20180137109A1 (en) | Methodology for automatic multilingual speech recognition | |
US20140207457A1 (en) | False alarm reduction in speech recognition systems using contextual information | |
US20130289987A1 (en) | Negative Example (Anti-Word) Based Performance Improvement For Speech Recognition | |
EP2891147B1 (en) | Method and system for predicting speech recognition performance using accuracy scores | |
AU2018271242A1 (en) | Method and system for real-time keyword spotting for speech analytics | |
WO2014031918A2 (en) | Method and system for selectively biased linear discriminant analysis in automatic speech recognition systems | |
JP2004139033A (en) | Voice synthesizing method, voice synthesizer, and voice synthesis program | |
NZ719961B2 (en) | Method and system for real-time keyword spotting for speech analytics | |
CA2896801C (en) | False alarm reduction in speech recognition systems using contextual information | |
NZ704832B2 (en) | Method and system for real-time keyword spotting for speech analytics | |
NZ705071B2 (en) | Method and system for predicting speech recognition performance using accuracy scores |