WO2018029071A1 - Audio signature for speech command spotting - Google Patents

Audio signature for speech command spotting Download PDF

Info

Publication number
WO2018029071A1
WO2018029071A1 PCT/EP2017/069649 EP2017069649W WO2018029071A1 WO 2018029071 A1 WO2018029071 A1 WO 2018029071A1 EP 2017069649 W EP2017069649 W EP 2017069649W WO 2018029071 A1 WO2018029071 A1 WO 2018029071A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech signal
speech
hfd
command
ubm
Prior art date
Application number
PCT/EP2017/069649
Other languages
French (fr)
Inventor
Sacha Vrazic
Original Assignee
Imra Europe S.A.S
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Imra Europe S.A.S filed Critical Imra Europe S.A.S
Publication of WO2018029071A1 publication Critical patent/WO2018029071A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/22Interactive procedures; Man-machine interfaces
    • G10L17/24Interactive procedures; Man-machine interfaces the user being prompted to utter a password or a predefined phrase
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting

Definitions

  • the present invention relates to detecting an audio signature in speech utterances for speech command spotting .
  • Voice communications are a natural and simple way of communicating between people. However, despite considerable improvement of speech recognition engines, making a machine understand some spoken
  • Speech command spotting for vehicles. Speech commands can be given inside the vehicle to control equipment such as windows, air conditioning, winkers, wipers, etc.
  • Speech commands can be also given from outside the vehicle, when for example the user joins his car on the parking slot with hands carrying some shopping bags, and then by just uttering "open", the door at the user's side opens.
  • At least one embodiment of the present invention aims at overcoming the above drawbacks and has an object of providing a speech spotting system that enables identification of an uttered speech command and the speaker without any previous training on a large database, in which the speech command can be language independent and does not have to be part of existing vocabulary.
  • a given speaker it is possible for a given speaker to define a voice command that is language and vocabulary independent.
  • the command may comprise speech, humming, singing, etc.
  • the command can be registered with only one utterance.
  • the Higuchi fractal dimension is used followed by probabilistic discrimination.
  • the Higuchi fractal dimension is applied in a multi-scale way in combination with a probabilistic modeling that enables assigning, as a signature, the couple speaker (i.e. user) and command, as well as identifying the command and the user robustly.
  • Fig. 1 shows a schematic block diagram illustrating processing in a registration mode according to an embodiment of the invention.
  • Fig. 2 shows a schematic block diagram illustrating feature computation processing in a registration mode according to embodiments of the invention.
  • Fig. 3 shows a flowchart illustrating a probabilistic modeling processing according to an embodiment of the invention.
  • Fig. 4 shows a diagram illustrating an example of user and command dependent GMM models according to an embodiment of the invention.
  • Fig. 5 shows a schematic block diagram illustrating a command and user detection processing in an action mode according to an embodiment of the invention.
  • Fig. 6 shows a diagram illustrating results of the command and user detection processing according to an embodiment of the invention.
  • Figs. 7A and 7B show diagrams illustrating results of a command and user detection processing according to comparative examples.
  • Fig. 8 shows a schematic block diagram illustrating a configuration of a control unit in which examples of embodiments of the invention are implementable. DESCRIPTION OF THE EMBODIMENTS
  • Embodiments of the invention relate to functions that are in the digital domain. However, there is an analog part to condition (amplify and low- pass filter) microphone signals and convert them to digital signals. This part is out of the scope of this application.
  • a speech spotting system according to at least one embodiment of the invention comprises two operation modes, i.e. a "registration” mode and an “action” mode. First, the registration mode will be described. Registration Mode
  • a speech signal representing a command uttered by a user as a label to a defined action is registered in the speech spotting system.
  • a speech utterance of the user is acquired by a microphone or microphone array 10 (for example, a one microphone or multi-microphone in-vehicle setting, which is out of the scope of this application).
  • the speech utterance is amplified, low-pass filtered and digitized.
  • noise and interferences for each situation in-vehicle or out-of-vehicle application
  • a digital audio signal is output from the pre-processing block 20.
  • a feature extraction block 30 of an embodiment of the invention which receives the digital audio signal, comprises an estimation according to Higuchi Fractal Dimension (HFD) in a multi-scale way.
  • Higuchi Fractal Dimension HFD
  • Multi-scale means that the fractal dimension is computed for different (multiple) scales and all these scale dependent fractal dimensions (i.e. HFD parameters) are gathered.
  • the HFD can be used alone or in combination with other features such as Mel-Frequency Cepstral Coefficients (MFCC).
  • Fig. 2 illustrates details of the feature extraction block 30.
  • the digital audio signal is subjected to framing in a framing block 31, in which frames of, for example, 32 ms are overlapped by 50%.
  • a voice activity detector (VAD) 32 applies an algorithm to the digital audio signal, which has been subjected to the framing, the algorithm detecting speech presence in the digital audio signal and segments a speech signal corresponding to a command, i.e. finds start and end of the speech signal.
  • the speech signal after segmentation is a matrix of time samples, corresponding to speech frames contained in the command.
  • the speech frames are also referred to as time frames of the command. In other words, each column of the matrix contains time samples
  • the speech signal i.e. the speech command matrix, is output from the VAD 32.
  • a feature space is computed.
  • the Higuchi fractal dimension block 34 is used together with Mel-frequency cepstral coefficients block 33 as illustrated in the lower branch of Fig. 2.
  • each column of the speech command matrix is processed independently, and from each column, a vector XTM of samples (time-series) is created as given by equation (1).
  • k is the time interval
  • m is the initial time in the dimension computation
  • W is the frame size in samples. The adjustment of these parameters defines the number of time-series that are obtained.
  • N HFD parameters are computed, for each time frame, as a feature vector of length N, which can also be referred to as "command feature vector", and the dimension of a command feature space matrix is [N x T] in the upper branch of Fig. 2, or [(N + M) x T] in the lower branch of Fig . 2 in which in addition to the N HFD parameters, M parameters according to the MFCC block 33 are computed.
  • T corresponds to the number of time frames of the command.
  • the feature space computed in block 30 is input into a universal background model (UBM) estimation block 40 which defines a kind of borders for GMM models.
  • UBM universal background model
  • the UBM is a user and command independent GMM model .
  • the UBM is acting as a prior model and there are many ways to compute it. Most efficient (in terms of model quality) is the Expectation-Maximization approach.
  • the UBM estimated in block 40 is input into block 50 in which a user and command depended GMM is computed from the UBM using e.g . the
  • MAP Maximum A Posteriori
  • Gaussian mixtures is 16, which is the same as for the UBM estimation.
  • the models estimated in blocks 40 and 50 are stored in a user/command model database 60.
  • the database 60 further stores the calculated features spaces.
  • Fig. 3 shows a procedure for user and command model estimation according to an embodiment of the invention.
  • the database 60 of user/command models and user/command feature spaces is empty (YES in step S20).
  • a UBM is estimated in step S22 and a GMM for the first speech signal (first user/command) is computed in step S23.
  • a feature space calculated from this second speech signal and the feature space calculated from the first speech signal (the first command) are used together to estimate the UBM .
  • the feature spaces are concatenated, and in step S22 the UBM is calculated using the concatenated feature spaces.
  • a GMM for the first speech signal is re-estimated and a GMM for the second speech signal is estimated.
  • the second speech signal represents a last user/command (last feature space) in the databased 60 in step S24, the process ends after the estimation of the GMM for the second speech signal . Assuming that the number of users/commands (i.e.
  • step S23 when registering a user/command S+ l, all S feature spaces and the current one are used to estimate the UBM in step S22. Then, the S+l user/command GMMs are (re-)estimated in step S23.
  • Fig. 4 shows a two-dimensional representation of three user/command GMMs estimated according to an embodiment of the invention. For graphical representation purposes, only two dimensions of the GMMs are represented. The GMMs have in fact much more dimensions.
  • the straight lines in Fig . 4 represent the boundaries between models which are important in the discrimination (decision) of which speech signal was uttered (i.e. which command was uttered by which user). Therefore, each model is in a kind of cluster.
  • the computed user/command dependent GMMs, the UBM and all feature spaces are kept in database 60.
  • Action mode In the following, the action mode of the speech spotting system according to an embodiment of the invention will be described .
  • an uttered speech signal is evaluated in order to find whether there is a command (i.e. a couple user and command) for the uttered speech signal, that has been registered in the speech spotting system in the registration mode.
  • a command i.e. a couple user and command
  • the registered commands are detected in a speech flow (continuous speech). According to another embodiment of the invention, the registered commands are detected from a short-time speech segment.
  • Fig. 5 illustrates processing in the action mode according to an embodiment of the invention.
  • the uttered speech signal (also referred to as trial uttered command) is input via a microphone or microphone array 41 which may be the same as the microphone or microphone array 10 of Fig. 1.
  • the pre-processing block 20 and the feature extraction block 36 are similar to blocks 20 and 30 used in the registration mode, except for the VAD in block 36, which is slightly different in order to segment the commands in the speech flow, rather than in a time limited recording .
  • the log-likelihood is computed for both the UBM and GMMs using the feature space from the trial uttered command .
  • the final log-likelihood LL is given by the average difference between the UBM and GMM log-likelihoods.
  • the final LL is below a predetermined threshold, then no commands (none of the registered commands uttered by a given user) are detected. In other words, in block 46 it is decided that the trial uttered command is not a registered command and user. Otherwise, the highest final LL provides the most probable detected couple of command and user, which is the output information from block 46. It may happen that the same command is uttered by multiple users. Such case is not a problem as the user will be discriminated in block 46.
  • final log- likelihoods are calculated by computing an average difference between the log-likelihood for the UBM and the log-likelihoods for the GMMs. Further, in block 46, a registered command uttered by a registered user is detected based on a final log-likelihood of the calculated final log-likelihoods if the final log-likelihood exceeds a predetermined threshold . Finally, in block 46, the registered command and the registered user are decided based on the maximum log-likelihood of the final log-likelihoods exceeding the
  • Fig. 6 shows a confusion matrix illustrating the result obtained in block 46 for five different registered users (i.e. speakers) and three registered commands for each registered user. Hence, there are 15 registered couples of user and command.
  • the x-axis represents the target, i.e. what must be detected, and the y-axis is the output from block 46. The number of correct detections is given on the diagonal of the confusion matrix.
  • indices 1 to 3 correspond to the three commands uttered by user 1
  • indices 4 to 6 correspond to the three commands uttered by user 2
  • indices 7 to 9 correspond to the three commands uttered by user 3
  • indices 10 to 12 correspond to the three commands uttered by user 4
  • indices 13 to 15 correspond to the three commands uttered by user 5.
  • the result table shown at the bottom right corner in Fig . 6 indicates an excellent recognition rate of the couples user and command of 98.1%.
  • Higuchi's Fractal Dimension is applied as a key feature element in a multi-scale approach combined with the UBM/GMM estimation procedure for modeling uniquely the
  • results illustrated in Fig . 6 are compared with results achieved by a first conventional speech spotting system using features extracted from a speech signal using a fractal dimension (which is different from Higuchi's Fractal Dimension) followed by a simple discrimination, and a second conventional speech spotting system using the fractal dimension features together with features derived from entropy of the speech signal.
  • Fig. 7A shows the results obtained from the first conventional speech spotting system
  • Fig. 7B shows the results obtained from the second conventional speech spotting system, for five different registered users (i.e. speakers) and three registered commands for each registered user, applying the same conditions and data as in the embodiment of the invention the result of which is illustrated in Fig . 6.
  • the x-axis represents the target, i.e. what must be detected
  • the y-axis is the output from block 46. The number of correct detections is given on the diagonal of the confusion matrix.
  • indices 1 to 3 correspond to the three commands uttered by user 1
  • indices 4 to 6 correspond to the three commands uttered by user 2
  • indices 7 to 9 correspond to the three commands uttered by user 3
  • indices 10 to 12 correspond to the three commands uttered by user 4
  • indices 13 to 15 correspond to the three commands uttered by user 5.
  • the number of correct detections is given on the diagonal of the confusion matrices, and it should be equal to 24, as there are 24 repetitions of each command.
  • the recognition rate is low at 10.6%, as illustrated at the bottom right corner in Fig . 7A.
  • the second features entropy
  • the results are improved but remain low at 14.2%, as illustrated at the bottom right corner in Fig . 7B.
  • Fig. 8 shows a schematic block diagram illustrating a configuration of a control unit in which at least some of the above described embodiments of the invention are implementable.
  • the control unit comprises processing resources (processing circuitry), memory resources (memory circuitry) and interfaces.
  • the microphone or microphone array 10, 41 may be
  • processing resources processing circuitry
  • memory resources memory circuitry
  • the various embodiments of the invention may be implemented in hardware or special purpose circuits, software (computer readable instructions embodied on a computer readable medium), logic or any combination thereof.
  • some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto.
  • firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto.
  • While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

Abstract

From a speech signal uttered by a user, for each of a number of time frames T of the speech signal, N Higuchi fractal dimension (HFD) parameters are extracted as a feature vector, using multi-scale HFD, and a feature space is formed from the feature vector and the number of time frames T for each scale of the multi-scale HFD (30). Feature spaces formed for each of a plurality of speech signals are concatenated, a universal background model (UBM) is estimated from the concatenated feature spaces (40), and a user and command dependent Gaussian mixture model (GMM) is estimated for each of the plurality of speech signals using the estimated UBM, thereby estimating GMMs each corresponding to one of the plurality of speech signals (50).

Description

AUDIO SIGNATURE FOR SPEECH COMMAND SPOTTING
DESCRIPTION BACKGROUND OF THE INVENTION Field of the invention
The present invention relates to detecting an audio signature in speech utterances for speech command spotting .
Related background Art
Voice communications are a natural and simple way of communicating between people. However, despite considerable improvement of speech recognition engines, making a machine understand some spoken
instructions is still challenging . Indeed, speech recognition engines work well in absence of noise and reverberation. Furthermore, they are language and vocabulary dependent where the vocabulary is trained (or pre-trained) on large occurrences of the same phonemes.
One application of speech recognition, but not limited thereto, is speech command spotting for vehicles. Speech commands can be given inside the vehicle to control equipment such as windows, air conditioning, winkers, wipers, etc.
Speech commands can be also given from outside the vehicle, when for example the user joins his car on the parking slot with hands carrying some shopping bags, and then by just uttering "open", the door at the user's side opens.
Most of the prior art systems implementing speech recognition or speech spotting use approaches with MFCC (Mel Frequency Cepstral Coefficients) as features or any extension with different types of models based on HMM (Hidden Markov Models), GMM (Gaussian Mixture Models), etc.
The problem of these systems is that they need a training of words (in reality entities smaller than a syllable) that are repeated many times with numerous speakers. Therefore, the systems are language and vocabulary dependent.
As an example, in vehicles, it is already possible to give voice commands to control the navigation or multimedia system. However, the list of
commands is pre-defined by the manufacturer and cannot be chosen by the vehicle user.
There are also some possibilities to enter a kind of reference by speech that is not pre-defined when affecting a voice label to the phone directory, for example. However, in general, the performances of such systems are poor. More advanced systems, even commercial ones, need a repetition of a given sentence several times, and still do not provide a high recognition rate. The following meanings for the abbreviations used in this specification apply:
GMM Gaussian Mixture Model
HFD Higuchi Fractal Dimension
HMM Hidden Markov Model
MAP Maximum A Posteriori
MFCC Mel Frequency Cepstral Coefficient
UBM Universal Background Model
VAD Voice Activity Detector
SUMMARY OF THE INVENTION At least one embodiment of the present invention aims at overcoming the above drawbacks and has an object of providing a speech spotting system that enables identification of an uttered speech command and the speaker without any previous training on a large database, in which the speech command can be language independent and does not have to be part of existing vocabulary.
According to aspects of the present invention, this is achieved by methods, apparatuses and a computer program product as defined in the appended claims.
According to at least one embodiment of the invention, it is possible for a given speaker to define a voice command that is language and vocabulary independent. The command may comprise speech, humming, singing, etc. The command can be registered with only one utterance.
According to an embodiment of the invention, the Higuchi fractal dimension is used followed by probabilistic discrimination. According to an embodiment of the invention, the Higuchi fractal dimension is applied in a multi-scale way in combination with a probabilistic modeling that enables assigning, as a signature, the couple speaker (i.e. user) and command, as well as identifying the command and the user robustly. In the following the invention will be described by way of embodiments thereof with reference to the accompanying drawings.
BRIEF DESCRIPTION OF THE DRAWINGS Fig. 1 shows a schematic block diagram illustrating processing in a registration mode according to an embodiment of the invention. Fig. 2 shows a schematic block diagram illustrating feature computation processing in a registration mode according to embodiments of the invention. Fig. 3 shows a flowchart illustrating a probabilistic modeling processing according to an embodiment of the invention.
Fig. 4 shows a diagram illustrating an example of user and command dependent GMM models according to an embodiment of the invention.
Fig. 5 shows a schematic block diagram illustrating a command and user detection processing in an action mode according to an embodiment of the invention. Fig. 6 shows a diagram illustrating results of the command and user detection processing according to an embodiment of the invention.
Figs. 7A and 7B show diagrams illustrating results of a command and user detection processing according to comparative examples.
Fig. 8 shows a schematic block diagram illustrating a configuration of a control unit in which examples of embodiments of the invention are implementable. DESCRIPTION OF THE EMBODIMENTS
Embodiments of the invention relate to functions that are in the digital domain. However, there is an analog part to condition (amplify and low- pass filter) microphone signals and convert them to digital signals. This part is out of the scope of this application. A speech spotting system according to at least one embodiment of the invention comprises two operation modes, i.e. a "registration" mode and an "action" mode. First, the registration mode will be described. Registration Mode
In the registration mode, a speech signal representing a command uttered by a user as a label to a defined action is registered in the speech spotting system. Referring to Fig. 1, first a speech utterance of the user is acquired by a microphone or microphone array 10 (for example, a one microphone or multi-microphone in-vehicle setting, which is out of the scope of this application). The speech utterance is amplified, low-pass filtered and digitized. Then, in a pre-processing block 20, which is out of scope of this application, noise and interferences for each situation (in-vehicle or out-of-vehicle application) are removed, and a digital audio signal is output from the pre-processing block 20.
A feature extraction block 30 of an embodiment of the invention, which receives the digital audio signal, comprises an estimation according to Higuchi Fractal Dimension (HFD) in a multi-scale way. "Multi-scale" means that the fractal dimension is computed for different (multiple) scales and all these scale dependent fractal dimensions (i.e. HFD parameters) are gathered. The HFD can be used alone or in combination with other features such as Mel-Frequency Cepstral Coefficients (MFCC).
Fig. 2 illustrates details of the feature extraction block 30. First, the digital audio signal is subjected to framing in a framing block 31, in which frames of, for example, 32 ms are overlapped by 50%. A voice activity detector (VAD) 32 applies an algorithm to the digital audio signal, which has been subjected to the framing, the algorithm detecting speech presence in the digital audio signal and segments a speech signal corresponding to a command, i.e. finds start and end of the speech signal. As a command can last several seconds, the speech signal after segmentation is a matrix of time samples, corresponding to speech frames contained in the command. The speech frames are also referred to as time frames of the command. In other words, each column of the matrix contains time samples
corresponding to a given time frame of the command. This matrix is also referred to as speech command matrix. The speech signal, i.e. the speech command matrix, is output from the VAD 32.
Then, from the speech signal, a feature space is computed. As mentioned above, according to an embodiment of the invention, it is possible to compute the feature space using only Higuchi fractal dimension block 34 as illustrated in the upper branch of Fig. 2. Alternatively, according to another embodiment of the invention, for computing the feature space the Higuchi fractal dimension block 34 is used together with Mel-frequency cepstral coefficients block 33 as illustrated in the lower branch of Fig. 2.
In the following, processing performed in HFD block 34 will be described.
First, from the speech signal output from the VAD 32, each column of the speech command matrix is processed independently, and from each column, a vector X™ of samples (time-series) is created as given by equation (1).
Figure imgf000007_0001
where k is the time interval, m is the initial time in the dimension computation, and W is the frame size in samples. The adjustment of these parameters defines the number of time-series that are obtained.
Then, the length Lmk of each time-series is computed as given by equation
(2). \W—m \
W-m \ \x[m+ik]-x[m+(i-l)k] \
The average Lk of the length is computed as given by equation (3).
Lk = ,∑m=l ^m,fe ■■■ (3)
Then, the slope of the line passing by the points given by
{log(l) , log(i), ... log(l/m)} on the x-axis and the points given by log(Lfe) on the y-axis is computed. The slope is the HFD parameter.
With the above processing, and for all chosen scales, N HFD parameters are computed, for each time frame, as a feature vector of length N, which can also be referred to as "command feature vector", and the dimension of a command feature space matrix is [N x T] in the upper branch of Fig. 2, or [(N + M) x T] in the lower branch of Fig . 2 in which in addition to the N HFD parameters, M parameters according to the MFCC block 33 are computed. T corresponds to the number of time frames of the command. For achieving multi-scale HFD, different values of m are used in the above equations, for example m=3, m = 10, and m = 50. In case three different values are applied for m, three feature spaces are calculated for the command. As shown in Fig . 1, the feature space computed in block 30 is input into a universal background model (UBM) estimation block 40 which defines a kind of borders for GMM models. According to an embodiment of the invention, the UBM is a user and command independent GMM model . The UBM is acting as a prior model and there are many ways to compute it. Most efficient (in terms of model quality) is the Expectation-Maximization approach. The UBM estimated in block 40 is input into block 50 in which a user and command depended GMM is computed from the UBM using e.g . the
Maximum A Posteriori (MAP) approach. For example, the number of
Gaussian mixtures is 16, which is the same as for the UBM estimation. The models estimated in blocks 40 and 50 are stored in a user/command model database 60. The database 60 further stores the calculated features spaces.
It is to be noted that every time a new command is registered by a user, i.e. a speech utterance is input by the user using the microphone or microphone array 10 shown in Fig. 1, both models UBM and GMM have to be re-estimated . The UBM is estimated on all feature spaces calculated from each of a plurality of speech signals uttered by a plurality of users and that are stored in the database 60.
Fig. 3 shows a procedure for user and command model estimation according to an embodiment of the invention. In case the registration mode is operated for the first time, the database 60 of user/command models and user/command feature spaces is empty (YES in step S20). Then, from the currently computed feature space extracted from the first speech signal uttered by a user, a UBM is estimated in step S22 and a GMM for the first speech signal (first user/command) is computed in step S23.
In case a second speech signal (a second command) has to be registered, a feature space calculated from this second speech signal and the feature space calculated from the first speech signal (the first command) are used together to estimate the UBM . In other words, in step S21 the feature spaces are concatenated, and in step S22 the UBM is calculated using the concatenated feature spaces. Then, using the UBM, by repeating step S23, a GMM for the first speech signal is re-estimated and a GMM for the second speech signal is estimated. As the second speech signal represents a last user/command (last feature space) in the databased 60 in step S24, the process ends after the estimation of the GMM for the second speech signal . Assuming that the number of users/commands (i.e. commands uttered by users) already registered is S, then when registering a user/command S+ l, all S feature spaces and the current one are used to estimate the UBM in step S22. Then, the S+l user/command GMMs are (re-)estimated in step S23.
It is to be noted that every time a new command is registered in the speech spotting system, all final user/command models must be re-estimated. This is, in a simple explanation, due to the re-estimation of the boundaries between models, because of the UBM-GMM approach.
Fig. 4 shows a two-dimensional representation of three user/command GMMs estimated according to an embodiment of the invention. For graphical representation purposes, only two dimensions of the GMMs are represented. The GMMs have in fact much more dimensions.
The straight lines in Fig . 4 represent the boundaries between models which are important in the discrimination (decision) of which speech signal was uttered (i.e. which command was uttered by which user). Therefore, each model is in a kind of cluster.
According to an embodiment of the invention, the computed user/command dependent GMMs, the UBM and all feature spaces are kept in database 60. As explained earlier, it is necessary to keep also the feature spaces for all registered commands (and not only their GMMs), because they are necessary in the re-estimation procedure when a new command is added or removed . It is noted that if a command is removed the same re-estimation procedure as performed for adding a new command applies to estimate new GMMs on all remaining commands.
Action mode In the following, the action mode of the speech spotting system according to an embodiment of the invention will be described . In the action mode, an uttered speech signal is evaluated in order to find whether there is a command (i.e. a couple user and command) for the uttered speech signal, that has been registered in the speech spotting system in the registration mode.
According to an embodiment of the invention, the registered commands are detected in a speech flow (continuous speech). According to another embodiment of the invention, the registered commands are detected from a short-time speech segment.
Fig. 5 illustrates processing in the action mode according to an embodiment of the invention. The uttered speech signal (also referred to as trial uttered command) is input via a microphone or microphone array 41 which may be the same as the microphone or microphone array 10 of Fig. 1.
In Fig . 5, the pre-processing block 20 and the feature extraction block 36 are similar to blocks 20 and 30 used in the registration mode, except for the VAD in block 36, which is slightly different in order to segment the commands in the speech flow, rather than in a time limited recording .
In blocks 44 and 45, the log-likelihood is computed for both the UBM and GMMs using the feature space from the trial uttered command . The final log-likelihood LL is given by the average difference between the UBM and GMM log-likelihoods.
If the final LL is below a predetermined threshold, then no commands (none of the registered commands uttered by a given user) are detected. In other words, in block 46 it is decided that the trial uttered command is not a registered command and user. Otherwise, the highest final LL provides the most probable detected couple of command and user, which is the output information from block 46. It may happen that the same command is uttered by multiple users. Such case is not a problem as the user will be discriminated in block 46.
According to an embodiment of the invention, in block 46, final log- likelihoods are calculated by computing an average difference between the log-likelihood for the UBM and the log-likelihoods for the GMMs. Further, in block 46, a registered command uttered by a registered user is detected based on a final log-likelihood of the calculated final log-likelihoods if the final log-likelihood exceeds a predetermined threshold . Finally, in block 46, the registered command and the registered user are decided based on the maximum log-likelihood of the final log-likelihoods exceeding the
predetermined threshold .
Fig. 6 shows a confusion matrix illustrating the result obtained in block 46 for five different registered users (i.e. speakers) and three registered commands for each registered user. Hence, there are 15 registered couples of user and command.
Each registered user utters each registered command 24 times. The x-axis represents the target, i.e. what must be detected, and the y-axis is the output from block 46. The number of correct detections is given on the diagonal of the confusion matrix. On the x-axis, indices 1 to 3 correspond to the three commands uttered by user 1, indices 4 to 6 correspond to the three commands uttered by user 2, indices 7 to 9 correspond to the three commands uttered by user 3, indices 10 to 12 correspond to the three commands uttered by user 4, and indices 13 to 15 correspond to the three commands uttered by user 5. The same applies for the y-axis.
When the number in the diagonal is equal to 24, it means that every time the command has been uttered the user and command are well recognized. When the number is below 24, it means that there are some errors, and it is possible to derive information on the errors. For example, in the case shown in Fig . 6, when user 2 has uttered command 3, 1 misdetection on 24 trials (number 23 on the diagonal) occurred, and by checking the column it can be seen that this one misdetection was detected as user 4/command 2.
The result table shown at the bottom right corner in Fig . 6 indicates an excellent recognition rate of the couples user and command of 98.1%.
According to embodiments of the invention, Higuchi's Fractal Dimension is applied as a key feature element in a multi-scale approach combined with the UBM/GMM estimation procedure for modeling uniquely the
user/command as an audio signature, that can be used in combination with other features or alone. In the following, the results illustrated in Fig . 6 are compared with results achieved by a first conventional speech spotting system using features extracted from a speech signal using a fractal dimension (which is different from Higuchi's Fractal Dimension) followed by a simple discrimination, and a second conventional speech spotting system using the fractal dimension features together with features derived from entropy of the speech signal.
Fig. 7A shows the results obtained from the first conventional speech spotting system, and Fig . 7B shows the results obtained from the second conventional speech spotting system, for five different registered users (i.e. speakers) and three registered commands for each registered user, applying the same conditions and data as in the embodiment of the invention the result of which is illustrated in Fig . 6. Hence, there are 15 couples of user and command. Each registered user utters each registered command 24 times. The x-axis represents the target, i.e. what must be detected, and the y-axis is the output from block 46. The number of correct detections is given on the diagonal of the confusion matrix. On the x-axis, indices 1 to 3 correspond to the three commands uttered by user 1, indices 4 to 6 correspond to the three commands uttered by user 2, indices 7 to 9 correspond to the three commands uttered by user 3, indices 10 to 12 correspond to the three commands uttered by user 4, and indices 13 to 15 correspond to the three commands uttered by user 5. The same applies for the y-axis.
The number of correct detections is given on the diagonal of the confusion matrices, and it should be equal to 24, as there are 24 repetitions of each command.
By using only the fractal dimension features, the recognition rate is low at 10.6%, as illustrated at the bottom right corner in Fig . 7A. When adding the second features (entropy), the results are improved but remain low at 14.2%, as illustrated at the bottom right corner in Fig . 7B.
Fig. 8 shows a schematic block diagram illustrating a configuration of a control unit in which at least some of the above described embodiments of the invention are implementable. The control unit comprises processing resources (processing circuitry), memory resources (memory circuitry) and interfaces. The microphone or microphone array 10, 41 may be
implemented by the interfaces, and at least some of the processing in blocks 20, 30, 36, 40, 44, 45, 46, 50 and 60 and steps S20 to S24 may be realized by the processing resources (processing circuitry) and memory resources (memory circuitry) of the control unit.
In general, the various embodiments of the invention may be implemented in hardware or special purpose circuits, software (computer readable instructions embodied on a computer readable medium), logic or any combination thereof. For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
It is to be understood that the above description is illustrative of the invention and is not to be construed as limiting the invention. Various modifications and applications may occur to those skilled in the art without departing from the true spirit and scope of the invention as defined by the appended claims.

Claims

CLAIMS :
1. A method of registering commands uttered by users, the method comprising :
acquiring a plurality of speech signals, each of the plurality of speech signals corresponding to a command of a plurality of commands, which is uttered by a user of a plurality of users;
for each of the plurality of speech signals, extracting, for each of a number of time frames T of the speech signal, N Higuchi fractal dimension (HFD) parameters, as a feature vector, from the speech signal using multi- scale HFD, and forming a feature space from the feature vector and the number of time frames T of the speech signal for each scale of the multi- scale HFD, N and T being integers equal to or greater than one, thereby forming feature spaces each corresponding to one of the plurality of speech signals;
concatenating the feature spaces;
estimating a universal background model (UBM) from the
concatenated feature spaces; and
estimating a user and command dependent Gaussian mixture model (GMM) for each of the plurality of speech signals using the estimated UBM, thereby estimating GMMs each corresponding to one of the plurality of speech signals.
2. The method of claim 1, comprising :
holding the estimated GMMs, the UBM and the feature spaces in a database.
3. The method of claim 1 or 2, comprising :
extracting the speech signal from a digital audio signal.
4. The method of any one of claims 1 to 3, comprising : for each of the plurality of speech signals, extracting, for each time frame of the speech signal, M Mel-frequency cepstral coefficients (MFCCs) from the speech signal, M being an integer equal to or greater than one, wherein the feature vector comprises the M MFCCs and the N HFD parameters.
5. A method of detecting registered commands uttered by registered users, the method comprising :
acquiring a speech signal;
extracting, for each of a number of time frames T of the speech signal, N Higuchi fractal dimension (HFD) parameters, as a feature vector, from the speech signal using multi-scale HFD, and forming a feature space from the feature vector and the number of time frames T of the speech signal for each scale of the multi-scale HFD, N and T being integers equal to or greater than one;
acquiring a universal background model (UBM) and at least one user and command dependent Gaussian mixture model (GMM);
calculating, using the feature space, a log-likelihood for the UBM and a log-likelihood for the at least one GMM;
calculating at least one final log-likelihood by computing an average difference between the log-likelihood for the UBM and the log-likelihood for the at least one GMM;
detecting, in the speech signal, a registered command uttered by a registered user if the at least one final log-likelihood exceeds a
predetermined threshold; and
deciding the registered command and the registered user based on the maximum log-likelihood out of the at least one final log-likelihood exceeding the predetermined threshold.
6. The method of claim 5, wherein the UBM and the at least one GMM are estimated by: acquiring a plurality of speech signals for registration, each of the plurality of speech signals for registration corresponding to a command of a plurality of commands, which is uttered by a user of a plurality of users, for each of the plurality of speech signals for registration, extracting, for each of a number of time frames T of the speech signal for registration, N Higuchi fractal dimension (HFD) parameters, as a feature vector for registration, from the speech signal for registration using multi-scale HFD, and forming a feature space for registration from the feature vector for registration and the number of time frames T of the speech signal for registration for each scale of the multi-scale HFD, N and T being integers equal to or greater than one, thereby forming feature spaces for registration each corresponding to one of the plurality of speech signals for registration; concatenating the feature spaces for registration;
estimating the universal background model (UBM) from the
concatenated feature spaces for registration; and
estimating a user and command dependent Gaussian mixture model (GMM) for each of the plurality of speech signals for registration using the estimated UBM, thereby estimating the at least one GMM .
7. The method of claim 5 or 6, comprising :
acquiring the speech signal from a digital audio signal representing continuous speech.
8. The method of any one of claims 5 to 7, comprising :
extracting, for each time frame of the speech signal, M Mel-frequency cepstral coefficients (MFCCs) from the speech signal, M being an integer equal to or greater than one,
wherein the feature vector comprises the M MFCCs and the N HFD parameters.
9. The method of claim 6, comprising : extracting, for each time frame of the speech signal, M Mel-frequency cepstral coefficients (MFCCs) from the speech signal, M being an integer equal to or greater than one,
wherein the feature vector comprises the M MFCCs and the N HFD parameters,
wherein the UBM and the at least one GMM are further estimated by: for each of the plurality of speech signals for registration, extracting, for each time frame of the speech signal for registration, M Mel-frequency cepstral coefficients (MFCCs) from the speech signal for registration, M being an integer equal to or greater than one,
wherein the feature vector for registration comprises the M MFCCs and the N HFD parameters.
10. A computer program product including a program for a processing device, comprising software code portions for performing the steps of any one of claims 1 to 9 when the program is run on the processing device.
11. The computer program product according to claim 10, wherein the computer program product comprises a computer-readable medium on which the software code portions are stored.
12. The computer program product according to claim 10, wherein the program is directly loadable into an internal memory of the processing device.
13. An apparatus for registering commands uttered by users, the apparatus comprising :
an extracting unit (30) configured to :
acquire a plurality of speech signals, each of the plurality of speech signals corresponding to a command of a plurality of commands, which is uttered by a user of a plurality of users, and
for each of the plurality of speech signals, extract, for each of a number of time frames T of the speech signal, N Higuchi fractal dimension (HFD) parameters, as a feature vector, from the speech signal using multi- scale HFD, and form a feature space from the feature vector and the number of time frames T of the speech signal for each scale of the multi- scale HFD, N and T being integers equal to or greater than one, thereby forming feature spaces each corresponding to one of the plurality of speech signals; and
an estimating unit (40, 50) configured to :
concatenate the feature spaces,
estimate a universal background model (UBM) from the concatenated feature spaces, and
estimate a user and command dependent Gaussian mixture model (GMM) for each of the plurality of speech signals using the estimated UBM, thereby estimating GMMs each corresponding to one of the plurality of speech signals.
14. The apparatus of claim 13, wherein the extracting unit is configured to, for each of the plurality of speech signals, extract, for each time frame of the speech signal, M Mel-frequency cepstral coefficients (MFCCs) from the speech signal, M being an integer equal to or greater than one,
wherein the feature vector comprises the M MFCCs and the N HFD parameters.
15. An apparatus for detecting registered commands uttered by registered users, the apparatus comprising :
an extraction unit (36) configured to :
acquire a speech signal, and
extract, for each of a number of time frames T of the speech signal, N Higuchi fractal dimension (HFD) parameters, as a feature vector, from the speech signal using multi-scale HFD, and form a feature space from the feature vector and the number of time frames T of the speech signal for each scale of the multi-scale HFD, N and T being integers equal to or greater than one;
a calculating unit (44, 45) configured to : acquire a universal background model (UBM) and at least one user and command dependent Gaussian mixture model (GMM), and
calculate, using the feature space, a log-likelihood for the UBM and a log-likelihood for the at least one GMM; and
a deciding unit (46) configured to :
calculate at least one final log-likelihood by computing an average difference between the log-likelihood for the UBM and the log-likelihood for the at least one GMM,
detect, in the speech signal, a registered command uttered by a registered user if the at least one final log-likelihood exceeds a
predetermined threshold, and
decide the registered command and the registered user based on the maximum log-likelihood out of the at least one final log-likelihood exceeding the predetermined threshold .
PCT/EP2017/069649 2016-08-12 2017-08-03 Audio signature for speech command spotting WO2018029071A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
DE102016115018.5A DE102016115018B4 (en) 2016-08-12 2016-08-12 Audio signature for voice command observation
DE102016115018.5 2016-08-12

Publications (1)

Publication Number Publication Date
WO2018029071A1 true WO2018029071A1 (en) 2018-02-15

Family

ID=59520913

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2017/069649 WO2018029071A1 (en) 2016-08-12 2017-08-03 Audio signature for speech command spotting

Country Status (2)

Country Link
DE (1) DE102016115018B4 (en)
WO (1) WO2018029071A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108766465A (en) * 2018-06-06 2018-11-06 华中师范大学 A kind of digital audio based on ENF universal background models distorts blind checking method
WO2019232826A1 (en) * 2018-06-06 2019-12-12 平安科技(深圳)有限公司 I-vector extraction method, speaker recognition method and apparatus, device, and medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140200890A1 (en) * 2012-11-30 2014-07-17 Stmicroelectronics Asia Pacific Pte Ltd. Methods, systems, and circuits for speaker dependent voice recognition with a single lexicon

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140200890A1 (en) * 2012-11-30 2014-07-17 Stmicroelectronics Asia Pacific Pte Ltd. Methods, systems, and circuits for speaker dependent voice recognition with a single lexicon

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
DOUGLAS A. REYNOLDS ET AL: "Speaker Verification Using Adapted Gaussian Mixture Models", DIGITAL SIGNAL PROCESSING., vol. 10, no. 1-3, 1 January 2000 (2000-01-01), US, pages 19 - 41, XP055282688, ISSN: 1051-2004, DOI: 10.1006/dspr.1999.0361 *
FULUFHELO V NELWAMONDO ET AL: "Multi-scale Fractal Dimension for Speaker Identification System", PROCEEDINGS OF THE 8TH WSEAS INT. CONF. ON AUTOMATIC CONTROL, MODELING AND SIMULATION, 14 March 2006 (2006-03-14), Prague, Czech Republic, pages 81 - 86, XP055418472, Retrieved from the Internet <URL:https://s3.amazonaws.com/academia.edu.documents/39478723/Multi-scale_Fractal_Dimension_for_Speake20151027-13450-22nlgh.pdf?AWSAccessKeyId=AKIAIWOWYYGZ2Y53UL3A&Expires=1508847939&Signature=RTzFrFR8NJRlVSJQHeBMroowves=&response-content-disposition=inline; filename=Multi-scale_fractal_dimension_for_spe> [retrieved on 20171024] *
ZAKI MOHAMMADI ET AL: "Effectiveness of fractal dimension for ASR in low resource language", THE 9TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, IEEE, 12 September 2014 (2014-09-12), pages 464 - 468, XP032669148, DOI: 10.1109/ISCSLP.2014.6936645 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108766465A (en) * 2018-06-06 2018-11-06 华中师范大学 A kind of digital audio based on ENF universal background models distorts blind checking method
WO2019232826A1 (en) * 2018-06-06 2019-12-12 平安科技(深圳)有限公司 I-vector extraction method, speaker recognition method and apparatus, device, and medium
CN108766465B (en) * 2018-06-06 2020-07-28 华中师范大学 Digital audio tampering blind detection method based on ENF general background model

Also Published As

Publication number Publication date
DE102016115018A1 (en) 2018-02-15
DE102016115018B4 (en) 2018-10-11

Similar Documents

Publication Publication Date Title
KR101988222B1 (en) Apparatus and method for large vocabulary continuous speech recognition
CN105529026B (en) Speech recognition apparatus and speech recognition method
EP2189976B1 (en) Method for adapting a codebook for speech recognition
GB2580856A (en) International Patent Application For Method, apparatus and system for speaker verification
US10733986B2 (en) Apparatus, method for voice recognition, and non-transitory computer-readable storage medium
US20190279644A1 (en) Speech processing device, speech processing method, and recording medium
JP6464005B2 (en) Noise suppression speech recognition apparatus and program thereof
KR101893789B1 (en) Method for speech endpoint detection using normalizaion and apparatus thereof
JP2006171750A (en) Feature vector extracting method for speech recognition
JP4897040B2 (en) Acoustic model registration device, speaker recognition device, acoustic model registration method, and acoustic model registration processing program
JP3298858B2 (en) Partition-based similarity method for low-complexity speech recognizers
WO2018029071A1 (en) Audio signature for speech command spotting
JP4074543B2 (en) Audio processing apparatus, audio processing method, audio processing program, and program recording medium
JP6481939B2 (en) Speech recognition apparatus and speech recognition program
JP5342629B2 (en) Male and female voice identification method, male and female voice identification device, and program
TWI578307B (en) Acoustic mode learning device, acoustic mode learning method, sound recognition device, and sound recognition method
KR101023211B1 (en) Microphone array based speech recognition system and target speech extraction method of the system
JP3493849B2 (en) Voice recognition device
JP2003271190A (en) Method and device for eliminating noise, and voice recognizing device using the same
JP4325044B2 (en) Speech recognition system
Morales-Cordovilla et al. On the use of asymmetric windows for robust speech recognition
JP4244524B2 (en) Voice authentication apparatus, voice authentication method, and program
Rehr et al. Cepstral noise subtraction for robust automatic speech recognition
KR20100056859A (en) Voice recognition apparatus and method
CN107039046B (en) Voice sound effect mode detection method based on feature fusion

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17748480

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 17748480

Country of ref document: EP

Kind code of ref document: A1