CN110767223A

CN110767223A - Voice keyword real-time detection method of single sound track robustness

Info

Publication number: CN110767223A
Application number: CN201910945315.6A
Authority: CN
Inventors: 胡鹏; 闫永杰
Original assignee: Elephant Acoustical (shenzhen) Technology Co Ltd
Current assignee: Elephant Acoustical (shenzhen) Technology Co Ltd
Priority date: 2019-09-30
Filing date: 2019-09-30
Publication date: 2020-02-07
Anticipated expiration: 2039-09-30
Also published as: CN110767223B

Abstract

The invention relates to a method for detecting a single sound channel robustness voice keyword in real time, which comprises the following steps: receiving a noisy speech in an electronic format; converting the voice signal of the time domain into a frequency domain signal by using short-time Fourier transform frame by frame; processing the frequency domain signal by using a Mel filter to obtain Mel characteristics as acoustic characteristics; the Mel characteristics are processed by a neural network frame by frame and then processed by a normalization index function to obtain confidence information of each keyword; when the confidence coefficient of a certain keyword is greater than a predefined threshold value, the current frame and a plurality of frames pushed forwards are taken to be spliced to be used as the output of the neural network; and when the confidence coefficient value is larger than a predefined threshold value, determining that the keyword is detected, otherwise, determining that the keyword is not detected. The invention has excellent performance, can still keep higher awakening rate in a noisy environment, has wider practicability, can greatly reduce the false alarm rate of the neural network and improves the performance of the invention.

Description

Voice keyword real-time detection method of single sound track robustness

Technical Field

The invention relates to the technical field of electronic communication noise reduction, in particular to a method for detecting a voice keyword with single sound channel robustness in real time.

Background

With the rise of applications such as intelligent assistants and intelligent sound equipment, the speech keyword detection technology is increasingly regarded by the industry as an important link in human-computer interaction. Hidden markov model-based filling Models (Filler Models) were first applied to keyword detection. Among them, the robust keyword detection of the mono is a very challenging subject, because the mono keyword detection only depends on the recorded sound signal of one microphone, and cannot utilize the spatial information commonly used in the microphone array. In addition, single-microphone robust keyword detection acoustic application scenarios are more extensive than microphone array keyword detection based on beamforming (appropriately configured spatial filtering through the sensor array). Because only one microphone is used, the single-channel robustness keyword detection is lower in cost and more convenient to use. Recently, a great breakthrough of keyword detection is that a deep neural network is used to replace a hidden Markov model, and the method has the advantages of less memory occupation, no need of decoding and searching and high accuracy. The most advanced method in the past is to use a feed-forward Deep Neural Network (DNN) trained with a large amount of data plus frame-level data labeling. Although the method can realize keyword detection, the robustness of the method to noise is not good, and the method can be improved by adding noises with different types and different signal-to-noise ratios to input voice in the training process, but the problem of high false alarm rate still exists.

The existing solution has the following disadvantages:

1. although keyword detection can be achieved, robustness to noise is not good;

2. there is a problem of a high false alarm rate.

Disclosure of Invention

The technical problem to be solved by the invention is how to solve the problems of poor noise robustness and high false alarm rate of the method in the prior art by adopting a single sound channel robustness voice keyword real-time detection method.

The technical scheme adopted by the invention for solving the technical problems is as follows: compared with the keyword detection of a microphone array formed by wave beams, the method can still keep higher awakening rate under the condition of not using space position information, and has wider application scenes by only using one microphone. The method adopts a supervised learning method to detect the keywords, and realizes a keyword detection method with robustness in a noise scene by combining two training targets of noise reduction and keyword detection. The method has excellent performance, can still keep higher awakening rate in a noisy environment, has wider practicability, and can greatly reduce the false alarm rate of the neural network.

In the method for detecting the robustness of the monophonic voice keyword in real time, the method for detecting the robustness of the monophonic voice keyword in real time comprises the following steps:

s1 receiving the noisy speech signal in electronic format, which contains human voice and background noise of non-human voice;

s2, converting the noisy speech signal of the time domain into a frequency domain signal by short-time Fourier transform frame by frame;

s3, processing the frequency domain signal by using a Mel filter to obtain Mel characteristics and taking the Mel characteristics as acoustic characteristics;

the S4 neural network includes: a convolutional neural network, a one-way long-short term memory recurrent neural network and a feedforward type deep neural network;

after the Mel features pass through a convolutional neural network, a one-way long-short term memory regression neural network and a feedforward type deep neural network frame by frame, the confidence coefficient information of each keyword is obtained after the Mel features are processed by a normalized exponential function;

s5, when the confidence coefficient of a certain keyword is larger than the predefined threshold value, the current frame and a plurality of frames pushed forward are taken for splicing and used as the output signal of the neural network;

and (8) sequentially passing the output signal of the S6 neural network through an attention mechanism and a feedforward deep neural network, and processing the output signal through a normalized exponential function to obtain confidence information of each keyword at the sentence level, wherein when the confidence value is greater than a predefined threshold value, the keyword is considered to be detected, otherwise, the keyword is considered to be not detected.

In the method for detecting the monophonic robustness voice keyword in real time, the Mel characteristic is formed by splicing the Mel characteristic of the current frame and a plurality of frames in the future.

In the method for detecting the robustness of the single sound channel voice keywords in real time, the one-way long-short term memory recurrent neural network comprises a plurality of stacked one-way layers, and each one-way layer is provided with sixty-four neurons.

In the method for detecting the monophonic robustness voice keyword in real time, the neural network is trained by adopting a large data set with noise, wherein the voice with noise is formed by mixing various noises and voices of a plurality of speakers;

noisy speech is a mixture of thousands of different types of noise and over five hundred speakers' speech.

In the method for detecting the robustness of the single sound channel voice keywords in real time, the convolutional neural network is formed by stacking a plurality of single convolutional layers;

each single convolutional layer of the convolutional neural network is connected by an activation function layer.

In the method for detecting the robustness of the single sound track voice keywords in real time, a feedforward deep neural network is formed by stacking a plurality of single-layer linear layers;

each linear layer of the feedforward deep neural network is connected with each other through an activation function layer.

In the method for detecting the monophonic robust speech keyword in real time, a soft attention mechanism is adopted as the attention mechanism.

In the monophonic robust speech keyword real-time detection method, the input of the attention mechanism is from the output of the convolutional neural network layer.

In the method for detecting the monophonic robust speech keyword in real time, the input of the attention mechanism is obtained by mixing the output signals of the convolution layer of the current frame and a plurality of past frames.

In the method for detecting the monophonic robust speech keyword in real time, the vector size output by the neural network is the number of the keywords participating in training plus one.

The method for detecting the voice keywords in the single sound channel robustness in real time has the advantages that the method is excellent in performance, can effectively reduce the voice noise in a close-range conversation scene, has stronger practicability compared with the prior art, and does not depend on the noise and a speaker.

Drawings

The invention will be further described with reference to the accompanying drawings and examples, in which:

FIG. 1 is a functional block diagram of a method for real-time detection of a single-channel robust speech keyword in accordance with the present invention;

FIG. 2 is a table of sample keywords and sequence numbers for a method for real-time detection of speech keywords for monophonic robustness in accordance with the present invention;

FIG. 3 is a diagram of a decreasing trend of cross entropy loss in a method for real-time detection of a single-channel robust speech keyword according to the present invention;

FIG. 4 is a variation trend of mean square error in the real training process of a single-channel robust speech keyword real-time detection method of the present invention;

FIG. 5 is a schematic diagram of a soft attention mechanism implementation of the monophonic robust speech keyword real-time detection method of the present invention.

FIG. 6 is a flow chart of a method for real-time detection of a single-channel robust speech keyword according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As shown in fig. 1, a method for detecting a robustness speech keyword in a single sound channel in real time includes the following steps: s1 receiving the noisy speech signal in electronic format, which contains human voice and background noise of non-human voice; s2, converting the noisy speech signal of the time domain into a frequency domain signal by short-time Fourier transform frame by frame; s3, processing the frequency domain signal by using a Mel filter to obtain Mel characteristics and taking the Mel characteristics as acoustic characteristics; the S4 neural network includes: a convolutional neural network, a one-way long-short term memory recurrent neural network and a feedforward type deep neural network; after the Mel features pass through a convolutional neural network, a one-way long-short term memory regression neural network and a feedforward type deep neural network frame by frame, processing by using a normalized exponential function (Softmax) to obtain confidence information of each keyword; s5, when the confidence coefficient of a certain keyword is larger than the predefined threshold value, the current frame and a plurality of frames pushed forward are taken for splicing and used as the output signal of the neural network; an output signal of the S6 neural network sequentially passes through an Attention mechanism (Attention mechanism) and a feedforward type deep neural network, and is processed by a normalized exponential function to obtain confidence information of each keyword at a sentence level, when the confidence value is larger than a predefined threshold value, the keyword is considered to be detected, otherwise, the keyword is considered to be not detected.

Further, confidence is also referred to as reliability, or confidence level, confidence coefficient, i.e. when a sample estimates an overall parameter, its conclusion is always uncertain due to the randomness of the sample. Therefore, a probabilistic statement method, i.e. interval estimation in mathematical statistics, is used, i.e. how large the corresponding probability of the estimated value and the overall parameter are within a certain allowable error range, and this corresponding probability is called confidence.

Further, a short-time Fourier transform (STFT) is a mathematical transform related to the STFT to determine the frequency and phase of the local area sinusoid of the time-varying signal.

Further, the normalization exponential function or Softmax function is a generalization of the logistic function in mathematics, especially in probability theory and related fields. It can "compress" a K-dimensional vector z containing arbitrary real numbers into another K-dimensional real vector σ (z) such that each element ranges between (0,1) and the sum of all elements is 1. This function is more than in the multi-classification problem.

Further, in step S2, the digital signal in the time domain is converted into a frequency domain signal by Fast Fourier Transform (FFT), and after the signal is converted into the frequency domain signal, we can conveniently analyze the frequency components of the signal and process the signal in the frequency domain, so as to implement many signal processing algorithms that cannot be completed in the time domain.

Furthermore, in step S1, the noisy speech training process uses a mixture of clean speech and noise with different snr, the inference process uses real collected speech, and in step S2, the feature extraction is performed to frame and window the noisy speech, each frame is twenty milliseconds long, and there is a ten-millisecond overlap between adjacent frames. And extracting a frequency spectrum amplitude vector on each frame by utilizing fast Fourier transform, and filtering by using a Mel filter to obtain the acoustic characteristics of each frame. Since the speech signal has a large correlation in the time dimension, and this correlation is very helpful for the keyword detection task. In order to improve the performance of keyword detection by utilizing the correlation in the time dimension, the method splices the current frame and a plurality of frames before and after the current frame into a vector with a larger dimension as an input feature. The method is executed by a computer program, acoustic features are extracted from the voice with noise, and the voice with noise is processed by a deep neural network to obtain whether the original voice with noise contains keywords or not. The method includes one or more program modules executed by any system or hardware device having executable computer programming instructions to perform the one or more modules.

Further, for applications requiring real-time processing, such as mobile hearing aid noise reduction applications, it is not acceptable to use information from future frames, which may result in delays. For keyword detection applications, delays within a certain range are acceptable, and therefore some real-time performance can be sacrificed within the present application to achieve better performance. Specifically, the current frame and the future 10 frames can be spliced as the input of the present invention, which only increases the delay of 100 milliseconds, can increase the accuracy of keyword detection, and can further increase the performance for the number of the used future frames which can also be increased to 20 frames. It should be noted that the historical frames are not spliced as input, and the main reason is that the long-short term memory recurrent neural network is used as a component of the network, and the long-short term memory recurrent neural network has a function of retaining a part of important historical input information, so that the historical frames do not need to be spliced, and only the future frames are relied on, thereby reducing the calculation amount of the neural network and the power consumption of the hardware devices relied on.

More specifically, the present invention supports one or more keywords, where different numbers of wake-up words correspond to different dimensions of network output, and the network output includes a first-level frame-level output and a second-level sentence-level network output, where the network output dimension is equal to the number of the keywords plus one, for example, if there is one keyword, the output dimension of the network is 2, if there are two keywords, the output dimension of the network is 3, and so on. Before the training task is performed, each command word should be numbered, for example, the following six keywords "light on", "turn on tv", "turn on air conditioner", "turn off light", "turn off tv", "turn off air conditioner" may be numbered as shown in fig. 2. It should be noted that, in the present invention, besides numbering the keywords, the non-keywords also need to be numbered, and the non-keywords are usually numbered as 0, and the keywords are numbered sequentially from 1 as the labels of the voices used for training. During training, the label should be subjected to One-hot encoding (One-hot encoding), and the result after the encoding is combined with the neural network output to solve the Cross Entropy Loss (Cross Entropy Loss).

Further, Cross Entropy (Cross Entropy) is an important concept in Shannon information theory, and is mainly used for measuring the difference information between two probability distributions. The performance of a language model is typically measured in terms of cross-entropy and complexity (perplexity). The meaning of cross entropy is the difficulty of text recognition using this model, or from a compression point of view, on average, several bits per word are encoded. The meaning of complexity is the number of branches that represent this text average with the model, whose inverse can be considered as the average probability of each word.

Further, generalization capability is very important for any supervised learning approach. Generalization capability refers to the performance of the method in a scenario without participation in training. The generalization performance of the method in the invention mainly adopts the noise-free human voice and about 10000 kinds of noises collected by different scenes to mix the noise-added voice with different Signal-to-noise ratios (SNR) and different loudness, and then solves the generalization problem through large-scale training. Because the recurrent neural network has the modeling capability on the long-term dependence relationship in the signal, the proposed model has good generalization on new noise and speaker scenes, which is very important for practical application. Preferably, to achieve better performance, the present invention uses a future frame dependent RNN model so that the network can obtain information from both the past and the future.

As shown in fig. 6, which illustrates the whole process of the present invention in detail, a detailed process of keyword detection is proposed, a noisy speech signal is input, a number of keywords is output, "1" in the figure represents a step involved during training, "2" in the figure represents a step of an inference or prediction phase, and "3" in the figure represents a step of training and prediction sharing. In order to keep the robustness of the invention under strong noise environment, the invention uses the Ideal Ratio Membrane (IRM) as an auxiliary training target in the training stage. The IRM is obtained by comparing the mel features of noisy speech with the mel features of clean speech. As shown in fig. 1, the present invention employs a 7-layer deconvolution neural network (De-CNN) corresponding to the 7-layer Convolutional Neural Network (CNN) to estimate an ideal ratio film for each input noisy speech, and then calculates Mean-square error (MSE) of the ideal ratio film and the estimated ratio film. The loss function of the neural network simultaneously comprises cross entropy loss and mean square error, and the loss function of the whole training set is minimized by the neural network through repeated multiple iterations. After the training phase is over, the prediction phase is entered, in which the deconvolution part is not used at all. This training method for simultaneously training multiple targets is generally referred to as joint training (joint training). The output of the deconvolution neural network part is a predicted ideal ratio film, and the Mel characteristics of pure speech can be recovered by using the ideal ratio film and the Mel characteristics of noisy speech, so that the convolution neural network part can keep more pure speech characteristics by adding the predicted ideal ratio film as a training target, and filter out the noise characteristics irrelevant to keyword detection, so that the whole neural network has better noise reduction performance, the robustness of the deconvolution neural network part in a strong noise environment can still obtain higher accuracy in the strong noise environment.

Further, Mean-square error (MSE) is a metric that reflects the degree of difference between the estimator and the estimated volume.

Still further, robustness is that the wake-up system can still be correctly woken up and maintain a low level of false wake-up rate in case of noise interference or target voice change.

As shown in fig. 3 or fig. 4, the Mean-square error (MSE) value decreases rapidly from about 1200 times to about 1400 times, and at this time, the cross entropy loss value also decreases rapidly, so that it can be seen that the performance of the keyword detection task is also continuously improved along with the improvement of the performance of the noise reduction task, which indicates that the performance of the noise reduction task has a synergistic effect on the performance of the keyword detection task.

Further, in the training process of the present invention, the training process is divided into two different stages, the first stage is the above mentioned Mean-square error (MSE) and cross entropy loss, and the second stage is mainly used for training keyword detection at sentence level. In the training process of the previous stage, the trained 7-layer convolutional neural network has a good noise reduction effect, so that in the training of the stage, the weight parameters of the convolutional neural network are frozen, and the convolutional neural network is not updated in the training process of the stage. During the training process of the stage, the current frame is spliced with the output of the convolutional neural network of 180 frames in total of 179 historical frames as the input of the attention mechanism. These 180 frames correspond to 1.8 seconds of speech, and since most of the keyword speech is less than 1.8 seconds, these 180 frame features contain almost the entire wake word speech. The attention mechanism is derived from the study of human vision, and in the cognitive science, due to the bottleneck of information processing, human beings selectively pay attention to a part of information while ignoring other visible information, and the attention mechanism is used here to detect relevant information from keywords extracted from a longer time series while ignoring other useless information. The second-stage awakening with the attention mechanism is used, so that the problem of poor performance caused by overhigh false alarm rate due to the fact that only the first-stage awakening is used can be effectively solved.

Further, the attention mechanism used within the present invention is a soft attention mechanism.

As shown in fig. 5, the principle of soft attention can be described specifically by the following formula:

e_t＝v^Ttanh(Wh_t+b)

as shown in the above formula, the whole process of the attention mechanism is that firstly the output et of the hidden state of each frame of the previous T frame is calculated, then the hidden states are calculated by using a normalization exponential function as the weight of each frame, and finally the weighted summation of the input of each frame of the whole event sequence is the output of the attention mechanism.

Furthermore, the Mel feature is formed by splicing the Mel feature of the current frame and a plurality of frames in the future.

Furthermore, a fast fourier transform is used to extract a spectral magnitude vector on each frame, and then a mel filter is used to perform filtering to obtain an acoustic feature, i.e. a mel feature, of each frame.

Further, the one-way long-short term memory recurrent neural network includes a plurality of stacked one-way layers, each one-way layer having sixty-four neurons.

Further, the neural network is trained by adopting a noisy big data set, wherein the noisy speech is formed by mixing a plurality of noises and a plurality of speaker voices;

Furthermore, the convolutional neural network is formed by stacking a plurality of single convolutional layers; each single convolutional layer of the convolutional neural network is connected by an activation function layer.

Further, the convolutional neural network is a deep feedforward artificial neural network, the artificial neurons may respond to surrounding cells, and the convolutional neural network includes convolutional layers and pooling layers.

Further, the feedforward deep neural network is formed by stacking a plurality of single-layer linear layers; each linear layer of the feedforward deep neural network is connected with each other through an activation function layer.

Furthermore, a Deep Neural Network (DNN) of feedforward type is a feedforward Neural network having at least one hidden layer, which is linearized by an activation function, subjected to a loss function using cross entropy, and learning-trained (adjusted and updated with weights between neurons) by a back propagation optimization algorithm (stochastic gradient descent algorithm, bulk gradient descent algorithm).

Further, the attention mechanism is a Soft-attention mechanism (Soft-attention).

Further, the input to the attention mechanism comes from the output of the convolutional neural network layer.

Further, the input to the attention mechanism is the mixing of the output signals of the current frame and several past frame convolution layers.

Furthermore, the attention mechanism can expand the capability of the neural network, allows more complex functions to be approximated, is a characteristic part capable of paying attention to input in a more intuitive way, can help to improve the reference performance of natural language processing, and brings brand new capabilities of image description, memory network addressing, neural programmers and the like.

Further, the vector size of the output of the neural network is the number of keywords involved in the training plus one.

Furthermore, the monaural robust keyword detection of the present invention refers to the fact that the monaural robust keyword detection has wider practicability compared to the keyword detection of a beam-formed microphone array for the signal collected by a single microphone.

Furthermore, the invention adopts a supervised learning method to detect the keywords, and the keywords are detected by a convolutional neural network, a long-short term memory recurrent neural network and a feedforward deep neural network. The method adopts a sentence-level attention mechanism and a feedforward deep neural network as a second-level neural network, and after the output confidence coefficient of the first-level deep neural network based on the convolutional neural network, the long-short term memory regression neural network and the feedforward deep neural network is greater than a threshold value, the second-level network is used for confirmation, if the output confidence coefficient of the second-level network is greater than the threshold value again, the keyword is considered to be detected, otherwise, the keyword is not detected, the second-level network is used for confirmation, the false alarm rate of the neural network can be greatly reduced at the cost of less influence on the performance, and the performance of the method is improved.

Further, a Long Short-Term Memory recurrent neural network (LSTM) is a time-cycle neural network, which is specially designed to solve the Long-Term dependence problem of a general RNN (recurrent neural network), and all RNNs have a chain form of a repetitive neural network module.

Furthermore, the robustness in the single-channel robustness awakening word detection is realized, the higher awakening rate can still be kept in a noisy environment, and the correct awakening rate of more than 90% can be kept under the noisy awakening condition that the signal-to-noise ratio is 0 dB.

Further, the signal-to-noise ratio (SNR, S/N) is also called signal-to-noise ratio. Refers to the ratio of signal to noise in an electronic device or system. The signal refers to an electronic signal from the outside of the device to be processed by the device, the noise refers to an irregular extra signal (or information) which does not exist in the original signal generated after passing through the device, and the signal does not change along with the change of the original signal.

Furthermore, the robust single-channel keyword detection refers to keyword detection of electronic-format voice collected by a single microphone, and compared with the keyword detection of a beam-formed microphone array, the robust single-channel keyword detection method can still keep a higher awakening rate under the condition of not using spatial position information, and has wider application scenes by only using one microphone. The invention adopts a supervised learning method to detect the keywords, and realizes a keyword detection method with robustness in a noise scene by combining two training targets of noise reduction and keyword detection.

Furthermore, the invention introduces the problem that the false alarm rate of the keyword detection is too high by a second-level network, wherein the attention mechanism in the second-level network can extract the information related to the keyword from a longer time sequence. The second level network only executes relevant logic after the output of the first level network is larger than a threshold value in the inference stage, and a part of calculation cost can be saved.

Further, the false alarm rate is also called false alarm probability, and refers to the probability that no target exists but a target is determined to exist due to the ubiquitous and fluctuating noise in the radar detection process by using a threshold detection method.

The invention provides a single sound channel robustness voice keyword real-time detection method, which adopts the single sound channel robustness keyword detection method to detect the signal collected by a single microphone, has wider practicability compared with the keyword detection of a microphone array formed by wave beams, adopts a second-level network for confirmation, can greatly reduce the false alarm rate of a neural network under the condition of less influence on the performance, improves the performance, and can still keep higher awakening rate in a noisy environment.

Although the present invention has been described with reference to the above embodiments, the scope of the present invention is not limited thereto, and modifications, substitutions and the like of the above members are intended to fall within the scope of the claims of the present invention without departing from the spirit of the present invention.

Claims

1. A monophonic robustness voice keyword real-time detection method is characterized by comprising the following steps:

s2, converting the voice signal with noise in the time domain into a frequency domain signal by short-time Fourier transform frame by frame;

the Mel features pass through a convolutional neural network, a one-way long-short term memory regression neural network and a feedforward type deep neural network frame by frame, and then are processed by a normalized exponential function to obtain confidence information of each keyword;

s5, when the confidence coefficient of a certain keyword is larger than a predefined threshold value, splicing the current frame and a plurality of frames pushed forward, and using the spliced frames as the output signal of the neural network;

s6, the output signal of the neural network passes through an attention mechanism and the feedforward deep neural network in sequence, and the confidence information of each keyword at sentence level is obtained after the normalized exponential function processing, when the confidence value is larger than a predefined threshold value, the keyword is considered to be detected, otherwise, the keyword is considered to be not detected.

2. The method as claimed in claim 1, wherein the mel feature is formed by concatenating the mel feature of the current frame and a plurality of frames in the future.

3. The method as claimed in claim 1, wherein the uni-directional long-short term memory recurrent neural network comprises a plurality of stacked uni-directional layers, each of the uni-directional layers having sixty-four neurons.

4. The method of claim 1, wherein the neural network is trained using a noisy large data set, wherein the noisy speech is a mixture of multiple noises and multiple speaker voices;

the noisy speech is a mixture of thousands of different types of noise and over five hundred speakers' speech.

5. The method of claim 1, wherein the convolutional neural network is formed by stacking a plurality of single convolutional layers;

each of the single convolutional layers of the convolutional neural network is connected by an activation function layer.

6. The method as claimed in claim 1, wherein the feedforward deep neural network is formed by stacking a plurality of single-layer linear layers;

and each linear layer of the feedforward type deep neural network is connected with each other through an activation function layer.

7. The method of claim 1, wherein the attention mechanism is a soft attention mechanism.

8. The method of claim 1, wherein the input of the attention mechanism is from an output of a convolutional neural network layer.

9. The method of claim 1, wherein the attention mechanism is input by mixing output signals of a current frame and convolution layers of a plurality of past frames.

10. The method of claim 1, wherein a vector size of an output of the neural network is one more than a number of the keywords involved in the training.