CN110767223A - Voice keyword real-time detection method of single sound track robustness - Google Patents
Voice keyword real-time detection method of single sound track robustness Download PDFInfo
- Publication number
- CN110767223A CN110767223A CN201910945315.6A CN201910945315A CN110767223A CN 110767223 A CN110767223 A CN 110767223A CN 201910945315 A CN201910945315 A CN 201910945315A CN 110767223 A CN110767223 A CN 110767223A
- Authority
- CN
- China
- Prior art keywords
- neural network
- keyword
- frame
- mel
- signal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 71
- 238000011897 real-time detection Methods 0.000 title claims description 12
- 238000013528 artificial neural network Methods 0.000 claims abstract description 75
- 238000012545 processing Methods 0.000 claims abstract description 9
- 239000010410 layer Substances 0.000 claims description 36
- 238000012549 training Methods 0.000 claims description 31
- 230000007246 mechanism Effects 0.000 claims description 28
- 238000013527 convolutional neural network Methods 0.000 claims description 25
- 230000006870 function Effects 0.000 claims description 23
- 230000015654 memory Effects 0.000 claims description 15
- 230000000306 recurrent effect Effects 0.000 claims description 12
- 241000282414 Homo sapiens Species 0.000 claims description 9
- 230000004913 activation Effects 0.000 claims description 7
- 238000002156 mixing Methods 0.000 claims description 5
- 239000000203 mixture Substances 0.000 claims description 5
- 210000002569 neuron Anatomy 0.000 claims description 5
- 239000002356 single layer Substances 0.000 claims description 3
- 238000010606 normalization Methods 0.000 abstract description 3
- 238000001514 detection method Methods 0.000 description 37
- 230000008569 process Effects 0.000 description 14
- 230000009467 reduction Effects 0.000 description 8
- 238000004422 calculation algorithm Methods 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 230000008859 change Effects 0.000 description 3
- 238000012790 confirmation Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000001914 filtration Methods 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 230000007423 decrease Effects 0.000 description 2
- 230000001934 delay Effects 0.000 description 2
- 230000007774 longterm Effects 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 210000004027 cell Anatomy 0.000 description 1
- 230000001149 cognitive effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 239000000945 filler Substances 0.000 description 1
- 238000011049 filling Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 239000012528 membrane Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000002195 synergetic effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/02—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
- G10L19/0212—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders using orthogonal transformation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/26—Pre-filtering or post-filtering
- G10L19/265—Pre-filtering, e.g. high frequency emphasis prior to encoding
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/223—Execution procedure of a spoken command
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/225—Feedback of the input speech
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Complex Calculations (AREA)
Abstract
The invention relates to a method for detecting a single sound channel robustness voice keyword in real time, which comprises the following steps: receiving a noisy speech in an electronic format; converting the voice signal of the time domain into a frequency domain signal by using short-time Fourier transform frame by frame; processing the frequency domain signal by using a Mel filter to obtain Mel characteristics as acoustic characteristics; the Mel characteristics are processed by a neural network frame by frame and then processed by a normalization index function to obtain confidence information of each keyword; when the confidence coefficient of a certain keyword is greater than a predefined threshold value, the current frame and a plurality of frames pushed forwards are taken to be spliced to be used as the output of the neural network; and when the confidence coefficient value is larger than a predefined threshold value, determining that the keyword is detected, otherwise, determining that the keyword is not detected. The invention has excellent performance, can still keep higher awakening rate in a noisy environment, has wider practicability, can greatly reduce the false alarm rate of the neural network and improves the performance of the invention.
Description
Technical Field
The invention relates to the technical field of electronic communication noise reduction, in particular to a method for detecting a voice keyword with single sound channel robustness in real time.
Background
With the rise of applications such as intelligent assistants and intelligent sound equipment, the speech keyword detection technology is increasingly regarded by the industry as an important link in human-computer interaction. Hidden markov model-based filling Models (Filler Models) were first applied to keyword detection. Among them, the robust keyword detection of the mono is a very challenging subject, because the mono keyword detection only depends on the recorded sound signal of one microphone, and cannot utilize the spatial information commonly used in the microphone array. In addition, single-microphone robust keyword detection acoustic application scenarios are more extensive than microphone array keyword detection based on beamforming (appropriately configured spatial filtering through the sensor array). Because only one microphone is used, the single-channel robustness keyword detection is lower in cost and more convenient to use. Recently, a great breakthrough of keyword detection is that a deep neural network is used to replace a hidden Markov model, and the method has the advantages of less memory occupation, no need of decoding and searching and high accuracy. The most advanced method in the past is to use a feed-forward Deep Neural Network (DNN) trained with a large amount of data plus frame-level data labeling. Although the method can realize keyword detection, the robustness of the method to noise is not good, and the method can be improved by adding noises with different types and different signal-to-noise ratios to input voice in the training process, but the problem of high false alarm rate still exists.
The existing solution has the following disadvantages:
1. although keyword detection can be achieved, robustness to noise is not good;
2. there is a problem of a high false alarm rate.
Disclosure of Invention
The technical problem to be solved by the invention is how to solve the problems of poor noise robustness and high false alarm rate of the method in the prior art by adopting a single sound channel robustness voice keyword real-time detection method.
The technical scheme adopted by the invention for solving the technical problems is as follows: compared with the keyword detection of a microphone array formed by wave beams, the method can still keep higher awakening rate under the condition of not using space position information, and has wider application scenes by only using one microphone. The method adopts a supervised learning method to detect the keywords, and realizes a keyword detection method with robustness in a noise scene by combining two training targets of noise reduction and keyword detection. The method has excellent performance, can still keep higher awakening rate in a noisy environment, has wider practicability, and can greatly reduce the false alarm rate of the neural network.
In the method for detecting the robustness of the monophonic voice keyword in real time, the method for detecting the robustness of the monophonic voice keyword in real time comprises the following steps:
s1 receiving the noisy speech signal in electronic format, which contains human voice and background noise of non-human voice;
s2, converting the noisy speech signal of the time domain into a frequency domain signal by short-time Fourier transform frame by frame;
s3, processing the frequency domain signal by using a Mel filter to obtain Mel characteristics and taking the Mel characteristics as acoustic characteristics;
the S4 neural network includes: a convolutional neural network, a one-way long-short term memory recurrent neural network and a feedforward type deep neural network;
after the Mel features pass through a convolutional neural network, a one-way long-short term memory regression neural network and a feedforward type deep neural network frame by frame, the confidence coefficient information of each keyword is obtained after the Mel features are processed by a normalized exponential function;
s5, when the confidence coefficient of a certain keyword is larger than the predefined threshold value, the current frame and a plurality of frames pushed forward are taken for splicing and used as the output signal of the neural network;
and (8) sequentially passing the output signal of the S6 neural network through an attention mechanism and a feedforward deep neural network, and processing the output signal through a normalized exponential function to obtain confidence information of each keyword at the sentence level, wherein when the confidence value is greater than a predefined threshold value, the keyword is considered to be detected, otherwise, the keyword is considered to be not detected.
In the method for detecting the monophonic robustness voice keyword in real time, the Mel characteristic is formed by splicing the Mel characteristic of the current frame and a plurality of frames in the future.
In the method for detecting the robustness of the single sound channel voice keywords in real time, the one-way long-short term memory recurrent neural network comprises a plurality of stacked one-way layers, and each one-way layer is provided with sixty-four neurons.
In the method for detecting the monophonic robustness voice keyword in real time, the neural network is trained by adopting a large data set with noise, wherein the voice with noise is formed by mixing various noises and voices of a plurality of speakers;
noisy speech is a mixture of thousands of different types of noise and over five hundred speakers' speech.
In the method for detecting the robustness of the single sound channel voice keywords in real time, the convolutional neural network is formed by stacking a plurality of single convolutional layers;
each single convolutional layer of the convolutional neural network is connected by an activation function layer.
In the method for detecting the robustness of the single sound track voice keywords in real time, a feedforward deep neural network is formed by stacking a plurality of single-layer linear layers;
each linear layer of the feedforward deep neural network is connected with each other through an activation function layer.
In the method for detecting the monophonic robust speech keyword in real time, a soft attention mechanism is adopted as the attention mechanism.
In the monophonic robust speech keyword real-time detection method, the input of the attention mechanism is from the output of the convolutional neural network layer.
In the method for detecting the monophonic robust speech keyword in real time, the input of the attention mechanism is obtained by mixing the output signals of the convolution layer of the current frame and a plurality of past frames.
In the method for detecting the monophonic robust speech keyword in real time, the vector size output by the neural network is the number of the keywords participating in training plus one.
The method for detecting the voice keywords in the single sound channel robustness in real time has the advantages that the method is excellent in performance, can effectively reduce the voice noise in a close-range conversation scene, has stronger practicability compared with the prior art, and does not depend on the noise and a speaker.
Drawings
The invention will be further described with reference to the accompanying drawings and examples, in which:
FIG. 1 is a functional block diagram of a method for real-time detection of a single-channel robust speech keyword in accordance with the present invention;
FIG. 2 is a table of sample keywords and sequence numbers for a method for real-time detection of speech keywords for monophonic robustness in accordance with the present invention;
FIG. 3 is a diagram of a decreasing trend of cross entropy loss in a method for real-time detection of a single-channel robust speech keyword according to the present invention;
FIG. 4 is a variation trend of mean square error in the real training process of a single-channel robust speech keyword real-time detection method of the present invention;
FIG. 5 is a schematic diagram of a soft attention mechanism implementation of the monophonic robust speech keyword real-time detection method of the present invention.
FIG. 6 is a flow chart of a method for real-time detection of a single-channel robust speech keyword according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
As shown in fig. 1, a method for detecting a robustness speech keyword in a single sound channel in real time includes the following steps: s1 receiving the noisy speech signal in electronic format, which contains human voice and background noise of non-human voice; s2, converting the noisy speech signal of the time domain into a frequency domain signal by short-time Fourier transform frame by frame; s3, processing the frequency domain signal by using a Mel filter to obtain Mel characteristics and taking the Mel characteristics as acoustic characteristics; the S4 neural network includes: a convolutional neural network, a one-way long-short term memory recurrent neural network and a feedforward type deep neural network; after the Mel features pass through a convolutional neural network, a one-way long-short term memory regression neural network and a feedforward type deep neural network frame by frame, processing by using a normalized exponential function (Softmax) to obtain confidence information of each keyword; s5, when the confidence coefficient of a certain keyword is larger than the predefined threshold value, the current frame and a plurality of frames pushed forward are taken for splicing and used as the output signal of the neural network; an output signal of the S6 neural network sequentially passes through an Attention mechanism (Attention mechanism) and a feedforward type deep neural network, and is processed by a normalized exponential function to obtain confidence information of each keyword at a sentence level, when the confidence value is larger than a predefined threshold value, the keyword is considered to be detected, otherwise, the keyword is considered to be not detected.
Further, confidence is also referred to as reliability, or confidence level, confidence coefficient, i.e. when a sample estimates an overall parameter, its conclusion is always uncertain due to the randomness of the sample. Therefore, a probabilistic statement method, i.e. interval estimation in mathematical statistics, is used, i.e. how large the corresponding probability of the estimated value and the overall parameter are within a certain allowable error range, and this corresponding probability is called confidence.
Further, a short-time Fourier transform (STFT) is a mathematical transform related to the STFT to determine the frequency and phase of the local area sinusoid of the time-varying signal.
Further, the normalization exponential function or Softmax function is a generalization of the logistic function in mathematics, especially in probability theory and related fields. It can "compress" a K-dimensional vector z containing arbitrary real numbers into another K-dimensional real vector σ (z) such that each element ranges between (0,1) and the sum of all elements is 1. This function is more than in the multi-classification problem.
Further, in step S2, the digital signal in the time domain is converted into a frequency domain signal by Fast Fourier Transform (FFT), and after the signal is converted into the frequency domain signal, we can conveniently analyze the frequency components of the signal and process the signal in the frequency domain, so as to implement many signal processing algorithms that cannot be completed in the time domain.
Furthermore, in step S1, the noisy speech training process uses a mixture of clean speech and noise with different snr, the inference process uses real collected speech, and in step S2, the feature extraction is performed to frame and window the noisy speech, each frame is twenty milliseconds long, and there is a ten-millisecond overlap between adjacent frames. And extracting a frequency spectrum amplitude vector on each frame by utilizing fast Fourier transform, and filtering by using a Mel filter to obtain the acoustic characteristics of each frame. Since the speech signal has a large correlation in the time dimension, and this correlation is very helpful for the keyword detection task. In order to improve the performance of keyword detection by utilizing the correlation in the time dimension, the method splices the current frame and a plurality of frames before and after the current frame into a vector with a larger dimension as an input feature. The method is executed by a computer program, acoustic features are extracted from the voice with noise, and the voice with noise is processed by a deep neural network to obtain whether the original voice with noise contains keywords or not. The method includes one or more program modules executed by any system or hardware device having executable computer programming instructions to perform the one or more modules.
Further, for applications requiring real-time processing, such as mobile hearing aid noise reduction applications, it is not acceptable to use information from future frames, which may result in delays. For keyword detection applications, delays within a certain range are acceptable, and therefore some real-time performance can be sacrificed within the present application to achieve better performance. Specifically, the current frame and the future 10 frames can be spliced as the input of the present invention, which only increases the delay of 100 milliseconds, can increase the accuracy of keyword detection, and can further increase the performance for the number of the used future frames which can also be increased to 20 frames. It should be noted that the historical frames are not spliced as input, and the main reason is that the long-short term memory recurrent neural network is used as a component of the network, and the long-short term memory recurrent neural network has a function of retaining a part of important historical input information, so that the historical frames do not need to be spliced, and only the future frames are relied on, thereby reducing the calculation amount of the neural network and the power consumption of the hardware devices relied on.
More specifically, the present invention supports one or more keywords, where different numbers of wake-up words correspond to different dimensions of network output, and the network output includes a first-level frame-level output and a second-level sentence-level network output, where the network output dimension is equal to the number of the keywords plus one, for example, if there is one keyword, the output dimension of the network is 2, if there are two keywords, the output dimension of the network is 3, and so on. Before the training task is performed, each command word should be numbered, for example, the following six keywords "light on", "turn on tv", "turn on air conditioner", "turn off light", "turn off tv", "turn off air conditioner" may be numbered as shown in fig. 2. It should be noted that, in the present invention, besides numbering the keywords, the non-keywords also need to be numbered, and the non-keywords are usually numbered as 0, and the keywords are numbered sequentially from 1 as the labels of the voices used for training. During training, the label should be subjected to One-hot encoding (One-hot encoding), and the result after the encoding is combined with the neural network output to solve the Cross Entropy Loss (Cross Entropy Loss).
Further, Cross Entropy (Cross Entropy) is an important concept in Shannon information theory, and is mainly used for measuring the difference information between two probability distributions. The performance of a language model is typically measured in terms of cross-entropy and complexity (perplexity). The meaning of cross entropy is the difficulty of text recognition using this model, or from a compression point of view, on average, several bits per word are encoded. The meaning of complexity is the number of branches that represent this text average with the model, whose inverse can be considered as the average probability of each word.
Further, generalization capability is very important for any supervised learning approach. Generalization capability refers to the performance of the method in a scenario without participation in training. The generalization performance of the method in the invention mainly adopts the noise-free human voice and about 10000 kinds of noises collected by different scenes to mix the noise-added voice with different Signal-to-noise ratios (SNR) and different loudness, and then solves the generalization problem through large-scale training. Because the recurrent neural network has the modeling capability on the long-term dependence relationship in the signal, the proposed model has good generalization on new noise and speaker scenes, which is very important for practical application. Preferably, to achieve better performance, the present invention uses a future frame dependent RNN model so that the network can obtain information from both the past and the future.
As shown in fig. 6, which illustrates the whole process of the present invention in detail, a detailed process of keyword detection is proposed, a noisy speech signal is input, a number of keywords is output, "1" in the figure represents a step involved during training, "2" in the figure represents a step of an inference or prediction phase, and "3" in the figure represents a step of training and prediction sharing. In order to keep the robustness of the invention under strong noise environment, the invention uses the Ideal Ratio Membrane (IRM) as an auxiliary training target in the training stage. The IRM is obtained by comparing the mel features of noisy speech with the mel features of clean speech. As shown in fig. 1, the present invention employs a 7-layer deconvolution neural network (De-CNN) corresponding to the 7-layer Convolutional Neural Network (CNN) to estimate an ideal ratio film for each input noisy speech, and then calculates Mean-square error (MSE) of the ideal ratio film and the estimated ratio film. The loss function of the neural network simultaneously comprises cross entropy loss and mean square error, and the loss function of the whole training set is minimized by the neural network through repeated multiple iterations. After the training phase is over, the prediction phase is entered, in which the deconvolution part is not used at all. This training method for simultaneously training multiple targets is generally referred to as joint training (joint training). The output of the deconvolution neural network part is a predicted ideal ratio film, and the Mel characteristics of pure speech can be recovered by using the ideal ratio film and the Mel characteristics of noisy speech, so that the convolution neural network part can keep more pure speech characteristics by adding the predicted ideal ratio film as a training target, and filter out the noise characteristics irrelevant to keyword detection, so that the whole neural network has better noise reduction performance, the robustness of the deconvolution neural network part in a strong noise environment can still obtain higher accuracy in the strong noise environment.
Further, Mean-square error (MSE) is a metric that reflects the degree of difference between the estimator and the estimated volume.
Still further, robustness is that the wake-up system can still be correctly woken up and maintain a low level of false wake-up rate in case of noise interference or target voice change.
As shown in fig. 3 or fig. 4, the Mean-square error (MSE) value decreases rapidly from about 1200 times to about 1400 times, and at this time, the cross entropy loss value also decreases rapidly, so that it can be seen that the performance of the keyword detection task is also continuously improved along with the improvement of the performance of the noise reduction task, which indicates that the performance of the noise reduction task has a synergistic effect on the performance of the keyword detection task.
Further, in the training process of the present invention, the training process is divided into two different stages, the first stage is the above mentioned Mean-square error (MSE) and cross entropy loss, and the second stage is mainly used for training keyword detection at sentence level. In the training process of the previous stage, the trained 7-layer convolutional neural network has a good noise reduction effect, so that in the training of the stage, the weight parameters of the convolutional neural network are frozen, and the convolutional neural network is not updated in the training process of the stage. During the training process of the stage, the current frame is spliced with the output of the convolutional neural network of 180 frames in total of 179 historical frames as the input of the attention mechanism. These 180 frames correspond to 1.8 seconds of speech, and since most of the keyword speech is less than 1.8 seconds, these 180 frame features contain almost the entire wake word speech. The attention mechanism is derived from the study of human vision, and in the cognitive science, due to the bottleneck of information processing, human beings selectively pay attention to a part of information while ignoring other visible information, and the attention mechanism is used here to detect relevant information from keywords extracted from a longer time series while ignoring other useless information. The second-stage awakening with the attention mechanism is used, so that the problem of poor performance caused by overhigh false alarm rate due to the fact that only the first-stage awakening is used can be effectively solved.
Further, the attention mechanism used within the present invention is a soft attention mechanism.
As shown in fig. 5, the principle of soft attention can be described specifically by the following formula:
et=vTtanh(Wht+b)
as shown in the above formula, the whole process of the attention mechanism is that firstly the output et of the hidden state of each frame of the previous T frame is calculated, then the hidden states are calculated by using a normalization exponential function as the weight of each frame, and finally the weighted summation of the input of each frame of the whole event sequence is the output of the attention mechanism.
Furthermore, the Mel feature is formed by splicing the Mel feature of the current frame and a plurality of frames in the future.
Furthermore, a fast fourier transform is used to extract a spectral magnitude vector on each frame, and then a mel filter is used to perform filtering to obtain an acoustic feature, i.e. a mel feature, of each frame.
Further, the one-way long-short term memory recurrent neural network includes a plurality of stacked one-way layers, each one-way layer having sixty-four neurons.
Further, the neural network is trained by adopting a noisy big data set, wherein the noisy speech is formed by mixing a plurality of noises and a plurality of speaker voices;
noisy speech is a mixture of thousands of different types of noise and over five hundred speakers' speech.
Furthermore, the convolutional neural network is formed by stacking a plurality of single convolutional layers; each single convolutional layer of the convolutional neural network is connected by an activation function layer.
Further, the convolutional neural network is a deep feedforward artificial neural network, the artificial neurons may respond to surrounding cells, and the convolutional neural network includes convolutional layers and pooling layers.
Further, the feedforward deep neural network is formed by stacking a plurality of single-layer linear layers; each linear layer of the feedforward deep neural network is connected with each other through an activation function layer.
Furthermore, a Deep Neural Network (DNN) of feedforward type is a feedforward Neural network having at least one hidden layer, which is linearized by an activation function, subjected to a loss function using cross entropy, and learning-trained (adjusted and updated with weights between neurons) by a back propagation optimization algorithm (stochastic gradient descent algorithm, bulk gradient descent algorithm).
Further, the attention mechanism is a Soft-attention mechanism (Soft-attention).
Further, the input to the attention mechanism comes from the output of the convolutional neural network layer.
Further, the input to the attention mechanism is the mixing of the output signals of the current frame and several past frame convolution layers.
Furthermore, the attention mechanism can expand the capability of the neural network, allows more complex functions to be approximated, is a characteristic part capable of paying attention to input in a more intuitive way, can help to improve the reference performance of natural language processing, and brings brand new capabilities of image description, memory network addressing, neural programmers and the like.
Further, the vector size of the output of the neural network is the number of keywords involved in the training plus one.
Furthermore, the monaural robust keyword detection of the present invention refers to the fact that the monaural robust keyword detection has wider practicability compared to the keyword detection of a beam-formed microphone array for the signal collected by a single microphone.
Furthermore, the invention adopts a supervised learning method to detect the keywords, and the keywords are detected by a convolutional neural network, a long-short term memory recurrent neural network and a feedforward deep neural network. The method adopts a sentence-level attention mechanism and a feedforward deep neural network as a second-level neural network, and after the output confidence coefficient of the first-level deep neural network based on the convolutional neural network, the long-short term memory regression neural network and the feedforward deep neural network is greater than a threshold value, the second-level network is used for confirmation, if the output confidence coefficient of the second-level network is greater than the threshold value again, the keyword is considered to be detected, otherwise, the keyword is not detected, the second-level network is used for confirmation, the false alarm rate of the neural network can be greatly reduced at the cost of less influence on the performance, and the performance of the method is improved.
Further, a Long Short-Term Memory recurrent neural network (LSTM) is a time-cycle neural network, which is specially designed to solve the Long-Term dependence problem of a general RNN (recurrent neural network), and all RNNs have a chain form of a repetitive neural network module.
Furthermore, the robustness in the single-channel robustness awakening word detection is realized, the higher awakening rate can still be kept in a noisy environment, and the correct awakening rate of more than 90% can be kept under the noisy awakening condition that the signal-to-noise ratio is 0 dB.
Further, the signal-to-noise ratio (SNR, S/N) is also called signal-to-noise ratio. Refers to the ratio of signal to noise in an electronic device or system. The signal refers to an electronic signal from the outside of the device to be processed by the device, the noise refers to an irregular extra signal (or information) which does not exist in the original signal generated after passing through the device, and the signal does not change along with the change of the original signal.
Furthermore, the robust single-channel keyword detection refers to keyword detection of electronic-format voice collected by a single microphone, and compared with the keyword detection of a beam-formed microphone array, the robust single-channel keyword detection method can still keep a higher awakening rate under the condition of not using spatial position information, and has wider application scenes by only using one microphone. The invention adopts a supervised learning method to detect the keywords, and realizes a keyword detection method with robustness in a noise scene by combining two training targets of noise reduction and keyword detection.
Furthermore, the invention introduces the problem that the false alarm rate of the keyword detection is too high by a second-level network, wherein the attention mechanism in the second-level network can extract the information related to the keyword from a longer time sequence. The second level network only executes relevant logic after the output of the first level network is larger than a threshold value in the inference stage, and a part of calculation cost can be saved.
Further, the false alarm rate is also called false alarm probability, and refers to the probability that no target exists but a target is determined to exist due to the ubiquitous and fluctuating noise in the radar detection process by using a threshold detection method.
The invention provides a single sound channel robustness voice keyword real-time detection method, which adopts the single sound channel robustness keyword detection method to detect the signal collected by a single microphone, has wider practicability compared with the keyword detection of a microphone array formed by wave beams, adopts a second-level network for confirmation, can greatly reduce the false alarm rate of a neural network under the condition of less influence on the performance, improves the performance, and can still keep higher awakening rate in a noisy environment.
Although the present invention has been described with reference to the above embodiments, the scope of the present invention is not limited thereto, and modifications, substitutions and the like of the above members are intended to fall within the scope of the claims of the present invention without departing from the spirit of the present invention.
Claims (10)
1. A monophonic robustness voice keyword real-time detection method is characterized by comprising the following steps:
s1 receiving the noisy speech signal in electronic format, which contains human voice and background noise of non-human voice;
s2, converting the voice signal with noise in the time domain into a frequency domain signal by short-time Fourier transform frame by frame;
s3, processing the frequency domain signal by using a Mel filter to obtain Mel characteristics and taking the Mel characteristics as acoustic characteristics;
the S4 neural network includes: a convolutional neural network, a one-way long-short term memory recurrent neural network and a feedforward type deep neural network;
the Mel features pass through a convolutional neural network, a one-way long-short term memory regression neural network and a feedforward type deep neural network frame by frame, and then are processed by a normalized exponential function to obtain confidence information of each keyword;
s5, when the confidence coefficient of a certain keyword is larger than a predefined threshold value, splicing the current frame and a plurality of frames pushed forward, and using the spliced frames as the output signal of the neural network;
s6, the output signal of the neural network passes through an attention mechanism and the feedforward deep neural network in sequence, and the confidence information of each keyword at sentence level is obtained after the normalized exponential function processing, when the confidence value is larger than a predefined threshold value, the keyword is considered to be detected, otherwise, the keyword is considered to be not detected.
2. The method as claimed in claim 1, wherein the mel feature is formed by concatenating the mel feature of the current frame and a plurality of frames in the future.
3. The method as claimed in claim 1, wherein the uni-directional long-short term memory recurrent neural network comprises a plurality of stacked uni-directional layers, each of the uni-directional layers having sixty-four neurons.
4. The method of claim 1, wherein the neural network is trained using a noisy large data set, wherein the noisy speech is a mixture of multiple noises and multiple speaker voices;
the noisy speech is a mixture of thousands of different types of noise and over five hundred speakers' speech.
5. The method of claim 1, wherein the convolutional neural network is formed by stacking a plurality of single convolutional layers;
each of the single convolutional layers of the convolutional neural network is connected by an activation function layer.
6. The method as claimed in claim 1, wherein the feedforward deep neural network is formed by stacking a plurality of single-layer linear layers;
and each linear layer of the feedforward type deep neural network is connected with each other through an activation function layer.
7. The method of claim 1, wherein the attention mechanism is a soft attention mechanism.
8. The method of claim 1, wherein the input of the attention mechanism is from an output of a convolutional neural network layer.
9. The method of claim 1, wherein the attention mechanism is input by mixing output signals of a current frame and convolution layers of a plurality of past frames.
10. The method of claim 1, wherein a vector size of an output of the neural network is one more than a number of the keywords involved in the training.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910945315.6A CN110767223B (en) | 2019-09-30 | 2019-09-30 | Voice keyword real-time detection method of single sound track robustness |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910945315.6A CN110767223B (en) | 2019-09-30 | 2019-09-30 | Voice keyword real-time detection method of single sound track robustness |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110767223A true CN110767223A (en) | 2020-02-07 |
CN110767223B CN110767223B (en) | 2022-04-12 |
Family
ID=69329184
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910945315.6A Active CN110767223B (en) | 2019-09-30 | 2019-09-30 | Voice keyword real-time detection method of single sound track robustness |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110767223B (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111261148A (en) * | 2020-03-13 | 2020-06-09 | 腾讯科技(深圳)有限公司 | Training method of voice model, voice enhancement processing method and related equipment |
CN111755002A (en) * | 2020-06-19 | 2020-10-09 | 北京百度网讯科技有限公司 | Speech recognition device, electronic apparatus, and speech recognition method |
CN111862973A (en) * | 2020-07-14 | 2020-10-30 | 杭州芯声智能科技有限公司 | Voice awakening method and system based on multi-command words |
CN111862957A (en) * | 2020-07-14 | 2020-10-30 | 杭州芯声智能科技有限公司 | Single track voice keyword low-power consumption real-time detection method |
CN111883181A (en) * | 2020-06-30 | 2020-11-03 | 海尔优家智能科技(北京)有限公司 | Audio detection method and device, storage medium and electronic device |
CN112163405A (en) * | 2020-09-08 | 2021-01-01 | 北京百度网讯科技有限公司 | Question generation method and device |
CN112466280A (en) * | 2020-12-01 | 2021-03-09 | 北京百度网讯科技有限公司 | Voice interaction method and device, electronic equipment and readable storage medium |
CN113035231A (en) * | 2021-03-18 | 2021-06-25 | 三星(中国)半导体有限公司 | Keyword detection method and device |
US20230128588A1 (en) * | 2020-08-24 | 2023-04-27 | Unlikely Artificial Intelligence Limited | Computer implemented method for the automated analysis or use of data |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106104674A (en) * | 2014-03-24 | 2016-11-09 | 微软技术许可有限责任公司 | Mixing voice identification |
US20160358602A1 (en) * | 2015-06-05 | 2016-12-08 | Apple Inc. | Robust speech recognition in the presence of echo and noise using multiple signals for discrimination |
US20170148429A1 (en) * | 2015-11-24 | 2017-05-25 | Fujitsu Limited | Keyword detector and keyword detection method |
CN107452389A (en) * | 2017-07-20 | 2017-12-08 | 大象声科(深圳)科技有限公司 | A kind of general monophonic real-time noise-reducing method |
US20180261213A1 (en) * | 2017-03-13 | 2018-09-13 | Baidu Usa Llc | Convolutional recurrent neural networks for small-footprint keyword spotting |
CN108630193A (en) * | 2017-03-21 | 2018-10-09 | 北京嘀嘀无限科技发展有限公司 | Audio recognition method and device |
CN109671433A (en) * | 2019-01-10 | 2019-04-23 | 腾讯科技(深圳)有限公司 | A kind of detection method and relevant apparatus of keyword |
CN109841226A (en) * | 2018-08-31 | 2019-06-04 | 大象声科(深圳)科技有限公司 | A kind of single channel real-time noise-reducing method based on convolution recurrent neural network |
WO2019116604A1 (en) * | 2017-12-15 | 2019-06-20 | Mitsubishi Electric Corporation | Speech recognition system |
CN110246490A (en) * | 2019-06-26 | 2019-09-17 | 合肥讯飞数码科技有限公司 | Voice keyword detection method and relevant apparatus |
-
2019
- 2019-09-30 CN CN201910945315.6A patent/CN110767223B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106104674A (en) * | 2014-03-24 | 2016-11-09 | 微软技术许可有限责任公司 | Mixing voice identification |
US20160358602A1 (en) * | 2015-06-05 | 2016-12-08 | Apple Inc. | Robust speech recognition in the presence of echo and noise using multiple signals for discrimination |
US20170148429A1 (en) * | 2015-11-24 | 2017-05-25 | Fujitsu Limited | Keyword detector and keyword detection method |
US20180261213A1 (en) * | 2017-03-13 | 2018-09-13 | Baidu Usa Llc | Convolutional recurrent neural networks for small-footprint keyword spotting |
CN108630193A (en) * | 2017-03-21 | 2018-10-09 | 北京嘀嘀无限科技发展有限公司 | Audio recognition method and device |
CN107452389A (en) * | 2017-07-20 | 2017-12-08 | 大象声科(深圳)科技有限公司 | A kind of general monophonic real-time noise-reducing method |
WO2019116604A1 (en) * | 2017-12-15 | 2019-06-20 | Mitsubishi Electric Corporation | Speech recognition system |
CN109841226A (en) * | 2018-08-31 | 2019-06-04 | 大象声科(深圳)科技有限公司 | A kind of single channel real-time noise-reducing method based on convolution recurrent neural network |
CN109671433A (en) * | 2019-01-10 | 2019-04-23 | 腾讯科技(深圳)有限公司 | A kind of detection method and relevant apparatus of keyword |
CN110246490A (en) * | 2019-06-26 | 2019-09-17 | 合肥讯飞数码科技有限公司 | Voice keyword detection method and relevant apparatus |
Non-Patent Citations (5)
Title |
---|
X. HAO: "An Attention-based Neural Network Approach for Single Channel Speech Enhancement", 《2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)》 * |
X. WANG: "Adversarial Examples for Improving End-to-end Attention-based Small-footprint Keyword Spotting", 《2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)》 * |
Y. HUANG: "Supervised Noise Reduction for Multichannel Keyword Spotting", 《2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)》 * |
张宇: "基于注意力LSTM和多任务学习的远场语音识别", 《清华大学学报(自然科学版)》 * |
涂志强: "车载噪声环境下的语音命令词识别的仿真研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111261148A (en) * | 2020-03-13 | 2020-06-09 | 腾讯科技(深圳)有限公司 | Training method of voice model, voice enhancement processing method and related equipment |
CN111261148B (en) * | 2020-03-13 | 2022-03-25 | 腾讯科技(深圳)有限公司 | Training method of voice model, voice enhancement processing method and related equipment |
CN111755002A (en) * | 2020-06-19 | 2020-10-09 | 北京百度网讯科技有限公司 | Speech recognition device, electronic apparatus, and speech recognition method |
CN111883181A (en) * | 2020-06-30 | 2020-11-03 | 海尔优家智能科技(北京)有限公司 | Audio detection method and device, storage medium and electronic device |
CN111862973A (en) * | 2020-07-14 | 2020-10-30 | 杭州芯声智能科技有限公司 | Voice awakening method and system based on multi-command words |
CN111862957A (en) * | 2020-07-14 | 2020-10-30 | 杭州芯声智能科技有限公司 | Single track voice keyword low-power consumption real-time detection method |
US20230128588A1 (en) * | 2020-08-24 | 2023-04-27 | Unlikely Artificial Intelligence Limited | Computer implemented method for the automated analysis or use of data |
CN112163405A (en) * | 2020-09-08 | 2021-01-01 | 北京百度网讯科技有限公司 | Question generation method and device |
CN112466280A (en) * | 2020-12-01 | 2021-03-09 | 北京百度网讯科技有限公司 | Voice interaction method and device, electronic equipment and readable storage medium |
CN112466280B (en) * | 2020-12-01 | 2021-12-24 | 北京百度网讯科技有限公司 | Voice interaction method and device, electronic equipment and readable storage medium |
CN113035231A (en) * | 2021-03-18 | 2021-06-25 | 三星(中国)半导体有限公司 | Keyword detection method and device |
CN113035231B (en) * | 2021-03-18 | 2024-01-09 | 三星(中国)半导体有限公司 | Keyword detection method and device |
Also Published As
Publication number | Publication date |
---|---|
CN110767223B (en) | 2022-04-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110767223B (en) | Voice keyword real-time detection method of single sound track robustness | |
CN111971743B (en) | Systems, methods, and computer readable media for improved real-time audio processing | |
CN110600017B (en) | Training method of voice processing model, voice recognition method, system and device | |
EP3926623B1 (en) | Speech recognition method and apparatus, and neural network training method and apparatus | |
CN107452389B (en) | Universal single-track real-time noise reduction method | |
CN107393526B (en) | Voice silence detection method, device, computer equipment and storage medium | |
CN109841226A (en) | A kind of single channel real-time noise-reducing method based on convolution recurrent neural network | |
US10504539B2 (en) | Voice activity detection systems and methods | |
Skowronski et al. | Automatic speech recognition using a predictive echo state network classifier | |
US20180025721A1 (en) | Automatic speech recognition using multi-dimensional models | |
CN110556103A (en) | Audio signal processing method, apparatus, system, device and storage medium | |
Myer et al. | Efficient keyword spotting using time delay neural networks | |
US20120239403A1 (en) | Downsampling Schemes in a Hierarchical Neural Network Structure for Phoneme Recognition | |
KR20180038219A (en) | Apparatus and Method for detecting voice based on correlation between time and frequency using deep neural network | |
CN113628612A (en) | Voice recognition method and device, electronic equipment and computer readable storage medium | |
CN109346062A (en) | Sound end detecting method and device | |
CN113205820A (en) | Method for generating voice coder for voice event detection | |
Huang et al. | Improving audio anomalies recognition using temporal convolutional attention networks | |
Soni et al. | State-of-the-art analysis of deep learning-based monaural speech source separation techniques | |
WO2021062705A1 (en) | Single-sound channel robustness speech keyword real-time detection method | |
US20180108345A1 (en) | Device and method for audio frame processing | |
Wang et al. | Robust speech recognition from ratio masks | |
Hadi et al. | An efficient real-time voice activity detection algorithm using teager energy to energy ratio | |
CN114333884B (en) | Voice noise reduction method based on combination of microphone array and wake-up word | |
CN113823311B (en) | Voice recognition method and device based on audio enhancement |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: 533, podium building 12, Shenzhen Bay science and technology ecological park, No.18, South Keji Road, high tech community, Yuehai street, Nanshan District, Shenzhen, Guangdong 518000 Applicant after: ELEVOC TECHNOLOGY Co.,Ltd. Address before: 2206, phase I, International Students Pioneer Building, 29 Gaoxin South Ring Road, Yuehai street, Nanshan District, Shenzhen, Guangdong 518000 Applicant before: ELEVOC TECHNOLOGY Co.,Ltd. |
|
CB02 | Change of applicant information | ||
GR01 | Patent grant | ||
GR01 | Patent grant |