CN110767223A - Voice keyword real-time detection method of single sound track robustness - Google Patents

Voice keyword real-time detection method of single sound track robustness Download PDF

Info

Publication number
CN110767223A
CN110767223A CN201910945315.6A CN201910945315A CN110767223A CN 110767223 A CN110767223 A CN 110767223A CN 201910945315 A CN201910945315 A CN 201910945315A CN 110767223 A CN110767223 A CN 110767223A
Authority
CN
China
Prior art keywords
neural network
keyword
frame
mel
signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910945315.6A
Other languages
Chinese (zh)
Other versions
CN110767223B (en
Inventor
胡鹏
闫永杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Elephant Acoustical (shenzhen) Technology Co Ltd
Original Assignee
Elephant Acoustical (shenzhen) Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Elephant Acoustical (shenzhen) Technology Co Ltd filed Critical Elephant Acoustical (shenzhen) Technology Co Ltd
Priority to CN201910945315.6A priority Critical patent/CN110767223B/en
Publication of CN110767223A publication Critical patent/CN110767223A/en
Application granted granted Critical
Publication of CN110767223B publication Critical patent/CN110767223B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/0212Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders using orthogonal transformation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/26Pre-filtering or post-filtering
    • G10L19/265Pre-filtering, e.g. high frequency emphasis prior to encoding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Complex Calculations (AREA)

Abstract

The invention relates to a method for detecting a single sound channel robustness voice keyword in real time, which comprises the following steps: receiving a noisy speech in an electronic format; converting the voice signal of the time domain into a frequency domain signal by using short-time Fourier transform frame by frame; processing the frequency domain signal by using a Mel filter to obtain Mel characteristics as acoustic characteristics; the Mel characteristics are processed by a neural network frame by frame and then processed by a normalization index function to obtain confidence information of each keyword; when the confidence coefficient of a certain keyword is greater than a predefined threshold value, the current frame and a plurality of frames pushed forwards are taken to be spliced to be used as the output of the neural network; and when the confidence coefficient value is larger than a predefined threshold value, determining that the keyword is detected, otherwise, determining that the keyword is not detected. The invention has excellent performance, can still keep higher awakening rate in a noisy environment, has wider practicability, can greatly reduce the false alarm rate of the neural network and improves the performance of the invention.

Description

Voice keyword real-time detection method of single sound track robustness
Technical Field
The invention relates to the technical field of electronic communication noise reduction, in particular to a method for detecting a voice keyword with single sound channel robustness in real time.
Background
With the rise of applications such as intelligent assistants and intelligent sound equipment, the speech keyword detection technology is increasingly regarded by the industry as an important link in human-computer interaction. Hidden markov model-based filling Models (Filler Models) were first applied to keyword detection. Among them, the robust keyword detection of the mono is a very challenging subject, because the mono keyword detection only depends on the recorded sound signal of one microphone, and cannot utilize the spatial information commonly used in the microphone array. In addition, single-microphone robust keyword detection acoustic application scenarios are more extensive than microphone array keyword detection based on beamforming (appropriately configured spatial filtering through the sensor array). Because only one microphone is used, the single-channel robustness keyword detection is lower in cost and more convenient to use. Recently, a great breakthrough of keyword detection is that a deep neural network is used to replace a hidden Markov model, and the method has the advantages of less memory occupation, no need of decoding and searching and high accuracy. The most advanced method in the past is to use a feed-forward Deep Neural Network (DNN) trained with a large amount of data plus frame-level data labeling. Although the method can realize keyword detection, the robustness of the method to noise is not good, and the method can be improved by adding noises with different types and different signal-to-noise ratios to input voice in the training process, but the problem of high false alarm rate still exists.
The existing solution has the following disadvantages:
1. although keyword detection can be achieved, robustness to noise is not good;
2. there is a problem of a high false alarm rate.
Disclosure of Invention
The technical problem to be solved by the invention is how to solve the problems of poor noise robustness and high false alarm rate of the method in the prior art by adopting a single sound channel robustness voice keyword real-time detection method.
The technical scheme adopted by the invention for solving the technical problems is as follows: compared with the keyword detection of a microphone array formed by wave beams, the method can still keep higher awakening rate under the condition of not using space position information, and has wider application scenes by only using one microphone. The method adopts a supervised learning method to detect the keywords, and realizes a keyword detection method with robustness in a noise scene by combining two training targets of noise reduction and keyword detection. The method has excellent performance, can still keep higher awakening rate in a noisy environment, has wider practicability, and can greatly reduce the false alarm rate of the neural network.
In the method for detecting the robustness of the monophonic voice keyword in real time, the method for detecting the robustness of the monophonic voice keyword in real time comprises the following steps:
s1 receiving the noisy speech signal in electronic format, which contains human voice and background noise of non-human voice;
s2, converting the noisy speech signal of the time domain into a frequency domain signal by short-time Fourier transform frame by frame;
s3, processing the frequency domain signal by using a Mel filter to obtain Mel characteristics and taking the Mel characteristics as acoustic characteristics;
the S4 neural network includes: a convolutional neural network, a one-way long-short term memory recurrent neural network and a feedforward type deep neural network;
after the Mel features pass through a convolutional neural network, a one-way long-short term memory regression neural network and a feedforward type deep neural network frame by frame, the confidence coefficient information of each keyword is obtained after the Mel features are processed by a normalized exponential function;
s5, when the confidence coefficient of a certain keyword is larger than the predefined threshold value, the current frame and a plurality of frames pushed forward are taken for splicing and used as the output signal of the neural network;
and (8) sequentially passing the output signal of the S6 neural network through an attention mechanism and a feedforward deep neural network, and processing the output signal through a normalized exponential function to obtain confidence information of each keyword at the sentence level, wherein when the confidence value is greater than a predefined threshold value, the keyword is considered to be detected, otherwise, the keyword is considered to be not detected.
In the method for detecting the monophonic robustness voice keyword in real time, the Mel characteristic is formed by splicing the Mel characteristic of the current frame and a plurality of frames in the future.
In the method for detecting the robustness of the single sound channel voice keywords in real time, the one-way long-short term memory recurrent neural network comprises a plurality of stacked one-way layers, and each one-way layer is provided with sixty-four neurons.
In the method for detecting the monophonic robustness voice keyword in real time, the neural network is trained by adopting a large data set with noise, wherein the voice with noise is formed by mixing various noises and voices of a plurality of speakers;
noisy speech is a mixture of thousands of different types of noise and over five hundred speakers' speech.
In the method for detecting the robustness of the single sound channel voice keywords in real time, the convolutional neural network is formed by stacking a plurality of single convolutional layers;
each single convolutional layer of the convolutional neural network is connected by an activation function layer.
In the method for detecting the robustness of the single sound track voice keywords in real time, a feedforward deep neural network is formed by stacking a plurality of single-layer linear layers;
each linear layer of the feedforward deep neural network is connected with each other through an activation function layer.
In the method for detecting the monophonic robust speech keyword in real time, a soft attention mechanism is adopted as the attention mechanism.
In the monophonic robust speech keyword real-time detection method, the input of the attention mechanism is from the output of the convolutional neural network layer.
In the method for detecting the monophonic robust speech keyword in real time, the input of the attention mechanism is obtained by mixing the output signals of the convolution layer of the current frame and a plurality of past frames.
In the method for detecting the monophonic robust speech keyword in real time, the vector size output by the neural network is the number of the keywords participating in training plus one.
The method for detecting the voice keywords in the single sound channel robustness in real time has the advantages that the method is excellent in performance, can effectively reduce the voice noise in a close-range conversation scene, has stronger practicability compared with the prior art, and does not depend on the noise and a speaker.
Drawings
The invention will be further described with reference to the accompanying drawings and examples, in which:
FIG. 1 is a functional block diagram of a method for real-time detection of a single-channel robust speech keyword in accordance with the present invention;
FIG. 2 is a table of sample keywords and sequence numbers for a method for real-time detection of speech keywords for monophonic robustness in accordance with the present invention;
FIG. 3 is a diagram of a decreasing trend of cross entropy loss in a method for real-time detection of a single-channel robust speech keyword according to the present invention;
FIG. 4 is a variation trend of mean square error in the real training process of a single-channel robust speech keyword real-time detection method of the present invention;
FIG. 5 is a schematic diagram of a soft attention mechanism implementation of the monophonic robust speech keyword real-time detection method of the present invention.
FIG. 6 is a flow chart of a method for real-time detection of a single-channel robust speech keyword according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
As shown in fig. 1, a method for detecting a robustness speech keyword in a single sound channel in real time includes the following steps: s1 receiving the noisy speech signal in electronic format, which contains human voice and background noise of non-human voice; s2, converting the noisy speech signal of the time domain into a frequency domain signal by short-time Fourier transform frame by frame; s3, processing the frequency domain signal by using a Mel filter to obtain Mel characteristics and taking the Mel characteristics as acoustic characteristics; the S4 neural network includes: a convolutional neural network, a one-way long-short term memory recurrent neural network and a feedforward type deep neural network; after the Mel features pass through a convolutional neural network, a one-way long-short term memory regression neural network and a feedforward type deep neural network frame by frame, processing by using a normalized exponential function (Softmax) to obtain confidence information of each keyword; s5, when the confidence coefficient of a certain keyword is larger than the predefined threshold value, the current frame and a plurality of frames pushed forward are taken for splicing and used as the output signal of the neural network; an output signal of the S6 neural network sequentially passes through an Attention mechanism (Attention mechanism) and a feedforward type deep neural network, and is processed by a normalized exponential function to obtain confidence information of each keyword at a sentence level, when the confidence value is larger than a predefined threshold value, the keyword is considered to be detected, otherwise, the keyword is considered to be not detected.
Further, confidence is also referred to as reliability, or confidence level, confidence coefficient, i.e. when a sample estimates an overall parameter, its conclusion is always uncertain due to the randomness of the sample. Therefore, a probabilistic statement method, i.e. interval estimation in mathematical statistics, is used, i.e. how large the corresponding probability of the estimated value and the overall parameter are within a certain allowable error range, and this corresponding probability is called confidence.
Further, a short-time Fourier transform (STFT) is a mathematical transform related to the STFT to determine the frequency and phase of the local area sinusoid of the time-varying signal.
Further, the normalization exponential function or Softmax function is a generalization of the logistic function in mathematics, especially in probability theory and related fields. It can "compress" a K-dimensional vector z containing arbitrary real numbers into another K-dimensional real vector σ (z) such that each element ranges between (0,1) and the sum of all elements is 1. This function is more than in the multi-classification problem.
Further, in step S2, the digital signal in the time domain is converted into a frequency domain signal by Fast Fourier Transform (FFT), and after the signal is converted into the frequency domain signal, we can conveniently analyze the frequency components of the signal and process the signal in the frequency domain, so as to implement many signal processing algorithms that cannot be completed in the time domain.
Furthermore, in step S1, the noisy speech training process uses a mixture of clean speech and noise with different snr, the inference process uses real collected speech, and in step S2, the feature extraction is performed to frame and window the noisy speech, each frame is twenty milliseconds long, and there is a ten-millisecond overlap between adjacent frames. And extracting a frequency spectrum amplitude vector on each frame by utilizing fast Fourier transform, and filtering by using a Mel filter to obtain the acoustic characteristics of each frame. Since the speech signal has a large correlation in the time dimension, and this correlation is very helpful for the keyword detection task. In order to improve the performance of keyword detection by utilizing the correlation in the time dimension, the method splices the current frame and a plurality of frames before and after the current frame into a vector with a larger dimension as an input feature. The method is executed by a computer program, acoustic features are extracted from the voice with noise, and the voice with noise is processed by a deep neural network to obtain whether the original voice with noise contains keywords or not. The method includes one or more program modules executed by any system or hardware device having executable computer programming instructions to perform the one or more modules.
Further, for applications requiring real-time processing, such as mobile hearing aid noise reduction applications, it is not acceptable to use information from future frames, which may result in delays. For keyword detection applications, delays within a certain range are acceptable, and therefore some real-time performance can be sacrificed within the present application to achieve better performance. Specifically, the current frame and the future 10 frames can be spliced as the input of the present invention, which only increases the delay of 100 milliseconds, can increase the accuracy of keyword detection, and can further increase the performance for the number of the used future frames which can also be increased to 20 frames. It should be noted that the historical frames are not spliced as input, and the main reason is that the long-short term memory recurrent neural network is used as a component of the network, and the long-short term memory recurrent neural network has a function of retaining a part of important historical input information, so that the historical frames do not need to be spliced, and only the future frames are relied on, thereby reducing the calculation amount of the neural network and the power consumption of the hardware devices relied on.
More specifically, the present invention supports one or more keywords, where different numbers of wake-up words correspond to different dimensions of network output, and the network output includes a first-level frame-level output and a second-level sentence-level network output, where the network output dimension is equal to the number of the keywords plus one, for example, if there is one keyword, the output dimension of the network is 2, if there are two keywords, the output dimension of the network is 3, and so on. Before the training task is performed, each command word should be numbered, for example, the following six keywords "light on", "turn on tv", "turn on air conditioner", "turn off light", "turn off tv", "turn off air conditioner" may be numbered as shown in fig. 2. It should be noted that, in the present invention, besides numbering the keywords, the non-keywords also need to be numbered, and the non-keywords are usually numbered as 0, and the keywords are numbered sequentially from 1 as the labels of the voices used for training. During training, the label should be subjected to One-hot encoding (One-hot encoding), and the result after the encoding is combined with the neural network output to solve the Cross Entropy Loss (Cross Entropy Loss).
Further, Cross Entropy (Cross Entropy) is an important concept in Shannon information theory, and is mainly used for measuring the difference information between two probability distributions. The performance of a language model is typically measured in terms of cross-entropy and complexity (perplexity). The meaning of cross entropy is the difficulty of text recognition using this model, or from a compression point of view, on average, several bits per word are encoded. The meaning of complexity is the number of branches that represent this text average with the model, whose inverse can be considered as the average probability of each word.
Further, generalization capability is very important for any supervised learning approach. Generalization capability refers to the performance of the method in a scenario without participation in training. The generalization performance of the method in the invention mainly adopts the noise-free human voice and about 10000 kinds of noises collected by different scenes to mix the noise-added voice with different Signal-to-noise ratios (SNR) and different loudness, and then solves the generalization problem through large-scale training. Because the recurrent neural network has the modeling capability on the long-term dependence relationship in the signal, the proposed model has good generalization on new noise and speaker scenes, which is very important for practical application. Preferably, to achieve better performance, the present invention uses a future frame dependent RNN model so that the network can obtain information from both the past and the future.
As shown in fig. 6, which illustrates the whole process of the present invention in detail, a detailed process of keyword detection is proposed, a noisy speech signal is input, a number of keywords is output, "1" in the figure represents a step involved during training, "2" in the figure represents a step of an inference or prediction phase, and "3" in the figure represents a step of training and prediction sharing. In order to keep the robustness of the invention under strong noise environment, the invention uses the Ideal Ratio Membrane (IRM) as an auxiliary training target in the training stage. The IRM is obtained by comparing the mel features of noisy speech with the mel features of clean speech. As shown in fig. 1, the present invention employs a 7-layer deconvolution neural network (De-CNN) corresponding to the 7-layer Convolutional Neural Network (CNN) to estimate an ideal ratio film for each input noisy speech, and then calculates Mean-square error (MSE) of the ideal ratio film and the estimated ratio film. The loss function of the neural network simultaneously comprises cross entropy loss and mean square error, and the loss function of the whole training set is minimized by the neural network through repeated multiple iterations. After the training phase is over, the prediction phase is entered, in which the deconvolution part is not used at all. This training method for simultaneously training multiple targets is generally referred to as joint training (joint training). The output of the deconvolution neural network part is a predicted ideal ratio film, and the Mel characteristics of pure speech can be recovered by using the ideal ratio film and the Mel characteristics of noisy speech, so that the convolution neural network part can keep more pure speech characteristics by adding the predicted ideal ratio film as a training target, and filter out the noise characteristics irrelevant to keyword detection, so that the whole neural network has better noise reduction performance, the robustness of the deconvolution neural network part in a strong noise environment can still obtain higher accuracy in the strong noise environment.
Further, Mean-square error (MSE) is a metric that reflects the degree of difference between the estimator and the estimated volume.
Still further, robustness is that the wake-up system can still be correctly woken up and maintain a low level of false wake-up rate in case of noise interference or target voice change.
As shown in fig. 3 or fig. 4, the Mean-square error (MSE) value decreases rapidly from about 1200 times to about 1400 times, and at this time, the cross entropy loss value also decreases rapidly, so that it can be seen that the performance of the keyword detection task is also continuously improved along with the improvement of the performance of the noise reduction task, which indicates that the performance of the noise reduction task has a synergistic effect on the performance of the keyword detection task.
Further, in the training process of the present invention, the training process is divided into two different stages, the first stage is the above mentioned Mean-square error (MSE) and cross entropy loss, and the second stage is mainly used for training keyword detection at sentence level. In the training process of the previous stage, the trained 7-layer convolutional neural network has a good noise reduction effect, so that in the training of the stage, the weight parameters of the convolutional neural network are frozen, and the convolutional neural network is not updated in the training process of the stage. During the training process of the stage, the current frame is spliced with the output of the convolutional neural network of 180 frames in total of 179 historical frames as the input of the attention mechanism. These 180 frames correspond to 1.8 seconds of speech, and since most of the keyword speech is less than 1.8 seconds, these 180 frame features contain almost the entire wake word speech. The attention mechanism is derived from the study of human vision, and in the cognitive science, due to the bottleneck of information processing, human beings selectively pay attention to a part of information while ignoring other visible information, and the attention mechanism is used here to detect relevant information from keywords extracted from a longer time series while ignoring other useless information. The second-stage awakening with the attention mechanism is used, so that the problem of poor performance caused by overhigh false alarm rate due to the fact that only the first-stage awakening is used can be effectively solved.
Further, the attention mechanism used within the present invention is a soft attention mechanism.
As shown in fig. 5, the principle of soft attention can be described specifically by the following formula:
et=vTtanh(Wht+b)
Figure BDA0002223985920000102
as shown in the above formula, the whole process of the attention mechanism is that firstly the output et of the hidden state of each frame of the previous T frame is calculated, then the hidden states are calculated by using a normalization exponential function as the weight of each frame, and finally the weighted summation of the input of each frame of the whole event sequence is the output of the attention mechanism.
Furthermore, the Mel feature is formed by splicing the Mel feature of the current frame and a plurality of frames in the future.
Furthermore, a fast fourier transform is used to extract a spectral magnitude vector on each frame, and then a mel filter is used to perform filtering to obtain an acoustic feature, i.e. a mel feature, of each frame.
Further, the one-way long-short term memory recurrent neural network includes a plurality of stacked one-way layers, each one-way layer having sixty-four neurons.
Further, the neural network is trained by adopting a noisy big data set, wherein the noisy speech is formed by mixing a plurality of noises and a plurality of speaker voices;
noisy speech is a mixture of thousands of different types of noise and over five hundred speakers' speech.
Furthermore, the convolutional neural network is formed by stacking a plurality of single convolutional layers; each single convolutional layer of the convolutional neural network is connected by an activation function layer.
Further, the convolutional neural network is a deep feedforward artificial neural network, the artificial neurons may respond to surrounding cells, and the convolutional neural network includes convolutional layers and pooling layers.
Further, the feedforward deep neural network is formed by stacking a plurality of single-layer linear layers; each linear layer of the feedforward deep neural network is connected with each other through an activation function layer.
Furthermore, a Deep Neural Network (DNN) of feedforward type is a feedforward Neural network having at least one hidden layer, which is linearized by an activation function, subjected to a loss function using cross entropy, and learning-trained (adjusted and updated with weights between neurons) by a back propagation optimization algorithm (stochastic gradient descent algorithm, bulk gradient descent algorithm).
Further, the attention mechanism is a Soft-attention mechanism (Soft-attention).
Further, the input to the attention mechanism comes from the output of the convolutional neural network layer.
Further, the input to the attention mechanism is the mixing of the output signals of the current frame and several past frame convolution layers.
Furthermore, the attention mechanism can expand the capability of the neural network, allows more complex functions to be approximated, is a characteristic part capable of paying attention to input in a more intuitive way, can help to improve the reference performance of natural language processing, and brings brand new capabilities of image description, memory network addressing, neural programmers and the like.
Further, the vector size of the output of the neural network is the number of keywords involved in the training plus one.
Furthermore, the monaural robust keyword detection of the present invention refers to the fact that the monaural robust keyword detection has wider practicability compared to the keyword detection of a beam-formed microphone array for the signal collected by a single microphone.
Furthermore, the invention adopts a supervised learning method to detect the keywords, and the keywords are detected by a convolutional neural network, a long-short term memory recurrent neural network and a feedforward deep neural network. The method adopts a sentence-level attention mechanism and a feedforward deep neural network as a second-level neural network, and after the output confidence coefficient of the first-level deep neural network based on the convolutional neural network, the long-short term memory regression neural network and the feedforward deep neural network is greater than a threshold value, the second-level network is used for confirmation, if the output confidence coefficient of the second-level network is greater than the threshold value again, the keyword is considered to be detected, otherwise, the keyword is not detected, the second-level network is used for confirmation, the false alarm rate of the neural network can be greatly reduced at the cost of less influence on the performance, and the performance of the method is improved.
Further, a Long Short-Term Memory recurrent neural network (LSTM) is a time-cycle neural network, which is specially designed to solve the Long-Term dependence problem of a general RNN (recurrent neural network), and all RNNs have a chain form of a repetitive neural network module.
Furthermore, the robustness in the single-channel robustness awakening word detection is realized, the higher awakening rate can still be kept in a noisy environment, and the correct awakening rate of more than 90% can be kept under the noisy awakening condition that the signal-to-noise ratio is 0 dB.
Further, the signal-to-noise ratio (SNR, S/N) is also called signal-to-noise ratio. Refers to the ratio of signal to noise in an electronic device or system. The signal refers to an electronic signal from the outside of the device to be processed by the device, the noise refers to an irregular extra signal (or information) which does not exist in the original signal generated after passing through the device, and the signal does not change along with the change of the original signal.
Furthermore, the robust single-channel keyword detection refers to keyword detection of electronic-format voice collected by a single microphone, and compared with the keyword detection of a beam-formed microphone array, the robust single-channel keyword detection method can still keep a higher awakening rate under the condition of not using spatial position information, and has wider application scenes by only using one microphone. The invention adopts a supervised learning method to detect the keywords, and realizes a keyword detection method with robustness in a noise scene by combining two training targets of noise reduction and keyword detection.
Furthermore, the invention introduces the problem that the false alarm rate of the keyword detection is too high by a second-level network, wherein the attention mechanism in the second-level network can extract the information related to the keyword from a longer time sequence. The second level network only executes relevant logic after the output of the first level network is larger than a threshold value in the inference stage, and a part of calculation cost can be saved.
Further, the false alarm rate is also called false alarm probability, and refers to the probability that no target exists but a target is determined to exist due to the ubiquitous and fluctuating noise in the radar detection process by using a threshold detection method.
The invention provides a single sound channel robustness voice keyword real-time detection method, which adopts the single sound channel robustness keyword detection method to detect the signal collected by a single microphone, has wider practicability compared with the keyword detection of a microphone array formed by wave beams, adopts a second-level network for confirmation, can greatly reduce the false alarm rate of a neural network under the condition of less influence on the performance, improves the performance, and can still keep higher awakening rate in a noisy environment.
Although the present invention has been described with reference to the above embodiments, the scope of the present invention is not limited thereto, and modifications, substitutions and the like of the above members are intended to fall within the scope of the claims of the present invention without departing from the spirit of the present invention.

Claims (10)

1. A monophonic robustness voice keyword real-time detection method is characterized by comprising the following steps:
s1 receiving the noisy speech signal in electronic format, which contains human voice and background noise of non-human voice;
s2, converting the voice signal with noise in the time domain into a frequency domain signal by short-time Fourier transform frame by frame;
s3, processing the frequency domain signal by using a Mel filter to obtain Mel characteristics and taking the Mel characteristics as acoustic characteristics;
the S4 neural network includes: a convolutional neural network, a one-way long-short term memory recurrent neural network and a feedforward type deep neural network;
the Mel features pass through a convolutional neural network, a one-way long-short term memory regression neural network and a feedforward type deep neural network frame by frame, and then are processed by a normalized exponential function to obtain confidence information of each keyword;
s5, when the confidence coefficient of a certain keyword is larger than a predefined threshold value, splicing the current frame and a plurality of frames pushed forward, and using the spliced frames as the output signal of the neural network;
s6, the output signal of the neural network passes through an attention mechanism and the feedforward deep neural network in sequence, and the confidence information of each keyword at sentence level is obtained after the normalized exponential function processing, when the confidence value is larger than a predefined threshold value, the keyword is considered to be detected, otherwise, the keyword is considered to be not detected.
2. The method as claimed in claim 1, wherein the mel feature is formed by concatenating the mel feature of the current frame and a plurality of frames in the future.
3. The method as claimed in claim 1, wherein the uni-directional long-short term memory recurrent neural network comprises a plurality of stacked uni-directional layers, each of the uni-directional layers having sixty-four neurons.
4. The method of claim 1, wherein the neural network is trained using a noisy large data set, wherein the noisy speech is a mixture of multiple noises and multiple speaker voices;
the noisy speech is a mixture of thousands of different types of noise and over five hundred speakers' speech.
5. The method of claim 1, wherein the convolutional neural network is formed by stacking a plurality of single convolutional layers;
each of the single convolutional layers of the convolutional neural network is connected by an activation function layer.
6. The method as claimed in claim 1, wherein the feedforward deep neural network is formed by stacking a plurality of single-layer linear layers;
and each linear layer of the feedforward type deep neural network is connected with each other through an activation function layer.
7. The method of claim 1, wherein the attention mechanism is a soft attention mechanism.
8. The method of claim 1, wherein the input of the attention mechanism is from an output of a convolutional neural network layer.
9. The method of claim 1, wherein the attention mechanism is input by mixing output signals of a current frame and convolution layers of a plurality of past frames.
10. The method of claim 1, wherein a vector size of an output of the neural network is one more than a number of the keywords involved in the training.
CN201910945315.6A 2019-09-30 2019-09-30 Voice keyword real-time detection method of single sound track robustness Active CN110767223B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910945315.6A CN110767223B (en) 2019-09-30 2019-09-30 Voice keyword real-time detection method of single sound track robustness

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910945315.6A CN110767223B (en) 2019-09-30 2019-09-30 Voice keyword real-time detection method of single sound track robustness

Publications (2)

Publication Number Publication Date
CN110767223A true CN110767223A (en) 2020-02-07
CN110767223B CN110767223B (en) 2022-04-12

Family

ID=69329184

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910945315.6A Active CN110767223B (en) 2019-09-30 2019-09-30 Voice keyword real-time detection method of single sound track robustness

Country Status (1)

Country Link
CN (1) CN110767223B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111261148A (en) * 2020-03-13 2020-06-09 腾讯科技(深圳)有限公司 Training method of voice model, voice enhancement processing method and related equipment
CN111755002A (en) * 2020-06-19 2020-10-09 北京百度网讯科技有限公司 Speech recognition device, electronic apparatus, and speech recognition method
CN111862973A (en) * 2020-07-14 2020-10-30 杭州芯声智能科技有限公司 Voice awakening method and system based on multi-command words
CN111862957A (en) * 2020-07-14 2020-10-30 杭州芯声智能科技有限公司 Single track voice keyword low-power consumption real-time detection method
CN111883181A (en) * 2020-06-30 2020-11-03 海尔优家智能科技(北京)有限公司 Audio detection method and device, storage medium and electronic device
CN112163405A (en) * 2020-09-08 2021-01-01 北京百度网讯科技有限公司 Question generation method and device
CN112466280A (en) * 2020-12-01 2021-03-09 北京百度网讯科技有限公司 Voice interaction method and device, electronic equipment and readable storage medium
CN113035231A (en) * 2021-03-18 2021-06-25 三星(中国)半导体有限公司 Keyword detection method and device
US20230128588A1 (en) * 2020-08-24 2023-04-27 Unlikely Artificial Intelligence Limited Computer implemented method for the automated analysis or use of data

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106104674A (en) * 2014-03-24 2016-11-09 微软技术许可有限责任公司 Mixing voice identification
US20160358602A1 (en) * 2015-06-05 2016-12-08 Apple Inc. Robust speech recognition in the presence of echo and noise using multiple signals for discrimination
US20170148429A1 (en) * 2015-11-24 2017-05-25 Fujitsu Limited Keyword detector and keyword detection method
CN107452389A (en) * 2017-07-20 2017-12-08 大象声科(深圳)科技有限公司 A kind of general monophonic real-time noise-reducing method
US20180261213A1 (en) * 2017-03-13 2018-09-13 Baidu Usa Llc Convolutional recurrent neural networks for small-footprint keyword spotting
CN108630193A (en) * 2017-03-21 2018-10-09 北京嘀嘀无限科技发展有限公司 Audio recognition method and device
CN109671433A (en) * 2019-01-10 2019-04-23 腾讯科技(深圳)有限公司 A kind of detection method and relevant apparatus of keyword
CN109841226A (en) * 2018-08-31 2019-06-04 大象声科(深圳)科技有限公司 A kind of single channel real-time noise-reducing method based on convolution recurrent neural network
WO2019116604A1 (en) * 2017-12-15 2019-06-20 Mitsubishi Electric Corporation Speech recognition system
CN110246490A (en) * 2019-06-26 2019-09-17 合肥讯飞数码科技有限公司 Voice keyword detection method and relevant apparatus

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106104674A (en) * 2014-03-24 2016-11-09 微软技术许可有限责任公司 Mixing voice identification
US20160358602A1 (en) * 2015-06-05 2016-12-08 Apple Inc. Robust speech recognition in the presence of echo and noise using multiple signals for discrimination
US20170148429A1 (en) * 2015-11-24 2017-05-25 Fujitsu Limited Keyword detector and keyword detection method
US20180261213A1 (en) * 2017-03-13 2018-09-13 Baidu Usa Llc Convolutional recurrent neural networks for small-footprint keyword spotting
CN108630193A (en) * 2017-03-21 2018-10-09 北京嘀嘀无限科技发展有限公司 Audio recognition method and device
CN107452389A (en) * 2017-07-20 2017-12-08 大象声科(深圳)科技有限公司 A kind of general monophonic real-time noise-reducing method
WO2019116604A1 (en) * 2017-12-15 2019-06-20 Mitsubishi Electric Corporation Speech recognition system
CN109841226A (en) * 2018-08-31 2019-06-04 大象声科(深圳)科技有限公司 A kind of single channel real-time noise-reducing method based on convolution recurrent neural network
CN109671433A (en) * 2019-01-10 2019-04-23 腾讯科技(深圳)有限公司 A kind of detection method and relevant apparatus of keyword
CN110246490A (en) * 2019-06-26 2019-09-17 合肥讯飞数码科技有限公司 Voice keyword detection method and relevant apparatus

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
X. HAO: "An Attention-based Neural Network Approach for Single Channel Speech Enhancement", 《2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)》 *
X. WANG: "Adversarial Examples for Improving End-to-end Attention-based Small-footprint Keyword Spotting", 《2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)》 *
Y. HUANG: "Supervised Noise Reduction for Multichannel Keyword Spotting", 《2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)》 *
张宇: "基于注意力LSTM和多任务学习的远场语音识别", 《清华大学学报(自然科学版)》 *
涂志强: "车载噪声环境下的语音命令词识别的仿真研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111261148A (en) * 2020-03-13 2020-06-09 腾讯科技(深圳)有限公司 Training method of voice model, voice enhancement processing method and related equipment
CN111261148B (en) * 2020-03-13 2022-03-25 腾讯科技(深圳)有限公司 Training method of voice model, voice enhancement processing method and related equipment
CN111755002A (en) * 2020-06-19 2020-10-09 北京百度网讯科技有限公司 Speech recognition device, electronic apparatus, and speech recognition method
CN111883181A (en) * 2020-06-30 2020-11-03 海尔优家智能科技(北京)有限公司 Audio detection method and device, storage medium and electronic device
CN111862973A (en) * 2020-07-14 2020-10-30 杭州芯声智能科技有限公司 Voice awakening method and system based on multi-command words
CN111862957A (en) * 2020-07-14 2020-10-30 杭州芯声智能科技有限公司 Single track voice keyword low-power consumption real-time detection method
US20230128588A1 (en) * 2020-08-24 2023-04-27 Unlikely Artificial Intelligence Limited Computer implemented method for the automated analysis or use of data
CN112163405A (en) * 2020-09-08 2021-01-01 北京百度网讯科技有限公司 Question generation method and device
CN112466280A (en) * 2020-12-01 2021-03-09 北京百度网讯科技有限公司 Voice interaction method and device, electronic equipment and readable storage medium
CN112466280B (en) * 2020-12-01 2021-12-24 北京百度网讯科技有限公司 Voice interaction method and device, electronic equipment and readable storage medium
CN113035231A (en) * 2021-03-18 2021-06-25 三星(中国)半导体有限公司 Keyword detection method and device
CN113035231B (en) * 2021-03-18 2024-01-09 三星(中国)半导体有限公司 Keyword detection method and device

Also Published As

Publication number Publication date
CN110767223B (en) 2022-04-12

Similar Documents

Publication Publication Date Title
CN110767223B (en) Voice keyword real-time detection method of single sound track robustness
CN111971743B (en) Systems, methods, and computer readable media for improved real-time audio processing
CN110600017B (en) Training method of voice processing model, voice recognition method, system and device
EP3926623B1 (en) Speech recognition method and apparatus, and neural network training method and apparatus
CN107452389B (en) Universal single-track real-time noise reduction method
CN107393526B (en) Voice silence detection method, device, computer equipment and storage medium
CN109841226A (en) A kind of single channel real-time noise-reducing method based on convolution recurrent neural network
US10504539B2 (en) Voice activity detection systems and methods
Skowronski et al. Automatic speech recognition using a predictive echo state network classifier
US20180025721A1 (en) Automatic speech recognition using multi-dimensional models
CN110556103A (en) Audio signal processing method, apparatus, system, device and storage medium
Myer et al. Efficient keyword spotting using time delay neural networks
US20120239403A1 (en) Downsampling Schemes in a Hierarchical Neural Network Structure for Phoneme Recognition
KR20180038219A (en) Apparatus and Method for detecting voice based on correlation between time and frequency using deep neural network
CN113628612A (en) Voice recognition method and device, electronic equipment and computer readable storage medium
CN109346062A (en) Sound end detecting method and device
CN113205820A (en) Method for generating voice coder for voice event detection
Huang et al. Improving audio anomalies recognition using temporal convolutional attention networks
Soni et al. State-of-the-art analysis of deep learning-based monaural speech source separation techniques
WO2021062705A1 (en) Single-sound channel robustness speech keyword real-time detection method
US20180108345A1 (en) Device and method for audio frame processing
Wang et al. Robust speech recognition from ratio masks
Hadi et al. An efficient real-time voice activity detection algorithm using teager energy to energy ratio
CN114333884B (en) Voice noise reduction method based on combination of microphone array and wake-up word
CN113823311B (en) Voice recognition method and device based on audio enhancement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 533, podium building 12, Shenzhen Bay science and technology ecological park, No.18, South Keji Road, high tech community, Yuehai street, Nanshan District, Shenzhen, Guangdong 518000

Applicant after: ELEVOC TECHNOLOGY Co.,Ltd.

Address before: 2206, phase I, International Students Pioneer Building, 29 Gaoxin South Ring Road, Yuehai street, Nanshan District, Shenzhen, Guangdong 518000

Applicant before: ELEVOC TECHNOLOGY Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant