CN112599123B

CN112599123B - Lightweight speech keyword recognition network, method, device and storage medium

Info

Publication number: CN112599123B
Application number: CN202110228328.9A
Authority: CN
Inventors: 殷绪成; 张硕; 杨春; 陈�峰
Original assignee: Zhuhai Eeasy Electronic Tech Co ltd
Current assignee: Zhuhai Eeasy Electronic Tech Co ltd
Priority date: 2021-03-01
Filing date: 2021-03-01
Publication date: 2021-06-22
Anticipated expiration: 2041-03-01
Also published as: CN112599123A

Abstract

The invention is suitable for the technical field of speech recognition, and provides a light-weight speech keyword recognition network, a method, equipment and a storage medium, wherein the light-weight speech keyword recognition network comprises a TDNN downsampling layer, an SE Block, a first TDNN layer, a second TDNN layer, a global averaging pooling layer and a Softmax layer which are sequentially connected, the TDNN downsampling layer is used for downsampling acoustic characteristics of input audio to be detected, the SE Block is used for performing extrusion-activation operation and reweighting operation on the downsampling characteristics to obtain reweighting characteristics, the reweighting characteristics are sequentially subjected to activation processing and normalization processing and output the normalized characteristics to the first TDNN layer, the global averaging pooling layer is used for performing global averaging pooling operation on the characteristics which are processed by the two TDNN layers so as to construct the light-weight speech keyword recognition network through the TDNN and the SE Block, hardware resource consumption is reduced.

Description

Lightweight speech keyword recognition network, method, device and storage medium

Technical Field

The invention belongs to the technical field of voice recognition, and particularly relates to a lightweight voice keyword recognition network, a method, equipment and a storage medium.

Background

Speech Keyword recognition (KWS) is a task that aims at detecting predefined keywords in an audio stream. In recent years, with the rise of keyword recognition technology, wake word detection technology has become popular, which is generally used to initiate interaction with voice assistants (e.g., "Hey Siri", "Alexa", and "Okay Google") or to distinguish simple common commands (e.g., "yes" or "no"). Because such tasks are typically run by continuously listening for specific keywords on low-resource devices, it remains challenging to implement high-precision, low-latency, small footprint KWS systems.

One common method of KWS is Large Vocabulary Continuous Speech Recognition (LVCSR), which occupies a Large amount of memory and has high latency, and thus is generally used for Keyword search of Large databases, and another method is based on Hidden Markov Models (HMMs) and Gaussian Mixture Models (GMMs) of keywords and filling words, where the HMM first creates a special decoding graph containing keywords and filling words, and then uses a Viterbi decoder to determine its optimal path, outputs the result with the highest probability, and thus requires a high computational cost, and is difficult to apply to applications on devices. In recent years, a method based on a Deep Neural Network (DNN) has a significant improvement in memory compared with a conventional method, and Deep kws (g.chen, c.parama, and g.heiglid, "Small-footprint keyword spotting using Deep Neural networks," in 2014IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE 2014, pp.4087-4091) considers keyword recognition as a classification problem and trains a sub-word unit for directly predicting a keyword by a DNN. Since DNN does not take into account the local time series and spectral correlation of Speech, sainth and parama (t.n. sainth and c.parama, "connected Neural networks for small-valued keyword spotting," in simple Neural Network of the International Speech Communication Association,2015.) propose to replace DNN with a Convolutional Neural Network (CNN) which achieves better performance while occupying less memory. However, the size of the receptive field of CNNs is usually limited and does not capture the temporal correlation of speech well. To overcome this problem, Tang and Lin (R.Tang and J.Lin, "Deep Residual learning for small-valued printed networks," in 2018IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE,2018, pp.5484-5488.) propose a KWS system based on a Residual Network (Residual Network, ResNet) in which they use a dilatational convolution to expand the size of the receptive field by a factor of two with the depth of the Network. However, the ResNet-based approach still requires hundreds of thousands of parameters to achieve the most advanced performance. To further reduce memory usage, some recent studies have applied the Time-Delay Neural Network (TDNN) (m.sun, d.snyder, y.gao, v.k.nagaraja, m.rodehorst, s.panchapagesan, n.pole, s.matsoukas, and s.vitadenuni, "Compressed Time Delay Network for small-focused print processing" in "speersech, 2017, intp.3607-3611.), Attention (c.shan, j.zhang, y.wang, and l.xie," extension-side-to-end models for small-focused print processing "in" speech,2017, cross-map, and "Network cross-map, and" map, cross-map, and map, map, wherein MobileNet reduces the number of parameters and computational cost through a deep separable convolution structure. However, if this method uses a large number of ReLU activation functions after the convolution operation, the expressive power of the model may be compromised and the method is not efficient in transferring gradients across layers. In summary, although many new architectures have been proposed, they still require a large number of parameters, which do not fully meet the requirements for running on modern low-resource devices.

Disclosure of Invention

The invention aims to provide a lightweight speech keyword recognition network, a method, equipment and a storage medium, and aims to solve the problem that the prior art cannot completely satisfy the requirement of running a speech keyword recognition network on low-resource equipment due to large hardware resource loss when a speech keyword is recognized.

In one aspect, the present invention provides a lightweight speech keyword recognition network, which is an improved time-delay neural network, and includes a TDNN downsampling layer, a SE Block, a first TDNN layer, a second TDNN layer, a global average pooling layer, and a Softmax layer, which are connected in sequence,

the TDNN downsampling layer is used for downsampling the acoustic characteristics of the input audio to be detected and outputting the downsampled characteristics obtained after downsampling to the SE Block;

the SE Block is used for performing extrusion-activation operation and re-weighting operation on the input down-sampling features to obtain re-weighting features, sequentially performing activation processing and normalization processing on the re-weighting features, and outputting the features after the normalization processing to the first TDNN layer;

the global average pooling layer is used for performing global average pooling operation on the input features processed by the two TDNN layers and outputting global dimension reduction features obtained after the global average pooling operation to the Softmax layer;

and the Softmax layer is used for classifying the input global dimension reduction features and outputting a keyword identification result.

Preferably, the activation process is performed using a Swish activation function and the normalization process is performed using LayerNorm.

Preferably, the SE Block performs an activation operation on the downsampling feature at a compression ratio of 16.

On the other hand, the invention provides a lightweight speech keyword recognition method based on the network, which comprises the following steps:

the TDNN downsampling layer downsamples the extracted acoustic features of the audio to be detected and outputs the downsampled features obtained after downsampling to the SE Block;

the SE Block performs extrusion-activation operation and re-weighting operation on the input down-sampling features to obtain re-weighting features, sequentially performs activation processing and normalization processing on the re-weighting features, and outputs the features after the normalization processing to the first TDNN layer;

the global average pooling layer performs global average pooling operation on the features input after being processed by the two TDNN layers, and outputs global dimension reduction features obtained after the global average pooling operation to the Softmax layer;

and the Softmax layer classifies the input global dimension reduction features and outputs a keyword identification result.

Preferably, before the TDNN downsampling layer performs downsampling processing on the extracted acoustic features of the audio to be detected, the method includes:

and extracting the Mel frequency cepstrum coefficient of the audio to be detected, taking the extracted Mel frequency cepstrum coefficient as the acoustic characteristic of the audio to be detected, and inputting the acoustic characteristic into the TDNN down-sampling layer.

Preferably, before the TDNN downsampling layer performs downsampling processing on the extracted acoustic features of the audio to be detected, the method further includes:

acquiring a training data set, and performing data amplification on each training data in the training data set to obtain a training data set after data amplification;

and training the lightweight speech keyword recognition network by using the training data set after data augmentation to obtain the trained lightweight speech keyword recognition network.

Preferably, the step of data augmenting each training data in the set of training data comprises:

for each training data, audio signals with a preset proportion are randomly replaced by Gaussian white noise, and a preset number of corresponding mask data are generated;

and forming a training data set after the data augmentation according to the mask data and the training data.

On the other hand, the invention provides a lightweight speech keyword recognition system, which comprises a feature extraction module and the lightweight speech keyword recognition network, wherein the feature extraction module is used for extracting the mel frequency cepstrum coefficient of the audio to be detected, and inputting the extracted mel frequency cepstrum coefficient into the lightweight speech keyword recognition network as the acoustic feature of the audio to be detected for speech keyword recognition.

In another aspect, the present invention further provides a speech keyword recognition apparatus, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the method when executing the computer program.

In another aspect, the present invention also provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of the method as described above.

The invention provides a lightweight speech keyword recognition network, which comprises a TDNN downsampling layer, a SE Block, a first TDNN layer, a second TDNN layer, a global averaging pooling layer and a Softmax layer which are sequentially connected, wherein the TDNN downsampling layer is used for downsampling input acoustic features of audio to be detected and outputting downsampled features obtained after downsampling to the SE Block, the SE Block is used for performing extrusion-activation operation and re-weighting operation on the input downsampled features to obtain re-weighted features, the re-weighted features are sequentially subjected to activation processing and normalization processing and output features after normalization processing to the first TDNN layer, the global averaging pooling layer is used for performing global averaging pooling operation on the input features after being processed by the two TDNN layers and outputting global dimension reduction features obtained after the global averaging pooling operation to the Softmax layer, the Softmax layer is used for classifying the input global dimension reduction features, and a keyword recognition result is output, so that a lightweight speech keyword recognition network is constructed through TDNN and SE Block, the accuracy is improved, and meanwhile, the hardware resource loss is greatly reduced, so that the network can stably and smoothly run on low-power-consumption equipment.

Drawings

Fig. 1 is a schematic structural diagram of a lightweight speech keyword recognition network according to an embodiment of the present invention;

FIG. 2 is a flow chart of a feature extraction implementation provided in the first embodiment of the present invention;

FIG. 3 is a flow chart of a downsampling implementation provided by an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a SE Block according to an embodiment of the present invention;

FIG. 5 is a flowchart illustrating an implementation of a lightweight speech keyword recognition method according to a second embodiment of the present invention;

FIG. 6 is a data expansion diagram used in an experimental example provided in the third embodiment of the present invention;

fig. 7 is an ROC curve comparing performance of the lightweight speech keyword recognition network used in the experimental example provided in the third embodiment of the present invention with that of other networks;

fig. 8 is a schematic structural diagram of a lightweight speech keyword recognition system according to a fourth embodiment of the present invention; and

fig. 9 is a schematic structural diagram of a speech keyword recognition device according to a fifth embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The following detailed description of specific implementations of the present invention is provided in conjunction with specific embodiments:

the first embodiment is as follows:

fig. 1 shows a structure of a lightweight speech keyword recognition network according to an embodiment of the present invention, and for convenience of description, only a part related to the embodiment of the present invention is shown.

The lightweight speech keyword recognition network is an improved Time-delay neural network (TDNN), or TDNN-SE, and comprises a TDNN downsampling Layer (TDNN-SUB), an SE Block (Squeeze and interaction Block, extrusion-activation Block), a first TDNN Layer, a second TDNN Layer, a Global Average Pooling Layer (Global Average Pooling) and a Softmax Layer which are sequentially connected and respectively correspond to Layer 1 to Layer 6 in the graph 1, wherein the first TDNN Layer and the second TDNN Layer have the same structure, the TDNN Layer is used for downsampling acoustic features of input audio to be detected and outputting the downsampled features obtained after downsampling processing to the SE Block, and the SE Block is used for obtaining heavy features by performing extrusion-activation operation and heavy weighting operation on the downsampled features, and sequentially performing activation processing and normalization processing on the heavily weighted features, outputting the features subjected to the normalization processing to the first TDNN layer, performing global average pooling operation on the features subjected to the processing by the two TDNN layers (the first TDNN layer and the second TDNN layer), outputting global dimension reduction features obtained after the global average pooling operation to the Softmax layer, and classifying the input global dimension reduction features by the Softmax layer to output a keyword identification result.

In the embodiment of the present invention, the speech to be detected is usually continuous speech, the acoustic feature of the speech to be detected input by the TDNN downsampling layer may be Linear Prediction Coefficients (LPC), Linear Prediction Cepstrum Coefficients (LPCC), Line Spectrum Frequency (LSF), Discrete Wavelet Transform (DWT), Perceptual Linear Prediction (PLP), and the like, and preferably, the acoustic feature input by the TDNN downsampling layer is mel-frequency cepstrum coefficients (MFCC) of the extracted audio to be detected.

Fig. 2 is an exemplary diagram of the MFCC extraction process, which first performs pre-emphasis processing on a speech signal to be detected (input speech signal), flattens the spectrum of the signal through a high-pass filter, performs frame windowing, divides the entire speech signal into, for example, 25ms frames with a time shift of 10ms, and then converts the time-domain signal into a frequency-domain signal using fast fourier transform FFT:

wherein, x (N) represents the voice signal to be detected, and N represents the point number of Fourier transform; after passing through MEL filter bank, logarithmic operation and discrete cosine transform DCT are carried out to obtain MFCC coefficient.

FIG. 3 is an exemplary diagram of an implementation process of TDNN downsampling, T_inAnd D_inRepresenting the length and dimensions of the input, w representing the window length of the TDNN, in the down-sampling process successive w features are stitched together and input into the TDNN, and the TDNN outputs a vector, then the TDNN moves k steps, typically k is set to 1, i.e. stepwise, to achieve a representation at each position, outputting a length T_out＝[(T_in-w+1)/k]，D_outRepresenting the dimension of the output. Since the context information contained in adjacent time points overlap for a large part,therefore, only partial connecting lines are reserved by adopting a down-sampling method, the effect of approximation of the original model can be obtained, and meanwhile, the calculated amount of the model can be greatly reduced.

Fig. 4 is a schematic structural diagram of SE Block, and assuming that an input original feature map of SE Block is X, the input original feature is subjected to inclusion structural processing to obtain a feature map U, where U is an H × W × C feature map, H, W, C respectively represents the height, width, and number of channels of the feature map U, and Global Average pooling (Global Average pooling) operation, that is, squeezing (squeezing) operation, is performed on the feature map U to obtain a 1 × 1 × C feature map Z, that is, statistics of channel levels is realized by Global Average pooling (Global Average pooling). The specific formula is as follows:

wherein, F_sqRepresents a global average pooling operation, u_c(i, j) represents an element at the ith row and jth column position on the c-th feature layer of the feature map U, z_cThe c-th feature layer in the feature map Z is shown.

In order to utilize the information obtained by the extrusion operation, a simple sigmoid activation function is selected as a gating mechanism, in order to limit the calculated amount of the network and facilitate generalization, in a nonlinear part, activation (activation) operation is realized through two layers of full connection, namely, the output of the extrusion operation is activated through two full connection layers and a sigmoid function (a full connection layer (a first full connection layer), a ReLU function, a full connection layer (a second full connection layer) and a sigmoid function which are connected in sequence), so as to obtain the weight of each characteristic channel, wherein the first full connection layer compresses C channels into C/r channels, the second full connection layer is used for recovering the C channels, r is the compression ratio, and mainly for reducing the dimensionality so as to reduce the calculated amount, preferably, r is 16. The activation operation formula is as follows:

s＝F_ex(z,W)＝σ(g(z,W))＝σ(W₂δ(W₁z))

wherein, F_exRepresenting a weight calculation operation, W₁,W₂All represent fully connected layers, W represents W₁,W₂A fully connected layer group formed together; w₁z denotes the first fully-connected operation, δ denotes the ReLu activation function, W₂δ(W₁z) represents the second fully-connected layer operation, σ represents the Sigmoid activation function, and s represents the weight vector.

And finally obtaining a weight vector s corresponding to each characteristic layer in the characteristic diagram Z through sigma operation.

After a weight vector s corresponding to each feature layer in the feature map Z is obtained, a weight weighting (weight distribution or reweight) operation is performed to obtain a weight weighting feature, that is, the weight obtained by the activation operation is multiplied by each channel data of the corresponding input feature U, and then the obtained weight is output, and the specific formula is as follows:

wherein, F_scaleRepresents a weight weighting (weight assignment) operation, u_cRepresents the c-th feature layer, s in the feature map U_cThe element representing the c-th position in the weight vector s, represents a multiplication operation, i.e. by multiplying s_cAnd u_cEach element in the feature layer is multiplied. By controlling the size of scale, important features are enhanced, unimportant features are weakened, and the extracted features have stronger directivity.

After the re-weighted feature map is output, the re-weighted feature map is preferably subjected to activation processing and normalization processing by a nonlinear activation function Swish and LayerNorm to improve the expressive power of the model. And finally, outputting the feature map subjected to the normalization processing to the TDNN layer.

The last layer of the lightweight speech keyword recognition network is a Softmax layer, and the lightweight speech keyword recognition network can be specifically used for converting multi-classification output numerical values into corresponding probabilities, and finally selecting a target with the maximum probability as a prediction target and outputting a keyword recognition result. The cross entropy loss formula used in Softmax is as follows:

wherein, t_iRepresenting true value, y_iWhen a sample is input, only one neuron corresponds to the correct type of the sample, if the probability value output by the neuron is higher, the loss generated by the neuron is smaller according to the above functional formula, otherwise, the loss is higher.

The lightweight speech keyword recognition network provided by the embodiment of the invention comprises a TDNN downsampling layer, a SE Block, a first TDNN layer, a second TDNN layer, a global averaging pooling layer and a Softmax layer which are sequentially connected, wherein the TDNN downsampling layer is used for downsampling input acoustic features of audio to be detected and outputting downsampled features obtained after downsampling to the SE Block, the SE Block is used for performing extrusion-activation operation and re-weighting operation on the input downsampled features to obtain re-weighted features, the re-weighted features are sequentially subjected to activation processing and normalization processing and outputting the features after normalization processing to the first TDNN layer, the global averaging pooling layer is used for performing global averaging pooling operation on the input features after processing by the two TDNN layers and outputting global averaging pooling operation to the Softmax layer to obtain global dimension reduction features, and the Softmax layer is used for classifying the input global dimension reduction features, and a keyword recognition result is output, so that a lightweight speech keyword recognition network is constructed through TDNN and SE Block, the accuracy is improved, and meanwhile, the hardware resource loss is greatly reduced, so that the network can stably and smoothly run on low-power-consumption equipment.

Example two:

fig. 5 shows an implementation flow of the lightweight speech keyword recognition method according to the second embodiment of the present invention, and for convenience of description, only the relevant parts according to the second embodiment of the present invention are shown, which are detailed as follows:

in step S501, the TDNN downsampling layer downsamples the extracted acoustic features of the audio to be detected, and outputs the downsampled features obtained after the downsampling process to SE Block.

The method is realized based on a trained lightweight speech keyword recognition network, the lightweight speech keyword recognition network is an improved time delay neural network, and the lightweight speech keyword recognition network comprises a TDNN down-sampling layer, an SE Block, a first TDNN layer, a second TDNN layer, a global average pooling layer and a Softmax layer which are sequentially connected. The specific implementation of the lightweight speech keyword recognition network can be referred to in the description of the first embodiment.

Before the lightweight speech keyword recognition network is used for speech keyword recognition, the lightweight speech keyword recognition network needs to be trained to obtain the trained lightweight speech keyword recognition network. When the lightweight speech keyword recognition network is trained, preferably, a training data set is obtained, each training data in the training data set is subjected to data amplification to obtain a training data set after the data amplification, and the lightweight speech keyword recognition network is trained by using the training data set after the data amplification to obtain a trained lightweight speech keyword recognition network so as to obtain a better training effect. When data amplification is performed on each training data in the training data set, it is further preferable that, for each training data, audio signals in a preset proportion are randomly replaced by gaussian white noise, a preset number of corresponding mask data are generated, and the training data set after data amplification is formed according to the mask data and the training data, so as to enlarge the scale of the training data set, and further improve the training effect of the network. Wherein the preset ratio and the preset number can be set by a user, for example, the preset ratio is a random value within 10% -20%, and the preset number is 5.

In the embodiment of the present invention, the speech to be detected is usually continuous speech, the acoustic feature of the speech to be detected input by the TDNN downsampling layer may be Linear Prediction Coefficients (LPC), Linear Prediction Cepstrum Coefficients (LPCC), Line Spectrum Frequency (LSF), Discrete Wavelet Transform (DWT), Perceptual Linear Prediction (PLP), and the like, and preferably, the acoustic feature input by the TDNN downsampling layer is mel-frequency cepstrum coefficients (MFCC) of the extracted audio to be detected. The MFCC extraction process can refer to the description of the first embodiment, and is not repeated herein.

In step S502, the SE Block performs a squeeze-activation operation and a re-weighting operation on the input down-sampling feature to obtain a re-weighting feature, sequentially performs an activation process and a normalization process on the re-weighting feature, and outputs the feature after the normalization process to the first TDNN layer.

In step S503, the global average pooling layer performs a global average pooling operation on the features input after being processed by the two TDNN layers, and outputs the global dimension reduction features obtained after the global average pooling operation to the Softmax layer.

In step S504, the Softmax layer classifies the input global dimension reduction features and outputs a keyword recognition result.

In the embodiment of the present invention, the specific implementation manner of steps S502-S504 can refer to the related description in the first embodiment, which is not repeated herein.

In the embodiment of the invention, a TDNN downsampling layer of a lightweight speech keyword recognition network performs downsampling processing on extracted acoustic features of audio to be detected, downsampling features obtained after downsampling processing are output to an SE Block, the SE Block performs extrusion-activation operation and heavy weighting operation on the input downsampling features to obtain heavy weighting features, activation processing and normalization processing are sequentially performed on the heavy weighting features, the features after normalization processing are output to a first TDNN layer, a global average pooling layer performs global average pooling operation on the input features after being processed by two TDNN layers, global dimension reduction features obtained after the global average pooling operation are output to a Softmax layer, the Softmax layer classifies the input global dimension reduction features and outputs a keyword recognition result, so that speech keyword recognition is performed through a lightweight speech keyword recognition network constructed by the TDNN and the SE Block, the accuracy is improved, and meanwhile hardware resource loss is greatly reduced, so that the network can stably and smoothly run on low-power-consumption equipment.

Example three:

this example combines an experimental example to further verify the method described in example three:

(1) data set used in this example

The Google Speech Command (p.garden, "sounding the Speech commands dataset," Google Research Blog,2017.) is an english Speech keyword dataset that contains 64752 recordings of 30 words, each recording being 1 second in length and containing one word. 10 words ("yes", "no", "up", "down", "left", "right", "on", "off", "stop", "go") are selected as keywords, and the rest are used as fillers and labeled as "unknown". In this example, the data set was divided into a training set, a validation set, and a test set at 8:1: 1.

The Biaobei Chinese Command Dataset is a Chinese speech keyword Dataset, the Dataset comprises the speech of 113 keywords of 150 people (including 'open television', 'open navigation' and the like), each speech is about 3-4 seconds, total 33900 speech is provided, the audio format is single channel, and the sampling rate is 44.1 kHz. In the experimental example, 30 words of voices are randomly divided into a training set, a verification set and a test set according to the ratio of 8:1: 1.

(2) Description of the experiments

The experiment and the test of the experimental example both adopt a Pythrch deep learning framework and adopt an Adam optimization strategy, wherein the momentum parameter is set as beta₁0.9 and β₂0.999. The initial learning rate was set to 0.001 and the learning rate was halved if there was no significant improvement (10%) in performance on the validation set. The weight attenuation parameter defaults to 0, the momentum parameter is 0.9, the training batch is 32, and the total iteration number is 50. The present experimental example uses 40-dimensional MFCC coefficients extracted every 10ms and a frame length of 25ms as acoustic features, and 99 consecutive frames are spliced together and input into a network. All experiments were performed on a machine containing 4 NVIDIA Titan XP GPUs.

The experiment is respectively tested and trained aiming at two data sets of Google SpeechCommand and Biaobei Chinese Command Dataset. The training data is first augmented in a manner shown in fig. 6, and for each training data (Real sample spectrum), 5 corresponding mask data (Masked sample spectrum) are generated by randomly replacing 10% -20% of the audio signal with white gaussian noise, and are trained together with the original data. In the experimental example, the classification accuracy is used as a main index for judging the performance, and the formula is as follows:

wherein the content of the first and second substances,

to predict the label, y_iIs a ground channel tag, if the two are equal, then

Is 1, otherwise is 0, N is the total number of samples.

In addition, the present experimental example also plots Receiver Operating Characteristics (ROC) curves, in which x and y axes show a False Alarm Rate (FAR) and a False Rejection Rate (FRR), respectively. For a given sensitivity threshold (defined as the minimum probability that a class is considered as a positive sample during evaluation), FAR and FRR represent the probability of obtaining False Positives and False Negatives, respectively. By scanning the sensitivity interval [0.0, 1.0], the curve for each keyword can be calculated and then averaged vertically to generate an overall curve for the particular model. The smaller the area under the curve (AUC), the better.

(3) Results of the experiment

In order to evaluate the effectiveness of the present invention, the present experimental example was conducted with an experiment of keyword recognition in each of the above-mentioned data sets. The experimental examples were compared with the currently mainstream Keyword recognition methods, including the methods of the people of the trad-fpool3 and tpool2(Tara N.Sainath and Carolina Time parallel, "connected Neural networks for Small-focused Keyword Spotting," in Interspace, 2015, pp.1478-1482.), Resnet 15(R.Tang and J.Lin, "Deep reactive learning for all-focused Keyword Spotting," in 2018IEEE International Conference on Acotics, Speech Signal Processing (ICASSP), April 2018, pp.5484-5488.), Bai, etc. (Bai Y, Yi J, Tao J, A.A.A.New road Keyword recognition [ gradient of the fourth word: 190: spread J.S.: particle J.A.S.: street, N.S.: Footprint, K.S.: 190). The experimental results are shown in table 1, and the results show that the accuracy of the method on the test set is obviously superior to that of other methods, particularly on the Chinese keyword test set, the accuracy of the method of the experimental example reaches 97.2% under the parameter quantity of only 13k, and is improved by 5.8% compared with a TDNN-SWSA model equivalent to the parameter quantity of the experimental example, while on the Google Speech Command English keyword set, the accuracy of the model (TDNN-SE) of the experimental example is improved by 0.6% compared with res15 with the optimal performance of the rest series, and the parameter quantity is reduced by 17 times, the calculated quantity in the experiment is 0.41M, which indicates that the method of the experimental example is more suitable for running on low-power consumption equipment.

TABLE 1

Further, the ROC curve is plotted in this experimental example, and as shown in fig. 7, when FAR is 0.005, the false rejection rate FRR of TDNN-SWSA is 0.08, and the false rejection rate FRR of the experimental example (TDNN-SE + mask) is 0.027, which is 66% lower, compared to TDNN-SWSA, indicating that the method performance of this experimental example is more stable.

The experimental example can realize the detection of specific keywords in continuous voice flow on low-resource equipment, and greatly reduces the hardware resource loss while improving the accuracy through the TDNN-SE and swish activation functions, so that the model can stably and smoothly run on low-power-consumption equipment.

Example four:

fig. 8 shows a structure of a lightweight speech keyword recognition system according to a fourth embodiment of the present invention, and for convenience of description, only the relevant portions according to the fourth embodiment of the present invention are shown, which include:

the lightweight speech keyword recognition system includes a feature extraction module 81 and a lightweight speech keyword recognition network 82. The feature extraction module 81 is configured to extract a mel-frequency cepstrum coefficient of the audio to be detected, input the extracted mel-frequency cepstrum coefficient into a lightweight speech keyword recognition network as an acoustic feature of the audio to be detected for speech keyword recognition, and the lightweight speech keyword recognition network 82 is configured to recognize a keyword of the speech to be detected according to the input acoustic feature of the audio to be detected, and output a keyword recognition result.

The specific implementation manner of the feature extraction module 81 for extracting the mel-frequency cepstrum coefficient of the audio to be detected and the specific implementation manner of the lightweight speech keyword recognition network 82 may refer to the description of the first embodiment, and are not described herein again.

In the embodiment of the present invention, each module of the lightweight speech keyword recognition system may be implemented by corresponding hardware or software units, and each module may be an independent software module or an independent hardware module, or may be integrated into a software module or a hardware module, which is not limited herein.

Example five:

fig. 9 shows a structure of a speech keyword recognition apparatus according to a fifth embodiment of the present invention, and for convenience of description, only a part related to the fifth embodiment of the present invention is shown.

The speech keyword recognition apparatus 9 of the embodiment of the present invention includes a processor 90, a memory 91, and a computer program 92 stored in the memory 91 and executable on the processor 90. The processor 90, when executing the computer program 92, implements the steps of the above-described method embodiments, such as the steps S501 to S504 shown in fig. 5. Alternatively, processor 90, when executing computer program 92, implements the functionality of the various layers in the lightweight speech keyword recognition network described above, such as the functionality of layer 1 through layer 6 shown in fig. 1.

In the embodiment of the invention, the provided lightweight speech keyword recognition network comprises a TDNN downsampling layer, an SE Block, a first TDNN layer, a second TDNN layer, a global averaging pooling layer and a Softmax layer which are sequentially connected, wherein the TDNN downsampling layer is used for downsampling input acoustic features of audio to be detected and outputting downsampled features obtained after downsampling to the SE Block, the SE Block is used for performing extrusion-activation operation and re-weighting operation on the input downsampled features to obtain re-weighted features, the re-weighted features are sequentially subjected to activation processing and normalization processing and outputting the features after normalization processing to the first TDNN layer, the global averaging pooling layer is used for performing global averaging pooling operation on the input features after processing by the two TDNN layers and outputting global averaging pooling operation to the Softmax layer to obtain global dimension reduction features, and the Softmax layer is used for classifying the input global dimension reduction features, and a keyword recognition result is output, so that a lightweight speech keyword recognition network is constructed through TDNN and SE Block, the accuracy is improved, and meanwhile, the hardware resource loss is greatly reduced, so that the network can stably and smoothly run on low-power-consumption equipment.

Example six:

in an embodiment of the present invention, a computer-readable storage medium is provided, which stores a computer program that, when executed by a processor, implements the steps in the above-described method embodiments, such as steps S501 to S504 shown in fig. 5. Alternatively, the computer program, when executed by the processor, implements the functions of the layers in the lightweight speech keyword recognition network, such as the functions of layer 1 to layer 6 shown in fig. 1.

The computer readable storage medium of the embodiments of the present invention may include any entity or device capable of carrying computer program code, a recording medium, such as a ROM/RAM, a magnetic disk, an optical disk, a flash memory, or the like.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A lightweight speech keyword recognition network is characterized in that the lightweight speech keyword recognition network is an improved time-delay neural network, and comprises a TDNN down-sampling layer, a SE Block, a first TDNN layer, a second TDNN layer, a global average pooling layer and a Softmax layer which are connected in sequence,

2. The network of claim 1, wherein the activation process is performed using a Swish activation function and the normalization process is performed using LayerNorm.

3. The network of claim 1, wherein the SE Block activates the downsample feature at a compression ratio of 16.

4. A lightweight speech keyword recognition method for a lightweight speech keyword recognition network according to any one of claims 1 to 3, wherein the method comprises:

5. The method of claim 4, wherein the TDNN downsampling layer, before downsampling the extracted acoustic features of the audio to be detected, comprises:

6. The method of claim 4, wherein before the TDNN downsampling layer downsamples the extracted acoustic features of the audio to be detected, the method further comprises:

7. The method of claim 6, wherein the step of data augmenting each training data in the set of training data comprises:

8. A light-weight speech keyword recognition system is characterized by comprising a feature extraction module and the light-weight speech keyword recognition network according to any one of claims 1 to 3, wherein the feature extraction module is used for extracting a Mel frequency cepstrum coefficient of an audio to be detected, and inputting the extracted Mel frequency cepstrum coefficient into the light-weight speech keyword recognition network as an acoustic feature of the audio to be detected for speech keyword recognition.

9. A speech keyword recognition device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of the claims 4 to 7 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 4 to 7.