CN115457953A

CN115457953A - Neural network multi-command word recognition method and system based on wearable device

Info

Publication number: CN115457953A
Application number: CN202210888530.9A
Authority: CN
Inventors: 纪盟盟; 王蒙; 胡光敏; 龚永康
Original assignee: Hangzhou Ccvui Intelligent Technology Co ltd
Current assignee: Hangzhou Ccvui Intelligent Technology Co ltd
Priority date: 2022-07-27
Filing date: 2022-07-27
Publication date: 2022-12-09

Abstract

The invention provides a neural network multi-command word recognition method and system based on wearable equipment, and relates to the technical field of audio processing. The neural network technology is used, various noises are mixed in training data, and the recognition accuracy and robustness are improved; the MFCC characteristics of voice are used as the input of the network, the CNN is used for carrying out characteristic extraction on the first layer of the network, the CNN can be used for carrying out weight sharing, the parameter number of the network can be reduced to a great extent, then, a GRU layer is added, the information between the previous frames in the voice section can be fully utilized, and the interframe characteristics can be obtained through the step, so that the overall recognition degree and the recognition efficiency of the system are improved; the voice detection is carried out by using the VAD voice detection module, and the multi-command word detection algorithm does not work when no voice exists, so that the power consumption of the system can be reduced; the reset of the GRU state can ensure the GRU state to be the same as the training condition, thereby ensuring the identification accuracy and robustness of the algorithm.

Description

Neural network multi-command word recognition method and system based on wearable device

Technical Field

The invention relates to the technical field of audio processing, in particular to a neural network multi-command word recognition method and system based on wearable equipment.

Background

The multi-command word recognition algorithm is one of algorithms commonly used for intelligent voice, and is widely applied to applications such as intelligent voice man-machine interaction. In the voice-based human-computer interaction process, a voice instruction sent by a human is transmitted into the machine through the microphone, in the machine, the multi-command word recognition algorithm can recognize specific command words, and when the specific command words are recognized, signals are fed back to the machine, so that the machine can make corresponding interaction reactions.

The wearable device-based multi-command word recognition can enable the device to communicate with the mobile phone through the Bluetooth module, the algorithm is integrated on the wearable device, the network is not needed, real-time and accurate multi-command word recognition can be achieved, and then human-computer interaction is achieved.

However, the existing multi-command word recognition scheme has the problems of poor robustness and low detection accuracy, has poor recognition effect on human voice signals in the presence of noise, and keeps a standby state at any time, so that the system has high energy consumption.

Therefore, it is necessary to provide a neural network multi-command word recognition method and system based on a wearable device to solve the above technical problems.

Disclosure of Invention

In order to solve one of the above technical problems, the present invention provides a neural network multi-command word recognition method based on wearable devices, which includes acquiring microphone signals by the wearable devices, and converting the microphone signals into digital input signal streams by an analog/digital converter; the digital input signal flow carries out voice detection through a VAD voice detection module, when noise is detected, the VAD voice detection module does not activate VAD flag bits, and the multi-command word recognition algorithm does not carry out operation; when a voice signal is detected, the VAD voice detection module activates a VAD flag bit and enters a multi-command word recognition algorithm; and after the multi-command word recognition algorithm is in a reset state, voice speech recognition is started.

Specifically, the multi-command word recognition algorithm comprises a voice MFCC feature extraction step, a CNN layer feature extraction step, a GRU layer sequence frame information extraction step and a DENSE layer command word classification step.

Specifically, the voice MFCC feature extraction step: selecting a Mel frequency cepstrum coefficient of a digital input signal stream as an input feature, and performing MFCC feature extraction to obtain an MFCC feature corresponding to the digital input signal stream; the MFCC feature extraction step comprises pre-emphasis, framing and windowing, FFT processing, mel filter processing, logarithmic operation and DCT transformation.

Specifically, the CNN layer feature extraction step: inputting MFCC characteristics, performing convolution operation on the MFCC characteristics to obtain a plurality of frames of CNN characteristic graphs, and obtaining sequence frames according to output sequence.

Specifically, the GRU layer extracts information between sequence frames: and performing interframe information extraction on the sequence frame through the GRU layer to obtain interframe information characteristics.

Specifically, the DENSE layer performs a command word classification step: inputting the inter-frame information characteristics into a DENSE layer, wherein the DENSE layer is obtained through network training and can output the classification probability of each command word corresponding to the voice signal according to the input inter-frame information characteristics, and the command words conveyed by the voice signal are judged according to the classification probability of each command word.

As a further solution, the pre-emphasis of the speech MFCC feature extraction step is chosen to have a pre-emphasis coefficient of 0.97.

As a further solution, the frame length of the frame windowing of the voice MFCC feature extraction step is 32ms, the frame shift is 16ms, and each frame is windowed using a Hamming window.

As a further solution, the voice MFCC feature extraction step performs fast fourier transform by FFT processing; filtering the sub-band by Mel filter processing; processing the output of the Mel filter by a logarithmic operation; the MFCC features are obtained by discrete cosine transform via DCT transform.

As a further solution, the CNN layer feature extraction step uses 16 convolution kernels of size [20,5] to process the MFCC features, and the step size is taken as [1,2]; a characteristic diagram with input dimensionality of [68, 40] of the CNN layer obtained in the CNN layer characteristic extraction step; 68 shows that the voice data of 1.1 second is divided into 6 frames, and 40 shows that 40 MFCC features are extracted from each frame; after the convolution operation, the signature size is [49, 18, 16].

Resetting the multi-command word recognition algorithm, namely resetting the state of the GRU layer; the GRU layer in the step of extracting information between sequence frames is a unidirectional GRU, 44 neurons are used, and the output of the CNN layer is input to the GRU layer after dimension resetting; wherein the dimension is reset to [49, 288], and the dimension of the GRU layer output is [44].

As a further solution, the GRU layer is deployed by the following formula:

Z _t ＝σ((X _t ,W _xz )+(H _t-1 ,W _hz )+b _z )

R _t ＝σ((X _t ,W _xr )+(H _t-1 ,W _hr )+b _r )

H_tilda＝tanh((X _t ,W _xh )+(H _t-1 R _t ,W _hh )+b _h )

H _t ＝H _t-1 Z _t +H_tilda(1-Z _t )

wherein, X _t Denotes the input of the GRU layer, H _t-1 Representing the hidden layer state at the previous moment, H _t Indicating the hidden layer state of the output at time t, W _xr 、W _hr 、W _xz 、W _hz 、W _xh 、W _hh Representing a weight matrix; b is a mixture of _r 、b _z 、b _h Denotes the offset, R _t Denotes a reset gate, Z _t Represents an update gate, H _ tilda represents information that needs to be updated, tanh (-) represents a Tanh activation function, and σ (-) represents a Sigmoid activation function.

As a further solution, the input of the DENSE layer that the DENSE layer performs the command word classification step is the output of the GRU layer; the output size of the Dense layer is 10, the output dimension is [10], wherein each dimension represents the probability of 9 command words and 1 negative sample class respectively.

As a further solution, the network training framework of the DENSE layer is based on a tensflo framework, the batch size adopted during training is 1024, and the iteration number is 50 generations; the data used for network training are clear voice data and voice data after noise mixing; training data are unified to 1.1 seconds, and a plurality of different noises are randomly mixed when noises are mixed; and (3) the network output of the DENSE layer is the probability of the corresponding category, the probability above 0.9 is classified into the corresponding command word category, and otherwise, the probability is defaulted to be the negative sample category.

As a further solution, the detection of the human voice signal of the wearable device collecting the microphone signal and the recognition of the corresponding command word are realized by the neural network multi-command word recognition method based on the wearable device as described in any one of the above.

Compared with the related art, the neural network multi-command word recognition method based on the wearable equipment has the following beneficial effects:

1. the invention uses the neural network technology to mix various noises in the training data, thereby improving the identification accuracy and robustness;

2. the invention uses MFCC characteristics of voice as the input of the network, in the first layer of the network, use CNN to carry on the characteristic extraction, use CNN can carry on the weight sharing, can reduce the parameter quantity of the network to a great extent, add a layer of GRU layer subsequently, can fully utilize the information among the previous frames in the voice section, make the extraction among the voice characteristics more abundant, use a full connection layer to classify finally, divide 10 categories, can get the interframe characteristic through this step, thus promote the whole recognition degree and recognition efficiency of the system;

3. the voice detection is carried out by using the VAD voice detection module, when the microphone receives voice, the VAD voice detection module gives an active state, when the multi-command word recognition algorithm receives the active state, the first frame resets the GRU initial state and starts to detect the command words; when no voice exists, the multi-command word detection algorithm does not work, so that the power consumption of the system can be reduced; the reset of the GRU state can ensure the GRU state to be the same as the training condition, thereby ensuring the identification accuracy and robustness of the algorithm.

Drawings

Fig. 1 is a flowchart illustrating a neural network multi-command word recognition method based on a wearable device according to an embodiment of the present invention;

fig. 2 is a schematic diagram of MFCC feature extraction of a neural network multi-command word recognition method based on a wearable device according to an embodiment of the present invention;

fig. 3 is a schematic diagram of feature extraction at a GRU layer of a neural network multi-command word recognition method based on a wearable device according to an embodiment of the present invention.

Detailed Description

The invention is further described with reference to the following figures and embodiments.

As shown in fig. 1 and fig. 3, the present embodiment provides a neural network multi-command word recognition method based on a wearable device, which collects microphone signals through the wearable device and converts the microphone signals into digital input signal streams through an analog-to-digital converter; the digital input signal flow carries out voice detection through a VAD voice detection module, when noise is detected, the VAD voice detection module does not activate VAD flag bits, and the multi-command word recognition algorithm does not carry out operation; when a voice signal is detected, the VAD voice detection module activates a VAD flag bit and enters a multi-command word recognition algorithm; and after the multi-command word recognition algorithm is in a reset state, voice speech recognition is started.

As shown in fig. 2, specifically, the voice MFCC feature extraction step: selecting a Mel frequency cepstrum coefficient of a digital input signal stream as an input feature, and performing MFCC feature extraction to obtain an MFCC feature corresponding to the digital input signal stream; the MFCC feature extraction step comprises pre-emphasis, framing and windowing, FFT processing, mel filter processing, logarithmic operation and DCT transformation.

Specifically, the step of extracting information between sequence frames by the GRU layer: and performing interframe information extraction on the sequence frame through the GRU layer to obtain interframe information characteristics.

It should be noted that: the embodiment uses the VAD voice detection algorithm to detect voice, when voice in a microphone passes through the VAD algorithm, the VAD can give a state of a flag bit, when the voice is not detected, the multi-command word recognition algorithm does not perform calculation, and when the voice is detected, the initial state Ht-1 of a first frame is set to be 0, so that the voice detection algorithm can be the same as a training situation when in use, and the recognition accuracy and robustness of the algorithm are improved.

The multi-command word recognition algorithm extracts the received digital signals into MFCC characteristics which serve as input of a neural network, the CNN convolution layer is used for extracting the characteristics in the first layer of the network, after the characteristics are extracted preliminarily, the characteristics of the sequence are input into the subsequent GRU layer, and the GRU layer can fully extract the time sequence characteristics of the voice section and serve as input of the subsequent DENSE classification layer. The classification layer may obtain 10 categories, including 9 command word categories and 1 negative sample category.

As a further solution, the voice MFCC feature extraction step performs fast fourier transform by FFT processing; filtering the sub-band by Mel-filter processing; processing the output of the Mel filter by a logarithmic operation; the MFCC features are obtained by discrete cosine transform via DCT transform.

It should be noted that: mel-Frequency Cepstral Coefficients (MFCC) was chosen as the input feature for the model. The extraction process includes pre-emphasis, framing and windowing, FFT, mel filter, logarithm calculation, DCT transformation, etc., and the process sequence and processing procedure are shown in the following figure. The lowest frequency and the highest frequency of the filter bank can be selected according to the frequency range of the actually recorded voice. Thereby reducing the impact of extraneous frequency bands.

As a further solution, the CNN layer extraction feature step uses 16 convolution kernels of size [20,5] to process MFCC features, and the step size is taken as [1,2]; the CNN layer characteristic extraction step is used for obtaining a characteristic diagram with input dimensions of [68, 40] of the CNN layer; 68 shows that the voice data of 1.1 second is divided into 6 frames, and 40 shows that 40 MFCC features are extracted from each frame; after the convolution operation, the signature size is [49, 18, 16].

Resetting the multi-command word recognition algorithm, namely resetting the state of the GRU layer; the GRU layer in the step of extracting information between sequence frames is a unidirectional GRU, 44 neurons are used, and the output of the CNN layer is input to the GRU layer after dimension resetting; wherein, the dimension is reset to [49, 288], and the dimension of the GRU layer output is [44].

As a further solution, as shown in fig. 3, the GRU layer is deployed by the following formula:

Z _t ＝σ((X _t ,W _xz )+(H _t-1 ,W _hz )+b _z )

R _t ＝σ((X _t ,W _xr )+(H _t-1 ,W _hr )+b _r )

H_tilda＝tanh((X _t ,W _xh )+(H _t-1 R _t ,W _hh )+b _h )

H _t ＝H _t-1 Z _t +H_tilda(1-Z _t )

wherein, X _t Representing the input of the GRU layer, H _t-1 Representing the hidden layer state at the previous moment, H _t Indicating the hidden layer state of the output at time t, W _xr 、W _hr 、W _xz 、W _hz 、W _xh 、W _hh Representing a weight matrix; b _r 、b _z 、b _h Denotes the offset, R _t Denotes a reset gate, Z _t Represents an update gate, H _ tilda represents information that needs to be updated, tanh (-) represents a Tanh activation function, and σ (-) represents a Sigmoid activation function.

As a further solution, the input of the sense layer, which the sense layer performs the command word classification step, is the output of the GRU layer, said; the output size of the Dense layer is 10, the output dimension is [10], wherein each dimension represents the probability of 9 command words and 1 negative sample class respectively.

As a further solution, the detection of the human voice signal of the wearable device collecting microphone signal and the identification of the corresponding command word are realized by the wearable device-based neural network multi-command word identification method as described in any one of the above.

The above description is only an embodiment of the present invention, and is not intended to limit the scope of the present invention, and all equivalent structures or equivalent processes performed by the present invention or directly or indirectly applied to other related technical fields are included in the scope of the present invention.

Claims

1. A neural network multi-command word recognition method based on wearable equipment is characterized in that microphone signals are collected through the wearable equipment and are converted into digital input signal streams through an analog-to-digital converter; the digital input signal flow carries out voice detection through a VAD voice detection module, when noise is detected, the VAD voice detection module does not activate VAD flag bits, and the multi-command word recognition algorithm does not carry out operation; when a voice signal is detected, the VAD voice detection module activates a VAD flag bit and enters a multi-command word recognition algorithm; after the multi-command word recognition algorithm is in a reset state, voice speech recognition is started;

the multi-command word recognition algorithm comprises a voice MFCC feature extraction step, a CNN layer feature extraction step, a GRU layer information extraction sequence frame step and a DENSE layer command word classification step;

the voice MFCC feature extraction step: selecting a Mel frequency cepstrum coefficient of a digital input signal stream as an input characteristic, and performing MFCC characteristic extraction to obtain an MFCC characteristic corresponding to the digital input signal stream; the MFCC feature extraction step comprises pre-emphasis, framing and windowing, FFT processing, mel filter processing, logarithmic operation and DCT transformation;

the CNN layer characteristic extraction step: inputting MFCC characteristics, performing convolution operation on the MFCC characteristics to obtain a plurality of frames of CNN characteristic diagrams, and obtaining sequence frames according to output sequence;

the GRU layer extracts information between sequence frames: extracting interframe information of the sequence frames through a GRU layer to obtain interframe information characteristics;

the DENSE layer carries out a command word classification step: inputting the inter-frame information characteristics into a DENSE layer, wherein the DENSE layer is obtained through network training and can output the classification probability of each command word corresponding to the voice signal according to the input inter-frame information characteristics, and the command words conveyed by the voice signal are judged according to the classification probability of each command word.

2. The method as claimed in claim 1, wherein the pre-emphasis of the voice MFCC feature extraction step is selected to be 0.97.

3. The method for recognizing the multi-command word in the neural network based on the wearable device as claimed in claim 1, wherein the frame length of the frame windowing of the voice MFCC feature extraction step is 32ms, the frame shift is 16ms, and each frame is windowed by using a Hamming window.

4. The neural network multi-command word recognition method based on wearable equipment as claimed in claim 1, wherein the voice MFCC feature extraction step is fast Fourier transformed by FFT processing; filtering the sub-band by Mel-filter processing; processing the output of the Mel filter by a logarithmic operation; the MFCC features are obtained by discrete cosine transform via DCT transform.

5. The wearable device-based neural network multi-command word recognition method of claim 1, wherein the CNN layer feature extraction step uses 16 convolutional checks of size [20,5] to process MFCC features, and the step size is taken as [1,2]; the CNN layer characteristic extraction step is used for obtaining a characteristic diagram with input dimensions of [68, 40] of the CNN layer; wherein 68 represents that the voice data of 1.1 seconds is divided into 6 frames, and 40 represents that 40 MFCC features are extracted from each frame; after the convolution operation, the signature size is [49, 18, 16].

6. The neural network multi-command word recognition method based on the wearable device according to claim 1, wherein the multi-command word recognition algorithm is reset as a state reset of a GRU layer; the GRU layer in the step of extracting information between sequence frames is a unidirectional GRU, 44 neurons are used, and the output of the CNN layer is input to the GRU layer after dimension resetting; wherein, the dimension is reset to [49, 288], and the dimension of the GRU layer output is [44].

7. The neural network multi-command word recognition method based on the wearable device of claim 1, wherein the GRU layer is deployed by the following formula:

Z _t ＝σ((X _t ,W _xz )+(H _t-1 ,W _hz )+b _z )

R _t ＝σ((X _t ,W _xr )+(H _t-1 ,W _hr )+b _r )

H_tilda＝tanh((X _t ,W _xh )+(H _t-1 R _t ,W _hh )+b _h )

H _t ＝H _t-1 Z _t +H_tilda(1-Z _t )

wherein, X _t Denotes the input of the GRU layer, H _t-1 Representing the hidden layer state at the previous moment, H _t Indicating the hidden layer state of the output at time t, W _xr 、W _hr 、W _xz 、W _hz 、W _xh 、W _hh Representing a weight matrix; b _r 、b _z 、b _h Denotes the offset, R _t Denotes a reset gate, Z _t Represents an update gate, H _ tilda represents information that needs to be updated, tanh (-) represents a Tanh activation function, and σ (-) represents a Sigmoid activation function.

8. The neural network multi-command word recognition method based on the wearable device according to claim 1, wherein the input of a Dense layer of the DenSE layer for carrying out the command word classification step is the output of a GRU layer; the Dense layer output size is 10, and the output dimension is [10], wherein each dimension represents the probability of 9 command words and 1 negative sample class respectively.

9. The neural network multi-command word recognition method based on the wearable device of claim 8, wherein the network training framework of the DENSE layer is based on a Tensorflow framework, the batch size adopted in training is 1024, and the iteration number is 50 generations; the data used for network training are clear voice data and voice data after noise mixing; training data are unified to 1.1 seconds, and a plurality of different noises are randomly mixed when noises are mixed; and (3) the network output of the DENSE layer is the probability of the corresponding category, the probability above 0.9 is classified into the corresponding command word category, and otherwise, the probability is defaulted to be the negative sample category.

10. A wearable device-based neural network multi-command word recognition system, which is operated on a hardware device and realizes the detection of human voice signals of microphone signals collected by a wearable device and the recognition of corresponding command words through the wearable device-based neural network multi-command word recognition method according to any one of claims 1 to 9.