CN115457953A - Neural network multi-command word recognition method and system based on wearable device - Google Patents
Neural network multi-command word recognition method and system based on wearable device Download PDFInfo
- Publication number
- CN115457953A CN115457953A CN202210888530.9A CN202210888530A CN115457953A CN 115457953 A CN115457953 A CN 115457953A CN 202210888530 A CN202210888530 A CN 202210888530A CN 115457953 A CN115457953 A CN 115457953A
- Authority
- CN
- China
- Prior art keywords
- command word
- layer
- voice
- gru
- neural network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 30
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 26
- 238000000605 extraction Methods 0.000 claims abstract description 44
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 26
- 238000001514 detection method Methods 0.000 claims abstract description 25
- 230000004913 activation Effects 0.000 claims description 6
- 238000010586 diagram Methods 0.000 claims description 6
- 239000000284 extract Substances 0.000 claims description 4
- 238000009432 framing Methods 0.000 claims description 4
- 230000009466 transformation Effects 0.000 claims description 4
- 238000001914 filtration Methods 0.000 claims description 3
- 230000037433 frameshift Effects 0.000 claims description 3
- 239000011159 matrix material Substances 0.000 claims description 3
- 210000002569 neuron Anatomy 0.000 claims description 3
- 238000005516 engineering process Methods 0.000 abstract description 2
- 230000003993 interaction Effects 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000009982 effect on human Effects 0.000 description 1
- 238000005265 energy consumption Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/225—Feedback of the input speech
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Signal Processing (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a neural network multi-command word recognition method and system based on wearable equipment, and relates to the technical field of audio processing. The neural network technology is used, various noises are mixed in training data, and the recognition accuracy and robustness are improved; the MFCC characteristics of voice are used as the input of the network, the CNN is used for carrying out characteristic extraction on the first layer of the network, the CNN can be used for carrying out weight sharing, the parameter number of the network can be reduced to a great extent, then, a GRU layer is added, the information between the previous frames in the voice section can be fully utilized, and the interframe characteristics can be obtained through the step, so that the overall recognition degree and the recognition efficiency of the system are improved; the voice detection is carried out by using the VAD voice detection module, and the multi-command word detection algorithm does not work when no voice exists, so that the power consumption of the system can be reduced; the reset of the GRU state can ensure the GRU state to be the same as the training condition, thereby ensuring the identification accuracy and robustness of the algorithm.
Description
Technical Field
The invention relates to the technical field of audio processing, in particular to a neural network multi-command word recognition method and system based on wearable equipment.
Background
The multi-command word recognition algorithm is one of algorithms commonly used for intelligent voice, and is widely applied to applications such as intelligent voice man-machine interaction. In the voice-based human-computer interaction process, a voice instruction sent by a human is transmitted into the machine through the microphone, in the machine, the multi-command word recognition algorithm can recognize specific command words, and when the specific command words are recognized, signals are fed back to the machine, so that the machine can make corresponding interaction reactions.
The wearable device-based multi-command word recognition can enable the device to communicate with the mobile phone through the Bluetooth module, the algorithm is integrated on the wearable device, the network is not needed, real-time and accurate multi-command word recognition can be achieved, and then human-computer interaction is achieved.
However, the existing multi-command word recognition scheme has the problems of poor robustness and low detection accuracy, has poor recognition effect on human voice signals in the presence of noise, and keeps a standby state at any time, so that the system has high energy consumption.
Therefore, it is necessary to provide a neural network multi-command word recognition method and system based on a wearable device to solve the above technical problems.
Disclosure of Invention
In order to solve one of the above technical problems, the present invention provides a neural network multi-command word recognition method based on wearable devices, which includes acquiring microphone signals by the wearable devices, and converting the microphone signals into digital input signal streams by an analog/digital converter; the digital input signal flow carries out voice detection through a VAD voice detection module, when noise is detected, the VAD voice detection module does not activate VAD flag bits, and the multi-command word recognition algorithm does not carry out operation; when a voice signal is detected, the VAD voice detection module activates a VAD flag bit and enters a multi-command word recognition algorithm; and after the multi-command word recognition algorithm is in a reset state, voice speech recognition is started.
Specifically, the multi-command word recognition algorithm comprises a voice MFCC feature extraction step, a CNN layer feature extraction step, a GRU layer sequence frame information extraction step and a DENSE layer command word classification step.
Specifically, the voice MFCC feature extraction step: selecting a Mel frequency cepstrum coefficient of a digital input signal stream as an input feature, and performing MFCC feature extraction to obtain an MFCC feature corresponding to the digital input signal stream; the MFCC feature extraction step comprises pre-emphasis, framing and windowing, FFT processing, mel filter processing, logarithmic operation and DCT transformation.
Specifically, the CNN layer feature extraction step: inputting MFCC characteristics, performing convolution operation on the MFCC characteristics to obtain a plurality of frames of CNN characteristic graphs, and obtaining sequence frames according to output sequence.
Specifically, the GRU layer extracts information between sequence frames: and performing interframe information extraction on the sequence frame through the GRU layer to obtain interframe information characteristics.
Specifically, the DENSE layer performs a command word classification step: inputting the inter-frame information characteristics into a DENSE layer, wherein the DENSE layer is obtained through network training and can output the classification probability of each command word corresponding to the voice signal according to the input inter-frame information characteristics, and the command words conveyed by the voice signal are judged according to the classification probability of each command word.
As a further solution, the pre-emphasis of the speech MFCC feature extraction step is chosen to have a pre-emphasis coefficient of 0.97.
As a further solution, the frame length of the frame windowing of the voice MFCC feature extraction step is 32ms, the frame shift is 16ms, and each frame is windowed using a Hamming window.
As a further solution, the voice MFCC feature extraction step performs fast fourier transform by FFT processing; filtering the sub-band by Mel filter processing; processing the output of the Mel filter by a logarithmic operation; the MFCC features are obtained by discrete cosine transform via DCT transform.
As a further solution, the CNN layer feature extraction step uses 16 convolution kernels of size [20,5] to process the MFCC features, and the step size is taken as [1,2]; a characteristic diagram with input dimensionality of [68, 40] of the CNN layer obtained in the CNN layer characteristic extraction step; 68 shows that the voice data of 1.1 second is divided into 6 frames, and 40 shows that 40 MFCC features are extracted from each frame; after the convolution operation, the signature size is [49, 18, 16].
Resetting the multi-command word recognition algorithm, namely resetting the state of the GRU layer; the GRU layer in the step of extracting information between sequence frames is a unidirectional GRU, 44 neurons are used, and the output of the CNN layer is input to the GRU layer after dimension resetting; wherein the dimension is reset to [49, 288], and the dimension of the GRU layer output is [44].
As a further solution, the GRU layer is deployed by the following formula:
Z t =σ((X t ,W xz )+(H t-1 ,W hz )+b z )
R t =σ((X t ,W xr )+(H t-1 ,W hr )+b r )
H_tilda=tanh((X t ,W xh )+(H t-1 R t ,W hh )+b h )
H t =H t-1 Z t +H_tilda(1-Z t )
wherein, X t Denotes the input of the GRU layer, H t-1 Representing the hidden layer state at the previous moment, H t Indicating the hidden layer state of the output at time t, W xr 、W hr 、W xz 、W hz 、W xh 、W hh Representing a weight matrix; b is a mixture of r 、b z 、b h Denotes the offset, R t Denotes a reset gate, Z t Represents an update gate, H _ tilda represents information that needs to be updated, tanh (-) represents a Tanh activation function, and σ (-) represents a Sigmoid activation function.
As a further solution, the input of the DENSE layer that the DENSE layer performs the command word classification step is the output of the GRU layer; the output size of the Dense layer is 10, the output dimension is [10], wherein each dimension represents the probability of 9 command words and 1 negative sample class respectively.
As a further solution, the network training framework of the DENSE layer is based on a tensflo framework, the batch size adopted during training is 1024, and the iteration number is 50 generations; the data used for network training are clear voice data and voice data after noise mixing; training data are unified to 1.1 seconds, and a plurality of different noises are randomly mixed when noises are mixed; and (3) the network output of the DENSE layer is the probability of the corresponding category, the probability above 0.9 is classified into the corresponding command word category, and otherwise, the probability is defaulted to be the negative sample category.
As a further solution, the detection of the human voice signal of the wearable device collecting the microphone signal and the recognition of the corresponding command word are realized by the neural network multi-command word recognition method based on the wearable device as described in any one of the above.
Compared with the related art, the neural network multi-command word recognition method based on the wearable equipment has the following beneficial effects:
1. the invention uses the neural network technology to mix various noises in the training data, thereby improving the identification accuracy and robustness;
2. the invention uses MFCC characteristics of voice as the input of the network, in the first layer of the network, use CNN to carry on the characteristic extraction, use CNN can carry on the weight sharing, can reduce the parameter quantity of the network to a great extent, add a layer of GRU layer subsequently, can fully utilize the information among the previous frames in the voice section, make the extraction among the voice characteristics more abundant, use a full connection layer to classify finally, divide 10 categories, can get the interframe characteristic through this step, thus promote the whole recognition degree and recognition efficiency of the system;
3. the voice detection is carried out by using the VAD voice detection module, when the microphone receives voice, the VAD voice detection module gives an active state, when the multi-command word recognition algorithm receives the active state, the first frame resets the GRU initial state and starts to detect the command words; when no voice exists, the multi-command word detection algorithm does not work, so that the power consumption of the system can be reduced; the reset of the GRU state can ensure the GRU state to be the same as the training condition, thereby ensuring the identification accuracy and robustness of the algorithm.
Drawings
Fig. 1 is a flowchart illustrating a neural network multi-command word recognition method based on a wearable device according to an embodiment of the present invention;
fig. 2 is a schematic diagram of MFCC feature extraction of a neural network multi-command word recognition method based on a wearable device according to an embodiment of the present invention;
fig. 3 is a schematic diagram of feature extraction at a GRU layer of a neural network multi-command word recognition method based on a wearable device according to an embodiment of the present invention.
Detailed Description
The invention is further described with reference to the following figures and embodiments.
As shown in fig. 1 and fig. 3, the present embodiment provides a neural network multi-command word recognition method based on a wearable device, which collects microphone signals through the wearable device and converts the microphone signals into digital input signal streams through an analog-to-digital converter; the digital input signal flow carries out voice detection through a VAD voice detection module, when noise is detected, the VAD voice detection module does not activate VAD flag bits, and the multi-command word recognition algorithm does not carry out operation; when a voice signal is detected, the VAD voice detection module activates a VAD flag bit and enters a multi-command word recognition algorithm; and after the multi-command word recognition algorithm is in a reset state, voice speech recognition is started.
Specifically, the multi-command word recognition algorithm comprises a voice MFCC feature extraction step, a CNN layer feature extraction step, a GRU layer sequence frame information extraction step and a DENSE layer command word classification step.
As shown in fig. 2, specifically, the voice MFCC feature extraction step: selecting a Mel frequency cepstrum coefficient of a digital input signal stream as an input feature, and performing MFCC feature extraction to obtain an MFCC feature corresponding to the digital input signal stream; the MFCC feature extraction step comprises pre-emphasis, framing and windowing, FFT processing, mel filter processing, logarithmic operation and DCT transformation.
Specifically, the CNN layer feature extraction step: inputting MFCC characteristics, performing convolution operation on the MFCC characteristics to obtain a plurality of frames of CNN characteristic graphs, and obtaining sequence frames according to output sequence.
Specifically, the step of extracting information between sequence frames by the GRU layer: and performing interframe information extraction on the sequence frame through the GRU layer to obtain interframe information characteristics.
Specifically, the DENSE layer performs a command word classification step: inputting the inter-frame information characteristics into a DENSE layer, wherein the DENSE layer is obtained through network training and can output the classification probability of each command word corresponding to the voice signal according to the input inter-frame information characteristics, and the command words conveyed by the voice signal are judged according to the classification probability of each command word.
It should be noted that: the embodiment uses the VAD voice detection algorithm to detect voice, when voice in a microphone passes through the VAD algorithm, the VAD can give a state of a flag bit, when the voice is not detected, the multi-command word recognition algorithm does not perform calculation, and when the voice is detected, the initial state Ht-1 of a first frame is set to be 0, so that the voice detection algorithm can be the same as a training situation when in use, and the recognition accuracy and robustness of the algorithm are improved.
The multi-command word recognition algorithm extracts the received digital signals into MFCC characteristics which serve as input of a neural network, the CNN convolution layer is used for extracting the characteristics in the first layer of the network, after the characteristics are extracted preliminarily, the characteristics of the sequence are input into the subsequent GRU layer, and the GRU layer can fully extract the time sequence characteristics of the voice section and serve as input of the subsequent DENSE classification layer. The classification layer may obtain 10 categories, including 9 command word categories and 1 negative sample category.
As a further solution, the pre-emphasis of the speech MFCC feature extraction step is chosen to have a pre-emphasis coefficient of 0.97.
As a further solution, the frame length of the frame windowing of the voice MFCC feature extraction step is 32ms, the frame shift is 16ms, and each frame is windowed using a Hamming window.
As a further solution, the voice MFCC feature extraction step performs fast fourier transform by FFT processing; filtering the sub-band by Mel-filter processing; processing the output of the Mel filter by a logarithmic operation; the MFCC features are obtained by discrete cosine transform via DCT transform.
It should be noted that: mel-Frequency Cepstral Coefficients (MFCC) was chosen as the input feature for the model. The extraction process includes pre-emphasis, framing and windowing, FFT, mel filter, logarithm calculation, DCT transformation, etc., and the process sequence and processing procedure are shown in the following figure. The lowest frequency and the highest frequency of the filter bank can be selected according to the frequency range of the actually recorded voice. Thereby reducing the impact of extraneous frequency bands.
As a further solution, the CNN layer extraction feature step uses 16 convolution kernels of size [20,5] to process MFCC features, and the step size is taken as [1,2]; the CNN layer characteristic extraction step is used for obtaining a characteristic diagram with input dimensions of [68, 40] of the CNN layer; 68 shows that the voice data of 1.1 second is divided into 6 frames, and 40 shows that 40 MFCC features are extracted from each frame; after the convolution operation, the signature size is [49, 18, 16].
Resetting the multi-command word recognition algorithm, namely resetting the state of the GRU layer; the GRU layer in the step of extracting information between sequence frames is a unidirectional GRU, 44 neurons are used, and the output of the CNN layer is input to the GRU layer after dimension resetting; wherein, the dimension is reset to [49, 288], and the dimension of the GRU layer output is [44].
As a further solution, as shown in fig. 3, the GRU layer is deployed by the following formula:
Z t =σ((X t ,W xz )+(H t-1 ,W hz )+b z )
R t =σ((X t ,W xr )+(H t-1 ,W hr )+b r )
H_tilda=tanh((X t ,W xh )+(H t-1 R t ,W hh )+b h )
H t =H t-1 Z t +H_tilda(1-Z t )
wherein, X t Representing the input of the GRU layer, H t-1 Representing the hidden layer state at the previous moment, H t Indicating the hidden layer state of the output at time t, W xr 、W hr 、W xz 、W hz 、W xh 、W hh Representing a weight matrix; b r 、b z 、b h Denotes the offset, R t Denotes a reset gate, Z t Represents an update gate, H _ tilda represents information that needs to be updated, tanh (-) represents a Tanh activation function, and σ (-) represents a Sigmoid activation function.
As a further solution, the input of the sense layer, which the sense layer performs the command word classification step, is the output of the GRU layer, said; the output size of the Dense layer is 10, the output dimension is [10], wherein each dimension represents the probability of 9 command words and 1 negative sample class respectively.
As a further solution, the network training framework of the DENSE layer is based on a tensflo framework, the batch size adopted during training is 1024, and the iteration number is 50 generations; the data used for network training are clear voice data and voice data after noise mixing; training data are unified to 1.1 seconds, and a plurality of different noises are randomly mixed when noises are mixed; and (3) the network output of the DENSE layer is the probability of the corresponding category, the probability above 0.9 is classified into the corresponding command word category, and otherwise, the probability is defaulted to be the negative sample category.
As a further solution, the detection of the human voice signal of the wearable device collecting microphone signal and the identification of the corresponding command word are realized by the wearable device-based neural network multi-command word identification method as described in any one of the above.
The above description is only an embodiment of the present invention, and is not intended to limit the scope of the present invention, and all equivalent structures or equivalent processes performed by the present invention or directly or indirectly applied to other related technical fields are included in the scope of the present invention.
Claims (10)
1. A neural network multi-command word recognition method based on wearable equipment is characterized in that microphone signals are collected through the wearable equipment and are converted into digital input signal streams through an analog-to-digital converter; the digital input signal flow carries out voice detection through a VAD voice detection module, when noise is detected, the VAD voice detection module does not activate VAD flag bits, and the multi-command word recognition algorithm does not carry out operation; when a voice signal is detected, the VAD voice detection module activates a VAD flag bit and enters a multi-command word recognition algorithm; after the multi-command word recognition algorithm is in a reset state, voice speech recognition is started;
the multi-command word recognition algorithm comprises a voice MFCC feature extraction step, a CNN layer feature extraction step, a GRU layer information extraction sequence frame step and a DENSE layer command word classification step;
the voice MFCC feature extraction step: selecting a Mel frequency cepstrum coefficient of a digital input signal stream as an input characteristic, and performing MFCC characteristic extraction to obtain an MFCC characteristic corresponding to the digital input signal stream; the MFCC feature extraction step comprises pre-emphasis, framing and windowing, FFT processing, mel filter processing, logarithmic operation and DCT transformation;
the CNN layer characteristic extraction step: inputting MFCC characteristics, performing convolution operation on the MFCC characteristics to obtain a plurality of frames of CNN characteristic diagrams, and obtaining sequence frames according to output sequence;
the GRU layer extracts information between sequence frames: extracting interframe information of the sequence frames through a GRU layer to obtain interframe information characteristics;
the DENSE layer carries out a command word classification step: inputting the inter-frame information characteristics into a DENSE layer, wherein the DENSE layer is obtained through network training and can output the classification probability of each command word corresponding to the voice signal according to the input inter-frame information characteristics, and the command words conveyed by the voice signal are judged according to the classification probability of each command word.
2. The method as claimed in claim 1, wherein the pre-emphasis of the voice MFCC feature extraction step is selected to be 0.97.
3. The method for recognizing the multi-command word in the neural network based on the wearable device as claimed in claim 1, wherein the frame length of the frame windowing of the voice MFCC feature extraction step is 32ms, the frame shift is 16ms, and each frame is windowed by using a Hamming window.
4. The neural network multi-command word recognition method based on wearable equipment as claimed in claim 1, wherein the voice MFCC feature extraction step is fast Fourier transformed by FFT processing; filtering the sub-band by Mel-filter processing; processing the output of the Mel filter by a logarithmic operation; the MFCC features are obtained by discrete cosine transform via DCT transform.
5. The wearable device-based neural network multi-command word recognition method of claim 1, wherein the CNN layer feature extraction step uses 16 convolutional checks of size [20,5] to process MFCC features, and the step size is taken as [1,2]; the CNN layer characteristic extraction step is used for obtaining a characteristic diagram with input dimensions of [68, 40] of the CNN layer; wherein 68 represents that the voice data of 1.1 seconds is divided into 6 frames, and 40 represents that 40 MFCC features are extracted from each frame; after the convolution operation, the signature size is [49, 18, 16].
6. The neural network multi-command word recognition method based on the wearable device according to claim 1, wherein the multi-command word recognition algorithm is reset as a state reset of a GRU layer; the GRU layer in the step of extracting information between sequence frames is a unidirectional GRU, 44 neurons are used, and the output of the CNN layer is input to the GRU layer after dimension resetting; wherein, the dimension is reset to [49, 288], and the dimension of the GRU layer output is [44].
7. The neural network multi-command word recognition method based on the wearable device of claim 1, wherein the GRU layer is deployed by the following formula:
Z t =σ((X t ,W xz )+(H t-1 ,W hz )+b z )
R t =σ((X t ,W xr )+(H t-1 ,W hr )+b r )
H_tilda=tanh((X t ,W xh )+(H t-1 R t ,W hh )+b h )
H t =H t-1 Z t +H_tilda(1-Z t )
wherein, X t Denotes the input of the GRU layer, H t-1 Representing the hidden layer state at the previous moment, H t Indicating the hidden layer state of the output at time t, W xr 、W hr 、W xz 、W hz 、W xh 、W hh Representing a weight matrix; b r 、b z 、b h Denotes the offset, R t Denotes a reset gate, Z t Represents an update gate, H _ tilda represents information that needs to be updated, tanh (-) represents a Tanh activation function, and σ (-) represents a Sigmoid activation function.
8. The neural network multi-command word recognition method based on the wearable device according to claim 1, wherein the input of a Dense layer of the DenSE layer for carrying out the command word classification step is the output of a GRU layer; the Dense layer output size is 10, and the output dimension is [10], wherein each dimension represents the probability of 9 command words and 1 negative sample class respectively.
9. The neural network multi-command word recognition method based on the wearable device of claim 8, wherein the network training framework of the DENSE layer is based on a Tensorflow framework, the batch size adopted in training is 1024, and the iteration number is 50 generations; the data used for network training are clear voice data and voice data after noise mixing; training data are unified to 1.1 seconds, and a plurality of different noises are randomly mixed when noises are mixed; and (3) the network output of the DENSE layer is the probability of the corresponding category, the probability above 0.9 is classified into the corresponding command word category, and otherwise, the probability is defaulted to be the negative sample category.
10. A wearable device-based neural network multi-command word recognition system, which is operated on a hardware device and realizes the detection of human voice signals of microphone signals collected by a wearable device and the recognition of corresponding command words through the wearable device-based neural network multi-command word recognition method according to any one of claims 1 to 9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210888530.9A CN115457953A (en) | 2022-07-27 | 2022-07-27 | Neural network multi-command word recognition method and system based on wearable device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210888530.9A CN115457953A (en) | 2022-07-27 | 2022-07-27 | Neural network multi-command word recognition method and system based on wearable device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115457953A true CN115457953A (en) | 2022-12-09 |
Family
ID=84295896
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210888530.9A Pending CN115457953A (en) | 2022-07-27 | 2022-07-27 | Neural network multi-command word recognition method and system based on wearable device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115457953A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023141701A1 (en) * | 2022-01-25 | 2023-08-03 | Blumind Inc. | Analog systems and methods for audio feature extraction and natural language processing |
-
2022
- 2022-07-27 CN CN202210888530.9A patent/CN115457953A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023141701A1 (en) * | 2022-01-25 | 2023-08-03 | Blumind Inc. | Analog systems and methods for audio feature extraction and natural language processing |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110364143B (en) | Voice awakening method and device and intelligent electronic equipment | |
Milton et al. | SVM scheme for speech emotion recognition using MFCC feature | |
CN103117059B (en) | Voice signal characteristics extracting method based on tensor decomposition | |
CN103065629A (en) | Speech recognition system of humanoid robot | |
CN110120227A (en) | A kind of depth stacks the speech separating method of residual error network | |
CN109192200B (en) | Speech recognition method | |
CN113053410B (en) | Voice recognition method, voice recognition device, computer equipment and storage medium | |
CN112071308A (en) | Awakening word training method based on speech synthesis data enhancement | |
Shi et al. | End-to-End Monaural Speech Separation with Multi-Scale Dynamic Weighted Gated Dilated Convolutional Pyramid Network. | |
CN110415697A (en) | A kind of vehicle-mounted voice control method and its system based on deep learning | |
CN115457953A (en) | Neural network multi-command word recognition method and system based on wearable device | |
CN112562725A (en) | Mixed voice emotion classification method based on spectrogram and capsule network | |
CN1300763C (en) | Automatic sound identifying treating method for embedded sound identifying system | |
Liu et al. | Simple pooling front-ends for efficient audio classification | |
CN113077798B (en) | Old man calls for help equipment at home | |
Espi et al. | Spectrogram patch based acoustic event detection and classification in speech overlapping conditions | |
CN117063229A (en) | Interactive voice signal processing method, related equipment and system | |
Chen et al. | Overlapped Speech Detection Based on Spectral and Spatial Feature Fusion. | |
Mendiratta et al. | ASR system for isolated words using ANN with back propagation and fuzzy based DWT | |
CN112992131A (en) | Method for extracting ping-pong command of target voice in complex scene | |
Zhou et al. | Environmental sound classification of western black-crowned gibbon habitat based on spectral subtraction and VGG16 | |
Khan et al. | Isolated Bangla word recognition and speaker detection by semantic modular time delay neural network (MTDNN) | |
Tao et al. | Design of elevator auxiliary control system based on speech recognition | |
CN112599123B (en) | Lightweight speech keyword recognition network, method, device and storage medium | |
Bhagath et al. | Telugu Spoken Digits Modeling using Convolutional Neural Networks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |