CN114937450B - Voice keyword recognition method and system - Google Patents
Voice keyword recognition method and system Download PDFInfo
- Publication number
- CN114937450B CN114937450B CN202110164790.7A CN202110164790A CN114937450B CN 114937450 B CN114937450 B CN 114937450B CN 202110164790 A CN202110164790 A CN 202110164790A CN 114937450 B CN114937450 B CN 114937450B
- Authority
- CN
- China
- Prior art keywords
- voice
- sub
- keyword
- network
- voice keyword
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 59
- 238000012549 training Methods 0.000 claims abstract description 54
- 239000013598 vector Substances 0.000 claims abstract description 54
- 238000003062 neural network model Methods 0.000 claims abstract description 20
- 238000013528 artificial neural network Methods 0.000 claims description 29
- 238000012545 processing Methods 0.000 claims description 28
- 238000000605 extraction Methods 0.000 claims description 24
- 238000001228 spectrum Methods 0.000 claims description 15
- 230000015654 memory Effects 0.000 claims description 11
- 238000004590 computer program Methods 0.000 claims description 10
- 238000009432 framing Methods 0.000 claims description 8
- 238000007781 pre-processing Methods 0.000 claims description 6
- 238000006243 chemical reaction Methods 0.000 claims description 4
- 238000010276 construction Methods 0.000 claims description 4
- 238000001914 filtration Methods 0.000 claims description 3
- 238000010586 diagram Methods 0.000 description 12
- 238000004364 calculation method Methods 0.000 description 9
- 230000008569 process Effects 0.000 description 5
- 238000004891 communication Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 2
- 230000037433 frameshift Effects 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 230000000737 periodic effect Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005265 energy consumption Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000005855 radiation Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L2015/088—Word spotting
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Telephonic Communication Services (AREA)
Abstract
The invention provides a voice keyword recognition method and a voice keyword recognition system, wherein the voice keyword recognition method comprises the following steps: extracting features of voice data to be identified to obtain voice keyword feature vectors; inputting the voice keyword feature vector into a trained voice keyword recognition model to obtain a voice keyword recognition result, wherein the trained voice keyword recognition model is obtained by training a neural network model through a sample voice keyword feature vector marked with a keyword type label, and the neural network model comprises a plurality of sub-networks and each sub-network recognizes a corresponding keyword; and executing opening and closing operations on the corresponding sub-network in the trained voice keyword recognition model according to the voice keyword recognition result. On the premise of keeping the recognition accuracy, the invention reduces the quantity of parameters and calculated quantity, and further reduces the overall power consumption of the system by dynamically adjusting the sub-network channels responsible for recognizing each keyword.
Description
Technical Field
The present invention relates to the field of speech recognition technologies, and in particular, to a method and a system for recognizing speech keywords.
Background
With the progress of machine learning technology, applications such as speech recognition processing and image target recognition have also been rapidly developed. Taking voice interaction as an example, a voice keyword recognition wake-up (Keyword Spotting, abbreviated as KWS) module is needed, which is a normally open functional module, sound in the environment can be detected at all times in a normal standby state, whether a preset wake-up instruction exists or not is judged, if so, a follow-up functional module of the equipment can be opened after keywords are detected, and more complex interaction functions are completed. The KWS module is generally embedded in an internet of things (Internet of Things, ioT) device, if the IoT device sends the original data to the central processing unit for processing, and then receives the result, the KWS module faces the problem that high transmission energy consumption, high delay and privacy are threatened, so that the KWS module is generally embedded in the IoT device. At the same time, ioT devices face limitations in terms of energy, computing power, etc.
Fig. 1 is a schematic diagram of an application scenario of an existing IoT device, as shown in fig. 1, different IoT devices are required to identify different kinds and different numbers of keywords, for example, an identification module of a curtain needs to identify "turn on, turn off, and pause", and a keyword required by a desk lamp is "turn on and turn off". The traditional solution, central processor unified processing of data, is not suitable for use in a scenario of multiple IoT devices, because the original sound data is continuously generated, while the valid sound data is very sparse, i.e. the number of times the user wakes up the device is very short compared to the time of day.
In the prior art, the implementation and optimization of the KWS module are mostly concentrated on a neural network (Neural Network, NN for short) computing part, and a low-power-consumption Neural Network (NN) accelerator is proposed, so that real-time keyword detection can be completed under the power consumption of 172 mu W, however, the scheme does not deeply study the actual deployment of the KWS module, but only simply operates in equipment; some proposal is made on KWS implementation schemes with low storage and low calculation amount, the power consumption is less than 1 mu W, but the time delay and the accuracy are low. The realization and optimization of the prior art still stays on the identification performance of a single network, and although balance points are obtained in terms of network structure and identification accuracy, the characteristics of the actually deployed KWS application are ignored, and the keyword requirements of different types and numbers in different scenes cannot be met. Therefore, a method and a system for recognizing voice keywords are needed to solve the above problems.
Disclosure of Invention
Aiming at the problems existing in the prior art, the invention provides a voice keyword recognition method and a voice keyword recognition system.
The invention provides a voice keyword recognition method, which comprises the following steps:
extracting features of voice data to be identified to obtain voice keyword feature vectors;
Inputting the voice keyword feature vector into a trained voice keyword recognition model to obtain a voice keyword recognition result, wherein the trained voice keyword recognition model is obtained by training a neural network model through a sample voice keyword feature vector marked with a keyword type label, and the neural network model comprises a plurality of sub-networks and each sub-network recognizes a corresponding keyword;
and executing opening and closing operations on the corresponding sub-network in the trained voice keyword recognition model according to the voice keyword recognition result.
According to the voice keyword recognition method provided by the invention, the trained voice keyword recognition model is obtained through training the following steps:
Constructing a plurality of training sample sets according to sample voice keyword feature vectors marked with different keyword type labels, training corresponding sub-networks according to each training sample set, and obtaining a plurality of trained sub-networks;
constructing a sub-network keyword output probability sample set according to the output results of the plurality of trained sub-networks;
And inputting the sub-network output probability sample set to a full-connection layer for training, and fusing the trained full-connection layer with a plurality of trained sub-networks to obtain a trained voice keyword recognition model.
According to the voice keyword recognition method provided by the invention, the operation of opening and closing the corresponding sub-network in the trained voice keyword recognition model according to the voice keyword recognition result comprises the following steps:
According to the voice keyword recognition result obtained at the current moment, controlling the target equipment, and closing the first sub-network;
after the control of the target equipment is completed, a second sub-network is started;
the first sub-network is a sub-network for recognizing the voice keywords at the current moment, and the second sub-network is a sub-network for recognizing the voice keywords at the next moment.
According to the voice keyword recognition method provided by the invention, the sub-network is a gating neural network.
According to the method for recognizing the voice keywords provided by the invention, the voice data to be recognized is subjected to feature extraction to obtain the voice keyword feature vectors, and the method comprises the following steps:
Performing analog-to-digital conversion processing on the voice data to be recognized to obtain a voice digital signal;
Performing fast Fourier transform processing on the voice digital signal to obtain voice signal spectrum information;
acquiring voice energy spectrum information corresponding to the voice signal spectrum information based on a squarer;
filtering the voice energy spectrum information to obtain average energy information of a plurality of frequency bands;
and acquiring the characteristic vector of the voice keyword according to the average energy information of the frequency bands.
According to the voice keyword recognition method provided by the invention, before the voice data to be recognized is subjected to feature extraction to obtain the voice keyword feature vector, the method further comprises the following steps:
Preprocessing voice data to be recognized to obtain preprocessed voice data, and extracting features according to the preprocessed voice data, wherein the preprocessing comprises pre-emphasis processing, framing processing and windowing processing.
The invention also provides a voice keyword recognition system, which comprises:
the feature extraction module is used for extracting features of the voice keywords to be identified and obtaining feature vectors of the voice keywords;
The voice keyword recognition module is used for inputting the voice keyword feature vector into a trained voice keyword recognition model to obtain a voice keyword recognition result, wherein the trained voice keyword recognition model is obtained by training a neural network model through a sample voice keyword feature vector marked with a keyword type label, and the neural network model comprises a plurality of sub-networks and each sub-network recognizes a corresponding keyword;
And the recognition result execution module is used for executing opening and closing operations on the corresponding sub-network in the trained voice keyword recognition model according to the voice keyword recognition result.
According to the voice keyword recognition system provided by the invention, the system further comprises:
The sub-network training module is used for constructing a plurality of training sample sets according to sample voice keyword feature vectors marked with different keyword type labels, training corresponding sub-networks according to each training sample set, and obtaining a plurality of trained sub-networks;
The sub-network output probability sample set construction module is used for constructing a sub-network keyword output probability sample set according to the output results of the plurality of trained sub-networks;
And the full-connection layer training module is used for inputting the sub-network output probability sample set into the full-connection layer for training, and fusing the trained full-connection layer with a plurality of trained sub-networks to obtain a trained voice keyword recognition model.
The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of any one of the above-mentioned speech keyword recognition methods when executing the program.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the speech keyword recognition method as described in any of the above.
According to the voice keyword recognition method and system, the voice data are subjected to feature extraction, the voice keyword feature vectors obtained after feature extraction are input into the voice keyword recognition model, and the voice keyword recognition result is obtained, so that the corresponding sub-network in the voice keyword recognition model is subjected to opening and closing operation according to the voice keyword recognition result.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic diagram of an existing IoT device application scenario provided by the present invention;
FIG. 2 is a schematic flow chart of a method for recognizing voice keywords;
FIG. 3 is a schematic diagram of a prior art KWS system;
FIG. 4 is a schematic diagram of a distributed neural network architecture of a speech keyword recognition model provided by the present invention;
FIG. 5 is a schematic diagram showing the comparison of the number of times of calculation of the method for recognizing a speech keyword provided by the present invention with the conventional method for recognizing a speech keyword in a KWS application scenario;
FIG. 6 is a schematic diagram of a voice keyword recognition system according to the present invention;
fig. 7 is a schematic structural diagram of an electronic device provided by the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Fig. 2 is a flow chart of a voice keyword recognition method provided by the present invention, and as shown in fig. 2, the present invention provides a voice keyword recognition method, including:
step 101, extracting characteristics of voice data to be identified, and obtaining voice keyword characteristic vectors;
102, inputting the voice keyword feature vector into a trained voice keyword recognition model to obtain a voice keyword recognition result, wherein the trained voice keyword recognition model is obtained by training a neural network model through a sample voice keyword feature vector marked with a keyword type label, and the neural network model comprises a plurality of sub-networks and each sub-network recognizes a corresponding keyword;
and step 103, executing opening and closing operations on the corresponding sub-network in the trained voice keyword recognition model according to the voice keyword recognition result.
Fig. 3 is a schematic structural diagram of a KWS system in the prior art, as shown in fig. 3, where the KWS system includes a microphone, an analog-to-digital converter (Analog Digital Converter, abbreviated as ADC) and a KWS processing module, where the KWS processing module may be further divided into three parts, i.e. feature extraction, feature recognition and back-end processing, and the feature recognition uses a single neural network to recognize all keywords. Firstly, data are collected through a microphone, a voice signal is converted into an analog signal, then an analog-to-digital converter converts the analog signal into a digital signal, and the digital signal is input into a KWS processing module. The KWS processing module comprises a voice feature extraction unit, a neural network (Neural Network, NN) computing unit and a necessary parameter storage unit, wherein the feature extraction unit is used for extracting and compressing an original voice signal to output a feature vector corresponding to voice, then the NN unit is used for computing according to the feature vector to obtain the recognition probability, and the final result is output through back-end processing. The neural network model is pre-trained and loaded into the memory of the KWS module.
In the present invention, in step 101, a speech signal to be recognized is used as input data, and in order to reduce complexity of the speech signal to be recognized, improve discrimination, the data is extracted and compressed, and feature vectors corresponding to speech keywords are output. The voice signal to be recognized is a digital signal obtained by collecting voice data through a microphone, converting the voice data into an analog signal and performing analog-digital conversion processing.
In the present invention, in step 102, a speech keyword feature vector is input into a speech keyword recognition model constructed by a neural network, where the neural network is distributed and includes a plurality of sub-networks, and can separate the recognition of each keyword, and each sub-network is only responsible for recognizing one keyword, so as to obtain a speech keyword recognition result.
In step 103, the present invention dynamically adjusts the current working state of the KWS system according to the recognition result of the voice keyword, and controls the voice keyword recognition model to open and close the corresponding sub-network channel.
According to the voice keyword recognition method, the voice data are subjected to feature extraction, the voice keyword feature vectors obtained after feature extraction are input into the voice keyword recognition model, and the voice keyword recognition result is obtained, so that the corresponding sub-networks in the voice keyword recognition model are subjected to opening and closing operations according to the voice keyword recognition result.
On the basis of the above embodiment, the trained speech keyword recognition model is obtained by training the following steps:
Step 201, constructing a plurality of training sample sets according to sample voice keyword feature vectors marked with different keyword type labels, and training corresponding sub-networks according to each training sample set to obtain a plurality of trained sub-networks;
Step 202, constructing a sub-network keyword output probability sample set according to output results of a plurality of trained sub-networks;
And 203, inputting the sub-network output probability sample set to a full-connection layer for training, and fusing the trained full-connection layer with a plurality of trained sub-networks to obtain a trained voice keyword recognition model.
In the present invention, fig. 4 is a schematic diagram of a distributed neural network architecture of a speech keyword recognition model provided by the present invention, and may refer to fig. 4, where the distributed neural network includes a plurality of sub-networks, each sub-network channel is independent, each sub-network recognizes a corresponding keyword, outputs of the plurality of sub-networks are further connected to a full connection layer for fusing outputs of the respective sub-networks, inputs of the full connection layer are combinations of output values of the respective sub-networks, and outputs of the full connection layer are keyword recognition results.
Specifically, in step 201, a plurality of training sample sets are constructed according to feature vectors of sample voice keywords marked with labels of different keyword types, the constructed plurality of training sample sets are used as input of a voice keyword recognition model, and first, the sub-networks corresponding to each training sample set are individually trained to recognize a single keyword, and the capability of each sub-network to independently recognize the keyword is reserved, so that a plurality of trained sub-networks are obtained.
Further, in step 202, the corresponding sub-network is trained according to each training sample set, so as to obtain the output probability of each sub-network keyword, and a sub-network keyword output probability sample set is constructed according to the output probabilities of a plurality of sub-network keywords. The probability statistical distribution of the output is different in different sub-networks.
Further, in step 203, the output probability sample sets of the multiple sub-networks are input to the full-connection layer for training, and the full-connection layer is used for weighting the output probability recognition results of the sub-network channels, and because the parameters of the sub-networks are fixed after training, only the parameters of the full-connection layer are trained to obtain the trained full-connection layer, and the trained full-connection layer and the multiple trained sub-networks are fused to obtain the final keyword recognition result, thereby obtaining the trained speech keyword recognition model.
Table 1 shows the flow of the training method of the distributed neural network according to the present invention, and the specific training method flow is shown in the following table.
TABLE 1
It should be noted that the full-connection layer is equivalent to a total classifier of the distributed neural network, and is used for screening and identifying the final correct keyword result. Since the statistical distribution of the probabilities of the outputs of the different small networks is different, if the output results of the small networks are directly stacked, conflicts of different channels occur, and it is impossible to determine which sub-network channel is the correct keyword recognition result. Taking a control desk lamp as an example, assuming that the small network output probability of identifying "on" is 95%, the small network output probability of identifying "off" is 97%, both channels consider that the respective keywords are identified with high probability, and the probability that the respective keywords cannot be considered as "off" is higher, therefore, the output probability of arbitrating a plurality of network channels can be balanced by adopting the full-connection classification layer. Meanwhile, since the identification process of each sub-network should be synchronized with the identification process of the whole distributed neural network and the sub-networks cannot interfere with each other, the same data distribution should be adopted in the training sub-network and the training full-connection layer process.
Table 2 shows that the direct combination of small networks suffers serious accuracy loss under the same network scale as shown in table 2 because of the test results of the keyword recognition data sets using different neural network structures, and compared with the traditional neural network structure, the distributed neural network structure provided by the invention saves 55% of calculation times, obtains the same-level multi-word recognition accuracy, and maintains the recognition accuracy of a single keyword for a single small network.
TABLE 2
Through experimental tests, compared with the traditional single-network and multi-keyword recognition, the invention can ensure that 2-3 keywords are recognized with basically the same accuracy (reaching 95% -96%) on the premise of reducing the parameter number by 55%, and meanwhile, the distributed network structure provided by the invention supports dynamically adjusting each channel responsible for single-keyword recognition, thereby further reducing the overall power consumption; when the system requirement identifies 3 keywords, the calculated amount of standby state is less than 1/6 of the calculated amount of the traditional neural network.
On the basis of the foregoing embodiment, the performing, according to the voice keyword recognition result, the opening and closing operations on the corresponding subnetwork in the trained voice keyword recognition model includes:
According to the voice keyword recognition result obtained at the current moment, controlling the target equipment, and closing the first sub-network;
after the control of the target equipment is completed, a second sub-network is started;
the first sub-network is a sub-network for recognizing the voice keywords at the current moment, and the second sub-network is a sub-network for recognizing the voice keywords at the next moment.
In the invention, taking a desk lamp as an example, a voice wake-up module of the desk lamp needs to identify three keywords of turning on, turning off and turning on. And controlling the desk lamp according to the voice keyword recognition result obtained from the voice keyword recognition model at the current moment. If the recognition result of the voice keyword is "on-lamp", the on-lamp operation is performed on the desk lamp, the sub-network corresponding to the "on-lamp" is turned off, after the on-lamp control is completed on the desk lamp, the KWS module needs to continuously recognize whether the input of "off-lamp" and "turn on" is related or not at the next moment, and meanwhile, the sub-networks corresponding to the "off-lamp" and the "turn on" are turned on respectively. If the recognition result of the voice keyword is "turn-off", the lamp turning-off operation is performed on the desk lamp, and the sub-networks corresponding to the "turn-off" and the "turn-on" are turned off, and after the lamp turning-off control is completed on the desk lamp, the KWS module needs to continuously recognize whether the input of the "turn-on" exists or not at the next moment, and the sub-network corresponding to the "turn-on" is turned on. If the recognition result of the voice keyword is "turn on", the operation of turning on the desk lamp is executed, meanwhile, the sub-network corresponding to the "turn on" is turned off, after the control of turning on the desk lamp is completed, the KWS module needs to continuously recognize whether the input of "turn off" exists or not at the next moment, and meanwhile, the sub-network corresponding to the "turn off" is turned on.
According to the voice keyword recognition result, the sub-network channels responsible for single keyword recognition are dynamically adjusted by opening and closing the corresponding sub-networks in the trained voice keyword recognition model, so that the overall power consumption of the KWS system is further reduced.
In the present invention, fig. 5 is a schematic diagram showing comparison of the number of times of calculation of the speech keyword recognition method provided by the present invention and the conventional speech keyword recognition method in the KWS application scenario. The voice keyword recognition method provided by the invention adopts the distributed neural network to separate the recognition of each keyword, and each sub-network is only responsible for recognizing one keyword. As shown in fig. 5, the number of times of calculation of the voice keyword recognition method provided by the invention in the kw application scene is obviously less than that of the traditional voice keyword recognition method in the kw application scene.
In the present invention, the adoption of a distributed neural network architecture can reduce many redundant computations. Taking a desk lamp as an example, before the "on lamp" is input, the system is in a standby mode, the desk lamp is in an off state, and only whether the "on lamp" keywords exist or not is required to be identified, and referring to fig. 5, the calculation times of the standby mode of the voice keyword identification method provided by the invention under the KWS application scene are 30K. The corresponding computing parts of turning off and turning on are not operated, so that the power consumption is saved. After the 'on lamp' is identified, the KWS module continuously identifies whether the 'on lamp' and the 'turning on lamp' are input, and does not need to pay attention to the 'on lamp' command any more, and after the desk lamp is turned on, the small network corresponding to the 'on lamp' is turned off, at the moment, the calculation times of the voice keyword identification method provided by the invention under the KWS application scene is 60K, and the calculation amount only becomes 2 times of the standby mode. Therefore, the voice keyword recognition method provided by the invention can save a lot of unnecessary calculation power consumption in the actual application scene.
Based on the above embodiments, the subnetworks include convolutional neural networks, gated neural networks (Gate Recurrent Unit, abbreviated as GRU), and Long Short-Term Memory networks (LSTM). Preferably, the subnetwork is a gated neural network. The gating neural network can obtain excellent performance in terms of parameter scale and identification accuracy.
On the basis of the above embodiment, the feature extraction of the voice data to be identified to obtain a voice keyword feature vector includes:
Performing analog-to-digital conversion processing on the voice data to be recognized to obtain a voice digital signal;
Performing fast Fourier transform processing on the voice digital signal to obtain voice signal spectrum information;
acquiring voice energy spectrum information corresponding to the voice signal spectrum information based on a squarer;
filtering the voice energy spectrum information to obtain average energy information of a plurality of frequency bands;
and acquiring the characteristic vector of the voice keyword according to the average energy information of the frequency bands.
In the invention, the characteristic extraction part in KWS is the first processing procedure of the voice data to be recognized, the adopted characteristic extraction method is the Mel frequency spectrum characteristic, and the characteristic extraction method outputs the average energy information of the voice data in each frequency band. Traditional feature extraction is done in the digital domain, mainly comprising ADC, fast Fourier Transform (FFT), squarer and mel filter. Specifically, the ADC converts the sound analog data collected by the microphone into a digital signal at a sampling frequency of 16kHz, then the front-end processes and FFT processes the digital signal to obtain spectrum information of the signal, the squarer squares the input data to obtain information of an energy spectrum, and finally the center frequency of the mel filter bank increases exponentially to obtain an average value of energy of the signal distributed in different frequency bands, and then a 16-dimensional eigenvector is output, where each dimension represents the average energy in one frequency band. After feature extraction, the input speech signal is greatly compressed into a feature vector.
On the basis of the foregoing embodiment, before extracting the features of the voice data to be recognized to obtain the feature vector of the voice keyword, the method further includes:
Preprocessing voice data to be recognized to obtain preprocessed voice data, and extracting features according to the preprocessed voice data, wherein the preprocessing comprises pre-emphasis processing, framing processing and windowing processing.
In the present invention, the audio data is preferably first Pre-processed, including Pre-emphasis (Pre-emphasis), framing (Framing) and windowing (Windowing), prior to feature extraction. The pre-emphasis processing aims to emphasize the high-frequency part of the voice, remove the influence of radiation and increase the high-frequency resolution of the voice. The voice signal has short-time stationarity (the voice signal can be considered to be approximately unchanged within 10-30 ms), so that the voice signal can be divided into a plurality of short segments for framing, and the framing of the voice signal is realized by adopting a movable window with limited length for weighting. The number of frames per second is about 33 to 100 frames, as the case may be. The framing method adopts an overlapped segmentation mode, the overlapped part of the previous frame and the next frame is called frame shift, and the ratio of the frame shift to the frame length is generally 0-0.5. The windowing treatment is followed by Fourier expansion, so that the overall situation is more continuous, and the Gibbs effect is avoided; meanwhile, the speech signal which is not periodic originally presents partial characteristics of the periodic function.
Fig. 6 is a schematic structural diagram of a voice keyword recognition system provided by the present invention, and as shown in fig. 6, the present invention provides a voice keyword recognition system, which includes a feature extraction module 601, a voice keyword recognition module 602, and a recognition result execution module 603, where the feature extraction module 601 is configured to perform feature extraction on a voice keyword to be recognized, and obtain a voice keyword feature vector; the voice keyword recognition module 602 is configured to input the voice keyword feature vector into a trained voice keyword recognition model to obtain a voice keyword recognition result, where the trained voice keyword recognition model is obtained by training a neural network model with a plurality of sub-networks and each sub-network recognizes a corresponding keyword by using a sample voice keyword feature vector marked with a keyword type tag; the recognition result execution module 603 is configured to execute the opening and closing operations on the corresponding sub-networks in the trained speech keyword recognition model according to the speech keyword recognition result.
According to the voice keyword recognition system provided by the invention, the feature extraction module is used for extracting the features of voice data, and the feature vectors of the voice keywords obtained after the feature extraction are input into the voice keyword recognition module to obtain the voice keyword recognition result, so that the recognition result execution module is used for executing the opening and closing operations on the corresponding sub-network in the voice keyword recognition model.
On the basis of the embodiment, the system further comprises a sub-network training module, a sub-network output probability sample set construction module and a full-connection layer training module, wherein the sub-network training module is used for constructing a plurality of training sample sets according to sample voice keyword feature vectors marked with different keyword type labels, training corresponding sub-networks according to each training sample set, and obtaining a plurality of trained sub-networks; the sub-network output probability sample set construction module is used for constructing a sub-network keyword output probability sample set according to the output results of the plurality of trained sub-networks; the full-connection layer training module is used for inputting the sub-network output probability sample set into the full-connection layer for training, and fusing the trained full-connection layer with a plurality of trained sub-networks to obtain a trained voice keyword recognition model.
The system provided by the invention is used for executing the method embodiments, and specific flow and details refer to the embodiments and are not repeated herein.
Fig. 7 is a schematic structural diagram of an electronic device according to the present invention, as shown in fig. 7, the electronic device may include: a processor (processor) 701, a communication interface (CommunicationsInterface) 702, a memory (memory) 703 and a communication bus 704, wherein the processor 701, the communication interface 702 and the memory 703 communicate with each other through the communication bus 704. The processor 701 may invoke logic instructions in the memory 703 to perform a voice keyword recognition method comprising: extracting features of voice data to be identified to obtain voice keyword feature vectors; inputting the voice keyword feature vector into a trained voice keyword recognition model to obtain a voice keyword recognition result, wherein the trained voice keyword recognition model is obtained by training a neural network model through a sample voice keyword feature vector marked with a keyword type label, and the neural network model comprises a plurality of sub-networks and each sub-network recognizes a corresponding keyword; and executing opening and closing operations on the corresponding sub-network in the trained voice keyword recognition model according to the voice keyword recognition result.
Further, the logic instructions in the memory 703 may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-only memory (ROM), a random access memory (RAM, randomAccessMemory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the method of speech keyword recognition provided by the methods described above, the method comprising: extracting features of voice data to be identified to obtain voice keyword feature vectors; inputting the voice keyword feature vector into a trained voice keyword recognition model to obtain a voice keyword recognition result, wherein the trained voice keyword recognition model is obtained by training a neural network model through a sample voice keyword feature vector marked with a keyword type label, and the neural network model comprises a plurality of sub-networks and each sub-network recognizes a corresponding keyword; and executing opening and closing operations on the corresponding sub-network in the trained voice keyword recognition model according to the voice keyword recognition result.
In still another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the voice keyword recognition method provided in the above embodiments, the method comprising: extracting features of voice data to be identified to obtain voice keyword feature vectors; inputting the voice keyword feature vector into a trained voice keyword recognition model to obtain a voice keyword recognition result, wherein the trained voice keyword recognition model is obtained by training a neural network model through a sample voice keyword feature vector marked with a keyword type label, and the neural network model comprises a plurality of sub-networks and each sub-network recognizes a corresponding keyword; and executing opening and closing operations on the corresponding sub-network in the trained voice keyword recognition model according to the voice keyword recognition result.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims (10)
1. A method for recognizing a voice keyword, comprising:
extracting features of voice data to be identified to obtain voice keyword feature vectors;
Inputting the voice keyword feature vector into a trained voice keyword recognition model to obtain a voice keyword recognition result, wherein the trained voice keyword recognition model is obtained by training a neural network model through a sample voice keyword feature vector marked with a keyword type label, and the neural network model comprises a plurality of sub-networks and each sub-network recognizes a corresponding keyword;
and executing opening and closing operations on the corresponding sub-network in the trained voice keyword recognition model according to the voice keyword recognition result.
2. The method for recognizing speech keywords according to claim 1, wherein the trained speech keyword recognition model is trained by:
Constructing a plurality of training sample sets according to sample voice keyword feature vectors marked with different keyword type labels, training corresponding sub-networks according to each training sample set, and obtaining a plurality of trained sub-networks;
constructing a sub-network keyword output probability sample set according to the output results of the plurality of trained sub-networks;
And inputting the sub-network output probability sample set to a full-connection layer for training, and fusing the trained full-connection layer with a plurality of trained sub-networks to obtain a trained voice keyword recognition model.
3. The method for recognizing a voice keyword according to claim 1, wherein the performing on and off operations on the corresponding sub-network in the trained voice keyword recognition model according to the voice keyword recognition result includes:
According to the voice keyword recognition result obtained at the current moment, controlling the target equipment, and closing the first sub-network;
after the control of the target equipment is completed, a second sub-network is started;
the first sub-network is a sub-network for recognizing the voice keywords at the current moment, and the second sub-network is a sub-network for recognizing the voice keywords at the next moment.
4. The method of claim 1, wherein the sub-network is a gated neural network.
5. The method for recognizing a voice keyword according to claim 1, wherein the feature extraction of the voice data to be recognized to obtain a voice keyword feature vector comprises:
Performing analog-to-digital conversion processing on the voice data to be recognized to obtain a voice digital signal;
Performing fast Fourier transform processing on the voice digital signal to obtain voice signal spectrum information;
acquiring voice energy spectrum information corresponding to the voice signal spectrum information based on a squarer;
filtering the voice energy spectrum information to obtain average energy information of a plurality of frequency bands;
and acquiring the characteristic vector of the voice keyword according to the average energy information of the frequency bands.
6. The method for recognizing a voice keyword according to claim 1, wherein before extracting features of the voice data to be recognized to obtain a voice keyword feature vector, the method further comprises:
Preprocessing voice data to be recognized to obtain preprocessed voice data, and extracting features according to the preprocessed voice data, wherein the preprocessing comprises pre-emphasis processing, framing processing and windowing processing.
7. A voice keyword recognition system, comprising:
the feature extraction module is used for extracting features of the voice keywords to be identified and obtaining feature vectors of the voice keywords;
The voice keyword recognition module is used for inputting the voice keyword feature vector into a trained voice keyword recognition model to obtain a voice keyword recognition result, wherein the trained voice keyword recognition model is obtained by training a neural network model through a sample voice keyword feature vector marked with a keyword type label, and the neural network model comprises a plurality of sub-networks and each sub-network recognizes a corresponding keyword;
And the recognition result execution module is used for executing opening and closing operations on the corresponding sub-network in the trained voice keyword recognition model according to the voice keyword recognition result.
8. The voice keyword recognition system of claim 7, wherein the system further comprises:
The sub-network training module is used for constructing a plurality of training sample sets according to sample voice keyword feature vectors marked with different keyword type labels, training corresponding sub-networks according to each training sample set, and obtaining a plurality of trained sub-networks;
The sub-network output probability sample set construction module is used for constructing a sub-network keyword output probability sample set according to the output results of the plurality of trained sub-networks;
And the full-connection layer training module is used for inputting the sub-network output probability sample set into the full-connection layer for training, and fusing the trained full-connection layer with a plurality of trained sub-networks to obtain a trained voice keyword recognition model.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the speech keyword recognition method according to any one of claims 1 to 7 when the computer program is executed by the processor.
10. A non-transitory computer readable storage medium having stored thereon a computer program, characterized in that the computer program when executed by a processor implements the steps of the speech keyword recognition method of any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110164790.7A CN114937450B (en) | 2021-02-05 | 2021-02-05 | Voice keyword recognition method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110164790.7A CN114937450B (en) | 2021-02-05 | 2021-02-05 | Voice keyword recognition method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114937450A CN114937450A (en) | 2022-08-23 |
CN114937450B true CN114937450B (en) | 2024-07-09 |
Family
ID=82861406
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110164790.7A Active CN114937450B (en) | 2021-02-05 | 2021-02-05 | Voice keyword recognition method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114937450B (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103971686A (en) * | 2013-01-30 | 2014-08-06 | 腾讯科技(深圳)有限公司 | Method and system for automatically recognizing voice |
CN109272988A (en) * | 2018-09-30 | 2019-01-25 | 江南大学 | Audio recognition method based on multichannel convolutional neural networks |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR102033929B1 (en) * | 2017-06-28 | 2019-10-18 | 포항공과대학교 산학협력단 | A real-time speech-recognition device using an ASIC chip and a smart-phone |
CN108305617B (en) * | 2018-01-31 | 2020-09-08 | 腾讯科技(深圳)有限公司 | Method and device for recognizing voice keywords |
CN111583940A (en) * | 2020-04-20 | 2020-08-25 | 东南大学 | Very low power consumption keyword awakening neural network circuit |
CN111862957A (en) * | 2020-07-14 | 2020-10-30 | 杭州芯声智能科技有限公司 | Single track voice keyword low-power consumption real-time detection method |
-
2021
- 2021-02-05 CN CN202110164790.7A patent/CN114937450B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103971686A (en) * | 2013-01-30 | 2014-08-06 | 腾讯科技(深圳)有限公司 | Method and system for automatically recognizing voice |
CN109272988A (en) * | 2018-09-30 | 2019-01-25 | 江南大学 | Audio recognition method based on multichannel convolutional neural networks |
Also Published As
Publication number | Publication date |
---|---|
CN114937450A (en) | 2022-08-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Gogate et al. | DNN driven speaker independent audio-visual mask estimation for speech separation | |
JP2019533193A (en) | Voice control system, wakeup method thereof, wakeup device, home appliance, coprocessor | |
CN110223687B (en) | Instruction execution method and device, storage medium and electronic equipment | |
CN113763965B (en) | Speaker identification method with multiple attention feature fusion | |
CN112183107A (en) | Audio processing method and device | |
CN111261145B (en) | Voice processing device, equipment and training method thereof | |
CN115691475A (en) | Method for training a speech recognition model and speech recognition method | |
CN118248177B (en) | Speech emotion recognition system and method based on approximate nearest neighbor search algorithm | |
CN114548262B (en) | Feature level fusion method for multi-mode physiological signals in emotion calculation | |
Naranjo-Alcazar et al. | On the performance of residual block design alternatives in convolutional neural networks for end-to-end audio classification | |
JP2024528331A (en) | Wake-up processing method, device, equipment, and computer storage medium | |
CN114937450B (en) | Voice keyword recognition method and system | |
CN114937449B (en) | Voice keyword recognition method and system | |
WO2023168713A1 (en) | Interactive speech signal processing method, related device and system | |
TW202026855A (en) | Voice wake-up apparatus and method thereof | |
He et al. | An adaptive multi-band system for low power voice command recognition. | |
Jiang et al. | A Speech Emotion Recognition Method Based on Improved Residual Network | |
CN114792518A (en) | Voice recognition system based on scheduling domain technology, method thereof and storage medium | |
Shim et al. | Attentive max feature map for acoustic scene classification with joint learning considering the abstraction of classes | |
Anguraj et al. | Analysis of influencing features with spectral feature extraction and multi-class classification using deep neural network for speech recognition system | |
CN118116372B (en) | Binary deep neural network hardware acceleration system for voice keyword recognition | |
CN116705013B (en) | Voice wake-up word detection method and device, storage medium and electronic equipment | |
Zhang et al. | Filamentary Convolution for Spoken Language Identification: A Brain-Inspired Approach | |
CN118430541B (en) | Intelligent voice robot system | |
Chen et al. | Machine Learning for Predictive Analytics in the Improvement of English Speech Feature Recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |