CN110648668A - Keyword detection device and method - Google Patents

Keyword detection device and method Download PDF

Info

Publication number
CN110648668A
CN110648668A CN201910906551.7A CN201910906551A CN110648668A CN 110648668 A CN110648668 A CN 110648668A CN 201910906551 A CN201910906551 A CN 201910906551A CN 110648668 A CN110648668 A CN 110648668A
Authority
CN
China
Prior art keywords
recognition device
voice recognition
speech recognition
training
keyword detection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910906551.7A
Other languages
Chinese (zh)
Inventor
赖家豪
郑达
李索恒
张志齐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Yitu Information Technology Co Ltd
Original Assignee
Shanghai Yitu Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Yitu Information Technology Co Ltd filed Critical Shanghai Yitu Information Technology Co Ltd
Priority to CN201910906551.7A priority Critical patent/CN110648668A/en
Publication of CN110648668A publication Critical patent/CN110648668A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/32Multiple recognisers used in sequence or in parallel; Score combination systems therefor, e.g. voting systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Abstract

The invention discloses a keyword detection device, comprising: a plurality of speech recognition devices. Each voice recognition device is obtained by adopting a training criterion based on CTC, and the training environments of the voice recognition devices are different; each voice recognition device comprises a neural network obtained through training based on CTC training criterion; the input end of each voice recognition device is connected with audio data. In the inference stage, the input end of each voice recognition device inputs the same first input audio data, each voice recognition device runs in parallel and outputs corresponding keyword scores, the keyword scores of each voice recognition device are subjected to weight combination to form a total score, and a keyword prediction result signal corresponding to the first input audio data is output according to the total score. The invention discloses a keyword detection method. The invention can improve the recall rate when used in various environments and reduce false alarm.

Description

Keyword detection device and method
Technical Field
The invention relates to voice recognition, in particular to a keyword detection device; the invention also relates to a keyword detection method.
Background
Speech Recognition, also known as Automatic Speech Recognition (ASR), is a technology that converts an input Speech signal, i.e., an audio signal, into a corresponding text for output, and has important applications in Artificial Intelligence (AI).
The existing speech recognition device usually includes a Neural Network (NN), the speech recognition device usually needs to be trained to be used, the speech recognition device mainly trains the Neural Network, the Neural Network forms a corresponding model through training, according to the trained model, after a speech signal, namely an audio signal, is processed through feature extraction and input to the Neural Network, the Neural Network selects an optimal output path according to the trained model and forms a corresponding text signal output. Neural networks include Recurrent Neural Networks (RNNs), which are typically trained using rules based on Connection Timing Classification (CTC), i.e., CTC-based training criteria. In the process of CTC-based rule training, training samples are provided, the training samples including an input audio signal, corresponding to the label of real output, each node in RNN has initial Weight value, i.e. Weight (Weight), after the input audio signal is input into RNN, the RNN generates output data according to the weight setting of each internal node, the output data and a real output label have a difference value and are calculated and output through a CTC loss function, the CTC loss can carry out back propagation to realize the adjustment of the weight of each node in the RNN, finally, the difference between the output data and the real output label is reduced to the required value or when the change of the difference between the output data and the real output label is small, and finishing the training, wherein each node in the RNN after the training has the corresponding final weight and is applied to the actual speech recognition. In the actual speech recognition, the audio signal subjected to feature extraction is input into the RNN, the RNN selects an output path with the largest score according to the training structure to output, the output path with the largest score is the output path corresponding to the largest probability product of each node of the RNN on the output path, and finally, the corresponding text information can be obtained through text decoding.
In some applications, keyword detection is also required, and the keyword detection can obtain a command required in automatic control, or monitor sensitive information appearing in communication voice, and the like. The conventional keyword detection apparatus is generally implemented by using a speech recognition apparatus, which performs training based on a training criterion of CTC for the speech recognition apparatus under a specific environment, and the training sample uses input audio data including related key words and text labels corresponding to the keywords in the input audio data. The existing keyword detection device realized by adopting a single voice recognition device has the defects that when the use environment is different from the training environment, the situation that the keywords cannot be recognized easily occurs, the recall rate is reduced, and even false alarm occurs.
Disclosure of Invention
The invention aims to provide a keyword detection device which can improve the recall rate when used in various environments and reduce false alarm. Therefore, the invention also provides a keyword detection method.
In order to solve the above technical problem, the keyword detection apparatus provided by the present invention includes: a plurality of speech recognition devices.
Each voice recognition device is obtained by adopting a training criterion based on CTC, and the training environments of the voice recognition devices are different; each voice recognition device comprises a neural network obtained through training based on a training criterion of CTC; the input end of each voice recognition device is connected with audio data.
In the inference stage, the same first input audio data is input into the input end of each voice recognition device, each voice recognition device runs in parallel and outputs corresponding keyword scores, the keyword scores of each voice recognition device are subjected to weight combination to form a total score of the keyword detection device, and a keyword prediction result signal corresponding to the first input audio data is output according to the total score.
In a further refinement, the type of neural network comprises a recurrent neural network.
In a further improvement, each training sample of the speech recognition device uses second input audio data including related key words and text labels corresponding to the key words in the second input audio data.
In a further refinement, each of said speech recognition devices is trained using a CTC loss function.
In a further improvement, the CTC loss function and the training sample adopted by each voice recognition device are different, and the CTC loss function and the training sample adopted by each voice recognition device are adaptive to the trained environment.
In a further refinement, the input audio data is subjected to feature processing before being input to the speech recognition device.
In a further refinement, the combination of weights forming the total score is a weighted average.
In a further refinement, the weight of each of the speech recognition devices is the reciprocal of the number of the speech recognition devices; alternatively, the weight of each of the speech recognition devices is determined by a grid search.
In order to solve the above technical problems, the keyword detection method provided by the present invention employs a keyword detection apparatus including a plurality of speech recognition apparatuses.
The input end of each voice recognition device is connected with audio data, and each voice recognition device comprises a neural network.
The training method of each speech recognition device comprises the following steps: training CTC-based training criteria is performed on each of the speech recognition devices, the environments in which the speech recognition devices are trained are different, and the training on each of the speech recognition devices comprises training on the neural network based on the CTC-based training criteria.
The following reasoning method is adopted in the reasoning stage: the input end of each speech recognition device inputs the same first input audio data, each speech recognition device operates in parallel and outputs corresponding keyword scores, the keyword scores of each speech recognition device are subjected to weight combination to form a total score of the keyword detection device, and a keyword prediction result signal corresponding to the first input audio data is output according to the total score.
In a further refinement, the type of neural network comprises a recurrent neural network.
In a further improvement, each training sample of the speech recognition device uses second input audio data including related key words and text labels corresponding to the key words in the second input audio data.
In a further refinement, each of said speech recognition devices is trained using a CTC loss function.
In a further improvement, the CTC loss function and the training sample adopted by each voice recognition device are different, and the CTC loss function and the training sample adopted by each voice recognition device are adaptive to the trained environment.
In a further refinement, the input audio data is subjected to feature processing before being input to the speech recognition device.
In a further refinement, the combination of weights forming the total score is a weighted average.
In a further refinement, the weight of each of the speech recognition devices is the reciprocal of the number of the speech recognition devices; alternatively, the weight of each of the speech recognition devices is determined by a grid search.
The keyword detection device comprises a plurality of voice recognition devices, wherein each voice recognition device is trained under different environments, and in an inference stage, each voice recognition device runs in parallel and performs weight combination according to the keyword scores of the voice recognition devices to form the total score of the keyword detection device and finally obtain a keyword prediction result signal; because each voice recognition device can obtain a good keyword detection result under the environment similar to the respective training environment, and each voice recognition device can obtain a good keyword detection structure under various different environments after running in parallel, the defect that the recall rate of the keyword detection device formed by adopting a single voice recognition device in the prior art is reduced when the environment is changed is overcome, so that the recall rate of the keyword detection device used under various environments can be improved, and the false alarm is reduced.
Drawings
The invention is described in further detail below with reference to the following figures and detailed description:
FIG. 1 is a schematic structural diagram of a keyword detection apparatus according to an embodiment of the present invention;
fig. 2 is a flowchart of an inference phase of the keyword detection apparatus according to the embodiment of the present invention.
Detailed Description
The keyword detection device of the embodiment of the invention comprises:
fig. 1 is a schematic structural diagram of a keyword detection apparatus according to an embodiment of the present invention; the keyword detection device of the embodiment of the invention comprises: the plurality of speech recognition devices 102, shown in fig. 1 as n speech recognition devices, are represented by speech recognition device 1, speech recognition device 2 through speech recognition device n, respectively, in corresponding boxes.
The keyword detection apparatus further includes a speech feature processing module 101, and the input audio data is subjected to feature processing before being input to the speech recognition apparatus 102, that is, is subjected to feature processing by the speech feature processing module 101. Preferably, the feature processing is to extract a spectral feature of the input audio data by short-time fourier transform.
Each voice recognition device 102 is obtained by adopting a training criterion based on a CTC, and the environment trained by each voice recognition device 102 is different; each voice recognition device 102 comprises a neural network 103 obtained through training based on CTC training criteria; the input end of each voice recognition device 102 is connected with audio data. The type of neural network 103 includes a recurrent neural network. Each neural network 103 also includes n, and is also denoted by a neural network 1, a neural network 2 through a neural network n in the corresponding block.
The training samples of each of the speech recognition devices 102 use second input audio data including related key words and text labels corresponding to the key words in the second input audio data.
Each of the speech recognition devices 102 is trained using a CTC loss function.
The CTC loss function and the training samples employed by each of the speech recognition devices 102 are different, and the CTC loss function and the training samples employed by each of the speech recognition devices 102 are adaptive to the trained environment.
As shown in fig. 2, which is a flowchart of an inference phase of the keyword detection apparatus according to the embodiment of the present invention, in the inference phase, the input end of each of the speech recognition apparatuses 102 inputs the same first input audio data, and the first input audio data is shown as reference numeral 104.
Each of the speech recognition devices 102 operates in parallel and outputs a corresponding keyword score, as indicated at 105. For the same first input audio data, the closer the training environment of the speech recognition device 102 and the input environment of the first input audio data are, the higher the keyword score of the corresponding speech recognition device 102 is.
The keyword scores of the speech recognition devices 102 are weighted and combined to form a total score of the keyword detection device, which is shown as 106. In the embodiment of the present invention, the weight combination mode for forming the total score is weighted average. The weight of each speech recognition device 102 is the reciprocal of the number of speech recognition devices 102; alternatively, the weight of each of the speech recognition devices 102 is determined by a grid search.
And outputting a keyword prediction result signal corresponding to the first input audio data according to the total score, wherein the keyword prediction result is shown as a mark 106.
The keyword detection device comprises a plurality of voice recognition devices 102, wherein each voice recognition device 102 is trained under different environments, in an inference stage, each voice recognition device 102 runs in parallel, and weight combination is carried out according to the keyword scores of the voice recognition devices 102 to form the total score of the keyword detection device and finally a keyword prediction result signal is obtained; because each voice recognition device 102 can obtain a good keyword detection result under the environment similar to the respective training environment, and each voice recognition device 102 can obtain a good keyword detection structure under various different environments after running in parallel, the defect that the recall rate of a keyword detection device formed by adopting a single voice recognition device 102 in the prior art is reduced when the environment is changed is overcome, so that the recall rate in use under various environments can be improved, and false alarm is reduced.
The keyword detection method of the embodiment of the invention comprises the following steps:
the keyword detection method of the embodiment of the present invention employs a keyword detection apparatus including a plurality of speech recognition apparatuses 102.
The input end of each speech recognition device 102 is connected with audio data, and each speech recognition device 102 comprises a neural network 103. In fig. 1, n speech recognition devices are shown, and are represented by speech recognition device 1, speech recognition device 2 through speech recognition device n, respectively, in corresponding blocks.
The type of neural network 103 includes a recurrent neural network. Each neural network 103 also includes n, and is also denoted by a neural network 1, a neural network 2 through a neural network n in the corresponding block.
The keyword detection apparatus further includes a speech feature processing module 101, and the input audio data is subjected to feature processing before being input to the speech recognition apparatus 102, that is, is subjected to feature processing by the speech feature processing module 101. Preferably, the feature processing is to extract a spectral feature of the input audio data by short-time fourier transform.
The training method of each of the speech recognition devices 102 includes: the training of the CTC-based training criterion is performed for each of the speech recognition devices 102, the environments in which the speech recognition devices 102 are trained are different, and the training of each of the speech recognition devices 102 includes training of the neural network 103 based on the CTC-based training criterion.
The training samples of each of the speech recognition devices 102 use second input audio data including related key words and text labels corresponding to the key words in the second input audio data.
Each of the speech recognition devices 102 is trained using a CTC loss function.
The CTC loss function and the training samples employed by each of the speech recognition devices 102 are different, and the CTC loss function and the training samples employed by each of the speech recognition devices 102 are adaptive to the trained environment.
The following reasoning method is adopted in the reasoning stage:
the input of each of the speech recognition devices 102 is input with the same first input audio data, as indicated by reference numeral 104.
Each of the speech recognition devices 102 operates in parallel and outputs a corresponding keyword score, as indicated at 105. For the same first input audio data, the closer the training environment of the speech recognition device 102 and the input environment of the first input audio data are, the higher the keyword score of the corresponding speech recognition device 102 is.
The keyword scores of the speech recognition devices 102 are weighted and combined to form a total score of the keyword detection device, which is shown as 106. In the embodiment of the present invention, the weight combination mode for forming the total score is weighted average. The weight of each speech recognition device 102 is the reciprocal of the number of speech recognition devices 102; alternatively, the weight of each of the speech recognition devices 102 is determined by a grid search.
And outputting a keyword prediction result signal corresponding to the first input audio data according to the total score, wherein the keyword prediction result is shown as a mark 106.
The present invention has been described in detail with reference to the specific embodiments, but these should not be construed as limitations of the present invention. Many variations and modifications may be made by one of ordinary skill in the art without departing from the principles of the present invention, which should also be considered as within the scope of the present invention.

Claims (16)

1. A keyword detection apparatus, comprising: a plurality of speech recognition devices;
each voice recognition device is obtained by adopting a training criterion based on CTC, and the training environments of the voice recognition devices are different; each voice recognition device comprises a neural network obtained through training based on a training criterion of CTC; the input end of each voice recognition device is connected with audio data;
in the inference stage, the same first input audio data is input into the input end of each voice recognition device, each voice recognition device runs in parallel and outputs corresponding keyword scores, the keyword scores of each voice recognition device are subjected to weight combination to form a total score of the keyword detection device, and a keyword prediction result signal corresponding to the first input audio data is output according to the total score.
2. The keyword detection apparatus according to claim 1, characterized in that: the type of neural network includes a recurrent neural network.
3. The keyword detection apparatus according to claim 1, characterized in that: the training sample of each voice recognition device adopts second input audio data comprising related key words and text labels corresponding to the key words in the second input audio data.
4. The keyword detection apparatus according to claim 3, characterized in that: each of the speech recognition devices is trained using a CTC loss function.
5. The keyword detection apparatus according to claim 4, characterized in that: the CTC loss function and the training sample adopted by each voice recognition device are different, and the CTC loss function and the training sample adopted by each voice recognition device are adaptive to the trained environment.
6. The keyword detection apparatus according to claim 1, characterized in that: the input audio data is subjected to feature processing before being input to the speech recognition device.
7. The keyword detection apparatus according to claim 1, characterized in that: the weight combination mode for forming the total score is weighted average.
8. The keyword detection apparatus according to claim 7, characterized in that: the weight of each voice recognition device is the reciprocal of the number of the voice recognition devices; alternatively, the weight of each of the speech recognition devices is determined by a grid search.
9. A keyword detection method is characterized in that: adopting a keyword detection device comprising a plurality of voice recognition devices;
the input end of each voice recognition device is connected with audio data, and each voice recognition device comprises a neural network;
the training method of each speech recognition device comprises the following steps:
training CTC-based training criteria for each of the speech recognition devices, wherein the environments in which the speech recognition devices are trained are different, and the training for each of the speech recognition devices comprises training of the neural network based on the CTC-based training criteria;
the following reasoning method is adopted in the reasoning stage:
the input end of each speech recognition device inputs the same first input audio data, each speech recognition device operates in parallel and outputs corresponding keyword scores, the keyword scores of each speech recognition device are subjected to weight combination to form a total score of the keyword detection device, and a keyword prediction result signal corresponding to the first input audio data is output according to the total score.
10. The keyword detection method according to claim 9, characterized in that: the type of neural network includes a recurrent neural network.
11. The keyword detection method according to claim 9, characterized in that: the training sample of each voice recognition device adopts second input audio data comprising related key words and text labels corresponding to the key words in the second input audio data.
12. The keyword detection method according to claim 11, characterized in that: each of the speech recognition devices is trained using a CTC loss function.
13. The keyword detection method according to claim 12, characterized in that: the CTC loss function and the training sample adopted by each voice recognition device are different, and the CTC loss function and the training sample adopted by each voice recognition device are adaptive to the trained environment.
14. The keyword detection method according to claim 9, characterized in that: the input audio data is subjected to feature processing before being input to the speech recognition device.
15. The keyword detection method according to claim 9, characterized in that: the weight combination mode for forming the total score is weighted average.
16. The keyword detection method according to claim 9, characterized in that: the weight of each voice recognition device is the reciprocal of the number of the voice recognition devices; alternatively, the weight of each of the speech recognition devices is determined by a grid search.
CN201910906551.7A 2019-09-24 2019-09-24 Keyword detection device and method Pending CN110648668A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910906551.7A CN110648668A (en) 2019-09-24 2019-09-24 Keyword detection device and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910906551.7A CN110648668A (en) 2019-09-24 2019-09-24 Keyword detection device and method

Publications (1)

Publication Number Publication Date
CN110648668A true CN110648668A (en) 2020-01-03

Family

ID=69011139

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910906551.7A Pending CN110648668A (en) 2019-09-24 2019-09-24 Keyword detection device and method

Country Status (1)

Country Link
CN (1) CN110648668A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112233655A (en) * 2020-09-28 2021-01-15 上海声瀚信息科技有限公司 Neural network training method for improving voice command word recognition performance

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106128462A (en) * 2016-06-21 2016-11-16 东莞酷派软件技术有限公司 Audio recognition method and system
CN107230475A (en) * 2017-05-27 2017-10-03 腾讯科技(深圳)有限公司 A kind of voice keyword recognition method, device, terminal and server
CN107358951A (en) * 2017-06-29 2017-11-17 阿里巴巴集团控股有限公司 A kind of voice awakening method, device and electronic equipment
US20180190295A1 (en) * 2016-12-31 2018-07-05 Lenovo (Beijing) Co., Ltd. Voice recognition
CN110148416A (en) * 2019-04-23 2019-08-20 腾讯科技(深圳)有限公司 Audio recognition method, device, equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106128462A (en) * 2016-06-21 2016-11-16 东莞酷派软件技术有限公司 Audio recognition method and system
US20180190295A1 (en) * 2016-12-31 2018-07-05 Lenovo (Beijing) Co., Ltd. Voice recognition
CN107230475A (en) * 2017-05-27 2017-10-03 腾讯科技(深圳)有限公司 A kind of voice keyword recognition method, device, terminal and server
CN107358951A (en) * 2017-06-29 2017-11-17 阿里巴巴集团控股有限公司 A kind of voice awakening method, device and electronic equipment
CN110148416A (en) * 2019-04-23 2019-08-20 腾讯科技(深圳)有限公司 Audio recognition method, device, equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112233655A (en) * 2020-09-28 2021-01-15 上海声瀚信息科技有限公司 Neural network training method for improving voice command word recognition performance

Similar Documents

Publication Publication Date Title
CN110648659B (en) Voice recognition and keyword detection device and method based on multitask model
CN110491416B (en) Telephone voice emotion analysis and identification method based on LSTM and SAE
Thomas et al. Analyzing convolutional neural networks for speech activity detection in mismatched acoustic conditions
CN108847238B (en) Service robot voice recognition method
CN107393526B (en) Voice silence detection method, device, computer equipment and storage medium
CN106098059B (en) Customizable voice awakening method and system
CN111627458B (en) Sound source separation method and equipment
CN110136749A (en) The relevant end-to-end speech end-point detecting method of speaker and device
CN109272988A (en) Audio recognition method based on multichannel convolutional neural networks
CN110349597B (en) Voice detection method and device
CN108922513A (en) Speech differentiation method, apparatus, computer equipment and storage medium
CN106161209B (en) A kind of method for filtering spam short messages and system based on depth self study
CN110610709A (en) Identity distinguishing method based on voiceprint recognition
CN106898354B (en) Method for estimating number of speakers based on DNN model and support vector machine model
CN112581938B (en) Speech breakpoint detection method, device and equipment based on artificial intelligence
CN110930995A (en) Voice recognition model applied to power industry
CN112418175A (en) Rolling bearing fault diagnosis method and system based on domain migration and storage medium
CN113129900A (en) Voiceprint extraction model construction method, voiceprint identification method and related equipment
Adiba et al. Towards immediate backchannel generation using attention-based early prediction model
CN111243621A (en) Construction method of GRU-SVM deep learning model for synthetic speech detection
Tan et al. Selective mutual learning: an efficient approach for single channel speech separation
CN110648668A (en) Keyword detection device and method
CN112927723A (en) High-performance anti-noise speech emotion recognition method based on deep neural network
Yang et al. Linguistically-Informed Training of Acoustic Word Embeddings for Low-Resource Languages.
Menon et al. ASR-free CNN-DTW keyword spotting using multilingual bottleneck features for almost zero-resource languages

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200103