CN110648668A

CN110648668A - Keyword detection device and method

Info

Publication number: CN110648668A
Application number: CN201910906551.7A
Authority: CN
Inventors: 赖家豪; 郑达; 李索恒; 张志齐
Original assignee: Shanghai Yitu Information Technology Co Ltd
Current assignee: Shanghai Yitu Information Technology Co Ltd
Priority date: 2019-09-24
Filing date: 2019-09-24
Publication date: 2020-01-03

Abstract

The invention discloses a keyword detection device, comprising: a plurality of speech recognition devices. Each voice recognition device is obtained by adopting a training criterion based on CTC, and the training environments of the voice recognition devices are different; each voice recognition device comprises a neural network obtained through training based on CTC training criterion; the input end of each voice recognition device is connected with audio data. In the inference stage, the input end of each voice recognition device inputs the same first input audio data, each voice recognition device runs in parallel and outputs corresponding keyword scores, the keyword scores of each voice recognition device are subjected to weight combination to form a total score, and a keyword prediction result signal corresponding to the first input audio data is output according to the total score. The invention discloses a keyword detection method. The invention can improve the recall rate when used in various environments and reduce false alarm.

Description

Keyword detection device and method

Technical Field

The invention relates to voice recognition, in particular to a keyword detection device; the invention also relates to a keyword detection method.

Background

Speech Recognition, also known as Automatic Speech Recognition (ASR), is a technology that converts an input Speech signal, i.e., an audio signal, into a corresponding text for output, and has important applications in Artificial Intelligence (AI).

The existing speech recognition device usually includes a Neural Network (NN), the speech recognition device usually needs to be trained to be used, the speech recognition device mainly trains the Neural Network, the Neural Network forms a corresponding model through training, according to the trained model, after a speech signal, namely an audio signal, is processed through feature extraction and input to the Neural Network, the Neural Network selects an optimal output path according to the trained model and forms a corresponding text signal output. Neural networks include Recurrent Neural Networks (RNNs), which are typically trained using rules based on Connection Timing Classification (CTC), i.e., CTC-based training criteria. In the process of CTC-based rule training, training samples are provided, the training samples including an input audio signal, corresponding to the label of real output, each node in RNN has initial Weight value, i.e. Weight (Weight), after the input audio signal is input into RNN, the RNN generates output data according to the weight setting of each internal node, the output data and a real output label have a difference value and are calculated and output through a CTC loss function, the CTC loss can carry out back propagation to realize the adjustment of the weight of each node in the RNN, finally, the difference between the output data and the real output label is reduced to the required value or when the change of the difference between the output data and the real output label is small, and finishing the training, wherein each node in the RNN after the training has the corresponding final weight and is applied to the actual speech recognition. In the actual speech recognition, the audio signal subjected to feature extraction is input into the RNN, the RNN selects an output path with the largest score according to the training structure to output, the output path with the largest score is the output path corresponding to the largest probability product of each node of the RNN on the output path, and finally, the corresponding text information can be obtained through text decoding.

In some applications, keyword detection is also required, and the keyword detection can obtain a command required in automatic control, or monitor sensitive information appearing in communication voice, and the like. The conventional keyword detection apparatus is generally implemented by using a speech recognition apparatus, which performs training based on a training criterion of CTC for the speech recognition apparatus under a specific environment, and the training sample uses input audio data including related key words and text labels corresponding to the keywords in the input audio data. The existing keyword detection device realized by adopting a single voice recognition device has the defects that when the use environment is different from the training environment, the situation that the keywords cannot be recognized easily occurs, the recall rate is reduced, and even false alarm occurs.

Disclosure of Invention

The invention aims to provide a keyword detection device which can improve the recall rate when used in various environments and reduce false alarm. Therefore, the invention also provides a keyword detection method.

In order to solve the above technical problem, the keyword detection apparatus provided by the present invention includes: a plurality of speech recognition devices.

Each voice recognition device is obtained by adopting a training criterion based on CTC, and the training environments of the voice recognition devices are different; each voice recognition device comprises a neural network obtained through training based on a training criterion of CTC; the input end of each voice recognition device is connected with audio data.

In the inference stage, the same first input audio data is input into the input end of each voice recognition device, each voice recognition device runs in parallel and outputs corresponding keyword scores, the keyword scores of each voice recognition device are subjected to weight combination to form a total score of the keyword detection device, and a keyword prediction result signal corresponding to the first input audio data is output according to the total score.

In a further refinement, the type of neural network comprises a recurrent neural network.

In a further improvement, each training sample of the speech recognition device uses second input audio data including related key words and text labels corresponding to the key words in the second input audio data.

In a further refinement, each of said speech recognition devices is trained using a CTC loss function.

In a further improvement, the CTC loss function and the training sample adopted by each voice recognition device are different, and the CTC loss function and the training sample adopted by each voice recognition device are adaptive to the trained environment.

In a further refinement, the input audio data is subjected to feature processing before being input to the speech recognition device.

In a further refinement, the combination of weights forming the total score is a weighted average.

In a further refinement, the weight of each of the speech recognition devices is the reciprocal of the number of the speech recognition devices; alternatively, the weight of each of the speech recognition devices is determined by a grid search.

In order to solve the above technical problems, the keyword detection method provided by the present invention employs a keyword detection apparatus including a plurality of speech recognition apparatuses.

The input end of each voice recognition device is connected with audio data, and each voice recognition device comprises a neural network.

The training method of each speech recognition device comprises the following steps: training CTC-based training criteria is performed on each of the speech recognition devices, the environments in which the speech recognition devices are trained are different, and the training on each of the speech recognition devices comprises training on the neural network based on the CTC-based training criteria.

The following reasoning method is adopted in the reasoning stage: the input end of each speech recognition device inputs the same first input audio data, each speech recognition device operates in parallel and outputs corresponding keyword scores, the keyword scores of each speech recognition device are subjected to weight combination to form a total score of the keyword detection device, and a keyword prediction result signal corresponding to the first input audio data is output according to the total score.

The keyword detection device comprises a plurality of voice recognition devices, wherein each voice recognition device is trained under different environments, and in an inference stage, each voice recognition device runs in parallel and performs weight combination according to the keyword scores of the voice recognition devices to form the total score of the keyword detection device and finally obtain a keyword prediction result signal; because each voice recognition device can obtain a good keyword detection result under the environment similar to the respective training environment, and each voice recognition device can obtain a good keyword detection structure under various different environments after running in parallel, the defect that the recall rate of the keyword detection device formed by adopting a single voice recognition device in the prior art is reduced when the environment is changed is overcome, so that the recall rate of the keyword detection device used under various environments can be improved, and the false alarm is reduced.

Drawings

The invention is described in further detail below with reference to the following figures and detailed description:

FIG. 1 is a schematic structural diagram of a keyword detection apparatus according to an embodiment of the present invention;

fig. 2 is a flowchart of an inference phase of the keyword detection apparatus according to the embodiment of the present invention.

Detailed Description

The keyword detection device of the embodiment of the invention comprises:

fig. 1 is a schematic structural diagram of a keyword detection apparatus according to an embodiment of the present invention; the keyword detection device of the embodiment of the invention comprises: the plurality of speech recognition devices 102, shown in fig. 1 as n speech recognition devices, are represented by speech recognition device 1, speech recognition device 2 through speech recognition device n, respectively, in corresponding boxes.

The keyword detection apparatus further includes a speech feature processing module 101, and the input audio data is subjected to feature processing before being input to the speech recognition apparatus 102, that is, is subjected to feature processing by the speech feature processing module 101. Preferably, the feature processing is to extract a spectral feature of the input audio data by short-time fourier transform.

Each voice recognition device 102 is obtained by adopting a training criterion based on a CTC, and the environment trained by each voice recognition device 102 is different; each voice recognition device 102 comprises a neural network 103 obtained through training based on CTC training criteria; the input end of each voice recognition device 102 is connected with audio data. The type of neural network 103 includes a recurrent neural network. Each neural network 103 also includes n, and is also denoted by a neural network 1, a neural network 2 through a neural network n in the corresponding block.

The training samples of each of the speech recognition devices 102 use second input audio data including related key words and text labels corresponding to the key words in the second input audio data.

Each of the speech recognition devices 102 is trained using a CTC loss function.

The CTC loss function and the training samples employed by each of the speech recognition devices 102 are different, and the CTC loss function and the training samples employed by each of the speech recognition devices 102 are adaptive to the trained environment.

As shown in fig. 2, which is a flowchart of an inference phase of the keyword detection apparatus according to the embodiment of the present invention, in the inference phase, the input end of each of the speech recognition apparatuses 102 inputs the same first input audio data, and the first input audio data is shown as reference numeral 104.

Each of the speech recognition devices 102 operates in parallel and outputs a corresponding keyword score, as indicated at 105. For the same first input audio data, the closer the training environment of the speech recognition device 102 and the input environment of the first input audio data are, the higher the keyword score of the corresponding speech recognition device 102 is.

The keyword scores of the speech recognition devices 102 are weighted and combined to form a total score of the keyword detection device, which is shown as 106. In the embodiment of the present invention, the weight combination mode for forming the total score is weighted average. The weight of each speech recognition device 102 is the reciprocal of the number of speech recognition devices 102; alternatively, the weight of each of the speech recognition devices 102 is determined by a grid search.

And outputting a keyword prediction result signal corresponding to the first input audio data according to the total score, wherein the keyword prediction result is shown as a mark 106.

The keyword detection device comprises a plurality of voice recognition devices 102, wherein each voice recognition device 102 is trained under different environments, in an inference stage, each voice recognition device 102 runs in parallel, and weight combination is carried out according to the keyword scores of the voice recognition devices 102 to form the total score of the keyword detection device and finally a keyword prediction result signal is obtained; because each voice recognition device 102 can obtain a good keyword detection result under the environment similar to the respective training environment, and each voice recognition device 102 can obtain a good keyword detection structure under various different environments after running in parallel, the defect that the recall rate of a keyword detection device formed by adopting a single voice recognition device 102 in the prior art is reduced when the environment is changed is overcome, so that the recall rate in use under various environments can be improved, and false alarm is reduced.

The keyword detection method of the embodiment of the invention comprises the following steps:

the keyword detection method of the embodiment of the present invention employs a keyword detection apparatus including a plurality of speech recognition apparatuses 102.

The input end of each speech recognition device 102 is connected with audio data, and each speech recognition device 102 comprises a neural network 103. In fig. 1, n speech recognition devices are shown, and are represented by speech recognition device 1, speech recognition device 2 through speech recognition device n, respectively, in corresponding blocks.

The type of neural network 103 includes a recurrent neural network. Each neural network 103 also includes n, and is also denoted by a neural network 1, a neural network 2 through a neural network n in the corresponding block.

The training method of each of the speech recognition devices 102 includes: the training of the CTC-based training criterion is performed for each of the speech recognition devices 102, the environments in which the speech recognition devices 102 are trained are different, and the training of each of the speech recognition devices 102 includes training of the neural network 103 based on the CTC-based training criterion.

The following reasoning method is adopted in the reasoning stage:

the input of each of the speech recognition devices 102 is input with the same first input audio data, as indicated by reference numeral 104.

The present invention has been described in detail with reference to the specific embodiments, but these should not be construed as limitations of the present invention. Many variations and modifications may be made by one of ordinary skill in the art without departing from the principles of the present invention, which should also be considered as within the scope of the present invention.

Claims

1. A keyword detection apparatus, comprising: a plurality of speech recognition devices;

each voice recognition device is obtained by adopting a training criterion based on CTC, and the training environments of the voice recognition devices are different; each voice recognition device comprises a neural network obtained through training based on a training criterion of CTC; the input end of each voice recognition device is connected with audio data;

2. The keyword detection apparatus according to claim 1, characterized in that: the type of neural network includes a recurrent neural network.

3. The keyword detection apparatus according to claim 1, characterized in that: the training sample of each voice recognition device adopts second input audio data comprising related key words and text labels corresponding to the key words in the second input audio data.

4. The keyword detection apparatus according to claim 3, characterized in that: each of the speech recognition devices is trained using a CTC loss function.

5. The keyword detection apparatus according to claim 4, characterized in that: the CTC loss function and the training sample adopted by each voice recognition device are different, and the CTC loss function and the training sample adopted by each voice recognition device are adaptive to the trained environment.

6. The keyword detection apparatus according to claim 1, characterized in that: the input audio data is subjected to feature processing before being input to the speech recognition device.

7. The keyword detection apparatus according to claim 1, characterized in that: the weight combination mode for forming the total score is weighted average.

8. The keyword detection apparatus according to claim 7, characterized in that: the weight of each voice recognition device is the reciprocal of the number of the voice recognition devices; alternatively, the weight of each of the speech recognition devices is determined by a grid search.

9. A keyword detection method is characterized in that: adopting a keyword detection device comprising a plurality of voice recognition devices;

the input end of each voice recognition device is connected with audio data, and each voice recognition device comprises a neural network;

the training method of each speech recognition device comprises the following steps:

training CTC-based training criteria for each of the speech recognition devices, wherein the environments in which the speech recognition devices are trained are different, and the training for each of the speech recognition devices comprises training of the neural network based on the CTC-based training criteria;

the following reasoning method is adopted in the reasoning stage:

the input end of each speech recognition device inputs the same first input audio data, each speech recognition device operates in parallel and outputs corresponding keyword scores, the keyword scores of each speech recognition device are subjected to weight combination to form a total score of the keyword detection device, and a keyword prediction result signal corresponding to the first input audio data is output according to the total score.

10. The keyword detection method according to claim 9, characterized in that: the type of neural network includes a recurrent neural network.

11. The keyword detection method according to claim 9, characterized in that: the training sample of each voice recognition device adopts second input audio data comprising related key words and text labels corresponding to the key words in the second input audio data.

12. The keyword detection method according to claim 11, characterized in that: each of the speech recognition devices is trained using a CTC loss function.

13. The keyword detection method according to claim 12, characterized in that: the CTC loss function and the training sample adopted by each voice recognition device are different, and the CTC loss function and the training sample adopted by each voice recognition device are adaptive to the trained environment.

14. The keyword detection method according to claim 9, characterized in that: the input audio data is subjected to feature processing before being input to the speech recognition device.

15. The keyword detection method according to claim 9, characterized in that: the weight combination mode for forming the total score is weighted average.

16. The keyword detection method according to claim 9, characterized in that: the weight of each voice recognition device is the reciprocal of the number of the voice recognition devices; alternatively, the weight of each of the speech recognition devices is determined by a grid search.