CN111477219A

CN111477219A - Keyword distinguishing method and device, electronic equipment and readable storage medium

Info

Publication number: CN111477219A
Application number: CN202010383187.3A
Authority: CN
Inventors: 夏静雯; 方磊; 吴明辉; 周振昆; 唐磊
Original assignee: Hefei Ustc Iflytek Co ltd
Current assignee: Hefei Ustc Iflytek Co ltd
Priority date: 2020-05-08
Filing date: 2020-05-08
Publication date: 2020-07-31

Abstract

The embodiment of the invention provides a keyword distinguishing method, a keyword distinguishing device, electronic equipment and a readable storage medium, wherein the method comprises the following steps: determining the audio characteristics of the keyword suspected fragment of the audio to be detected; inputting the audio characteristics of the keyword suspected fragment of the audio to be detected into a keyword distinguishing model to obtain a keyword distinguishing result output by the keyword distinguishing model; the keyword distinguishing model is obtained by training based on audio features of sample keyword suspected fragments of sample audio and sample keyword labeled contents corresponding to the sample keyword suspected fragments. According to the keyword distinguishing method provided by the embodiment of the invention, the audio characteristics of the suspected keyword segment of the audio to be detected are used as the input of the keyword distinguishing model, the suspected keyword segment with low identification accuracy is further distinguished, and the false alarm in the suspected keyword segment is filtered out, so that accurate keyword information is distinguished.

Description

Keyword distinguishing method and device, electronic equipment and readable storage medium

Technical Field

The invention relates to the field of voice recognition, in particular to a keyword distinguishing method and device, electronic equipment and a readable storage medium.

Background

Keyword recognition is an important branch in speech recognition, and is widely applied to speech control, speech monitoring, speech input and other speech recognition application scenarios.

However, when keyword recognition is performed on the audio frequency in the ultrashort wave channel, the recognition accuracy rate is significantly reduced due to the problems that the audio frequency in the ultrashort wave channel has a low signal-to-noise ratio of the audio signal, the audio quality is greatly fluctuated by the environmental influence, and a large number of keywords which are similar in pronunciation and difficult to distinguish are present in the audio frequency.

Disclosure of Invention

In view of at least one of the above technical problems in the prior art, embodiments of the present invention provide a keyword distinguishing method, apparatus, electronic device and readable storage medium.

In a first aspect, an embodiment of the present invention provides a keyword distinguishing method, including:

determining the audio characteristics of the keyword suspected fragment of the audio to be detected;

inputting the audio characteristics of the keyword suspected fragment of the audio to be detected into a keyword distinguishing model to obtain a keyword distinguishing result output by the keyword distinguishing model; the keyword distinguishing model is obtained by training based on audio features of sample keyword suspected fragments of sample audio and sample keyword labeled contents corresponding to the sample keyword suspected fragments.

Optionally, the sample keyword suspected segments of the sample audio include positive samples and negative samples; the positive sample and the negative sample are determined by a keyword identification result obtained by identifying the sample audio based on a keyword identification model, the keyword identification result corresponding to the positive sample is consistent with the sample keyword labeled content, and the keyword identification result corresponding to the negative sample is inconsistent with the sample keyword labeled content.

Optionally, the keyword differentiation model is obtained by an iterative training mode based on positive samples and negative samples included in the sample keyword suspected fragment.

Optionally, the iterative training comprises a plurality of rounds of training processes, wherein each round of training process comprises:

according to the keyword distinguishing results corresponding to the positive sample and the negative sample corresponding to the current round and the keyword labeling contents corresponding to the positive sample and the negative sample corresponding to the current round, performing parameter adjustment on the keyword distinguishing model, and updating and determining the positive sample and the negative sample corresponding to the next round; the keyword distinguishing result corresponding to the positive sample corresponding to the next round is consistent with the sample keyword labeling content, and the keyword distinguishing result corresponding to the negative sample corresponding to the next round is inconsistent with the sample keyword labeling content;

the positive sample and the negative sample included in the sample keyword suspected fragment are the positive sample and the negative sample corresponding to the initial round in the iterative training.

Optionally, the determining the audio features of the suspected keyword segment of the audio to be detected includes:

determining the audio characteristics of the audio to be detected;

inputting the audio features of the audio to be detected into the keyword recognition model to obtain a keyword recognition result output by the keyword recognition model; the keyword recognition model is obtained by training based on the audio features of the sample audio and sample keyword labeling results corresponding to the sample audio, and the sample keyword labeling results comprise the sample keyword labeling content and the sample keyword labeling positions;

and determining the audio characteristics of the keyword suspected segments of the audio to be detected based on the keyword positions in the keyword identification results and the audio characteristics of the audio to be detected.

Optionally, the inputting the audio features of the suspected keyword segment of the audio to be detected into a keyword distinguishing model to obtain a keyword distinguishing result output by the keyword distinguishing model includes:

inputting the audio features of the suspected keyword segments of the audio to be detected into a classification layer of the keyword distinguishing model to obtain a confidence score for each keyword output by the classification layer;

inputting the confidence score of any keyword into a confidence judgment layer of the keyword distinguishing model to obtain a keyword distinguishing result of any keyword output by the confidence judgment layer; the keyword distinguishing result of any keyword is determined based on the confidence score and the confidence threshold of any keyword, and the keywords are in one-to-one correspondence with the confidence threshold.

Optionally, the confidence threshold of any keyword is determined based on an average of confidence scores corresponding to all the positive examples of any keyword.

Optionally, the inputting the audio features of the suspected keyword segment of the audio to be detected into the classification layer of the keyword differentiation model to obtain a confidence score for each keyword output by the classification layer specifically includes:

inputting the audio features of the keyword suspected fragments of the audio to be detected into a hidden layer feature extraction layer of the classification layer to obtain hidden layer features output by the hidden layer feature extraction layer;

inputting the hidden layer features into an activation function layer of the classification layer to obtain a confidence score for each keyword output by the activation function layer; the activation function applied by the activation function layer is obtained by taking the negative logarithm of the calculation result of the softmax function.

Optionally, the sample keyword labeling position is determined by a frame expansion process.

Optionally, the audio features of the sample keyword suspected segments of the sample audio are obtained by performing masking processing on a time domain and/or a frequency domain.

Optionally, the sample audio includes original sample audio and processed original sample audio, and the manner of processing the original sample audio includes at least one of: noise addition, noise reduction and speed change.

In a second aspect, an embodiment of the present invention provides a keyword differentiating apparatus, including:

the characteristic determining module is used for determining the audio characteristics of the keyword suspected fragment of the audio to be detected;

the identification result determining module is used for inputting the audio characteristics of the keyword suspected fragment of the audio to be detected into a keyword distinguishing model to obtain a keyword distinguishing result output by the keyword distinguishing model; the keyword distinguishing model is obtained by training based on audio features of sample keyword suspected fragments of sample audio and sample keyword labeled contents corresponding to the sample keyword suspected fragments.

In a third aspect, an embodiment of the present invention provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the keyword distinguishing method according to the first aspect when executing the program.

In a fourth aspect, an embodiment of the present invention provides a non-transitory computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the step of keyword recognition and differentiation as described in the first aspect.

According to the keyword distinguishing method provided by the embodiment of the invention, the audio characteristics of the suspected keyword segment of the audio to be detected are used as the input of the keyword distinguishing model, the suspected keyword segment with low identification accuracy is further distinguished, and the false alarm in the suspected keyword segment is filtered out, so that accurate keyword information is distinguished.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a flowchart illustrating a keyword differentiating method according to an embodiment of the present invention;

FIG. 2 is a schematic flowchart of a method for training a keyword spotting model according to an embodiment of the present invention;

fig. 3 is a schematic flowchart of a method for determining audio features of a keyword suspected fragment of an audio to be detected according to an embodiment of the present invention;

FIG. 4 is a flowchart illustrating a method for determining a keyword differentiation result according to an embodiment of the present invention;

FIG. 5 is a flowchart illustrating a confidence score calculation method according to an embodiment of the present invention;

FIG. 6 is a schematic flow chart illustrating a model training method according to an embodiment of the present invention;

FIG. 7 is a schematic flow chart of feature extraction model training according to an embodiment of the present invention;

FIG. 8 is a flowchart illustrating a keyword differentiating method according to an embodiment of the present invention;

FIG. 9 is a schematic structural diagram of a keyword differentiating apparatus according to an embodiment of the present invention;

FIG. 10 is a schematic structural diagram of an electronic device according to an embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In view of the above, the embodiment of the present invention provides a keyword distinguishing method. Fig. 1 is a schematic flowchart of a keyword distinguishing method according to an embodiment of the present invention, and as shown in fig. 1, the method includes:

s110, determining the audio characteristics of the keyword suspected fragment of the audio to be detected.

Specifically, the audio to be tested is the audio from which the keywords need to be identified and distinguished in the embodiment of the present invention. The audio to be tested may be pre-stored audio, or audio received from the outside, or audio generated by a specific method.

The suspected keyword segment in the audio to be detected may be a keyword segment identified from the audio to be detected according to a specific keyword identification method. The audio features of the suspected keyword segment in the audio to be detected may be feature representations of audio information in the suspected keyword segment in the audio to be detected.

Because the voice data under the ultrashort wave channel in the embodiment of the invention contains a large number of words with similar pronunciations, the keyword suspected fragment identified from the audio to be detected by the specific keyword identification method cannot ensure the identification accuracy, and can be used as the input of the embodiment of the invention for further processing.

S120, inputting the audio characteristics of the keyword suspected fragment of the audio to be detected into a keyword distinguishing model to obtain a keyword distinguishing result output by the keyword distinguishing model; the keyword distinguishing model is obtained by training based on audio features of sample keyword suspected fragments of sample audio and sample keyword labeled contents corresponding to the sample keyword suspected fragments.

Specifically, the audio features of the suspected keyword segment of the audio to be detected input to the keyword distinguishing model are not high in accuracy of the recognition result due to the fact that the voice data in the ultrashort wave channel contains a large number of words with similar pronunciations. Therefore, the keyword distinguishing model further analyzes the difference between different keywords, thereby enhancing the distinguishing degree between the keywords and distinguishing correct keyword information.

The keyword distinguishing result output by the keyword distinguishing model in the embodiment of the invention can be a result of judging whether the suspected keyword segment of the audio to be detected is the keyword. Specifically, the keyword distinguishing model may be configured to distinguish a plurality of preset keywords, and accordingly, the keyword distinguishing result may be a result of whether the suspected keyword segment of the audio to be detected is determined to be the keyword for each preset keyword.

Before the keyword distinguishing model is applied, the keyword distinguishing model can be obtained through pre-training, and the keyword distinguishing model can be obtained through training in the following way: firstly, collecting audio characteristics of sample keyword suspected fragments of a large amount of sample audio and sample keyword labeled content corresponding to the sample keyword suspected fragments. And then, training the keyword distinguishing model by taking the audio characteristics of the sample keyword suspected fragment of the sample audio as a training sample and the sample keyword labeled content corresponding to the sample keyword suspected fragment as a training label, thereby obtaining the trained keyword distinguishing model. The sample keyword annotation content may be artificial annotation information of a specific keyword contained in the sample keyword suspected fragment.

Further, because the audio quality under the ultrashort wave channel is greatly influenced by the channel and the signal is very unstable, the audio to be tested in practical application has distortion phenomenon and even loses signals on partial time domain and frequency domain. In order to enhance the robustness of the keyword distinguishing model and enable the model to have certain prediction capability on a lost audio signal, a mask strategy is introduced in the training process of the keyword distinguishing model in the embodiment of the invention, that is, in the audio features of the sample keyword suspected segments of the sample audio, training data with a certain proportion (for example, 10%) is randomly extracted for mask processing.

For example, the sounds of sample keywords suspected fragments of sample audioThe Frequency characteristics are expressed in a form of a Frame × Time × Frequency matrix, wherein the Frame represents the number of frames corresponding to the audio, and the Time is equal to (t ═ t₁,t₂，...，t_m) Representing the audio time-domain feature vector, Frequency ═ f₁，f₂，...，f_n) Representing audio frequency domain feature vectors, where m and n are the vector dimensions of the audio time domain feature vectors and the frequency domain feature vectors, respectively. When the audio features of the sample keyword suspected fragments of the sample audio are input into the keyword distinguishing model, the number is from [1, m]Randomly selecting a random number m in the range_rCalculating the mean value of the characteristic time-domain vectors as mu₁Will be [0, m ]_r]All data in between are in μ₁Replacement; then from [1, n ]]Randomly selecting a random number n within the range_rCalculating the mean value of the characteristic frequency domain vector as mu₂Will be [0, n ]_r]All data in between are in μ₂And (6) replacing. The mean calculation formulas of the characteristic time domain vector and the characteristic frequency domain vector are respectively as follows:

Based on the above embodiment, fig. 2 is a schematic flow chart of a training method of a keyword differentiation model provided in an embodiment of the present invention, and as shown in fig. 2, the training method of the keyword differentiation model specifically includes the following steps:

the sample keyword suspected fragment of the sample audio comprises a positive sample and a negative sample; the positive sample and the negative sample are determined by a keyword identification result obtained by identifying the sample audio based on a keyword identification model, the keyword identification result corresponding to the positive sample is consistent with the sample keyword labeled content, and the keyword identification result corresponding to the negative sample is inconsistent with the sample keyword labeled content.

It is understood that the voice data in the ultrashort wave channel contains a large number of words with similar pronunciation, and the words are difficult to distinguish from each other, which results in a low accuracy of the recognition result of the keyword, and especially non-keywords appearing with similar pronunciation to the keyword are recognized as the keyword. For example, a keyword "green" is preset in the system, and since the pronunciation of the non-keyword "green" is relatively close to that of the "green", the "green" contained in the audio may also be recognized as the keyword "green", which is a condition of inaccurate recognition.

In consideration of the situation, the sample keyword suspected fragment of the sample audio is divided into the positive sample and the negative sample, which is beneficial to further training the distinguishing capability of the keyword distinguishing model, so that the keyword distinguishing model can accurately filter the non-keywords with similar pronunciation.

Correspondingly, a part of the keywords identified in the keyword identification result belong to the identification result with correct identification, and the sample keyword suspected fragment of the corresponding sample audio is a positive sample; the other part of the keyword recognition results belongs to recognition results with wrong recognition, that is, non-keywords with similar pronunciation are mistakenly recognized as keywords, and the sample keyword suspected fragment of the corresponding sample audio is a negative sample.

The keyword distinguishing model in the embodiment of the invention is obtained by an iterative training mode based on positive samples and negative samples included in the sample keyword suspected fragments. The iterative training comprises a training process of multiple rounds, and positive samples and negative samples included in the sample keyword suspected segments are positive samples and negative samples corresponding to an initial round in the iterative training, namely initial input of the whole iterative training process.

Specifically, for the whole iterative training process of the keyword differentiation model, taking the training process of two adjacent rounds as an example, the method specifically includes:

firstly, determining a positive sample and a negative sample corresponding to the current round, and inputting the audio characteristics of the positive sample and the audio characteristics of the negative sample corresponding to the current round into the keyword distinguishing model to obtain the keyword distinguishing results corresponding to the positive sample and the negative sample corresponding to the current round.

And according to the keyword distinguishing results corresponding to the positive sample and the negative sample corresponding to the current round and the keyword labeling contents corresponding to the positive sample and the negative sample corresponding to the current round, performing parameter adjustment on the keyword distinguishing model, and updating and determining the positive sample and the negative sample corresponding to the next round.

Specifically, the keyword discrimination result corresponding to the positive example sample corresponding to the next round is consistent with the sample keyword labeling content, and the keyword discrimination result corresponding to the negative example sample corresponding to the next round is inconsistent with the sample keyword labeling content. Namely, each iteration training round updates the set range of the positive sample and the negative sample according to the keyword distinguishing result obtained in the current round. It will be appreciated that the number of positive examples will be greater and the number of negative examples will be less.

And acquiring the audio features of the positive example samples and the audio features of the negative example samples corresponding to the next round, and inputting the audio features of the positive example samples and the audio features of the negative example samples corresponding to the next round into the keyword distinguishing model for iterative training until the keyword distinguishing model reaches a preset convergence condition. It can be understood that the convergence condition may be that the number of the positive example samples and the negative example samples in the iterative process does not change significantly any more, which represents that the keyword distinguishing model has a better ability to distinguish similar keywords.

According to the keyword distinguishing model training method provided by the embodiment of the invention, the updated positive sample and the updated negative sample can be sent into the keyword distinguishing model again for iterative training according to the keyword distinguishing result obtained by each iterative training in the training process and by combining the labeled content of the keywords until the keyword distinguishing model is converged, so that the iterative updating of the keyword distinguishing model can be automatically realized, and manual intervention is not needed in the iterative training process.

Based on any of the above embodiments, fig. 3 is a schematic flowchart of a method for determining an audio feature of a keyword suspected fragment of an audio to be tested according to an embodiment of the present invention, as shown in fig. 3, step S110 specifically includes:

and S111, determining the audio characteristics of the audio to be detected.

Specifically, if the audio characteristics of the keyword suspected segment of the audio to be tested need to be determined, the audio characteristics of the whole audio to be tested need to be determined first. Audio features are typically used to characterize information contained in audio in the frequency and time domains. Also, audio has many forms of characteristic representations. For example, the audio feature of the audio to be tested determined in the embodiment of the present invention may be a posterior bn (bottoming) feature of the audio to be tested. Compared with the bottom acoustic features, the BN features have strong language information representation capability and strong anti-interference effect, and the robustness is better.

Further, the audio features of the audio to be detected are determined, and the corresponding audio feature extraction model can be used to extract the corresponding audio features. For example, if the embodiment of the present invention uses a BN feature as an audio feature, a corresponding DNN (Deep Neural Networks) model may be designed as a BN feature extraction model, where the model has a stronger clustering capability for phonemes with the same pronunciation and a stronger distinguishing capability for different pronunciations. The embodiment of the present invention may also adopt other audio features with better information representation capability and a feature extraction model with stronger pronunciation distinguishing capability, which is not specifically limited herein.

S112, inputting the audio frequency characteristics of the audio frequency to be detected into the keyword recognition model to obtain a keyword recognition result output by the keyword recognition model.

Specifically, the keyword recognition model adopted in the embodiment of the present invention is used for finding the position of the keyword appearing in the audio to be detected for the given keyword to be recognized and the audio feature data of the audio to be detected. After the keyword recognition model is modeled according to the specific keywords, the audio frequency characteristics of the audio frequency to be detected are input, and then the keyword recognition result output by the keyword recognition model can be obtained.

The keyword recognition model adopted in the embodiment of the invention can be a keyword recognition model obtained based on language-independent keyword recognition technology. Language independent keyword recognition refers to a technology for recognizing and determining the content and position of a specific keyword in a given voice only by pronunciation similarity without a keyword recognizer of a specific language. Since this technique is based on pronunciation similarity and does not depend on information of languages, it is called language-independent keyword recognition technique.

The language-independent keyword recognition technology is mainly based on a keyword recognition framework of GMM-HMM/Filler, and HMM (Hidden Markov Model) is established for a plurality of pronunciation samples of each keyword for recognizing suspected keyword segments in audio; and building GMM-Filler (Gaussian mixture model) for the non-keyword, and filtering and absorbing the audio of the non-keyword part. And carrying out Viterbi decoding on the test audio according to the existing model to obtain a candidate voice segment of each keyword. And calculating the likelihood ratio of each keyword candidate segment on the keyword model and the Filler model as a confidence score, and selecting a proper screening threshold value to achieve a balance between the accuracy and the recall rate.

Before the keyword recognition model is applied, the keyword recognition model can be obtained through pre-training, and the keyword recognition model can be obtained through training in the following way: firstly, collecting audio features of a large number of sample audios and sample keyword labeling results corresponding to the sample audios. And then, training the keyword recognition model by taking the audio features of the sample audio as training samples and taking the sample keyword labeling result corresponding to the sample audio as a training label, thereby obtaining the trained keyword recognition model.

The sample keyword labeling result comprises the sample keyword labeling content and the sample keyword labeling position, namely, the sample audio is labeled manually and has specific keyword information at a specific time or frame number.

Further, in order to avoid that the sample keyword labeling result loses effective information from beginning to end in the audio capture, the embodiment of the invention performs frame expansion processing on the sample keyword labeling result before and after the sample keyword labeling result. This loss of useful information is caused by an acoustically generated error in the manual labeling.

For example, the sample keyword tagging position corresponding to the sample keyword tagging content in the sample keyword tagging result is (start, end), where start represents a start frame of the sample keyword tagging content, and end represents an end frame of the sample keyword tagging content. Now, if the sample keyword labeling positions are respectively expanded by a certain number of frames, for example, 5 frames, the position of the obtained sample keyword labeling position in the sample audio is (start-5, end + 5).

S113, determining the audio frequency characteristics of the keyword suspected fragments of the audio frequency to be detected based on the keyword positions in the keyword identification results and the audio frequency characteristics of the audio frequency to be detected.

Specifically, according to the audio features of the audio to be detected determined in step S111 and the keyword positions in the keyword recognition results determined in step S112, the audio features of the keyword suspected segments of the audio to be detected can be determined by intercepting the audio features of the audio to be detected based on the keyword positions. It is understood that the keyword positions in the keyword recognition result can be used to characterize the positions of the suspected keyword segments in the whole audio to be tested.

According to the method for determining the audio characteristics of the keyword suspected fragment of the audio to be detected, provided by the embodiment of the invention, the audio characteristics of the keyword suspected fragment of the audio to be detected are determined through the keyword identification model, so that the primary identification of the keyword is realized; and in the identification process, the frame expansion processing of the sample keyword labeling result is carried out, so that the loss of head and tail information caused by labeling the sample audio is avoided.

Based on any of the above embodiments, fig. 4 is a schematic flowchart of a method for determining a keyword differentiation result according to an embodiment of the present invention, and as shown in fig. 4, step S120 specifically includes:

s121, inputting the audio characteristics of the keyword suspected fragment of the audio to be detected into a classification layer of the keyword distinguishing model to obtain a confidence score aiming at each keyword output by the classification layer.

Specifically, the keyword discrimination model in the embodiment of the present invention may be a bidirectional L STM (L ong Short-Term Memory unit classifier) and may be composed of a plurality of keyword classifiers, because the classification task is simpler than the recognition task, and the hidden layers are more, the calculation process is more complicated_kw1,S_kw1,...,S_fillerDividing the vector into the number of the keywords and the dimension of a filer filter, and dividing S in the vector_fillerFor filtering non-keywords.

S122, inputting the confidence score of any keyword into a confidence judgment layer of the keyword distinguishing model to obtain a keyword distinguishing result of the keyword output by the confidence judgment layer; the keyword distinguishing result of any keyword is determined based on the confidence score and the confidence threshold of any keyword, and the keywords are in one-to-one correspondence with the confidence threshold.

Specifically, the keyword discrimination model in the embodiment of the present invention further includes a confidence level judgment layer after the classification layer. After the confidence score of the audio to be detected for each keyword is obtained, the confidence score is input into a confidence judgment layer, the confidence score corresponding to each keyword in the score vector is compared with the confidence threshold corresponding to each keyword, and a keyword distinguishing result of any keyword is obtained, namely whether the suspected keyword segment is a positive case or a negative case, wherein the positive case represents that the keyword classification result is accurate, and the negative case represents that the keyword identification result is a false alarm.

It can be understood that, in the embodiment of the present invention, the keywords and the confidence thresholds are in one-to-one correspondence, that is, because training samples are unbalanced, the number of samples is small, and confidence scores between different keywords are not comparable, a reasonable confidence threshold is calculated for different keywords according to actual situations.

Further, a confidence threshold value S 'of any keyword'_kwiMay be all n positive example samples V based on said any keyword i_iCorresponding confidence score S_kwiIs determined by the following calculation formula:

further, in the training process, the confidence scores calculated by all the positive examples corresponding to any keyword are averaged, and a specific range of the average value fluctuating up and down is set as an actual threshold range, so that the keyword discrimination result can be correctly obtained according to the threshold.

According to the method for determining the keyword distinguishing result provided by the embodiment of the invention, the confidence score of each keyword is compared with the specific confidence threshold corresponding to the keyword, so that the accurate keyword distinguishing result is obtained.

Based on any of the above embodiments, fig. 5 is a schematic flow chart of the confidence score calculation method provided by the embodiment of the present invention, and as shown in fig. 5, step S121 specifically includes:

and S1211, inputting the audio features of the keyword suspected fragment of the audio to be detected to the hidden layer feature extraction layer of the classification layer, and obtaining the hidden layer features output by the hidden layer feature extraction layer.

Specifically, the classification layer of the keyword distinguishing model includes a hidden layer feature extraction layer, which is used for calculating hidden layer features corresponding to the keyword suspected segments of the audio to be detected. Because the classification task contained in the keyword distinguishing model is simpler compared with the identification task contained in the keyword identification model, and the more hidden layers in the model, the more complicated the calculation process. Therefore, the keyword distinguishing model in the implementation of the invention can be designed to only comprise three hidden layer feature extraction layers.

S1212, inputting the hidden layer feature into an activation function layer of the classification layer, and obtaining a confidence score for each keyword output by the activation function layer; the activation function applied by the activation function layer is obtained by taking the negative logarithm of the calculation result of the softmax function.

Specifically, the classification layer in the embodiment of the present invention includes an activation function layer, and the activation function layer is located between the hidden layer feature extraction layer and the confidence level determination layer, and is configured to calculate a confidence level score for each keyword through an activation function.

The activation function used in the conventional classification model is typically softmax, and the function is to show the result of the classification task in the form of probability, and the sum of the probabilities of the classification results is 1. In contrast, since the keyword distinguishing model is mainly used for distinguishing non-keywords similar to the pronunciation of the keywords, the false alarm phenomenon is eliminated. Thus, using softmax as an activation function results in a confidence score for non-keywords that is very close to the confidence score for keywords, and discrimination that is too weak to distinguish between these close keywords.

In order to increase the degree of distinction between the keywords and the non-keywords, and increase the score difference obtained by the keywords and the non-keywords corresponding to the activation function layer of the keyword distinction model, the embodiment of the invention may use log _ softmax as the activation function to replace the conventional softmax, and the score obtained by calculating the softmax takes the negative logarithm as the confidence score. The improved calculation formula of the log _ softmax function is as follows:

wherein x is_iThe results of the calculations for each keyword are targeted for the classification task.

According to the confidence score calculation method provided by the embodiment of the invention, the score obtained by softmax calculation is subjected to negative logarithm as the confidence score, and the discrimination between the keywords and the non-keywords is increased, so that the keyword discrimination model can output a more accurate keyword discrimination result.

Based on any of the above embodiments, fig. 6 is a schematic flow chart of a model training method in the keyword distinguishing method provided by the embodiment of the present invention, and as shown in fig. 6, the method specifically includes the following contents.

First, raw sample audio, which may be unprocessed audio data, is obtained. Besides the original sample audio, at least one processing mode of adding noise, reducing noise, changing speed and the like to the original sample audio is carried out, for example, three processing modes of adding noise, then reducing noise and changing speed are respectively carried out to the original sample audio, and four audio data of the original sample audio, the noise-added audio, the noise-reduced audio and the changing speed audio are obtained as the sample audio.

Keyword recognition and differentiation tasks often suffer from a lack of annotation data. By the aid of the method for amplifying training data, the problem that the capacity of a keyword recognition model or a keyword distinguishing model is insufficient due to lack of training corpora can be effectively solved without additionally collecting a large number of training corpora.

Next, the sample audio is preprocessed, and the processing method may include audio format conversion, noise reduction enhancement, endpoint detection, and the like, so as to meet the actual requirement of the model training scenario on the sample quality, which is not specifically limited in the embodiments of the present invention.

Next, a feature extraction model needs to be trained, and audio features are extracted from the sample audio obtained after preprocessing. For example, the audio feature of the audio to be tested determined in the embodiment of the present invention may be a posterior bn (bottletech) feature of the audio to be tested. Compared with the bottom acoustic features, the BN features have strong language information representation capability and strong anti-interference effect, and the robustness is better. Accordingly, a dnn (deep Neural networks) model may be designed as a feature extraction model of BN features.

The process of training the feature extraction model may be unsupervised training with sample audio as a training sample. The trained feature extraction model can more accurately extract the BN feature of the audio frequency for the audio frequency data under the ultrashort wave channel.

Further, the feature extraction model for extracting BN features may include 3 hidden layers and 1 bootleneck layer, and fig. 7 is a schematic flowchart of the feature extraction model training, as shown in fig. 7, the model first extracts 39-dimensional P L P (Perceptual L initial Predictive) features for sample audio.

Then, in order to avoid the problem that the keyword feature segment loses effective information from head to tail in the audio capture, frame expansion processing is carried out on the front and the back of the keyword segment feature. According to the labeling result, the position of the original keyword feature segment in the audio is (start, end), the start represents the start frame of the keyword, and the end represents the end frame of the keyword. Now, 5 frames are respectively expanded before and after the keyword feature segment, and the position of the obtained new keyword feature segment in the audio is (start-5, end + 5).

The process of training the feature extraction model can be unsupervised training by taking the P L P feature after frame expansion processing as a training sample, and taking the output result of the bottleeck layer as the final BN feature.

And secondly, after the audio features of the sample audio are obtained according to the trained feature extraction model, training a keyword recognition model by using the audio features of the sample audio and the sample keyword labeling result corresponding to the sample audio. Taking a keyword recognition model based on a language-independent keyword recognition technology as an example, in a model training stage, an HMM model needs to be established for a keyword, and a GMM-Filler model needs to be established for a non-keyword, so that a trained keyword recognition model is obtained.

And performing keyword recognition on a training set consisting of sample audios by using the trained keyword recognition model. Different preliminary screening threshold values are set for different keywords in different keyword identification models, all suspected keyword segments are recalled as far as possible, even if the keyword segments identified by the keyword identification models are identified as far as possible, and the keyword segments are prevented from being missed as far as possible. In this step, the problem of false alarm rate is temporarily not considered, and the false alarm condition is handled by the subsequent keyword discrimination model.

And next, according to the audio features of the sample audio determined in the feature extraction model and the keyword positions in the keyword recognition results corresponding to the sample audio determined in the keyword recognition model, intercepting the audio features through the keyword positions, and determining the audio features of the suspected keyword segments. It is understood that the keyword locations in the keyword recognition results can be used to characterize the locations of the suspected keyword segments in the entire sample audio.

Next, the keyword differentiation model needs to be trained in an iterative training manner based on positive samples and negative samples included in the sample keyword suspected fragments. The positive sample and the negative sample are determined based on a keyword recognition result obtained by recognizing the sample audio by the keyword recognition model.

The iterative training comprises a training process of multiple rounds, and positive samples and negative samples included in the sample keyword suspected segments are positive samples and negative samples corresponding to an initial round in the iterative training, namely initial input of the whole iterative training process.

For the whole iterative training process of the keyword differentiation model, taking the training process of two adjacent rounds as an example, the method specifically includes:

The model training method in the keyword distinguishing method provided by the embodiment of the invention obtains the feature extraction model, the keyword recognition model and the keyword distinguishing model used in the embodiment of the invention through training in sequence, and prepares for the model application stage of the keyword distinguishing method.

Based on any of the above embodiments, fig. 8 is a schematic flow chart of the keyword distinguishing method provided by the embodiment of the present invention, and as shown in fig. 8, the method specifically includes the following contents.

Firstly, the audio to be detected needs to be preprocessed, and the processing method may include audio format conversion, noise reduction enhancement, endpoint detection, and the like, so as to meet the requirement of the actual application scene on the audio quality, and the embodiment of the present invention is not particularly limited.

Secondly, inputting the preprocessed audio to be tested into the feature extraction model to obtain the audio features of the audio to be tested.

And then, inputting the audio features of the audio to be detected into the keyword recognition model to obtain a keyword recognition result output by the keyword recognition model, and determining the audio features of the keyword suspected fragments of the audio to be detected based on the positions of the keywords in the keyword recognition result and the audio features of the audio to be detected.

And finally, inputting the audio characteristics of the suspected keyword segment of the audio to be detected into the keyword distinguishing model to obtain the confidence score of the suspected keyword segment on each keyword classifier. If the confidence score is larger than the corresponding confidence threshold value on a certain keyword classifier, determining the keyword distinguishing result of the suspected keyword segment as the keyword; otherwise, determining that the keyword distinguishing result of the suspected keyword segment is a non-keyword.

Based on any of the above embodiments, fig. 9 is a keyword distinguishing apparatus provided in an embodiment of the present invention, as shown in fig. 9, the apparatus specifically includes:

the characteristic determining module 910 is configured to determine an audio characteristic of the keyword suspected fragment of the audio to be detected.

An identification result determining module 920, configured to input the audio features of the keyword suspected fragment of the audio to be detected into a keyword distinguishing model, so as to obtain a keyword distinguishing result output by the keyword distinguishing model; the keyword distinguishing model is obtained by training based on audio features of sample keyword suspected fragments of sample audio and sample keyword labeled contents corresponding to the sample keyword suspected fragments.

Before the keyword distinguishing model is applied, the keyword distinguishing model can be obtained through pre-training, and the keyword distinguishing model can be obtained through training in the following way: firstly, collecting audio characteristics of sample keyword suspected fragments of a large amount of sample audio and sample keyword labeled content corresponding to the sample keyword suspected fragments. And then, training the keyword distinguishing model by taking the audio characteristics of the sample keyword suspected fragment of the sample audio as a training sample and the sample keyword labeled content corresponding to the sample keyword suspected fragment as a training label, thereby obtaining the trained keyword distinguishing model. The sample keyword annotation content can be artificial annotation information of whether the audio contains a specific keyword.

The keyword distinguishing device provided by the embodiment of the invention takes the audio characteristics of the suspected keyword segment of the audio to be detected as the input of the keyword distinguishing model, further distinguishes the suspected keyword segment with low identification accuracy, and filters the false alarm in the suspected keyword segment, so as to distinguish accurate keyword information.

Based on any one of the above embodiments, in the apparatus, the sample keyword suspected fragment of the sample audio includes a positive sample and a negative sample; the positive sample and the negative sample are determined by a keyword identification result obtained by identifying the sample audio based on a keyword identification model, the keyword identification result corresponding to the positive sample is consistent with the sample keyword labeled content, and the keyword identification result corresponding to the negative sample is inconsistent with the sample keyword labeled content.

Based on any one of the embodiments, in the device, the keyword discrimination model is obtained by an iterative training mode based on positive samples and negative samples included in the sample keyword suspected segments.

The iterative training includes a plurality of rounds of training processes, wherein each round of training process includes:

Based on any one of the above embodiments, in the apparatus, the determining the audio characteristics of the suspected keyword segment of the audio to be detected includes:

determining the audio characteristics of the audio to be detected;

Based on any one of the above embodiments, in the apparatus, the inputting the audio features of the suspected keyword segment of the audio to be detected into a keyword distinguishing model to obtain a keyword distinguishing result output by the keyword distinguishing model includes:

Based on any one of the above embodiments, in the apparatus, the confidence threshold of any keyword is determined based on an average value of confidence scores corresponding to all the good examples of any keyword.

Based on any one of the above embodiments, in the apparatus, the inputting the audio features of the suspected keyword segment of the audio to be detected into the classification layer of the keyword differentiation model to obtain the confidence score for each keyword output by the classification layer specifically includes:

According to any one of the above embodiments, in the apparatus, the sample keyword labeling position is determined by frame expansion processing.

Based on any of the above embodiments, in the apparatus, the audio features of the sample keyword suspected segments of the sample audio are obtained by performing masking processing on a time domain and/or a frequency domain.

Based on any of the above embodiments, fig. 10 illustrates an entity structure schematic diagram of an electronic device, and as shown in fig. 10, the electronic device may include: a processor (processor)1010, a communication Interface (Communications Interface)1020, a memory (memory)1030, and a communication bus 1040, wherein the processor 1010, the communication Interface 1020, and the memory 1030 communicate with each other via the communication bus 1040. Processor 1010 may call logic instructions in memory 1030 to perform the following method: determining the audio characteristics of the keyword suspected fragment of the audio to be detected; inputting the audio characteristics of the keyword suspected fragment of the audio to be detected into a keyword distinguishing model to obtain a keyword distinguishing result output by the keyword distinguishing model; the keyword distinguishing model is obtained by training based on audio features of sample keyword suspected fragments of sample audio and sample keyword labeled contents corresponding to the sample keyword suspected fragments.

Furthermore, the logic instructions in the memory 1030 can be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method described in the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented to perform the keyword distinguishing method provided in the foregoing embodiments when executed by a processor, and for example, the method includes: determining the audio characteristics of the keyword suspected fragment of the audio to be detected; inputting the audio characteristics of the keyword suspected fragment of the audio to be detected into a keyword distinguishing model to obtain a keyword distinguishing result output by the keyword distinguishing model; the keyword distinguishing model is obtained by training based on audio features of sample keyword suspected fragments of sample audio and sample keyword labeled contents corresponding to the sample keyword suspected fragments.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A keyword distinguishing method is characterized by comprising the following steps:

2. The keyword differentiation method according to claim 1, wherein the sample keyword suspected segments of the sample audio include positive examples and negative examples; the positive sample and the negative sample are determined by a keyword identification result obtained by identifying the sample audio based on a keyword identification model, the keyword identification result corresponding to the positive sample is consistent with the sample keyword labeled content, and the keyword identification result corresponding to the negative sample is inconsistent with the sample keyword labeled content.

3. The keyword distinguishing method according to claim 2, wherein the keyword distinguishing model is obtained by iterative training based on positive examples and negative examples included in the sample keyword suspected segment.

4. The keyword differentiating method according to claim 3, wherein the iterative training comprises a plurality of rounds of training process, wherein each round of training process comprises:

5. The method according to claim 2, wherein the determining the audio characteristics of the suspected keyword segment of the audio to be tested comprises:

determining the audio characteristics of the audio to be detected;

6. The method for distinguishing keywords according to claim 2, wherein the step of inputting the audio features of the suspected keyword segment of the audio to be detected into a keyword distinguishing model to obtain a keyword distinguishing result output by the keyword distinguishing model comprises:

7. The keyword differentiation method according to claim 6, wherein the confidence threshold of any keyword is determined based on an average of confidence scores corresponding to all regular samples of any keyword.

8. The keyword differentiation method according to claim 6, wherein the step of inputting the audio features of the suspected keyword segment of the audio to be detected into the classification layer of the keyword differentiation model to obtain the confidence score for each keyword output by the classification layer specifically comprises:

9. The method of claim 5, wherein the sample keyword annotation location is determined by a frame expansion process.

10. The keyword differentiating method according to any one of claims 1 to 9, wherein the audio features of the sample keyword suspected segments of the sample audio are obtained by performing a masking process on a time domain and/or a frequency domain.

11. The keyword differentiating method according to any one of claims 1 to 9, wherein the sample audio comprises original sample audio and processed original sample audio, and the processing manner of the original sample audio comprises at least one of the following: noise addition, noise reduction and speed change.

12. A keyword differentiating apparatus, comprising:

13. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the keyword differentiating method according to any one of claims 1 to 11 when executing the program.

14. A non-transitory computer readable storage medium, on which a computer program is stored, which, when being executed by a processor, performs the step of keyword recognition differentiation according to any one of claims 1 to 11.