CN110265003B

CN110265003B - Method for recognizing voice keywords in broadcast signal

Info

Publication number: CN110265003B
Application number: CN201910596186.4A
Authority: CN
Inventors: 雒瑞森; 孙超; 武瑞娟; 杜淼; 余艳梅; 龚晓峰
Original assignee: Chengdu Dagong Bochuang Information Technology Co ltd; Sichuan University
Current assignee: Chengdu Dagong Bochuang Information Technology Co ltd; Sichuan University
Priority date: 2019-07-03
Filing date: 2019-07-03
Publication date: 2021-06-11
Anticipated expiration: 2039-07-03
Also published as: CN110265003A

Abstract

The invention discloses a method for identifying voice keywords in a broadcast signal, which comprises the following steps: firstly, establishing an acoustic model; secondly, establishing a word phoneme mapping table of the key words; thirdly, establishing a keyword sequence dictionary; training the acoustic model by using the keywords in the dictionary to obtain mapping between the speech characteristics of the keywords and the phonemes, and loading the mapping into the acoustic model; defining mapping between the phonemes and the specified key words, and storing the mapping into a character phoneme mapping table; loading the trained acoustic model, the word phoneme mapping table and the keyword sequence dictionary into a decoder; and seventhly, denoising and feature extraction are carried out on the broadcast signals to be recognized, and then the broadcast signals are loaded into a decoder to obtain a keyword recognition result. The acoustic model and the word phoneme mapping table are trained by using the sample recorded by the keywords, and the fault tolerance rate of the word phoneme mapping table is improved because a complete sentence with significance is not required.

Description

Method for recognizing voice keywords in broadcast signal

Technical Field

The invention relates to the technical field of voice recognition, in particular to a method for recognizing voice keywords in a broadcast signal.

Background

With the continuous development of social economy and radio communication, the importance of radio spectrum resources is increasingly highlighted. Many illegal broadcasts are driven by economic interests to buy and erect broadcast base stations privately, broadcast false advertisements or conduct economic fraud, disturb economic order, interfere with normal communication signals, and even cause safety accidents. Therefore, the effective method is designed to carry out high-efficiency automatic discrimination and control on illegal broadcast signals, and has very important social and economic values. The traditional illegal broadcast discrimination mostly adopts a manual listening identification method, so that the labor cost is high, and errors are easy to generate. Although automatic speech recognition has been used in daily life, the automatic discrimination of illegal broadcasts in spectrum management and control is still a troublesome problem due to the characteristics of illegal broadcast signals which are significantly different from daily speech. The two main reasons are that the broadcast signal has large noise and the signal is not pure mandarin speech, and the signal can not be connected with the external internet due to confidentiality in the application field of illegal broadcast discrimination, which causes that most of the existing automatic speech recognition models can not be used, and the use of the existing models is greatly limited due to the insufficient number of training samples.

Although Chinese speech recognition has achieved certain effects and applications, there are still certain difficulties in directly applying the automatic speech recognition technology to broadcast signal legitimacy discrimination due to some characteristics of the broadcast signal itself. In particular, broadcast signal legitimacy screening needs to be performed in an offline environment due to some security requirements, so that many existing commercial online models cannot be used directly.

Disclosure of Invention

The invention aims to solve the problems that keywords can not be identified and a plurality of irrelevant words are wrongly identified when illegal broadcasts are detected by using an automatic voice recognition technology in the prior art; the method for recognizing the speech keywords in the broadcast signals is characterized in that an acoustic model and a character phoneme mapping table are trained by using a sample recorded by the keywords, and meanwhile, because a complete and meaningful sentence is not needed, only a specific keyword in the speech needs to be recognized to judge whether the speech is normal broadcast or illegal broadcast, the efficiency is improved, and the recognition process is simplified.

The invention is realized by the following technical scheme:

a method of recognizing a speech keyword in a broadcast signal, comprising the steps of:

step one, establishing a plurality of acoustic models;

step two, establishing a word phoneme mapping table of key words;

step three, establishing a keyword sequence dictionary and a voice fragment containing keywords

Training a plurality of acoustic models by using keywords in the keyword sequence dictionary to obtain mapping between the speech characteristics and phonemes of the keywords, and loading the mapping into the acoustic models;

defining mapping between phonemes and the specified key words, and storing the mapping into a character phoneme mapping table;

step six, loading the trained acoustic model, the word phoneme mapping table and the keyword sequence dictionary into a plurality of decoders respectively;

seventhly, denoising and feature extraction are carried out on the broadcast signals to be recognized, then the broadcast signals to be recognized are loaded into a plurality of decoders respectively, keyword recognition results of a plurality of models are obtained, a recognition result is obtained through a voting method, and then the keyword recognition results are made into feature vectors;

step eight, making the recognition result of the keyword into a feature vector, taking a plurality of normal broadcast voice segments and illegal broadcast voice segments, respectively marking electronic tags of 'normal broadcast' and 'illegal broadcast' to form a training set, and training an SVM classifier by taking 'normal broadcast/illegal broadcast' as a two-classification standard;

and step nine, loading the feature vectors obtained in the step seven into an SVM classifier to obtain a judgment result that the broadcast signals are normal broadcast or illegal broadcast.

Broadcast signals are typically continuously noisy and have distinctive public address tones, sometimes mixed with non-speech signals. Therefore, a general speech recognition model trained in a quiet background, standard conversational speech is difficult to use directly in our problem. To solve the above problem, it is necessary to construct a special speech recognition model that can be specifically adapted to the broadcast speech signal. Specifically, for an automatic speech recognition model, the inventor sets an acoustic model, a character phoneme mapping table and a dictionary. The design is mainly on how to train or adjust the acoustic model, the word phoneme mapping table and the dictionary, so that the voice recognition system can obtain a good effect on the broadcast signals. Since we can only obtain a limited number of broadcast signal samples, the whole system cannot be trained from scratch; therefore, the existing automatic voice recognition system PocketSphinx is taken as a base model, and parameters are correspondingly adjusted according to the broadcast signals, so that the broadcast recording signals can be accurately recognized; after the acoustic model, the character phoneme mapping table and the keyword sequence dictionary are established, in order to improve the recognition efficiency of the method, the character phoneme mapping table should have the same characteristics as the speech signal in the actual broadcast signal as much as possible when being trained, so in the fourth step, the keywords in the keyword sequence dictionary are recorded into training samples, and then the acoustic model is trained by using the samples, and meanwhile, compared with the traditional method, the method has the difference that only the keywords are recognized in the recognition process by defining the keywords, but not the sentence meanings of sentences, so that the recognition speed of the method is much faster than that of the traditional method; the method improves the applicability of the method in the actual use process, obtains the mapping between phonemes and defined keywords by adapting a text phoneme mapping table, and finally loads a trained acoustic model, the adapted text phoneme mapping table and a keyword sequence dictionary into a decoder, wherein in order to improve the accuracy of the classification of a subsequent SVM classifier, a plurality of acoustic models are arranged, a plurality of acoustic models are trained by using different sample sets simultaneously, the trained acoustic models are loaded into a plurality of decoders respectively, a text phoneme mapping table and a keyword sequence dictionary are loaded into each acoustic model, the same denoised broadcast signal to be recognized is loaded into the decoders respectively, each decoder can obtain a recognition result, for example, ten decoders in total, the eight decoders recognize the same keyword for the fourth sentence of the broadcast signal to be recognized, two unrecognized keywords are reserved as one of the results of the keyword recognition in the step, and the step is performed for all sentences of the broadcast signal to be recognized, so that training tasks are dispersed to a plurality of training models, a plurality of acoustic models can be trained simultaneously, time saving is facilitated, most importantly, the influence of unrecognized and misrecognized is prevented, the recognition accuracy is improved, meanwhile, the hypothesis space can be expanded by combining a plurality of classifiers, and a better approximately optimal solution can be learned; the method is helpful for improving the accuracy of the character phoneme mapping table, and in the recognition aspect, accurate recognition of the speech-keyword is carried out by a method of adding a few keywords into a dictionary; the acoustic model and the word phoneme mapping table are trained by using a sample recorded by keywords, and meanwhile, because a complete and meaningful sentence is not needed, the fault tolerance rate of the designed word phoneme mapping table is much higher.

Further, the acoustic model in the step one is a CMU sphinx model, and the concrete operations of training the acoustic model in the step four are as follows:

step 4.1, recording a plurality of sample broadcast audio files containing keywords as training samples; randomly forming a plurality of training sets by the training samples;

and 4.2, extracting the audio signal characteristics of the sample broadcast audio files in each training set, and training the CMU sphinx models by using the extracted audio signal characteristics.

In order to improve the accuracy of the acoustic model in the identification process, in the method, when a training sample is manufactured, a broadcast audio file containing a keyword sample is recorded and used as the sample file, and the acoustic model with adaptive parameters is generated in such a way, so that the keyword identification capability of the acoustic model is reserved, the function of specifically identifying the specific radio broadcast environment is enhanced, and then the audio characteristic signal of the sample file is extracted and stored in the CMU sphinx model.

Further, the CMU sphinx model is a hidden Markov-Gaussian mixture model;

the joint expression formula of the Gaussian mixture model is as follows:

wherein x represents a syllable; p (x) is the probability of outputting the syllable; p (m) is the weight of the corresponding Gaussian probability density function; mu.s_mAnd σ_m ²Is a parameter of the corresponding gaussian distribution; m is the index of the submodel, namely the mth submodel; m is the total number of submodels; n (-) is a multivariate Gaussian distribution; i is a unit matrix corresponding to the data dimension; p (x | m) is the probability of outputting the syllable for the mth model.

Because the Gaussian mixture model uses joint expression of a plurality of Gaussian distributions and has a plurality of distribution centers, the simulation of the acoustic model is very suitable, and from the formula, the probability density function can be regarded as the combination of a plurality of Gaussian; since the sound signals are always in a multi-center variance attenuation distribution, the Gaussian mixture model is very suitable for being used as a model of an acoustic model. The Gaussian mixture model has strong expression capability, but the model training is not a simple matter; for the probability distribution function, a method of maximizing a log-likelihood function is often used in the prior art when training. However, the log-likelihood function of the gaussian mixture model is not continuously derivable, so that the inventor uses a heuristic algorithm for training after research, the more common heuristic algorithm is an E-M algorithm, and the characteristic that probability addition/integration is 1 can be naturally guaranteed, so that the method is widely adopted when solving the problem of the probability density function extremum; the specific E-M algorithm can be expressed as follows:

assuming that a parameter to be learned is θ, hidden variables of the mixture model are Z (p (M) in the gaussian mixture model, each gaussian distribution coefficient), single model variables are X (mean and variance of each gaussian model in the gaussian mixture model), and a logarithmic loss function is [ logL (θ; X, Z) ], the E-M algorithm can be expressed as:

e, step E: solving a lower bound of the objective function

And M: optimizing parameters in the current step

By looping through the above steps, we can make the parameter θ gradually converge to an optimal value.

An additional complete acoustic model, designed based on a Gaussian mixture model-Markov chain; specifically, in speech recognition, a speech signal is composed of syllables; the syllables are mutually connected to form a language finally, and the Markov chain can learn the time-varying characteristic of the system and capture the mutual influence relation among the syllable time, so that the method is widely applied to acoustic modeling of voice recognition; the hidden Markov model is composed of an explicit state and a hidden state, wherein the explicit state is a part which is directly observed by people, such as data in a speech signal; hidden states are variables that we model assumes but are not visible to us. In markov models, transitions between states are made in hidden states, but each hidden state requires a distribution to transition to the observation of an explicit state: this is also the reason for its so-called "hidden" markov model, in which it is worth noting that, for a hidden variable s, one of its values at the current moment is related to the previous moment; meanwhile, for the current observed value, the observed value is only related to the hidden variable at the moment, the property is called Markov property, and an algorithm with the property is drawn into a picture to be in a chain shape, so the observed value can be called hidden Markov chain; hidden Markov chains involve two important forward equations:

a_kj＝P(S＝j|S＝k)

b_j(X)＝p(x|S＝j)

wherein the second formula is a probability density function modeling the characteristic signal of each frame, which is sometimes called "transmit function"; in acoustic signal modeling, we make this function follow a gaussian mixture model, thus obtaining our HMM-GMM ensemble model; the first formula reflects the change between the hidden states, and the transition between the states can be calculated by using a dynamic programming method.

Furthermore, in the third step, the method for establishing the keyword sequence dictionary adopts manual definition. The method for defining key words adopts an expert system, wherein the expert system extracts relevant knowledge into an expression according to the experience of illegal broadcast detection experts, for example, the method has three alternative key words, according to the expert experience, setting the keywords 1+ 2 as illegal broadcasts and the keywords 1+ 3 as normal broadcasts, when the method subsequently judges the progress of the illegal broadcasts, the inventor also adds fuzzy logic and self-adapting discrimination, uses SVM support vector machine method in machine learning to automatically discriminate illegal broadcasting or normal broadcasting, so that the recall rate of normal broadcasting is greatly increased, meanwhile, the method can not only output the judgment whether the broadcast is illegal, but also output the confidence level, at times of low confidence, we may request human intervention to determine if it is an illegal broadcast.

Further, the voting in the seventh step specifically operates as follows:

step 7.1, loading the voice to be recognized into each decoder;

step 7.2, obtaining the identification result generated by each decoder, wherein x_i,jThe recognition result of the ith speech representing the jth model is recorded as 1 through a voting mechanism, namely, if N models are trained together, if the occurrence number m is more than N/2, the result is taken as the recognition result y_iAnd is recorded as:

that is, if the number of times of occurrence of a certain speech recognition keyword is half of the number of times of passing a bill, the keyword is retained, otherwise, the keyword is discarded, and the keyword is used as a recognition result.

The method has the advantages that the speech to be recognized is loaded into each decoder for recognition, all recognition results are integrated to eliminate and retain keywords, the probability of false recognition in the process is reduced to the maximum extent, for example, when only one decoder recognizes, the keywords in a sentence are not recognized, the keywords can be missed in the process, when a plurality of decoders are used for recognizing the same speech to be recognized, the problem can be avoided to the great extent, and the keywords with more than half number of tickets are screened out through the statistical formula, so that the accuracy in subsequent classification is improved.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. the acoustic model and the word phoneme mapping table are trained by using a sample recorded by keywords, and meanwhile, complete and meaningful sentences are not required in the method, so that the fault tolerance rate of the word phoneme mapping table in the method is obviously improved;

2. in the method, when a training sample is made, a broadcast audio file containing a keyword sample is recorded firstly, the file is used as the sample file, and the acoustic model with adaptive parameters is generated in such a way, so that the keyword recognition capability of the acoustic model is reserved, and the function of specifically recognizing specific radio broadcast environment is enhanced.

Drawings

The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention. In the drawings:

fig. 1 is a signal processing flow chart of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to examples and accompanying drawings, and the exemplary embodiments and descriptions thereof are only used for explaining the present invention and are not meant to limit the present invention.

Examples

As shown in fig. 1, a method of recognizing a voice keyword in a broadcast signal includes the steps of:

step one, establishing an acoustic model; the acoustic model in this embodiment is a CMU sphinx model; the CMU sphinx model is a hidden Markov-Gaussian mixture model;

the joint expression formula of the Gaussian mixture model is as follows:

wherein x represents a syllable; p (x) is the probability of outputting the syllable; p (m) is the weight of the corresponding Gaussian probability density function; mu.s_mAnd σ_m ²Is a parameter of the corresponding gaussian distribution; m is the index of the submodel, namely the mth submodel; m is the total number of submodels; n (-) is a multivariate Gaussian distribution; i is a unit matrix corresponding to the data dimension; p (x | m) is the probability of outputting the syllable for the mth model. In addition, the complete acoustic model is designed based on a Gaussian mixture model-Markov chain; specifically, in speech recognition, a speech signal is composed of syllables; the syllables are connected with each other to form a language, and the Markov chain can learn the time-varying characteristic of the system and captureObtaining the mutual influence relation between syllable time, so the method is widely applied to acoustic modeling of voice recognition; the hidden Markov model is composed of an explicit state and a hidden state, wherein the explicit state is a part which is directly observed by people, such as data in a speech signal; hidden states are variables that we model assumes but are not visible to us. In markov models, transitions between states are made in hidden states, but each hidden state requires a distribution to transition to the observation of an explicit state: this is also the reason for its so-called "hidden" markov model, in which it is worth noting that, for a hidden variable s, one of its values at the current moment is related to the previous moment; meanwhile, for the current observed value, the observed value is only related to the hidden variable at the moment, the property is called Markov property, and an algorithm with the property is drawn into a picture to be in a chain shape, so the observed value can be called hidden Markov chain; hidden Markov chains involve two important forward equations:

a_kj＝P(S＝j|S＝k)

b_j(X)＝p(x|S＝j)

Step two, establishing a word phoneme mapping table of key words;

step three, establishing a keyword sequence dictionary; in the embodiment, an expert system is adopted in the method for defining the keywords, the expert system extracts relevant knowledge into an expression according to the experience of illegal broadcast detection experts, for example, the method has three alternative keywords, the keyword 1+ keyword 2 is set as illegal broadcast and the keyword 1+ keyword 3 is set as normal broadcast according to the experience of the expert, the inventor adds fuzzy logic and self-adaptive judgment when judging the progress of the illegal broadcast subsequently, an SVM (support vector machine) method in machine learning is used for automatically judging whether the illegal broadcast is the normal broadcast or the illegal broadcast, the recall rate of the normal broadcast is greatly improved, meanwhile, the method can output the judgment whether the illegal broadcast is the illegal broadcast or not and also can output the confidence coefficient of the illegal broadcast, and when the confidence coefficient is lower, the user can request manual intervention, to determine whether it is an illegal broadcast.

Randomly extracting partial samples of the voice fragments from the training samples, training the acoustic models by using keywords in the keyword sequence dictionary to obtain a plurality of samples of the acoustic models, obtaining mapping between the voice characteristics of the keywords and the phonemes, and loading the mapping into the corresponding acoustic models;

in this embodiment, the voting method is to establish a voting mechanism, and the specific operations are as follows:

step 7.1, loading the voice to be recognized into each decoder;

and step nine, loading the feature vectors obtained in the step seven into an SVM classifier to obtain a judgment result that the broadcast signals are normal broadcast or illegal broadcast. In this embodiment, the decoder is an existing PocketSphinx decoder.

In this embodiment, the concrete operations of training the acoustic model in step four are as follows:

step 1.1, recording a sample broadcast audio file containing keywords;

and step 1.2, extracting the audio signal characteristics of the sample broadcast audio file, and storing the audio signal characteristics to the CMU sphinx model.

In addition to the above steps, in this embodiment, the method further includes an acoustic model parameter adaptation improvement: in this embodiment, the maximum posterior probability model expression is:

where P (λ) is the prior probability and P (O | λ) is the likelihood function, i.e., a measure of the likelihood of characterizing the data under a particular model setting. In the acoustic model parameter adaptation improvement, P (λ) is the parameter of the chinese basic acoustic model in the speech recognition model, and P (O | λ) is the likelihood function of the data that we newly add.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method of recognizing a speech keyword in a broadcast signal, comprising the steps of:

step one, establishing a plurality of acoustic models;

step two, establishing a word phoneme mapping table of key words;

2. The method according to claim 1, wherein the acoustic model in the first step is a CMU sphinx model, and the specific operations of training the acoustic model in the fourth step are as follows:

3. The method of claim 2, wherein the cmxsphinx model is a hidden markov-gaussian mixture model;

the joint expression formula of the Gaussian mixture model is as follows:

4. The method of claim 1, wherein in step three, the keyword sequence dictionary is created by manual definition.

5. The method of claim 1, wherein the voting in the step seven specifically operates as follows:

step 7.1, loading the voice to be recognized into each decoder respectively;