CN110265003B - Method for recognizing voice keywords in broadcast signal - Google Patents

Method for recognizing voice keywords in broadcast signal Download PDF

Info

Publication number
CN110265003B
CN110265003B CN201910596186.4A CN201910596186A CN110265003B CN 110265003 B CN110265003 B CN 110265003B CN 201910596186 A CN201910596186 A CN 201910596186A CN 110265003 B CN110265003 B CN 110265003B
Authority
CN
China
Prior art keywords
broadcast
keyword
model
keywords
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910596186.4A
Other languages
Chinese (zh)
Other versions
CN110265003A (en
Inventor
雒瑞森
孙超
武瑞娟
杜淼
余艳梅
龚晓峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Dagong Bochuang Information Technology Co ltd
Sichuan University
Original Assignee
Chengdu Dagong Bochuang Information Technology Co ltd
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Dagong Bochuang Information Technology Co ltd, Sichuan University filed Critical Chengdu Dagong Bochuang Information Technology Co ltd
Priority to CN201910596186.4A priority Critical patent/CN110265003B/en
Publication of CN110265003A publication Critical patent/CN110265003A/en
Application granted granted Critical
Publication of CN110265003B publication Critical patent/CN110265003B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering

Abstract

The invention discloses a method for identifying voice keywords in a broadcast signal, which comprises the following steps: firstly, establishing an acoustic model; secondly, establishing a word phoneme mapping table of the key words; thirdly, establishing a keyword sequence dictionary; training the acoustic model by using the keywords in the dictionary to obtain mapping between the speech characteristics of the keywords and the phonemes, and loading the mapping into the acoustic model; defining mapping between the phonemes and the specified key words, and storing the mapping into a character phoneme mapping table; loading the trained acoustic model, the word phoneme mapping table and the keyword sequence dictionary into a decoder; and seventhly, denoising and feature extraction are carried out on the broadcast signals to be recognized, and then the broadcast signals are loaded into a decoder to obtain a keyword recognition result. The acoustic model and the word phoneme mapping table are trained by using the sample recorded by the keywords, and the fault tolerance rate of the word phoneme mapping table is improved because a complete sentence with significance is not required.

Description

Method for recognizing voice keywords in broadcast signal
Technical Field
The invention relates to the technical field of voice recognition, in particular to a method for recognizing voice keywords in a broadcast signal.
Background
With the continuous development of social economy and radio communication, the importance of radio spectrum resources is increasingly highlighted. Many illegal broadcasts are driven by economic interests to buy and erect broadcast base stations privately, broadcast false advertisements or conduct economic fraud, disturb economic order, interfere with normal communication signals, and even cause safety accidents. Therefore, the effective method is designed to carry out high-efficiency automatic discrimination and control on illegal broadcast signals, and has very important social and economic values. The traditional illegal broadcast discrimination mostly adopts a manual listening identification method, so that the labor cost is high, and errors are easy to generate. Although automatic speech recognition has been used in daily life, the automatic discrimination of illegal broadcasts in spectrum management and control is still a troublesome problem due to the characteristics of illegal broadcast signals which are significantly different from daily speech. The two main reasons are that the broadcast signal has large noise and the signal is not pure mandarin speech, and the signal can not be connected with the external internet due to confidentiality in the application field of illegal broadcast discrimination, which causes that most of the existing automatic speech recognition models can not be used, and the use of the existing models is greatly limited due to the insufficient number of training samples.
Although Chinese speech recognition has achieved certain effects and applications, there are still certain difficulties in directly applying the automatic speech recognition technology to broadcast signal legitimacy discrimination due to some characteristics of the broadcast signal itself. In particular, broadcast signal legitimacy screening needs to be performed in an offline environment due to some security requirements, so that many existing commercial online models cannot be used directly.
Disclosure of Invention
The invention aims to solve the problems that keywords can not be identified and a plurality of irrelevant words are wrongly identified when illegal broadcasts are detected by using an automatic voice recognition technology in the prior art; the method for recognizing the speech keywords in the broadcast signals is characterized in that an acoustic model and a character phoneme mapping table are trained by using a sample recorded by the keywords, and meanwhile, because a complete and meaningful sentence is not needed, only a specific keyword in the speech needs to be recognized to judge whether the speech is normal broadcast or illegal broadcast, the efficiency is improved, and the recognition process is simplified.
The invention is realized by the following technical scheme:
a method of recognizing a speech keyword in a broadcast signal, comprising the steps of:
step one, establishing a plurality of acoustic models;
step two, establishing a word phoneme mapping table of key words;
step three, establishing a keyword sequence dictionary and a voice fragment containing keywords
Training a plurality of acoustic models by using keywords in the keyword sequence dictionary to obtain mapping between the speech characteristics and phonemes of the keywords, and loading the mapping into the acoustic models;
defining mapping between phonemes and the specified key words, and storing the mapping into a character phoneme mapping table;
step six, loading the trained acoustic model, the word phoneme mapping table and the keyword sequence dictionary into a plurality of decoders respectively;
seventhly, denoising and feature extraction are carried out on the broadcast signals to be recognized, then the broadcast signals to be recognized are loaded into a plurality of decoders respectively, keyword recognition results of a plurality of models are obtained, a recognition result is obtained through a voting method, and then the keyword recognition results are made into feature vectors;
step eight, making the recognition result of the keyword into a feature vector, taking a plurality of normal broadcast voice segments and illegal broadcast voice segments, respectively marking electronic tags of 'normal broadcast' and 'illegal broadcast' to form a training set, and training an SVM classifier by taking 'normal broadcast/illegal broadcast' as a two-classification standard;
and step nine, loading the feature vectors obtained in the step seven into an SVM classifier to obtain a judgment result that the broadcast signals are normal broadcast or illegal broadcast.
Broadcast signals are typically continuously noisy and have distinctive public address tones, sometimes mixed with non-speech signals. Therefore, a general speech recognition model trained in a quiet background, standard conversational speech is difficult to use directly in our problem. To solve the above problem, it is necessary to construct a special speech recognition model that can be specifically adapted to the broadcast speech signal. Specifically, for an automatic speech recognition model, the inventor sets an acoustic model, a character phoneme mapping table and a dictionary. The design is mainly on how to train or adjust the acoustic model, the word phoneme mapping table and the dictionary, so that the voice recognition system can obtain a good effect on the broadcast signals. Since we can only obtain a limited number of broadcast signal samples, the whole system cannot be trained from scratch; therefore, the existing automatic voice recognition system PocketSphinx is taken as a base model, and parameters are correspondingly adjusted according to the broadcast signals, so that the broadcast recording signals can be accurately recognized; after the acoustic model, the character phoneme mapping table and the keyword sequence dictionary are established, in order to improve the recognition efficiency of the method, the character phoneme mapping table should have the same characteristics as the speech signal in the actual broadcast signal as much as possible when being trained, so in the fourth step, the keywords in the keyword sequence dictionary are recorded into training samples, and then the acoustic model is trained by using the samples, and meanwhile, compared with the traditional method, the method has the difference that only the keywords are recognized in the recognition process by defining the keywords, but not the sentence meanings of sentences, so that the recognition speed of the method is much faster than that of the traditional method; the method improves the applicability of the method in the actual use process, obtains the mapping between phonemes and defined keywords by adapting a text phoneme mapping table, and finally loads a trained acoustic model, the adapted text phoneme mapping table and a keyword sequence dictionary into a decoder, wherein in order to improve the accuracy of the classification of a subsequent SVM classifier, a plurality of acoustic models are arranged, a plurality of acoustic models are trained by using different sample sets simultaneously, the trained acoustic models are loaded into a plurality of decoders respectively, a text phoneme mapping table and a keyword sequence dictionary are loaded into each acoustic model, the same denoised broadcast signal to be recognized is loaded into the decoders respectively, each decoder can obtain a recognition result, for example, ten decoders in total, the eight decoders recognize the same keyword for the fourth sentence of the broadcast signal to be recognized, two unrecognized keywords are reserved as one of the results of the keyword recognition in the step, and the step is performed for all sentences of the broadcast signal to be recognized, so that training tasks are dispersed to a plurality of training models, a plurality of acoustic models can be trained simultaneously, time saving is facilitated, most importantly, the influence of unrecognized and misrecognized is prevented, the recognition accuracy is improved, meanwhile, the hypothesis space can be expanded by combining a plurality of classifiers, and a better approximately optimal solution can be learned; the method is helpful for improving the accuracy of the character phoneme mapping table, and in the recognition aspect, accurate recognition of the speech-keyword is carried out by a method of adding a few keywords into a dictionary; the acoustic model and the word phoneme mapping table are trained by using a sample recorded by keywords, and meanwhile, because a complete and meaningful sentence is not needed, the fault tolerance rate of the designed word phoneme mapping table is much higher.
Further, the acoustic model in the step one is a CMU sphinx model, and the concrete operations of training the acoustic model in the step four are as follows:
step 4.1, recording a plurality of sample broadcast audio files containing keywords as training samples; randomly forming a plurality of training sets by the training samples;
and 4.2, extracting the audio signal characteristics of the sample broadcast audio files in each training set, and training the CMU sphinx models by using the extracted audio signal characteristics.
In order to improve the accuracy of the acoustic model in the identification process, in the method, when a training sample is manufactured, a broadcast audio file containing a keyword sample is recorded and used as the sample file, and the acoustic model with adaptive parameters is generated in such a way, so that the keyword identification capability of the acoustic model is reserved, the function of specifically identifying the specific radio broadcast environment is enhanced, and then the audio characteristic signal of the sample file is extracted and stored in the CMU sphinx model.
Further, the CMU sphinx model is a hidden Markov-Gaussian mixture model;
the joint expression formula of the Gaussian mixture model is as follows:
Figure GDA0003026073740000031
wherein x represents a syllable; p (x) is the probability of outputting the syllable; p (m) is the weight of the corresponding Gaussian probability density function; mu.smAnd σm 2Is a parameter of the corresponding gaussian distribution; m is the index of the submodel, namely the mth submodel; m is the total number of submodels; n (-) is a multivariate Gaussian distribution; i is a unit matrix corresponding to the data dimension; p (x | m) is the probability of outputting the syllable for the mth model.
Because the Gaussian mixture model uses joint expression of a plurality of Gaussian distributions and has a plurality of distribution centers, the simulation of the acoustic model is very suitable, and from the formula, the probability density function can be regarded as the combination of a plurality of Gaussian; since the sound signals are always in a multi-center variance attenuation distribution, the Gaussian mixture model is very suitable for being used as a model of an acoustic model. The Gaussian mixture model has strong expression capability, but the model training is not a simple matter; for the probability distribution function, a method of maximizing a log-likelihood function is often used in the prior art when training. However, the log-likelihood function of the gaussian mixture model is not continuously derivable, so that the inventor uses a heuristic algorithm for training after research, the more common heuristic algorithm is an E-M algorithm, and the characteristic that probability addition/integration is 1 can be naturally guaranteed, so that the method is widely adopted when solving the problem of the probability density function extremum; the specific E-M algorithm can be expressed as follows:
assuming that a parameter to be learned is θ, hidden variables of the mixture model are Z (p (M) in the gaussian mixture model, each gaussian distribution coefficient), single model variables are X (mean and variance of each gaussian model in the gaussian mixture model), and a logarithmic loss function is [ logL (θ; X, Z) ], the E-M algorithm can be expressed as:
e, step E: solving a lower bound of the objective function
Figure GDA0003026073740000041
And M: optimizing parameters in the current step
Figure GDA0003026073740000042
By looping through the above steps, we can make the parameter θ gradually converge to an optimal value.
An additional complete acoustic model, designed based on a Gaussian mixture model-Markov chain; specifically, in speech recognition, a speech signal is composed of syllables; the syllables are mutually connected to form a language finally, and the Markov chain can learn the time-varying characteristic of the system and capture the mutual influence relation among the syllable time, so that the method is widely applied to acoustic modeling of voice recognition; the hidden Markov model is composed of an explicit state and a hidden state, wherein the explicit state is a part which is directly observed by people, such as data in a speech signal; hidden states are variables that we model assumes but are not visible to us. In markov models, transitions between states are made in hidden states, but each hidden state requires a distribution to transition to the observation of an explicit state: this is also the reason for its so-called "hidden" markov model, in which it is worth noting that, for a hidden variable s, one of its values at the current moment is related to the previous moment; meanwhile, for the current observed value, the observed value is only related to the hidden variable at the moment, the property is called Markov property, and an algorithm with the property is drawn into a picture to be in a chain shape, so the observed value can be called hidden Markov chain; hidden Markov chains involve two important forward equations:
akj=P(S=j|S=k)
bj(X)=p(x|S=j)
wherein the second formula is a probability density function modeling the characteristic signal of each frame, which is sometimes called "transmit function"; in acoustic signal modeling, we make this function follow a gaussian mixture model, thus obtaining our HMM-GMM ensemble model; the first formula reflects the change between the hidden states, and the transition between the states can be calculated by using a dynamic programming method.
Furthermore, in the third step, the method for establishing the keyword sequence dictionary adopts manual definition. The method for defining key words adopts an expert system, wherein the expert system extracts relevant knowledge into an expression according to the experience of illegal broadcast detection experts, for example, the method has three alternative key words, according to the expert experience, setting the keywords 1+ 2 as illegal broadcasts and the keywords 1+ 3 as normal broadcasts, when the method subsequently judges the progress of the illegal broadcasts, the inventor also adds fuzzy logic and self-adapting discrimination, uses SVM support vector machine method in machine learning to automatically discriminate illegal broadcasting or normal broadcasting, so that the recall rate of normal broadcasting is greatly increased, meanwhile, the method can not only output the judgment whether the broadcast is illegal, but also output the confidence level, at times of low confidence, we may request human intervention to determine if it is an illegal broadcast.
Further, the voting in the seventh step specifically operates as follows:
step 7.1, loading the voice to be recognized into each decoder;
step 7.2, obtaining the identification result generated by each decoder, wherein xi,jThe recognition result of the ith speech representing the jth model is recorded as 1 through a voting mechanism, namely, if N models are trained together, if the occurrence number m is more than N/2, the result is taken as the recognition result yiAnd is recorded as:
Figure GDA0003026073740000051
that is, if the number of times of occurrence of a certain speech recognition keyword is half of the number of times of passing a bill, the keyword is retained, otherwise, the keyword is discarded, and the keyword is used as a recognition result.
The method has the advantages that the speech to be recognized is loaded into each decoder for recognition, all recognition results are integrated to eliminate and retain keywords, the probability of false recognition in the process is reduced to the maximum extent, for example, when only one decoder recognizes, the keywords in a sentence are not recognized, the keywords can be missed in the process, when a plurality of decoders are used for recognizing the same speech to be recognized, the problem can be avoided to the great extent, and the keywords with more than half number of tickets are screened out through the statistical formula, so that the accuracy in subsequent classification is improved.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. the acoustic model and the word phoneme mapping table are trained by using a sample recorded by keywords, and meanwhile, complete and meaningful sentences are not required in the method, so that the fault tolerance rate of the word phoneme mapping table in the method is obviously improved;
2. in the method, when a training sample is made, a broadcast audio file containing a keyword sample is recorded firstly, the file is used as the sample file, and the acoustic model with adaptive parameters is generated in such a way, so that the keyword recognition capability of the acoustic model is reserved, and the function of specifically recognizing specific radio broadcast environment is enhanced.
Drawings
The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention. In the drawings:
fig. 1 is a signal processing flow chart of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to examples and accompanying drawings, and the exemplary embodiments and descriptions thereof are only used for explaining the present invention and are not meant to limit the present invention.
Examples
As shown in fig. 1, a method of recognizing a voice keyword in a broadcast signal includes the steps of:
step one, establishing an acoustic model; the acoustic model in this embodiment is a CMU sphinx model; the CMU sphinx model is a hidden Markov-Gaussian mixture model;
the joint expression formula of the Gaussian mixture model is as follows:
Figure GDA0003026073740000061
wherein x represents a syllable; p (x) is the probability of outputting the syllable; p (m) is the weight of the corresponding Gaussian probability density function; mu.smAnd σm 2Is a parameter of the corresponding gaussian distribution; m is the index of the submodel, namely the mth submodel; m is the total number of submodels; n (-) is a multivariate Gaussian distribution; i is a unit matrix corresponding to the data dimension; p (x | m) is the probability of outputting the syllable for the mth model. In addition, the complete acoustic model is designed based on a Gaussian mixture model-Markov chain; specifically, in speech recognition, a speech signal is composed of syllables; the syllables are connected with each other to form a language, and the Markov chain can learn the time-varying characteristic of the system and captureObtaining the mutual influence relation between syllable time, so the method is widely applied to acoustic modeling of voice recognition; the hidden Markov model is composed of an explicit state and a hidden state, wherein the explicit state is a part which is directly observed by people, such as data in a speech signal; hidden states are variables that we model assumes but are not visible to us. In markov models, transitions between states are made in hidden states, but each hidden state requires a distribution to transition to the observation of an explicit state: this is also the reason for its so-called "hidden" markov model, in which it is worth noting that, for a hidden variable s, one of its values at the current moment is related to the previous moment; meanwhile, for the current observed value, the observed value is only related to the hidden variable at the moment, the property is called Markov property, and an algorithm with the property is drawn into a picture to be in a chain shape, so the observed value can be called hidden Markov chain; hidden Markov chains involve two important forward equations:
akj=P(S=j|S=k)
bj(X)=p(x|S=j)
wherein the second formula is a probability density function modeling the characteristic signal of each frame, which is sometimes called "transmit function"; in acoustic signal modeling, we make this function follow a gaussian mixture model, thus obtaining our HMM-GMM ensemble model; the first formula reflects the change between the hidden states, and the transition between the states can be calculated by using a dynamic programming method.
Step two, establishing a word phoneme mapping table of key words;
step three, establishing a keyword sequence dictionary; in the embodiment, an expert system is adopted in the method for defining the keywords, the expert system extracts relevant knowledge into an expression according to the experience of illegal broadcast detection experts, for example, the method has three alternative keywords, the keyword 1+ keyword 2 is set as illegal broadcast and the keyword 1+ keyword 3 is set as normal broadcast according to the experience of the expert, the inventor adds fuzzy logic and self-adaptive judgment when judging the progress of the illegal broadcast subsequently, an SVM (support vector machine) method in machine learning is used for automatically judging whether the illegal broadcast is the normal broadcast or the illegal broadcast, the recall rate of the normal broadcast is greatly improved, meanwhile, the method can output the judgment whether the illegal broadcast is the illegal broadcast or not and also can output the confidence coefficient of the illegal broadcast, and when the confidence coefficient is lower, the user can request manual intervention, to determine whether it is an illegal broadcast.
Randomly extracting partial samples of the voice fragments from the training samples, training the acoustic models by using keywords in the keyword sequence dictionary to obtain a plurality of samples of the acoustic models, obtaining mapping between the voice characteristics of the keywords and the phonemes, and loading the mapping into the corresponding acoustic models;
defining mapping between phonemes and the specified key words, and storing the mapping into a character phoneme mapping table;
step six, loading the trained acoustic model, the word phoneme mapping table and the keyword sequence dictionary into a plurality of decoders respectively;
seventhly, denoising and feature extraction are carried out on the broadcast signals to be recognized, then the broadcast signals to be recognized are loaded into a plurality of decoders respectively, keyword recognition results of a plurality of models are obtained, a recognition result is obtained through a voting method, and then the keyword recognition results are made into feature vectors;
in this embodiment, the voting method is to establish a voting mechanism, and the specific operations are as follows:
step 7.1, loading the voice to be recognized into each decoder;
step 7.2, obtaining the identification result generated by each decoder, wherein xi,jThe recognition result of the ith speech representing the jth model is recorded as 1 through a voting mechanism, namely, if N models are trained together, if the occurrence number m is more than N/2, the result is taken as the recognition result yiAnd is recorded as:
Figure GDA0003026073740000071
that is, if the number of times of occurrence of a certain speech recognition keyword is half of the number of times of passing a bill, the keyword is retained, otherwise, the keyword is discarded, and the keyword is used as a recognition result.
Step eight, making the recognition result of the keyword into a feature vector, taking a plurality of normal broadcast voice segments and illegal broadcast voice segments, respectively marking electronic tags of 'normal broadcast' and 'illegal broadcast' to form a training set, and training an SVM classifier by taking 'normal broadcast/illegal broadcast' as a two-classification standard;
and step nine, loading the feature vectors obtained in the step seven into an SVM classifier to obtain a judgment result that the broadcast signals are normal broadcast or illegal broadcast. In this embodiment, the decoder is an existing PocketSphinx decoder.
In this embodiment, the concrete operations of training the acoustic model in step four are as follows:
step 1.1, recording a sample broadcast audio file containing keywords;
and step 1.2, extracting the audio signal characteristics of the sample broadcast audio file, and storing the audio signal characteristics to the CMU sphinx model.
In addition to the above steps, in this embodiment, the method further includes an acoustic model parameter adaptation improvement: in this embodiment, the maximum posterior probability model expression is:
Figure GDA0003026073740000081
where P (λ) is the prior probability and P (O | λ) is the likelihood function, i.e., a measure of the likelihood of characterizing the data under a particular model setting. In the acoustic model parameter adaptation improvement, P (λ) is the parameter of the chinese basic acoustic model in the speech recognition model, and P (O | λ) is the likelihood function of the data that we newly add.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (5)

1. A method of recognizing a speech keyword in a broadcast signal, comprising the steps of:
step one, establishing a plurality of acoustic models;
step two, establishing a word phoneme mapping table of key words;
step three, establishing a keyword sequence dictionary and a voice fragment containing keywords
Training a plurality of acoustic models by using keywords in the keyword sequence dictionary to obtain mapping between the speech characteristics and phonemes of the keywords, and loading the mapping into the acoustic models;
defining mapping between phonemes and the specified key words, and storing the mapping into a character phoneme mapping table;
step six, loading the trained acoustic model, the word phoneme mapping table and the keyword sequence dictionary into a plurality of decoders respectively;
seventhly, denoising and feature extraction are carried out on the broadcast signals to be recognized, then the broadcast signals to be recognized are loaded into a plurality of decoders respectively, keyword recognition results of a plurality of models are obtained, a recognition result is obtained through a voting method, and then the keyword recognition results are made into feature vectors;
step eight, making the recognition result of the keyword into a feature vector, taking a plurality of normal broadcast voice segments and illegal broadcast voice segments, respectively marking electronic tags of 'normal broadcast' and 'illegal broadcast' to form a training set, and training an SVM classifier by taking 'normal broadcast/illegal broadcast' as a two-classification standard;
and step nine, loading the feature vectors obtained in the step seven into an SVM classifier to obtain a judgment result that the broadcast signals are normal broadcast or illegal broadcast.
2. The method according to claim 1, wherein the acoustic model in the first step is a CMU sphinx model, and the specific operations of training the acoustic model in the fourth step are as follows:
step 4.1, recording a plurality of sample broadcast audio files containing keywords as training samples; randomly forming a plurality of training sets by the training samples;
and 4.2, extracting the audio signal characteristics of the sample broadcast audio files in each training set, and training the CMU sphinx models by using the extracted audio signal characteristics.
3. The method of claim 2, wherein the cmxsphinx model is a hidden markov-gaussian mixture model;
the joint expression formula of the Gaussian mixture model is as follows:
Figure FDA0003026073730000011
wherein x represents a syllable; p (x) is the probability of outputting the syllable; p (m) is the weight of the corresponding Gaussian probability density function; mu.smAnd σm 2Is a parameter of the corresponding gaussian distribution; m is the index of the submodel, namely the mth submodel; m is the total number of submodels; n (-) is a multivariate Gaussian distribution; i is a unit matrix corresponding to the data dimension; p (x | m) is the probability of outputting the syllable for the mth model.
4. The method of claim 1, wherein in step three, the keyword sequence dictionary is created by manual definition.
5. The method of claim 1, wherein the voting in the step seven specifically operates as follows:
step 7.1, loading the voice to be recognized into each decoder respectively;
step 7.2, obtaining the identification result generated by each decoder, wherein xi,jThe recognition result of the ith speech representing the jth model is recorded as 1 through a voting mechanism, namely, if N models are trained together, if the occurrence number m is more than N/2, the result is taken as the recognition result yiAnd is recorded as:
Figure FDA0003026073730000021
that is, if the number of times of occurrence of a certain speech recognition keyword is half of the number of times of passing a bill, the keyword is retained, otherwise, the keyword is discarded, and the keyword is used as a recognition result.
CN201910596186.4A 2019-07-03 2019-07-03 Method for recognizing voice keywords in broadcast signal Active CN110265003B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910596186.4A CN110265003B (en) 2019-07-03 2019-07-03 Method for recognizing voice keywords in broadcast signal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910596186.4A CN110265003B (en) 2019-07-03 2019-07-03 Method for recognizing voice keywords in broadcast signal

Publications (2)

Publication Number Publication Date
CN110265003A CN110265003A (en) 2019-09-20
CN110265003B true CN110265003B (en) 2021-06-11

Family

ID=67924229

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910596186.4A Active CN110265003B (en) 2019-07-03 2019-07-03 Method for recognizing voice keywords in broadcast signal

Country Status (1)

Country Link
CN (1) CN110265003B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113096650B (en) * 2021-03-03 2023-12-08 河海大学 Acoustic decoding method based on prior probability

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101894125A (en) * 2010-05-13 2010-11-24 复旦大学 Content-based video classification method
CN102156885B (en) * 2010-02-12 2014-03-26 中国科学院自动化研究所 Image classification method based on cascaded codebook generation
CN104268557A (en) * 2014-09-15 2015-01-07 西安电子科技大学 Polarization SAR classification method based on cooperative training and depth SVM
CN107274889A (en) * 2017-06-19 2017-10-20 北京紫博光彦信息技术有限公司 A kind of method and device according to speech production business paper
CN108703824A (en) * 2018-03-15 2018-10-26 哈工大机器人(合肥)国际创新研究院 A kind of bionic hand control system and control method based on myoelectricity bracelet

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102156885B (en) * 2010-02-12 2014-03-26 中国科学院自动化研究所 Image classification method based on cascaded codebook generation
CN101894125A (en) * 2010-05-13 2010-11-24 复旦大学 Content-based video classification method
CN104268557A (en) * 2014-09-15 2015-01-07 西安电子科技大学 Polarization SAR classification method based on cooperative training and depth SVM
CN107274889A (en) * 2017-06-19 2017-10-20 北京紫博光彦信息技术有限公司 A kind of method and device according to speech production business paper
CN108703824A (en) * 2018-03-15 2018-10-26 哈工大机器人(合肥)国际创新研究院 A kind of bionic hand control system and control method based on myoelectricity bracelet

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
"基于Sphinx的机器人语音识别系统构建与研究";袁翔;《电脑知识与技术》;20170331;第13卷(第7期);全文 *
"基于关键词识别的可离线无线电电磁频谱管控系统研究";雒瑞森;《电子测试》;20181130;全文 *
"非法调频广播信号的识别问题研究";谭旭;《西部广播电视》;20180825;全文 *

Also Published As

Publication number Publication date
CN110265003A (en) 2019-09-20

Similar Documents

Publication Publication Date Title
CN107680582B (en) Acoustic model training method, voice recognition method, device, equipment and medium
CN110021308B (en) Speech emotion recognition method and device, computer equipment and storage medium
CN103971675B (en) Automatic speech recognition method and system
KR100655491B1 (en) Two stage utterance verification method and device of speech recognition system
CN110473566A (en) Audio separation method, device, electronic equipment and computer readable storage medium
CN108281137A (en) A kind of universal phonetic under whole tone element frame wakes up recognition methods and system
CN108269133A (en) A kind of combination human bioequivalence and the intelligent advertisement push method and terminal of speech recognition
CN108447471A (en) Audio recognition method and speech recognition equipment
CN111243569B (en) Emotional voice automatic generation method and device based on generation type confrontation network
CN102982811A (en) Voice endpoint detection method based on real-time decoding
CN112927679B (en) Method for adding punctuation marks in voice recognition and voice recognition device
CN111524527A (en) Speaker separation method, device, electronic equipment and storage medium
CN109887511A (en) A kind of voice wake-up optimization method based on cascade DNN
CN111145763A (en) GRU-based voice recognition method and system in audio
CN112233651A (en) Dialect type determining method, dialect type determining device, dialect type determining equipment and storage medium
CN115171731A (en) Emotion category determination method, device and equipment and readable storage medium
CN115394287A (en) Mixed language voice recognition method, device, system and storage medium
CN112259085A (en) Two-stage voice awakening algorithm based on model fusion framework
CN111477219A (en) Keyword distinguishing method and device, electronic equipment and readable storage medium
CN110265003B (en) Method for recognizing voice keywords in broadcast signal
CN111009235A (en) Voice recognition method based on CLDNN + CTC acoustic model
CN111091809A (en) Regional accent recognition method and device based on depth feature fusion
CN110299133B (en) Method for judging illegal broadcast based on keyword
CN112185357A (en) Device and method for simultaneously recognizing human voice and non-human voice
WO2020073839A1 (en) Voice wake-up method, apparatus and system, and electronic device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant