CN110265003A

CN110265003A - A kind of method of voiced keyword in identification broadcast singal

Info

Publication number: CN110265003A
Application number: CN201910596186.4A
Authority: CN
Inventors: 雒瑞森; 孙超; 武瑞娟; 杜淼; 余艳梅; 龚晓峰
Original assignee: Chengdu Grand Duke Bo Chuan Information Technology Co Ltd; Sichuan University
Current assignee: Chengdu Grand Duke Bo Chuan Information Technology Co Ltd; Sichuan University
Priority date: 2019-07-03
Filing date: 2019-07-03
Publication date: 2019-09-20
Anticipated expiration: 2039-07-03
Also published as: CN110265003B

Abstract

The invention discloses a kind of methods of voiced keyword in identification broadcast singal, include the following steps: one, establish acoustic model；Two, the text phoneme mapping table of crucial words is established；Three, keyword sequences dictionary is established；Four, acoustic model is trained using the keyword in dictionary, obtains the mapping between the characteristics of speech sounds of keyword and phoneme, and the mapping truck is entered into acoustic model；Five, the mapping between phoneme and specified crucial words is defined, and the mapping is saved into text phoneme mapping table；Six, trained acoustic model, text phoneme mapping table and keyword sequences dictionary are loaded into decoder；Seven, broadcast singal to be identified is denoised and carries out feature extraction, then be loaded into decoder, obtain keyword recognition result.Sample made of being recorded by using keyword is trained acoustic model and text phoneme mapping table, since we do not need the sentence for completely having meaning, improves the serious forgiveness of text phoneme mapping table.

Description

A kind of method of voiced keyword in identification broadcast singal

Technical field

The present invention relates to technical field of voice recognition, and in particular to a kind of side for identifying voiced keyword in broadcast singal Method.

Background technique

With the continuous development of social economy and radio communication, the importance of Radio Spectrum Resource is increasingly prominent.Perhaps More illegal broadcast are bought privately under the driving of economic interests, set up broadcast base station, broadcast sham publicity or carry out economy and take advantage of Swindleness behavior upsets economic order, and interferes normal communication signal, or even can cause safety accident.So design is effective Method carries out efficient automatic examination and control to illegal broadcast singal, has highly important society and economic value.It passes The illegal broadcast examination of system, which mostly uses, manually listens to knowledge method for distinguishing, and human cost is high, and is easy to produce fault.Although automatic language Sound identification has been obtained many applications in people's daily life, but due to illegal broadcast singal be markedly different from it is daily The characteristics of voice, leading to the automatic examination in frequency spectrum control, illegally broadcasted is still a stubborn problem.Wherein, main Two reasons, first is that due to broadcast singal have biggish noise, and its signal often be not simple mandarin pronunciation, second is that In the application field that illegal broadcast is screened, cannot often be connected with external the Internet for the needs of confidentiality, which results in Existing automatic speech recognition model is not available mostly, in addition the lazy weight of training sample, significantly limits existing mould The use of type.

Although Chinese speech recognition has been achieved for certain effect and application, due to some spies of broadcast singal itself Point, by automatic speech recognition technology directly apply to broadcast singal legitimacy screen on there are still certain difficulties.Specifically, by In the demand of some safeties, broadcast singal legitimacy, which is screened, to be needed to carry out under offline environment, so that much existing business On-time model can not be used directly.

Summary of the invention

Illegal broadcast is being detected using automatic speech recognition technology during present invention aims to solve the prior art When, lead to the problem of keyword can not identify and much incoherent vocabulary by mistake " identification "；It is wide to provide a kind of identification The method for broadcasting voiced keyword in signal, sample made of being recorded by using keyword map acoustic model and text phoneme Table is trained, simultaneously because we do not need sentence that is complete, having meaning, it is only necessary to identify in voice and specifically close Key word determines that its property for normal broadcast or illegal broadcast, improves efficiency, simplifies identification process.

The present invention is achieved through the following technical solutions:

The method of voiced keyword, includes the following steps: in a kind of identification broadcast singal

Step 1: establishing multiple acoustic models；

Step 2: establishing the text phoneme mapping table of crucial words；

Step 3: establishing keyword sequences dictionary and the sound bite containing keyword

Step 4: being trained using the keyword in dictionary to multiple acoustic models, the characteristics of speech sounds of keyword is obtained Mapping between phoneme, and the mapping truck is entered into acoustic model；

Step 5: defining the mapping between phoneme and specified crucial words, and the mapping is saved to text phoneme and is reflected In firing table；

Step 6: trained multiple acoustic models, text phoneme mapping table and keyword sequences dictionary are loaded into respectively Multiple decoders；

Step 7: feature extraction is denoised and carried out to broadcast singal to be identified, then it is loaded into multiple solutions respectively Code device obtains the keyword recognition of multiple models as a result, obtaining a recognition result by the method voted, then keyword is known Other result is fabricated to feature vector；

Step 8: keyword recognition result is fabricated to feature vector, several normal broadcast voice segments and illegal broadcast are taken Voice segments, and the electronic tag of " normal broadcast " and " illegal broadcast " is stamped respectively, training set is formed, with " normal broadcast/illegal Broadcast " is two classification standards training SVM classifier；

Step 9: the feature vector that step 7 is obtained is loaded into SVM classifier, obtain the broadcast singal be normal broadcast or The judgement result illegally broadcasted.

Broadcast singal generally has lasting noise, and its tone that broadcasts is more special, can also mix non-voice letter sometimes Number.So the general speech recognition modeling trained in quiet background, standard dialogic voice, it is difficult to directly asking at us It is used in topic.To solve the above problems, us is just needed to construct special speech recognition modeling, can particularly adapt to wide The voice signal broadcast.Specifically, inventor is provided with acoustic model, text phoneme mapping table for automatic speech recognition model With dictionary three parts.And design focal point is how to train or adjust acoustic model, text phoneme mapping table and dictionary, makes voice Identifying system can obtain better effect on broadcast singal.Since we are merely able to obtain a limited number of broadcast letters Number sample, can not start from scratch trained whole system；So it is contemplated that using existing automatic speech recognition system PocketSphinx is adjusted correspondingly parameter as substrate model, and according to broadcast singal, allows to accurately know Other sound program recording signal；After acoustic model, text phoneme mapping table and keyword sequences dictionary establish, to improve this method Recognition efficiency, so when being trained to text phoneme mapping table should as far as possible and voice in actual broadcast singal The characteristics of signal, is identical, so first recording the keyword in keyword sequences dictionary at training sample in step 4, then existing Acoustic model is trained with the sample, while difference of this method compared to conventional method is, by definition of keywords, Only keyword is identified in identification process, rather than is identified that this allows for this method for the sentence of sentence meaning Recognition speed wants much faster compared to conventional method；The applicability of this method in actual use is improved, then logical It crosses and text phoneme mapping table is adapted to, obtained the mapping between phoneme and the keyword of definition, will finally instruct Perfect acoustic model, the text phoneme mapping table that has adapted to and keyword sequences dictionary are loaded into decoder, this programme here In order to improve the accuracy of subsequent SVM classifier classification, so acoustic model is provided with multiple, while different samples is used Collection is trained multiple acoustic models, and trained multiple acoustic models is loaded into respectively in multiple decoders, each acoustics Being loaded into model has a text phoneme mapping table and keyword sequences dictionary, and by the broadcast singal to be identified after the same denoising It is loaded into above-mentioned multiple decoders respectively, each decoder can obtain a recognition result, such as a total of ten solutions Code device, wherein there are eight decoders all to have identified the same keyword to the 4th of the broadcast singal to be identified, there are two Unidentified keyword out just retains one of the result that the keyword identified is identified as keyword in this step, to this The step for all sentences of broadcast singal to be identified carry out is thus training mission to be dispersed to multiple trained moulds Type can simultaneously be trained multiple acoustic models, be conducive to save the time, it is most important that prevent nonrecognition and accidentally know Other influence helps to improve the accuracy rate of identification, can expand in combination with multiple classifiers and assume space, it is possible to learn To better approximate optimal solution；The accuracy for helping to improve text phoneme mapping table in this way, in terms of identification, I By be used only a little keyword be added dictionary method, carry out voice-keyword accurately identify；By using keyword Sample made of recording is trained acoustic model and text phoneme mapping table, simultaneously because we do not need it is complete, Has the sentence of meaning, so the serious forgiveness of the text phoneme mapping table of design seems relatively much higher.

Further, acoustic model is CMU sphinx model in step 1, is trained in step 4 to acoustic model Concrete operations it is as follows:

Step 4.1 records several sample broadcast audio files containing keyword, as training sample；And it will training sample This forms multiple training sets at random；

Step 4.2, the audio signal characteristic for extracting sample broadcast audio file in each training set, utilize the sound extracted Frequency signal characteristic is trained multiple CMU sphinx models.

In order to improve accuracy rate of the acoustic model in identification process, in this method when making training sample, first record Containing keyword sample broadcast audio file, using this document as sample file, in this way, the parameter adaptive of generation Acoustic model afterwards had not only remained acoustic model to keyword recognition ability, but also enhanced for specific radio broadcasting ring Border carries out the function of specific identification, then by extracting the audio frequency characteristics signal of sample file, and saves and gives CMU sphinx mould Type.

Further, CMU sphinx model is hidden Markov-gauss hybrid models；

The Combined expression formula of the gauss hybrid models is as follows:

Wherein x indicates a syllable；P (x) is the probability for exporting the syllable；P (m) is corresponding Gaussian probability-density function Weight；μ_mAnd σ_m ²It is the parameter of corresponding Gaussian Profile；M is the index of submodel, i.e. m-th of submodel；M is submodule in total Type quantity；N () is multivariate Gaussian distribution；I is the unit matrix of corresponding data dimension；P (x | m) be for m-th of model, it is defeated The probability of the syllable out.

Because gauss hybrid models use the Combined expression of multiple Gaussian Profiles, there are multiple distribution center, so very It is appropriate for the simulation of acoustic model, from formula, it will be seen that this probability density function can be considered as multiple Gausses Combination；Since voice signal is often in multicenter variance attenuation distribution, so gauss hybrid models are highly suitable as acoustics The modeling of model.Gauss hybrid models have very strong ability to express, but its model training is not a simple thing； For probability-distribution function, training is often used the method for maximizing log-likelihood function often in the prior art.But due to this The log-likelihood function of gauss hybrid models simultaneously can discontinuously be led, so inventor after study after using heuritic approach into Row training, more common heuritic approach are E-M algorithm, it can the characteristics of naturally guarantee probability addition/integrate is 1, So that it has extensive use when solving probability density function extreme-value problem；Specific E-M algorithm can be expressed as follows:

Assuming that parameter to be learned is θ, mixed model hidden variable is that Z (is P (m), each Gaussian Profile in gauss hybrid models Coefficient), single model variable is X (being the mean value and variance of each Gauss model in gauss hybrid models), and logarithm loss function is [logL(θ；X, Z)], then E-M algorithm can state are as follows:

E step: a lower bound of objective function is sought

M step: the parameter in the current step of optimization

By recycling the operation of above-mentioned steps, we can make parameter θ gradually converge to optimal value.

In addition complete acoustic model is designed based on gauss hybrid models-Markov chain；Specifically, In speech recognition, voice signal is made of syllable；And connected each other between syllable, language is finally constituted, and due to Ma Er Can husband's chain with the time-varying characteristics of learning system and the relationship that influences each other between each syllable time can be captured, so by widely answering Acoustic Modeling for speech recognition；Hidden Markov model is made of aobvious state and hidden state two parts, wherein aobvious state is The part that we are directly observed, such as the data in voice signal；Hidden state is that our model hypothesis have but can not to us The variable seen.In Markov model, the conversion between state is to complete in hidden state, but each hidden state needs A distribution is wanted to be converted to the observation of aobvious state: this is also the reason of it is known as " hidden " Markov model, noticeable Be, in hidden Markov model, for hidden variable s, the value at current time it is related with last moment；Meanwhile for working as Preceding observation, only related with the hidden variable at this moment, our this properties are referred to as Markov property, and due to this property Algorithm drafting pattern sector-meeting " chain " is presented, so we can be called " Hidden Markov Chain " again；Hidden Markov Chain Involve following two important advance formula:

a_kj=P (S=j | S=k)

b_j(X)=p (x | S=j)

Wherein, second formula is the probability density function modeled to the characteristic signal of each frame, i.e. our institutes sometimes " the transmitting function " said；In acoustic signal modeling, we enable this function defer to gauss hybrid models, to obtain ours HMM-GMM overall model；And it is variation between hidden state that first formula, which then reflects, the transfer between state can be used The method of Dynamic Programming calculates.

Further, in step 3, the method for establishing keyword sequences dictionary is using Manual definition.For keyword It defines and uses expert system in this method, so-called expert system is exactly the warp according to illegal broadcast detection expert in this method It tests, relevant knowledge is extracted as expression formula, for example, keyword is arranged according to expertise in alternative keyword there are three us 1+ keyword 2 is illegal broadcast, and keyword 1+ keyword 3 is normal broadcast, and this method is sentenced in subsequent advance to illegal broadcast Periodically, inventor is additionally added fuzzy logic and adaptive differentiation is come using the SVM support vector machine method in machine learning Automatically differentiate illegal broadcast or normal broadcast, greatly improves the recall rate of normal broadcast, while making this method not only Whether be the judgement illegally broadcasted, can also export its confidence level, when confidence level is lower, we can be asked if can export Manual intervention is asked, to determine whether illegal broadcast.

Further, the concrete operations voted in step 7 are as follows:

Voice to be identified is loaded into each decoder by step 7.1；

Step 7.2 obtains the recognition result that each decoder generates, wherein x_i,jIndicate i-th voice of j-th of model Recognition result, by voting mechanism, that is, occur once being denoted as 1, N number of model had trained if had altogether, if frequency of occurrence m Greater than N/2, then using this result as recognition result y_i, it is denoted as:

The keyword frequency of occurrence gained vote of even certain speech recognition is more than half, retains, otherwise will give up, then by the key Word is as recognition result.

It is identified by the way that voice to be identified is loaded into each decoder, keyword is done in all recognition results of synthesis It rejects and retains, reduce the probability misidentified during this to the greatest extent, for example, when the identification of only one decoder, Keyword in a certain sentence is not identified, in this course, which may just be missed, and use more A decoding is when identifying same voice to be identified, so that it may largely avoid this problem, then pass through appeal system Meter formula filters out the keyword that gained vote is more than half, is conducive to improve accuracy rate when subsequent classification.

Compared with prior art, the present invention having the following advantages and benefits:

1, sample made of being recorded by using keyword is trained acoustic model and text phoneme mapping table, simultaneously Sentence that is complete, having meaning is not needed in this method, significantly improves the fault-tolerant of this method Chinese word tone element mapping table Rate；

2, it in this method when making training sample, first records and contains keyword sample broadcast audio file, by this document As sample file, in this way, the acoustic model after the parameter adaptive of generation had both remained acoustic model to key Word recognition capability, and enhance the function that specific identification is carried out for specific radio broadcasting environment.

Detailed description of the invention

Attached drawing described herein is used to provide to further understand the embodiment of the present invention, constitutes one of the application Point, do not constitute the restriction to the embodiment of the present invention.In the accompanying drawings:

Fig. 1 is signal processing flow figure of the invention.

Specific embodiment

To make the objectives, technical solutions, and advantages of the present invention clearer, below with reference to embodiment and attached drawing, to this Invention is described in further detail, and exemplary embodiment of the invention and its explanation for explaining only the invention, are not made For limitation of the invention.

Embodiment

As shown in Figure 1, a kind of method for identifying voiced keyword in broadcast singal, includes the following steps:

Step 1: establishing acoustic model；Acoustic model in the present embodiment is CMU sphinx model；CMU sphinx mould Type is hidden Markov-gauss hybrid models；

The Combined expression formula of the gauss hybrid models is as follows:

Wherein x indicates a syllable；P (x) is the probability for exporting the syllable；P (m) is corresponding Gaussian probability-density function Weight；μ_mAnd σ_m ²It is the parameter of corresponding Gaussian Profile；M is the index of submodel, i.e. m-th of submodel；M is submodule in total Type quantity；N () is multivariate Gaussian distribution；I is the unit matrix of corresponding data dimension；P (x | m) be for m-th of model, it is defeated The probability of the syllable out.In addition, complete acoustic model, is designed based on gauss hybrid models-Markov chain；Specifically For, in speech recognition, voice signal is made of syllable；And connected each other between syllable, finally constitute language, and by With the time-varying characteristics of learning system and the relationship that influences each other between each syllable time can be captured in Markov chain, so it is wide The general Acoustic Modeling applied to speech recognition；Hidden Markov model is made of aobvious state and hidden state two parts, wherein showing State is the part that we are directly observed, such as the data in voice signal；Hidden state is that our model hypothesis have but to me Sightless variable.In Markov model, the conversion between state is completed in hidden state, but each hidden shape State requires a distribution to be converted to the observation of aobvious state: this is also the reason of it is known as " hidden " Markov model, is worth It is noted that in hidden Markov model, for hidden variable s, the value at current time it is related with last moment；Meanwhile Only related with the hidden variable at this moment for current observation, our this properties are referred to as Markov property, and due to this " chain " is presented in the algorithm drafting pattern sector-meeting of kind property, so we can be called " Hidden Markov Chain " again；Hidden Ma Er Can husband's chain involve following two important advance formula:

a_kj=P (S=j | S=k)

b_j(X)=p (x | S=j)

Step 2: establishing the text phoneme mapping table of crucial words；

Step 3: establishing keyword sequences dictionary；In the present embodiment, in definition this method of keyword use expert System, so-called expert system is exactly that relevant knowledge is extracted as table according to the experience of illegal broadcast detection expert in this method Up to formula, for example, alternative keyword there are three us, according to expertise, it is illegal broadcast that keyword 1+ keyword 2, which is arranged, And keyword 1+ keyword 3 is normal broadcast, this method it is subsequent advance to illegal broadcast determine when, inventor is additionally added fuzzy Logic and adaptive differentiation, using the SVM support vector machine method in machine learning, automatically to differentiate illegal broadcast still Whether normal broadcast greatly improves the recall rate of normal broadcast, while this method not only to export to be illegally to broadcast Judgement, its confidence level can also be exported, when confidence level is lower, we can request manual intervention, to determine whether Illegally to broadcast.

Step 4: randomly selecting the sample of the sound bite of part from training sample, the keyword in dictionary is recycled Acoustic model is trained, obtains the sample of multiple acoustic models, then obtain between the characteristics of speech sounds of keyword and phoneme Mapping, and the mapping is loaded into corresponding acoustic model；

In the present embodiment, the method for ballot is to establish voting mechanism, and concrete operations are as follows:

Voice to be identified is loaded into each decoder by step 7.1；

Step 9: the feature vector that step 7 is obtained is loaded into SVM classifier, obtain the broadcast singal be normal broadcast or The judgement result illegally broadcasted.In the present embodiment, decoder uses the decoder of existing PocketSphinx.

In the present embodiment, the concrete operations being trained in step 4 to acoustic model are as follows:

Step 1.1 records the sample broadcast audio file containing keyword；

CMU sphinx model is given in step 1.2, the audio signal characteristic for extracting sample broadcast audio file, preservation.

It further include that acoustic model parameters adapt to improve: maximum a posteriori probability, this reality in the present embodiment in addition to above-mentioned steps It applies in example, maximum a posteriori probability model expression are as follows:Wherein, P (λ) is prior probability, And P (O | λ) is likelihood function, i.e. the measurement of characterize data likelihood degree under specific model specification.In acoustic model parameters It adapts in improvement, the parameter of the basic acoustic model of Chinese in our P (λ) i.e. speech recognition modeling, and P (O | λ) it then should be me The likelihood function of data that is newly added.

Above-described specific embodiment has carried out further the purpose of the present invention, technical scheme and beneficial effects It is described in detail, it should be understood that being not intended to limit the present invention the foregoing is merely a specific embodiment of the invention Protection scope, all within the spirits and principles of the present invention, any modification, equivalent substitution, improvement and etc. done should all include Within protection scope of the present invention.

Claims

1. a kind of method of voiced keyword in identification broadcast singal, which comprises the steps of:

Step 1: establishing multiple acoustic models；

Step 2: establishing the text phoneme mapping table of crucial words；

Step 4: being trained using the keyword in dictionary to multiple acoustic models, the characteristics of speech sounds and sound of keyword are obtained Mapping between element, and the mapping truck is entered into acoustic model；

Step 5: defining the mapping between phoneme and specified crucial words, and the mapping is saved to text phoneme mapping table In；

Step 6: trained multiple acoustic models, text phoneme mapping table and keyword sequences dictionary are loaded into respectively multiple Decoder；

Step 7: feature extraction is denoised and carried out to broadcast singal to be identified, then it is loaded into multiple decoders respectively, The keyword recognition of multiple models is obtained as a result, obtaining a recognition result by the method voted, then keyword is identified and is tied Fruit is fabricated to feature vector；

Step 8: keyword recognition result is fabricated to feature vector, several normal broadcast voice segments and illegal broadcasting speech are taken Section, and the electronic tag of " normal broadcast " and " illegal broadcast " is stamped respectively, training set is formed, with " normal broadcast/illegal wide Broadcast " SVM classifier is trained for two classification standards；

Step 9: the feature vector that step 7 is obtained is loaded into SVM classifier, show that the broadcast singal is normal broadcast or illegal The judgement result of broadcast.

2. the method for voiced keyword in a kind of identification broadcast singal according to claim 1, which is characterized in that step 1 Described in acoustic model be CMU sphinx model, the concrete operations being trained to acoustic model in step 4 are as follows:

Step 4.1 records several sample broadcast audio files containing keyword, as training sample；And by training sample with Machine forms multiple training sets；

Step 4.2, the audio signal characteristic for extracting sample broadcast audio file in each training set, are believed using the audio extracted Number feature is trained multiple CMU sphinx models.

3. the method for voiced keyword in a kind of identification broadcast singal according to claim 2, which is characterized in that described CMU sphinx model is hidden Markov-gauss hybrid models；

The Combined expression formula of the gauss hybrid models is as follows:

Wherein x indicates a syllable；P (x) is the probability for exporting the syllable；P (m) is the power of corresponding Gaussian probability-density function Value；μ_mAnd σ_m ²It is the parameter of corresponding Gaussian Profile；M is the index of submodel, i.e. m-th of submodel；M is submodel in total Quantity；N () is multivariate Gaussian distribution；I is the unit matrix of corresponding data dimension；P (x | m) it is for m-th of model, output The probability of the syllable.

4. the method for voiced keyword in a kind of identification broadcast singal according to claim 1, which is characterized in that the step In rapid three, the method for establishing keyword sequences dictionary is using Manual definition.

5. the method for voiced keyword in a kind of identification broadcast singal according to claim 1, which is characterized in that described The concrete operations voted in step 7 are as follows:

Voice to be identified loading is loaded into each decoder by step 7.1 respectively；

Step 7.2 obtains the recognition result that each decoder generates, wherein x_i,jIndicate the knowledge of i-th voice of j-th of model Not as a result, by voting mechanism, that is, occur once being denoted as 1, if N number of model is had trained altogether, if frequency of occurrence m is greater than N/2, then using this result as recognition result y_i, it is denoted as:

The keyword frequency of occurrence gained vote of even certain speech recognition is more than half, retains, otherwise will give up, then makees the keyword For recognition result.