CN110265003A - A kind of method of voiced keyword in identification broadcast singal - Google Patents
A kind of method of voiced keyword in identification broadcast singal Download PDFInfo
- Publication number
- CN110265003A CN110265003A CN201910596186.4A CN201910596186A CN110265003A CN 110265003 A CN110265003 A CN 110265003A CN 201910596186 A CN201910596186 A CN 201910596186A CN 110265003 A CN110265003 A CN 110265003A
- Authority
- CN
- China
- Prior art keywords
- keyword
- broadcast
- model
- trained
- broadcast singal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 46
- 238000013507 mapping Methods 0.000 claims abstract description 41
- 238000000605 extraction Methods 0.000 claims abstract description 4
- 238000012549 training Methods 0.000 claims description 22
- 241000252794 Sphinx Species 0.000 claims description 10
- 230000007246 mechanism Effects 0.000 claims description 4
- 239000011159 matrix material Substances 0.000 claims description 3
- 230000005236 sound signal Effects 0.000 claims description 3
- 235000013399 edible fruits Nutrition 0.000 claims 1
- 230000006870 function Effects 0.000 description 18
- 230000003044 adaptive effect Effects 0.000 description 4
- 230000008901 benefit Effects 0.000 description 4
- 238000013461 design Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000007796 conventional method Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 230000004069 differentiation Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000010304 firing Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000001228 spectrum Methods 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 241001672694 Citrus reticulata Species 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005315 distribution function Methods 0.000 description 1
- 230000002045 lasting effect Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000004321 preservation Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000004064 recycling Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 239000000758 substrate Substances 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0631—Creating reference templates; Clustering
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of methods of voiced keyword in identification broadcast singal, include the following steps: one, establish acoustic model;Two, the text phoneme mapping table of crucial words is established;Three, keyword sequences dictionary is established;Four, acoustic model is trained using the keyword in dictionary, obtains the mapping between the characteristics of speech sounds of keyword and phoneme, and the mapping truck is entered into acoustic model;Five, the mapping between phoneme and specified crucial words is defined, and the mapping is saved into text phoneme mapping table;Six, trained acoustic model, text phoneme mapping table and keyword sequences dictionary are loaded into decoder;Seven, broadcast singal to be identified is denoised and carries out feature extraction, then be loaded into decoder, obtain keyword recognition result.Sample made of being recorded by using keyword is trained acoustic model and text phoneme mapping table, since we do not need the sentence for completely having meaning, improves the serious forgiveness of text phoneme mapping table.
Description
Technical field
The present invention relates to technical field of voice recognition, and in particular to a kind of side for identifying voiced keyword in broadcast singal
Method.
Background technique
With the continuous development of social economy and radio communication, the importance of Radio Spectrum Resource is increasingly prominent.Perhaps
More illegal broadcast are bought privately under the driving of economic interests, set up broadcast base station, broadcast sham publicity or carry out economy and take advantage of
Swindleness behavior upsets economic order, and interferes normal communication signal, or even can cause safety accident.So design is effective
Method carries out efficient automatic examination and control to illegal broadcast singal, has highly important society and economic value.It passes
The illegal broadcast examination of system, which mostly uses, manually listens to knowledge method for distinguishing, and human cost is high, and is easy to produce fault.Although automatic language
Sound identification has been obtained many applications in people's daily life, but due to illegal broadcast singal be markedly different from it is daily
The characteristics of voice, leading to the automatic examination in frequency spectrum control, illegally broadcasted is still a stubborn problem.Wherein, main
Two reasons, first is that due to broadcast singal have biggish noise, and its signal often be not simple mandarin pronunciation, second is that
In the application field that illegal broadcast is screened, cannot often be connected with external the Internet for the needs of confidentiality, which results in
Existing automatic speech recognition model is not available mostly, in addition the lazy weight of training sample, significantly limits existing mould
The use of type.
Although Chinese speech recognition has been achieved for certain effect and application, due to some spies of broadcast singal itself
Point, by automatic speech recognition technology directly apply to broadcast singal legitimacy screen on there are still certain difficulties.Specifically, by
In the demand of some safeties, broadcast singal legitimacy, which is screened, to be needed to carry out under offline environment, so that much existing business
On-time model can not be used directly.
Summary of the invention
Illegal broadcast is being detected using automatic speech recognition technology during present invention aims to solve the prior art
When, lead to the problem of keyword can not identify and much incoherent vocabulary by mistake " identification ";It is wide to provide a kind of identification
The method for broadcasting voiced keyword in signal, sample made of being recorded by using keyword map acoustic model and text phoneme
Table is trained, simultaneously because we do not need sentence that is complete, having meaning, it is only necessary to identify in voice and specifically close
Key word determines that its property for normal broadcast or illegal broadcast, improves efficiency, simplifies identification process.
The present invention is achieved through the following technical solutions:
The method of voiced keyword, includes the following steps: in a kind of identification broadcast singal
Step 1: establishing multiple acoustic models;
Step 2: establishing the text phoneme mapping table of crucial words;
Step 3: establishing keyword sequences dictionary and the sound bite containing keyword
Step 4: being trained using the keyword in dictionary to multiple acoustic models, the characteristics of speech sounds of keyword is obtained
Mapping between phoneme, and the mapping truck is entered into acoustic model;
Step 5: defining the mapping between phoneme and specified crucial words, and the mapping is saved to text phoneme and is reflected
In firing table;
Step 6: trained multiple acoustic models, text phoneme mapping table and keyword sequences dictionary are loaded into respectively
Multiple decoders;
Step 7: feature extraction is denoised and carried out to broadcast singal to be identified, then it is loaded into multiple solutions respectively
Code device obtains the keyword recognition of multiple models as a result, obtaining a recognition result by the method voted, then keyword is known
Other result is fabricated to feature vector;
Step 8: keyword recognition result is fabricated to feature vector, several normal broadcast voice segments and illegal broadcast are taken
Voice segments, and the electronic tag of " normal broadcast " and " illegal broadcast " is stamped respectively, training set is formed, with " normal broadcast/illegal
Broadcast " is two classification standards training SVM classifier;
Step 9: the feature vector that step 7 is obtained is loaded into SVM classifier, obtain the broadcast singal be normal broadcast or
The judgement result illegally broadcasted.
Broadcast singal generally has lasting noise, and its tone that broadcasts is more special, can also mix non-voice letter sometimes
Number.So the general speech recognition modeling trained in quiet background, standard dialogic voice, it is difficult to directly asking at us
It is used in topic.To solve the above problems, us is just needed to construct special speech recognition modeling, can particularly adapt to wide
The voice signal broadcast.Specifically, inventor is provided with acoustic model, text phoneme mapping table for automatic speech recognition model
With dictionary three parts.And design focal point is how to train or adjust acoustic model, text phoneme mapping table and dictionary, makes voice
Identifying system can obtain better effect on broadcast singal.Since we are merely able to obtain a limited number of broadcast letters
Number sample, can not start from scratch trained whole system;So it is contemplated that using existing automatic speech recognition system
PocketSphinx is adjusted correspondingly parameter as substrate model, and according to broadcast singal, allows to accurately know
Other sound program recording signal;After acoustic model, text phoneme mapping table and keyword sequences dictionary establish, to improve this method
Recognition efficiency, so when being trained to text phoneme mapping table should as far as possible and voice in actual broadcast singal
The characteristics of signal, is identical, so first recording the keyword in keyword sequences dictionary at training sample in step 4, then existing
Acoustic model is trained with the sample, while difference of this method compared to conventional method is, by definition of keywords,
Only keyword is identified in identification process, rather than is identified that this allows for this method for the sentence of sentence meaning
Recognition speed wants much faster compared to conventional method;The applicability of this method in actual use is improved, then logical
It crosses and text phoneme mapping table is adapted to, obtained the mapping between phoneme and the keyword of definition, will finally instruct
Perfect acoustic model, the text phoneme mapping table that has adapted to and keyword sequences dictionary are loaded into decoder, this programme here
In order to improve the accuracy of subsequent SVM classifier classification, so acoustic model is provided with multiple, while different samples is used
Collection is trained multiple acoustic models, and trained multiple acoustic models is loaded into respectively in multiple decoders, each acoustics
Being loaded into model has a text phoneme mapping table and keyword sequences dictionary, and by the broadcast singal to be identified after the same denoising
It is loaded into above-mentioned multiple decoders respectively, each decoder can obtain a recognition result, such as a total of ten solutions
Code device, wherein there are eight decoders all to have identified the same keyword to the 4th of the broadcast singal to be identified, there are two
Unidentified keyword out just retains one of the result that the keyword identified is identified as keyword in this step, to this
The step for all sentences of broadcast singal to be identified carry out is thus training mission to be dispersed to multiple trained moulds
Type can simultaneously be trained multiple acoustic models, be conducive to save the time, it is most important that prevent nonrecognition and accidentally know
Other influence helps to improve the accuracy rate of identification, can expand in combination with multiple classifiers and assume space, it is possible to learn
To better approximate optimal solution;The accuracy for helping to improve text phoneme mapping table in this way, in terms of identification, I
By be used only a little keyword be added dictionary method, carry out voice-keyword accurately identify;By using keyword
Sample made of recording is trained acoustic model and text phoneme mapping table, simultaneously because we do not need it is complete,
Has the sentence of meaning, so the serious forgiveness of the text phoneme mapping table of design seems relatively much higher.
Further, acoustic model is CMU sphinx model in step 1, is trained in step 4 to acoustic model
Concrete operations it is as follows:
Step 4.1 records several sample broadcast audio files containing keyword, as training sample;And it will training sample
This forms multiple training sets at random;
Step 4.2, the audio signal characteristic for extracting sample broadcast audio file in each training set, utilize the sound extracted
Frequency signal characteristic is trained multiple CMU sphinx models.
In order to improve accuracy rate of the acoustic model in identification process, in this method when making training sample, first record
Containing keyword sample broadcast audio file, using this document as sample file, in this way, the parameter adaptive of generation
Acoustic model afterwards had not only remained acoustic model to keyword recognition ability, but also enhanced for specific radio broadcasting ring
Border carries out the function of specific identification, then by extracting the audio frequency characteristics signal of sample file, and saves and gives CMU sphinx mould
Type.
Further, CMU sphinx model is hidden Markov-gauss hybrid models;
The Combined expression formula of the gauss hybrid models is as follows:
Wherein x indicates a syllable;P (x) is the probability for exporting the syllable;P (m) is corresponding Gaussian probability-density function
Weight;μmAnd σm 2It is the parameter of corresponding Gaussian Profile;M is the index of submodel, i.e. m-th of submodel;M is submodule in total
Type quantity;N () is multivariate Gaussian distribution;I is the unit matrix of corresponding data dimension;P (x | m) be for m-th of model, it is defeated
The probability of the syllable out.
Because gauss hybrid models use the Combined expression of multiple Gaussian Profiles, there are multiple distribution center, so very
It is appropriate for the simulation of acoustic model, from formula, it will be seen that this probability density function can be considered as multiple Gausses
Combination;Since voice signal is often in multicenter variance attenuation distribution, so gauss hybrid models are highly suitable as acoustics
The modeling of model.Gauss hybrid models have very strong ability to express, but its model training is not a simple thing;
For probability-distribution function, training is often used the method for maximizing log-likelihood function often in the prior art.But due to this
The log-likelihood function of gauss hybrid models simultaneously can discontinuously be led, so inventor after study after using heuritic approach into
Row training, more common heuritic approach are E-M algorithm, it can the characteristics of naturally guarantee probability addition/integrate is 1,
So that it has extensive use when solving probability density function extreme-value problem;Specific E-M algorithm can be expressed as follows:
Assuming that parameter to be learned is θ, mixed model hidden variable is that Z (is P (m), each Gaussian Profile in gauss hybrid models
Coefficient), single model variable is X (being the mean value and variance of each Gauss model in gauss hybrid models), and logarithm loss function is
[logL(θ;X, Z)], then E-M algorithm can state are as follows:
E step: a lower bound of objective function is sought
M step: the parameter in the current step of optimization
By recycling the operation of above-mentioned steps, we can make parameter θ gradually converge to optimal value.
In addition complete acoustic model is designed based on gauss hybrid models-Markov chain;Specifically,
In speech recognition, voice signal is made of syllable;And connected each other between syllable, language is finally constituted, and due to Ma Er
Can husband's chain with the time-varying characteristics of learning system and the relationship that influences each other between each syllable time can be captured, so by widely answering
Acoustic Modeling for speech recognition;Hidden Markov model is made of aobvious state and hidden state two parts, wherein aobvious state is
The part that we are directly observed, such as the data in voice signal;Hidden state is that our model hypothesis have but can not to us
The variable seen.In Markov model, the conversion between state is to complete in hidden state, but each hidden state needs
A distribution is wanted to be converted to the observation of aobvious state: this is also the reason of it is known as " hidden " Markov model, noticeable
Be, in hidden Markov model, for hidden variable s, the value at current time it is related with last moment;Meanwhile for working as
Preceding observation, only related with the hidden variable at this moment, our this properties are referred to as Markov property, and due to this property
Algorithm drafting pattern sector-meeting " chain " is presented, so we can be called " Hidden Markov Chain " again;Hidden Markov Chain
Involve following two important advance formula:
akj=P (S=j | S=k)
bj(X)=p (x | S=j)
Wherein, second formula is the probability density function modeled to the characteristic signal of each frame, i.e. our institutes sometimes
" the transmitting function " said;In acoustic signal modeling, we enable this function defer to gauss hybrid models, to obtain ours
HMM-GMM overall model;And it is variation between hidden state that first formula, which then reflects, the transfer between state can be used
The method of Dynamic Programming calculates.
Further, in step 3, the method for establishing keyword sequences dictionary is using Manual definition.For keyword
It defines and uses expert system in this method, so-called expert system is exactly the warp according to illegal broadcast detection expert in this method
It tests, relevant knowledge is extracted as expression formula, for example, keyword is arranged according to expertise in alternative keyword there are three us
1+ keyword 2 is illegal broadcast, and keyword 1+ keyword 3 is normal broadcast, and this method is sentenced in subsequent advance to illegal broadcast
Periodically, inventor is additionally added fuzzy logic and adaptive differentiation is come using the SVM support vector machine method in machine learning
Automatically differentiate illegal broadcast or normal broadcast, greatly improves the recall rate of normal broadcast, while making this method not only
Whether be the judgement illegally broadcasted, can also export its confidence level, when confidence level is lower, we can be asked if can export
Manual intervention is asked, to determine whether illegal broadcast.
Further, the concrete operations voted in step 7 are as follows:
Voice to be identified is loaded into each decoder by step 7.1;
Step 7.2 obtains the recognition result that each decoder generates, wherein xi,jIndicate i-th voice of j-th of model
Recognition result, by voting mechanism, that is, occur once being denoted as 1, N number of model had trained if had altogether, if frequency of occurrence m
Greater than N/2, then using this result as recognition result yi, it is denoted as:
The keyword frequency of occurrence gained vote of even certain speech recognition is more than half, retains, otherwise will give up, then by the key
Word is as recognition result.
It is identified by the way that voice to be identified is loaded into each decoder, keyword is done in all recognition results of synthesis
It rejects and retains, reduce the probability misidentified during this to the greatest extent, for example, when the identification of only one decoder,
Keyword in a certain sentence is not identified, in this course, which may just be missed, and use more
A decoding is when identifying same voice to be identified, so that it may largely avoid this problem, then pass through appeal system
Meter formula filters out the keyword that gained vote is more than half, is conducive to improve accuracy rate when subsequent classification.
Compared with prior art, the present invention having the following advantages and benefits:
1, sample made of being recorded by using keyword is trained acoustic model and text phoneme mapping table, simultaneously
Sentence that is complete, having meaning is not needed in this method, significantly improves the fault-tolerant of this method Chinese word tone element mapping table
Rate;
2, it in this method when making training sample, first records and contains keyword sample broadcast audio file, by this document
As sample file, in this way, the acoustic model after the parameter adaptive of generation had both remained acoustic model to key
Word recognition capability, and enhance the function that specific identification is carried out for specific radio broadcasting environment.
Detailed description of the invention
Attached drawing described herein is used to provide to further understand the embodiment of the present invention, constitutes one of the application
Point, do not constitute the restriction to the embodiment of the present invention.In the accompanying drawings:
Fig. 1 is signal processing flow figure of the invention.
Specific embodiment
To make the objectives, technical solutions, and advantages of the present invention clearer, below with reference to embodiment and attached drawing, to this
Invention is described in further detail, and exemplary embodiment of the invention and its explanation for explaining only the invention, are not made
For limitation of the invention.
Embodiment
As shown in Figure 1, a kind of method for identifying voiced keyword in broadcast singal, includes the following steps:
Step 1: establishing acoustic model;Acoustic model in the present embodiment is CMU sphinx model;CMU sphinx mould
Type is hidden Markov-gauss hybrid models;
The Combined expression formula of the gauss hybrid models is as follows:
Wherein x indicates a syllable;P (x) is the probability for exporting the syllable;P (m) is corresponding Gaussian probability-density function
Weight;μmAnd σm 2It is the parameter of corresponding Gaussian Profile;M is the index of submodel, i.e. m-th of submodel;M is submodule in total
Type quantity;N () is multivariate Gaussian distribution;I is the unit matrix of corresponding data dimension;P (x | m) be for m-th of model, it is defeated
The probability of the syllable out.In addition, complete acoustic model, is designed based on gauss hybrid models-Markov chain;Specifically
For, in speech recognition, voice signal is made of syllable;And connected each other between syllable, finally constitute language, and by
With the time-varying characteristics of learning system and the relationship that influences each other between each syllable time can be captured in Markov chain, so it is wide
The general Acoustic Modeling applied to speech recognition;Hidden Markov model is made of aobvious state and hidden state two parts, wherein showing
State is the part that we are directly observed, such as the data in voice signal;Hidden state is that our model hypothesis have but to me
Sightless variable.In Markov model, the conversion between state is completed in hidden state, but each hidden shape
State requires a distribution to be converted to the observation of aobvious state: this is also the reason of it is known as " hidden " Markov model, is worth
It is noted that in hidden Markov model, for hidden variable s, the value at current time it is related with last moment;Meanwhile
Only related with the hidden variable at this moment for current observation, our this properties are referred to as Markov property, and due to this
" chain " is presented in the algorithm drafting pattern sector-meeting of kind property, so we can be called " Hidden Markov Chain " again;Hidden Ma Er
Can husband's chain involve following two important advance formula:
akj=P (S=j | S=k)
bj(X)=p (x | S=j)
Wherein, second formula is the probability density function modeled to the characteristic signal of each frame, i.e. our institutes sometimes
" the transmitting function " said;In acoustic signal modeling, we enable this function defer to gauss hybrid models, to obtain ours
HMM-GMM overall model;And it is variation between hidden state that first formula, which then reflects, the transfer between state can be used
The method of Dynamic Programming calculates.
Step 2: establishing the text phoneme mapping table of crucial words;
Step 3: establishing keyword sequences dictionary;In the present embodiment, in definition this method of keyword use expert
System, so-called expert system is exactly that relevant knowledge is extracted as table according to the experience of illegal broadcast detection expert in this method
Up to formula, for example, alternative keyword there are three us, according to expertise, it is illegal broadcast that keyword 1+ keyword 2, which is arranged,
And keyword 1+ keyword 3 is normal broadcast, this method it is subsequent advance to illegal broadcast determine when, inventor is additionally added fuzzy
Logic and adaptive differentiation, using the SVM support vector machine method in machine learning, automatically to differentiate illegal broadcast still
Whether normal broadcast greatly improves the recall rate of normal broadcast, while this method not only to export to be illegally to broadcast
Judgement, its confidence level can also be exported, when confidence level is lower, we can request manual intervention, to determine whether
Illegally to broadcast.
Step 4: randomly selecting the sample of the sound bite of part from training sample, the keyword in dictionary is recycled
Acoustic model is trained, obtains the sample of multiple acoustic models, then obtain between the characteristics of speech sounds of keyword and phoneme
Mapping, and the mapping is loaded into corresponding acoustic model;
Step 5: defining the mapping between phoneme and specified crucial words, and the mapping is saved to text phoneme and is reflected
In firing table;
Step 6: trained multiple acoustic models, text phoneme mapping table and keyword sequences dictionary are loaded into respectively
Multiple decoders;
Step 7: feature extraction is denoised and carried out to broadcast singal to be identified, then it is loaded into multiple solutions respectively
Code device obtains the keyword recognition of multiple models as a result, obtaining a recognition result by the method voted, then keyword is known
Other result is fabricated to feature vector;
In the present embodiment, the method for ballot is to establish voting mechanism, and concrete operations are as follows:
Voice to be identified is loaded into each decoder by step 7.1;
Step 7.2 obtains the recognition result that each decoder generates, wherein xi,jIndicate i-th voice of j-th of model
Recognition result, by voting mechanism, that is, occur once being denoted as 1, N number of model had trained if had altogether, if frequency of occurrence m
Greater than N/2, then using this result as recognition result yi, it is denoted as:
The keyword frequency of occurrence gained vote of even certain speech recognition is more than half, retains, otherwise will give up, then by the key
Word is as recognition result.
Step 8: keyword recognition result is fabricated to feature vector, several normal broadcast voice segments and illegal broadcast are taken
Voice segments, and the electronic tag of " normal broadcast " and " illegal broadcast " is stamped respectively, training set is formed, with " normal broadcast/illegal
Broadcast " is two classification standards training SVM classifier;
Step 9: the feature vector that step 7 is obtained is loaded into SVM classifier, obtain the broadcast singal be normal broadcast or
The judgement result illegally broadcasted.In the present embodiment, decoder uses the decoder of existing PocketSphinx.
In the present embodiment, the concrete operations being trained in step 4 to acoustic model are as follows:
Step 1.1 records the sample broadcast audio file containing keyword;
CMU sphinx model is given in step 1.2, the audio signal characteristic for extracting sample broadcast audio file, preservation.
It further include that acoustic model parameters adapt to improve: maximum a posteriori probability, this reality in the present embodiment in addition to above-mentioned steps
It applies in example, maximum a posteriori probability model expression are as follows:Wherein, P (λ) is prior probability,
And P (O | λ) is likelihood function, i.e. the measurement of characterize data likelihood degree under specific model specification.In acoustic model parameters
It adapts in improvement, the parameter of the basic acoustic model of Chinese in our P (λ) i.e. speech recognition modeling, and P (O | λ) it then should be me
The likelihood function of data that is newly added.
Above-described specific embodiment has carried out further the purpose of the present invention, technical scheme and beneficial effects
It is described in detail, it should be understood that being not intended to limit the present invention the foregoing is merely a specific embodiment of the invention
Protection scope, all within the spirits and principles of the present invention, any modification, equivalent substitution, improvement and etc. done should all include
Within protection scope of the present invention.
Claims (5)
1. a kind of method of voiced keyword in identification broadcast singal, which comprises the steps of:
Step 1: establishing multiple acoustic models;
Step 2: establishing the text phoneme mapping table of crucial words;
Step 3: establishing keyword sequences dictionary and the sound bite containing keyword
Step 4: being trained using the keyword in dictionary to multiple acoustic models, the characteristics of speech sounds and sound of keyword are obtained
Mapping between element, and the mapping truck is entered into acoustic model;
Step 5: defining the mapping between phoneme and specified crucial words, and the mapping is saved to text phoneme mapping table
In;
Step 6: trained multiple acoustic models, text phoneme mapping table and keyword sequences dictionary are loaded into respectively multiple
Decoder;
Step 7: feature extraction is denoised and carried out to broadcast singal to be identified, then it is loaded into multiple decoders respectively,
The keyword recognition of multiple models is obtained as a result, obtaining a recognition result by the method voted, then keyword is identified and is tied
Fruit is fabricated to feature vector;
Step 8: keyword recognition result is fabricated to feature vector, several normal broadcast voice segments and illegal broadcasting speech are taken
Section, and the electronic tag of " normal broadcast " and " illegal broadcast " is stamped respectively, training set is formed, with " normal broadcast/illegal wide
Broadcast " SVM classifier is trained for two classification standards;
Step 9: the feature vector that step 7 is obtained is loaded into SVM classifier, show that the broadcast singal is normal broadcast or illegal
The judgement result of broadcast.
2. the method for voiced keyword in a kind of identification broadcast singal according to claim 1, which is characterized in that step 1
Described in acoustic model be CMU sphinx model, the concrete operations being trained to acoustic model in step 4 are as follows:
Step 4.1 records several sample broadcast audio files containing keyword, as training sample;And by training sample with
Machine forms multiple training sets;
Step 4.2, the audio signal characteristic for extracting sample broadcast audio file in each training set, are believed using the audio extracted
Number feature is trained multiple CMU sphinx models.
3. the method for voiced keyword in a kind of identification broadcast singal according to claim 2, which is characterized in that described
CMU sphinx model is hidden Markov-gauss hybrid models;
The Combined expression formula of the gauss hybrid models is as follows:
Wherein x indicates a syllable;P (x) is the probability for exporting the syllable;P (m) is the power of corresponding Gaussian probability-density function
Value;μmAnd σm 2It is the parameter of corresponding Gaussian Profile;M is the index of submodel, i.e. m-th of submodel;M is submodel in total
Quantity;N () is multivariate Gaussian distribution;I is the unit matrix of corresponding data dimension;P (x | m) it is for m-th of model, output
The probability of the syllable.
4. the method for voiced keyword in a kind of identification broadcast singal according to claim 1, which is characterized in that the step
In rapid three, the method for establishing keyword sequences dictionary is using Manual definition.
5. the method for voiced keyword in a kind of identification broadcast singal according to claim 1, which is characterized in that described
The concrete operations voted in step 7 are as follows:
Voice to be identified loading is loaded into each decoder by step 7.1 respectively;
Step 7.2 obtains the recognition result that each decoder generates, wherein xi,jIndicate the knowledge of i-th voice of j-th of model
Not as a result, by voting mechanism, that is, occur once being denoted as 1, if N number of model is had trained altogether, if frequency of occurrence m is greater than
N/2, then using this result as recognition result yi, it is denoted as:
The keyword frequency of occurrence gained vote of even certain speech recognition is more than half, retains, otherwise will give up, then makees the keyword
For recognition result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910596186.4A CN110265003B (en) | 2019-07-03 | 2019-07-03 | Method for recognizing voice keywords in broadcast signal |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910596186.4A CN110265003B (en) | 2019-07-03 | 2019-07-03 | Method for recognizing voice keywords in broadcast signal |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110265003A true CN110265003A (en) | 2019-09-20 |
CN110265003B CN110265003B (en) | 2021-06-11 |
Family
ID=67924229
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910596186.4A Active CN110265003B (en) | 2019-07-03 | 2019-07-03 | Method for recognizing voice keywords in broadcast signal |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110265003B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113096650A (en) * | 2021-03-03 | 2021-07-09 | 河海大学 | Acoustic decoding method based on prior probability |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101894125A (en) * | 2010-05-13 | 2010-11-24 | 复旦大学 | Content-based video classification method |
CN102156885B (en) * | 2010-02-12 | 2014-03-26 | 中国科学院自动化研究所 | Image classification method based on cascaded codebook generation |
CN104268557A (en) * | 2014-09-15 | 2015-01-07 | 西安电子科技大学 | Polarization SAR classification method based on cooperative training and depth SVM |
CN107274889A (en) * | 2017-06-19 | 2017-10-20 | 北京紫博光彦信息技术有限公司 | A kind of method and device according to speech production business paper |
CN108703824A (en) * | 2018-03-15 | 2018-10-26 | 哈工大机器人(合肥)国际创新研究院 | A kind of bionic hand control system and control method based on myoelectricity bracelet |
-
2019
- 2019-07-03 CN CN201910596186.4A patent/CN110265003B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102156885B (en) * | 2010-02-12 | 2014-03-26 | 中国科学院自动化研究所 | Image classification method based on cascaded codebook generation |
CN101894125A (en) * | 2010-05-13 | 2010-11-24 | 复旦大学 | Content-based video classification method |
CN104268557A (en) * | 2014-09-15 | 2015-01-07 | 西安电子科技大学 | Polarization SAR classification method based on cooperative training and depth SVM |
CN107274889A (en) * | 2017-06-19 | 2017-10-20 | 北京紫博光彦信息技术有限公司 | A kind of method and device according to speech production business paper |
CN108703824A (en) * | 2018-03-15 | 2018-10-26 | 哈工大机器人(合肥)国际创新研究院 | A kind of bionic hand control system and control method based on myoelectricity bracelet |
Non-Patent Citations (3)
Title |
---|
袁翔: ""基于Sphinx的机器人语音识别系统构建与研究"", 《电脑知识与技术》 * |
谭旭: ""非法调频广播信号的识别问题研究"", 《西部广播电视》 * |
雒瑞森: ""基于关键词识别的可离线无线电电磁频谱管控系统研究"", 《电子测试》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113096650A (en) * | 2021-03-03 | 2021-07-09 | 河海大学 | Acoustic decoding method based on prior probability |
CN113096650B (en) * | 2021-03-03 | 2023-12-08 | 河海大学 | Acoustic decoding method based on prior probability |
Also Published As
Publication number | Publication date |
---|---|
CN110265003B (en) | 2021-06-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101136199B (en) | Voice data processing method and equipment | |
DE112020004504T5 (en) | Account connection with device | |
Martin et al. | Learning phonemes with a proto‐lexicon | |
CN108281137A (en) | A kind of universal phonetic under whole tone element frame wakes up recognition methods and system | |
CN107731233A (en) | A kind of method for recognizing sound-groove based on RNN | |
CN108962229B (en) | Single-channel and unsupervised target speaker voice extraction method | |
CN105938716A (en) | Multi-precision-fitting-based automatic detection method for copied sample voice | |
CN111243569B (en) | Emotional voice automatic generation method and device based on generation type confrontation network | |
CN109256150A (en) | Speech emotion recognition system and method based on machine learning | |
CN101710490A (en) | Method and device for compensating noise for voice assessment | |
CN112581964B (en) | Multi-domain oriented intelligent voice interaction method | |
WO2012014301A1 (en) | Intoxication state determination device and intoxication state determination method | |
Shi et al. | H-vectors: Utterance-level speaker embedding using a hierarchical attention model | |
CN109448756A (en) | A kind of voice age recognition methods and system | |
US7133827B1 (en) | Training speech recognition word models from word samples synthesized by Monte Carlo techniques | |
CN115394287A (en) | Mixed language voice recognition method, device, system and storage medium | |
Scholten et al. | Learning to recognise words using visually grounded speech | |
Cao et al. | Speaker-independent speech emotion recognition based on random forest feature selection algorithm | |
CN109410946A (en) | A kind of method, apparatus of recognition of speech signals, equipment and storage medium | |
CN112052686B (en) | Voice learning resource pushing method for user interactive education | |
CN110265003A (en) | A kind of method of voiced keyword in identification broadcast singal | |
CN110299133B (en) | Method for judging illegal broadcast based on keyword | |
CN115132195B (en) | Voice wakeup method, device, equipment, storage medium and program product | |
Singh et al. | Voice disguise by mimicry: deriving statistical articulometric evidence to evaluate claimed impersonation | |
CN111091809A (en) | Regional accent recognition method and device based on depth feature fusion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |