CN107610720A - Pronounce inclined error detection method, apparatus, storage medium and equipment - Google Patents
Pronounce inclined error detection method, apparatus, storage medium and equipment Download PDFInfo
- Publication number
- CN107610720A CN107610720A CN201710895726.XA CN201710895726A CN107610720A CN 107610720 A CN107610720 A CN 107610720A CN 201710895726 A CN201710895726 A CN 201710895726A CN 107610720 A CN107610720 A CN 107610720A
- Authority
- CN
- China
- Prior art keywords
- phoneme
- msub
- mrow
- voice
- error detection
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Electrically Operated Instructional Devices (AREA)
- Machine Translation (AREA)
Abstract
The invention provides one kind pronunciation inclined error detection method, apparatus, storage medium and equipment, this method includes:Using the key frame position for connecting phoneme in correct voice known to sequential classification CTC method detections, as acoustics boundary mark landmark;Inclined error detection of pronouncing is carried out to phoneme described in voice to be detected based on the landmark.The present invention detects key frame as acoustics boundary mark by the use of CTC methods, is not required to mark acoustics boundary mark in advance.
Description
Technical field
The present invention relates to computer assisted speech-related technologies field, more particularly to it is a kind of pronounce inclined error detection method,
Device, storage medium and equipment.
Background technology
Inclined error detection of pronouncing can be that learner is lifted as an important technology in area of computer aided pronunciation training system
Oracy provides effective approach.In the past few decades, the inclined error detection side of pronunciation largely based on segment level has been emerged in large numbers
Method.A wherein route is to be based on automatic speech recognition technology, and inclined error detection of pronouncing is carried out using statistics of speech recognition framework.Press
Feedback form, it can further be divided into two types.A kind of is the method based on confidence score, for example, log-likelihood ratio
(“Automatic detection of phone-level mispronunciation for language learning”,
Speech Communication, vol.30, no.2-3, pp.95-108,2000) measure the acoustics sound of mother tongue and non-mother tongue
The similarity degree of prime model, and its variant pronunciation wellness (" Phone-level pronunciation scoring of
specific phone segments for language instruction”,Speech communication,
vol.30,no.2,pp.95-108,2000).However, when learner faces a relatively low fraction, it is not known that how to correct.
Another kind is rule-based method, i.e., orthoepy and its inclined type by mistake is added extension and known using pronunciation extension identification network
In other dictionary.Type is missed partially in order to collect, using two methods:One kind is to formulate pronunciation rule using expertise, another
It is the method for machine learning, i.e., learns acoustic voice rule from orthoepy with automatic in the mark for missing pronunciation partially to generate acoustics
Phonetic system model (" Mispronunciation detection and diagnosis in l2 english speech using
multidistribution deep neural networks,"IEEE/ACM Transactions on Audio,
Speech,and Language Processing,2017).Or counted from corpus and miss pronunciation rule and respective tones partially
It is secondary, then with prior probability, extension pronunciation dictionary (" Improvement of segmental mispronunciation
detection with prior knowledge extracted from large L2 speech corpus,”in
Interspeech2011,pp.1593-1596).Compared with the method based on confidence score, rule-based method can be to learn
Habit person provides more feedback informations.This method advantage based on automatic speech recognition framework is that pronunciation is easy to predict by mistake partially.
But its shortcoming is, on the one hand, for the angle of parameter distinction, the method all extracts same frequency in each frame voice
Parameter (such as mel cepstrum coefficients) is composed, its distinction need further to study, and its potential hypothesis information is in voice
It is equally distributed;On the other hand, for model angle, ASR system uses hidden Markov (Hidden Markov mostly
Model, HMM) model models the temporal information of phoneme.But HMM can't be strong differentiation frequency spectrum on be similar to and duration not
Same voice (" Comparing different approaches for automatic pronunciation error
detection,”Speech Communication,vol.51,no.10,pp.845-852,2009).And the method relies on
Language setting and training data scale, specific inclined its accuracy of detection of type by mistake is also needed further to be lifted.
For two language learners, the significant challenge of foreign language studying, which comes from, realizes specific phonemic contrast.It is this right
It is vertical to be present in mother tongue, but may be not present in two languages.Due to being influenceed by effects such as Negative Transfer from Mother Language, its position of articulation is normal
The position of articulation of similar sound in mother tongue can often be tended to.The inclined pronunciation by mistake of foreign language learner can not only be divided into insertion, deletion,
Replace mistake.Wen Cao et al. define pronunciation according to position of articulation and manner of articulation and miss trend partially, describe two language persons pronunciation
Between orthoepy and a kind of specious situation (" Developing a Chinese L2 of inclined intermediateness of pronouncing by mistake
speech database of Japanese learners with narrow-phonetic labels for computer
assisted pronunciation training”,in interspeech2010,pp.1922-1925).Such case is normal
Appear in advanced learner.In order to identify this trickle change.Another route is will to pronounce inclined error detection as two points
Generic task, to detect incorrect pronunciations and its miss trend partially.But it is normal to find the feature with distinction for every kind of inclined type by mistake
It is often extremely difficult.
Stevens acoustics landmark (acoustics boundary mark) is theoretical, from the mechanism of human speech output, defines
Transient regions (Acoustic phonetics.MITs of the landmark as Quantum Nonlinear relation between description pronunciation and acoustics
press,2000,vol.30;“The quantal nature of speech:Evidence from articulatory-
acoustic data,”1972,pp.51-66;“On the quantal nature of speech,”Journal of
phonetics,1989,vol.17,no.2 pp.3-45;“Quantal theory,enhancement and overlap,”
Journal of Phonetics,2010,vol.38,no.1,pp.10-19).In this region there is significant signal to dash forward
Become, generally mean that focus perceptually and the target of pronunciation, there is abundant voice messaging.A large amount of perception experiments show, listen
Person, which concentrates on landmark, to be helped to select potential distinguishing characteristics (" Evidence for the role of acoustic
boundaries in the perception of speech sounds,”in Phonetic Linguistics:Essays
in Honor of Peter Ladefoged,edited by V.Fromkin(Academic,New York),pp.243-
255).Distinguishing characteristics is extracted at landmark and achieves good result in inclined error detection of pronouncing.And determine that voice can be distinguished
The landmark positions of classification are extremely difficult.It usually requires to study Mechanism of Speech Production and a large amount of artificial marks, therefore efficiency
It is not high.
For problem above, domestic and foreign scholars propose a variety of improved methods.It can substantially be divided into three classes:
The first kind is the change for the characteristic parameter that voice signal different levels and dimension are detected from the angle of signal detection
Change obtains landmark.Conventional parameter has short-time energy, zero-crossing rate, formant etc..Sharlen A.Liu propose to utilize voice
The frequency dividing energy feature method that detects the three kind landmark related to consonant.This method is according to phoneme pronunciation feature by voice
Frequency spectrum is divided into six frequency bands, and using the peak-to-valley value of the difference curve of each frequency band energy as landmark candidates, passes through
Corresponding judgment criterion obtains landmark sequences (" the Landmark detection for distinctive of voice signal
feature‐based speech recognition,”The Journal of the Acoustical Society of
America,1996,vol.100,no.5,pp.3417-3430).A.R.Janyan and P.C.Pandey thinks point that Liu is established
The difference that frequency processing method is depended between words person, therefore utilization gauss hybrid models (Gaussian Mixture Model,
GMM smooth spectrum envelope) is modeled, and utilizes ROR (rate of rise) function extraction GMM parameter detecting plosives
landmark(“Detection of stop landmarks using Gaussian mixture modeling of
speech spectrum,”Acoustics,Speech and Signal Processing,2009.ICASSP 2009.IEEE
International Conference on.IEEE,2009:4681-4684).Detection vowel before Dumpala considers
Landmark does not consider sound source characteristics, and it extracts acoustic feature from glottis closing moment, and these features include sound source characteristics harmony
The method that road feature, wherein sound source characteristics use ZFF (zero frequency filtering), track characteristics use SFF
The method of (single frequency filtering).Regular algorithm is then based on to detect the landmark of vowel
(“Robust Vowel Landmark Detection Using Epoch-Based Features,”in
INTERSPEECH.2016,pp.160-164)。
Second class is to be directed to different landmark types, different parameters is selected, from the angle of machine learning.
Recurrence algorithm of convex hull of the Howitt based on detection syllabic nucleus, extract three kinds of acoustic feature (Valley Depth, duration and level) inputs
The landmark of vowel is detected to multi-layer perception (MLP) (Multi-layer perceptron).The label of speech frame is divided into by it
Vowel and non-vowel (" Automatic Syllable Detection for Vowel Landmarks, " doctor
thesis,MIT,1999).Hasegawa-Johnson et al. realizes the speech recognition system based on landmark.Its is primary
Step is exactly landmark detection, and its method used is exactly all with the svm classifier of one two classification to every kind of landmark
Device detects (" Landmark-based speech recognition:Report of the 2004 Johns Hopkins
summer workshop,”Acoustics,Speech,and Signal Processing,2005.Proceedings.
(ICASSP'05),IEEE International Conference on.IEEE,2005,vol 1,pp.213-216)。Chi-
Yueh Lin and Hsiao-Chuan Wang propose to utilize random forest (random forest) and bootstrapping methods
Explosion starting landmark (Burst onset landmark) is detected, testing result is spliced to the MFCC characteristic vectors of extraction
Afterwards, and incorporate in the speech recognition system based on HMM, further lift the detection performance (" Burst of plosive and affricate
onset landmark detection and its application to speech recognition,”IEEE
Transactions on Audio,Speech,and Language Processing,2011,vol.19no.5,pp.1253-
1264)。
3rd class is to assume that a fixed position of phoneme duration is landmark from linguistics angle, and is applied to
In inclined error detection of pronouncing.Yoon assumes the landmark of English vowel in the centre position of phoneme duration, the landmark of consonant
In the originating of phoneme, middle and finish time (" Landmark-based automated pronunciation error
Detection, " in Interspeech.2010, pp.614-617).Yanlu Xie et al. synthesize skill using the splicing of voice
Art and combination perceive experiment and think that Japanese perceives the crucial clue of the Chinese nose ending of a final in nasalized vowel section, and by this position
Between the moment as landmark (" Landmark of mandarin nasal codas and its application in
pronunciation error detection,”Acoustics,Speech and Signal Processing
(ICASSP),2016IEEE International Conference on.IEEE,2016:5370-5374).Because Chinese does not have
There are landmark schemes, Xuesong Yang etc. to formulate two kinds of mapping schemes.A kind of is the landmark positions according to English
The landmark fixed positions of english phoneme are mapped to Chinese by the International Phonetic Symbols, another way is seen by linguist
Survey and count landmark location rules (" the Landmark-based pronunciation for formulating some Chinese phonemes
error identification on Chinese learning,”in Speech Prosody,2016)。
As a whole, the research of forefathers or the angle research Mechanism of Speech Production from signal detection, and for not unisonance
The landmark types of element design different parameters.Or from experiment is perceived, landmark is marked manually, or assume one
The position of individual fixation is as landmark.For first kind method, its benefit need not contain mark landmark manually
Training data, but need to study Mechanism of Speech Production for the landmark of different phonemes, design the different letters with distinction
Number parameter detects.And some constant criterions are usually selected, for the different consideration deficiencies between words person.For second of side
Method, its benefit are only need to select to have the feature of distinction to classify automatically by machine learning.But it generally relies on manual mark
The data being poured in are trained, and need to select the different features with distinction for different landmark.If detection
All landmark, then need repeatedly to train.For the third method, its benefit assumes that some positions fixed, calculating side
Just, context environmental but is not taken into full account.
The content of the invention
The embodiment of the present invention provides a kind of pronunciation inclined error detection method, apparatus, storage medium and equipment, to solve existing skill
One or more missings in art.
The embodiment of the present invention provides a kind of inclined error detection method of pronunciation, including:Detected using sequential classification CTC methods are connected
The key frame position of phoneme in known correct voice, as acoustics boundary mark landmark;Based on the landmark to be detected
Phoneme described in voice carries out inclined error detection of pronouncing.
In one embodiment, the crucial framing bit for connecting phoneme in correct voice known to sequential classification CTC method detections is utilized
Put, as acoustics boundary mark landmark, including:RNN acoustic models are trained using CTC criterions;Utilize the RNN acoustic modes after training
The voice of processing unit, obtains phoneme described in the voice of processing unit on each time frame in correct voice known to type decoding
The sequence of posterior probability;Using setting window length, setting each posterior probability in peaking function and sequence, each time frame pair is calculated
The peaking function value answered;Calculate the average and variance of all peaking function values for being more than zero;Obtain cutting ratio using average and variance
Husband's inequality is avenged, and obtains the peaking function value for meeting Chebyshev inequality;Maximum sharpness is obtained in the setting long scope of window
Functional value;The key frame position of the phoneme is determined using the peak location of maximum sharpness functional value, as landmark.
In one embodiment, the key frame position of the phoneme is determined using the peak location of maximum sharpness functional value, is wrapped
Include:Whether comprising corresponding to the peak location in speech text known to judgement corresponding to the processing unit of correct voice
Phoneme;If in the presence of using the peak location as key frame position;If being not present, the peak location is rejected, and from
Remaining meets to reacquire maximum sharpness functional value in the peaking function value of Chebyshev inequality, and utilizes and reacquire most
The peak location of big peaking function value determines the key frame position of the phoneme.
In one embodiment, the key frame position of the phoneme is determined using the peak location of maximum sharpness functional value, is made
For landmark, including:By by the key frame position with the processing unit of known correct voice corresponding to mark
Text phoneme time information compares, and determines the key frame relative position of the phoneme;To all key frame phases of the phoneme
Position is averaging, the final key frame of the phoneme is obtained, as landmark.
In one embodiment, inclined error detection of pronouncing is carried out to phoneme described in voice to be detected based on the landmark,
Including:Based on the landmark, the acoustic feature of phoneme and known correct voice described in type voice are missed known to extraction partially
Described in phoneme acoustic feature;Missed partially in the acoustic feature of phoneme described in type voice and known correct voice using known
The acoustic feature training SVM classifier of the phoneme;Phoneme described in voice to be detected is carried out using SVM classifier after training
Pronounce inclined error detection.
In one embodiment, it is described set peaking function as:
Wherein, Si(k,i,xi, T) represent peaking function value, T represent processing unit voice in sound mother in each time frame
On posterior probability sequence, k represent window length, xiRepresent the value of the posterior probability of i-th of time frame in sequence T, i to be more than or
Null integer.
The embodiment of the present invention also provides a kind of inclined error detection device of pronunciation, including:Acoustics boundary mark determining unit, is used for:Profit
The key frame position of phoneme in the correct voice known to connection sequential classification CTC method detections, as acoustics boundary mark landmark;
Pronounce inclined error detection unit, is used for:Inclined error detection of pronouncing is carried out to phoneme described in voice to be detected based on the landmark.
In one embodiment, the acoustics boundary mark determining unit, including:Acoustic training model module, is used for:Utilize CTC
Criterion trains RNN acoustic models;Probability sequence generation module, is used for:Using correct known to the RNN acoustic models decoding after training
The voice of processing unit in voice, obtains the sequence of posterior probability of the phoneme described in the voice of processing unit on each time frame
Row;Peaking function value generation module, is used for:Using setting window length, setting each posterior probability in peaking function and sequence, calculate
To peaking function value corresponding to each time frame;Inequality parameter generation module, is used for:Calculate it is all be more than zero peaking function value
Average and variance;Peaking function value screening module, is used for:Chebyshev inequality is obtained using average and variance, and is obtained
Meet the peaking function value of Chebyshev inequality;Maximum sharpness functional value determining module, is used for:Obtained in the setting long scope of window
Take maximum sharpness functional value;Acoustics boundary mark determining module, is used for:The sound is determined using the peak location of maximum sharpness functional value
The key frame position of element, as landmark.
In one embodiment, the acoustics boundary mark determining module, including:Phoneme judge module, is used for:It is correct known to judgement
Whether the phoneme corresponding to the peak location is included in speech text corresponding to the processing unit of voice;Crucial framing bit
Determining module is put, is used for:If in the presence of using the peak location as key frame position;If being not present, the spike is rejected
Position, and meet from remaining to reacquire maximum sharpness functional value in the peaking function value of Chebyshev inequality, and utilize weight
The peak location of the maximum sharpness functional value newly obtained determines the key frame position of the phoneme.
In one embodiment, the acoustics boundary mark determining module, including:Key frame relative position determining module, is used for:It is logical
Cross by the key frame position with the processing unit of known correct voice corresponding to mark text phoneme time information phase
Contrast, determine the key frame relative position of the phoneme;Final key frame determining module, is used for:It is relevant to the institute of the phoneme
Key frame relative position is averaging, and the final key frame of the phoneme is obtained, as landmark.
In one embodiment, the inclined error detection unit of pronunciation, including:Acoustic feature extraction module, it is used for:Based on described
Landmark, the sound of phoneme described in the inclined acoustic feature of phoneme described in type voice by mistake known to extraction and known correct voice
Learn feature;SVM classifier training module, is used for:Using it is known partially by mistake the acoustic feature training of phoneme described in type voice and
The acoustic feature SVM classifier of phoneme described in known correct voice;Pronounce inclined error detection module, be used for:Utilize SVM after training
Grader carries out inclined error detection of pronouncing to phoneme described in voice to be detected.
In one embodiment, the peaking function value generation module, it is additionally operable to perform:
It is described set peaking function as:
Wherein, Si(k,i,xi, T) represent peaking function value, T represent processing unit voice in sound mother in each time frame
On posterior probability sequence, k represent window length, xiRepresent the value of the posterior probability of i-th of time frame in sequence T, i to be more than or
Null integer.
The embodiment of the present invention also provides a kind of computer-readable recording medium, is stored thereon with computer program, the program
The step of the various embodiments described above methods described is realized when being executed by processor.
The embodiment of the present invention also provides a kind of computer equipment, including memory, processor and storage are on a memory simultaneously
The computer program that can be run on a processor, the various embodiments described above methods described is realized during the computing device described program
The step of.
Pronunciation inclined error detection method, apparatus, storage medium and the equipment of the embodiment of the present invention, key frame is detected based on CTC,
Key frame position is detected by using CTC methods and determines landmark, without mark landmark manually in advance, is avoided pair
Mark landmark dependence manually, and a unified speech recognition framework is used, it is easy to inclined error detection of pronouncing.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing
There is the required accompanying drawing used in technology description to be briefly described, it should be apparent that, drawings in the following description are only this
Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with
Other accompanying drawings are obtained according to these accompanying drawings.In the accompanying drawings:
Fig. 1 is the schematic flow sheet of the inclined error detection method of pronunciation of the embodiment of the present invention;
Fig. 2 is that correct voice middle pitch known to connection sequential classification CTC method detections is determined the use of in one embodiment of the invention
Method flow schematic diagram of the key frame position of element as acoustics boundary mark;
Fig. 3 is the key frame position for determining phoneme in one embodiment of the invention using the peak location of maximum sharpness functional value
Method flow schematic diagram;
Fig. 4 is the crucial framing bit for determining phoneme in another embodiment of the present invention using the peak location of maximum sharpness functional value
Put the method flow schematic diagram as acoustics boundary mark;
Fig. 5 is to carry out inclined error detection of pronouncing to phoneme in voice to be detected based on acoustics boundary mark in one embodiment of the invention
Method flow schematic diagram;
Fig. 6 is the algorithm flow signal extracted in one embodiment of the invention to the spike of each phoneme in sentence
Figure;
Fig. 7 is the spike phenomenon schematic diagram of CTC in one embodiment of the invention;
Fig. 8 is the flow block schematic illustration of inclined error detection of pronouncing in one embodiment of the invention;
Fig. 9 is the structural representation of the inclined error detection device of pronunciation of the embodiment of the present invention;
Figure 10 is the structural representation of acoustics boundary mark determining unit in one embodiment of the invention;
Figure 11 is the structural representation of acoustics boundary mark determining module in one embodiment of the invention;
Figure 12 is the structural representation of acoustics boundary mark determining module in another embodiment of the present invention;
Figure 13 is the structural representation of inclined error detection unit of pronouncing in one embodiment of the invention;
Figure 14 is the structural representation of the computer equipment of one embodiment of the invention.
Embodiment
For the purpose, technical scheme and advantage of the embodiment of the present invention are more clearly understood, below in conjunction with the accompanying drawings to this hair
Bright embodiment is described in further details.Here, the schematic description and description of the present invention is used to explain the present invention, but simultaneously
It is not as a limitation of the invention.
In order to avoid the dependence to mark landmark manually, the embodiments of the invention provide the inclined error detection side of one kind pronunciation
Method.Fig. 1 is the schematic flow sheet of the inclined error detection method of pronunciation of the embodiment of the present invention.As shown in figure 1, the hair of the embodiment of the present invention
The inclined error detection method of sound, it may include:
Step S110:Using the key frame position for connecting phoneme in correct voice known to sequential classification CTC method detections, make
For acoustics boundary mark landmark;
Step S120:Inclined error detection of pronouncing is carried out to phoneme described in voice to be detected based on the landmark.
CTC methods can carry out sequence mark study using Recognition with Recurrent Neural Network.Subject matter in speech recognition be by
It is text label sequence that acoustic feature is Sequence Transformed, such as Chinese and English sound auxiliary sequence, the former usual length are more than the latter.Implement
In example, blank (blank) label can be introduced by CTC and is easily obscured between two pronunciation units or uncertain side to absorb
Boundary, and allow label to repeat, so as to obtain the best alignment between speech frame and output label.In embodiment, CTC
Posterior probability of each modeling unit in each time step can be provided using RNN softmax layers., can be with embodiment
By many-to-one mapping, multiple output labels are mapped to one without the sequence for repeating label and blank labels.Embodiment
In, CTC can be possible to alignment summation by preceding backward algorithm to target sequence.
In above-mentioned steps S110, it is known that correct voice can obtain from existing mother tongue corpus.Phoneme for example can be with
It is initial consonant or simple or compound vowel of a Chinese syllable.In above-mentioned steps S120, based on the landmark determined, it can be sent out by a variety of distinct methods
The inclined error detection of sound, such as SVM (Support Vector Machine, SVMs) grader.
In the embodiment of the present invention, key frame position is detected by using CTC methods to determine landmark, without prior hand
Dynamic mark landmark, avoid the dependence to mark landmark manually.Moreover, using unified speech recognition framework, inspection
It is good to survey result uniformity.
Fig. 2 is that correct voice middle pitch known to connection sequential classification CTC method detections is determined the use of in one embodiment of the invention
Method flow schematic diagram of the key frame position of element as acoustics boundary mark.As shown in Fig. 2 in above-mentioned steps S110, using even
Meet the key frame position of phoneme in correct voice known to sequential classification CTC method detections, the side as acoustics boundary mark landmark
Method, it may include:
Step S111:RNN acoustic models are trained using CTC criterions;
Step S112:Using the voice of processing unit in correct voice known to the RNN acoustic models decoding after training, obtain
The sequence of posterior probability of the phoneme on each time frame described in the voice of processing unit;
Step S113:Using setting window length, setting each posterior probability in peaking function and sequence, each time frame is calculated
Corresponding peaking function value;
Step S114:Calculate the average and variance of all peaking function values for being more than zero;
Step S115:Chebyshev inequality is obtained using average and variance, and obtains and meets Chebyshev inequality
Peaking function value;
Step S116:Maximum sharpness functional value is obtained in the setting long scope of window;
Step S117:The key frame position of the phoneme is determined using the peak location of maximum sharpness functional value, as
landmark。
In above-mentioned steps S111, RNN can be trained using the correct voice in mother tongue corpus as input
(Recurrent Neural Network, recurrent neural network) acoustic model.In other embodiments, other sound can be used
Learn model.In above-mentioned steps S112, the processing unit for example can be in short.The sequence is time series.In above-mentioned step
In rapid S115, the average being calculated and variance can be substituted into the Chebyshev inequality of standard, specifically be cut ratio
Husband's inequality is avenged, and specific Chebyshev inequality can be updated to using peaking function value as the value of variable, is judged whether
Meet the inequality.In embodiment, when reservation meets the peaking function value of Chebyshev inequality, it is original that its can be recorded simultaneously
Index, to get its time frame (peak location)., can be in the setting long scope of window (such as 2k) in above-mentioned steps S116
Carry out peaking function value to compare, retain a maximum., can be direct by obtained key frame position in above-mentioned steps S117
As landmark, or by certain screening or judge again to determine whether as landmark.Inventor is had found with CTC criterions
The RNN model output label posterior probability of training has obvious spike phenomenon, can effectively be determined using the feature
landmark。
In embodiment, the setting peaking function can be:
Wherein, Si(k,i,xi, T) represent peaking function value, T represent processing unit voice in sound mother in each time frame
On posterior probability sequence, k represent window length, xiRepresent the value of the posterior probability of i-th of time frame in sequence T, i to be more than or
Null integer.
In the present embodiment, peaking function value Si(k,i,xi, T) it is bigger, represent the position as spike possibility it is bigger,
So it can effectively filter out maximum sharpness position using the setting peaking function.
Fig. 3 is the key frame position for determining phoneme in one embodiment of the invention using the peak location of maximum sharpness functional value
Method flow schematic diagram.As shown in figure 3, in above-mentioned steps S117, determined using the peak location of maximum sharpness functional value
The method of the key frame position of the phoneme, it may include:
Step S1171:Whether institute is included in speech text known to judgement corresponding to the processing unit of correct voice
State the phoneme corresponding to peak location;
Step S1172:If in the presence of using the peak location as key frame position;
Step S1173:If being not present, the peak location is rejected, and meets the point of Chebyshev inequality from remaining
Maximum sharpness functional value is reacquired in peak functional value, and is determined using the peak location of the maximum sharpness functional value reacquired
The key frame position of the phoneme.
In the case where calculating inaccuracy, the peaking function maximum selected is possible to meeting very little, causes this word occur
Do not include this phoneme in (processing unit), pass through above-mentioned steps S1171, step S1172 and step S1172 combination known texts
The peak location that this sentence is talked about to the phoneme not included in (processing unit) is rejected, and can improve the accuracy of key frame position.
Fig. 4 is the crucial framing bit for determining phoneme in another embodiment of the present invention using the peak location of maximum sharpness functional value
Put the method flow schematic diagram as acoustics boundary mark.As shown in figure 4, in above-mentioned steps S117, maximum sharpness functional value is utilized
Peak location determine the key frame position of the phoneme, the method as landmark, it may include:
Step S1174:By by the key frame position with the processing unit of known correct voice corresponding to mark
Explanatory notes this phoneme time information compares, and determines the key frame relative position of the phoneme;
Step S1175:All key frame relative positions of the phoneme are averaging, obtain the final key of the phoneme
Frame, as landmark.
In the present embodiment, mark text phoneme time information can be mark text sound mother's temporal information.One processing
Multiple identical phonemes can be included in unit (a word), can be with by being averaged to multiple key frame positions of same phoneme
Obtain unified key frame position, the inclined error detection of pronunciation convenient to carry out.
, can be by key frame position, final key frame position or key frame position average value and mark manually in embodiment
Landmark be compared, if unanimously, can be averaged using key frame position, final key frame position or key frame position
Value carries out inclined error detection of pronouncing as landmark, if inconsistent, the landmark marked manually can be used to be pronounced
Inclined error detection, the inclined error detection of pronunciation can be improved with this.
Fig. 5 is to carry out inclined error detection of pronouncing to phoneme in voice to be detected based on acoustics boundary mark in one embodiment of the invention
Method flow schematic diagram.As shown in figure 5, in above-mentioned steps S120, based on the landmark to described in voice to be detected
Phoneme pronounce the method for inclined error detection, it may include:
Step S121:Based on the landmark, known to extraction partially by mistake the acoustic feature of phoneme described in type voice and
The acoustic feature of phoneme described in known correct voice;
Step S122:Missed partially described in the acoustic feature of phoneme described in type voice and known correct voice using known
The acoustic feature training SVM classifier of phoneme;
Step S123:Inclined error detection of pronouncing is carried out to phoneme described in voice to be detected using SVM classifier after training.
In the present embodiment, inclined error detection of pronouncing is carried out using SVM classifier after training, preferably detection knot can be obtained
Fruit.
Fig. 6 is the algorithm flow signal extracted in one embodiment of the invention to the spike of each phoneme in sentence
Figure.As shown in fig. 6, using a sentence as processing unit, the method extracted to the spike of each phoneme in sentence can
Including:
Step S301:Directly decoded in short using the CTC RNN acoustic models trained, obtain probability sequence.
Modeling unit (for example, sound is female) posterior probability x on each time frame is extracted from mother tongue pronunciationi, group
The probability sequence T for containing N (N represents the time step number of a word) individual point into one.
Step S302:Calculate peaking function value a corresponding to each time framei, obtain the peaking function value array more than zero.
In embodiment, peaking function selection is:
Si(k,i,xi, T) and i-th point in time series T of probable value x can be representediWith respect to other point conspicuousness, its
Value is bigger, represents bigger as the possibility of spike.It will be greater than 0 S1(k,i,xi, T) value (representing candidate's spike) pick out
Add in array a, and keep its primary index in time series.
In embodiment, the half of the average duration of every kind of phoneme can be counted according to corpus, or rule of thumb, selects window length
K, such as it is 4 to set the long k of window.
Step S303:Calculate the average m and variance s of all elements in array a.
Step S304:Using Chebyshev inequality (Chebyshev Inequality):Sieve
Select peaking function value.
Wherein, μ is average, and σ is variance, and h is greater than 0 constant.Itself it is not assumed that stochastic variable X obeys any distribution, its
Represent to meet that the peak value of this condition is seldom.If meetThen retain candidate's kurtosis xi, and
Record its primary index.Wherein, h can be manually set to the constant more than 0.
Step S305:Post-processed, the kurtosis in the long scope of window (2k) is compared, only retain a maximum.
Last remaining spike can be used as real candidate's spike, and its primary index is final candidate's peak location.Due to this calculation
Method is possible to the maximum meeting very little selected, it may appear that does not include this phoneme in this sentence words.For inclined error detection and the mark of pronouncing
For note task.Its text is known, it is necessary to reject the peak location for the phoneme not included in this word with reference to known text.
In embodiment, then need to set threshold value for detecting key frame in the tasks such as speech recognition, by candidate's peak location
Kurtosis too small position in place's weeds out.
Fig. 7 is the spike phenomenon schematic diagram of CTC in one embodiment of the invention.As shown in fig. 7, with " We ' ve done our
Exemplified by part ", easily obscure between two pronunciations or uncertain border blank blank labels absorb, using CTC by sentence " We '
Spike w, iy, v, d, ah, n, aa, r, p, t be present in label posterior probability corresponding to ve done our part " voice.
Fig. 8 is the flow block schematic illustration of inclined error detection of pronouncing in one embodiment of the invention.As shown in figure 8, whole detection
Framework can be divided into two stages:First stage, using the voice of mother tongue corpus as input, RNN acoustics is trained using CTC criterions
Model, according to above-mentioned spike extraction algorithm, the extraction feature to mother tongue pronunciation decodes, and generates label posterior probability, extraction
Peak location, then, it is determined with marking in text compared with sound mother's temporal information (relative to each phoneme time started)
The position (counting every kind of phoneme spike relative position) of key frame, averages as final to the key frame position of every kind of phoneme
Key frame;Second stage, the inclined error detection of pronunciation based on key frame, the key frame position trained using the first stage, from spy
Acoustic feature is extracted in accordatura element or its inclined speech samples by mistake, and utilizes orthoepy and its SVM that type trains by mistake partially to divide
Class device detects to particular phoneme.
Whether in embodiment, can first verify that the peak location based on CTC drivings and landmark position has unanimously
Property, inclined error detection of pronouncing then is carried out as key frame based on the spike of data-driven using CTC system.Its benefit is need not
Mark landmark in advance, and use a unified speech recognition framework.
Based on additionally providing one with the inclined error detection method identical inventive concept of pronunciation shown in Fig. 1, the embodiment of the present application
The inclined error detection device of kind pronunciation, as described in example below.Because the inclined error detection device of the pronunciation solves the principle and hair of problem
The inclined error detection method of sound is similar, therefore the implementation of the inclined error detection device of the pronunciation may refer to the reality of inclined error detection method
Apply, repeat part and repeat no more.
Fig. 9 is the structural representation of the inclined error detection device of pronunciation of the embodiment of the present invention.As shown in figure 9, the present invention is implemented
The inclined error detection device of pronunciation of example, it may include:Acoustics boundary mark determining unit 510 and the inclined error detection unit 520 that pronounces, the two is mutual
Connection.
Acoustics boundary mark determining unit 510, is used for:Using connecting correct voice middle pitch known to sequential classification CTC methods detection
The key frame position of element, as acoustics boundary mark landmark;
Pronounce inclined error detection unit 520, is used for:Phoneme described in voice to be detected is sent out based on the landmark
The inclined error detection of sound.
Figure 10 is the structural representation of acoustics boundary mark determining unit in one embodiment of the invention.As shown in Figure 10, the sound
Determining unit 510 is marked by educational circles, it may include:Acoustic training model module 511, probability sequence generation module 512, the life of peaking function value
Into module 513, inequality parameter generation module 514, peaking function value screening module 515, maximum sharpness functional value determining module
516 and acoustics boundary mark determining module 517, above-mentioned each sequence of modules connection.
Acoustic training model module 511, is used for:RNN acoustic models are trained using CTC criterions;
Probability sequence generation module 512, is used for:Using locating in correct voice known to the RNN acoustic models decoding after training
The voice of unit is managed, obtains the sequence of posterior probability of the phoneme described in the voice of processing unit on each time frame;
Peaking function value generation module 513, is used for:Using setting, window is long, it is general to set each posteriority in peaking function and sequence
Rate, peaking function value corresponding to each time frame is calculated;
Inequality parameter generation module 514, is used for:Calculate the average and variance of all peaking function values for being more than zero;
Peaking function value screening module 515, is used for:Chebyshev inequality is obtained using average and variance, and is obtained full
The peaking function value of sufficient Chebyshev inequality;
Maximum sharpness functional value determining module 516, is used for:Maximum sharpness functional value is obtained in the setting long scope of window;
Acoustics boundary mark determining module 517, is used for:The pass of the phoneme is determined using the peak location of maximum sharpness functional value
Key frame position, as landmark.
Figure 11 is the structural representation of acoustics boundary mark determining module in one embodiment of the invention.As shown in figure 11, embodiment
In, the acoustics boundary mark determining module 517, it may include:Phoneme judge module 5171 and key frame position determining module 5172, two
Person is connected with each other.
Phoneme judge module 5171, is used for:Speech text known to judgement corresponding to the processing unit of correct voice
In whether include the phoneme corresponding to the peak location;
Key frame position determining module 5172, is used for:If in the presence of using the peak location as key frame position;If
It is not present, then rejects the peak location, and reacquired most from the peaking function value that remaining meets Chebyshev inequality
Big peaking function value, and determine the crucial framing bit of the phoneme using the peak location of the maximum sharpness functional value of reacquisition
Put.
Figure 12 is the structural representation of acoustics boundary mark determining module in another embodiment of the present invention.As shown in figure 12, implement
In example, the acoustics boundary mark determining module 517, including:Key frame relative position determining module 5173 and final key frame determine
Module 5174, the two interconnection.
Key frame relative position determining module 5173, is used for:By by the key frame position and known correct voice
Mark text phoneme time information corresponding to the processing unit compares, and determines the key frame relative position of the phoneme;
Final key frame determining module 5174, is used for:All key frame relative positions of the phoneme are averaging, obtained
The final key frame of the phoneme, as landmark.
Figure 13 is the structural representation of inclined error detection unit of pronouncing in one embodiment of the invention.As shown in figure 13, the hair
The inclined error detection unit 520 of sound, it may include:Acoustic feature extraction module 521, SVM classifier training module 522 and the inclined flase drop of pronunciation
Survey module 523, above-mentioned each sequence of modules connection.
Acoustic feature extraction module 521, is used for:Based on the landmark, missed partially described in type voice known to extraction
The acoustic feature of phoneme described in the acoustic feature of phoneme and known correct voice;
SVM classifier training module 522, is used for:Using it is known partially by mistake the acoustic feature of phoneme described in type voice and
The acoustic feature training SVM classifier of phoneme described in known correct voice;
Pronounce inclined error detection module 523, be used for:Phoneme described in voice to be detected is entered using SVM classifier after training
The inclined error detection of row pronunciation.
In embodiment, the peaking function value generation module 513, it may also be used for perform:
It is described set peaking function as:
Wherein, Si(k,i,xi, T) represent peaking function value, T represent processing unit voice in sound mother in each time frame
On posterior probability sequence, k represent window length, xiRepresent the value of the posterior probability of i-th of time frame in sequence T, i to be more than or
Null integer.
The embodiment of the present invention also provides a kind of computer-readable recording medium, is stored thereon with computer program, the program
The step of the various embodiments described above methods described is realized when being executed by processor.
Figure 14 is the structural representation of the computer equipment of one embodiment of the invention.As shown in figure 14, computer equipment
600, including memory 610, processor 620 and storage are on a memory and the computer program that can run on a processor, institute
State the step of realizing the various embodiments described above methods described when processor 620 performs described program.
In summary, the pronunciation of the embodiment of the present invention inclined error detection method, apparatus, storage medium and equipment, examined based on CTC
Key frame is surveyed, detects key frame position by using CTC methods to determine landmark, without marking manually in advance
Landmark, the dependence to mark landmark manually is avoided, and use a unified speech recognition framework, be easy to pronounce
Inclined error detection.
In the description of this specification, reference term " one embodiment ", " specific embodiment ", " some implementations
Example ", " such as ", the description of " example ", " specific example " or " some examples " etc. mean to combine the embodiment or example description
Specific features, structure, material or feature are contained at least one embodiment or example of the present invention.In this manual,
Identical embodiment or example are not necessarily referring to the schematic representation of above-mentioned term.Moreover, the specific features of description, knot
Structure, material or feature can combine in an appropriate manner in any one or more embodiments or example.Each embodiment
In the step of being related to order be used for the implementation that schematically illustrates the present invention, sequence of steps therein is not construed as limiting, can be as needed
Appropriately adjust.
It should be understood by those skilled in the art that, embodiments of the invention can be provided as method, system or computer program
Product.Therefore, the present invention can use the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware
Apply the form of example.Moreover, the present invention can use the computer for wherein including computer usable program code in one or more
The computer program production that usable storage medium is implemented on (including but is not limited to magnetic disk storage, CD-ROM, optical memory etc.)
The form of product.
The present invention is the flow with reference to method according to embodiments of the present invention, equipment (system) and computer program product
Figure and/or block diagram describe.It should be understood that can be by every first-class in computer program instructions implementation process figure and/or block diagram
Journey and/or the flow in square frame and flow chart and/or block diagram and/or the combination of square frame.These computer programs can be provided
The processors of all-purpose computer, special-purpose computer, Embedded Processor or other programmable data processing devices is instructed to produce
A raw machine so that produced by the instruction of computer or the computing device of other programmable data processing devices for real
The device for the function of being specified in present one flow of flow chart or one square frame of multiple flows and/or block diagram or multiple square frames.
These computer program instructions, which may be alternatively stored in, can guide computer or other programmable data processing devices with spy
Determine in the computer-readable memory that mode works so that the instruction being stored in the computer-readable memory, which produces, to be included referring to
Make the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one square frame of block diagram or
The function of being specified in multiple square frames.
These computer program instructions can be also loaded into computer or other programmable data processing devices so that counted
Series of operation steps is performed on calculation machine or other programmable devices to produce computer implemented processing, so as in computer or
The instruction performed on other programmable devices is provided for realizing in one flow of flow chart or multiple flows and/or block diagram one
The step of function of being specified in individual square frame or multiple square frames.
Particular embodiments described above, the purpose of the present invention, technical scheme and beneficial effect are carried out further in detail
Describe in detail it is bright, should be understood that the foregoing is only the present invention specific embodiment, the guarantor being not intended to limit the present invention
Scope is protected, within the spirit and principles of the invention, any modification, equivalent substitution and improvements done etc., should be included in this
Within the protection domain of invention.
Claims (14)
- The inclined error detection method 1. one kind is pronounced, it is characterised in that including:Using the key frame position for connecting phoneme in correct voice known to sequential classification CTC method detections, as acoustics boundary mark landmark;Inclined error detection of pronouncing is carried out to phoneme described in voice to be detected based on the landmark.
- 2. inclined error detection method of pronouncing as claimed in claim 1, it is characterised in that examined using sequential classification CTC methods are connected The key frame position of phoneme in correct voice known to survey, as acoustics boundary mark landmark, including:RNN acoustic models are trained using CTC criterions;Using the voice of processing unit in correct voice known to the RNN acoustic models decoding after training, the language of processing unit is obtained The sequence of posterior probability of the phoneme described in sound on each time frame;Using setting window length, setting each posterior probability in peaking function and sequence, spike letter corresponding to each time frame is calculated Numerical value;Calculate the average and variance of all peaking function values for being more than zero;Chebyshev inequality is obtained using average and variance, and obtains the peaking function value for meeting Chebyshev inequality;Maximum sharpness functional value is obtained in the setting long scope of window;The key frame position of the phoneme is determined using the peak location of maximum sharpness functional value, as landmark.
- 3. inclined error detection method of pronouncing as claimed in claim 2, it is characterised in that utilize the spike position of maximum sharpness functional value The key frame position for determining the phoneme is put, including:It is whether right comprising the peak location in speech text known to judgement corresponding to the processing unit of correct voice The phoneme answered;If in the presence of using the peak location as key frame position;If being not present, reject the peak location, and from its Maximum sharpness functional value is reacquired in the remaining peaking function value for meeting Chebyshev inequality, and utilizes the maximum reacquired The peak location of peaking function value determines the key frame position of the phoneme.
- 4. inclined error detection method of pronouncing as claimed in claim 2, it is characterised in that utilize the spike position of maximum sharpness functional value The key frame position for determining the phoneme is put, as landmark, including:By by the key frame position with the processing unit of known correct voice corresponding to mark text phoneme time Information compares, and determines the key frame relative position of the phoneme;All key frame relative positions of the phoneme are averaging, obtain the final key frame of the phoneme, as landmark。
- 5. inclined error detection method of pronouncing as claimed in claim 1, it is characterised in that based on the landmark to language to be detected Phoneme described in sound carries out inclined error detection of pronouncing, including:Missed partially based on the landmark, known to extraction in the acoustic feature of phoneme described in type voice and known correct voice The acoustic feature of the phoneme;It is special using the known acoustics for missing phoneme described in the acoustic feature of phoneme described in type voice and known correct voice partially Sign training SVM classifier;Inclined error detection of pronouncing is carried out to phoneme described in voice to be detected using SVM classifier after training.
- 6. the inclined error detection method of pronunciation as described in any one of claim 2 to 4, it is characterised in that the setting peaking function For:<mrow> <msub> <mi>S</mi> <mi>i</mi> </msub> <mrow> <mo>(</mo> <mi>k</mi> <mo>,</mo> <mi>i</mi> <mo>,</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>,</mo> <mi>T</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mi>m</mi> <mi>a</mi> <mi>x</mi> <mo>{</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>-</mo> <msub> <mi>x</mi> <mrow> <mi>i</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> <mo>,</mo> <mn>....</mn> <mo>,</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>-</mo> <msub> <mi>x</mi> <mrow> <mi>i</mi> <mo>-</mo> <mi>k</mi> </mrow> </msub> <mo>}</mo> <mo>+</mo> <mi>m</mi> <mi>a</mi> <mi>x</mi> <mo>{</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>-</mo> <msub> <mi>x</mi> <mrow> <mi>i</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> <mo>,</mo> <mn>...</mn> <mo>,</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>-</mo> <msub> <mi>x</mi> <mrow> <mi>i</mi> <mo>+</mo> <mi>k</mi> </mrow> </msub> <mo>}</mo> </mrow> <mn>2</mn> </mfrac> <mo>,</mo> </mrow>Wherein, Si(k,i,xi, T) represent peaking function value, T represent processing unit voice in sound mother on each time frame The sequence of posterior probability, k represent window length, xiRepresent the value of the posterior probability of i-th of time frame in sequence T, i be more than or equal to Zero integer.
- The inclined error detection device 7. one kind is pronounced, it is characterised in that including:Acoustics boundary mark determining unit, is used for:Utilize the key for connecting phoneme in correct voice known to sequential classification CTC method detections Frame position, as acoustics boundary mark landmark;Pronounce inclined error detection unit, is used for:Inclined mistake of pronouncing is carried out to phoneme described in voice to be detected based on the landmark Detection.
- 8. pronounce inclined error detection device as claimed in claim 7, it is characterised in that the acoustics boundary mark determining unit, including:Acoustic training model module, is used for:RNN acoustic models are trained using CTC criterions;Probability sequence generation module, is used for:Utilize processing unit in correct voice known to the RNN acoustic models decoding after training Voice, obtain the sequence of posterior probability of the phoneme described in the voice of processing unit on each time frame;Peaking function value generation module, is used for:Using setting window length, setting each posterior probability in peaking function and sequence, calculate Obtain peaking function value corresponding to each time frame;Inequality parameter generation module, is used for:Calculate the average and variance of all peaking function values for being more than zero;Peaking function value screening module, is used for:Chebyshev inequality is obtained using average and variance, and obtains and meets Qie Bixue The peaking function value of husband's inequality;Maximum sharpness functional value determining module, is used for:Maximum sharpness functional value is obtained in the setting long scope of window;Acoustics boundary mark determining module, is used for:The crucial framing bit of the phoneme is determined using the peak location of maximum sharpness functional value Put, as landmark.
- 9. pronounce inclined error detection device as claimed in claim 8, it is characterised in that the acoustics boundary mark determining module, including:Phoneme judge module, is used for:Whether wrapped in speech text known to judgement corresponding to the processing unit of correct voice Containing the phoneme corresponding to the peak location;Key frame position determining module, is used for:If in the presence of using the peak location as key frame position;If being not present, The peak location is rejected, and meets from remaining to reacquire maximum sharpness function in the peaking function value of Chebyshev inequality Value, and determine the key frame position of the phoneme using the peak location of the maximum sharpness functional value of reacquisition.
- 10. pronounce inclined error detection device as claimed in claim 8, it is characterised in that the acoustics boundary mark determining module, bag Include:Key frame relative position determining module, is used for:By by the processing of the key frame position and known correct voice Mark text phoneme time information corresponding to unit compares, and determines the key frame relative position of the phoneme;Final key frame determining module, is used for:All key frame relative positions of the phoneme are averaging, obtain the phoneme Final key frame, as landmark.
- 11. pronounce inclined error detection device as claimed in claim 7, it is characterised in that the inclined error detection unit of pronunciation, bag Include:Acoustic feature extraction module, it is used for:Based on the landmark, the sound of phoneme described in type voice is missed known to extraction partially Learn the acoustic feature of phoneme described in feature and known correct voice;SVM classifier training module, is used for:Utilize the known acoustic feature of phoneme described in type voice and known correct by mistake partially The acoustic feature training SVM classifier of phoneme described in voice;Pronounce inclined error detection module, be used for:Phoneme described in voice to be detected pronounce partially using SVM classifier after training Error detection.
- 12. the inclined error detection device of pronunciation as described in any one of claim 8 to 10, it is characterised in that the peaking function value Generation module, it is additionally operable to perform:It is described set peaking function as:<mrow> <msub> <mi>S</mi> <mi>i</mi> </msub> <mrow> <mo>(</mo> <mrow> <mi>k</mi> <mo>,</mo> <mi>i</mi> <mo>,</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>,</mo> <mi>T</mi> </mrow> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mi>max</mi> <mrow> <mo>{</mo> <mrow> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>-</mo> <msub> <mi>x</mi> <mrow> <mi>i</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> <mo>,</mo> <mn>....</mn> <mo>,</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>-</mo> <msub> <mi>x</mi> <mrow> <mi>i</mi> <mo>-</mo> <mi>k</mi> </mrow> </msub> </mrow> <mo>}</mo> </mrow> <mo>+</mo> <mi>max</mi> <mrow> <mo>{</mo> <mrow> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>-</mo> <msub> <mi>x</mi> <mrow> <mi>i</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> <mo>,</mo> <mn>...</mn> <mo>,</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>-</mo> <msub> <mi>x</mi> <mrow> <mi>i</mi> <mo>+</mo> <mi>k</mi> </mrow> </msub> </mrow> <mo>}</mo> </mrow> </mrow> <mn>2</mn> </mfrac> <mo>,</mo> </mrow>Wherein, Si(k,i,xi, T) represent peaking function value, T represent processing unit voice in sound mother on each time frame The sequence of posterior probability, k represent window length, xiRepresent the value of the posterior probability of i-th of time frame in sequence T, i be more than or equal to Zero integer.
- 13. a kind of computer-readable recording medium, is stored thereon with computer program, it is characterised in that the program is by processor The step of claim 1 to 6 methods described is realized during execution.
- 14. a kind of computer equipment, including memory, processor and storage are on a memory and the meter that can run on a processor Calculation machine program, it is characterised in that the step of claim 1 to 6 methods described is realized during the computing device described program.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710895726.XA CN107610720B (en) | 2017-09-28 | 2017-09-28 | Pronunciation deviation detection method and device, storage medium and equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710895726.XA CN107610720B (en) | 2017-09-28 | 2017-09-28 | Pronunciation deviation detection method and device, storage medium and equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107610720A true CN107610720A (en) | 2018-01-19 |
CN107610720B CN107610720B (en) | 2020-08-04 |
Family
ID=61059289
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710895726.XA Active CN107610720B (en) | 2017-09-28 | 2017-09-28 | Pronunciation deviation detection method and device, storage medium and equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107610720B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109377540A (en) * | 2018-09-30 | 2019-02-22 | 网易(杭州)网络有限公司 | Synthetic method, device, storage medium, processor and the terminal of FA Facial Animation |
CN113327595A (en) * | 2021-06-16 | 2021-08-31 | 北京语言大学 | Pronunciation deviation detection method and device and storage medium |
CN113571045A (en) * | 2021-06-02 | 2021-10-29 | 北京它思智能科技有限公司 | Minnan language voice recognition method, system, equipment and medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060212296A1 (en) * | 2004-03-17 | 2006-09-21 | Carol Espy-Wilson | System and method for automatic speech recognition from phonetic features and acoustic landmarks |
CN105551483A (en) * | 2015-12-11 | 2016-05-04 | 百度在线网络技术(北京)有限公司 | Speech recognition modeling method and speech recognition modeling device |
US20160372119A1 (en) * | 2015-06-19 | 2016-12-22 | Google Inc. | Speech recognition with acoustic models |
-
2017
- 2017-09-28 CN CN201710895726.XA patent/CN107610720B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060212296A1 (en) * | 2004-03-17 | 2006-09-21 | Carol Espy-Wilson | System and method for automatic speech recognition from phonetic features and acoustic landmarks |
US20160372119A1 (en) * | 2015-06-19 | 2016-12-22 | Google Inc. | Speech recognition with acoustic models |
CN105551483A (en) * | 2015-12-11 | 2016-05-04 | 百度在线网络技术(北京)有限公司 | Speech recognition modeling method and speech recognition modeling device |
Non-Patent Citations (5)
Title |
---|
ALEX GRAVES: "《Supervised Sequence Labelling with Recurrent Neural Networks》", 《SPRINGER》 * |
XUESONG YANG等: "Landmark-Based Pronunciation Error Identification on Chinese Learning", 《SPEECH PROSODY 2016》 * |
YANLU XIE等: "Landmark of Mandarin nasal codas and its application in pronunciation error detection", 《 2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)》 * |
YOON SY等: "Landmark-based Automated Pronunciation Error Detection", 《INTERSPEECH 2010》 * |
孙望: "语音识别技术的研究及其在发音错误识别系统中的应用", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109377540A (en) * | 2018-09-30 | 2019-02-22 | 网易(杭州)网络有限公司 | Synthetic method, device, storage medium, processor and the terminal of FA Facial Animation |
CN109377540B (en) * | 2018-09-30 | 2023-12-19 | 网易(杭州)网络有限公司 | Method and device for synthesizing facial animation, storage medium, processor and terminal |
CN113571045A (en) * | 2021-06-02 | 2021-10-29 | 北京它思智能科技有限公司 | Minnan language voice recognition method, system, equipment and medium |
CN113571045B (en) * | 2021-06-02 | 2024-03-12 | 北京它思智能科技有限公司 | Method, system, equipment and medium for identifying Minnan language voice |
CN113327595A (en) * | 2021-06-16 | 2021-08-31 | 北京语言大学 | Pronunciation deviation detection method and device and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN107610720B (en) | 2020-08-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Strik et al. | Comparing different approaches for automatic pronunciation error detection | |
Li et al. | Improving Mispronunciation Detection for Non-Native Learners with Multisource Information and LSTM-Based Deep Models. | |
US7962327B2 (en) | Pronunciation assessment method and system based on distinctive feature analysis | |
Shahin et al. | Tabby Talks: An automated tool for the assessment of childhood apraxia of speech | |
Arora et al. | Phonological feature-based speech recognition system for pronunciation training in non-native language learning | |
CN106782603B (en) | Intelligent voice evaluation method and system | |
Gao et al. | A study on robust detection of pronunciation erroneous tendency based on deep neural network. | |
CN110415725B (en) | Method and system for evaluating pronunciation quality of second language using first language data | |
CN107886968A (en) | Speech evaluating method and system | |
Li et al. | Improving mispronunciation detection of mandarin tones for non-native learners with soft-target tone labels and BLSTM-based deep tone models | |
Tabbaa et al. | Computer-aided training for Quranic recitation | |
CN107610720A (en) | Pronounce inclined error detection method, apparatus, storage medium and equipment | |
Mao et al. | Applying multitask learning to acoustic-phonemic model for mispronunciation detection and diagnosis in l2 english speech | |
Arora et al. | Phonological feature based mispronunciation detection and diagnosis using multi-task DNNs and active learning | |
Korzekwa et al. | Detection of lexical stress errors in non-native (l2) english with data augmentation and attention | |
Duan et al. | Effective articulatory modeling for pronunciation error detection of L2 learner without non-native training data | |
Chen et al. | A self-attention joint model for spoken language understanding in situational dialog applications | |
Mary et al. | Searching speech databases: features, techniques and evaluation measures | |
Shen et al. | Self-supervised pre-trained speech representation based end-to-end mispronunciation detection and diagnosis of Mandarin | |
US20210319786A1 (en) | Mispronunciation detection with phonological feedback | |
Kashif et al. | Consonant phoneme based extreme learning machine (ELM) recognition model for foreign accent identification | |
CN116597809A (en) | Multi-tone word disambiguation method, device, electronic equipment and readable storage medium | |
Niu et al. | A study on landmark detection based on CTC and its application to pronunciation error detection | |
Li et al. | Improving mandarin tone mispronunciation detection for non-native learners with soft-target tone labels and blstm-based deep models | |
Nakagawa et al. | A statistical method of evaluating pronunciation proficiency for English words spoken by Japanese |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |