CN107610720A

CN107610720A - Pronounce inclined error detection method, apparatus, storage medium and equipment

Info

Publication number: CN107610720A
Application number: CN201710895726.XA
Authority: CN
Inventors: 解焱陆; 牛传迎; 张劲松
Original assignee: BEIJING LANGUAGE AND CULTURE UNIVERSITY
Current assignee: BEIJING LANGUAGE AND CULTURE UNIVERSITY
Priority date: 2017-09-28
Filing date: 2017-09-28
Publication date: 2018-01-19
Anticipated expiration: 2037-09-28
Also published as: CN107610720B

Abstract

The invention provides one kind pronunciation inclined error detection method, apparatus, storage medium and equipment, this method includes：Using the key frame position for connecting phoneme in correct voice known to sequential classification CTC method detections, as acoustics boundary mark landmark；Inclined error detection of pronouncing is carried out to phoneme described in voice to be detected based on the landmark.The present invention detects key frame as acoustics boundary mark by the use of CTC methods, is not required to mark acoustics boundary mark in advance.

Description

Pronounce inclined error detection method, apparatus, storage medium and equipment

Technical field

The present invention relates to computer assisted speech-related technologies field, more particularly to it is a kind of pronounce inclined error detection method, Device, storage medium and equipment.

Background technology

Inclined error detection of pronouncing can be that learner is lifted as an important technology in area of computer aided pronunciation training system Oracy provides effective approach.In the past few decades, the inclined error detection side of pronunciation largely based on segment level has been emerged in large numbers Method.A wherein route is to be based on automatic speech recognition technology, and inclined error detection of pronouncing is carried out using statistics of speech recognition framework.Press Feedback form, it can further be divided into two types.A kind of is the method based on confidence score, for example, log-likelihood ratio (“Automatic detection of phone-level mispronunciation for language learning”, Speech Communication, vol.30, no.2-3, pp.95-108,2000) measure the acoustics sound of mother tongue and non-mother tongue The similarity degree of prime model, and its variant pronunciation wellness (" Phone-level pronunciation scoring of specific phone segments for language instruction”,Speech communication, vol.30,no.2,pp.95-108,2000).However, when learner faces a relatively low fraction, it is not known that how to correct. Another kind is rule-based method, i.e., orthoepy and its inclined type by mistake is added extension and known using pronunciation extension identification network In other dictionary.Type is missed partially in order to collect, using two methods：One kind is to formulate pronunciation rule using expertise, another It is the method for machine learning, i.e., learns acoustic voice rule from orthoepy with automatic in the mark for missing pronunciation partially to generate acoustics Phonetic system model (" Mispronunciation detection and diagnosis in l2 english speech using multidistribution deep neural networks,"IEEE/ACM Transactions on Audio, Speech,and Language Processing,2017).Or counted from corpus and miss pronunciation rule and respective tones partially It is secondary, then with prior probability, extension pronunciation dictionary (" Improvement of segmental mispronunciation detection with prior knowledge extracted from large L2 speech corpus,”in Interspeech2011,pp.1593-1596).Compared with the method based on confidence score, rule-based method can be to learn Habit person provides more feedback informations.This method advantage based on automatic speech recognition framework is that pronunciation is easy to predict by mistake partially. But its shortcoming is, on the one hand, for the angle of parameter distinction, the method all extracts same frequency in each frame voice Parameter (such as mel cepstrum coefficients) is composed, its distinction need further to study, and its potential hypothesis information is in voice It is equally distributed；On the other hand, for model angle, ASR system uses hidden Markov (Hidden Markov mostly Model, HMM) model models the temporal information of phoneme.But HMM can't be strong differentiation frequency spectrum on be similar to and duration not Same voice (" Comparing different approaches for automatic pronunciation error detection,”Speech Communication,vol.51,no.10,pp.845-852,2009).And the method relies on Language setting and training data scale, specific inclined its accuracy of detection of type by mistake is also needed further to be lifted.

For two language learners, the significant challenge of foreign language studying, which comes from, realizes specific phonemic contrast.It is this right It is vertical to be present in mother tongue, but may be not present in two languages.Due to being influenceed by effects such as Negative Transfer from Mother Language, its position of articulation is normal The position of articulation of similar sound in mother tongue can often be tended to.The inclined pronunciation by mistake of foreign language learner can not only be divided into insertion, deletion, Replace mistake.Wen Cao et al. define pronunciation according to position of articulation and manner of articulation and miss trend partially, describe two language persons pronunciation Between orthoepy and a kind of specious situation (" Developing a Chinese L2 of inclined intermediateness of pronouncing by mistake speech database of Japanese learners with narrow-phonetic labels for computer assisted pronunciation training”,in interspeech2010,pp.1922-1925).Such case is normal Appear in advanced learner.In order to identify this trickle change.Another route is will to pronounce inclined error detection as two points Generic task, to detect incorrect pronunciations and its miss trend partially.But it is normal to find the feature with distinction for every kind of inclined type by mistake It is often extremely difficult.

Stevens acoustics landmark (acoustics boundary mark) is theoretical, from the mechanism of human speech output, defines Transient regions (Acoustic phonetics.MITs of the landmark as Quantum Nonlinear relation between description pronunciation and acoustics press,2000,vol.30；“The quantal nature of speech:Evidence from articulatory- acoustic data,”1972,pp.51-66；“On the quantal nature of speech,”Journal of phonetics,1989,vol.17,no.2 pp.3-45；“Quantal theory,enhancement and overlap,” Journal of Phonetics,2010,vol.38,no.1,pp.10-19).In this region there is significant signal to dash forward Become, generally mean that focus perceptually and the target of pronunciation, there is abundant voice messaging.A large amount of perception experiments show, listen Person, which concentrates on landmark, to be helped to select potential distinguishing characteristics (" Evidence for the role of acoustic boundaries in the perception of speech sounds,”in Phonetic Linguistics:Essays in Honor of Peter Ladefoged,edited by V.Fromkin(Academic,New York),pp.243- 255).Distinguishing characteristics is extracted at landmark and achieves good result in inclined error detection of pronouncing.And determine that voice can be distinguished The landmark positions of classification are extremely difficult.It usually requires to study Mechanism of Speech Production and a large amount of artificial marks, therefore efficiency It is not high.

For problem above, domestic and foreign scholars propose a variety of improved methods.It can substantially be divided into three classes：

The first kind is the change for the characteristic parameter that voice signal different levels and dimension are detected from the angle of signal detection Change obtains landmark.Conventional parameter has short-time energy, zero-crossing rate, formant etc..Sharlen A.Liu propose to utilize voice The frequency dividing energy feature method that detects the three kind landmark related to consonant.This method is according to phoneme pronunciation feature by voice Frequency spectrum is divided into six frequency bands, and using the peak-to-valley value of the difference curve of each frequency band energy as landmark candidates, passes through Corresponding judgment criterion obtains landmark sequences (" the Landmark detection for distinctive of voice signal feature‐based speech recognition,”The Journal of the Acoustical Society of America,1996,vol.100,no.5,pp.3417-3430).A.R.Janyan and P.C.Pandey thinks point that Liu is established The difference that frequency processing method is depended between words person, therefore utilization gauss hybrid models (Gaussian Mixture Model, GMM smooth spectrum envelope) is modeled, and utilizes ROR (rate of rise) function extraction GMM parameter detecting plosives landmark(“Detection of stop landmarks using Gaussian mixture modeling of speech spectrum,”Acoustics,Speech and Signal Processing,2009.ICASSP 2009.IEEE International Conference on.IEEE,2009:4681-4684).Detection vowel before Dumpala considers Landmark does not consider sound source characteristics, and it extracts acoustic feature from glottis closing moment, and these features include sound source characteristics harmony The method that road feature, wherein sound source characteristics use ZFF (zero frequency filtering), track characteristics use SFF The method of (single frequency filtering).Regular algorithm is then based on to detect the landmark of vowel (“Robust Vowel Landmark Detection Using Epoch-Based Features,”in INTERSPEECH.2016,pp.160-164)。

Second class is to be directed to different landmark types, different parameters is selected, from the angle of machine learning. Recurrence algorithm of convex hull of the Howitt based on detection syllabic nucleus, extract three kinds of acoustic feature (Valley Depth, duration and level) inputs The landmark of vowel is detected to multi-layer perception (MLP) (Multi-layer perceptron).The label of speech frame is divided into by it Vowel and non-vowel (" Automatic Syllable Detection for Vowel Landmarks, " doctor thesis,MIT,1999).Hasegawa-Johnson et al. realizes the speech recognition system based on landmark.Its is primary Step is exactly landmark detection, and its method used is exactly all with the svm classifier of one two classification to every kind of landmark Device detects (" Landmark-based speech recognition:Report of the 2004 Johns Hopkins summer workshop,”Acoustics,Speech,and Signal Processing,2005.Proceedings. (ICASSP'05),IEEE International Conference on.IEEE,2005,vol 1,pp.213-216)。Chi- Yueh Lin and Hsiao-Chuan Wang propose to utilize random forest (random forest) and bootstrapping methods Explosion starting landmark (Burst onset landmark) is detected, testing result is spliced to the MFCC characteristic vectors of extraction Afterwards, and incorporate in the speech recognition system based on HMM, further lift the detection performance (" Burst of plosive and affricate onset landmark detection and its application to speech recognition,”IEEE Transactions on Audio,Speech,and Language Processing,2011,vol.19no.5,pp.1253- 1264)。

3rd class is to assume that a fixed position of phoneme duration is landmark from linguistics angle, and is applied to In inclined error detection of pronouncing.Yoon assumes the landmark of English vowel in the centre position of phoneme duration, the landmark of consonant In the originating of phoneme, middle and finish time (" Landmark-based automated pronunciation error Detection, " in Interspeech.2010, pp.614-617).Yanlu Xie et al. synthesize skill using the splicing of voice Art and combination perceive experiment and think that Japanese perceives the crucial clue of the Chinese nose ending of a final in nasalized vowel section, and by this position Between the moment as landmark (" Landmark of mandarin nasal codas and its application in pronunciation error detection,”Acoustics,Speech and Signal Processing (ICASSP),2016IEEE International Conference on.IEEE,2016:5370-5374).Because Chinese does not have There are landmark schemes, Xuesong Yang etc. to formulate two kinds of mapping schemes.A kind of is the landmark positions according to English The landmark fixed positions of english phoneme are mapped to Chinese by the International Phonetic Symbols, another way is seen by linguist Survey and count landmark location rules (" the Landmark-based pronunciation for formulating some Chinese phonemes error identification on Chinese learning,”in Speech Prosody,2016)。

As a whole, the research of forefathers or the angle research Mechanism of Speech Production from signal detection, and for not unisonance The landmark types of element design different parameters.Or from experiment is perceived, landmark is marked manually, or assume one The position of individual fixation is as landmark.For first kind method, its benefit need not contain mark landmark manually Training data, but need to study Mechanism of Speech Production for the landmark of different phonemes, design the different letters with distinction Number parameter detects.And some constant criterions are usually selected, for the different consideration deficiencies between words person.For second of side Method, its benefit are only need to select to have the feature of distinction to classify automatically by machine learning.But it generally relies on manual mark The data being poured in are trained, and need to select the different features with distinction for different landmark.If detection All landmark, then need repeatedly to train.For the third method, its benefit assumes that some positions fixed, calculating side Just, context environmental but is not taken into full account.

The content of the invention

The embodiment of the present invention provides a kind of pronunciation inclined error detection method, apparatus, storage medium and equipment, to solve existing skill One or more missings in art.

The embodiment of the present invention provides a kind of inclined error detection method of pronunciation, including：Detected using sequential classification CTC methods are connected The key frame position of phoneme in known correct voice, as acoustics boundary mark landmark；Based on the landmark to be detected Phoneme described in voice carries out inclined error detection of pronouncing.

In one embodiment, the crucial framing bit for connecting phoneme in correct voice known to sequential classification CTC method detections is utilized Put, as acoustics boundary mark landmark, including：RNN acoustic models are trained using CTC criterions；Utilize the RNN acoustic modes after training The voice of processing unit, obtains phoneme described in the voice of processing unit on each time frame in correct voice known to type decoding The sequence of posterior probability；Using setting window length, setting each posterior probability in peaking function and sequence, each time frame pair is calculated The peaking function value answered；Calculate the average and variance of all peaking function values for being more than zero；Obtain cutting ratio using average and variance Husband's inequality is avenged, and obtains the peaking function value for meeting Chebyshev inequality；Maximum sharpness is obtained in the setting long scope of window Functional value；The key frame position of the phoneme is determined using the peak location of maximum sharpness functional value, as landmark.

In one embodiment, the key frame position of the phoneme is determined using the peak location of maximum sharpness functional value, is wrapped Include：Whether comprising corresponding to the peak location in speech text known to judgement corresponding to the processing unit of correct voice Phoneme；If in the presence of using the peak location as key frame position；If being not present, the peak location is rejected, and from Remaining meets to reacquire maximum sharpness functional value in the peaking function value of Chebyshev inequality, and utilizes and reacquire most The peak location of big peaking function value determines the key frame position of the phoneme.

In one embodiment, the key frame position of the phoneme is determined using the peak location of maximum sharpness functional value, is made For landmark, including：By by the key frame position with the processing unit of known correct voice corresponding to mark Text phoneme time information compares, and determines the key frame relative position of the phoneme；To all key frame phases of the phoneme Position is averaging, the final key frame of the phoneme is obtained, as landmark.

In one embodiment, inclined error detection of pronouncing is carried out to phoneme described in voice to be detected based on the landmark, Including：Based on the landmark, the acoustic feature of phoneme and known correct voice described in type voice are missed known to extraction partially Described in phoneme acoustic feature；Missed partially in the acoustic feature of phoneme described in type voice and known correct voice using known The acoustic feature training SVM classifier of the phoneme；Phoneme described in voice to be detected is carried out using SVM classifier after training Pronounce inclined error detection.

In one embodiment, it is described set peaking function as：

Wherein, S_i(k,i,x_i, T) represent peaking function value, T represent processing unit voice in sound mother in each time frame On posterior probability sequence, k represent window length, x_iRepresent the value of the posterior probability of i-th of time frame in sequence T, i to be more than or Null integer.

The embodiment of the present invention also provides a kind of inclined error detection device of pronunciation, including：Acoustics boundary mark determining unit, is used for：Profit The key frame position of phoneme in the correct voice known to connection sequential classification CTC method detections, as acoustics boundary mark landmark； Pronounce inclined error detection unit, is used for：Inclined error detection of pronouncing is carried out to phoneme described in voice to be detected based on the landmark.

In one embodiment, the acoustics boundary mark determining unit, including：Acoustic training model module, is used for：Utilize CTC Criterion trains RNN acoustic models；Probability sequence generation module, is used for：Using correct known to the RNN acoustic models decoding after training The voice of processing unit in voice, obtains the sequence of posterior probability of the phoneme described in the voice of processing unit on each time frame Row；Peaking function value generation module, is used for：Using setting window length, setting each posterior probability in peaking function and sequence, calculate To peaking function value corresponding to each time frame；Inequality parameter generation module, is used for：Calculate it is all be more than zero peaking function value Average and variance；Peaking function value screening module, is used for：Chebyshev inequality is obtained using average and variance, and is obtained Meet the peaking function value of Chebyshev inequality；Maximum sharpness functional value determining module, is used for：Obtained in the setting long scope of window Take maximum sharpness functional value；Acoustics boundary mark determining module, is used for：The sound is determined using the peak location of maximum sharpness functional value The key frame position of element, as landmark.

In one embodiment, the acoustics boundary mark determining module, including：Phoneme judge module, is used for：It is correct known to judgement Whether the phoneme corresponding to the peak location is included in speech text corresponding to the processing unit of voice；Crucial framing bit Determining module is put, is used for：If in the presence of using the peak location as key frame position；If being not present, the spike is rejected Position, and meet from remaining to reacquire maximum sharpness functional value in the peaking function value of Chebyshev inequality, and utilize weight The peak location of the maximum sharpness functional value newly obtained determines the key frame position of the phoneme.

In one embodiment, the acoustics boundary mark determining module, including：Key frame relative position determining module, is used for：It is logical Cross by the key frame position with the processing unit of known correct voice corresponding to mark text phoneme time information phase Contrast, determine the key frame relative position of the phoneme；Final key frame determining module, is used for：It is relevant to the institute of the phoneme Key frame relative position is averaging, and the final key frame of the phoneme is obtained, as landmark.

In one embodiment, the inclined error detection unit of pronunciation, including：Acoustic feature extraction module, it is used for：Based on described Landmark, the sound of phoneme described in the inclined acoustic feature of phoneme described in type voice by mistake known to extraction and known correct voice Learn feature；SVM classifier training module, is used for：Using it is known partially by mistake the acoustic feature training of phoneme described in type voice and The acoustic feature SVM classifier of phoneme described in known correct voice；Pronounce inclined error detection module, be used for：Utilize SVM after training Grader carries out inclined error detection of pronouncing to phoneme described in voice to be detected.

In one embodiment, the peaking function value generation module, it is additionally operable to perform：

It is described set peaking function as：

The embodiment of the present invention also provides a kind of computer-readable recording medium, is stored thereon with computer program, the program The step of the various embodiments described above methods described is realized when being executed by processor.

The embodiment of the present invention also provides a kind of computer equipment, including memory, processor and storage are on a memory simultaneously The computer program that can be run on a processor, the various embodiments described above methods described is realized during the computing device described program The step of.

Pronunciation inclined error detection method, apparatus, storage medium and the equipment of the embodiment of the present invention, key frame is detected based on CTC, Key frame position is detected by using CTC methods and determines landmark, without mark landmark manually in advance, is avoided pair Mark landmark dependence manually, and a unified speech recognition framework is used, it is easy to inclined error detection of pronouncing.

Brief description of the drawings

In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is the required accompanying drawing used in technology description to be briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with Other accompanying drawings are obtained according to these accompanying drawings.In the accompanying drawings：

Fig. 1 is the schematic flow sheet of the inclined error detection method of pronunciation of the embodiment of the present invention；

Fig. 2 is that correct voice middle pitch known to connection sequential classification CTC method detections is determined the use of in one embodiment of the invention Method flow schematic diagram of the key frame position of element as acoustics boundary mark；

Fig. 3 is the key frame position for determining phoneme in one embodiment of the invention using the peak location of maximum sharpness functional value Method flow schematic diagram；

Fig. 4 is the crucial framing bit for determining phoneme in another embodiment of the present invention using the peak location of maximum sharpness functional value Put the method flow schematic diagram as acoustics boundary mark；

Fig. 5 is to carry out inclined error detection of pronouncing to phoneme in voice to be detected based on acoustics boundary mark in one embodiment of the invention Method flow schematic diagram；

Fig. 6 is the algorithm flow signal extracted in one embodiment of the invention to the spike of each phoneme in sentence Figure；

Fig. 7 is the spike phenomenon schematic diagram of CTC in one embodiment of the invention；

Fig. 8 is the flow block schematic illustration of inclined error detection of pronouncing in one embodiment of the invention；

Fig. 9 is the structural representation of the inclined error detection device of pronunciation of the embodiment of the present invention；

Figure 10 is the structural representation of acoustics boundary mark determining unit in one embodiment of the invention；

Figure 11 is the structural representation of acoustics boundary mark determining module in one embodiment of the invention；

Figure 12 is the structural representation of acoustics boundary mark determining module in another embodiment of the present invention；

Figure 13 is the structural representation of inclined error detection unit of pronouncing in one embodiment of the invention；

Figure 14 is the structural representation of the computer equipment of one embodiment of the invention.

Embodiment

For the purpose, technical scheme and advantage of the embodiment of the present invention are more clearly understood, below in conjunction with the accompanying drawings to this hair Bright embodiment is described in further details.Here, the schematic description and description of the present invention is used to explain the present invention, but simultaneously It is not as a limitation of the invention.

In order to avoid the dependence to mark landmark manually, the embodiments of the invention provide the inclined error detection side of one kind pronunciation Method.Fig. 1 is the schematic flow sheet of the inclined error detection method of pronunciation of the embodiment of the present invention.As shown in figure 1, the hair of the embodiment of the present invention The inclined error detection method of sound, it may include：

Step S110：Using the key frame position for connecting phoneme in correct voice known to sequential classification CTC method detections, make For acoustics boundary mark landmark；

Step S120：Inclined error detection of pronouncing is carried out to phoneme described in voice to be detected based on the landmark.

CTC methods can carry out sequence mark study using Recognition with Recurrent Neural Network.Subject matter in speech recognition be by It is text label sequence that acoustic feature is Sequence Transformed, such as Chinese and English sound auxiliary sequence, the former usual length are more than the latter.Implement In example, blank (blank) label can be introduced by CTC and is easily obscured between two pronunciation units or uncertain side to absorb Boundary, and allow label to repeat, so as to obtain the best alignment between speech frame and output label.In embodiment, CTC Posterior probability of each modeling unit in each time step can be provided using RNN softmax layers., can be with embodiment By many-to-one mapping, multiple output labels are mapped to one without the sequence for repeating label and blank labels.Embodiment In, CTC can be possible to alignment summation by preceding backward algorithm to target sequence.

In above-mentioned steps S110, it is known that correct voice can obtain from existing mother tongue corpus.Phoneme for example can be with It is initial consonant or simple or compound vowel of a Chinese syllable.In above-mentioned steps S120, based on the landmark determined, it can be sent out by a variety of distinct methods The inclined error detection of sound, such as SVM (Support Vector Machine, SVMs) grader.

In the embodiment of the present invention, key frame position is detected by using CTC methods to determine landmark, without prior hand Dynamic mark landmark, avoid the dependence to mark landmark manually.Moreover, using unified speech recognition framework, inspection It is good to survey result uniformity.

Fig. 2 is that correct voice middle pitch known to connection sequential classification CTC method detections is determined the use of in one embodiment of the invention Method flow schematic diagram of the key frame position of element as acoustics boundary mark.As shown in Fig. 2 in above-mentioned steps S110, using even Meet the key frame position of phoneme in correct voice known to sequential classification CTC method detections, the side as acoustics boundary mark landmark Method, it may include：

Step S111：RNN acoustic models are trained using CTC criterions；

Step S112：Using the voice of processing unit in correct voice known to the RNN acoustic models decoding after training, obtain The sequence of posterior probability of the phoneme on each time frame described in the voice of processing unit；

Step S113：Using setting window length, setting each posterior probability in peaking function and sequence, each time frame is calculated Corresponding peaking function value；

Step S114：Calculate the average and variance of all peaking function values for being more than zero；

Step S115：Chebyshev inequality is obtained using average and variance, and obtains and meets Chebyshev inequality Peaking function value；

Step S116：Maximum sharpness functional value is obtained in the setting long scope of window；

Step S117：The key frame position of the phoneme is determined using the peak location of maximum sharpness functional value, as landmark。

In above-mentioned steps S111, RNN can be trained using the correct voice in mother tongue corpus as input (Recurrent Neural Network, recurrent neural network) acoustic model.In other embodiments, other sound can be used Learn model.In above-mentioned steps S112, the processing unit for example can be in short.The sequence is time series.In above-mentioned step In rapid S115, the average being calculated and variance can be substituted into the Chebyshev inequality of standard, specifically be cut ratio Husband's inequality is avenged, and specific Chebyshev inequality can be updated to using peaking function value as the value of variable, is judged whether Meet the inequality.In embodiment, when reservation meets the peaking function value of Chebyshev inequality, it is original that its can be recorded simultaneously Index, to get its time frame (peak location)., can be in the setting long scope of window (such as 2k) in above-mentioned steps S116 Carry out peaking function value to compare, retain a maximum., can be direct by obtained key frame position in above-mentioned steps S117 As landmark, or by certain screening or judge again to determine whether as landmark.Inventor is had found with CTC criterions The RNN model output label posterior probability of training has obvious spike phenomenon, can effectively be determined using the feature landmark。

In embodiment, the setting peaking function can be：

In the present embodiment, peaking function value S_i(k,i,x_i, T) it is bigger, represent the position as spike possibility it is bigger, So it can effectively filter out maximum sharpness position using the setting peaking function.

Fig. 3 is the key frame position for determining phoneme in one embodiment of the invention using the peak location of maximum sharpness functional value Method flow schematic diagram.As shown in figure 3, in above-mentioned steps S117, determined using the peak location of maximum sharpness functional value The method of the key frame position of the phoneme, it may include：

Step S1171：Whether institute is included in speech text known to judgement corresponding to the processing unit of correct voice State the phoneme corresponding to peak location；

Step S1172：If in the presence of using the peak location as key frame position；

Step S1173：If being not present, the peak location is rejected, and meets the point of Chebyshev inequality from remaining Maximum sharpness functional value is reacquired in peak functional value, and is determined using the peak location of the maximum sharpness functional value reacquired The key frame position of the phoneme.

In the case where calculating inaccuracy, the peaking function maximum selected is possible to meeting very little, causes this word occur Do not include this phoneme in (processing unit), pass through above-mentioned steps S1171, step S1172 and step S1172 combination known texts The peak location that this sentence is talked about to the phoneme not included in (processing unit) is rejected, and can improve the accuracy of key frame position.

Fig. 4 is the crucial framing bit for determining phoneme in another embodiment of the present invention using the peak location of maximum sharpness functional value Put the method flow schematic diagram as acoustics boundary mark.As shown in figure 4, in above-mentioned steps S117, maximum sharpness functional value is utilized Peak location determine the key frame position of the phoneme, the method as landmark, it may include：

Step S1174：By by the key frame position with the processing unit of known correct voice corresponding to mark Explanatory notes this phoneme time information compares, and determines the key frame relative position of the phoneme；

Step S1175：All key frame relative positions of the phoneme are averaging, obtain the final key of the phoneme Frame, as landmark.

In the present embodiment, mark text phoneme time information can be mark text sound mother's temporal information.One processing Multiple identical phonemes can be included in unit (a word), can be with by being averaged to multiple key frame positions of same phoneme Obtain unified key frame position, the inclined error detection of pronunciation convenient to carry out.

, can be by key frame position, final key frame position or key frame position average value and mark manually in embodiment Landmark be compared, if unanimously, can be averaged using key frame position, final key frame position or key frame position Value carries out inclined error detection of pronouncing as landmark, if inconsistent, the landmark marked manually can be used to be pronounced Inclined error detection, the inclined error detection of pronunciation can be improved with this.

Fig. 5 is to carry out inclined error detection of pronouncing to phoneme in voice to be detected based on acoustics boundary mark in one embodiment of the invention Method flow schematic diagram.As shown in figure 5, in above-mentioned steps S120, based on the landmark to described in voice to be detected Phoneme pronounce the method for inclined error detection, it may include：

Step S121：Based on the landmark, known to extraction partially by mistake the acoustic feature of phoneme described in type voice and The acoustic feature of phoneme described in known correct voice；

Step S122：Missed partially described in the acoustic feature of phoneme described in type voice and known correct voice using known The acoustic feature training SVM classifier of phoneme；

Step S123：Inclined error detection of pronouncing is carried out to phoneme described in voice to be detected using SVM classifier after training.

In the present embodiment, inclined error detection of pronouncing is carried out using SVM classifier after training, preferably detection knot can be obtained Fruit.

Fig. 6 is the algorithm flow signal extracted in one embodiment of the invention to the spike of each phoneme in sentence Figure.As shown in fig. 6, using a sentence as processing unit, the method extracted to the spike of each phoneme in sentence can Including：

Step S301：Directly decoded in short using the CTC RNN acoustic models trained, obtain probability sequence.

Modeling unit (for example, sound is female) posterior probability x on each time frame is extracted from mother tongue pronunciation_i, group The probability sequence T for containing N (N represents the time step number of a word) individual point into one.

Step S302：Calculate peaking function value a corresponding to each time frame_i, obtain the peaking function value array more than zero.

In embodiment, peaking function selection is：

S_i(k,i,x_i, T) and i-th point in time series T of probable value x can be represented_iWith respect to other point conspicuousness, its Value is bigger, represents bigger as the possibility of spike.It will be greater than 0 S₁(k,i,x_i, T) value (representing candidate's spike) pick out Add in array a, and keep its primary index in time series.

In embodiment, the half of the average duration of every kind of phoneme can be counted according to corpus, or rule of thumb, selects window length K, such as it is 4 to set the long k of window.

Step S303：Calculate the average m and variance s of all elements in array a.

Step S304：Using Chebyshev inequality (Chebyshev Inequality):Sieve Select peaking function value.

Wherein, μ is average, and σ is variance, and h is greater than 0 constant.Itself it is not assumed that stochastic variable X obeys any distribution, its Represent to meet that the peak value of this condition is seldom.If meetThen retain candidate's kurtosis x_i, and Record its primary index.Wherein, h can be manually set to the constant more than 0.

Step S305：Post-processed, the kurtosis in the long scope of window (2k) is compared, only retain a maximum.

Last remaining spike can be used as real candidate's spike, and its primary index is final candidate's peak location.Due to this calculation Method is possible to the maximum meeting very little selected, it may appear that does not include this phoneme in this sentence words.For inclined error detection and the mark of pronouncing For note task.Its text is known, it is necessary to reject the peak location for the phoneme not included in this word with reference to known text.

In embodiment, then need to set threshold value for detecting key frame in the tasks such as speech recognition, by candidate's peak location Kurtosis too small position in place's weeds out.

Fig. 7 is the spike phenomenon schematic diagram of CTC in one embodiment of the invention.As shown in fig. 7, with " We ' ve done our Exemplified by part ", easily obscure between two pronunciations or uncertain border blank blank labels absorb, using CTC by sentence " We ' Spike w, iy, v, d, ah, n, aa, r, p, t be present in label posterior probability corresponding to ve done our part " voice.

Fig. 8 is the flow block schematic illustration of inclined error detection of pronouncing in one embodiment of the invention.As shown in figure 8, whole detection Framework can be divided into two stages：First stage, using the voice of mother tongue corpus as input, RNN acoustics is trained using CTC criterions Model, according to above-mentioned spike extraction algorithm, the extraction feature to mother tongue pronunciation decodes, and generates label posterior probability, extraction Peak location, then, it is determined with marking in text compared with sound mother's temporal information (relative to each phoneme time started) The position (counting every kind of phoneme spike relative position) of key frame, averages as final to the key frame position of every kind of phoneme Key frame；Second stage, the inclined error detection of pronunciation based on key frame, the key frame position trained using the first stage, from spy Acoustic feature is extracted in accordatura element or its inclined speech samples by mistake, and utilizes orthoepy and its SVM that type trains by mistake partially to divide Class device detects to particular phoneme.

Whether in embodiment, can first verify that the peak location based on CTC drivings and landmark position has unanimously Property, inclined error detection of pronouncing then is carried out as key frame based on the spike of data-driven using CTC system.Its benefit is need not Mark landmark in advance, and use a unified speech recognition framework.

Based on additionally providing one with the inclined error detection method identical inventive concept of pronunciation shown in Fig. 1, the embodiment of the present application The inclined error detection device of kind pronunciation, as described in example below.Because the inclined error detection device of the pronunciation solves the principle and hair of problem The inclined error detection method of sound is similar, therefore the implementation of the inclined error detection device of the pronunciation may refer to the reality of inclined error detection method Apply, repeat part and repeat no more.

Fig. 9 is the structural representation of the inclined error detection device of pronunciation of the embodiment of the present invention.As shown in figure 9, the present invention is implemented The inclined error detection device of pronunciation of example, it may include：Acoustics boundary mark determining unit 510 and the inclined error detection unit 520 that pronounces, the two is mutual Connection.

Acoustics boundary mark determining unit 510, is used for：Using connecting correct voice middle pitch known to sequential classification CTC methods detection The key frame position of element, as acoustics boundary mark landmark；

Pronounce inclined error detection unit 520, is used for：Phoneme described in voice to be detected is sent out based on the landmark The inclined error detection of sound.

Figure 10 is the structural representation of acoustics boundary mark determining unit in one embodiment of the invention.As shown in Figure 10, the sound Determining unit 510 is marked by educational circles, it may include：Acoustic training model module 511, probability sequence generation module 512, the life of peaking function value Into module 513, inequality parameter generation module 514, peaking function value screening module 515, maximum sharpness functional value determining module 516 and acoustics boundary mark determining module 517, above-mentioned each sequence of modules connection.

Acoustic training model module 511, is used for：RNN acoustic models are trained using CTC criterions；

Probability sequence generation module 512, is used for：Using locating in correct voice known to the RNN acoustic models decoding after training The voice of unit is managed, obtains the sequence of posterior probability of the phoneme described in the voice of processing unit on each time frame；

Peaking function value generation module 513, is used for：Using setting, window is long, it is general to set each posteriority in peaking function and sequence Rate, peaking function value corresponding to each time frame is calculated；

Inequality parameter generation module 514, is used for：Calculate the average and variance of all peaking function values for being more than zero；

Peaking function value screening module 515, is used for：Chebyshev inequality is obtained using average and variance, and is obtained full The peaking function value of sufficient Chebyshev inequality；

Maximum sharpness functional value determining module 516, is used for：Maximum sharpness functional value is obtained in the setting long scope of window；

Acoustics boundary mark determining module 517, is used for：The pass of the phoneme is determined using the peak location of maximum sharpness functional value Key frame position, as landmark.

Figure 11 is the structural representation of acoustics boundary mark determining module in one embodiment of the invention.As shown in figure 11, embodiment In, the acoustics boundary mark determining module 517, it may include：Phoneme judge module 5171 and key frame position determining module 5172, two Person is connected with each other.

Phoneme judge module 5171, is used for：Speech text known to judgement corresponding to the processing unit of correct voice In whether include the phoneme corresponding to the peak location；

Key frame position determining module 5172, is used for：If in the presence of using the peak location as key frame position；If It is not present, then rejects the peak location, and reacquired most from the peaking function value that remaining meets Chebyshev inequality Big peaking function value, and determine the crucial framing bit of the phoneme using the peak location of the maximum sharpness functional value of reacquisition Put.

Figure 12 is the structural representation of acoustics boundary mark determining module in another embodiment of the present invention.As shown in figure 12, implement In example, the acoustics boundary mark determining module 517, including：Key frame relative position determining module 5173 and final key frame determine Module 5174, the two interconnection.

Key frame relative position determining module 5173, is used for：By by the key frame position and known correct voice Mark text phoneme time information corresponding to the processing unit compares, and determines the key frame relative position of the phoneme；

Final key frame determining module 5174, is used for：All key frame relative positions of the phoneme are averaging, obtained The final key frame of the phoneme, as landmark.

Figure 13 is the structural representation of inclined error detection unit of pronouncing in one embodiment of the invention.As shown in figure 13, the hair The inclined error detection unit 520 of sound, it may include：Acoustic feature extraction module 521, SVM classifier training module 522 and the inclined flase drop of pronunciation Survey module 523, above-mentioned each sequence of modules connection.

Acoustic feature extraction module 521, is used for：Based on the landmark, missed partially described in type voice known to extraction The acoustic feature of phoneme described in the acoustic feature of phoneme and known correct voice；

SVM classifier training module 522, is used for：Using it is known partially by mistake the acoustic feature of phoneme described in type voice and The acoustic feature training SVM classifier of phoneme described in known correct voice；

Pronounce inclined error detection module 523, be used for：Phoneme described in voice to be detected is entered using SVM classifier after training The inclined error detection of row pronunciation.

In embodiment, the peaking function value generation module 513, it may also be used for perform：

It is described set peaking function as：

Figure 14 is the structural representation of the computer equipment of one embodiment of the invention.As shown in figure 14, computer equipment 600, including memory 610, processor 620 and storage are on a memory and the computer program that can run on a processor, institute State the step of realizing the various embodiments described above methods described when processor 620 performs described program.

In summary, the pronunciation of the embodiment of the present invention inclined error detection method, apparatus, storage medium and equipment, examined based on CTC Key frame is surveyed, detects key frame position by using CTC methods to determine landmark, without marking manually in advance Landmark, the dependence to mark landmark manually is avoided, and use a unified speech recognition framework, be easy to pronounce Inclined error detection.

In the description of this specification, reference term " one embodiment ", " specific embodiment ", " some implementations Example ", " such as ", the description of " example ", " specific example " or " some examples " etc. mean to combine the embodiment or example description Specific features, structure, material or feature are contained at least one embodiment or example of the present invention.In this manual, Identical embodiment or example are not necessarily referring to the schematic representation of above-mentioned term.Moreover, the specific features of description, knot Structure, material or feature can combine in an appropriate manner in any one or more embodiments or example.Each embodiment In the step of being related to order be used for the implementation that schematically illustrates the present invention, sequence of steps therein is not construed as limiting, can be as needed Appropriately adjust.

It should be understood by those skilled in the art that, embodiments of the invention can be provided as method, system or computer program Product.Therefore, the present invention can use the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware Apply the form of example.Moreover, the present invention can use the computer for wherein including computer usable program code in one or more The computer program production that usable storage medium is implemented on (including but is not limited to magnetic disk storage, CD-ROM, optical memory etc.) The form of product.

The present invention is the flow with reference to method according to embodiments of the present invention, equipment (system) and computer program product Figure and/or block diagram describe.It should be understood that can be by every first-class in computer program instructions implementation process figure and/or block diagram Journey and/or the flow in square frame and flow chart and/or block diagram and/or the combination of square frame.These computer programs can be provided The processors of all-purpose computer, special-purpose computer, Embedded Processor or other programmable data processing devices is instructed to produce A raw machine so that produced by the instruction of computer or the computing device of other programmable data processing devices for real The device for the function of being specified in present one flow of flow chart or one square frame of multiple flows and/or block diagram or multiple square frames.

These computer program instructions, which may be alternatively stored in, can guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works so that the instruction being stored in the computer-readable memory, which produces, to be included referring to Make the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one square frame of block diagram or The function of being specified in multiple square frames.

These computer program instructions can be also loaded into computer or other programmable data processing devices so that counted Series of operation steps is performed on calculation machine or other programmable devices to produce computer implemented processing, so as in computer or The instruction performed on other programmable devices is provided for realizing in one flow of flow chart or multiple flows and/or block diagram one The step of function of being specified in individual square frame or multiple square frames.

Particular embodiments described above, the purpose of the present invention, technical scheme and beneficial effect are carried out further in detail Describe in detail it is bright, should be understood that the foregoing is only the present invention specific embodiment, the guarantor being not intended to limit the present invention Scope is protected, within the spirit and principles of the invention, any modification, equivalent substitution and improvements done etc., should be included in this Within the protection domain of invention.

Claims

The inclined error detection method 1. one kind is pronounced, it is characterised in that including：

Using the key frame position for connecting phoneme in correct voice known to sequential classification CTC method detections, as acoustics boundary mark landmark；

Inclined error detection of pronouncing is carried out to phoneme described in voice to be detected based on the landmark.
2. inclined error detection method of pronouncing as claimed in claim 1, it is characterised in that examined using sequential classification CTC methods are connected The key frame position of phoneme in correct voice known to survey, as acoustics boundary mark landmark, including：

RNN acoustic models are trained using CTC criterions；

Using the voice of processing unit in correct voice known to the RNN acoustic models decoding after training, the language of processing unit is obtained The sequence of posterior probability of the phoneme described in sound on each time frame；

Using setting window length, setting each posterior probability in peaking function and sequence, spike letter corresponding to each time frame is calculated Numerical value；

Calculate the average and variance of all peaking function values for being more than zero；

Chebyshev inequality is obtained using average and variance, and obtains the peaking function value for meeting Chebyshev inequality；

Maximum sharpness functional value is obtained in the setting long scope of window；

The key frame position of the phoneme is determined using the peak location of maximum sharpness functional value, as landmark.
3. inclined error detection method of pronouncing as claimed in claim 2, it is characterised in that utilize the spike position of maximum sharpness functional value The key frame position for determining the phoneme is put, including：

It is whether right comprising the peak location in speech text known to judgement corresponding to the processing unit of correct voice The phoneme answered；

If in the presence of using the peak location as key frame position；If being not present, reject the peak location, and from its Maximum sharpness functional value is reacquired in the remaining peaking function value for meeting Chebyshev inequality, and utilizes the maximum reacquired The peak location of peaking function value determines the key frame position of the phoneme.
4. inclined error detection method of pronouncing as claimed in claim 2, it is characterised in that utilize the spike position of maximum sharpness functional value The key frame position for determining the phoneme is put, as landmark, including：

By by the key frame position with the processing unit of known correct voice corresponding to mark text phoneme time Information compares, and determines the key frame relative position of the phoneme；

All key frame relative positions of the phoneme are averaging, obtain the final key frame of the phoneme, as landmark。
5. inclined error detection method of pronouncing as claimed in claim 1, it is characterised in that based on the landmark to language to be detected Phoneme described in sound carries out inclined error detection of pronouncing, including：

Missed partially based on the landmark, known to extraction in the acoustic feature of phoneme described in type voice and known correct voice The acoustic feature of the phoneme；

It is special using the known acoustics for missing phoneme described in the acoustic feature of phoneme described in type voice and known correct voice partially Sign training SVM classifier；

Inclined error detection of pronouncing is carried out to phoneme described in voice to be detected using SVM classifier after training.
6. the inclined error detection method of pronunciation as described in any one of claim 2 to 4, it is characterised in that the setting peaking function For：

<mrow> <msub> <mi>S</mi> <mi>i</mi> </msub> <mrow> <mo>(</mo> <mi>k</mi> <mo>,</mo> <mi>i</mi> <mo>,</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>,</mo> <mi>T</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mi>m</mi> <mi>a</mi> <mi>x</mi> <mo>{</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>-</mo> <msub> <mi>x</mi> <mrow> <mi>i</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> <mo>,</mo> <mn>....</mn> <mo>,</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>-</mo> <msub> <mi>x</mi> <mrow> <mi>i</mi> <mo>-</mo> <mi>k</mi> </mrow> </msub> <mo>}</mo> <mo>+</mo> <mi>m</mi> <mi>a</mi> <mi>x</mi> <mo>{</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>-</mo> <msub> <mi>x</mi> <mrow> <mi>i</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> <mo>,</mo> <mn>...</mn> <mo>,</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>-</mo> <msub> <mi>x</mi> <mrow> <mi>i</mi> <mo>+</mo> <mi>k</mi> </mrow> </msub> <mo>}</mo> </mrow> <mn>2</mn> </mfrac> <mo>,</mo> </mrow>

Wherein, S_i(k,i,x_i, T) represent peaking function value, T represent processing unit voice in sound mother on each time frame The sequence of posterior probability, k represent window length, x_iRepresent the value of the posterior probability of i-th of time frame in sequence T, i be more than or equal to Zero integer.
The inclined error detection device 7. one kind is pronounced, it is characterised in that including：

Acoustics boundary mark determining unit, is used for：Utilize the key for connecting phoneme in correct voice known to sequential classification CTC method detections Frame position, as acoustics boundary mark landmark；

Pronounce inclined error detection unit, is used for：Inclined mistake of pronouncing is carried out to phoneme described in voice to be detected based on the landmark Detection.
8. pronounce inclined error detection device as claimed in claim 7, it is characterised in that the acoustics boundary mark determining unit, including：

Acoustic training model module, is used for：RNN acoustic models are trained using CTC criterions；

Probability sequence generation module, is used for：Utilize processing unit in correct voice known to the RNN acoustic models decoding after training Voice, obtain the sequence of posterior probability of the phoneme described in the voice of processing unit on each time frame；

Peaking function value generation module, is used for：Using setting window length, setting each posterior probability in peaking function and sequence, calculate Obtain peaking function value corresponding to each time frame；

Inequality parameter generation module, is used for：Calculate the average and variance of all peaking function values for being more than zero；

Peaking function value screening module, is used for：Chebyshev inequality is obtained using average and variance, and obtains and meets Qie Bixue The peaking function value of husband's inequality；

Maximum sharpness functional value determining module, is used for：Maximum sharpness functional value is obtained in the setting long scope of window；

Acoustics boundary mark determining module, is used for：The crucial framing bit of the phoneme is determined using the peak location of maximum sharpness functional value Put, as landmark.
9. pronounce inclined error detection device as claimed in claim 8, it is characterised in that the acoustics boundary mark determining module, including：

Phoneme judge module, is used for：Whether wrapped in speech text known to judgement corresponding to the processing unit of correct voice Containing the phoneme corresponding to the peak location；

Key frame position determining module, is used for：If in the presence of using the peak location as key frame position；If being not present, The peak location is rejected, and meets from remaining to reacquire maximum sharpness function in the peaking function value of Chebyshev inequality Value, and determine the key frame position of the phoneme using the peak location of the maximum sharpness functional value of reacquisition.
10. pronounce inclined error detection device as claimed in claim 8, it is characterised in that the acoustics boundary mark determining module, bag Include：

Key frame relative position determining module, is used for：By by the processing of the key frame position and known correct voice Mark text phoneme time information corresponding to unit compares, and determines the key frame relative position of the phoneme；

Final key frame determining module, is used for：All key frame relative positions of the phoneme are averaging, obtain the phoneme Final key frame, as landmark.
11. pronounce inclined error detection device as claimed in claim 7, it is characterised in that the inclined error detection unit of pronunciation, bag Include：

Acoustic feature extraction module, it is used for：Based on the landmark, the sound of phoneme described in type voice is missed known to extraction partially Learn the acoustic feature of phoneme described in feature and known correct voice；

SVM classifier training module, is used for：Utilize the known acoustic feature of phoneme described in type voice and known correct by mistake partially The acoustic feature training SVM classifier of phoneme described in voice；

Pronounce inclined error detection module, be used for：Phoneme described in voice to be detected pronounce partially using SVM classifier after training Error detection.
12. the inclined error detection device of pronunciation as described in any one of claim 8 to 10, it is characterised in that the peaking function value Generation module, it is additionally operable to perform：

It is described set peaking function as：

<mrow> <msub> <mi>S</mi> <mi>i</mi> </msub> <mrow> <mo>(</mo> <mrow> <mi>k</mi> <mo>,</mo> <mi>i</mi> <mo>,</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>,</mo> <mi>T</mi> </mrow> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mi>max</mi> <mrow> <mo>{</mo> <mrow> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>-</mo> <msub> <mi>x</mi> <mrow> <mi>i</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> <mo>,</mo> <mn>....</mn> <mo>,</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>-</mo> <msub> <mi>x</mi> <mrow> <mi>i</mi> <mo>-</mo> <mi>k</mi> </mrow> </msub> </mrow> <mo>}</mo> </mrow> <mo>+</mo> <mi>max</mi> <mrow> <mo>{</mo> <mrow> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>-</mo> <msub> <mi>x</mi> <mrow> <mi>i</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> <mo>,</mo> <mn>...</mn> <mo>,</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>-</mo> <msub> <mi>x</mi> <mrow> <mi>i</mi> <mo>+</mo> <mi>k</mi> </mrow> </msub> </mrow> <mo>}</mo> </mrow> </mrow> <mn>2</mn> </mfrac> <mo>,</mo> </mrow>

Wherein, S_i(k,i,x_i, T) represent peaking function value, T represent processing unit voice in sound mother on each time frame The sequence of posterior probability, k represent window length, x_iRepresent the value of the posterior probability of i-th of time frame in sequence T, i be more than or equal to Zero integer.
13. a kind of computer-readable recording medium, is stored thereon with computer program, it is characterised in that the program is by processor The step of claim 1 to 6 methods described is realized during execution.
14. a kind of computer equipment, including memory, processor and storage are on a memory and the meter that can run on a processor Calculation machine program, it is characterised in that the step of claim 1 to 6 methods described is realized during the computing device described program.