CN103985391A - Phonetic-level low power consumption spoken language evaluation and defect diagnosis method without standard pronunciation - Google Patents

Phonetic-level low power consumption spoken language evaluation and defect diagnosis method without standard pronunciation Download PDF

Info

Publication number
CN103985391A
CN103985391A CN201410229186.8A CN201410229186A CN103985391A CN 103985391 A CN103985391 A CN 103985391A CN 201410229186 A CN201410229186 A CN 201410229186A CN 103985391 A CN103985391 A CN 103985391A
Authority
CN
China
Prior art keywords
phoneme
user
pronunciation
fst
power consumption
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410229186.8A
Other languages
Chinese (zh)
Inventor
柳超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201410229186.8A priority Critical patent/CN103985391A/en
Publication of CN103985391A publication Critical patent/CN103985391A/en
Pending legal-status Critical Current

Links

Abstract

The invention discloses a phonetic-level low power consumption spoken language evaluation and defect diagnosis method without standard pronunciation. The method includes the following steps that (1) acoustic feature extraction is conducted on user pronunciation to obtain a feature vector sequence; (2) decoding operation is conducted on the feature vector sequence of user pronunciation through a Viterbi algorithm based on a weighting finite state converter Q, so that the mapping relation from the feature vector sequence to a phoneme sequence is obtained; (3) with respect to each phoneme, the pronunciation quality of the user on each phoneme is evaluated by calculating the goodness of fit between a feature vector set corresponding to the phonemes and the mathematic representation of the phonemes in an acoustic model H. The phonetic-level low power consumption spoken language evaluation and defect diagnosis method without standard pronunciation has the advantages that the method is independent of standard pronunciation, all burdensome calculations are executed on a server terminal, the calculation amount of a user terminal is minimized, the load and power consumption on the terminals are effectively reduced, in the using process of the terminals, networking is needless, and network flow consumption is avoided; the phonemes which the user cannot pronounce well can be recognized, and targeted training is provided.

Description

The spoken evaluation and defect diagnostic method of low-power consumption without the phoneme level of standard pronunciation
Technical field
The present invention relates to computer-assisted language learning and speech recognition technology field, be specifically related to spoken the evaluation and defect diagnostic method of low-power consumption of a kind of phoneme level (phonetic-level).The technology of phone-level makes scoring and feedback can refine to phone-level: after user has read a series of texts, can identify those and hinder the pure core phoneme of user pronunciation, thereby corresponding training material is provided and practises targetedly.Can be adapted to the study of the language such as English, Chinese, Spanish, and to aphasis patient's diagnosis and evaluation and test.
Background technology
The study of language is to imitate, especially voice aspect.Take English as example, and in order to train pure spoken language, best bet is exactly that existing a lot of study courses or guidance material build with this exactly with reading the pure pronunciation that mother tongue is English.Substantially, these teaching materials only provide the pronunciation of pure sample, and by the pronunciation of student oneself judgement oneself and the difference between standard pronunciation, and and then own decision how to improve.The limitation of this method is as follows:
1, owing to oneself listening own sound and sound that others hears to have difference, student is different with other people perception to the perception of own sound, so the quality of oneself pronunciation of cannot objectively marking.
2, by recording, can make up above-mentioned defect, but between recording, switch and relatively cause unnecessary trouble back and forth, reduce learning efficiency, this is the scheme that various language repeaters adopt.
Even if 3 do not consider above-mentioned factor, student's oneself (and even teacher) evaluation remains subjectivity qualitatively, cannot accomplish objective quantization assessment, and student does not know how to improve.Because user can not distinguish oneself pronunciation defect exactly, can not practise targetedly.
Summary of the invention
In view of the foregoing defects the prior art has, the technical problem to be solved in the present invention is, a kind of spoken evaluation method that refine to phone-level is provided, and can to user's pronunciation, evaluate from phone-level, and scoring more is accurately provided.
For solving the problems of the technologies described above, first, the invention provides a kind of spoken evaluation method of low-power consumption of the phoneme level without standard pronunciation, while being applicable to not providing standard pronunciation, use, comprise the steps:
(1) user speech is carried out to acoustic feature extraction, obtain each frame characteristic of correspondence vector, and then obtain and the corresponding characteristic vector sequence of user speech;
(2) for given text, its corresponding aligned phoneme sequence, note is done
P all={ sil, p 1, sil, p 2, sil, p 3, sil ..., p (M-1), sil, p m, sil}, wherein sil represents pause sound, based on weighting FST Q, uses Viterbi algorithm to carry out decode operation to the corresponding characteristic vector sequence of user speech, obtains characteristic vector sequence to above-mentioned aligned phoneme sequence p allalignment α, the count vector note of this alignment α is done
β={ns 0,n 1,ns 1,n 2,ns 2,n 3,ns 3,...,n (M-1),ns (M-1),n M,ns M},
N wherein iexpression is corresponding to the quantity of the frame of i non-pause sound phoneme, ns iexpression is corresponding to the quantity of the frame of i+1 pause sound, M is the quantity of the non-pause sound phoneme that this sample text is corresponding, the above-mentioned decode procedure based on weighting FST Q and Viterbi algorithm, when providing and aliging, has also provided the corresponding HMM state of each proper vector;
Q=π wherein ε(min (det (H ο det (C ο det (L ο G))))), min wherein represents the operation that minimizes of relevant weighting FST, det represents the deterministic operation of relevant weighting FST, symbol ο represents the composition operation of relevant weighting FST, π εrepresent to remove the operation of epsilon symbol in weighting FST;
Acoustic model H, pronunciation dictionary model L and context-sensitive phoneme model C are weighting FST, and are all that the training process of the large vocabulary continuous speech recognition technology based on weighting FST obtains; For given text, produce the language model G that this text is corresponding, thereby produce the weighting FST Q corresponding with the text;
(3) to each phoneme, by calculating the goodness of fit between its characteristic of correspondence vector or proper vector group and its mathematical notation in acoustic model H, can evaluate the voice quality of user on each phoneme, the goodness of fit is higher, illustrates that voice quality is better.
Further, the present invention also provides a kind of low-power consumption of the phoneme level without standard pronunciation spoken defect diagnostic method, it is characterized in that, first adopt as claimed in any one of claims 1-9 wherein without the spoken evaluation method of low-power consumption of the phoneme level of standard pronunciation user speech is evaluated, then also comprise and determine that user has the step of the phoneme of pronunciation defect: according to user, read aloud resulting a plurality of the first phoneme massfractions in a plurality of pronunciations unit or the second phoneme massfraction sequence, a plurality of phonemes that the phoneme that user's score is minimum or score are lower are the phoneme that has pronunciation defect.
The spoken evaluation method of low-power consumption and the defect diagnostic method of the phoneme level without standard pronunciation of the present invention have following beneficial effect:
1, without standard pronunciation, can be widely used in and there is no the evaluation of standard pronunciation and defect diagonsis scene, for example, the evaluation to the dialogue under language learner's natural scene or statement, and the diagnosis of the language proficiency in natural situation to language disorder.
2, heavy computing is all carried out at server end: comprise and produce weighting FST H, C, L, and produce the weighting FST Q that phoneme of speech sound alignment relies on.
3, only about the computing of user speech, be placed in user terminal execution, effectively reduced load and the power consumption in terminal, reduced the hardware requirement to terminal.
4, user, when terminal is used, does not need networking completely, has avoided network flow consumption.
5, when having new example to increase, first process beyond the clouds, produce the weighting FST Q corresponding with new text, then be downloaded to terminal.Due to weighting FST, Q is less, and download is little, can be quick-downloading to terminal.Even if user uses network more during new data example, also can complete very fast.Owing to having used the minimization technique of weighting FST, minimized download.
Accompanying drawing explanation
Fig. 1 is the process schematic diagram of speech processes.
Fig. 2 produces the process schematic diagram of weighting FST H, C, L by speech recognition training.
Fig. 3 is the process schematic diagram of based on weighting FST Q, voice document being decoded in the spoken evaluation method of low-power consumption of the phoneme level without standard pronunciation of one embodiment of the present of invention.
Fig. 4 is the process flow diagram of the spoken evaluation method of low-power consumption of the phoneme level without standard pronunciation of one embodiment of the present of invention.
Fig. 5 is the most frequently used hidden markov model schematic diagram of phoneme model.
Fig. 6 is the structured flowchart that the equipment (i.e. terminal in figure) of realizing spoken evaluation method of the present invention in one embodiment of the present of invention is connected with server.
Fig. 7 realizes the structural representation that the equipment of spoken evaluation method of the present invention is connected with server in one embodiment of the present of invention.
Specific embodiment
When introducing embodiments of the invention, need to introduce the correlation technique in speech processes field, object is for the ease of the clearer the present invention of understanding.
As shown in Figure 1, speech processes is processed to the process of extracting proper vector and is roughly divided into following three steps:
The sound of i, acquired original is about the sonic data (waveform) of time;
Ii, with a set time length (as 25ms), be defined as a frame, and each frame moves forward another time interval (as 10ms), between frame and frame, just has so certain overlapping (as 15ms):
Iii, each frame is carried out to signal process and to obtain each frame characteristic of correspondence vector (feature vector), for example the more common way of industry is to adopt MFCC (Mel-Frequency Cepstral Coefficient now, Mel frequency cepstral coefficient) feature and its single order and second order difference amount, amount to 39 dimensional features.The algorithm of MFCC has been the known technology of this area, and the Chinese invention patent application prospectus that can be specifically CN1763843A referring to publication number, does not repeat them here.In addition, can also adopt linear prediction cepstrum coefficient coefficient (LPCC) and its single order and second order difference amount as proper vector, linear prediction cepstrum coefficient coefficient (LPCC) is also conventional technological means, repeats no more.
Next, introduce large vocabulary continuous speech recognition program (the large vocabulary continuous speech recognition based on weighting FST, be abbreviated as LVCSR), the detailed process of LVCSR can be with reference to publication No. CN102779508A, denomination of invention is that < < sound bank generates Apparatus for () and method therefor, the Chinese invention patent application of speech synthesis system and method > > thereof, LVCSR technology has been ordinary skill in the art means, this is also repeated no more.In addition, by LVCSR training process, obtain the method for weighting FST H, C, L, details refers to < < Speech recognition with weighted finite-state transducers > > author Mehryar Mohri etc. and (sees New York University website, network address: http://www.cs.nyu.edu/~mohri/pub/hbka.pdf).Particularly, can by a large amount of corpus, pass through the training process of speech recognition training together with pronunciation dictionary with reference to figure 2, obtain weighting FST H, C, L.In fact, weighting FST H, C, L are respectively the acoustics mathematical model of phoneme, and context dependent phoneme (in the document having, being referred to as phone) model, with pronunciation dictionary model.
Weighting FST H, C, L are at server end, the training process of process speech recognition program obtains, weighting FST H, C, L are reusable after training obtains, needn't regenerate again, unless due to the increase of language material or change and need to again train, or need to again train due to the difference (dialect is also considered to wherein a kind of different language) of languages.
Below sketch language model G, concerning given text, its corresponding language model G (being the FST of relevant grammer) is exactly a simple transfer from a word to next word, and the finite state device that transition probability is 1 (Finite State Automaton, FSA), finite state device is a kind of special FST.In fact, usually, for a large amount of language materials, language model G is more complicated, but for given text, its language model G determines, because for given text, relation between phoneme and phoneme determines, word (being word for Chinese) is also determined to the relation between word (word).
Language model G (about the FST of grammer) and weighting FST H, C, L are comprehensively obtained to one about the weighting FST Q of this given text (and corresponding pronunciation unit), wherein Q=π ε(min (det (H ο det (C ο det (L ο G))))), min wherein (minimization) represents the operation that minimizes of relevant weighting FST, det (determination) represents the deterministic operation of relevant weighting FST, symbol ο represents the composition operation (composition) of relevant weighting FST, π εrepresent to remove the operation of epsilon symbol in weighting FST.About using the relevant technologies of weighting FST processed voice, please refer to the structure > > (author of the speech recognition system of < < based on finite state digraph, Xiao Ji, Tsing-Hua University's Master's thesis, delivers in May, 2011 time).It should be noted that, this paper is that the angle based on speech recognition is carried out, and the G therefore using is more complicated, and the G that the present invention uses and Q are for specific text, therefore simply occupy storage space little, thereby download is little.
Before the embodiment of narration embodiments of the invention, what need statement is, traditional speech recognition program is by H, C, L and general language model (language model), as general n-gram (n gram language model) does compound (composition), produce general speech recognition program, thereby caused storage space that needs are huge and complicated decoding algorithm (naturally high power consumption).And in the present invention, we carry out composition operation by G (determining) corresponding to each given pronunciation unit (a corresponding known text) with H, C, L, produce a speech recognition program that is just used for identifying this given pronunciation unit, weighting FST Q mentioned above namely, and and then produce on its basis aliging of mentioned speech feature vector sequence and factor sequence.The feature of this weighting FST Q is to produce for this given text (pronunciation unit), be specific to this pronunciation unit, can only identify this pronunciation unit, therefore its storage space occupying is less, be convenient to storage and use, and corresponding decoding algorithm simple (natural low-power consumption).Even what in fact user read is other voice, what based on weighting FST Q, after decoding algorithm, draw remains given text.That is to say, there is relation one to one in text, pronunciation unit and weighting FST Q.
In addition, the initialism using in claims of the application once and instructions is described:
HMM: Hidden Markov Model (HMM) (Hidden Markov Model)
GMM: gauss hybrid models (Gaussian Mixture Model)
Viterbi (transliteration: Viterbi) algorithm: the problem that this algorithm solves is to estimate most possible hiding sequence behind by observation sequence.
In claims and instructions of the application, " pronunciation unit " generally refers to a sentence, certainly " pronunciation unit " can also be a word, phrase, an or paragraph, or even the article of entire chapter, difference is only, word, phrase can be regarded as to short sentence, and paragraph or article is regarded as to the combination of a plurality of sentences.Therefore, in specific embodiments of the invention, generally take sentence as unit describes, can be applicable to equally word, phrase, paragraph even in the evaluation of entire article.
Below in conjunction with accompanying drawing, illustrate the specific embodiment of the present invention.
The feature of the present embodiment is not use received pronunciation, for given text, directly user speech is evaluated.One embodiment of the present of invention, without the spoken evaluation method of low-power consumption of the phoneme level of standard pronunciation, are carried out following steps successively:
(1) user speech is carried out to acoustic feature extraction, obtain each frame characteristic of correspondence vector, and then obtain and the corresponding characteristic vector sequence of user speech;
(2), for given text, in the situation that not considering to pause sound, the aligned phoneme sequence that it comprises is what determine, i.e. { p 1, p 2..., p m, if consider pause sound, its corresponding aligned phoneme sequence note is done
P all={ sil, p 1, sil, p 2, sil, p 3, sil ..., p (M-1), sil, p m, sil}, wherein sil represents pause sound, based on weighting FST Q, uses Viterbi algorithm to carry out decode operation to the corresponding characteristic vector sequence of user speech, obtains characteristic vector sequence to above-mentioned aligned phoneme sequence p allalignment α,
The count vector note of this alignment α is done
β={ns 0,n 1,ns 1,n 2,ns 2,n 3,ns 3,...,n (M-1),ns (M-1),n M,ns M},
N wherein iexpression is corresponding to the quantity of the frame of i non-pause sound phoneme, ns iexpression is corresponding to the quantity of the frame of i+1 pause sound, M is the quantity of the non-pause sound phoneme that this sample text is corresponding, the above-mentioned decode procedure based on weighting FST Q and Viterbi algorithm, when providing and aliging, has also provided the corresponding HMM state of each proper vector; Q=π wherein ε(min (det (H ο det (C ο det (L ο G))))), min wherein represents the operation that minimizes of relevant weighting FST, det represents the deterministic operation of relevant weighting FST, symbol ο represents the composition operation of relevant weighting FST, π εrepresent to remove the operation of epsilon symbol in weighting FST;
Acoustic model H, pronunciation dictionary model L and context-sensitive phoneme model C are weighting FST, and are all that the training process of the large vocabulary continuous speech recognition technology based on weighting FST obtains; For given text, produce the language model G that this text is corresponding, thereby produce the weighting FST Q corresponding with the text;
(3) to each phoneme, by calculating the goodness of fit between its characteristic of correspondence vector or proper vector group and its mathematical notation in acoustic model H, can evaluate the voice quality of user on each phoneme, the goodness of fit is higher, illustrates that voice quality is better.
Please refer to Fig. 3, when the voice document of user speech is processed, due to what use, be and the corresponding weighting FST of given text Q, what decoding obtained is exactly given text (each weighting FST Q and text are one to one).The above-mentioned decode procedure based on weighting FST Q and Viterbi algorithm, is known prior art, and Viterbi algorithm is also the algorithm of field of speech recognition maturation, thereby repeats no more.Weighting FST Q in Fig. 3 is stored in terminal device or by network and is downloaded in terminal device after generating according to given text and in advance, can with text corresponding stored.In fact, due to the information that weighting FST Q has comprised text, therefore can only store weighting FST Q.
Referring to Fig. 4, understand embodiments of the invention, in the process of decoding, produced aliging of aligned phoneme sequence and characteristic vector sequence simultaneously, by evaluating the goodness of fit between proper vector in this alignment or proper vector group and its mathematical notation in acoustic model H, can evaluate the voice quality of user on each phoneme, the goodness of fit is higher, illustrates that voice quality is better
By training, obtain acoustic model H, pronunciation dictionary model L and context-sensitive phoneme model C only need to once train and can obtain at server end, reusable.The process that wherein generates weighting FST Q can carry out also can carrying out at user terminal at server end, preferably at server end, generates, to reduce the computation requirement to user terminal.As preferred embodiment, in the present embodiment, weighting FST Q generates at server end, is directly stored in terminal or is downloaded to terminal by network.For the text of appointment, Q is little for its weighting FST, and download is little, and heavy calculating all carries out in service end, has minimized the burden of user terminal.Certainly, in the situation that not considering user terminal computing power, these computation processes also can be carried out at user terminal.User terminal comprises common smart machine, comprises the even terminal device such as smart mobile phone of desktop computer, notebook computer, panel computer, can be also learning machine, language repeater with computing power, lead in reading aloud the smart machines such as machine.
Particularly, in step (3), weigh user when reading the voice quality of each phoneme, can adopt the goodness of fit between each phoneme characteristic of correspondence vector or proper vector group and its mathematical notation to evaluate, the goodness of fit can adopt likelihood probability P (O i| p i) evaluate; When using GMM-HMM model (please refer to Fig. 5 about the graph model (Graphicai Model) of GMM-HMM), likelihood probability P (O i| p i) what represent is that proper vector group is the probability by the HMM generation of its corresponding phoneme;
P ( O i | p i ) &ap; P ( O i | p i , S i ) = &Pi; t = 1 T i b s t ( o t ) a s t s t + 1
Wherein: about equal sign is according to conventional Viterbi approximation technique, S i=s 1, s 1..., phoneme p icorresponding characteristic vector sequence O i=o 1, o 2..., hMM status switch;
phoneme p for this reason iexit status,
expression state s twith s t+1between transition probability,
B(o t) represent gauss hybrid models;
Conventionally can adopt likelihood probability P (O i| p i) as evaluating user at phoneme p ithe first phoneme massfraction of upper voice quality.But usually, in order to prevent calculation overflow, in fact conventionally adopting the logarithm of above-mentioned likelihood probability is ln (P (O i| p i)) as evaluating user at phoneme p ithe first phoneme massfraction of upper voice quality.
In addition, can from another one dimension, evaluate user's voice quality, continue referring to the GMM-HMM model schematic diagram shown in Fig. 5, understand.The goodness of fit between each phoneme and its characteristic of correspondence vector or proper vector group adopts Pr (p i| O i) weigh described Pr (p i| O i) be O ibelong to its corresponding phoneme p iposterior probability, by likelihood probability P (O i| p i) and prior probability Pr (p i) by Bayesian formula, calculate this posterior probability Pr (p i| O i) as the second phoneme massfraction of evaluating user's voice quality.Wherein, prior probability Pr (p i) by large quantitative statistics, obtain, for routine techniques means of the prior art, repeat no more.
More than described the voice quality of weighing certain phoneme (single-tone element or triphones) from phone-level, whole voice quality that in addition can the molecular pronunciation of distich unit is evaluated.Because short sentence can be regarded as in word, phrase, and paragraph can be regarded the combination of a plurality of sentences as, thus using without exception sentence as pronunciation unit weigh, the situation of word, phrase and paragraph, can carry out analogy according to the situation of sentence.Adopt following formula to calculate the comprehensive mark of phoneme of the pronunciation unit that user reads aloud;
υ wherein 1, ifor the first phoneme massfraction of i phoneme in user speech, υ 2, ifor the second phoneme massfraction of i phoneme in user speech, ω 1, iand ω 2, ifor corresponding weight.
Described weights omega wherein 1, iand ω 2, ican arrange by craft; Described weights omega 1, iand ω 2, ican also obtain by the mode of machine learning: choose a plurality of texts corresponding to difference pronunciation unit, by different users, read aloud respectively, and by expert each user read aloud to quality evaluation, manually provide the corresponding comprehensive mark of phoneme level, by machine learning method, obtain optimum weight sequence.
In first embodiment of the present invention, adopt in step wherein (1) Mel frequency cepstral coefficient (MFCC) and their single order and second order difference as proper vector or linear prediction cepstrum coefficient coefficient (LPCC) and their single order and second order difference as proper vector.
More than to be all to evaluate from the voice quality aspect of certain phoneme of user, in addition, also need user's fluency to evaluate, by calculating the quantitative assessment user's of the pause sound sil in user speech fluency, i pause sound sil characteristic of correspondence vector number is ns i, ns ithe quantity of > 0 is more, illustrates that pause is more, and voice quality is poorer.Can adopt in addition the voice quality of the ratio measurement pronunciation unit that calculates pause sound.Namely the ratio of the quantity by calculating pause sound characteristic of correspondence vector in the corresponding proper vector in whole pronunciation unit weighed the voice quality of pronunciation unit.Particularly, can adopt following formula to calculate: this ratio is larger, illustrates that pause sound is more, and fluency is poorer.This ratio should be in a zone of reasonableness, and too large explanation pauses too many, does not pause in the place that too little explanation should pause, and rational interval can be determined by large quantitative statistics.
For in the situation that identifying user pronunciation defect, give user to instruct targetedly, on the basis of above-mentioned spoken evaluation method, the present invention also further comprises the step of determining user pronunciation defect:
According to user, read aloud resulting a plurality of the first phoneme massfractions in a plurality of pronunciations unit or the second phoneme massfraction sequence, phoneme that counting user score is minimum or the lower a plurality of phonemes of statistics score, be considered to the phoneme of pronunciation defect.One or several minimum phoneme of user's score can be chosen, be prompted to user.Certainly further, can from database, select and comprise the pronunciation unit confession user exercise that user has the phoneme of pronunciation defect.Thereby can effectively solve the problem that cannot accurately evaluate the voice quality of certain phoneme of user or certain several phoneme in prior art, and can propose to point out targetedly and provide and practise targetedly material.
Please refer to the system architecture that Fig. 6, Fig. 7 understand the spoken evaluation method of low-power consumption that realizes the phoneme level without standard pronunciation of the present invention.On terminal device, comprising:
Audio Processing Unit, for receiving user speech and carrying out acoustic feature extraction, obtains and user speech characteristic of correspondence sequence vector;
Storage unit, for storing weighting FST Q, Q is corresponding with the text of appointment for this weighting FST, and for user speech is decoded;
Decoding unit, the characteristic vector sequence that Audio Processing Unit processing obtains and the weighting FST Q of storage unit all deliver to decoding unit, decoding unit is used weighting FST Q to use Viterbi algorithm to carry out decode operation to the corresponding characteristic vector sequence of user speech, consider after pause sound sil wherein, obtain characteristic vector sequence to the α that aligns of the corresponding aligned phoneme sequence that comprises pause sound of the text with producing weighting FST Q, the aligned phoneme sequence that this decoding obtains is: { sil, p 1, sil, p 2, sil, p 3, sil ..., p (M-1), sil, p m, sil}, the count vector of this alignment α
β={ns 0,n 1,ns 1,n 2,ns 2,n 3,ns 3,...,n (M-1),ns (M-1),n M,ns M};
Wherein, sil represents the sound that pauses, n iexpression is corresponding to the quantity of the frame of i phoneme, ns iexpression is corresponding to the quantity of the frame of i+1 pause sound, M is the quantity of the phoneme that comprises in this sample text, the above-mentioned decode procedure based on weighting FST Q and Viterbi algorithm, when providing and aliging, has also provided the corresponding HMM state of each proper vector;
Pronunciation quality evaluating unit, calculates the goodness of fit between each phoneme and its characteristic of correspondence vector or proper vector group to evaluate the phoneme massfraction of the voice quality of user on each phoneme.
Its course of work please see above about the description of the spoken evaluation method of low-power consumption of the phoneme level without standard pronunciation.
With reference to Fig. 6, the corresponding weighting FST of its Chinese version 1 Q 1, the corresponding weighting FST of text 2 Q 2... thereby each text has its corresponding Q.Weighting FST H wherein, C, L obtain through training in advance, and because process minimizes and deterministic operation, the Q generating for each given text is less, can more easily download or be stored to terminal.For Fig. 6, it should be noted that, for different texts, the weighting FST H of its use, C, L are that precondition is good, and are all identical, in figure, just symbolically show two terminals, in fact terminal quantity is not limit.
The present invention has following beneficial effect:
(1) without standard pronunciation, can be widely used in and there is no the evaluation of standard pronunciation and defect diagonsis scene, for example, the evaluation to the dialogue under language learner's natural scene or statement, and the diagnosis of the language proficiency in natural situation to language disorder.
(2) heavy computing is all carried out at server end: comprise and produce weighting FST H, C, L, and produce the weighting FST Q that phoneme of speech sound alignment relies on, with reference to Fig. 6, can understand this advantage of the present invention.
(3) only about the computing of user speech, be placed in user terminal execution, effectively reduced load and the power consumption in terminal, reduced the hardware requirement to terminal.
(4) user, when terminal is used, does not need networking completely, has avoided network flow consumption.
(5) when having new example to increase, first process beyond the clouds, produce the weighting FST Q corresponding with new text, then be downloaded to terminal.Due to weighting FST, Q is less, and download is little, can be quick-downloading to terminal.Even if user uses network more during new data example, also can complete very fast.Owing to having used the minimization technique of weighting FST, minimized download.
(6) to user's pronunciation evaluation, can be accurate to phone-level, and consider the voice quality of context-sensitive phoneme, can provide the poor phoneme of user pronunciation, can provide corresponding language material (including the phoneme that user pronunciation quality is lower) to practise targetedly.
The present invention can be adapted to the study of the language such as English, Chinese, Spanish, and to aphasis patient's diagnosis and evaluation and test.
Certainly, the above is the preferred embodiments of the present invention, should be understood that; for those skilled in the art; under the premise without departing from the principles of the invention, can also make some improvements and modifications, these improvements and modifications are also considered as protection scope of the present invention.

Claims (11)

1. without the spoken evaluation method of low-power consumption of the phoneme level of standard pronunciation, it is characterized in that, comprise the steps:
(1) user speech is carried out to acoustic feature extraction, obtain each frame characteristic of correspondence vector, and then obtain and the corresponding characteristic vector sequence of user speech;
(2) for given text, its corresponding aligned phoneme sequence, note is done
P all={ sil, p 1, sil, p 2, sil, p 3, sil ..., p (M-1), sil, p m, sil}, wherein sil represents pause sound, based on weighting FST Q, uses Viterbi algorithm to carry out decode operation to the corresponding characteristic vector sequence of user speech, obtains characteristic vector sequence to above-mentioned aligned phoneme sequence p allalignment α,
The count vector note of this alignment α is done
β={ns 0,n 1,ns 1,n 2,ns 2,n 3,ns 3,...,n (M-1),ns (M-1),n M,ns M},
N wherein iexpression is corresponding to the quantity of the frame of i non-pause sound phoneme, ns iexpression is corresponding to the quantity of the frame of i+1 pause sound, M is the quantity of the non-pause sound phoneme that this sample text is corresponding, the above-mentioned decode procedure based on weighting FST Q and Viterbi algorithm, when providing and aliging, has also provided the corresponding HMM state of each proper vector;
Q=π ε (min (det (H ο det (C ο det (L ο G))))) wherein, min wherein represents the operation that minimizes of relevant weighting FST, det represents the deterministic operation of relevant weighting FST, symbol ο represents the composition operation of relevant weighting FST, π εrepresent to remove the operation of epsilon symbol in weighting FST;
Acoustic model H, pronunciation dictionary model L and context-sensitive phoneme model C are weighting FST, and are all that the training process of the large vocabulary continuous speech recognition technology based on weighting FST obtains; For given text, produce the language model G that this text is corresponding, thereby produce the weighting FST Q corresponding with the text;
(3) to each phoneme, by calculating the goodness of fit between its characteristic of correspondence vector or proper vector group and its mathematical notation in acoustic model H, can evaluate the voice quality of user on each phoneme, the goodness of fit is higher, illustrates that voice quality is better.
2. the spoken evaluation method of the low-power consumption of the phoneme level without standard pronunciation according to claim 1, it is characterized in that, in described step (3), the goodness of fit between each phoneme characteristic of correspondence vector or proper vector group and its mathematical notation adopts likelihood probability P (O i| p i) evaluate; When using GMM-HMM model, likelihood probability P (O i| p i) be that this proper vector group is the probability by the HMM generation of its corresponding phoneme,
P ( O i | p i ) &ap; P ( O i | p i , S i ) = &Pi; t = 1 T i b s t ( o t ) a s t s t + 1
Wherein: about equal sign is according to conventional Viterbi approximation technique, S i=s 1, s 2..., phoneme p icorresponding characteristic vector sequence O i=o 1, o 2..., hMM status switch;
phoneme p for this reason iexit status,
expression state s twith s t+1between transition probability,
B(o t) represent gauss hybrid models;
Adopt likelihood probability P (O i| p i) as evaluating user at phoneme p ithe first phoneme massfraction of upper voice quality;
Or the logarithm that adopts above-mentioned likelihood probability is ln (P (O i| p i)) as evaluating user at phoneme p ithe first phoneme massfraction of upper voice quality.
3. the spoken evaluation method of the low-power consumption of the phoneme level without standard pronunciation according to claim 1, is characterized in that, the goodness of fit in described step (3) between each phoneme and its characteristic of correspondence vector or proper vector group adopts Pr (p i| O i) weigh described Pr (p i| O i) be proper vector group O ibelong to its corresponding phoneme p iposterior probability, by likelihood probability P (O i| p i) and prior probability Pr (p i) by Bayesian formula, calculate this posterior probability Pr (p i| O i) as the second phoneme massfraction of evaluating user's voice quality.
4. according to the spoken evaluation method of the low-power consumption of the phoneme level without standard pronunciation described in claim 2 or 3, it is characterized in that, adopt following formula to calculate the comprehensive mark of phoneme level that user reads aloud a pronunciation unit;
υ wherein 1, ifor the first phoneme massfraction of i phoneme in user speech, υ 2, ifor the second phoneme massfraction of i phoneme in user speech, ω 1, iand ω 2, ifor corresponding weight.
5. the spoken evaluation method of the low-power consumption of the phoneme level without standard pronunciation according to claim 4, is characterized in that
Described weights omega 1, iand ω 2, iby manual setting;
Or described weights omega 1, iand ω 2, imode by machine learning obtains: choose a plurality of texts corresponding to difference pronunciation unit, by different users, read aloud respectively, and by expert each user read aloud to quality evaluation, manually provide the corresponding comprehensive mark of phoneme level, by machine learning method, obtain optimum weight sequence.
6. according to the spoken evaluation method of the low-power consumption of the phoneme level without standard pronunciation described in claim 1-5 any one, it is characterized in that, wherein in step (1), adopt Mel frequency cepstral coefficient and their single order and second order difference or linear prediction cepstrum coefficient coefficient and their single order and second order difference as proper vector.
7. according to the spoken evaluation method of the low-power consumption of the phoneme level without standard pronunciation described in claim 1-5 any one, it is characterized in that, weighting FST Q generates at server end, is directly stored in terminal or is downloaded to terminal by network.
8. the spoken evaluation method of the low-power consumption of the phoneme level without standard pronunciation according to claim 7, is characterized in that, described acoustic model H, pronunciation dictionary model L and context-sensitive phoneme model C all generate at server end.
9. according to the spoken evaluation method of the low-power consumption described in claim 1-5 any one, it is characterized in that, also comprise the step that user speech fluency is evaluated: the fluency of evaluating user by the quantity of the sound that pauses between non-pause phoneme in calculating user speech: ns ithe quantity of > 0 is more, illustrates that pause is more, and fluency is poorer.
10. the spoken defect diagnostic method of the low-power consumption without the phoneme level of standard pronunciation, it is characterized in that, first adopt as claimed in any one of claims 1-9 wherein without the spoken evaluation method of low-power consumption of the phoneme level of standard pronunciation user speech is processed, then also comprise and determine that user has the step of the phoneme of pronunciation defect: according to user, read aloud resulting a plurality of the first phoneme massfractions in a plurality of pronunciations unit or the second phoneme massfraction sequence, a plurality of phonemes that the phoneme that user's score is minimum or score are lower are the phoneme that has pronunciation defect.
The spoken defect diagnostic method of low-power consumption of the 11. phoneme levels without standard pronunciation according to claim 10, is characterized in that, selects the pronunciation unit of the phoneme that includes described pronunciation defect for user's exercise from database.
CN201410229186.8A 2014-04-16 2014-05-28 Phonetic-level low power consumption spoken language evaluation and defect diagnosis method without standard pronunciation Pending CN103985391A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410229186.8A CN103985391A (en) 2014-04-16 2014-05-28 Phonetic-level low power consumption spoken language evaluation and defect diagnosis method without standard pronunciation

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
CN201410151506.2 2014-04-16
CN201410151506 2014-04-16
CN201410229186.8A CN103985391A (en) 2014-04-16 2014-05-28 Phonetic-level low power consumption spoken language evaluation and defect diagnosis method without standard pronunciation

Publications (1)

Publication Number Publication Date
CN103985391A true CN103985391A (en) 2014-08-13

Family

ID=51277334

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410229186.8A Pending CN103985391A (en) 2014-04-16 2014-05-28 Phonetic-level low power consumption spoken language evaluation and defect diagnosis method without standard pronunciation

Country Status (1)

Country Link
CN (1) CN103985391A (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106297828A (en) * 2016-08-12 2017-01-04 苏州驰声信息科技有限公司 The detection method of a kind of mistake utterance detection based on degree of depth study and device
CN107077863A (en) * 2014-08-15 2017-08-18 智能-枢纽私人有限公司 Method and system for the auxiliary improvement user speech in appointed language
CN107644638A (en) * 2017-10-17 2018-01-30 北京智能管家科技有限公司 Audio recognition method, device, terminal and computer-readable recording medium
WO2018232591A1 (en) * 2017-06-20 2018-12-27 Microsoft Technology Licensing, Llc. Sequence recognition processing
CN109300345A (en) * 2018-11-20 2019-02-01 深圳市神经科学研究院 A kind of shorthand nomenclature training method and device
CN109300339A (en) * 2018-11-19 2019-02-01 王泓懿 A kind of exercising method and system of Oral English Practice
CN110010123A (en) * 2018-01-16 2019-07-12 上海异构网络科技有限公司 English phonetic word pronunciation learning evaluation system and method
CN110047466A (en) * 2019-04-16 2019-07-23 深圳市数字星河科技有限公司 A kind of method of open creation massage voice reading standard reference model
CN110111779A (en) * 2018-01-29 2019-08-09 阿里巴巴集团控股有限公司 Syntactic model generation method and device, audio recognition method and device
CN110136747A (en) * 2019-05-16 2019-08-16 上海流利说信息技术有限公司 A kind of method, apparatus, equipment and storage medium for evaluating phoneme of speech sound correctness
CN110148427A (en) * 2018-08-22 2019-08-20 腾讯数码(天津)有限公司 Audio-frequency processing method, device, system, storage medium, terminal and server
CN112397056A (en) * 2021-01-20 2021-02-23 北京世纪好未来教育科技有限公司 Voice evaluation method and computer storage medium
CN112466335A (en) * 2020-11-04 2021-03-09 吉林体育学院 English pronunciation quality evaluation method based on accent prominence
CN112908361A (en) * 2021-02-02 2021-06-04 早道(大连)教育科技有限公司 Spoken language pronunciation evaluation system based on small granularity
CN112906369A (en) * 2021-02-19 2021-06-04 脸萌有限公司 Lyric file generation method and device
CN113421587A (en) * 2021-06-02 2021-09-21 网易有道信息技术(北京)有限公司 Voice evaluation method and device, computing equipment and storage medium
CN113571043A (en) * 2021-07-27 2021-10-29 广州欢城文化传媒有限公司 Dialect simulation force evaluation method and device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008040035A (en) * 2006-08-04 2008-02-21 Advanced Telecommunication Research Institute International Pronunciation evaluation apparatus and program
CN101645271A (en) * 2008-12-23 2010-02-10 中国科学院声学研究所 Rapid confidence-calculation method in pronunciation quality evaluation system
CN101887725A (en) * 2010-04-30 2010-11-17 中国科学院声学研究所 Phoneme confusion network-based phoneme posterior probability calculation method
CN101246685B (en) * 2008-03-17 2011-03-30 清华大学 Pronunciation quality evaluation method of computer auxiliary language learning system
CN103151042A (en) * 2013-01-23 2013-06-12 中国科学院深圳先进技术研究院 Full-automatic oral language evaluating management and scoring system and scoring method thereof

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008040035A (en) * 2006-08-04 2008-02-21 Advanced Telecommunication Research Institute International Pronunciation evaluation apparatus and program
CN101246685B (en) * 2008-03-17 2011-03-30 清华大学 Pronunciation quality evaluation method of computer auxiliary language learning system
CN101645271A (en) * 2008-12-23 2010-02-10 中国科学院声学研究所 Rapid confidence-calculation method in pronunciation quality evaluation system
CN101645271B (en) * 2008-12-23 2011-12-07 中国科学院声学研究所 Rapid confidence-calculation method in pronunciation quality evaluation system
CN101887725A (en) * 2010-04-30 2010-11-17 中国科学院声学研究所 Phoneme confusion network-based phoneme posterior probability calculation method
CN103151042A (en) * 2013-01-23 2013-06-12 中国科学院深圳先进技术研究院 Full-automatic oral language evaluating management and scoring system and scoring method thereof

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
肖吉: "基于有限状态图的语音识别系统的构建", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107077863A (en) * 2014-08-15 2017-08-18 智能-枢纽私人有限公司 Method and system for the auxiliary improvement user speech in appointed language
CN106297828A (en) * 2016-08-12 2017-01-04 苏州驰声信息科技有限公司 The detection method of a kind of mistake utterance detection based on degree of depth study and device
CN106297828B (en) * 2016-08-12 2020-03-24 苏州驰声信息科技有限公司 Detection method and device for false sounding detection based on deep learning
WO2018232591A1 (en) * 2017-06-20 2018-12-27 Microsoft Technology Licensing, Llc. Sequence recognition processing
CN107644638A (en) * 2017-10-17 2018-01-30 北京智能管家科技有限公司 Audio recognition method, device, terminal and computer-readable recording medium
CN107644638B (en) * 2017-10-17 2019-01-04 北京智能管家科技有限公司 Audio recognition method, device, terminal and computer readable storage medium
CN110010123A (en) * 2018-01-16 2019-07-12 上海异构网络科技有限公司 English phonetic word pronunciation learning evaluation system and method
CN110111779B (en) * 2018-01-29 2023-12-26 阿里巴巴集团控股有限公司 Grammar model generation method and device and voice recognition method and device
CN110111779A (en) * 2018-01-29 2019-08-09 阿里巴巴集团控股有限公司 Syntactic model generation method and device, audio recognition method and device
CN110148427A (en) * 2018-08-22 2019-08-20 腾讯数码(天津)有限公司 Audio-frequency processing method, device, system, storage medium, terminal and server
CN110148427B (en) * 2018-08-22 2024-04-19 腾讯数码(天津)有限公司 Audio processing method, device, system, storage medium, terminal and server
CN109300339A (en) * 2018-11-19 2019-02-01 王泓懿 A kind of exercising method and system of Oral English Practice
CN109300345A (en) * 2018-11-20 2019-02-01 深圳市神经科学研究院 A kind of shorthand nomenclature training method and device
CN110047466B (en) * 2019-04-16 2021-04-13 深圳市数字星河科技有限公司 Method for openly creating voice reading standard reference model
CN110047466A (en) * 2019-04-16 2019-07-23 深圳市数字星河科技有限公司 A kind of method of open creation massage voice reading standard reference model
CN110136747A (en) * 2019-05-16 2019-08-16 上海流利说信息技术有限公司 A kind of method, apparatus, equipment and storage medium for evaluating phoneme of speech sound correctness
CN112466335B (en) * 2020-11-04 2023-09-29 吉林体育学院 English pronunciation quality evaluation method based on accent prominence
CN112466335A (en) * 2020-11-04 2021-03-09 吉林体育学院 English pronunciation quality evaluation method based on accent prominence
CN112397056B (en) * 2021-01-20 2021-04-09 北京世纪好未来教育科技有限公司 Voice evaluation method and computer storage medium
CN112397056A (en) * 2021-01-20 2021-02-23 北京世纪好未来教育科技有限公司 Voice evaluation method and computer storage medium
CN112908361A (en) * 2021-02-02 2021-06-04 早道(大连)教育科技有限公司 Spoken language pronunciation evaluation system based on small granularity
CN112906369A (en) * 2021-02-19 2021-06-04 脸萌有限公司 Lyric file generation method and device
CN113421587A (en) * 2021-06-02 2021-09-21 网易有道信息技术(北京)有限公司 Voice evaluation method and device, computing equipment and storage medium
CN113421587B (en) * 2021-06-02 2023-10-13 网易有道信息技术(北京)有限公司 Voice evaluation method, device, computing equipment and storage medium
CN113571043A (en) * 2021-07-27 2021-10-29 广州欢城文化传媒有限公司 Dialect simulation force evaluation method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN103985391A (en) Phonetic-level low power consumption spoken language evaluation and defect diagnosis method without standard pronunciation
CN103985392A (en) Phoneme-level low-power consumption spoken language assessment and defect diagnosis method
US7840404B2 (en) Method and system for using automatic generation of speech features to provide diagnostic feedback
US7392187B2 (en) Method and system for the automatic generation of speech features for scoring high entropy speech
CN101661675B (en) Self-sensing error tone pronunciation learning method and system
US20110123965A1 (en) Speech Processing and Learning
US9489864B2 (en) Systems and methods for an automated pronunciation assessment system for similar vowel pairs
JPH10222190A (en) Sounding measuring device and method
Cheng Automatic assessment of prosody in high-stakes English tests.
CN103594087A (en) Method and system for improving oral evaluation performance
Cucchiarini et al. Automatic speech recognition for second language pronunciation training
CN109326281A (en) Prosodic labeling method, apparatus and equipment
Duan et al. A Preliminary study on ASR-based detection of Chinese mispronunciation by Japanese learners
CN112802456A (en) Voice evaluation scoring method and device, electronic equipment and storage medium
CN109697975B (en) Voice evaluation method and device
Peabody et al. Towards automatic tone correction in non-native mandarin
Zechner et al. Automatic scoring of children’s read-aloud text passages and word lists
US8768697B2 (en) Method for measuring speech characteristics
Nakagawa et al. A statistical method of evaluating pronunciation proficiency for English words spoken by Japanese
Maia et al. An HMM-based Brazilian Portuguese speech synthesizer and its characteristics
Bang et al. An automatic feedback system for English speaking integrating pronunciation and prosody assessments
Li et al. English sentence pronunciation evaluation using rhythm and intonation
CN111508523A (en) Voice training prompting method and system
Li et al. Speech interaction of educational robot based on Ekho and Sphinx
Bao et al. An Auxiliary Teaching System for Spoken English Based on Speech Recognition Technology

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20140813

WD01 Invention patent application deemed withdrawn after publication