CN103985391A

CN103985391A - Phonetic-level low power consumption spoken language evaluation and defect diagnosis method without standard pronunciation

Info

Publication number: CN103985391A
Application number: CN201410229186.8A
Authority: CN
Inventors: 柳超
Original assignee: Individual
Current assignee: Individual
Priority date: 2014-04-16
Filing date: 2014-05-28
Publication date: 2014-08-13

Abstract

The invention discloses a phonetic-level low power consumption spoken language evaluation and defect diagnosis method without standard pronunciation. The method includes the following steps that (1) acoustic feature extraction is conducted on user pronunciation to obtain a feature vector sequence; (2) decoding operation is conducted on the feature vector sequence of user pronunciation through a Viterbi algorithm based on a weighting finite state converter Q, so that the mapping relation from the feature vector sequence to a phoneme sequence is obtained; (3) with respect to each phoneme, the pronunciation quality of the user on each phoneme is evaluated by calculating the goodness of fit between a feature vector set corresponding to the phonemes and the mathematic representation of the phonemes in an acoustic model H. The phonetic-level low power consumption spoken language evaluation and defect diagnosis method without standard pronunciation has the advantages that the method is independent of standard pronunciation, all burdensome calculations are executed on a server terminal, the calculation amount of a user terminal is minimized, the load and power consumption on the terminals are effectively reduced, in the using process of the terminals, networking is needless, and network flow consumption is avoided; the phonemes which the user cannot pronounce well can be recognized, and targeted training is provided.

Description

The spoken evaluation and defect diagnostic method of low-power consumption without the phoneme level of standard pronunciation

Technical field

The present invention relates to computer-assisted language learning and speech recognition technology field, be specifically related to spoken the evaluation and defect diagnostic method of low-power consumption of a kind of phoneme level (phonetic-level).The technology of phone-level makes scoring and feedback can refine to phone-level: after user has read a series of texts, can identify those and hinder the pure core phoneme of user pronunciation, thereby corresponding training material is provided and practises targetedly.Can be adapted to the study of the language such as English, Chinese, Spanish, and to aphasis patient's diagnosis and evaluation and test.

Background technology

The study of language is to imitate, especially voice aspect.Take English as example, and in order to train pure spoken language, best bet is exactly that existing a lot of study courses or guidance material build with this exactly with reading the pure pronunciation that mother tongue is English.Substantially, these teaching materials only provide the pronunciation of pure sample, and by the pronunciation of student oneself judgement oneself and the difference between standard pronunciation, and and then own decision how to improve.The limitation of this method is as follows:

1, owing to oneself listening own sound and sound that others hears to have difference, student is different with other people perception to the perception of own sound, so the quality of oneself pronunciation of cannot objectively marking.

2, by recording, can make up above-mentioned defect, but between recording, switch and relatively cause unnecessary trouble back and forth, reduce learning efficiency, this is the scheme that various language repeaters adopt.

Even if 3 do not consider above-mentioned factor, student's oneself (and even teacher) evaluation remains subjectivity qualitatively, cannot accomplish objective quantization assessment, and student does not know how to improve.Because user can not distinguish oneself pronunciation defect exactly, can not practise targetedly.

Summary of the invention

In view of the foregoing defects the prior art has, the technical problem to be solved in the present invention is, a kind of spoken evaluation method that refine to phone-level is provided, and can to user's pronunciation, evaluate from phone-level, and scoring more is accurately provided.

For solving the problems of the technologies described above, first, the invention provides a kind of spoken evaluation method of low-power consumption of the phoneme level without standard pronunciation, while being applicable to not providing standard pronunciation, use, comprise the steps:

(1) user speech is carried out to acoustic feature extraction, obtain each frame characteristic of correspondence vector, and then obtain and the corresponding characteristic vector sequence of user speech;

(2) for given text, its corresponding aligned phoneme sequence, note is done

P _all={ sil, p ₁, sil, p ₂, sil, p ₃, sil ..., p _(M-1), sil, p _m, sil}, wherein sil represents pause sound, based on weighting FST Q, uses Viterbi algorithm to carry out decode operation to the corresponding characteristic vector sequence of user speech, obtains characteristic vector sequence to above-mentioned aligned phoneme sequence p _allalignment α, the count vector note of this alignment α is done

β＝{ns ₀，n ₁，ns ₁，n ₂，ns ₂，n ₃，ns ₃，...，n _(M-1)，ns _(M-1)，n _M，ns _M}，

N wherein _iexpression is corresponding to the quantity of the frame of i non-pause sound phoneme, ns _iexpression is corresponding to the quantity of the frame of i+1 pause sound, M is the quantity of the non-pause sound phoneme that this sample text is corresponding, the above-mentioned decode procedure based on weighting FST Q and Viterbi algorithm, when providing and aliging, has also provided the corresponding HMM state of each proper vector;

Q=π wherein _ε(min (det (H ο det (C ο det (L ο G))))), min wherein represents the operation that minimizes of relevant weighting FST, det represents the deterministic operation of relevant weighting FST, symbol ο represents the composition operation of relevant weighting FST, π _εrepresent to remove the operation of epsilon symbol in weighting FST;

Acoustic model H, pronunciation dictionary model L and context-sensitive phoneme model C are weighting FST, and are all that the training process of the large vocabulary continuous speech recognition technology based on weighting FST obtains; For given text, produce the language model G that this text is corresponding, thereby produce the weighting FST Q corresponding with the text;

(3) to each phoneme, by calculating the goodness of fit between its characteristic of correspondence vector or proper vector group and its mathematical notation in acoustic model H, can evaluate the voice quality of user on each phoneme, the goodness of fit is higher, illustrates that voice quality is better.

Further, the present invention also provides a kind of low-power consumption of the phoneme level without standard pronunciation spoken defect diagnostic method, it is characterized in that, first adopt as claimed in any one of claims 1-9 wherein without the spoken evaluation method of low-power consumption of the phoneme level of standard pronunciation user speech is evaluated, then also comprise and determine that user has the step of the phoneme of pronunciation defect: according to user, read aloud resulting a plurality of the first phoneme massfractions in a plurality of pronunciations unit or the second phoneme massfraction sequence, a plurality of phonemes that the phoneme that user's score is minimum or score are lower are the phoneme that has pronunciation defect.

The spoken evaluation method of low-power consumption and the defect diagnostic method of the phoneme level without standard pronunciation of the present invention have following beneficial effect:

1, without standard pronunciation, can be widely used in and there is no the evaluation of standard pronunciation and defect diagonsis scene, for example, the evaluation to the dialogue under language learner's natural scene or statement, and the diagnosis of the language proficiency in natural situation to language disorder.

2, heavy computing is all carried out at server end: comprise and produce weighting FST H, C, L, and produce the weighting FST Q that phoneme of speech sound alignment relies on.

3, only about the computing of user speech, be placed in user terminal execution, effectively reduced load and the power consumption in terminal, reduced the hardware requirement to terminal.

4, user, when terminal is used, does not need networking completely, has avoided network flow consumption.

5, when having new example to increase, first process beyond the clouds, produce the weighting FST Q corresponding with new text, then be downloaded to terminal.Due to weighting FST, Q is less, and download is little, can be quick-downloading to terminal.Even if user uses network more during new data example, also can complete very fast.Owing to having used the minimization technique of weighting FST, minimized download.

Accompanying drawing explanation

Fig. 1 is the process schematic diagram of speech processes.

Fig. 2 produces the process schematic diagram of weighting FST H, C, L by speech recognition training.

Fig. 3 is the process schematic diagram of based on weighting FST Q, voice document being decoded in the spoken evaluation method of low-power consumption of the phoneme level without standard pronunciation of one embodiment of the present of invention.

Fig. 4 is the process flow diagram of the spoken evaluation method of low-power consumption of the phoneme level without standard pronunciation of one embodiment of the present of invention.

Fig. 5 is the most frequently used hidden markov model schematic diagram of phoneme model.

Fig. 6 is the structured flowchart that the equipment (i.e. terminal in figure) of realizing spoken evaluation method of the present invention in one embodiment of the present of invention is connected with server.

Fig. 7 realizes the structural representation that the equipment of spoken evaluation method of the present invention is connected with server in one embodiment of the present of invention.

Specific embodiment

When introducing embodiments of the invention, need to introduce the correlation technique in speech processes field, object is for the ease of the clearer the present invention of understanding.

As shown in Figure 1, speech processes is processed to the process of extracting proper vector and is roughly divided into following three steps:

The sound of i, acquired original is about the sonic data (waveform) of time;

Ii, with a set time length (as 25ms), be defined as a frame, and each frame moves forward another time interval (as 10ms), between frame and frame, just has so certain overlapping (as 15ms):

Iii, each frame is carried out to signal process and to obtain each frame characteristic of correspondence vector (feature vector), for example the more common way of industry is to adopt MFCC (Mel-Frequency Cepstral Coefficient now, Mel frequency cepstral coefficient) feature and its single order and second order difference amount, amount to 39 dimensional features.The algorithm of MFCC has been the known technology of this area, and the Chinese invention patent application prospectus that can be specifically CN1763843A referring to publication number, does not repeat them here.In addition, can also adopt linear prediction cepstrum coefficient coefficient (LPCC) and its single order and second order difference amount as proper vector, linear prediction cepstrum coefficient coefficient (LPCC) is also conventional technological means, repeats no more.

Next, introduce large vocabulary continuous speech recognition program (the large vocabulary continuous speech recognition based on weighting FST, be abbreviated as LVCSR), the detailed process of LVCSR can be with reference to publication No. CN102779508A, denomination of invention is that < < sound bank generates Apparatus for () and method therefor, the Chinese invention patent application of speech synthesis system and method > > thereof, LVCSR technology has been ordinary skill in the art means, this is also repeated no more.In addition, by LVCSR training process, obtain the method for weighting FST H, C, L, details refers to < < Speech recognition with weighted finite-state transducers > > author Mehryar Mohri etc. and (sees New York University website, network address: http://www.cs.nyu.edu/～mohri/pub/hbka.pdf).Particularly, can by a large amount of corpus, pass through the training process of speech recognition training together with pronunciation dictionary with reference to figure 2, obtain weighting FST H, C, L.In fact, weighting FST H, C, L are respectively the acoustics mathematical model of phoneme, and context dependent phoneme (in the document having, being referred to as phone) model, with pronunciation dictionary model.

Weighting FST H, C, L are at server end, the training process of process speech recognition program obtains, weighting FST H, C, L are reusable after training obtains, needn't regenerate again, unless due to the increase of language material or change and need to again train, or need to again train due to the difference (dialect is also considered to wherein a kind of different language) of languages.

Below sketch language model G, concerning given text, its corresponding language model G (being the FST of relevant grammer) is exactly a simple transfer from a word to next word, and the finite state device that transition probability is 1 (Finite State Automaton, FSA), finite state device is a kind of special FST.In fact, usually, for a large amount of language materials, language model G is more complicated, but for given text, its language model G determines, because for given text, relation between phoneme and phoneme determines, word (being word for Chinese) is also determined to the relation between word (word).

Language model G (about the FST of grammer) and weighting FST H, C, L are comprehensively obtained to one about the weighting FST Q of this given text (and corresponding pronunciation unit), wherein Q=π _ε(min (det (H ο det (C ο det (L ο G))))), min wherein (minimization) represents the operation that minimizes of relevant weighting FST, det (determination) represents the deterministic operation of relevant weighting FST, symbol ο represents the composition operation (composition) of relevant weighting FST, π _εrepresent to remove the operation of epsilon symbol in weighting FST.About using the relevant technologies of weighting FST processed voice, please refer to the structure > > (author of the speech recognition system of < < based on finite state digraph, Xiao Ji, Tsing-Hua University's Master's thesis, delivers in May, 2011 time).It should be noted that, this paper is that the angle based on speech recognition is carried out, and the G therefore using is more complicated, and the G that the present invention uses and Q are for specific text, therefore simply occupy storage space little, thereby download is little.

Before the embodiment of narration embodiments of the invention, what need statement is, traditional speech recognition program is by H, C, L and general language model (language model), as general n-gram (n gram language model) does compound (composition), produce general speech recognition program, thereby caused storage space that needs are huge and complicated decoding algorithm (naturally high power consumption).And in the present invention, we carry out composition operation by G (determining) corresponding to each given pronunciation unit (a corresponding known text) with H, C, L, produce a speech recognition program that is just used for identifying this given pronunciation unit, weighting FST Q mentioned above namely, and and then produce on its basis aliging of mentioned speech feature vector sequence and factor sequence.The feature of this weighting FST Q is to produce for this given text (pronunciation unit), be specific to this pronunciation unit, can only identify this pronunciation unit, therefore its storage space occupying is less, be convenient to storage and use, and corresponding decoding algorithm simple (natural low-power consumption).Even what in fact user read is other voice, what based on weighting FST Q, after decoding algorithm, draw remains given text.That is to say, there is relation one to one in text, pronunciation unit and weighting FST Q.

In addition, the initialism using in claims of the application once and instructions is described:

HMM: Hidden Markov Model (HMM) (Hidden Markov Model)

GMM: gauss hybrid models (Gaussian Mixture Model)

Viterbi (transliteration: Viterbi) algorithm: the problem that this algorithm solves is to estimate most possible hiding sequence behind by observation sequence.

In claims and instructions of the application, " pronunciation unit " generally refers to a sentence, certainly " pronunciation unit " can also be a word, phrase, an or paragraph, or even the article of entire chapter, difference is only, word, phrase can be regarded as to short sentence, and paragraph or article is regarded as to the combination of a plurality of sentences.Therefore, in specific embodiments of the invention, generally take sentence as unit describes, can be applicable to equally word, phrase, paragraph even in the evaluation of entire article.

Below in conjunction with accompanying drawing, illustrate the specific embodiment of the present invention.

The feature of the present embodiment is not use received pronunciation, for given text, directly user speech is evaluated.One embodiment of the present of invention, without the spoken evaluation method of low-power consumption of the phoneme level of standard pronunciation, are carried out following steps successively:

(2), for given text, in the situation that not considering to pause sound, the aligned phoneme sequence that it comprises is what determine, i.e. { p ₁, p ₂..., p _m, if consider pause sound, its corresponding aligned phoneme sequence note is done

P _all={ sil, p ₁, sil, p ₂, sil, p ₃, sil ..., p _(M-1), sil, p _m, sil}, wherein sil represents pause sound, based on weighting FST Q, uses Viterbi algorithm to carry out decode operation to the corresponding characteristic vector sequence of user speech, obtains characteristic vector sequence to above-mentioned aligned phoneme sequence p _allalignment α,

The count vector note of this alignment α is done

N wherein _iexpression is corresponding to the quantity of the frame of i non-pause sound phoneme, ns _iexpression is corresponding to the quantity of the frame of i+1 pause sound, M is the quantity of the non-pause sound phoneme that this sample text is corresponding, the above-mentioned decode procedure based on weighting FST Q and Viterbi algorithm, when providing and aliging, has also provided the corresponding HMM state of each proper vector; Q=π wherein _ε(min (det (H ο det (C ο det (L ο G))))), min wherein represents the operation that minimizes of relevant weighting FST, det represents the deterministic operation of relevant weighting FST, symbol ο represents the composition operation of relevant weighting FST, π _εrepresent to remove the operation of epsilon symbol in weighting FST;

Please refer to Fig. 3, when the voice document of user speech is processed, due to what use, be and the corresponding weighting FST of given text Q, what decoding obtained is exactly given text (each weighting FST Q and text are one to one).The above-mentioned decode procedure based on weighting FST Q and Viterbi algorithm, is known prior art, and Viterbi algorithm is also the algorithm of field of speech recognition maturation, thereby repeats no more.Weighting FST Q in Fig. 3 is stored in terminal device or by network and is downloaded in terminal device after generating according to given text and in advance, can with text corresponding stored.In fact, due to the information that weighting FST Q has comprised text, therefore can only store weighting FST Q.

Referring to Fig. 4, understand embodiments of the invention, in the process of decoding, produced aliging of aligned phoneme sequence and characteristic vector sequence simultaneously, by evaluating the goodness of fit between proper vector in this alignment or proper vector group and its mathematical notation in acoustic model H, can evaluate the voice quality of user on each phoneme, the goodness of fit is higher, illustrates that voice quality is better

By training, obtain acoustic model H, pronunciation dictionary model L and context-sensitive phoneme model C only need to once train and can obtain at server end, reusable.The process that wherein generates weighting FST Q can carry out also can carrying out at user terminal at server end, preferably at server end, generates, to reduce the computation requirement to user terminal.As preferred embodiment, in the present embodiment, weighting FST Q generates at server end, is directly stored in terminal or is downloaded to terminal by network.For the text of appointment, Q is little for its weighting FST, and download is little, and heavy calculating all carries out in service end, has minimized the burden of user terminal.Certainly, in the situation that not considering user terminal computing power, these computation processes also can be carried out at user terminal.User terminal comprises common smart machine, comprises the even terminal device such as smart mobile phone of desktop computer, notebook computer, panel computer, can be also learning machine, language repeater with computing power, lead in reading aloud the smart machines such as machine.

Particularly, in step (3), weigh user when reading the voice quality of each phoneme, can adopt the goodness of fit between each phoneme characteristic of correspondence vector or proper vector group and its mathematical notation to evaluate, the goodness of fit can adopt likelihood probability P (O _i| p _i) evaluate; When using GMM-HMM model (please refer to Fig. 5 about the graph model (Graphicai Model) of GMM-HMM), likelihood probability P (O _i| p _i) what represent is that proper vector group is the probability by the HMM generation of its corresponding phoneme;

P (O_{i} | p_{i}) \approx P (O_{i} | p_{i}, S_{i}) = Π_{t = 1}^{T_{i}} b_{s_{t}} (o_{t}) a_{s_{t} s_{t + 1}}

Wherein: about equal sign is according to conventional Viterbi approximation technique, S _i=s ₁, s ₁..., phoneme p _icorresponding characteristic vector sequence O _i=o ₁, o ₂..., hMM status switch;

phoneme p for this reason _iexit status,

expression state s _twith s _t+1between transition probability,

B(o _t) represent gauss hybrid models;

Conventionally can adopt likelihood probability P (O _i| p _i) as evaluating user at phoneme p _ithe first phoneme massfraction of upper voice quality.But usually, in order to prevent calculation overflow, in fact conventionally adopting the logarithm of above-mentioned likelihood probability is ln (P (O _i| p _i)) as evaluating user at phoneme p _ithe first phoneme massfraction of upper voice quality.

In addition, can from another one dimension, evaluate user's voice quality, continue referring to the GMM-HMM model schematic diagram shown in Fig. 5, understand.The goodness of fit between each phoneme and its characteristic of correspondence vector or proper vector group adopts Pr (p _i| O _i) weigh described Pr (p _i| O _i) be O _ibelong to its corresponding phoneme p _iposterior probability, by likelihood probability P (O _i| p _i) and prior probability Pr (p _i) by Bayesian formula, calculate this posterior probability Pr (p _i| O _i) as the second phoneme massfraction of evaluating user's voice quality.Wherein, prior probability Pr (p _i) by large quantitative statistics, obtain, for routine techniques means of the prior art, repeat no more.

More than described the voice quality of weighing certain phoneme (single-tone element or triphones) from phone-level, whole voice quality that in addition can the molecular pronunciation of distich unit is evaluated.Because short sentence can be regarded as in word, phrase, and paragraph can be regarded the combination of a plurality of sentences as, thus using without exception sentence as pronunciation unit weigh, the situation of word, phrase and paragraph, can carry out analogy according to the situation of sentence.Adopt following formula to calculate the comprehensive mark of phoneme of the pronunciation unit that user reads aloud;

υ wherein _{1, i}for the first phoneme massfraction of i phoneme in user speech, υ _{2, i}for the second phoneme massfraction of i phoneme in user speech, ω _{1, i}and ω _{2, i}for corresponding weight.

Described weights omega wherein _{1, i}and ω _{2, i}can arrange by craft; Described weights omega _{1, i}and ω _{2, i}can also obtain by the mode of machine learning: choose a plurality of texts corresponding to difference pronunciation unit, by different users, read aloud respectively, and by expert each user read aloud to quality evaluation, manually provide the corresponding comprehensive mark of phoneme level, by machine learning method, obtain optimum weight sequence.

In first embodiment of the present invention, adopt in step wherein (1) Mel frequency cepstral coefficient (MFCC) and their single order and second order difference as proper vector or linear prediction cepstrum coefficient coefficient (LPCC) and their single order and second order difference as proper vector.

More than to be all to evaluate from the voice quality aspect of certain phoneme of user, in addition, also need user's fluency to evaluate, by calculating the quantitative assessment user's of the pause sound sil in user speech fluency, i pause sound sil characteristic of correspondence vector number is ns _i, ns _ithe quantity of > 0 is more, illustrates that pause is more, and voice quality is poorer.Can adopt in addition the voice quality of the ratio measurement pronunciation unit that calculates pause sound.Namely the ratio of the quantity by calculating pause sound characteristic of correspondence vector in the corresponding proper vector in whole pronunciation unit weighed the voice quality of pronunciation unit.Particularly, can adopt following formula to calculate: this ratio is larger, illustrates that pause sound is more, and fluency is poorer.This ratio should be in a zone of reasonableness, and too large explanation pauses too many, does not pause in the place that too little explanation should pause, and rational interval can be determined by large quantitative statistics.

For in the situation that identifying user pronunciation defect, give user to instruct targetedly, on the basis of above-mentioned spoken evaluation method, the present invention also further comprises the step of determining user pronunciation defect:

According to user, read aloud resulting a plurality of the first phoneme massfractions in a plurality of pronunciations unit or the second phoneme massfraction sequence, phoneme that counting user score is minimum or the lower a plurality of phonemes of statistics score, be considered to the phoneme of pronunciation defect.One or several minimum phoneme of user's score can be chosen, be prompted to user.Certainly further, can from database, select and comprise the pronunciation unit confession user exercise that user has the phoneme of pronunciation defect.Thereby can effectively solve the problem that cannot accurately evaluate the voice quality of certain phoneme of user or certain several phoneme in prior art, and can propose to point out targetedly and provide and practise targetedly material.

Please refer to the system architecture that Fig. 6, Fig. 7 understand the spoken evaluation method of low-power consumption that realizes the phoneme level without standard pronunciation of the present invention.On terminal device, comprising:

Audio Processing Unit, for receiving user speech and carrying out acoustic feature extraction, obtains and user speech characteristic of correspondence sequence vector;

Storage unit, for storing weighting FST Q, Q is corresponding with the text of appointment for this weighting FST, and for user speech is decoded;

Decoding unit, the characteristic vector sequence that Audio Processing Unit processing obtains and the weighting FST Q of storage unit all deliver to decoding unit, decoding unit is used weighting FST Q to use Viterbi algorithm to carry out decode operation to the corresponding characteristic vector sequence of user speech, consider after pause sound sil wherein, obtain characteristic vector sequence to the α that aligns of the corresponding aligned phoneme sequence that comprises pause sound of the text with producing weighting FST Q, the aligned phoneme sequence that this decoding obtains is: { sil, p ₁, sil, p ₂, sil, p ₃, sil ..., p _(M-1), sil, p _m, sil}, the count vector of this alignment α

β＝{ns ₀，n ₁，ns ₁，n ₂，ns ₂，n ₃，ns ₃，...，n _(M-1)，ns _(M-1)，n _M，ns _M}；

Wherein, sil represents the sound that pauses, n _iexpression is corresponding to the quantity of the frame of i phoneme, ns _iexpression is corresponding to the quantity of the frame of i+1 pause sound, M is the quantity of the phoneme that comprises in this sample text, the above-mentioned decode procedure based on weighting FST Q and Viterbi algorithm, when providing and aliging, has also provided the corresponding HMM state of each proper vector;

Pronunciation quality evaluating unit, calculates the goodness of fit between each phoneme and its characteristic of correspondence vector or proper vector group to evaluate the phoneme massfraction of the voice quality of user on each phoneme.

Its course of work please see above about the description of the spoken evaluation method of low-power consumption of the phoneme level without standard pronunciation.

With reference to Fig. 6, the corresponding weighting FST of its Chinese version 1 Q ₁, the corresponding weighting FST of text 2 Q ₂... thereby each text has its corresponding Q.Weighting FST H wherein, C, L obtain through training in advance, and because process minimizes and deterministic operation, the Q generating for each given text is less, can more easily download or be stored to terminal.For Fig. 6, it should be noted that, for different texts, the weighting FST H of its use, C, L are that precondition is good, and are all identical, in figure, just symbolically show two terminals, in fact terminal quantity is not limit.

The present invention has following beneficial effect:

(1) without standard pronunciation, can be widely used in and there is no the evaluation of standard pronunciation and defect diagonsis scene, for example, the evaluation to the dialogue under language learner's natural scene or statement, and the diagnosis of the language proficiency in natural situation to language disorder.

(2) heavy computing is all carried out at server end: comprise and produce weighting FST H, C, L, and produce the weighting FST Q that phoneme of speech sound alignment relies on, with reference to Fig. 6, can understand this advantage of the present invention.

(3) only about the computing of user speech, be placed in user terminal execution, effectively reduced load and the power consumption in terminal, reduced the hardware requirement to terminal.

(4) user, when terminal is used, does not need networking completely, has avoided network flow consumption.

(5) when having new example to increase, first process beyond the clouds, produce the weighting FST Q corresponding with new text, then be downloaded to terminal.Due to weighting FST, Q is less, and download is little, can be quick-downloading to terminal.Even if user uses network more during new data example, also can complete very fast.Owing to having used the minimization technique of weighting FST, minimized download.

(6) to user's pronunciation evaluation, can be accurate to phone-level, and consider the voice quality of context-sensitive phoneme, can provide the poor phoneme of user pronunciation, can provide corresponding language material (including the phoneme that user pronunciation quality is lower) to practise targetedly.

The present invention can be adapted to the study of the language such as English, Chinese, Spanish, and to aphasis patient's diagnosis and evaluation and test.

Certainly, the above is the preferred embodiments of the present invention, should be understood that; for those skilled in the art; under the premise without departing from the principles of the invention, can also make some improvements and modifications, these improvements and modifications are also considered as protection scope of the present invention.

Claims

1. without the spoken evaluation method of low-power consumption of the phoneme level of standard pronunciation, it is characterized in that, comprise the steps:

(2) for given text, its corresponding aligned phoneme sequence, note is done

The count vector note of this alignment α is done

Q=π ε (min (det (H ο det (C ο det (L ο G))))) wherein, min wherein represents the operation that minimizes of relevant weighting FST, det represents the deterministic operation of relevant weighting FST, symbol ο represents the composition operation of relevant weighting FST, π _εrepresent to remove the operation of epsilon symbol in weighting FST;

2. the spoken evaluation method of the low-power consumption of the phoneme level without standard pronunciation according to claim 1, it is characterized in that, in described step (3), the goodness of fit between each phoneme characteristic of correspondence vector or proper vector group and its mathematical notation adopts likelihood probability P (O _i| p _i) evaluate; When using GMM-HMM model, likelihood probability P (O _i| p _i) be that this proper vector group is the probability by the HMM generation of its corresponding phoneme,

P (O_{i} | p_{i}) \approx P (O_{i} | p_{i}, S_{i}) = Π_{t = 1}^{T_{i}} b_{s_{t}} (o_{t}) a_{s_{t} s_{t + 1}}

Wherein: about equal sign is according to conventional Viterbi approximation technique, S _i=s ₁, s ₂..., phoneme p _icorresponding characteristic vector sequence O _i=o ₁, o ₂..., hMM status switch;

phoneme p for this reason _iexit status,

expression state s _twith s _t+1between transition probability,

B(o _t) represent gauss hybrid models;

Adopt likelihood probability P (O _i| p _i) as evaluating user at phoneme p _ithe first phoneme massfraction of upper voice quality;

Or the logarithm that adopts above-mentioned likelihood probability is ln (P (O _i| p _i)) as evaluating user at phoneme p _ithe first phoneme massfraction of upper voice quality.

3. the spoken evaluation method of the low-power consumption of the phoneme level without standard pronunciation according to claim 1, is characterized in that, the goodness of fit in described step (3) between each phoneme and its characteristic of correspondence vector or proper vector group adopts Pr (p _i| O _i) weigh described Pr (p _i| O _i) be proper vector group O _ibelong to its corresponding phoneme p _iposterior probability, by likelihood probability P (O _i| p _i) and prior probability Pr (p _i) by Bayesian formula, calculate this posterior probability Pr (p _i| O _i) as the second phoneme massfraction of evaluating user's voice quality.

4. according to the spoken evaluation method of the low-power consumption of the phoneme level without standard pronunciation described in claim 2 or 3, it is characterized in that, adopt following formula to calculate the comprehensive mark of phoneme level that user reads aloud a pronunciation unit;

5. the spoken evaluation method of the low-power consumption of the phoneme level without standard pronunciation according to claim 4, is characterized in that

Described weights omega _{1, i}and ω _{2, i}by manual setting;

Or described weights omega _{1, i}and ω _{2, i}mode by machine learning obtains: choose a plurality of texts corresponding to difference pronunciation unit, by different users, read aloud respectively, and by expert each user read aloud to quality evaluation, manually provide the corresponding comprehensive mark of phoneme level, by machine learning method, obtain optimum weight sequence.

6. according to the spoken evaluation method of the low-power consumption of the phoneme level without standard pronunciation described in claim 1-5 any one, it is characterized in that, wherein in step (1), adopt Mel frequency cepstral coefficient and their single order and second order difference or linear prediction cepstrum coefficient coefficient and their single order and second order difference as proper vector.

7. according to the spoken evaluation method of the low-power consumption of the phoneme level without standard pronunciation described in claim 1-5 any one, it is characterized in that, weighting FST Q generates at server end, is directly stored in terminal or is downloaded to terminal by network.

8. the spoken evaluation method of the low-power consumption of the phoneme level without standard pronunciation according to claim 7, is characterized in that, described acoustic model H, pronunciation dictionary model L and context-sensitive phoneme model C all generate at server end.

9. according to the spoken evaluation method of the low-power consumption described in claim 1-5 any one, it is characterized in that, also comprise the step that user speech fluency is evaluated: the fluency of evaluating user by the quantity of the sound that pauses between non-pause phoneme in calculating user speech: ns _ithe quantity of > 0 is more, illustrates that pause is more, and fluency is poorer.

10. the spoken defect diagnostic method of the low-power consumption without the phoneme level of standard pronunciation, it is characterized in that, first adopt as claimed in any one of claims 1-9 wherein without the spoken evaluation method of low-power consumption of the phoneme level of standard pronunciation user speech is processed, then also comprise and determine that user has the step of the phoneme of pronunciation defect: according to user, read aloud resulting a plurality of the first phoneme massfractions in a plurality of pronunciations unit or the second phoneme massfraction sequence, a plurality of phonemes that the phoneme that user's score is minimum or score are lower are the phoneme that has pronunciation defect.

The spoken defect diagnostic method of low-power consumption of the 11. phoneme levels without standard pronunciation according to claim 10, is characterized in that, selects the pronunciation unit of the phoneme that includes described pronunciation defect for user's exercise from database.