CN1763843A

CN1763843A - Pronunciation quality evaluating method for language learning machine

Info

Publication number: CN1763843A
Application number: CNA2005101148488A
Authority: CN
Inventors: 梁维谦; 董明; 丁玉国; 刘润生
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2005-11-18
Filing date: 2005-11-18
Publication date: 2006-04-26
Anticipated expiration: 2025-11-18
Also published as: CN100411011C

Abstract

The invention discloses a pronunciation quality evaluation method of language study machine in the computer subsidiary language study and phonetic technique domain, which is characterized by the following: extracting exercise phonetic feature; exercising standard pronunciation model; forming standard pronunciation network; detecting phonetic end; extracting evaluation phonetic feature; searching optimum path; calculating the mark of pronunciation quality. The method displays objective and stable evaluation, which constitutes imbedded English study system and mutual human-machine education and oral English self-detection.

Description

The pronunciation quality evaluating method that is used for language learner

Technical field

The invention belongs to computer-assisted language learning and voice technology field, relate in particular to the pronunciation quality evaluating method that 16 of employings and above digital signal processing chip are realized.

Background technology

In recent years, embedded language study product at home and abroad develops rapidly.Mainly be language repeater in early days, it is the additional device that a bit of voice digitization can be stored on an Analog Tape Recorder, and these a bit of voice can repeatedly repeat playback, are beneficial to learner's audition repeatedly, with reading memory.The language learner of main flow is the second generation product that adopts digital signal processing chip (DSP, Digital Signal Processing) technology in the market.Hardware system generally comprises microcontroller (Micro Control Unit, MCU), digital signal processing chip (DSP), codec (CODEC), ROM, SRAM, flash memory (Flash Memory), USB (universal serial bus) (USB), keyboard and LCD (Liquid Crystal Display, LCD) etc.; Wherein MCU is as main control chip, and actuating equipment drives and system control program such as program scheduler, and DSP carries out the application algorithm routine.Application program comprises basic modules such as recording, playback, word speed adjusting, and some product also has the mp3 module.Have re-readingly on the function, with reading, with reading contrast, literal shows synchronously, content retrieval inquiry and the adjustable playback of word speed etc.This class learning machine can and upgrade learning stuff by the Internet download mostly.Typical case's representative that the good note star English learning machine of star company is the digital English learning machine of the second generation is remembered in Shenzhen well.

Learn a language and especially learn the learning process that spoken key is interaction, promptly the teacher passes judgment on targetedly in time and instructs in learning process.Yet traditional be in the language learning at center with teacher because the shortage of qualified teachers' strength, this task can't be finished.And existing language learner does not possess the ability of this evaluation learner pronunciation.

Summary of the invention

The objective of the invention is for overcoming the weak point of prior art, a kind of pronunciation quality evaluating method that is used for language learner is proposed, the pronunciation quality evaluating that can have nothing to do the high performance text of the realization on the embedded language learning machine and speaker has the method moderate complexity, estimates accuracy height and the good characteristics of robustness.Particularly the crowd's of Chinese accent evaluation accuracy is reached even surpassed current international most advanced level.

The pronunciation quality evaluating method that is used for language learner that the present invention proposes, comprise that the phonetic feature that is used to train extracts, the Received Pronunciation model training, the generation of Received Pronunciation network, sound end detects, the phonetic feature that is used to estimate extracts, optimum route search, and the calculating each several part of voice quality mark; It is characterized in that the implementation method of each several part specifically may further comprise the steps:

A, the phonetic feature that is used to train extract:

(1) foundation comprises the tranining database of reading aloud voice in a large number in advance;

(2) digital speech in each voice document in the said tranining database is carried out pre-emphasis and divides frame windowing process, the branch frame voice that obtain having accurate stationarity;

(3) said minute frame voice are extracted phonetic feature, this phonetic feature is a cepstrum coefficient;

B, Received Pronunciation model training

(1) utilize the said phonetic feature of steps A to train the Received Pronunciation model that obtains based on phoneme;

(2) self-adaptation that said Received Pronunciation model is carried out Chinese crowd accent is as final Received Pronunciation model, and Optimization Model is to Chinese crowd's assess performance;

The generation of C, Received Pronunciation network

Given text is carried out the segmentation of words, search Pronounceable dictionary and obtain the phoneme mark, utilizing said Received Pronunciation model based on phoneme to obtain with the state at last is the linear Received Pronunciation network of node;

D, sound end detect:

(1) analog voice signal obtains digital speech through the A/D conversion;

(2) said digital speech is carried out pre-emphasis and divides frame windowing process, the branch frame voice that obtain having accurate stationarity;

(3) said minute frame voice are calculated time domain logarithm energy;

(4) method of employing moving average filter (moving-average filter) is obtained being used for the feature (being designated hereinafter simply as end inspection feature) of end-point detection by said time domain logarithm energy;

(5) method of employing upper and lower bound dual threshold and finite state machine combination is carried out end-point detection to said end inspection feature, obtains the starting and ending end points of voice;

E, the phonetic feature that is used to estimate extract

Said minute frame voice of step D are extracted phonetic feature, and process is identical with (3) step of steps A.

F, optimum route search:

(1) said phonetic feature of step e and the said Received Pronunciation network of step C are forced coupling, obtain all possible routing information in the network;

(2) utilize said routing information, the terminal node that allows from network is recalled and optimal path;

The calculating of G, voice quality mark:

(1) utilize said optimal path information in the step F to calculate the confidence score of every frame phonetic feature;

(2) utilize in the step F confidence score of each state on the said optimal path information calculating path; Confidence score to all states on the optimal path is averaged the confidence score that obtains whole sentence;

(3) utilize mapping function that said whole sentence confidence score is mapped to subjective assessment and divide number interval, obtain final voice quality mark.

Cepstrum coefficient in the said steps A can be Mei Er frequency marking cepstrum coefficient (MFCC, Mel-FrequencyCepstrum Coefficients), and it has utilized the frequency discrimination characteristic of people's ear.

Received Pronunciation model among the said step B (1) is the hidden Markov model (HMM based on phoneme, HiddenMarkov Model), the concrete training process of this model is: adopt Gauss model of all phonetic feature initialization, utilize this model copy to go out all phoneme models, adopt the method for Baum-Welth that model is repeatedly trained.Constantly increase the quantity of the gauss component of each phoneme model, carry out the Baum-Welth training again.

The adaptive implementation method that the Received Pronunciation model is carried out Chinese crowd accent among the said step B (2) is: the Received Pronunciation model that obtains is carried out based on linear (the Maximum Likelihood LinearRegression of recurrence of maximum likelihood, MLLR) and maximum a posteriori probability (Maximum A Posteriori, MAP) the accent self-adaptation of method obtains final Received Pronunciation model.

The Received Pronunciation network of said step C can be one and has definite start node and terminal node, and the state with HMM of not considering the syntax that present node is only relevant with its preorder node is the Linear Network of node.

The method of the optimum route search of said step F has adopted the method for frame synchronization Viterbi (Viterbi) beam search.

For can on the limited memory resource of embedded system, realizing pronunciation quality evaluating method of the present invention, step D, E, F and G carry out with the segmentation in time of predefined fixedly frame number step-length, can reduce requirement so greatly, make the flush type learning system can handle long voice system resource;

Pronunciation quality evaluating method of the present invention makes language learner have interactive function.The embedded English learning system that utilizes this method to realize has been obtained preferable performance in actual applications.

The present invention has following characteristics:

Characteristics such as (1) the present invention has the accuracy of evaluation height, robustness is good, system resource overhead is little;

(2) employing makes the flush type learning system to change courseware content easily based on the Received Pronunciation model of phoneme, need not to train again;

(3) consider the influence of mother tongue accent, phoneme model has been carried out the accent self-adaptation pronunciation of English;

(4) adopt technology such as moving average filter and finite state machine to carry out real-time end-point detection, improved accuracy and the robustness of end-point detection English Phonetics;

(5) can be used for based on the embedded language learning system that with DSP is core, have that volume is little, in light weight, a power consumptive province, outstanding feature that cost is low;

(6) pronunciation quality evaluating method of the present invention, the courseware form in conjunction with abundant can change traditional learning machine mode of operation and classroom instruction pattern.

Description of drawings

Fig. 1 is the method overall procedure synoptic diagram of the embodiment of the invention.

Fig. 2 is the Received Pronunciation model training process flow diagram of the embodiment of the invention; The overall process of Fig. 2 (a) expression Received Pronunciation model training, the training process of a specific hidden Markov model of Fig. 2 (b) expression.

Fig. 3 is the topology diagram of the Received Pronunciation model of the embodiment of the invention; Fig. 3 (a) expression pause model, Fig. 3 (b) expression phoneme and quiet model.

Fig. 4 is the hidden Markov model accent self adaptive flow figure of the embodiment of the invention.

Fig. 5 is the Received Pronunciation topology of networks synoptic diagram of the embodiment of the invention; The whole sentence of Fig. 5 (a) expression is the Linear Network structure of node with the word, and Fig. 5 (b) represents that each word is the Linear Network structure of node with the phoneme.

Fig. 6 is the generative process synoptic diagram of the Received Pronunciation network of the embodiment of the invention.

Fig. 7 is the detail flowchart of the pronunciation quality evaluating method of realizing of the embodiment of the invention on embedded platform.

Embodiment

A kind of pronunciation quality evaluating method embodiment that is used for language learner that the present invention proposes is described in detail as follows in conjunction with each figure:

The embodiment overall procedure of the inventive method is divided into as shown in Figure 1: A, the phonetic feature that is used to train extract; The training of B, Received Pronunciation model; The generation of C, Received Pronunciation network (above each step can utilize computing machine to finish in advance); D, sound end detect; E, the phonetic feature that is used to estimate extract; F, optimum route search; The calculating of G, voice quality mark and output (above each step utilizes embedded platform to finish).The embodiment of each step is described in detail as follows.

A, the phonetic feature that is used to train extract:

(1) sets up in advance and comprise a large amount of English and read aloud the tranining database of voice (content that requirement comprises all has the covering of some to each phoneme);

(2) digital speech in each voice document in the said tranining database is carried out pre-emphasis and handle, preemphasis filter is taken as H (z)=1-0.9375z ^-1Voice after the pre-emphasis are carried out branch frame windowing (employing hamming code window) handle, frame length can be 32ms, and frame moves and can be 16ms, obtains having the branch frame voice of accurate stationarity;

(3) said minute frame voice are extracted Mei Er frequency marking cepstrum coefficient (MFCC) as phonetic feature; The frequency domain character in short-term of voice can accurately be described the variation of voice, MFCC is a kind of eigenvector that comes out according to the frequency discrimination property calculation of human auditory system, be based upon on the basis of fourier spectrum analysis, the computing method of MFCC are: at first minute frame voice are carried out fast fourier transform (Fast Fourier Transformation, FFT) obtain the short-term spectrum of signal, secondly according to the MEL frequency marking short-term spectrum is divided into the logical group of several bands, the frequency response that its band is logical is a triangle, calculate the signal energy of respective filter group once more, calculate corresponding cepstrum coefficient by discrete cosine transform at last;

The MFCC feature mainly reflects the static nature of voice, and the behavioral characteristics of voice signal can be composed with the first order difference of static nature spectrum and second order difference and describe.Whole phonetic feature is made of MFCC parameter, MFCC single order, second order difference coefficient, normalized energy coefficient and single order thereof, second order difference coefficient.Every frame comprises 39 dimensional features altogether;

The training of B, Received Pronunciation model:

(1) utilize the process of the said phonetic feature training of steps A based on the Received Pronunciation model of phoneme, as shown in Figure 2:

A, according to the prototype of the multidimensional Gaussian distribution of the dimension of the phonetic feature single data stream that to set up a covariance matrix be the diagonal angle form, use whole speech datas to estimate the mean value vector and the covariance matrix of this Gaussian distribution.

B, determine Pronounceable dictionary and phonetic symbol system, finish the phoneme level mark to all voice, the phonetic symbol system of present embodiment comprises 40 phonemes and 1 quiet sign, 1 sign of pausing.

C, present embodiment adopt hidden Markov model (HMM) based on phoneme as the Received Pronunciation model, and HMM is the statistics of speech recognition model that is widely adopted at present.HMM state transition model from left to right can be described the pronunciation characteristic of voice well.Phoneme and quiet model that the present invention adopts are the HMM of 3 states, and the pause model is the HMM that single state can be crossed over, and its topological structure as shown in Figure 3.Q wherein _iThe state of expression HMM.a _IjThe redirect probability of expression HMM.b _j(O _t) be the multithread mixed Gaussian density probability distribution function of the state output of HMM model.As the formula (1).

b_{j} (O_{t}) = Π_{s = 1}^{s} {[Σ_{m = 1}^{M_{s}} C_{jsm} N (O_{st}; μ_{jsm}; φ_{jsm})]}^{γ_{s}} - - - (1)

Wherein S is the fluxion of data, M _sIt is the number of the mixed Gaussian Density Distribution in each data stream; N is the multidimensional Gaussian distribution, as the formula (2):

N (o; μ; φ) = \frac{1}{\sqrt{{(2 π)}^{n} | φ |}} e^{- \frac{1}{2} (o - μ) φ^{- 1} (o - μ)} - - - (2)

The Received Pronunciation model of present embodiment comprises 40 phoneme HMM models and a quiet HMM model, a pause HMM model; Said Gaussian distribution prototype is copied into each HMM model; Utilize the Baum-Welch algorithm that each HMM model is carried out repeatedly valuation then, the valuation number of times can be 5;

D, progressively increase the quantity of gauss component in the HMM model, the model that obtains is carried out the Baum-Welch training once more; The quantity increase of gauss component is followed successively by 2,4,6,8; When Gauss's quantity rises to 8, repetition training 10 times, training process finishes.

(2) self-adaptation that said Received Pronunciation model is carried out Chinese crowd accent, the embodiment of the invention has adopted the accent adaptive approach based on overall MLLR and MAP serial, and the self-adaptation number of times is set at 4, and idiographic flow is as shown in Figure 4.

A, MLLR are a kind of adaptive algorithms based on model transferring.The basic assumption of this class algorithm is that close voice also are close in the irrelevant speech model space of speaker and by the transformation relation between the adaptation people voice space, therefore can utilize the voice that occurred in the training utterance to count this transformation relation, to not model mapping from speaker's independence model to quilt adaptation people voice space of the voice of appearance, thereby realize self-adaptation with this conversion realization.The speech model space is divided into the R class according to necessarily estimating (as Euclidean distance, likelihood score etc.), all kinds of T that are transformed to _r(*), the training utterance collection of all kinds of correspondences is X _r, r=1,2 ..., R, model parameter is λ _r, r=1,2 ..., R, then adaptive training satisfies:

T_{r} = \underset{T}{\arg \max} (P (X_{r} | T_{r})), r = 1,2 . . ., R - - - (3)

Parameter after the self-adaptation

{\hat{λ}}_{r} r = 1,2, . . . R

Satisfy

{\hat{λ}}_{r} = T_{r} (λ_{r}), r = 1,2 . . ., R - - - (4)

Because this class algorithm has made full use of the mutual relationship between voice, a plurality of models are shared a conversion, the parameter that needs to estimate is the coefficient of each conversion, and being easier to the accumulation data estimated parameter can come into force under less self-adapting data situation, therefore has adaptive speed faster.What the embodiment of the invention adopted is non-classified overall MLLR self-adaptation.

The basic norm of b, MAP algorithm is the posterior probability maximization, therefore has theoretic optimality:

{\hat{θ}}_{i} = \underset{θ_{i}}{\arg \max} P (θ_{i} | x) - - - (5)

The mean value vector valuation formula of standard MAP algorithm is:

\hat{μ} = \frac{Σ_{t = 1}^{T} L_{t}}{Σ_{t = 1}^{T} L_{t} + τ} \overline{μ} + \frac{τ}{Σ_{t = 1}^{T} L_{t} + τ} μ - - - (6)

L wherein _tBe the probability of t moment measurement vector to this Gaussian mixture components, τ is the weights of adaptive voice data based on priori, and μ is the mean value vector of adaptive voice, and μ is the mean value vector of speaker's independence model.Also can find out thus, when self-adapting data is abundant, the mean value vector after the self-adaptation The mean value vector μ that the speaker is correlated with will be trended towards.The embodiment of the invention adopts the adaptive purpose of MAP again after the MLLR self-adaptation be to make full use of adaptive speech data, further provides accent adaptive effect.

The Received Pronunciation model that finally obtains is deposited in the external memory storage of embedded system.

The generation of C, Received Pronunciation network:

The Received Pronunciation network of the embodiment of the invention as shown in Figure 5, wherein (a) is for being the Linear Network example of node with the word, start node is " sil " of beginning, terminal node is last " sil ", (b) for each word inside be the Linear Network of node with the phoneme, each phoneme inner for as shown in Figure 3 be the network of node with the state.The network generative process is as shown in Figure 6: at first urtext is carried out the segmentation of words and obtain shown in Fig. 5 (a), secondly each word lookup Pronounceable dictionary is obtained shown in Fig. 5 (b).Consider the multiple sound situation of word, for saving storage space and improving search efficiency, present embodiment has carried out the phoneme character string comparison based on dynamic programming between the multiple pronunciation of word, it is the network of node with the phoneme that a plurality of aligned phoneme sequence are fused into one, makes that the identical phoneme between each pronunciation is shared.Utilizing phoneme HMM model that network finally is launched into the state at last is the network of node, has write down status indicator, phoneme sign, word sign and the preorder interstitial content and the preorder node identification information of present node on each state node.So far, obtain definite start node P of having of present embodiment and terminal node T, the state with HMM of not considering the syntax that present node is only relevant with its preorder node is the Received Pronunciation network of node.

Said Received Pronunciation network is deposited in the external memory storage of embedded system.

D, sound end detect:

(1) voice signal at first carries out low-pass filtering, samples by the linear A/D of 16bit then and quantizes, and becomes digital speech.Sample frequency is 8kHz;

(2) said digital speech is carried out pre-emphasis and divides frame windowing process, the branch frame voice that obtain having accurate stationarity; Method is identical with (2) step of steps A;

(3) more said minute frame voice are calculated logarithm energy in short-term.

(4) adopt the method for moving average filter to obtain end inspection feature by said time domain logarithm energy: end-point detection is carried out in real time, and the real time end point detecting method need satisfy following requirement: a, different background-noise levels is had consistent output; B, can detect starting point and terminating point; C, the time-delay of weak point; D, limited response interval; E, in end points place maximization signal to noise ratio (S/N ratio); The end points of f, accurate detection and localization; G, suppress to detect mistake to greatest extent; It is closely similar to take all factors into consideration the graphic limit detection function (moving average filter) that adopts usually in the objective function of above requirements definition and the Flame Image Process.Said moving average filter as the formula (7), wherein g () is a time domain logarithm energy, t is current frame number, and h () is a moving average filter, as the formula (8), as seen h () is an odd symmetry function, W is desirable 13, f () as the formula (9), its parameter can be: A=0.2208, s=0.5383, [K ₁... K ₆]=[1.583,1.468 ,-0.078 ,-0.036 ,-0.872 ,-0.56].

F (t) = Σ_{i = - W}^{W} h (i) g (t + i) - - - (7)

h (i) = \{\begin{matrix} - f (- i) & - W \leq i < 0 \\ f (i) & 0 \leq i \leq W \end{matrix} - - - (8)

f(x)＝e ^Ax[K ₁sin(Ax)+K ₂cos(Ax)]+e ^-Ax[K ₃sin(Ax)+K ₄cos(Ax)]+K ₅+K ₆e ^sx(9)

(5) method of employing upper and lower bound dual threshold and finite state machine combination, said end inspection feature is carried out end-point detection, obtain the starting and ending end points of voice: said end inspection feature F (t) the initiating terminal of voice on the occasion of, be negative value finishing end, then be close to zero at quiet section.According to the predefined upper limit, lower threshold and voice minimum length in time, control each frame voice at voice, quiet and leave and carry out redirect between the voice status.Be initially set mute state, the initial end points when F (t) exports voice when reaching upper limit threshold enters voice status.Be in voice status, leave voice status when F (t) has just entered when reaching lower threshold.Be in the end caps of the time of the leaving voice status time output voice that reach a preset threshold, close the recording channel, end-point detection finishes.

E, the phonetic feature that is used to estimate extract:

F, optimum route search:

(1) said phonetic feature of step e and the said Received Pronunciation network of step C are forced coupling, obtain all possible routing information in the network: the Received Pronunciation network of the embodiment of the invention is the Linear Network (as shown in Figure 5) behind left-hand, can adopt the Viterbi beam search algorithm of frame synchronization to obtain optimal path.Given HMM model Φ and observation vector sequence O={o ₁..., o _TAfter, need ask for producing the optimum condition sequence S={s that this observes vector sequence ¹... s ^T, promptly

\hat{S} = \underset{s}{\arg \max} {P (S, O | Φ)} - - - (10)

In viterbi algorithm, definition t optimal path likelihood score constantly is

V _i(t)＝P(o ₁，…，o _t，s ¹，…，s ^t-1，s ^t＝i?|Φ) (11)

In Linear Network, the optimal path of any time is only relevant with the information of previous frame with present frame, promptly satisfies the principle of without aftereffect.Therefore, if the global optimum path at t constantly by node i, so, the path is in the part of 0～t between constantly, must be to be to be optimum in each paths of finish node constantly with the node i at t.If we only seek out optimal path, so constantly at t, with the node i be the path of finish node only need keep one just enough.

According to mentioned above principle, the searching algorithm of present embodiment is:

Definition: PreNode (i) is the preorder node set of node i.(t i) is the t optimum preorder node of node i constantly to BestPre.(t i) is the likelihood mark of t speech frame corresponding node i constantly to L.L_Path (1, i) and L_Path (0, i) be respectively the optimal path likelihood mark that former frame and present frame are finish node with the node i.

Step 1: constantly at t=0

L_Path (- 1, i) = \{\begin{matrix} L (0, i) & i &Element; Entry \\ 0 & i &NotElement; Entry \end{matrix}- - - - (12)

Wherein i ∈ Entry represents that i is a start node.

Step 2: at t constantly, for node i arbitrarily obtained present frame likelihood mark L (t, i), then the optimal path mark of present frame is:

L_Path (0, i) = \max_{j} (L_Path (- 1, j)) + L (t, i), &ForAll; j &Element; PreNode (i) - - - (13)

With optimum preorder node charge to BestPre (t, i), with L_Path (1, i) and L_Path (0, data i) are exchanged for the calculating of next frame and prepare.

Step 3: if t＜T forwards step 2 to; Otherwise, finish.

(2) when voice finish, can allow to such an extent that terminal node is recalled BestPre (t i) is got access to the optimum state path of forcing coupling from network;

The calculating of G, voice quality mark

(1) utilize said optimal path information in the step F to calculate the confidence score of every frame phonetic feature, as the formula (14):

C_{j} = \log (p (O_{j} | s^{i})) - \log (\underset{i}{Σ} p (O_{j} | s^{i})) - - - (14)

(2) utilize in the step F confidence score of each state on the said optimal path information calculating path; Confidence score to all states on the optimal path is averaged the confidence score that obtains whole sentence, and as the formula (15), wherein N is the status number that optimal path comprises.

C = \frac{Σ_{i = 1}^{N} (\frac{Σ_{j = js}^{je} C_{j}}{je - js})}{N} - - - (15)

(3) utilize mapping function that said whole sentence confidence score is mapped to subjective assessment and divide number interval: the interval of the confidence score that directly calculates is usually at (∞, a] between, wherein a is a constant, divide number interval inconsistent with subjective assessment, the embodiment of the invention utilizes piecewise linear function that it is mapped to the subjective scores interval, as the formula (16), wherein a and b determine that by experiment a is a regulatory factor:

S = \{\begin{matrix} αC & if & a \leq C \leq b \\ 100 & if & C > b \\ 0 & if & C < a \end{matrix} - - - (16)

Also the S that obtains further can be quantified as excellent, good, in, the difference the voice quality grade.

Consider the restriction of memory source, the step D of the embodiment of the invention, E, F and G carry out with the segmentation in time of predefined fixedly frame number step-length, and every section size can be 40 frames.

Present embodiment has been developed embedded English learning system based on pronunciation quality evaluating based on said method.Learning content can require automatically to upgrade at any time according to teaching.Adopt the pronunciation quality evaluating technology can make man-machine between interactive learning, alleviated the workload of classroom oral English teaching greatly, alleviated the problem of teacher's supply and demand anxiety, realized the autonomous learning of Oral English Practice and test automatically.The present invention can estimate standard Chinese crowd's pronunciation of English quality.This method when grading system is 4 grades (excellent, good, in, poor), has reached 0.74 with the correlativity of subjective assessment to Chinese crowd's pronunciation of English quality assessment.

Claims

1, a kind of pronunciation quality evaluating method that is used for language learner, comprise that the phonetic feature that is used to train extracts, the Received Pronunciation model training, the generation of Received Pronunciation network, sound end detects, the phonetic feature that is used to estimate extracts, optimum route search, and the calculating each several part of voice quality mark; It is characterized in that the implementation method of each several part specifically may further comprise the steps:

A, the phonetic feature that is used to train extract:

B, Received Pronunciation model training

The generation of C, Received Pronunciation network

D, sound end detect:

(1) analog voice signal obtains digital speech through the A/D conversion;

(3) said minute frame voice are calculated time domain logarithm energy;

(4) method of employing moving average filter is obtained being used for the end inspection feature of end-point detection by said time domain logarithm energy;

E, the phonetic feature that is used to estimate extract

F, optimum route search:

The calculating of G, voice quality mark:

2, the pronunciation quality evaluating method that is used for language learner as claimed in claim 1 is characterized in that, the cepstrum coefficient in the said steps A is the Mei Er frequency marking cepstrum coefficient that utilizes the frequency discrimination characteristic of people's ear.

3, the pronunciation quality evaluating method that is used for language learner as claimed in claim 1, it is characterized in that, Received Pronunciation model among the said step B (1) is the hidden Markov model based on phoneme, the concrete training process of this model is: adopt Gauss model of all phonetic feature initialization, utilize this model copy to go out all phoneme models, adopt the method for Baum-Welth that model is repeatedly trained; Constantly increase the quantity of the gauss component of each phoneme model, carry out the Baum-Welth training again.

4, the pronunciation quality evaluating method that is used for language learner as claimed in claim 1, it is characterized in that, the adaptive implementation method that the Received Pronunciation model is carried out Chinese crowd accent among the said step B (2) is: the Received Pronunciation model that obtains is carried out returning and the accent self-adaptation of maximum a posteriori probability method based on maximum likelihood is linear, obtain final Received Pronunciation model.

5, the pronunciation quality evaluating method that is used for language learner as claimed in claim 1, it is characterized in that, the Received Pronunciation network of said step C is one and has definite start node and terminal node that the state with HMM of not considering the syntax that present node is only relevant with its preorder node is the Linear Network of node.