CN102214462A

CN102214462A - Method and system for estimating pronunciation

Info

Publication number: CN102214462A
Application number: CN2011101527653A
Authority: CN
Inventors: 赵璇; 王鹰; 黄玩惠
Original assignee: BEIJING AISHUOBA TECHNOLOGY CO LTD
Current assignee: BEIJING AISHUOBA TECHNOLOGY CO LTD
Priority date: 2011-06-08
Filing date: 2011-06-08
Publication date: 2011-10-12
Anticipated expiration: 2031-06-08
Also published as: CN102214462B

Abstract

The invention relates to the field of computer aided language teaching, and provides a method for estimating pronunciation. The method comprises the following steps of: receiving an actual measurement sound signal of single language or a plurality of languages; generating an actual measurement audio frame signal according to the actual measurement sound signal; and comparing the actual measurement audio frame signal with a standard audio frame signal to estimate the quality of the actual measurement sound signal. The invention further discloses a system for estimating pronunciation. With the method and the system for estimating pronunciation provided by the invention, the quality of pronunciation can be estimated more accurately and effectively in a simple way.

Description

The method and system that is used for pronunciation evaluation

Technical field

The present invention relates to the computer-assisted language learning field, relate more specifically to be used for the method and system of pronunciation evaluation.

Background technology

Language is the human instrument that exchanges, and in internationalization level more and more higher today, grasps multi-door language and is praised highly by more and more people.Under this background condition, utilize the variety of way of area of computer aided verbal learning to arise at the historic moment.

Patent 98103685.6 discloses a kind of pronounce method of quality of phonetic symbol assessment learner of utilizing.Whether accurately this method is specified some common mispronounce patterns according to expertise, obtains score by pronunciation and mode standard contrast with the speaker, can obtain speaker's pronunciation information, thereby speaker's voice quality is assessed.The defective of this method is that error pattern need preestablish, if the mistake of speaker not among predefined error pattern, then can not detect mispronounce probably.

Patent 02160031.7 discloses a kind of method of automatic pronunciation correction.This method is weighed speaker's pronunciation level from pronunciation, pitch, loudness of a sound, four aspects of length.The defective of this method is the pronunciation phonetic symbol that needs every words of artificial mark, needs the work of cost great amount of manpower.This method adopts phonetic symbol to set up model, and carries out the voice quality scoring by model probability, need set up corresponding phonetic symbol model to each languages, so it is unfavorable for carrying out multilingual expansion, sneaks into multilingual situation in more being difficult to be supported in short.

Patent 200510107681.2 discloses a kind of method of utilizing phoneme recognizer assessment voice.Because this method needs in advance each phoneme to be carried out modeling, thereby exists the problem that can't support multilingual pronunciation evaluation equally.

In like manner, patent 200510114848.8, patent 200710145859.1, patent 200810102076.X, patent 200810107118.9, patent 200810168514.2, patent 200810141036.6, patent 20081022675.2, the essence of patent 200810240811.3 all are to adopt the Received Pronunciation model and obtained score by the contrast of evaluation and test voice, thereby assess the pronunciation level of tested voice, the difference on the algorithm that its difference is to count the score.Such method based on the Received Pronunciation model all is difficult to carry out multilingual expansion, can't accurately assess the unknown pronunciation of unknown language.Yet in daily life, the situation that Chinese and English are used with in people's spoken language is more and more general, and sometimes even in short two or more different language are sneaked in the inside.This just makes the pronunciation evaluating method of traditional master pattern based on language-specific become at a loss as to what to do gradually.

All are based on the method for phonetic symbol, and phenomenon is read by company that all can't descriptive language.Carrying out phonetic symbol when mark, connecting and read identically with the mark that does not connect the phonetic symbol of reading, so it can't be assessed some phrases (for example " a lot of ") whether accurately connected and reads.

All all can't accurately pass judgment on the accurate attaching problem of nasal sound in the speech based on the method for phonetic symbol.For example: the pronunciation of " any " is/a-ny/, or/an-y/, or/an-ny/.

In sum, need a kind of new pronunciation evaluation mode, particularly the pronunciation evaluation mode in language learning is assessed voice quality more accurately and effectively in simple mode.

Summary of the invention

At above-mentioned prior art problems, the invention provides the method and system that is used for pronunciation evaluation, can assess voice quality more accurately and effectively in simple mode.

The invention provides a kind of method that is used for pronunciation evaluation, may further comprise the steps:

Receive single languages or multilingual actual measurement voice signal;

According to described actual measurement voice signal, generate actual measurement audio frame signal;

Described actual measurement audio frame signal and standard audio frame signal are compared, to described actual measurement voice signal quality of evaluation.

Preferably, in various embodiments of the present invention, described method further comprises:

Extract the standard audio characteristic information from described standard audio frame signal, described standard audio characteristic information for example is at least a in Mel frequency cepstral coefficient, sense of hearing linear predictor coefficient, line spectral frequencies parameter, linear predictor coefficient, linear prediction cepstrum coefficient, the sequential template; With

Extract actual measurement audio frequency characteristics information from described actual measurement audio frame signal, described actual measurement audio frequency characteristics information for example is at least a in Mel frequency cepstral coefficient, sense of hearing linear predictor coefficient, line spectral frequencies parameter, linear predictor coefficient, linear prediction cepstrum coefficient, the sequential template;

Wherein, describedly relatively comprise: relatively described actual measurement audio frequency characteristics information and described standard audio characteristic information.

Preferably, in various embodiments of the present invention, describedly relatively comprise: utilize dynamic time consolidation (DTW) algorithm to make described actual measurement audio frame signal corresponding with described standard audio frame signal and compare.

In described actual measurement audio frame signal, form A actual measurement frame piece, comprise one or more actual measurement audio frames in each actual measurement frame piece;

In described standard audio frame signal, form B standard frame piece, comprise one or more standard audio frames in each standard frame piece;

Wherein, described A and B are the integer greater than 1, describedly relatively comprise: by the similarity of more described actual measurement frame piece and described standard frame piece, obtain the similarity of described actual measurement voice signal and described standard voice signal;

Wherein, if A ≠ B then determines the off quality of described actual measurement voice signal, or utilize the DTW algorithm that A described actual measurement frame piece forced to be divided into B actual measurement frame piece and carry out described comparison afterwards;

More preferably, if A 〉=2B or B 〉=2A then determine the off quality of described actual measurement voice signal.

Obtain the time dependent curve of energy of described actual measurement audio frame signal, and energy low ebb place therein with described actual measurement audio frame signal separately, to form described A actual measurement frame piece; And/or

Obtain the time dependent curve of energy of described standard audio frame signal, and energy low ebb place therein with described standard audio frame signal separately, to form described B standard frame piece.

At least a in Mel frequency cepstral coefficient by a plurality of actual measurement audio frames in the described actual measurement frame piece of described actual measurement audio frame signal, sense of hearing linear predictor coefficient, line spectral frequencies parameter, linear predictor coefficient, linear prediction cepstrum coefficient, the sequential template constitutes actual measurement audio frame characteristic sequence;

At least a in Mel frequency cepstral coefficient by a plurality of standard audio frames in the described standard frame piece of described standard audio frame signal, sense of hearing linear predictor coefficient, line spectral frequencies parameter, linear predictor coefficient, linear prediction cepstrum coefficient, the sequential template constitutes standard audio frame characteristic sequence;

Wherein, describedly relatively comprise: by the DTW algorithm described actual measurement audio frame characteristic sequence is alignd with described standard audio frame characteristic sequence, carry out similarity relatively for corresponding actual measurement audio frame feature in described actual measurement audio frame characteristic sequence and described standard audio frame characteristic sequence and standard audio frame feature;

More preferably, described similarity is relatively undertaken by at least a mode in related coefficient, support vector machine (SVM), the multi-layer perception (MLP).

Preferably, in various embodiments of the present invention, described quality of evaluation comprises: when the similarity of actual measurement audio frequency characteristics information in the described actual measurement audio frame signal and the standard audio characteristic information in the described standard audio frame signal during less than predetermined threshold, determine that described actual measurement voice signal is inaccurate; Otherwise, determine that described actual measurement voice signal is accurate.

Utilize the quantity of actual measurement frame piece up-to-standard in each described actual measurement frame piece to account for the ratio of the sum of described actual measurement frame piece, obtain the quality score of described actual measurement voice signal; Or

Utilize the quality average of all actual measurement frame pieces in the described actual measurement audio frame signal, obtain the quality score of described actual measurement voice signal.

Preferably, in various embodiments of the present invention, further comprise:

Record and/or output are confirmed as inaccurate part in described actual measurement voice signal; And/or

At in described actual measurement voice signal, being confirmed as inaccurate part, the counterpart of corresponding output in described standard voice signal.

The invention provides a kind of system that is used for pronunciation evaluation, comprising:

Sound receiver is used to receive single languages or multilingual actual measurement voice signal;

The audio frame generating apparatus is used for generating actual measurement audio frame signal according to described actual measurement voice signal;

Apparatus for evaluating is used for described actual measurement audio frame signal and standard audio frame signal relatively with to described actual measurement voice signal quality of evaluation.

By the method and system that is used for pronunciation evaluation provided by the invention, can assess voice quality more accurately and effectively in simple mode.

Description of drawings

In order to be illustrated more clearly in the embodiment of the invention or technical scheme of the prior art, below will do to introduce simply to the accompanying drawing of required use in embodiment or the description of the Prior Art, apparently, accompanying drawing in below describing only is some embodiments of the present invention, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain other embodiment and accompanying drawing thereof according to these accompanying drawing illustrated embodiments.

Fig. 1 is the indicative flowchart that is used for the method for pronunciation evaluation according to an embodiment of the invention.

Fig. 2 is the indicative flowchart of the method that is used for pronunciation evaluation according to another embodiment of the present invention.

Embodiment

Below with reference to accompanying drawing the technical scheme of various embodiments of the present invention is carried out clear, complete description, obviously, described embodiment only is a part of embodiment of the present invention, rather than whole embodiment.Based on the embodiment among the present invention, those of ordinary skills are resulting all other embodiment under the prerequisite of not making creative work, the scope that all belongs to the present invention and protected.

Receive single languages or multilingual actual measurement voice signal;

What can expect is that the standard audio frame signal can obtain in the canned data from database in advance; Also can obtain in real time, for example,, and form and its actual measurement audio frame signal relatively based on student's pronunciation based on teacher's pronunciation formation standard audio frame signal.

By the method and system that is used for pronunciation evaluation provided by the invention, the acoustic ratio of the audio frame of utilization actual measurement voice signal and standard voice signal, assess the voice quality of actual measurement voice signal accurately and effectively in simple mode, for example, accurately whether the actual measurement voice signal (accuracy reaches predetermined value), and, because this Acoustic assessment mode and text-independent, thereby can easily be applied to single languages and multilingual (promptly, the assessment of actual measurement voice signal multilingual mixing), for example, the assessment of the actual measurement voice signal that Chinese and English is mixed.

From described standard audio frame signal, extract the standard audio characteristic information; With

From described actual measurement audio frame signal, extract actual measurement audio frequency characteristics information;

In various embodiments of the present invention, preferably, can utilize various audio frequency characteristics information to be used for described comparison, for example, described standard audio characteristic information and actual measurement audio frequency characteristics information can be in the following spectrum signature information at least a (that is, can utilize the single audio frequency characteristics information of following type or the combination of a plurality of audio frequency characteristics information:

Mel frequency cepstral coefficient (MFCC, Mel Frequency Cepstrum Coefficient),

Sense of hearing linear predictor coefficient (PLP, Perceptual Linear Prediction),

Line spectral frequencies parameter (LSF, Line Spectral Frequency),

Linear predictor coefficient (LPC, Linear Predictive Coefficient),

Linear prediction cepstrum coefficient (LPCC, Linear Prediction Cepstral Coefficient),

Sequential template (TRAP, TempoRAl Patterns).

More preferably, can adopt PLP or TRAP to be used for described comparison as audio frequency characteristics information.

Preferably, in various embodiments of the present invention, describedly relatively comprise: utilize dynamic time warping (DTW, Dynamic time warping) algorithm that described actual measurement audio frame signal is alignd with described standard audio frame signal (frame piece wherein is corresponding one by one) and compare.

Preferably, if A 〉=2B or B 〉=2A then determine the off quality of described actual measurement voice signal.

A=B that is to say, if then can directly carry out described comparison; Otherwise, can directly determine the off quality of described actual measurement voice signal, perhaps alternately also can utilize the DTW algorithm A actual measurement frame piece forced to be divided into whether carry out described comparison after B the actual measurement frame piece qualified with the quality of definite described actual measurement voice signal.Preferably, in one embodiment, if A 〉=2B or B 〉=2A, can think that then the difference of described actual measurement voice signal and described standard voice signal is excessive or inequality, that is, similarity is crossed low or dissimilar, thereby can directly determine the off quality of described actual measurement voice signal.

In order to realize that pressure described herein divides, at first must form B standard frame piece, under the situation of knowing B value, carry out described pressure division and obtain B and survey the frame piece.Its method is: utilize the DTW algorithm will survey the frame feature and align with the standard frame feature to obtain the frame between the two and the corresponding relation of frame, can determine the border of B actual measurement frame piece then again by the border of B standard frame piece.

The energy that obtains described actual measurement audio frame signal is change curve in time, and energy low ebb place therein with described actual measurement audio frame signal separately, surveys the frame piece to form described A; And/or

The energy that obtains described standard audio frame signal is change curve in time, and energy low ebb place therein with described standard audio frame signal separately, to form described B standard frame piece.

At least a in Mel frequency cepstral coefficient (MFCC) by a plurality of actual measurement audio frames in the described actual measurement frame piece of described actual measurement audio frame signal, sense of hearing linear predictor coefficient (PLP), line spectral frequencies parameter (LSF), linear predictor coefficient (LPC), linear prediction cepstrum coefficient (LPCC), the sequential template (TRAP) constitutes actual measurement audio frame characteristic sequence;

At least a in Mel frequency cepstral coefficient (MFCC) by a plurality of standard audio frames in the described standard frame piece of described standard audio frame signal, sense of hearing linear predictor coefficient (PLP), line spectral frequencies parameter (LSF), linear predictor coefficient (LPC), linear prediction cepstrum coefficient (LPCC), the sequential template (TRAP) constitutes standard audio frame characteristic sequence;

Preferably, described similarity is relatively undertaken by at least a mode in related coefficient, support vector machine (SVM), the multi-layer perception (MLP).When needed, also can utilize gauss hybrid models (GMM) to carry out similarity relatively.

By the DTW algorithm, described actual measurement audio frame characteristic sequence is alignd with described standard audio frame characteristic sequence, thereby make the element in two not isometric sequences that originally may be difficult to comparison have one-to-one relationship.Every stack features that will have an one-to-one relationship is sent into the similarity comparer to (that is, surveying audio frame feature and standard audio frame feature accordingly) and is carried out similarity relatively.

In one embodiment, the similarity comparer can be realized with related coefficient, adopts related coefficient relatively to survey the similarity of audio frame signal and standard audio frame signal, that is:

f (X, Y) = COR (X, Y) = \frac{Σ_{i = 0}^{N} (Xi - \overset{&OverBar;}{X}) (Yi - \overset{&OverBar;}{Y})}{\sqrt{Σ_{i = 0}^{N} {(Xi - \overset{&OverBar;}{x})}^{2} Σ_{i = 0}^{N} {(Yi - \overset{&OverBar;}{Y})}^{2}}}

If f (X, Y) 〉=threshold thinks that then X is identical with Y or have abundant similarity, otherwise thinks that X is different with Y or dissimilar.

In one embodiment,, can adopt at least a in the following sorter in order relatively to survey the similarity of audio frame signal and standard audio frame signal, with final acquisition sound signal quality score:

Support vector machine (SVM, support vector machine),

Multi-layer perception (MLP, multi layer perceptron),

Gauss hybrid models (GMM, Gaussian Mixture Model).

In one embodiment, adopt SVM, that is, f (X, Y)=SVM ([X; Y]) ∈ [1 ,+1], wherein, [X; Y] expression is spliced into a column vector to two column vector X and Y and sends into the svm classifier device.(X Y) 〉=0, thinks that then X is identical with Y or have abundant similarity, otherwise thinks that X is different with Y or dissimilar as if f.

In a preferred embodiment, adopt MLP, that is, f (X, Y)=MLP ([X; Y]) ∈ [0,1], wherein, [X; Y] expression is spliced into a column vector to two column vector X and Y and sends into the MLP sorter.If f (X, Y) 〉=threshold, think that then X is identical with Y or have abundant similarity, otherwise think that X is different with Y or dissimilar.

In another embodiment, adopt GMM, that is,

Wherein, GMM _XThe GMM model that expression is obtained by the X estimation, GMM _X(Y) the probability score of expression Y in the probability model of X, GMM _YThe GMM model that expression is obtained by the Y estimation, GMM _Y(X) the probability score of expression X in the probability model of Y.If f (X, Y) 〉=threshold thinks that then X is identical with Y or have abundant similarity, otherwise thinks that X is different with Y or dissimilar.

Preferably, in various embodiments of the present invention, described quality of evaluation comprises:

When the similarity of actual measurement audio frequency characteristics information in the described actual measurement audio frame signal and the standard audio characteristic information in the described standard audio frame signal during, determine that described actual measurement voice signal is inaccurate less than predetermined threshold; Otherwise, determine that described actual measurement voice signal is accurate.

Like this, the ratio that can utilize accurate (or inaccurate) frame that actual measurement is contained in the audio frame block to account for the totalframes amount quality score that obtains each frame piece and survey voice signal.Also can utilize the quality score of the quality average of each actual measurement audio frame block as the actual measurement voice signal.

In one embodiment, according in described actual measurement voice signal, being confirmed as inaccurate part, can obtaining the true position of cacoepy (for example true frame piece position of cacoepy), and it can be noted.

In one embodiment, at in described actual measurement voice signal, being confirmed as inaccurate part, can the counterpart of corresponding output in described standard voice signal, thereby can carry out the voice comparison to specific syllable, word or phrase as required, with the pronunciation that corrects a mistake promptly, for example can be used for language teaching, this is particularly useful under the situation of correcting individual voice mistake emphatically.

According to the ratio that in described actual measurement voice signal, is confirmed as inaccurate part, determine the quality score of described actual measurement voice signal.

In one embodiment, by calculating true syllable number or word number or the shared ratio of phrase number of cacoepy, obtain the sound signal quality score.

In one embodiment, in A the actual measurement frame piece that forms based on described actual measurement audio frame signal, by calculating accurate/inaccurate number calculated mass score in A the actual measurement frame piece.

In one embodiment, change into earlier audio frequency characteristics frame by frame, relatively align with DTW again, thereby obtain the corresponding relation between actual measurement sound frame and the standard pronunciation frame, each group corresponding audio frame signal (a frame standard pronunciation and corresponding frame actual measurement sound combine) sent into to compare in the neural network obtain the output result, perhaps directly calculate related coefficient and obtain similarity.

At step 101-103, the actual measurement audio frame signal that obtains is divided frame and forms A frame piece (wherein can comprise a plurality of frames in each frame piece), and can therefrom extract actual measurement audio frequency characteristics information (for example MFCC).

At step 104-106, the standard audio frame signal of obtaining is divided frame and formed B frame piece (wherein can comprise a plurality of frames in each frame piece), and can therefrom extract standard audio characteristic information (for example MFCC).

Wherein, described A and B are the integer greater than 1, if A=B (in the embodiment shown in fig. 1), then proceed subsequent step, otherwise think that the actual measurement voice signal is different with the standard voice signal or dissimilar and think that voice quality is defective, also can utilize aforesaid pressure dividing mode to form B certainly and survey frame piece (pressure make new A=B) to carry out the DTW comparison of aliging with B standard frame piece.And step 101-103 and step 104-106 can carry out simultaneously, also can not carry out simultaneously; But, when adopting aforementioned pressure dividing mode, step 104-106 must carry out prior to step 101-103.

Below will obtain the similarity of actual measurement voice signal and standard voice signal by relatively surveying the similarity of frame piece and standard frame piece.

In step 107, the actual measurement audio frame is alignd with the standard audio frame.

In step 108, the actual measurement frame piece of actual measurement audio frame signal is alignd with the standard frame piece of standard audio frame signal.

Under above-mentioned aligned condition, can obtain to survey the frame piece similarity of voice signal and standard voice signal, obtain the score of actual measurement frame piece thus.

In step 109, determine the score of the actual measurement frame piece of actual measurement audio frame signal.

In step 110, determine the quality score of actual measurement voice signal.

Fig. 2 is the indicative flowchart that is used for the method for pronunciation evaluation according to an embodiment of the invention.

In step 201, the standard voice conversion of signals is become the standard audio frame signal of the pulse code modulation (pcm) form of 16k, 16 (BIT).Certainly, in other embodiments, corresponding standard audio frame signal can be (for example being stored in the database for calling) of finishing in advance, then needn't carry out this switch process.

In step 202, the standard voice signal can be divided into the audio frame (window) of 25 milliseconds (ms), and the distance between the adjacent windows can be 10 milliseconds (ms).Certainly, in other embodiments, also can take the distance (for example being 5ms) between different window (for example being 20ms) and/or the adjacent windows.Voice signal is continuous " waveform signal ", can move 10 milliseconds according to 20 milliseconds of frame lengths, frame divides frame handle to obtain described " audio frame signal " waveform signal, then 100 milliseconds voice will become 9 frame audio frame signals, and 1000 milliseconds voice will become 99 frame audio frame signals.Voice are divided according to the energy low ebb, can be divided into several " frame pieces " again,, can be divided into 499 frames, but the inside has only 5 syllables, so be split into 5 frame pieces such as 5 seconds in short.

In step 203, the waveform signal of each audio frame converts the fast Fourier transform (FFT) spectrum after by high boost to, the FFT spectrum is divided into 24 subbands equidistantly according to Mel (MEL) and extracts the sub belt energy (subband that also can be divided into other quantity certainly respectively, for example 36), sub belt energy unit is converted to decibel, remake discrete cosine transform (DCT), obtain MEL frequency cepstral coefficient (MFCC) feature.In another embodiment, also can take alternate manner to extract acoustic feature (for example MFCC); And in another embodiment, also can extract other acoustic feature parameter as a comparison that is different from MFCC.

At step 204-206, the disposal route of actual measurement voice signal is similar in the disposal route of 201-203 to the standard voice signal, obtains the MFCC feature of actual measurement voice signal at last.

Wherein, step 201-203 and step 204-206 can carry out simultaneously, also can not carry out simultaneously.

In step 207, utilize dynamic time consolidation (DTW) algorithm will survey audio frame and the alignment of standard audio frame, obtain the corresponding relation of actual measurement each frame of audio frame and each frame of standard audio frame.

In step 208, extract the energy trace of actual measurement voice signal, at the low ebb place of energy actual measurement voice signal cent is slit into plurality of sections (calling syllable phonetically).

In step 209, the MFCC of some frames in the frame piece of actual measurement voice signal is spliced into sequence of real numbers, the MFCC of the standard voice signal that it is corresponding also is combined into sequence of real numbers, asks the related coefficient and/or the neural network scoring output of two sequences.

In step 210, when related coefficient is lower than predetermined threshold, think that actual measurement voice signal cacoepy is true, forward step 211 to; Otherwise, think that the pronunciation of actual measurement voice signal accurately, forwards step 212 to.

In step 213, statistics is considered to survey accurately the quantity of frame piece in step 212, calculates frame piece shared ratio in actual measurement frame piece total amount accurately.

In step 214, according to the frame piece shared ratio in actual measurement frame piece total amount accurately of pronouncing, the accurate ratio of will pronouncing is converted into mark, and can feed back to the user.In one embodiment, be full marks greater than 90%; Less than 50% is zero; Between 50%-90%, ask mark according to linear interpolation.

The present invention also provides a kind of system that is used for pronunciation evaluation, comprising:

By the technical scheme of embodiments of the invention, overcome the defective of existing pronunciation evaluating method, the similarity of surveying voice signal and standard voice signal from the acoustics assessment is to determine voice quality.Its form is succinct, and is simple to operate, can realize the pronunciation quality assessment that languages are irrelevant, therefore has better generality and ease for use.

Various embodiment provided by the invention can be as required combination mutually in any way, the technical scheme that obtains by this combination, also within the scope of the invention.

Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these are revised and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also comprises these changes and modification interior.

Claims

1. a method that is used for pronunciation evaluation is characterized in that, may further comprise the steps:

Receive single languages or multilingual actual measurement voice signal;

2. the method for claim 1 is characterized in that, further comprises:

3. method as claimed in claim 1 or 2 is characterized in that,

Describedly relatively comprise: utilize dynamic time consolidation DTW algorithm to make described actual measurement audio frame signal corresponding with described standard audio frame signal and compare.

4. as the described method of one of claim 1 to 3, it is characterized in that, further comprise:

5. as the described method of one of claim 1 to 4, it is characterized in that, further comprise:

6. as the described method of one of claim 1 to 5, it is characterized in that, further comprise:

Preferably, described similarity is relatively undertaken by at least a mode among related coefficient, support vector machine SVM, the multi-layer perception MLP.

7. as the described method of one of claim 1 to 6, it is characterized in that,

Described quality of evaluation comprises:

8. as the described method of one of claim 1 to 7, it is characterized in that, further comprise:

9. as the described method of one of claim 1 to 8, it is characterized in that, further comprise:

10. a system that is used for pronunciation evaluation is characterized in that, comprising: