CN102214462B

CN102214462B - Method and system for estimating pronunciation

Info

Publication number: CN102214462B
Application number: CN2011101527653A
Authority: CN
Inventors: 赵璇; 王鹰; 黄玩惠
Original assignee: BEIJING AISHUOBA TECHNOLOGY CO LTD
Current assignee: BEIJING AISHUOBA TECHNOLOGY CO LTD
Priority date: 2011-06-08
Filing date: 2011-06-08
Publication date: 2012-11-14
Anticipated expiration: 2031-06-08
Also published as: CN102214462A

Abstract

The invention relates to the field of computer aided language teaching, and provides a method for estimating pronunciation. The method comprises the following steps of: receiving an actual measurement sound signal of single language or a plurality of languages; generating an actual measurement audio frame signal according to the actual measurement sound signal; and comparing the actual measurement audio frame signal with a standard audio frame signal to estimate the quality of the actual measurement sound signal. The invention further discloses a system for estimating pronunciation. With the method and the system for estimating pronunciation provided by the invention, the quality of pronunciation can be estimated more accurately and effectively in a simple way.

Description

The method and system that is used for pronunciation evaluation

Technical field

The present invention relates to the computer-assisted language learning field, relate more specifically to be used for the method and system of pronunciation evaluation.

Background technology

Language is the human instrument that exchanges, and in internationalization level increasingly high today, grasps multi-door language and is praised highly by more and more people.Under this background condition, utilize the variety of way of area of computer aided verbal learning to arise at the historic moment.

Patent 98103685.6 discloses a kind of pronounce method of quality of phonetic symbol assessment learner of utilizing.Whether accurately this method is specified some common mispronounce patterns according to expertise, obtains score through pronunciation and mode standard contrast with the speaker, can obtain speaker's pronunciation information, thereby speaker's voice quality is assessed.The defective of this method is that error pattern need preestablish, if the mistake of speaker not among predefined error pattern, then can not detect mispronounce probably.

Patent 02160031.7 discloses a kind of method of automatic pronunciation correction.This method is weighed speaker's pronunciation level from pronunciation, pitch, loudness of a sound, four aspects of length.The defective of this method is the pronunciation phonetic symbol that needs every words of artificial mark, needs the work of cost great amount of manpower.This method adopts phonetic symbol to set up model, and carries out the voice quality scoring through model probability, need set up corresponding phonetic symbol model to each languages, so it is unfavorable for carrying out multilingual expansion, sneaks into multilingual situation in more being difficult to be supported in short.

Patent 200510107681.2 discloses a kind of method of utilizing phoneme recognizer assessment voice.Because this method needs in advance each phoneme to be carried out modeling, thereby exists the problem that can't support multilingual pronunciation evaluation equally.

In like manner, patent 200510114848.8, patent 200710145859.1; Patent 200810102076.X, patent 200810107118.9, patent 200810168514.2; Patent 200810141036.6, patent 20081022675.2, the essence of patent 200810240811.3 all are to adopt the RP model and obtained score by the contrast of evaluation and test voice; Thereby assess the pronunciation level of tested voice, the difference on the algorithm that its difference is to count the score.Such method based on the RP model all is difficult to carry out multilingual expansion, can't accurately assess the unknown pronunciation of unknown language.Yet in daily life, the situation that Chinese and English are used with in people's spoken language is more and more general, and sometimes even in short two or more different language are sneaked in the inside.This just makes the pronunciation evaluating method of traditional master pattern based on language-specific become at a loss as to what to do gradually.

All are based on the method for phonetic symbol, and phenomenon is read by company that all can't descriptive language.Carrying out phonetic symbol when mark, connecting and read identically with the mark that does not connect the phonetic symbol of reading, so it can't be assessed some phrases (for example " a lot of ") whether accurately connected and reads.

All all can't accurately pass judgment on the accurate attaching problem of nasal sound in the speech based on the method for phonetic symbol.For example: the pronunciation of " any " is/a-ny/, or/an-y/, or/an-ny/.

In sum, need a kind of new pronunciation evaluation mode, particularly the pronunciation evaluation mode in language learning is assessed voice quality more accurately and effectively with simple mode.

Summary of the invention

To above-mentioned prior art problems, the present invention is provided for the method and system of pronunciation evaluation, can assess voice quality more accurately and effectively with simple mode.

The invention provides a kind of method that is used for pronunciation evaluation, may further comprise the steps:

Receive single languages or multilingual actual measurement voice signal;

According to said actual measurement voice signal, generate actual measurement audio frame signal;

Said actual measurement audio frame signal and standard audio frame signal are compared, to said actual measurement voice signal quality of evaluation;

In said actual measurement audio frame signal, form A actual measurement frame piece, comprise one or more actual measurement audio frames in each actual measurement frame piece;

In said standard audio frame signal, form B standard frame piece, comprise one or more standard audio frames in each standard frame piece;

Wherein, said A and B are the integer greater than 1, saidly relatively comprise: through the similarity of more said actual measurement frame piece and said standard frame piece, obtain the similarity of said actual measurement voice signal and said standard voice signal;

Wherein, if A ≠ B then confirms the off quality of said actual measurement voice signal, or utilize the DTW algorithm that A said actual measurement frame piece forced to be divided into B actual measurement frame piece and carry out said comparison afterwards;

If A >=2B or B >=2A then confirm the off quality of said actual measurement voice signal.

Preferably, in various embodiments of the present invention, described method further comprises:

From said standard audio frame signal, extract the standard audio characteristic information, said standard audio characteristic information is at least a in Mei Er frequency cepstral coefficient, sense of hearing linear predictor coefficient, line spectral frequencies parameter, linear predictor coefficient, linear prediction cepstrum coefficient, the sequential template; With

From said actual measurement audio frame signal, extract actual measurement audio frequency characteristics information, said actual measurement audio frequency characteristics information for example is at least a in Mei Er frequency cepstral coefficient, sense of hearing linear predictor coefficient, line spectral frequencies parameter, linear predictor coefficient, linear prediction cepstrum coefficient, the sequential template;

Wherein, saidly relatively comprise: relatively said actual measurement audio frequency characteristics information and said standard audio characteristic information.

Obtain the time dependent curve of energy of said actual measurement audio frame signal, and energy low ebb place therein with said actual measurement audio frame signal separately, to form said A actual measurement frame piece; And/or

Obtain the time dependent curve of energy of said standard audio frame signal, and energy low ebb place therein with said standard audio frame signal separately, to form said B standard frame piece.

At least a in Mei Er frequency cepstral coefficient through a plurality of actual measurement audio frames in the said actual measurement frame piece of said actual measurement audio frame signal, sense of hearing linear predictor coefficient, line spectral frequencies parameter, linear predictor coefficient, linear prediction cepstrum coefficient, the sequential template constitutes actual measurement audio frame characteristic sequence;

At least a in Mei Er frequency cepstral coefficient through a plurality of standard audio frames in the said standard frame piece of said standard audio frame signal, sense of hearing linear predictor coefficient, line spectral frequencies parameter, linear predictor coefficient, linear prediction cepstrum coefficient, the sequential template constitutes standard audio frame characteristic sequence;

Wherein, Saidly relatively comprise: through the DTW algorithm said actual measurement audio frame characteristic sequence is alignd with said standard audio frame characteristic sequence, carry out similarity relatively for corresponding actual measurement audio frame characteristic in said actual measurement audio frame characteristic sequence and said standard audio frame characteristic sequence and standard audio frame characteristic;

Said similarity is relatively carried out through at least a mode in related coefficient, SVMs (SVM), the multi-layer perception (MLP).

Preferably; In various embodiments of the present invention; Said quality of evaluation comprises: when the similarity of actual measurement audio frequency characteristics information in the said actual measurement audio frame signal and the standard audio characteristic information in the said standard audio frame signal during less than predetermined threshold, confirm that said actual measurement voice signal is inaccurate; Otherwise, confirm that said actual measurement voice signal is accurate.

Utilize the quantity of actual measurement frame piece up-to-standard in each said actual measurement frame piece to account for the ratio of the sum of said actual measurement frame piece, obtain the quality score of said actual measurement voice signal; Or

Utilize the quality average of all actual measurement frame pieces in the said actual measurement audio frame signal, obtain the quality score of said actual measurement voice signal.

Preferably, in various embodiments of the present invention, further comprise:

Record and/or output are confirmed as inaccurate part in said actual measurement voice signal; And/or

To in said actual measurement voice signal, being confirmed as inaccurate part, the counterpart of corresponding output in said standard voice signal.

The invention provides a kind of system that is used for pronunciation evaluation, comprising:

Sound receiver is used to receive single languages or multilingual actual measurement voice signal;

The audio frame generating apparatus is used for generating actual measurement audio frame signal according to said actual measurement voice signal;

Apparatus for evaluating is used for said actual measurement audio frame signal and standard audio frame signal relatively with to said actual measurement voice signal quality of evaluation;

Actual measurement frame piece generating apparatus is used for forming A actual measurement frame piece at said actual measurement audio frame signal, comprises one or more actual measurement audio frames in each actual measurement frame piece;

Standard frame piece generating apparatus is used for forming B standard frame piece in said standard audio frame signal, comprises one or more standard audio frames in each standard frame piece;

Description of drawings

Through the method and system that is used for pronunciation evaluation provided by the invention, can assess voice quality more accurately and effectively with simple mode.

Embodiment

In order to be illustrated more clearly in the embodiment of the invention or technical scheme of the prior art; Below will do to introduce simply to the accompanying drawing of required use in embodiment or the description of the Prior Art; Obviously, the accompanying drawing in below describing only is some embodiments of the present invention, for those of ordinary skills; Under the prerequisite of not paying creative work, can also obtain other embodiment and accompanying drawing thereof according to these accompanying drawing illustrated embodiments.

Fig. 1 is the indicative flowchart that is used for the method for pronunciation evaluation according to an embodiment of the invention.

Fig. 2 is the indicative flowchart of the method that is used for pronunciation evaluation according to another embodiment of the present invention.

Below will combine accompanying drawing that the technical scheme of various embodiments of the present invention is carried out clear, complete description, obviously, described embodiment only is a part of embodiment of the present invention, rather than whole embodiment.Based on the embodiment among the present invention, those of ordinary skills are resulting all other embodiment under the prerequisite of not making creative work, the scope that all belongs to the present invention and protected.

The present invention provides a kind of method that is used for pronunciation evaluation, may further comprise the steps:

Receive single languages or multilingual actual measurement voice signal;

Said actual measurement audio frame signal and standard audio frame signal are compared, to said actual measurement voice signal quality of evaluation.

What can expect is that the standard audio frame signal can obtain in the canned data from database in advance; Also can obtain in real time, for example,, and form and its actual measurement audio frame signal relatively based on student's pronunciation based on teacher's pronunciation formation standard audio frame signal.

Through the method and system that is used for pronunciation evaluation provided by the invention, the acoustic ratio of the audio frame of utilization actual measurement voice signal and standard voice signal is assessed the voice quality of actual measurement voice signal accurately and effectively with simple mode; For example; Accurately whether the actual measurement voice signal (accuracy reaches predetermined value), and, because this Acoustic assessment mode and text-independent; Thereby can easily be applied to single languages and multilingual (promptly; The assessment of actual measurement voice signal multilingual mixing), for example, the assessment of the actual measurement voice signal that Chinese and English is mixed.

From said standard audio frame signal, extract the standard audio characteristic information; With

From said actual measurement audio frame signal, extract actual measurement audio frequency characteristics information;

In various embodiments of the present invention; Preferably; Various audio frequency characteristics information capable of using is used for said comparison; For example, said standard audio characteristic information and actual measurement audio frequency characteristics information can be in the following spectrum signature information at least a (that is the combination of the single audio frequency characteristics information of following type capable of using or a plurality of audio frequency characteristics information:

Mei Er frequency cepstral coefficient (MFCC, Mel Frequency Cepstrum Coefficient),

Sense of hearing linear predictor coefficient (PLP, Perceptual Linear Prediction),

Line spectral frequencies parameter (LSF, Line Spectral Frequency),

Linear predictor coefficient (LPC, Linear Predictive Coefficient),

Linear prediction cepstrum coefficient (LPCC, Linear Prediction Cepstral Coefficient),

Sequential template (TRAP, TempoRAl Patterns).

More preferably, can adopt PLP or TRAP to be used for said comparison as audio frequency characteristics information.

Preferably; In various embodiments of the present invention; Saidly relatively comprise: utilize dynamic time warping (DTW, Dynamic time warping) algorithm that said actual measurement audio frame signal is alignd with said standard audio frame signal (frame piece wherein is corresponding one by one) and compare.

Preferably, if A >=2B or B >=2A then confirm the off quality of said actual measurement voice signal.

A=B that is to say, if then can directly carry out said comparison; Otherwise; Can directly confirm the off quality of said actual measurement voice signal, perhaps alternately also can utilize the DTW algorithm A actual measurement frame piece forced to be divided into whether carry out said comparison after B the actual measurement frame piece qualified with the quality of definite said actual measurement voice signal.Preferably, in one embodiment, if A >=2B or B >=2A; Can think that then the difference of said actual measurement voice signal and said standard voice signal is excessive or inequality; That is, similarity is crossed low or dissimilar, thereby can directly confirm the off quality of said actual measurement voice signal.

Divide in order to be implemented in this described pressure, at first must form B standard frame piece, under the situation of knowing B value, carry out said pressure division and obtain B and survey the frame piece.Its method is: utilize the DTW algorithm will survey the frame characteristic and align with the standard frame characteristic to obtain frame and the corresponding relation of frame between the two, can confirm the border of B actual measurement frame piece then again through the border of B standard frame piece.

The energy that obtains said actual measurement audio frame signal is change curve in time, and energy low ebb place therein with said actual measurement audio frame signal separately, surveys the frame piece to form said A; And/or

The energy that obtains said standard audio frame signal is change curve in time, and energy low ebb place therein with said standard audio frame signal separately, to form said B standard frame piece.

At least a in Mei Er frequency cepstral coefficient (MFCC) through a plurality of actual measurement audio frames in the said actual measurement frame piece of said actual measurement audio frame signal, sense of hearing linear predictor coefficient (PLP), line spectral frequencies parameter (LSF), linear predictor coefficient (LPC), linear prediction cepstrum coefficient (LPCC), the sequential template (TRAP) constitutes actual measurement audio frame characteristic sequence;

At least a in Mei Er frequency cepstral coefficient (MFCC) through a plurality of standard audio frames in the said standard frame piece of said standard audio frame signal, sense of hearing linear predictor coefficient (PLP), line spectral frequencies parameter (LSF), linear predictor coefficient (LPC), linear prediction cepstrum coefficient (LPCC), the sequential template (TRAP) constitutes standard audio frame characteristic sequence;

Preferably, said similarity is relatively carried out through at least a mode in related coefficient, SVMs (SVM), the multi-layer perception (MLP).When needed, gauss hybrid models also capable of using (GMM) carries out similarity relatively.

Through the DTW algorithm, said actual measurement audio frame characteristic sequence is alignd with said standard audio frame characteristic sequence, thereby make the element in two not isometric sequences that originally possibly be difficult to comparison have one-to-one relationship.Every stack features that will have an one-to-one relationship is sent into the similarity comparer to (that is, surveying audio frame characteristic and standard audio frame characteristic accordingly) and is carried out similarity relatively.

In one embodiment, the similarity comparer can be realized with related coefficient, adopts related coefficient relatively to survey the similarity of audio frame signal and standard audio frame signal, that is:

f (X, Y) = COR (X, Y) = \frac{Σ_{i = 0}^{N} (Xi - \overset{&OverBar;}{X}) (Yi - \overset{&OverBar;}{Y})}{\sqrt{Σ_{i = 0}^{N} {(Xi - \overset{&OverBar;}{x})}^{2} Σ_{i = 0}^{N} {(Yi - \overset{&OverBar;}{Y})}^{2}}}

If f (X, Y) >=threshold thinks that then X is identical with Y or have abundant similarity, otherwise thinks that X is different with Y or dissimilar.

In one embodiment,, can adopt at least a in the following sorter in order relatively to survey the similarity of audio frame signal and standard audio frame signal, with final acquisition sound signal quality score:

SVMs (SVM, support vector machine),

Multi-layer perception (MLP, multi layer perceptron),

Gauss hybrid models (GMM, Gaussian Mixture Model).

In one embodiment, adopt SVM, that is, f (X, Y)=SVM ([X; Y]) ∈ [1 ,+1], wherein, [X; Y] expression is spliced into a column vector to two column vector X and Y and sends into the svm classifier device.(X Y) >=0, thinks that then X is identical with Y or have abundant similarity, otherwise thinks that X is different with Y or dissimilar as if f.

In a preferred embodiment, adopt MLP, that is, f (X, Y)=MLP ([X; Y]) ∈ [0,1], wherein, [X; Y] expression is spliced into a column vector to two column vector X and Y and sends into the MLP sorter.If f (X, Y) >=threshold, think that then X is identical with Y or have abundant similarity, otherwise think that X is different with Y or dissimilar.

In another embodiment, adopt GMM, that is,

Wherein, GMM _XThe GMM model that expression is obtained by the X estimation, GMM _X(Y) the probability score of expression Y in the probability model of X, GMM _YThe GMM model that expression is obtained by the Y estimation, GMM _Y(X) the probability score of expression X in the probability model of Y.If f (X, Y) >=threshold thinks that then X is identical with Y or have abundant similarity, otherwise thinks that X is different with Y or dissimilar.

Preferably, in various embodiments of the present invention, said quality of evaluation comprises:

When the similarity of actual measurement audio frequency characteristics information in the said actual measurement audio frame signal and the standard audio characteristic information in the said standard audio frame signal during, confirm that said actual measurement voice signal is inaccurate less than predetermined threshold; Otherwise, confirm that said actual measurement voice signal is accurate.

Like this, the ratio that accurate (or inaccurate) frame that is contained in the actual measurement audio frame block capable of using accounts for the totalframes amount obtains the quality score of each frame piece and actual measurement voice signal.Also can utilize the quality score of the quality average of each actual measurement audio frame block as the actual measurement voice signal.

In one embodiment, according in said actual measurement voice signal, being confirmed as inaccurate part, can obtaining the true position of cacoepy (the for example true frame piece position of cacoepy), and can it be noted.

In one embodiment; To in said actual measurement voice signal, being confirmed as inaccurate part; Can the counterpart of corresponding output in said standard voice signal, thus can carry out the voice comparison to specific syllable, word or phrase as required, with the pronunciation that corrects a mistake promptly; For example can be used for language teaching, this is particularly useful under the situation of correcting individual voice mistake emphatically.

According to the ratio that in said actual measurement voice signal, is confirmed as inaccurate part, confirm the quality score of said actual measurement voice signal.

In one embodiment, through calculating true syllable number or word number or the shared ratio of phrase number of cacoepy, obtain the sound signal quality score.

In one embodiment, in A the actual measurement frame piece that forms based on said actual measurement audio frame signal, through calculating accurate/inaccurate number calculated mass score in A the actual measurement frame piece.

In one embodiment; Earlier change into audio frequency characteristics by frame; Relatively align with DTW again; Thereby obtain the corresponding relation between actual measurement sound frame and the standard pronunciation frame, each group corresponding audio frame signal (a frame standard pronunciation combines with corresponding frame actual measurement sound) is sent into to compare in the neural network obtain to export the result, perhaps directly calculate related coefficient and obtain similarity.

At step 101-103, the actual measurement audio frame signal that obtains is divided frame and forms A frame piece (wherein can comprise a plurality of frames in each frame piece), and can therefrom extract actual measurement audio frequency characteristics information (for example MFCC).

At step 104-106, the standard audio frame signal of obtaining is divided frame and formed B frame piece (wherein can comprise a plurality of frames in each frame piece), and can therefrom extract standard audio characteristic information (for example MFCC).

Wherein, Said A and B are the integer greater than 1; If A=B (in the embodiment shown in fig. 1); Then proceed subsequent step, otherwise think that the actual measurement voice signal is different with the standard voice signal or dissimilar and think that voice quality is defective, aforesaid pressure dividing mode also capable of using certainly forms B and surveys frame piece (pressure make new A=B) to carry out the DTW comparison of aliging with B standard frame piece.And step 101-103 and step 104-106 can carry out simultaneously, also can not carry out simultaneously; But, when adopting aforementioned pressure dividing mode, step 104-106 must carry out prior to step 101-103.

Below will obtain the similarity of actual measurement voice signal and standard voice signal through relatively surveying the similarity of frame piece and standard frame piece.

In step 107, the actual measurement audio frame is alignd with the standard audio frame.

In step 108, the actual measurement frame piece of actual measurement audio frame signal is alignd with the standard frame piece of standard audio frame signal.

Under above-mentioned aligned condition, can obtain to survey the frame piece similarity of voice signal and standard voice signal, obtain the score of actual measurement frame piece thus.

In step 109, confirm the score of the actual measurement frame piece of actual measurement audio frame signal.

In step 110, confirm the quality score of actual measurement voice signal.

Fig. 2 is the indicative flowchart that is used for the method for pronunciation evaluation according to an embodiment of the invention.

In step 201, the standard voice conversion of signals is become the standard audio frame signal of the pulse code modulation (pcm) form of 16k, 16 (BIT).Certainly, in other embodiments, corresponding standard audio frame signal can be (for example is stored in and supplies in the database to call) of accomplishing in advance, then needn't carry out this switch process.

In step 202, the standard voice signal can be divided into the audio frame (window) of 25 milliseconds (ms), and the distance between the adjacent windows can be 10 milliseconds (ms).Certainly, in other embodiments, also can take the distance (for example being 5ms) between different window (for example being 20ms) and/or the adjacent windows.Voice signal is continuous " waveform signal "; Can move 10 milliseconds according to 20 milliseconds of frame lengths, frame divides frame handle to obtain said " audio frame signal " waveform signal; Then 100 milliseconds voice will become 9 frame audio frame signals, and 1000 milliseconds voice will become 99 frame audio frame signals.Voice are divided according to the energy low ebb, can be divided into several " frame pieces " again,, can be divided into 499 frames, but the inside has only 5 syllables, so be split into 5 frame pieces such as 5 seconds in short.

In step 203; The waveform signal of each audio frame converts the fast Fourier transform (FFT) spectrum after by high boost to; The FFT spectrum is divided into 24 subbands equidistantly and extracts sub belt energy (subband that also can be divided into other quantity certainly, for example 36) respectively according to Mei Er (MEL), convert sub belt energy unit into decibel; Remake discrete cosine transform (DCT), obtain MEL frequency cepstral coefficient (MFCC) characteristic.In another embodiment, also can take alternate manner to extract acoustic feature (for example MFCC); And in another embodiment, also can extract be different from MFCC other acoustic feature as comparative parameter.

At step 204-206, the disposal route of actual measurement voice signal is similar in the disposal route of 201-203 with the standard voice signal, obtains the MFCC characteristic of actual measurement voice signal at last.

Wherein, step 201-203 and step 204-206 can carry out simultaneously, also can not carry out simultaneously.

In step 207, utilize dynamic time consolidation (DTW) algorithm will survey audio frame and the alignment of standard audio frame, obtain the corresponding relation of actual measurement each frame of audio frame and each frame of standard audio frame.

In step 208, extract the energy trace of actual measurement voice signal, be slit into plurality of sections (calling syllable phonetically) to actual measurement voice signal cent at the low ebb place of energy.

In step 209, the MFCC of some frames in the frame piece of actual measurement voice signal is spliced into sequence of real numbers, the MFCC of the standard voice signal that it is corresponding also is combined into sequence of real numbers, asks the related coefficient and/or the neural network scoring output of two sequences.

In step 210, when related coefficient is lower than predetermined threshold, think that actual measurement voice signal cacoepy is true, forward step 211 to; Otherwise, think that the pronunciation of actual measurement voice signal accurately, forwards step 212 to.

In step 213, statistics is considered to survey accurately the quantity of frame piece in step 212, calculates frame piece shared ratio in actual measurement frame piece total amount accurately.

In step 214, according to the frame piece shared ratio in actual measurement frame piece total amount accurately of pronouncing, the accurate ratio of will pronouncing is converted into mark, and can feed back to the user.In one embodiment, be full marks greater than 90%; Less than 50% is zero; Between 50%-90%, ask mark according to linear interpolation.

The present invention also provides a kind of system that is used for pronunciation evaluation, comprising:

Apparatus for evaluating is used for said actual measurement audio frame signal and standard audio frame signal relatively with to said actual measurement voice signal quality of evaluation.

Through the technical scheme of embodiments of the invention, overcome the defective of existing pronunciation evaluating method, the similarity of surveying voice signal and standard voice signal from the acoustics assessment is to confirm voice quality.Its form is succinct, and is simple to operate, can realize the pronunciation quality assessment that languages are irrelevant, therefore has better generality and ease for use.

Various embodiment provided by the invention can make up with any-mode as required each other, the technical scheme that obtains through this combination, also within the scope of the invention.

Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, belong within the scope of claim of the present invention and equivalent technologies thereof if of the present invention these are revised with modification, then the present invention also comprises these changes and modification interior.

Claims

1. a method that is used for pronunciation evaluation is characterized in that, may further comprise the steps:

Receive single languages or multilingual actual measurement voice signal;

2. the method for claim 1 is characterized in that, further comprises:

3. according to claim 1 or claim 2 method is characterized in that, further comprises:

4. according to claim 1 or claim 2 method is characterized in that, further comprises:

Said similarity is relatively carried out through at least a mode among related coefficient, SVMs SVM, the multi-layer perception MLP.

5. according to claim 1 or claim 2 method is characterized in that,

Said quality of evaluation comprises:

6. according to claim 1 or claim 2 method is characterized in that, further comprises:

7. according to claim 1 or claim 2 method is characterized in that, further comprises:

8. a system that is used for pronunciation evaluation is characterized in that, comprising: