CN102214462B - Method and system for estimating pronunciation - Google Patents

Method and system for estimating pronunciation Download PDF

Info

Publication number
CN102214462B
CN102214462B CN2011101527653A CN201110152765A CN102214462B CN 102214462 B CN102214462 B CN 102214462B CN 2011101527653 A CN2011101527653 A CN 2011101527653A CN 201110152765 A CN201110152765 A CN 201110152765A CN 102214462 B CN102214462 B CN 102214462B
Authority
CN
China
Prior art keywords
actual measurement
standard
signal
audio frame
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN2011101527653A
Other languages
Chinese (zh)
Other versions
CN102214462A (en
Inventor
赵璇
王鹰
黄玩惠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING AISHUOBA TECHNOLOGY CO LTD
Original Assignee
BEIJING AISHUOBA TECHNOLOGY CO LTD
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING AISHUOBA TECHNOLOGY CO LTD filed Critical BEIJING AISHUOBA TECHNOLOGY CO LTD
Priority to CN2011101527653A priority Critical patent/CN102214462B/en
Publication of CN102214462A publication Critical patent/CN102214462A/en
Application granted granted Critical
Publication of CN102214462B publication Critical patent/CN102214462B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention relates to the field of computer aided language teaching, and provides a method for estimating pronunciation. The method comprises the following steps of: receiving an actual measurement sound signal of single language or a plurality of languages; generating an actual measurement audio frame signal according to the actual measurement sound signal; and comparing the actual measurement audio frame signal with a standard audio frame signal to estimate the quality of the actual measurement sound signal. The invention further discloses a system for estimating pronunciation. With the method and the system for estimating pronunciation provided by the invention, the quality of pronunciation can be estimated more accurately and effectively in a simple way.

Description

The method and system that is used for pronunciation evaluation
Technical field
The present invention relates to the computer-assisted language learning field, relate more specifically to be used for the method and system of pronunciation evaluation.
Background technology
Language is the human instrument that exchanges, and in internationalization level increasingly high today, grasps multi-door language and is praised highly by more and more people.Under this background condition, utilize the variety of way of area of computer aided verbal learning to arise at the historic moment.
Patent 98103685.6 discloses a kind of pronounce method of quality of phonetic symbol assessment learner of utilizing.Whether accurately this method is specified some common mispronounce patterns according to expertise, obtains score through pronunciation and mode standard contrast with the speaker, can obtain speaker's pronunciation information, thereby speaker's voice quality is assessed.The defective of this method is that error pattern need preestablish, if the mistake of speaker not among predefined error pattern, then can not detect mispronounce probably.
Patent 02160031.7 discloses a kind of method of automatic pronunciation correction.This method is weighed speaker's pronunciation level from pronunciation, pitch, loudness of a sound, four aspects of length.The defective of this method is the pronunciation phonetic symbol that needs every words of artificial mark, needs the work of cost great amount of manpower.This method adopts phonetic symbol to set up model, and carries out the voice quality scoring through model probability, need set up corresponding phonetic symbol model to each languages, so it is unfavorable for carrying out multilingual expansion, sneaks into multilingual situation in more being difficult to be supported in short.
Patent 200510107681.2 discloses a kind of method of utilizing phoneme recognizer assessment voice.Because this method needs in advance each phoneme to be carried out modeling, thereby exists the problem that can't support multilingual pronunciation evaluation equally.
In like manner, patent 200510114848.8, patent 200710145859.1; Patent 200810102076.X, patent 200810107118.9, patent 200810168514.2; Patent 200810141036.6, patent 20081022675.2, the essence of patent 200810240811.3 all are to adopt the RP model and obtained score by the contrast of evaluation and test voice; Thereby assess the pronunciation level of tested voice, the difference on the algorithm that its difference is to count the score.Such method based on the RP model all is difficult to carry out multilingual expansion, can't accurately assess the unknown pronunciation of unknown language.Yet in daily life, the situation that Chinese and English are used with in people's spoken language is more and more general, and sometimes even in short two or more different language are sneaked in the inside.This just makes the pronunciation evaluating method of traditional master pattern based on language-specific become at a loss as to what to do gradually.
All are based on the method for phonetic symbol, and phenomenon is read by company that all can't descriptive language.Carrying out phonetic symbol when mark, connecting and read identically with the mark that does not connect the phonetic symbol of reading, so it can't be assessed some phrases (for example " a lot of ") whether accurately connected and reads.
All all can't accurately pass judgment on the accurate attaching problem of nasal sound in the speech based on the method for phonetic symbol.For example: the pronunciation of " any " is/a-ny/, or/an-y/, or/an-ny/.
In sum, need a kind of new pronunciation evaluation mode, particularly the pronunciation evaluation mode in language learning is assessed voice quality more accurately and effectively with simple mode.
Summary of the invention
To above-mentioned prior art problems, the present invention is provided for the method and system of pronunciation evaluation, can assess voice quality more accurately and effectively with simple mode.
The invention provides a kind of method that is used for pronunciation evaluation, may further comprise the steps:
Receive single languages or multilingual actual measurement voice signal;
According to said actual measurement voice signal, generate actual measurement audio frame signal;
Said actual measurement audio frame signal and standard audio frame signal are compared, to said actual measurement voice signal quality of evaluation;
In said actual measurement audio frame signal, form A actual measurement frame piece, comprise one or more actual measurement audio frames in each actual measurement frame piece;
In said standard audio frame signal, form B standard frame piece, comprise one or more standard audio frames in each standard frame piece;
Wherein, said A and B are the integer greater than 1, saidly relatively comprise: through the similarity of more said actual measurement frame piece and said standard frame piece, obtain the similarity of said actual measurement voice signal and said standard voice signal;
Wherein, if A ≠ B then confirms the off quality of said actual measurement voice signal, or utilize the DTW algorithm that A said actual measurement frame piece forced to be divided into B actual measurement frame piece and carry out said comparison afterwards;
If A >=2B or B >=2A then confirm the off quality of said actual measurement voice signal.
Preferably, in various embodiments of the present invention, described method further comprises:
From said standard audio frame signal, extract the standard audio characteristic information, said standard audio characteristic information is at least a in Mei Er frequency cepstral coefficient, sense of hearing linear predictor coefficient, line spectral frequencies parameter, linear predictor coefficient, linear prediction cepstrum coefficient, the sequential template; With
From said actual measurement audio frame signal, extract actual measurement audio frequency characteristics information, said actual measurement audio frequency characteristics information for example is at least a in Mei Er frequency cepstral coefficient, sense of hearing linear predictor coefficient, line spectral frequencies parameter, linear predictor coefficient, linear prediction cepstrum coefficient, the sequential template;
Wherein, saidly relatively comprise: relatively said actual measurement audio frequency characteristics information and said standard audio characteristic information.
Preferably, in various embodiments of the present invention, described method further comprises:
Obtain the time dependent curve of energy of said actual measurement audio frame signal, and energy low ebb place therein with said actual measurement audio frame signal separately, to form said A actual measurement frame piece; And/or
Obtain the time dependent curve of energy of said standard audio frame signal, and energy low ebb place therein with said standard audio frame signal separately, to form said B standard frame piece.
Preferably, in various embodiments of the present invention, described method further comprises:
At least a in Mei Er frequency cepstral coefficient through a plurality of actual measurement audio frames in the said actual measurement frame piece of said actual measurement audio frame signal, sense of hearing linear predictor coefficient, line spectral frequencies parameter, linear predictor coefficient, linear prediction cepstrum coefficient, the sequential template constitutes actual measurement audio frame characteristic sequence;
At least a in Mei Er frequency cepstral coefficient through a plurality of standard audio frames in the said standard frame piece of said standard audio frame signal, sense of hearing linear predictor coefficient, line spectral frequencies parameter, linear predictor coefficient, linear prediction cepstrum coefficient, the sequential template constitutes standard audio frame characteristic sequence;
Wherein, Saidly relatively comprise: through the DTW algorithm said actual measurement audio frame characteristic sequence is alignd with said standard audio frame characteristic sequence, carry out similarity relatively for corresponding actual measurement audio frame characteristic in said actual measurement audio frame characteristic sequence and said standard audio frame characteristic sequence and standard audio frame characteristic;
Said similarity is relatively carried out through at least a mode in related coefficient, SVMs (SVM), the multi-layer perception (MLP).
Preferably; In various embodiments of the present invention; Said quality of evaluation comprises: when the similarity of actual measurement audio frequency characteristics information in the said actual measurement audio frame signal and the standard audio characteristic information in the said standard audio frame signal during less than predetermined threshold, confirm that said actual measurement voice signal is inaccurate; Otherwise, confirm that said actual measurement voice signal is accurate.
Preferably, in various embodiments of the present invention, described method further comprises:
Utilize the quantity of actual measurement frame piece up-to-standard in each said actual measurement frame piece to account for the ratio of the sum of said actual measurement frame piece, obtain the quality score of said actual measurement voice signal; Or
Utilize the quality average of all actual measurement frame pieces in the said actual measurement audio frame signal, obtain the quality score of said actual measurement voice signal.
Preferably, in various embodiments of the present invention, further comprise:
Record and/or output are confirmed as inaccurate part in said actual measurement voice signal; And/or
To in said actual measurement voice signal, being confirmed as inaccurate part, the counterpart of corresponding output in said standard voice signal.
The invention provides a kind of system that is used for pronunciation evaluation, comprising:
Sound receiver is used to receive single languages or multilingual actual measurement voice signal;
The audio frame generating apparatus is used for generating actual measurement audio frame signal according to said actual measurement voice signal;
Apparatus for evaluating is used for said actual measurement audio frame signal and standard audio frame signal relatively with to said actual measurement voice signal quality of evaluation;
Actual measurement frame piece generating apparatus is used for forming A actual measurement frame piece at said actual measurement audio frame signal, comprises one or more actual measurement audio frames in each actual measurement frame piece;
Standard frame piece generating apparatus is used for forming B standard frame piece in said standard audio frame signal, comprises one or more standard audio frames in each standard frame piece;
Wherein, said A and B are the integer greater than 1, saidly relatively comprise: through the similarity of more said actual measurement frame piece and said standard frame piece, obtain the similarity of said actual measurement voice signal and said standard voice signal;
Description of drawings
Wherein, if A ≠ B then confirms the off quality of said actual measurement voice signal, or utilize the DTW algorithm that A said actual measurement frame piece forced to be divided into B actual measurement frame piece and carry out said comparison afterwards;
If A >=2B or B >=2A then confirm the off quality of said actual measurement voice signal.
Through the method and system that is used for pronunciation evaluation provided by the invention, can assess voice quality more accurately and effectively with simple mode.
Embodiment
In order to be illustrated more clearly in the embodiment of the invention or technical scheme of the prior art; Below will do to introduce simply to the accompanying drawing of required use in embodiment or the description of the Prior Art; Obviously, the accompanying drawing in below describing only is some embodiments of the present invention, for those of ordinary skills; Under the prerequisite of not paying creative work, can also obtain other embodiment and accompanying drawing thereof according to these accompanying drawing illustrated embodiments.
Fig. 1 is the indicative flowchart that is used for the method for pronunciation evaluation according to an embodiment of the invention.
Fig. 2 is the indicative flowchart of the method that is used for pronunciation evaluation according to another embodiment of the present invention.
Below will combine accompanying drawing that the technical scheme of various embodiments of the present invention is carried out clear, complete description, obviously, described embodiment only is a part of embodiment of the present invention, rather than whole embodiment.Based on the embodiment among the present invention, those of ordinary skills are resulting all other embodiment under the prerequisite of not making creative work, the scope that all belongs to the present invention and protected.
The present invention provides a kind of method that is used for pronunciation evaluation, may further comprise the steps:
Receive single languages or multilingual actual measurement voice signal;
According to said actual measurement voice signal, generate actual measurement audio frame signal;
Said actual measurement audio frame signal and standard audio frame signal are compared, to said actual measurement voice signal quality of evaluation.
What can expect is that the standard audio frame signal can obtain in the canned data from database in advance; Also can obtain in real time, for example,, and form and its actual measurement audio frame signal relatively based on student's pronunciation based on teacher's pronunciation formation standard audio frame signal.
Through the method and system that is used for pronunciation evaluation provided by the invention, the acoustic ratio of the audio frame of utilization actual measurement voice signal and standard voice signal is assessed the voice quality of actual measurement voice signal accurately and effectively with simple mode; For example; Accurately whether the actual measurement voice signal (accuracy reaches predetermined value), and, because this Acoustic assessment mode and text-independent; Thereby can easily be applied to single languages and multilingual (promptly; The assessment of actual measurement voice signal multilingual mixing), for example, the assessment of the actual measurement voice signal that Chinese and English is mixed.
Preferably, in various embodiments of the present invention, described method further comprises:
From said standard audio frame signal, extract the standard audio characteristic information; With
From said actual measurement audio frame signal, extract actual measurement audio frequency characteristics information;
Wherein, saidly relatively comprise: relatively said actual measurement audio frequency characteristics information and said standard audio characteristic information.
In various embodiments of the present invention; Preferably; Various audio frequency characteristics information capable of using is used for said comparison; For example, said standard audio characteristic information and actual measurement audio frequency characteristics information can be in the following spectrum signature information at least a (that is the combination of the single audio frequency characteristics information of following type capable of using or a plurality of audio frequency characteristics information:
Mei Er frequency cepstral coefficient (MFCC, Mel Frequency Cepstrum Coefficient),
Sense of hearing linear predictor coefficient (PLP, Perceptual Linear Prediction),
Line spectral frequencies parameter (LSF, Line Spectral Frequency),
Linear predictor coefficient (LPC, Linear Predictive Coefficient),
Linear prediction cepstrum coefficient (LPCC, Linear Prediction Cepstral Coefficient),
Sequential template (TRAP, TempoRAl Patterns).
More preferably, can adopt PLP or TRAP to be used for said comparison as audio frequency characteristics information.
Preferably; In various embodiments of the present invention; Saidly relatively comprise: utilize dynamic time warping (DTW, Dynamic time warping) algorithm that said actual measurement audio frame signal is alignd with said standard audio frame signal (frame piece wherein is corresponding one by one) and compare.
Preferably, in various embodiments of the present invention, described method further comprises:
In said actual measurement audio frame signal, form A actual measurement frame piece, comprise one or more actual measurement audio frames in each actual measurement frame piece;
In said standard audio frame signal, form B standard frame piece, comprise one or more standard audio frames in each standard frame piece;
Wherein, said A and B are the integer greater than 1, saidly relatively comprise: through the similarity of more said actual measurement frame piece and said standard frame piece, obtain the similarity of said actual measurement voice signal and said standard voice signal;
Wherein, if A ≠ B then confirms the off quality of said actual measurement voice signal, or utilize the DTW algorithm that A said actual measurement frame piece forced to be divided into B actual measurement frame piece and carry out said comparison afterwards;
Preferably, if A >=2B or B >=2A then confirm the off quality of said actual measurement voice signal.
A=B that is to say, if then can directly carry out said comparison; Otherwise; Can directly confirm the off quality of said actual measurement voice signal, perhaps alternately also can utilize the DTW algorithm A actual measurement frame piece forced to be divided into whether carry out said comparison after B the actual measurement frame piece qualified with the quality of definite said actual measurement voice signal.Preferably, in one embodiment, if A >=2B or B >=2A; Can think that then the difference of said actual measurement voice signal and said standard voice signal is excessive or inequality; That is, similarity is crossed low or dissimilar, thereby can directly confirm the off quality of said actual measurement voice signal.
Divide in order to be implemented in this described pressure, at first must form B standard frame piece, under the situation of knowing B value, carry out said pressure division and obtain B and survey the frame piece.Its method is: utilize the DTW algorithm will survey the frame characteristic and align with the standard frame characteristic to obtain frame and the corresponding relation of frame between the two, can confirm the border of B actual measurement frame piece then again through the border of B standard frame piece.
Preferably, in various embodiments of the present invention, described method further comprises:
The energy that obtains said actual measurement audio frame signal is change curve in time, and energy low ebb place therein with said actual measurement audio frame signal separately, surveys the frame piece to form said A; And/or
The energy that obtains said standard audio frame signal is change curve in time, and energy low ebb place therein with said standard audio frame signal separately, to form said B standard frame piece.
Preferably, in various embodiments of the present invention, described method further comprises:
At least a in Mei Er frequency cepstral coefficient (MFCC) through a plurality of actual measurement audio frames in the said actual measurement frame piece of said actual measurement audio frame signal, sense of hearing linear predictor coefficient (PLP), line spectral frequencies parameter (LSF), linear predictor coefficient (LPC), linear prediction cepstrum coefficient (LPCC), the sequential template (TRAP) constitutes actual measurement audio frame characteristic sequence;
At least a in Mei Er frequency cepstral coefficient (MFCC) through a plurality of standard audio frames in the said standard frame piece of said standard audio frame signal, sense of hearing linear predictor coefficient (PLP), line spectral frequencies parameter (LSF), linear predictor coefficient (LPC), linear prediction cepstrum coefficient (LPCC), the sequential template (TRAP) constitutes standard audio frame characteristic sequence;
Wherein, Saidly relatively comprise: through the DTW algorithm said actual measurement audio frame characteristic sequence is alignd with said standard audio frame characteristic sequence, carry out similarity relatively for corresponding actual measurement audio frame characteristic in said actual measurement audio frame characteristic sequence and said standard audio frame characteristic sequence and standard audio frame characteristic;
Preferably, said similarity is relatively carried out through at least a mode in related coefficient, SVMs (SVM), the multi-layer perception (MLP).When needed, gauss hybrid models also capable of using (GMM) carries out similarity relatively.
Through the DTW algorithm, said actual measurement audio frame characteristic sequence is alignd with said standard audio frame characteristic sequence, thereby make the element in two not isometric sequences that originally possibly be difficult to comparison have one-to-one relationship.Every stack features that will have an one-to-one relationship is sent into the similarity comparer to (that is, surveying audio frame characteristic and standard audio frame characteristic accordingly) and is carried out similarity relatively.
In one embodiment, the similarity comparer can be realized with related coefficient, adopts related coefficient relatively to survey the similarity of audio frame signal and standard audio frame signal, that is:
f ( X , Y ) = COR ( X , Y ) = Σ i = 0 N ( Xi - X ‾ ) ( Yi - Y ‾ ) Σ i = 0 N ( Xi - x ‾ ) 2 Σ i = 0 N ( Yi - Y ‾ ) 2
If f (X, Y) >=threshold thinks that then X is identical with Y or have abundant similarity, otherwise thinks that X is different with Y or dissimilar.
In one embodiment,, can adopt at least a in the following sorter in order relatively to survey the similarity of audio frame signal and standard audio frame signal, with final acquisition sound signal quality score:
SVMs (SVM, support vector machine),
Multi-layer perception (MLP, multi layer perceptron),
Gauss hybrid models (GMM, Gaussian Mixture Model).
In one embodiment, adopt SVM, that is, f (X, Y)=SVM ([X; Y]) ∈ [1 ,+1], wherein, [X; Y] expression is spliced into a column vector to two column vector X and Y and sends into the svm classifier device.(X Y) >=0, thinks that then X is identical with Y or have abundant similarity, otherwise thinks that X is different with Y or dissimilar as if f.
In a preferred embodiment, adopt MLP, that is, f (X, Y)=MLP ([X; Y]) ∈ [0,1], wherein, [X; Y] expression is spliced into a column vector to two column vector X and Y and sends into the MLP sorter.If f (X, Y) >=threshold, think that then X is identical with Y or have abundant similarity, otherwise think that X is different with Y or dissimilar.
In another embodiment, adopt GMM, that is,
Figure BDA0000066934760000091
Wherein, GMM XThe GMM model that expression is obtained by the X estimation, GMM X(Y) the probability score of expression Y in the probability model of X, GMM YThe GMM model that expression is obtained by the Y estimation, GMM Y(X) the probability score of expression X in the probability model of Y.If f (X, Y) >=threshold thinks that then X is identical with Y or have abundant similarity, otherwise thinks that X is different with Y or dissimilar.
Preferably, in various embodiments of the present invention, said quality of evaluation comprises:
When the similarity of actual measurement audio frequency characteristics information in the said actual measurement audio frame signal and the standard audio characteristic information in the said standard audio frame signal during, confirm that said actual measurement voice signal is inaccurate less than predetermined threshold; Otherwise, confirm that said actual measurement voice signal is accurate.
Preferably, in various embodiments of the present invention, described method further comprises:
Utilize the quantity of actual measurement frame piece up-to-standard in each said actual measurement frame piece to account for the ratio of the sum of said actual measurement frame piece, obtain the quality score of said actual measurement voice signal; Or
Utilize the quality average of all actual measurement frame pieces in the said actual measurement audio frame signal, obtain the quality score of said actual measurement voice signal.
Like this, the ratio that accurate (or inaccurate) frame that is contained in the actual measurement audio frame block capable of using accounts for the totalframes amount obtains the quality score of each frame piece and actual measurement voice signal.Also can utilize the quality score of the quality average of each actual measurement audio frame block as the actual measurement voice signal.
Preferably, in various embodiments of the present invention, described method further comprises:
Record and/or output are confirmed as inaccurate part in said actual measurement voice signal; And/or
To in said actual measurement voice signal, being confirmed as inaccurate part, the counterpart of corresponding output in said standard voice signal.
In one embodiment, according in said actual measurement voice signal, being confirmed as inaccurate part, can obtaining the true position of cacoepy (the for example true frame piece position of cacoepy), and can it be noted.
In one embodiment; To in said actual measurement voice signal, being confirmed as inaccurate part; Can the counterpart of corresponding output in said standard voice signal, thus can carry out the voice comparison to specific syllable, word or phrase as required, with the pronunciation that corrects a mistake promptly; For example can be used for language teaching, this is particularly useful under the situation of correcting individual voice mistake emphatically.
Preferably, in various embodiments of the present invention, described method further comprises:
According to the ratio that in said actual measurement voice signal, is confirmed as inaccurate part, confirm the quality score of said actual measurement voice signal.
In one embodiment, through calculating true syllable number or word number or the shared ratio of phrase number of cacoepy, obtain the sound signal quality score.
In one embodiment, in A the actual measurement frame piece that forms based on said actual measurement audio frame signal, through calculating accurate/inaccurate number calculated mass score in A the actual measurement frame piece.
In one embodiment; Earlier change into audio frequency characteristics by frame; Relatively align with DTW again; Thereby obtain the corresponding relation between actual measurement sound frame and the standard pronunciation frame, each group corresponding audio frame signal (a frame standard pronunciation combines with corresponding frame actual measurement sound) is sent into to compare in the neural network obtain to export the result, perhaps directly calculate related coefficient and obtain similarity.
Fig. 1 is the indicative flowchart that is used for the method for pronunciation evaluation according to an embodiment of the invention.
At step 101-103, the actual measurement audio frame signal that obtains is divided frame and forms A frame piece (wherein can comprise a plurality of frames in each frame piece), and can therefrom extract actual measurement audio frequency characteristics information (for example MFCC).
At step 104-106, the standard audio frame signal of obtaining is divided frame and formed B frame piece (wherein can comprise a plurality of frames in each frame piece), and can therefrom extract standard audio characteristic information (for example MFCC).
Wherein, Said A and B are the integer greater than 1; If A=B (in the embodiment shown in fig. 1); Then proceed subsequent step, otherwise think that the actual measurement voice signal is different with the standard voice signal or dissimilar and think that voice quality is defective, aforesaid pressure dividing mode also capable of using certainly forms B and surveys frame piece (pressure make new A=B) to carry out the DTW comparison of aliging with B standard frame piece.And step 101-103 and step 104-106 can carry out simultaneously, also can not carry out simultaneously; But, when adopting aforementioned pressure dividing mode, step 104-106 must carry out prior to step 101-103.
Below will obtain the similarity of actual measurement voice signal and standard voice signal through relatively surveying the similarity of frame piece and standard frame piece.
In step 107, the actual measurement audio frame is alignd with the standard audio frame.
In step 108, the actual measurement frame piece of actual measurement audio frame signal is alignd with the standard frame piece of standard audio frame signal.
Under above-mentioned aligned condition, can obtain to survey the frame piece similarity of voice signal and standard voice signal, obtain the score of actual measurement frame piece thus.
In step 109, confirm the score of the actual measurement frame piece of actual measurement audio frame signal.
In step 110, confirm the quality score of actual measurement voice signal.
Fig. 2 is the indicative flowchart that is used for the method for pronunciation evaluation according to an embodiment of the invention.
In step 201, the standard voice conversion of signals is become the standard audio frame signal of the pulse code modulation (pcm) form of 16k, 16 (BIT).Certainly, in other embodiments, corresponding standard audio frame signal can be (for example is stored in and supplies in the database to call) of accomplishing in advance, then needn't carry out this switch process.
In step 202, the standard voice signal can be divided into the audio frame (window) of 25 milliseconds (ms), and the distance between the adjacent windows can be 10 milliseconds (ms).Certainly, in other embodiments, also can take the distance (for example being 5ms) between different window (for example being 20ms) and/or the adjacent windows.Voice signal is continuous " waveform signal "; Can move 10 milliseconds according to 20 milliseconds of frame lengths, frame divides frame handle to obtain said " audio frame signal " waveform signal; Then 100 milliseconds voice will become 9 frame audio frame signals, and 1000 milliseconds voice will become 99 frame audio frame signals.Voice are divided according to the energy low ebb, can be divided into several " frame pieces " again,, can be divided into 499 frames, but the inside has only 5 syllables, so be split into 5 frame pieces such as 5 seconds in short.
In step 203; The waveform signal of each audio frame converts the fast Fourier transform (FFT) spectrum after by high boost to; The FFT spectrum is divided into 24 subbands equidistantly and extracts sub belt energy (subband that also can be divided into other quantity certainly, for example 36) respectively according to Mei Er (MEL), convert sub belt energy unit into decibel; Remake discrete cosine transform (DCT), obtain MEL frequency cepstral coefficient (MFCC) characteristic.In another embodiment, also can take alternate manner to extract acoustic feature (for example MFCC); And in another embodiment, also can extract be different from MFCC other acoustic feature as comparative parameter.
At step 204-206, the disposal route of actual measurement voice signal is similar in the disposal route of 201-203 with the standard voice signal, obtains the MFCC characteristic of actual measurement voice signal at last.
Wherein, step 201-203 and step 204-206 can carry out simultaneously, also can not carry out simultaneously.
In step 207, utilize dynamic time consolidation (DTW) algorithm will survey audio frame and the alignment of standard audio frame, obtain the corresponding relation of actual measurement each frame of audio frame and each frame of standard audio frame.
In step 208, extract the energy trace of actual measurement voice signal, be slit into plurality of sections (calling syllable phonetically) to actual measurement voice signal cent at the low ebb place of energy.
In step 209, the MFCC of some frames in the frame piece of actual measurement voice signal is spliced into sequence of real numbers, the MFCC of the standard voice signal that it is corresponding also is combined into sequence of real numbers, asks the related coefficient and/or the neural network scoring output of two sequences.
In step 210, when related coefficient is lower than predetermined threshold, think that actual measurement voice signal cacoepy is true, forward step 211 to; Otherwise, think that the pronunciation of actual measurement voice signal accurately, forwards step 212 to.
In step 213, statistics is considered to survey accurately the quantity of frame piece in step 212, calculates frame piece shared ratio in actual measurement frame piece total amount accurately.
In step 214, according to the frame piece shared ratio in actual measurement frame piece total amount accurately of pronouncing, the accurate ratio of will pronouncing is converted into mark, and can feed back to the user.In one embodiment, be full marks greater than 90%; Less than 50% is zero; Between 50%-90%, ask mark according to linear interpolation.
The present invention also provides a kind of system that is used for pronunciation evaluation, comprising:
Sound receiver is used to receive single languages or multilingual actual measurement voice signal;
The audio frame generating apparatus is used for generating actual measurement audio frame signal according to said actual measurement voice signal;
Apparatus for evaluating is used for said actual measurement audio frame signal and standard audio frame signal relatively with to said actual measurement voice signal quality of evaluation.
Through the technical scheme of embodiments of the invention, overcome the defective of existing pronunciation evaluating method, the similarity of surveying voice signal and standard voice signal from the acoustics assessment is to confirm voice quality.Its form is succinct, and is simple to operate, can realize the pronunciation quality assessment that languages are irrelevant, therefore has better generality and ease for use.
Various embodiment provided by the invention can make up with any-mode as required each other, the technical scheme that obtains through this combination, also within the scope of the invention.
Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, belong within the scope of claim of the present invention and equivalent technologies thereof if of the present invention these are revised with modification, then the present invention also comprises these changes and modification interior.

Claims (8)

1. a method that is used for pronunciation evaluation is characterized in that, may further comprise the steps:
Receive single languages or multilingual actual measurement voice signal;
According to said actual measurement voice signal, generate actual measurement audio frame signal;
Said actual measurement audio frame signal and standard audio frame signal are compared, to said actual measurement voice signal quality of evaluation;
In said actual measurement audio frame signal, form A actual measurement frame piece, comprise one or more actual measurement audio frames in each actual measurement frame piece;
In said standard audio frame signal, form B standard frame piece, comprise one or more standard audio frames in each standard frame piece;
Wherein, said A and B are the integer greater than 1, saidly relatively comprise: through the similarity of more said actual measurement frame piece and said standard frame piece, obtain the similarity of said actual measurement voice signal and said standard voice signal;
Wherein, if A ≠ B then confirms the off quality of said actual measurement voice signal, or utilize the DTW algorithm that A said actual measurement frame piece forced to be divided into B actual measurement frame piece and carry out said comparison afterwards;
If A >=2B or B >=2A then confirm the off quality of said actual measurement voice signal.
2. the method for claim 1 is characterized in that, further comprises:
From said standard audio frame signal, extract the standard audio characteristic information, said standard audio characteristic information is at least a in Mei Er frequency cepstral coefficient, sense of hearing linear predictor coefficient, line spectral frequencies parameter, linear predictor coefficient, linear prediction cepstrum coefficient, the sequential template; With
From said actual measurement audio frame signal, extract actual measurement audio frequency characteristics information, said actual measurement audio frequency characteristics information for example is at least a in Mei Er frequency cepstral coefficient, sense of hearing linear predictor coefficient, line spectral frequencies parameter, linear predictor coefficient, linear prediction cepstrum coefficient, the sequential template;
Wherein, saidly relatively comprise: relatively said actual measurement audio frequency characteristics information and said standard audio characteristic information.
3. according to claim 1 or claim 2 method is characterized in that, further comprises:
Obtain the time dependent curve of energy of said actual measurement audio frame signal, and energy low ebb place therein with said actual measurement audio frame signal separately, to form said A actual measurement frame piece; And/or
Obtain the time dependent curve of energy of said standard audio frame signal, and energy low ebb place therein with said standard audio frame signal separately, to form said B standard frame piece.
4. according to claim 1 or claim 2 method is characterized in that, further comprises:
At least a in Mei Er frequency cepstral coefficient through a plurality of actual measurement audio frames in the said actual measurement frame piece of said actual measurement audio frame signal, sense of hearing linear predictor coefficient, line spectral frequencies parameter, linear predictor coefficient, linear prediction cepstrum coefficient, the sequential template constitutes actual measurement audio frame characteristic sequence;
At least a in Mei Er frequency cepstral coefficient through a plurality of standard audio frames in the said standard frame piece of said standard audio frame signal, sense of hearing linear predictor coefficient, line spectral frequencies parameter, linear predictor coefficient, linear prediction cepstrum coefficient, the sequential template constitutes standard audio frame characteristic sequence;
Wherein, Saidly relatively comprise: through the DTW algorithm said actual measurement audio frame characteristic sequence is alignd with said standard audio frame characteristic sequence, carry out similarity relatively for corresponding actual measurement audio frame characteristic in said actual measurement audio frame characteristic sequence and said standard audio frame characteristic sequence and standard audio frame characteristic;
Said similarity is relatively carried out through at least a mode among related coefficient, SVMs SVM, the multi-layer perception MLP.
5. according to claim 1 or claim 2 method is characterized in that,
Said quality of evaluation comprises:
When the similarity of actual measurement audio frequency characteristics information in the said actual measurement audio frame signal and the standard audio characteristic information in the said standard audio frame signal during, confirm that said actual measurement voice signal is inaccurate less than predetermined threshold; Otherwise, confirm that said actual measurement voice signal is accurate.
6. according to claim 1 or claim 2 method is characterized in that, further comprises:
Utilize the quantity of actual measurement frame piece up-to-standard in each said actual measurement frame piece to account for the ratio of the sum of said actual measurement frame piece, obtain the quality score of said actual measurement voice signal; Or
Utilize the quality average of all actual measurement frame pieces in the said actual measurement audio frame signal, obtain the quality score of said actual measurement voice signal.
7. according to claim 1 or claim 2 method is characterized in that, further comprises:
Record and/or output are confirmed as inaccurate part in said actual measurement voice signal; And/or
To in said actual measurement voice signal, being confirmed as inaccurate part, the counterpart of corresponding output in said standard voice signal.
8. a system that is used for pronunciation evaluation is characterized in that, comprising:
Sound receiver is used to receive single languages or multilingual actual measurement voice signal;
The audio frame generating apparatus is used for generating actual measurement audio frame signal according to said actual measurement voice signal;
Apparatus for evaluating is used for said actual measurement audio frame signal and standard audio frame signal relatively with to said actual measurement voice signal quality of evaluation;
Actual measurement frame piece generating apparatus is used for forming A actual measurement frame piece at said actual measurement audio frame signal, comprises one or more actual measurement audio frames in each actual measurement frame piece;
Standard frame piece generating apparatus is used for forming B standard frame piece in said standard audio frame signal, comprises one or more standard audio frames in each standard frame piece;
Wherein, said A and B are the integer greater than 1, saidly relatively comprise: through the similarity of more said actual measurement frame piece and said standard frame piece, obtain the similarity of said actual measurement voice signal and said standard voice signal;
Wherein, if A ≠ B then confirms the off quality of said actual measurement voice signal, or utilize the DTW algorithm that A said actual measurement frame piece forced to be divided into B actual measurement frame piece and carry out said comparison afterwards;
If A >=2B or B >=2A then confirm the off quality of said actual measurement voice signal.
CN2011101527653A 2011-06-08 2011-06-08 Method and system for estimating pronunciation Expired - Fee Related CN102214462B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011101527653A CN102214462B (en) 2011-06-08 2011-06-08 Method and system for estimating pronunciation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011101527653A CN102214462B (en) 2011-06-08 2011-06-08 Method and system for estimating pronunciation

Publications (2)

Publication Number Publication Date
CN102214462A CN102214462A (en) 2011-10-12
CN102214462B true CN102214462B (en) 2012-11-14

Family

ID=44745743

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011101527653A Expired - Fee Related CN102214462B (en) 2011-06-08 2011-06-08 Method and system for estimating pronunciation

Country Status (1)

Country Link
CN (1) CN102214462B (en)

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103514764A (en) * 2013-10-28 2014-01-15 苏州市思玛特电力科技有限公司 Language teaching assessment system
CN103514765A (en) * 2013-10-28 2014-01-15 苏州市思玛特电力科技有限公司 Language teaching assessment method
CN104050964A (en) * 2014-06-17 2014-09-17 公安部第三研究所 Audio signal reduction degree detecting method and system
CN105609114B (en) * 2014-11-25 2019-11-15 科大讯飞股份有限公司 A kind of pronunciation detection method and device
CN104464726B (en) * 2014-12-30 2017-10-27 北京奇艺世纪科技有限公司 A kind of determination method and device of similar audio
CN105578115B (en) * 2015-12-22 2016-10-26 深圳市鹰硕音频科技有限公司 A kind of Network teaching method with Speech Assessment function and system
CN107368469A (en) * 2017-06-01 2017-11-21 广东外语外贸大学 A kind of Vietnamese teaching methods of marking and its Vietnamese learning platform applied
CN109801193B (en) * 2017-11-17 2020-09-15 深圳市鹰硕教育服务股份有限公司 Follow-up teaching system with voice evaluation function
CN107958673B (en) * 2017-11-28 2021-05-11 北京先声教育科技有限公司 Spoken language scoring method and device
CN108766415B (en) * 2018-05-22 2020-11-24 清华大学 Voice evaluation method
CN109104409A (en) * 2018-06-29 2018-12-28 康美药业股份有限公司 A kind of method for secret protection and system for health consultation platform
CN109493853B (en) * 2018-09-30 2022-03-22 福建星网视易信息系统有限公司 Method for determining audio similarity and terminal
CN109961802B (en) * 2019-03-26 2021-05-18 北京达佳互联信息技术有限公司 Sound quality comparison method, device, electronic equipment and storage medium
CN110047474A (en) * 2019-05-06 2019-07-23 齐鲁工业大学 A kind of English phonetic pronunciation intelligent training system and training method
CN111951827B (en) * 2019-05-16 2022-12-06 上海流利说信息技术有限公司 Continuous reading identification correction method, device, equipment and readable storage medium
CN110211610A (en) * 2019-06-20 2019-09-06 平安科技(深圳)有限公司 Assess the method, apparatus and storage medium of audio signal loss
CN110648566A (en) * 2019-09-16 2020-01-03 中北大学 Singing teaching method and device
CN111986650B (en) * 2020-08-07 2024-02-27 云知声智能科技股份有限公司 Method and system for assisting voice evaluation by means of language identification
CN112951274A (en) * 2021-02-07 2021-06-11 脸萌有限公司 Voice similarity determination method and device, and program product
CN117612566B (en) * 2023-11-16 2024-05-28 书行科技(北京)有限公司 Audio quality assessment method and related product

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1815522A (en) * 2006-02-28 2006-08-09 安徽中科大讯飞信息科技有限公司 Method for testing mandarin level and guiding learning using computer
CN101105939A (en) * 2007-09-04 2008-01-16 安徽科大讯飞信息科技股份有限公司 Sonification guiding method
CN101197084A (en) * 2007-11-06 2008-06-11 安徽科大讯飞信息科技股份有限公司 Automatic spoken English evaluating and learning system
GB2458461A (en) * 2008-03-17 2009-09-23 Kai Yu Spoken language learning system
CN102044247A (en) * 2009-10-10 2011-05-04 北京理工大学 Objective evaluation method for VoIP speech

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1815522A (en) * 2006-02-28 2006-08-09 安徽中科大讯飞信息科技有限公司 Method for testing mandarin level and guiding learning using computer
CN101105939A (en) * 2007-09-04 2008-01-16 安徽科大讯飞信息科技股份有限公司 Sonification guiding method
CN101197084A (en) * 2007-11-06 2008-06-11 安徽科大讯飞信息科技股份有限公司 Automatic spoken English evaluating and learning system
GB2458461A (en) * 2008-03-17 2009-09-23 Kai Yu Spoken language learning system
CN102044247A (en) * 2009-10-10 2011-05-04 北京理工大学 Objective evaluation method for VoIP speech

Also Published As

Publication number Publication date
CN102214462A (en) 2011-10-12

Similar Documents

Publication Publication Date Title
CN102214462B (en) Method and system for estimating pronunciation
CN101661675B (en) Self-sensing error tone pronunciation learning method and system
Shobaki et al. The OGI kids’ speech corpus and recognizers
US20100004931A1 (en) Apparatus and method for speech utterance verification
US20060074655A1 (en) Method and system for the automatic generation of speech features for scoring high entropy speech
Cohen et al. Vocal tract normalization in speech recognition: Compensating for systematic speaker variability
Chen et al. Applying rhythm features to automatically assess non-native speech
CN101785048A (en) hmm-based bilingual (mandarin-english) tts techniques
Wightman et al. The aligner: Text-to-speech alignment using Markov models
CN101246685A (en) Pronunciation quality evaluation method of computer auxiliary language learning system
CN103559892A (en) Method and system for evaluating spoken language
Rao Application of prosody models for developing speech systems in Indian languages
Shah et al. Effectiveness of PLP-based phonetic segmentation for speech synthesis
Middag et al. Robust automatic intelligibility assessment techniques evaluated on speakers treated for head and neck cancer
Droua-Hamdani et al. Speaker-independent ASR for modern standard Arabic: effect of regional accents
CN110047474A (en) A kind of English phonetic pronunciation intelligent training system and training method
Strik et al. Comparing classifiers for pronunciation error detection
Ullmann et al. Objective intelligibility assessment of text-to-speech systems through utterance verification
Cernak et al. On the (UN) importance of the contextual factors in HMM-based speech synthesis and coding
Pucher et al. Phonetic distance measures for speech recognition vocabulary and grammar optimization
Lin et al. Improving L2 English rhythm evaluation with automatic sentence stress detection
Patil et al. Acoustic features for detection of aspirated stops
Kyriakopoulos et al. Automatic characterisation of the pronunciation of non-native English speakers using phone distance features
Barczewska et al. Detection of disfluencies in speech signal
Slaney et al. Pitch-gesture modeling using subband autocorrelation change detection.

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20121114

Termination date: 20180608