CN102214462A - Method and system for estimating pronunciation - Google Patents

Method and system for estimating pronunciation Download PDF

Info

Publication number
CN102214462A
CN102214462A CN2011101527653A CN201110152765A CN102214462A CN 102214462 A CN102214462 A CN 102214462A CN 2011101527653 A CN2011101527653 A CN 2011101527653A CN 201110152765 A CN201110152765 A CN 201110152765A CN 102214462 A CN102214462 A CN 102214462A
Authority
CN
China
Prior art keywords
actual measurement
audio frame
standard
signal
voice signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2011101527653A
Other languages
Chinese (zh)
Other versions
CN102214462B (en
Inventor
赵璇
王鹰
黄玩惠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING AISHUOBA TECHNOLOGY CO LTD
Original Assignee
BEIJING AISHUOBA TECHNOLOGY CO LTD
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING AISHUOBA TECHNOLOGY CO LTD filed Critical BEIJING AISHUOBA TECHNOLOGY CO LTD
Priority to CN2011101527653A priority Critical patent/CN102214462B/en
Publication of CN102214462A publication Critical patent/CN102214462A/en
Application granted granted Critical
Publication of CN102214462B publication Critical patent/CN102214462B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention relates to the field of computer aided language teaching, and provides a method for estimating pronunciation. The method comprises the following steps of: receiving an actual measurement sound signal of single language or a plurality of languages; generating an actual measurement audio frame signal according to the actual measurement sound signal; and comparing the actual measurement audio frame signal with a standard audio frame signal to estimate the quality of the actual measurement sound signal. The invention further discloses a system for estimating pronunciation. With the method and the system for estimating pronunciation provided by the invention, the quality of pronunciation can be estimated more accurately and effectively in a simple way.

Description

The method and system that is used for pronunciation evaluation
Technical field
The present invention relates to the computer-assisted language learning field, relate more specifically to be used for the method and system of pronunciation evaluation.
Background technology
Language is the human instrument that exchanges, and in internationalization level more and more higher today, grasps multi-door language and is praised highly by more and more people.Under this background condition, utilize the variety of way of area of computer aided verbal learning to arise at the historic moment.
Patent 98103685.6 discloses a kind of pronounce method of quality of phonetic symbol assessment learner of utilizing.Whether accurately this method is specified some common mispronounce patterns according to expertise, obtains score by pronunciation and mode standard contrast with the speaker, can obtain speaker's pronunciation information, thereby speaker's voice quality is assessed.The defective of this method is that error pattern need preestablish, if the mistake of speaker not among predefined error pattern, then can not detect mispronounce probably.
Patent 02160031.7 discloses a kind of method of automatic pronunciation correction.This method is weighed speaker's pronunciation level from pronunciation, pitch, loudness of a sound, four aspects of length.The defective of this method is the pronunciation phonetic symbol that needs every words of artificial mark, needs the work of cost great amount of manpower.This method adopts phonetic symbol to set up model, and carries out the voice quality scoring by model probability, need set up corresponding phonetic symbol model to each languages, so it is unfavorable for carrying out multilingual expansion, sneaks into multilingual situation in more being difficult to be supported in short.
Patent 200510107681.2 discloses a kind of method of utilizing phoneme recognizer assessment voice.Because this method needs in advance each phoneme to be carried out modeling, thereby exists the problem that can't support multilingual pronunciation evaluation equally.
In like manner, patent 200510114848.8, patent 200710145859.1, patent 200810102076.X, patent 200810107118.9, patent 200810168514.2, patent 200810141036.6, patent 20081022675.2, the essence of patent 200810240811.3 all are to adopt the Received Pronunciation model and obtained score by the contrast of evaluation and test voice, thereby assess the pronunciation level of tested voice, the difference on the algorithm that its difference is to count the score.Such method based on the Received Pronunciation model all is difficult to carry out multilingual expansion, can't accurately assess the unknown pronunciation of unknown language.Yet in daily life, the situation that Chinese and English are used with in people's spoken language is more and more general, and sometimes even in short two or more different language are sneaked in the inside.This just makes the pronunciation evaluating method of traditional master pattern based on language-specific become at a loss as to what to do gradually.
All are based on the method for phonetic symbol, and phenomenon is read by company that all can't descriptive language.Carrying out phonetic symbol when mark, connecting and read identically with the mark that does not connect the phonetic symbol of reading, so it can't be assessed some phrases (for example " a lot of ") whether accurately connected and reads.
All all can't accurately pass judgment on the accurate attaching problem of nasal sound in the speech based on the method for phonetic symbol.For example: the pronunciation of " any " is/a-ny/, or/an-y/, or/an-ny/.
In sum, need a kind of new pronunciation evaluation mode, particularly the pronunciation evaluation mode in language learning is assessed voice quality more accurately and effectively in simple mode.
Summary of the invention
At above-mentioned prior art problems, the invention provides the method and system that is used for pronunciation evaluation, can assess voice quality more accurately and effectively in simple mode.
The invention provides a kind of method that is used for pronunciation evaluation, may further comprise the steps:
Receive single languages or multilingual actual measurement voice signal;
According to described actual measurement voice signal, generate actual measurement audio frame signal;
Described actual measurement audio frame signal and standard audio frame signal are compared, to described actual measurement voice signal quality of evaluation.
Preferably, in various embodiments of the present invention, described method further comprises:
Extract the standard audio characteristic information from described standard audio frame signal, described standard audio characteristic information for example is at least a in Mel frequency cepstral coefficient, sense of hearing linear predictor coefficient, line spectral frequencies parameter, linear predictor coefficient, linear prediction cepstrum coefficient, the sequential template; With
Extract actual measurement audio frequency characteristics information from described actual measurement audio frame signal, described actual measurement audio frequency characteristics information for example is at least a in Mel frequency cepstral coefficient, sense of hearing linear predictor coefficient, line spectral frequencies parameter, linear predictor coefficient, linear prediction cepstrum coefficient, the sequential template;
Wherein, describedly relatively comprise: relatively described actual measurement audio frequency characteristics information and described standard audio characteristic information.
Preferably, in various embodiments of the present invention, describedly relatively comprise: utilize dynamic time consolidation (DTW) algorithm to make described actual measurement audio frame signal corresponding with described standard audio frame signal and compare.
Preferably, in various embodiments of the present invention, described method further comprises:
In described actual measurement audio frame signal, form A actual measurement frame piece, comprise one or more actual measurement audio frames in each actual measurement frame piece;
In described standard audio frame signal, form B standard frame piece, comprise one or more standard audio frames in each standard frame piece;
Wherein, described A and B are the integer greater than 1, describedly relatively comprise: by the similarity of more described actual measurement frame piece and described standard frame piece, obtain the similarity of described actual measurement voice signal and described standard voice signal;
Wherein, if A ≠ B then determines the off quality of described actual measurement voice signal, or utilize the DTW algorithm that A described actual measurement frame piece forced to be divided into B actual measurement frame piece and carry out described comparison afterwards;
More preferably, if A 〉=2B or B 〉=2A then determine the off quality of described actual measurement voice signal.
Preferably, in various embodiments of the present invention, described method further comprises:
Obtain the time dependent curve of energy of described actual measurement audio frame signal, and energy low ebb place therein with described actual measurement audio frame signal separately, to form described A actual measurement frame piece; And/or
Obtain the time dependent curve of energy of described standard audio frame signal, and energy low ebb place therein with described standard audio frame signal separately, to form described B standard frame piece.
Preferably, in various embodiments of the present invention, described method further comprises:
At least a in Mel frequency cepstral coefficient by a plurality of actual measurement audio frames in the described actual measurement frame piece of described actual measurement audio frame signal, sense of hearing linear predictor coefficient, line spectral frequencies parameter, linear predictor coefficient, linear prediction cepstrum coefficient, the sequential template constitutes actual measurement audio frame characteristic sequence;
At least a in Mel frequency cepstral coefficient by a plurality of standard audio frames in the described standard frame piece of described standard audio frame signal, sense of hearing linear predictor coefficient, line spectral frequencies parameter, linear predictor coefficient, linear prediction cepstrum coefficient, the sequential template constitutes standard audio frame characteristic sequence;
Wherein, describedly relatively comprise: by the DTW algorithm described actual measurement audio frame characteristic sequence is alignd with described standard audio frame characteristic sequence, carry out similarity relatively for corresponding actual measurement audio frame feature in described actual measurement audio frame characteristic sequence and described standard audio frame characteristic sequence and standard audio frame feature;
More preferably, described similarity is relatively undertaken by at least a mode in related coefficient, support vector machine (SVM), the multi-layer perception (MLP).
Preferably, in various embodiments of the present invention, described quality of evaluation comprises: when the similarity of actual measurement audio frequency characteristics information in the described actual measurement audio frame signal and the standard audio characteristic information in the described standard audio frame signal during less than predetermined threshold, determine that described actual measurement voice signal is inaccurate; Otherwise, determine that described actual measurement voice signal is accurate.
Preferably, in various embodiments of the present invention, described method further comprises:
Utilize the quantity of actual measurement frame piece up-to-standard in each described actual measurement frame piece to account for the ratio of the sum of described actual measurement frame piece, obtain the quality score of described actual measurement voice signal; Or
Utilize the quality average of all actual measurement frame pieces in the described actual measurement audio frame signal, obtain the quality score of described actual measurement voice signal.
Preferably, in various embodiments of the present invention, further comprise:
Record and/or output are confirmed as inaccurate part in described actual measurement voice signal; And/or
At in described actual measurement voice signal, being confirmed as inaccurate part, the counterpart of corresponding output in described standard voice signal.
The invention provides a kind of system that is used for pronunciation evaluation, comprising:
Sound receiver is used to receive single languages or multilingual actual measurement voice signal;
The audio frame generating apparatus is used for generating actual measurement audio frame signal according to described actual measurement voice signal;
Apparatus for evaluating is used for described actual measurement audio frame signal and standard audio frame signal relatively with to described actual measurement voice signal quality of evaluation.
By the method and system that is used for pronunciation evaluation provided by the invention, can assess voice quality more accurately and effectively in simple mode.
Description of drawings
In order to be illustrated more clearly in the embodiment of the invention or technical scheme of the prior art, below will do to introduce simply to the accompanying drawing of required use in embodiment or the description of the Prior Art, apparently, accompanying drawing in below describing only is some embodiments of the present invention, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain other embodiment and accompanying drawing thereof according to these accompanying drawing illustrated embodiments.
Fig. 1 is the indicative flowchart that is used for the method for pronunciation evaluation according to an embodiment of the invention.
Fig. 2 is the indicative flowchart of the method that is used for pronunciation evaluation according to another embodiment of the present invention.
Embodiment
Below with reference to accompanying drawing the technical scheme of various embodiments of the present invention is carried out clear, complete description, obviously, described embodiment only is a part of embodiment of the present invention, rather than whole embodiment.Based on the embodiment among the present invention, those of ordinary skills are resulting all other embodiment under the prerequisite of not making creative work, the scope that all belongs to the present invention and protected.
The invention provides a kind of method that is used for pronunciation evaluation, may further comprise the steps:
Receive single languages or multilingual actual measurement voice signal;
According to described actual measurement voice signal, generate actual measurement audio frame signal;
Described actual measurement audio frame signal and standard audio frame signal are compared, to described actual measurement voice signal quality of evaluation.
What can expect is that the standard audio frame signal can obtain in the canned data from database in advance; Also can obtain in real time, for example,, and form and its actual measurement audio frame signal relatively based on student's pronunciation based on teacher's pronunciation formation standard audio frame signal.
By the method and system that is used for pronunciation evaluation provided by the invention, the acoustic ratio of the audio frame of utilization actual measurement voice signal and standard voice signal, assess the voice quality of actual measurement voice signal accurately and effectively in simple mode, for example, accurately whether the actual measurement voice signal (accuracy reaches predetermined value), and, because this Acoustic assessment mode and text-independent, thereby can easily be applied to single languages and multilingual (promptly, the assessment of actual measurement voice signal multilingual mixing), for example, the assessment of the actual measurement voice signal that Chinese and English is mixed.
Preferably, in various embodiments of the present invention, described method further comprises:
From described standard audio frame signal, extract the standard audio characteristic information; With
From described actual measurement audio frame signal, extract actual measurement audio frequency characteristics information;
Wherein, describedly relatively comprise: relatively described actual measurement audio frequency characteristics information and described standard audio characteristic information.
In various embodiments of the present invention, preferably, can utilize various audio frequency characteristics information to be used for described comparison, for example, described standard audio characteristic information and actual measurement audio frequency characteristics information can be in the following spectrum signature information at least a (that is, can utilize the single audio frequency characteristics information of following type or the combination of a plurality of audio frequency characteristics information:
Mel frequency cepstral coefficient (MFCC, Mel Frequency Cepstrum Coefficient),
Sense of hearing linear predictor coefficient (PLP, Perceptual Linear Prediction),
Line spectral frequencies parameter (LSF, Line Spectral Frequency),
Linear predictor coefficient (LPC, Linear Predictive Coefficient),
Linear prediction cepstrum coefficient (LPCC, Linear Prediction Cepstral Coefficient),
Sequential template (TRAP, TempoRAl Patterns).
More preferably, can adopt PLP or TRAP to be used for described comparison as audio frequency characteristics information.
Preferably, in various embodiments of the present invention, describedly relatively comprise: utilize dynamic time warping (DTW, Dynamic time warping) algorithm that described actual measurement audio frame signal is alignd with described standard audio frame signal (frame piece wherein is corresponding one by one) and compare.
Preferably, in various embodiments of the present invention, described method further comprises:
In described actual measurement audio frame signal, form A actual measurement frame piece, comprise one or more actual measurement audio frames in each actual measurement frame piece;
In described standard audio frame signal, form B standard frame piece, comprise one or more standard audio frames in each standard frame piece;
Wherein, described A and B are the integer greater than 1, describedly relatively comprise: by the similarity of more described actual measurement frame piece and described standard frame piece, obtain the similarity of described actual measurement voice signal and described standard voice signal;
Wherein, if A ≠ B then determines the off quality of described actual measurement voice signal, or utilize the DTW algorithm that A described actual measurement frame piece forced to be divided into B actual measurement frame piece and carry out described comparison afterwards;
Preferably, if A 〉=2B or B 〉=2A then determine the off quality of described actual measurement voice signal.
A=B that is to say, if then can directly carry out described comparison; Otherwise, can directly determine the off quality of described actual measurement voice signal, perhaps alternately also can utilize the DTW algorithm A actual measurement frame piece forced to be divided into whether carry out described comparison after B the actual measurement frame piece qualified with the quality of definite described actual measurement voice signal.Preferably, in one embodiment, if A 〉=2B or B 〉=2A, can think that then the difference of described actual measurement voice signal and described standard voice signal is excessive or inequality, that is, similarity is crossed low or dissimilar, thereby can directly determine the off quality of described actual measurement voice signal.
In order to realize that pressure described herein divides, at first must form B standard frame piece, under the situation of knowing B value, carry out described pressure division and obtain B and survey the frame piece.Its method is: utilize the DTW algorithm will survey the frame feature and align with the standard frame feature to obtain the frame between the two and the corresponding relation of frame, can determine the border of B actual measurement frame piece then again by the border of B standard frame piece.
Preferably, in various embodiments of the present invention, described method further comprises:
The energy that obtains described actual measurement audio frame signal is change curve in time, and energy low ebb place therein with described actual measurement audio frame signal separately, surveys the frame piece to form described A; And/or
The energy that obtains described standard audio frame signal is change curve in time, and energy low ebb place therein with described standard audio frame signal separately, to form described B standard frame piece.
Preferably, in various embodiments of the present invention, described method further comprises:
At least a in Mel frequency cepstral coefficient (MFCC) by a plurality of actual measurement audio frames in the described actual measurement frame piece of described actual measurement audio frame signal, sense of hearing linear predictor coefficient (PLP), line spectral frequencies parameter (LSF), linear predictor coefficient (LPC), linear prediction cepstrum coefficient (LPCC), the sequential template (TRAP) constitutes actual measurement audio frame characteristic sequence;
At least a in Mel frequency cepstral coefficient (MFCC) by a plurality of standard audio frames in the described standard frame piece of described standard audio frame signal, sense of hearing linear predictor coefficient (PLP), line spectral frequencies parameter (LSF), linear predictor coefficient (LPC), linear prediction cepstrum coefficient (LPCC), the sequential template (TRAP) constitutes standard audio frame characteristic sequence;
Wherein, describedly relatively comprise: by the DTW algorithm described actual measurement audio frame characteristic sequence is alignd with described standard audio frame characteristic sequence, carry out similarity relatively for corresponding actual measurement audio frame feature in described actual measurement audio frame characteristic sequence and described standard audio frame characteristic sequence and standard audio frame feature;
Preferably, described similarity is relatively undertaken by at least a mode in related coefficient, support vector machine (SVM), the multi-layer perception (MLP).When needed, also can utilize gauss hybrid models (GMM) to carry out similarity relatively.
By the DTW algorithm, described actual measurement audio frame characteristic sequence is alignd with described standard audio frame characteristic sequence, thereby make the element in two not isometric sequences that originally may be difficult to comparison have one-to-one relationship.Every stack features that will have an one-to-one relationship is sent into the similarity comparer to (that is, surveying audio frame feature and standard audio frame feature accordingly) and is carried out similarity relatively.
In one embodiment, the similarity comparer can be realized with related coefficient, adopts related coefficient relatively to survey the similarity of audio frame signal and standard audio frame signal, that is:
f ( X , Y ) = COR ( X , Y ) = Σ i = 0 N ( Xi - X ‾ ) ( Yi - Y ‾ ) Σ i = 0 N ( Xi - x ‾ ) 2 Σ i = 0 N ( Yi - Y ‾ ) 2
If f (X, Y) 〉=threshold thinks that then X is identical with Y or have abundant similarity, otherwise thinks that X is different with Y or dissimilar.
In one embodiment,, can adopt at least a in the following sorter in order relatively to survey the similarity of audio frame signal and standard audio frame signal, with final acquisition sound signal quality score:
Support vector machine (SVM, support vector machine),
Multi-layer perception (MLP, multi layer perceptron),
Gauss hybrid models (GMM, Gaussian Mixture Model).
In one embodiment, adopt SVM, that is, f (X, Y)=SVM ([X; Y]) ∈ [1 ,+1], wherein, [X; Y] expression is spliced into a column vector to two column vector X and Y and sends into the svm classifier device.(X Y) 〉=0, thinks that then X is identical with Y or have abundant similarity, otherwise thinks that X is different with Y or dissimilar as if f.
In a preferred embodiment, adopt MLP, that is, f (X, Y)=MLP ([X; Y]) ∈ [0,1], wherein, [X; Y] expression is spliced into a column vector to two column vector X and Y and sends into the MLP sorter.If f (X, Y) 〉=threshold, think that then X is identical with Y or have abundant similarity, otherwise think that X is different with Y or dissimilar.
In another embodiment, adopt GMM, that is,
Figure BDA0000066934760000091
Wherein, GMM XThe GMM model that expression is obtained by the X estimation, GMM X(Y) the probability score of expression Y in the probability model of X, GMM YThe GMM model that expression is obtained by the Y estimation, GMM Y(X) the probability score of expression X in the probability model of Y.If f (X, Y) 〉=threshold thinks that then X is identical with Y or have abundant similarity, otherwise thinks that X is different with Y or dissimilar.
Preferably, in various embodiments of the present invention, described quality of evaluation comprises:
When the similarity of actual measurement audio frequency characteristics information in the described actual measurement audio frame signal and the standard audio characteristic information in the described standard audio frame signal during, determine that described actual measurement voice signal is inaccurate less than predetermined threshold; Otherwise, determine that described actual measurement voice signal is accurate.
Preferably, in various embodiments of the present invention, described method further comprises:
Utilize the quantity of actual measurement frame piece up-to-standard in each described actual measurement frame piece to account for the ratio of the sum of described actual measurement frame piece, obtain the quality score of described actual measurement voice signal; Or
Utilize the quality average of all actual measurement frame pieces in the described actual measurement audio frame signal, obtain the quality score of described actual measurement voice signal.
Like this, the ratio that can utilize accurate (or inaccurate) frame that actual measurement is contained in the audio frame block to account for the totalframes amount quality score that obtains each frame piece and survey voice signal.Also can utilize the quality score of the quality average of each actual measurement audio frame block as the actual measurement voice signal.
Preferably, in various embodiments of the present invention, described method further comprises:
Record and/or output are confirmed as inaccurate part in described actual measurement voice signal; And/or
At in described actual measurement voice signal, being confirmed as inaccurate part, the counterpart of corresponding output in described standard voice signal.
In one embodiment, according in described actual measurement voice signal, being confirmed as inaccurate part, can obtaining the true position of cacoepy (for example true frame piece position of cacoepy), and it can be noted.
In one embodiment, at in described actual measurement voice signal, being confirmed as inaccurate part, can the counterpart of corresponding output in described standard voice signal, thereby can carry out the voice comparison to specific syllable, word or phrase as required, with the pronunciation that corrects a mistake promptly, for example can be used for language teaching, this is particularly useful under the situation of correcting individual voice mistake emphatically.
Preferably, in various embodiments of the present invention, described method further comprises:
According to the ratio that in described actual measurement voice signal, is confirmed as inaccurate part, determine the quality score of described actual measurement voice signal.
In one embodiment, by calculating true syllable number or word number or the shared ratio of phrase number of cacoepy, obtain the sound signal quality score.
In one embodiment, in A the actual measurement frame piece that forms based on described actual measurement audio frame signal, by calculating accurate/inaccurate number calculated mass score in A the actual measurement frame piece.
In one embodiment, change into earlier audio frequency characteristics frame by frame, relatively align with DTW again, thereby obtain the corresponding relation between actual measurement sound frame and the standard pronunciation frame, each group corresponding audio frame signal (a frame standard pronunciation and corresponding frame actual measurement sound combine) sent into to compare in the neural network obtain the output result, perhaps directly calculate related coefficient and obtain similarity.
Fig. 1 is the indicative flowchart that is used for the method for pronunciation evaluation according to an embodiment of the invention.
At step 101-103, the actual measurement audio frame signal that obtains is divided frame and forms A frame piece (wherein can comprise a plurality of frames in each frame piece), and can therefrom extract actual measurement audio frequency characteristics information (for example MFCC).
At step 104-106, the standard audio frame signal of obtaining is divided frame and formed B frame piece (wherein can comprise a plurality of frames in each frame piece), and can therefrom extract standard audio characteristic information (for example MFCC).
Wherein, described A and B are the integer greater than 1, if A=B (in the embodiment shown in fig. 1), then proceed subsequent step, otherwise think that the actual measurement voice signal is different with the standard voice signal or dissimilar and think that voice quality is defective, also can utilize aforesaid pressure dividing mode to form B certainly and survey frame piece (pressure make new A=B) to carry out the DTW comparison of aliging with B standard frame piece.And step 101-103 and step 104-106 can carry out simultaneously, also can not carry out simultaneously; But, when adopting aforementioned pressure dividing mode, step 104-106 must carry out prior to step 101-103.
Below will obtain the similarity of actual measurement voice signal and standard voice signal by relatively surveying the similarity of frame piece and standard frame piece.
In step 107, the actual measurement audio frame is alignd with the standard audio frame.
In step 108, the actual measurement frame piece of actual measurement audio frame signal is alignd with the standard frame piece of standard audio frame signal.
Under above-mentioned aligned condition, can obtain to survey the frame piece similarity of voice signal and standard voice signal, obtain the score of actual measurement frame piece thus.
In step 109, determine the score of the actual measurement frame piece of actual measurement audio frame signal.
In step 110, determine the quality score of actual measurement voice signal.
Fig. 2 is the indicative flowchart that is used for the method for pronunciation evaluation according to an embodiment of the invention.
In step 201, the standard voice conversion of signals is become the standard audio frame signal of the pulse code modulation (pcm) form of 16k, 16 (BIT).Certainly, in other embodiments, corresponding standard audio frame signal can be (for example being stored in the database for calling) of finishing in advance, then needn't carry out this switch process.
In step 202, the standard voice signal can be divided into the audio frame (window) of 25 milliseconds (ms), and the distance between the adjacent windows can be 10 milliseconds (ms).Certainly, in other embodiments, also can take the distance (for example being 5ms) between different window (for example being 20ms) and/or the adjacent windows.Voice signal is continuous " waveform signal ", can move 10 milliseconds according to 20 milliseconds of frame lengths, frame divides frame handle to obtain described " audio frame signal " waveform signal, then 100 milliseconds voice will become 9 frame audio frame signals, and 1000 milliseconds voice will become 99 frame audio frame signals.Voice are divided according to the energy low ebb, can be divided into several " frame pieces " again,, can be divided into 499 frames, but the inside has only 5 syllables, so be split into 5 frame pieces such as 5 seconds in short.
In step 203, the waveform signal of each audio frame converts the fast Fourier transform (FFT) spectrum after by high boost to, the FFT spectrum is divided into 24 subbands equidistantly according to Mel (MEL) and extracts the sub belt energy (subband that also can be divided into other quantity certainly respectively, for example 36), sub belt energy unit is converted to decibel, remake discrete cosine transform (DCT), obtain MEL frequency cepstral coefficient (MFCC) feature.In another embodiment, also can take alternate manner to extract acoustic feature (for example MFCC); And in another embodiment, also can extract other acoustic feature parameter as a comparison that is different from MFCC.
At step 204-206, the disposal route of actual measurement voice signal is similar in the disposal route of 201-203 to the standard voice signal, obtains the MFCC feature of actual measurement voice signal at last.
Wherein, step 201-203 and step 204-206 can carry out simultaneously, also can not carry out simultaneously.
In step 207, utilize dynamic time consolidation (DTW) algorithm will survey audio frame and the alignment of standard audio frame, obtain the corresponding relation of actual measurement each frame of audio frame and each frame of standard audio frame.
In step 208, extract the energy trace of actual measurement voice signal, at the low ebb place of energy actual measurement voice signal cent is slit into plurality of sections (calling syllable phonetically).
In step 209, the MFCC of some frames in the frame piece of actual measurement voice signal is spliced into sequence of real numbers, the MFCC of the standard voice signal that it is corresponding also is combined into sequence of real numbers, asks the related coefficient and/or the neural network scoring output of two sequences.
In step 210, when related coefficient is lower than predetermined threshold, think that actual measurement voice signal cacoepy is true, forward step 211 to; Otherwise, think that the pronunciation of actual measurement voice signal accurately, forwards step 212 to.
In step 213, statistics is considered to survey accurately the quantity of frame piece in step 212, calculates frame piece shared ratio in actual measurement frame piece total amount accurately.
In step 214, according to the frame piece shared ratio in actual measurement frame piece total amount accurately of pronouncing, the accurate ratio of will pronouncing is converted into mark, and can feed back to the user.In one embodiment, be full marks greater than 90%; Less than 50% is zero; Between 50%-90%, ask mark according to linear interpolation.
The present invention also provides a kind of system that is used for pronunciation evaluation, comprising:
Sound receiver is used to receive single languages or multilingual actual measurement voice signal;
The audio frame generating apparatus is used for generating actual measurement audio frame signal according to described actual measurement voice signal;
Apparatus for evaluating is used for described actual measurement audio frame signal and standard audio frame signal relatively with to described actual measurement voice signal quality of evaluation.
By the technical scheme of embodiments of the invention, overcome the defective of existing pronunciation evaluating method, the similarity of surveying voice signal and standard voice signal from the acoustics assessment is to determine voice quality.Its form is succinct, and is simple to operate, can realize the pronunciation quality assessment that languages are irrelevant, therefore has better generality and ease for use.
Various embodiment provided by the invention can be as required combination mutually in any way, the technical scheme that obtains by this combination, also within the scope of the invention.
Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these are revised and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also comprises these changes and modification interior.

Claims (10)

1. a method that is used for pronunciation evaluation is characterized in that, may further comprise the steps:
Receive single languages or multilingual actual measurement voice signal;
According to described actual measurement voice signal, generate actual measurement audio frame signal;
Described actual measurement audio frame signal and standard audio frame signal are compared, to described actual measurement voice signal quality of evaluation.
2. the method for claim 1 is characterized in that, further comprises:
Extract the standard audio characteristic information from described standard audio frame signal, described standard audio characteristic information for example is at least a in Mel frequency cepstral coefficient, sense of hearing linear predictor coefficient, line spectral frequencies parameter, linear predictor coefficient, linear prediction cepstrum coefficient, the sequential template; With
Extract actual measurement audio frequency characteristics information from described actual measurement audio frame signal, described actual measurement audio frequency characteristics information for example is at least a in Mel frequency cepstral coefficient, sense of hearing linear predictor coefficient, line spectral frequencies parameter, linear predictor coefficient, linear prediction cepstrum coefficient, the sequential template;
Wherein, describedly relatively comprise: relatively described actual measurement audio frequency characteristics information and described standard audio characteristic information.
3. method as claimed in claim 1 or 2 is characterized in that,
Describedly relatively comprise: utilize dynamic time consolidation DTW algorithm to make described actual measurement audio frame signal corresponding with described standard audio frame signal and compare.
4. as the described method of one of claim 1 to 3, it is characterized in that, further comprise:
In described actual measurement audio frame signal, form A actual measurement frame piece, comprise one or more actual measurement audio frames in each actual measurement frame piece;
In described standard audio frame signal, form B standard frame piece, comprise one or more standard audio frames in each standard frame piece;
Wherein, described A and B are the integer greater than 1, describedly relatively comprise: by the similarity of more described actual measurement frame piece and described standard frame piece, obtain the similarity of described actual measurement voice signal and described standard voice signal;
Wherein, if A ≠ B then determines the off quality of described actual measurement voice signal, or utilize the DTW algorithm that A described actual measurement frame piece forced to be divided into B actual measurement frame piece and carry out described comparison afterwards;
Preferably, if A 〉=2B or B 〉=2A then determine the off quality of described actual measurement voice signal.
5. as the described method of one of claim 1 to 4, it is characterized in that, further comprise:
Obtain the time dependent curve of energy of described actual measurement audio frame signal, and energy low ebb place therein with described actual measurement audio frame signal separately, to form described A actual measurement frame piece; And/or
Obtain the time dependent curve of energy of described standard audio frame signal, and energy low ebb place therein with described standard audio frame signal separately, to form described B standard frame piece.
6. as the described method of one of claim 1 to 5, it is characterized in that, further comprise:
At least a in Mel frequency cepstral coefficient by a plurality of actual measurement audio frames in the described actual measurement frame piece of described actual measurement audio frame signal, sense of hearing linear predictor coefficient, line spectral frequencies parameter, linear predictor coefficient, linear prediction cepstrum coefficient, the sequential template constitutes actual measurement audio frame characteristic sequence;
At least a in Mel frequency cepstral coefficient by a plurality of standard audio frames in the described standard frame piece of described standard audio frame signal, sense of hearing linear predictor coefficient, line spectral frequencies parameter, linear predictor coefficient, linear prediction cepstrum coefficient, the sequential template constitutes standard audio frame characteristic sequence;
Wherein, describedly relatively comprise: by the DTW algorithm described actual measurement audio frame characteristic sequence is alignd with described standard audio frame characteristic sequence, carry out similarity relatively for corresponding actual measurement audio frame feature in described actual measurement audio frame characteristic sequence and described standard audio frame characteristic sequence and standard audio frame feature;
Preferably, described similarity is relatively undertaken by at least a mode among related coefficient, support vector machine SVM, the multi-layer perception MLP.
7. as the described method of one of claim 1 to 6, it is characterized in that,
Described quality of evaluation comprises:
When the similarity of actual measurement audio frequency characteristics information in the described actual measurement audio frame signal and the standard audio characteristic information in the described standard audio frame signal during, determine that described actual measurement voice signal is inaccurate less than predetermined threshold; Otherwise, determine that described actual measurement voice signal is accurate.
8. as the described method of one of claim 1 to 7, it is characterized in that, further comprise:
Utilize the quantity of actual measurement frame piece up-to-standard in each described actual measurement frame piece to account for the ratio of the sum of described actual measurement frame piece, obtain the quality score of described actual measurement voice signal; Or
Utilize the quality average of all actual measurement frame pieces in the described actual measurement audio frame signal, obtain the quality score of described actual measurement voice signal.
9. as the described method of one of claim 1 to 8, it is characterized in that, further comprise:
Record and/or output are confirmed as inaccurate part in described actual measurement voice signal; And/or
At in described actual measurement voice signal, being confirmed as inaccurate part, the counterpart of corresponding output in described standard voice signal.
10. a system that is used for pronunciation evaluation is characterized in that, comprising:
Sound receiver is used to receive single languages or multilingual actual measurement voice signal;
The audio frame generating apparatus is used for generating actual measurement audio frame signal according to described actual measurement voice signal;
Apparatus for evaluating is used for described actual measurement audio frame signal and standard audio frame signal relatively with to described actual measurement voice signal quality of evaluation.
CN2011101527653A 2011-06-08 2011-06-08 Method and system for estimating pronunciation Expired - Fee Related CN102214462B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011101527653A CN102214462B (en) 2011-06-08 2011-06-08 Method and system for estimating pronunciation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011101527653A CN102214462B (en) 2011-06-08 2011-06-08 Method and system for estimating pronunciation

Publications (2)

Publication Number Publication Date
CN102214462A true CN102214462A (en) 2011-10-12
CN102214462B CN102214462B (en) 2012-11-14

Family

ID=44745743

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011101527653A Expired - Fee Related CN102214462B (en) 2011-06-08 2011-06-08 Method and system for estimating pronunciation

Country Status (1)

Country Link
CN (1) CN102214462B (en)

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103514765A (en) * 2013-10-28 2014-01-15 苏州市思玛特电力科技有限公司 Language teaching assessment method
CN103514764A (en) * 2013-10-28 2014-01-15 苏州市思玛特电力科技有限公司 Language teaching assessment system
CN104050964A (en) * 2014-06-17 2014-09-17 公安部第三研究所 Audio signal reduction degree detecting method and system
CN104464726A (en) * 2014-12-30 2015-03-25 北京奇艺世纪科技有限公司 Method and device for determining similar audios
CN105578115A (en) * 2015-12-22 2016-05-11 深圳市鹰硕音频科技有限公司 Network teaching method and system with voice assessment function
CN105609114A (en) * 2014-11-25 2016-05-25 科大讯飞股份有限公司 Method and device for detecting pronunciation
CN107368469A (en) * 2017-06-01 2017-11-21 广东外语外贸大学 A kind of Vietnamese teaching methods of marking and its Vietnamese learning platform applied
CN107958673A (en) * 2017-11-28 2018-04-24 北京先声教育科技有限公司 A kind of spoken language methods of marking and device
CN108766415A (en) * 2018-05-22 2018-11-06 清华大学 A kind of voice assessment method
CN109104409A (en) * 2018-06-29 2018-12-28 康美药业股份有限公司 A kind of method for secret protection and system for health consultation platform
CN109493853A (en) * 2018-09-30 2019-03-19 福建星网视易信息系统有限公司 A kind of the determination method and terminal of audio similarity
WO2019095446A1 (en) * 2017-11-17 2019-05-23 深圳市鹰硕音频科技有限公司 Following teaching system having speech evaluation function
CN109961802A (en) * 2019-03-26 2019-07-02 北京达佳互联信息技术有限公司 Sound quality comparative approach, device, electronic equipment and storage medium
CN110047474A (en) * 2019-05-06 2019-07-23 齐鲁工业大学 A kind of English phonetic pronunciation intelligent training system and training method
CN111951827A (en) * 2019-05-16 2020-11-17 上海流利说信息技术有限公司 Continuous reading identification correction method, device, equipment and readable storage medium
CN111986650A (en) * 2020-08-07 2020-11-24 云知声智能科技股份有限公司 Method and system for assisting speech evaluation by means of language identification
WO2020253054A1 (en) * 2019-06-20 2020-12-24 平安科技(深圳)有限公司 Method and apparatus for evaluating audio signal loss, and storage medium
CN112951274A (en) * 2021-02-07 2021-06-11 脸萌有限公司 Voice similarity determination method and device, and program product
CN117612566A (en) * 2023-11-16 2024-02-27 书行科技(北京)有限公司 Audio quality assessment method and related product
CN117612566B (en) * 2023-11-16 2024-05-28 书行科技(北京)有限公司 Audio quality assessment method and related product

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1815522A (en) * 2006-02-28 2006-08-09 安徽中科大讯飞信息科技有限公司 Method for testing mandarin level and guiding learning using computer
CN101105939A (en) * 2007-09-04 2008-01-16 安徽科大讯飞信息科技股份有限公司 Sonification guiding method
CN101197084A (en) * 2007-11-06 2008-06-11 安徽科大讯飞信息科技股份有限公司 Automatic spoken English evaluating and learning system
GB2458461A (en) * 2008-03-17 2009-09-23 Kai Yu Spoken language learning system
CN102044247A (en) * 2009-10-10 2011-05-04 北京理工大学 Objective evaluation method for VoIP speech

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1815522A (en) * 2006-02-28 2006-08-09 安徽中科大讯飞信息科技有限公司 Method for testing mandarin level and guiding learning using computer
CN101105939A (en) * 2007-09-04 2008-01-16 安徽科大讯飞信息科技股份有限公司 Sonification guiding method
CN101197084A (en) * 2007-11-06 2008-06-11 安徽科大讯飞信息科技股份有限公司 Automatic spoken English evaluating and learning system
GB2458461A (en) * 2008-03-17 2009-09-23 Kai Yu Spoken language learning system
CN102044247A (en) * 2009-10-10 2011-05-04 北京理工大学 Objective evaluation method for VoIP speech

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103514764A (en) * 2013-10-28 2014-01-15 苏州市思玛特电力科技有限公司 Language teaching assessment system
CN103514765A (en) * 2013-10-28 2014-01-15 苏州市思玛特电力科技有限公司 Language teaching assessment method
CN104050964A (en) * 2014-06-17 2014-09-17 公安部第三研究所 Audio signal reduction degree detecting method and system
CN105609114A (en) * 2014-11-25 2016-05-25 科大讯飞股份有限公司 Method and device for detecting pronunciation
CN105609114B (en) * 2014-11-25 2019-11-15 科大讯飞股份有限公司 A kind of pronunciation detection method and device
CN104464726B (en) * 2014-12-30 2017-10-27 北京奇艺世纪科技有限公司 A kind of determination method and device of similar audio
CN104464726A (en) * 2014-12-30 2015-03-25 北京奇艺世纪科技有限公司 Method and device for determining similar audios
CN105578115B (en) * 2015-12-22 2016-10-26 深圳市鹰硕音频科技有限公司 A kind of Network teaching method with Speech Assessment function and system
CN105578115A (en) * 2015-12-22 2016-05-11 深圳市鹰硕音频科技有限公司 Network teaching method and system with voice assessment function
CN107368469A (en) * 2017-06-01 2017-11-21 广东外语外贸大学 A kind of Vietnamese teaching methods of marking and its Vietnamese learning platform applied
WO2019095446A1 (en) * 2017-11-17 2019-05-23 深圳市鹰硕音频科技有限公司 Following teaching system having speech evaluation function
CN107958673A (en) * 2017-11-28 2018-04-24 北京先声教育科技有限公司 A kind of spoken language methods of marking and device
CN108766415A (en) * 2018-05-22 2018-11-06 清华大学 A kind of voice assessment method
CN109104409A (en) * 2018-06-29 2018-12-28 康美药业股份有限公司 A kind of method for secret protection and system for health consultation platform
CN109493853A (en) * 2018-09-30 2019-03-19 福建星网视易信息系统有限公司 A kind of the determination method and terminal of audio similarity
CN109493853B (en) * 2018-09-30 2022-03-22 福建星网视易信息系统有限公司 Method for determining audio similarity and terminal
CN109961802A (en) * 2019-03-26 2019-07-02 北京达佳互联信息技术有限公司 Sound quality comparative approach, device, electronic equipment and storage medium
CN109961802B (en) * 2019-03-26 2021-05-18 北京达佳互联信息技术有限公司 Sound quality comparison method, device, electronic equipment and storage medium
CN110047474A (en) * 2019-05-06 2019-07-23 齐鲁工业大学 A kind of English phonetic pronunciation intelligent training system and training method
CN111951827A (en) * 2019-05-16 2020-11-17 上海流利说信息技术有限公司 Continuous reading identification correction method, device, equipment and readable storage medium
WO2020253054A1 (en) * 2019-06-20 2020-12-24 平安科技(深圳)有限公司 Method and apparatus for evaluating audio signal loss, and storage medium
CN111986650A (en) * 2020-08-07 2020-11-24 云知声智能科技股份有限公司 Method and system for assisting speech evaluation by means of language identification
CN111986650B (en) * 2020-08-07 2024-02-27 云知声智能科技股份有限公司 Method and system for assisting voice evaluation by means of language identification
CN112951274A (en) * 2021-02-07 2021-06-11 脸萌有限公司 Voice similarity determination method and device, and program product
CN117612566A (en) * 2023-11-16 2024-02-27 书行科技(北京)有限公司 Audio quality assessment method and related product
CN117612566B (en) * 2023-11-16 2024-05-28 书行科技(北京)有限公司 Audio quality assessment method and related product

Also Published As

Publication number Publication date
CN102214462B (en) 2012-11-14

Similar Documents

Publication Publication Date Title
CN102214462B (en) Method and system for estimating pronunciation
CN101661675B (en) Self-sensing error tone pronunciation learning method and system
Shobaki et al. The OGI kids’ speech corpus and recognizers
US8209173B2 (en) Method and system for the automatic generation of speech features for scoring high entropy speech
US20100004931A1 (en) Apparatus and method for speech utterance verification
Hosom Speaker-independent phoneme alignment using transition-dependent states
Turk et al. Robust processing techniques for voice conversion
Cohen et al. Vocal tract normalization in speech recognition: Compensating for systematic speaker variability
CN101246685A (en) Pronunciation quality evaluation method of computer auxiliary language learning system
CN101785048A (en) hmm-based bilingual (mandarin-english) tts techniques
CN104272382A (en) Method and system for template-based personalized singing synthesis
CN103559892A (en) Method and system for evaluating spoken language
Chen et al. Applying rhythm features to automatically assess non-native speech
CN103559894A (en) Method and system for evaluating spoken language
Wightman et al. The aligner: Text-to-speech alignment using Markov models
Shah et al. Effectiveness of PLP-based phonetic segmentation for speech synthesis
CN110047474A (en) A kind of English phonetic pronunciation intelligent training system and training method
Middag et al. Robust automatic intelligibility assessment techniques evaluated on speakers treated for head and neck cancer
Droua-Hamdani et al. Speaker-independent ASR for modern standard Arabic: effect of regional accents
Strik et al. Comparing classifiers for pronunciation error detection
Ullmann et al. Objective intelligibility assessment of text-to-speech systems through utterance verification
Cernak et al. On the (UN) importance of the contextual factors in HMM-based speech synthesis and coding
Pucher et al. Phonetic distance measures for speech recognition vocabulary and grammar optimization
Lin et al. Improving L2 English rhythm evaluation with automatic sentence stress detection
Patil et al. Acoustic features for detection of aspirated stops

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20121114

Termination date: 20180608