CN101650886A

CN101650886A - Method for automatically detecting reading errors of language learners

Info

Publication number: CN101650886A
Application number: CN200810224792A
Authority: CN
Inventors: 颜永红; 董滨; 刘常亮
Original assignee: Institute of Acoustics CAS; Beijing Kexin Technology Co Ltd
Current assignee: Institute of Acoustics CAS; Beijing Kexin Technology Co Ltd
Priority date: 2008-12-26
Filing date: 2008-12-26
Publication date: 2010-02-17
Anticipated expiration: 2028-12-26
Also published as: CN101650886B

Abstract

The invention relates to a method for automatically detecting reading errors of language learners. The method comprises the following steps: training a multi-pronunciation model according to reading linguistic data; constructing a simplified search space according to reference answers, a pronunciation dictionary and the multi-pronunciation model; constructing a reading linguistic model according to the reference answers and linguistic knowledge; preprocessing input speech sound for framing and then extracting phonetic features; searching the simplified search space for a pronunciation path with the highest cumulative score obtained by scoring an acoustic model, the linguistic model and the pronunciation model as a pronunciation identification sequence based on a Viterbi algorithm; and aligning the identified pronunciation sequence with a pronunciation sequence of the reference answers based on a dynamic programming matching algorithm, thus obtaining multi-reading, non-reading and misreading results. As a hidden Markov model is taken as the acoustic model, the method has the advantages of no model voice, ability of greatly improving use convenience, and better performance and operating speed.

Description

A kind of method of automatic detection reading errors of language learners

Technical field

The invention belongs to the computer-assisted language learning field, specifically, the present invention relates to the method that a kind of automatic detection language learner reads aloud the bright read error in the voice.

Background technology

In computer-assisted language learning, can detect automatically that the user reads aloud wrong in the voice and in time feed back is a vital part.Read aloud in the study at language, to mutiread, the skip of speech or word, to mispronounce be learner's the brightest read error of easy appearance.Detection method is in the past normally carried out template matches with user speech and received pronunciation, and then detects the position of bright read error.The teacher's that received pronunciation normally records in advance voice.The shortcoming of this method maximum is to need received pronunciation as template, and learning stuff can not arbitrarily be changed, in actual applications, and inconvenience very.

Development based on the automatic speech recognition technology of hidden Markov model provides a kind of new road for development to computer-assisted language learning.The automatic speech recognition technology is by carrying out statistical study to a large amount of speech datas, construct meeting the acoustic model of phonetic feature; In the identifying, the speech recognition decoder unit finds a paths that mates the most with actual speech, the i.e. recognition result of voice according to acoustic model in a limited search volume.What speech recognition technology in general sense solved is the problem that voice is changed into literal, and its final output result is a word sequence.Here, the content of voice is unknown fully, so very big vocabulary and the current language model of its general employing comprises candidate item as much as possible.For the task of reading aloud error-detecting, what it will solve is the problem whether detection user speech and Key for Reference mate, and its final output result is the position and the type information of a bright read error.At this moment, the content that the user reads aloud is known, and therefore, common audio recognition method here and be not suitable for.Based on this, the invention provides a kind of method of automatic detection reading errors of language learners.

Summary of the invention

The object of the present invention is to provide a kind of method of automatic detection reading errors of language learners.This method is based on the automatic speech recognition technology based on hidden Markov model, proposed a kind of be used for detecting automatically mutiread, skip that the language learner reads aloud voice and the new method of mispronouncing.This method does not need teacher's received pronunciation as template, improves the convenience of practical application greatly.

For achieving the above object, the method for the bright read error of detection provided by the invention comprises following several steps:

1) front-end processing: the input voice are carried out pre-service, extract the MFCC feature.

2) make up and to simplify the search volume: the search volume of simplifying according to Key for Reference (being the content that the user will read aloud), Pronounceable dictionary, multiple sound model and acoustics model construction.

3) structure is read aloud language model: according to Key for Reference structure user's the language model of reading aloud, this language model has been described context and probabilistic information thereof that the user may read aloud when reading aloud this reference statement.

4) search: in the search volume, according to acoustic model, read aloud paths of coupling of eigenvector stream that language model and multiple sound pattern search obtain and import, as user's the actual resultant content of reading aloud, i.e. recognition result sequence.In search procedure, the likelihood probability of each speech W is calculated by following formula:

P(W|O)＝P(O|V)·P(V|W)·P(W)

Wherein, O is the eigenvector of current input, and P (W) is the probabilistic language model of current speech W, and P (V|W) is the probability of certain pronunciation V of current speech W, and P (O|V) is the acoustic model probability of eigenvector O on pronunciation V.

5) alignment: Key for Reference is alignd the testing result of obtain user's mutiread, skip, mispronouncing with recognition result.

In the technique scheme, described step 2) further comprise the steps:

A) content that the user will be read aloud is as the reference answer, and Key for Reference is carried out participle;

B) to each speech, according to its Pronounceable dictionary and multiple sound model, find its all possible pronunciation, be built into the speech network, and add a filler pronunciation;

C) according to Pronounceable dictionary, each speech in the last predicate network is launched into aligned phoneme sequence, constitute a phoneme network;

D) each phoneme is converted again to corresponding hidden Markov model, each Markov model is made of several states;

E) through forward direction and back to merging, the final state network of simplifying that forms.

Wherein, above-mentioned steps b) pronunciation that multiple each speech of sound model description in is possible and corresponding probability P (V|W), be by collecting a large amount of language materials of reading aloud, these data are carried out the mark of actual pronunciation by the expert, to mark with Key for Reference and align, calculate the probability of each pronunciation of each word by following formula by dynamic programming algorithm:

P (V | W) = \frac{N (W, V)}{N (W)}

Wherein, N (W) is the number of the speech W that occurs in the language material, and (W V) is the number that its pronunciation is noted as the W of V to N.

In the technique scheme, the language model that described step 3) is mentioned is the language model of a binary factor, and it has described the probability that current speech W1 jumps to speech W2, and W1 and W2 are speech or the filler in the Key for Reference.This language model regulation, the probability that current speech jumps to next correct speech is P, jump in the Key for Reference other speech or the probability of filler be (P-1)/N, N is the total speech number in the Key for Reference.

In the technique scheme, the searching algorithm that described step 4) used is the Viterbi searching algorithm.

In the technique scheme, alignment schemes that described step 5) is mentioned adopts the method for dynamic programming alignment.

The beneficial effect of the method for automatic detection reading errors of language learners of the present invention is, it carries out the method that template matches detects bright read error and compares with existing employing teacher's received pronunciation and learner's the voice of reading aloud, it adopts hidden Markov model as acoustic model, no longer need the template voice, therefore, learning stuff can arbitrarily be changed, and no longer needs each learning stuff is recorded the standard form voice.And, hidden Markov acoustic model based on statistics adopts large-scale voice language material training to form, eliminated the dependence to single template voice to the full extent, its robustness will be higher than the method for template matches far away, can improve the objectivity and the accuracy of reading aloud detection greatly.

Description of drawings

Fig. 1 is the process flow diagram of automatic detection reading errors of language learners method of the present invention.

Fig. 2 is an example of the multiple sound model that uses in the method for the present invention.

Fig. 3 is the synoptic diagram of speech network in the method for the present invention.

Fig. 4 is the synoptic diagram that speech node changes into the state node sequence in the method for the present invention.

Fig. 5 is an employed synoptic diagram of simplifying search volume (state network) in the method for the present invention.

Fig. 6 is an employed synoptic diagram of reading aloud language model in the method for the present invention.

Embodiment

Below in conjunction with drawings and the specific embodiments the present invention is described further:

Embodiment

Fig. 1 is the process flow diagram of the method for the bright read error of detection of the present invention.As shown in Figure 1, the method for the bright read error of detection provided by the invention comprises the steps:

1) front-end processing: the input voice are carried out pre-service, carry out feature extraction;

In one embodiment, to import data carries out the digitizing of 16K sampling rate and (also can adopt other sampling rate herein, such as, 8K, 32K or the like), and carry out pre-emphasis, branch frame, windowing process, each frame is extracted MFCC (mel-frequency cepstral coefficient) eigenvector and two jump resolutes.

2) make up and to simplify the search volume: the content that the user will be read aloud is as the reference answer, and the search volume of simplifying according to Key for Reference, Pronounceable dictionary, multiple sound model and acoustics model construction; For example, specifically comprise the steps:

A) content that the user will be read aloud is as the reference answer, and Key for Reference is carried out participle.

B) to each speech, according to its Pronounceable dictionary and multiple sound model, find its all possible pronunciation, be built into speech network as shown in Figure 3, and add a filler pronunciation.

C) according to Pronounceable dictionary, each speech in the speech network of Fig. 3 is launched into aligned phoneme sequence, constitute a phoneme network.

D) each phoneme is converted again to corresponding hidden Markov model (HMM), each Markov model is made of several states.Here, the process that speech is changed into state as shown in Figure 4.

E) through forward direction and the back to merging, as shown in Figure 5, finally form a state network efficiently of simplifying.

Above-mentioned steps b) the multiple sound model description of being mentioned in possible pronunciation of each speech and corresponding probability P (V|W), be to obtain, reflected the distribution situation of the recurrent mispronounce in a certain area or a certain colony according to the corpus statistics of collecting in advance of reading aloud.Its simple examples as shown in Figure 6.Each speech has a plurality of pronunciations, and each pronunciation has corresponding probability.Such as " length " two pronunciations " chang2 " and " zhang3 " are arranged, the probability of each pronunciation is respectively 0.59 and 0.41.Multiple sound model is that precondition is good, its training method is: collect a large amount of language materials of reading aloud, by the mark that the expert carries out actual pronunciation to these data, will mark with Key for Reference and align by dynamic programming algorithm, calculate the probability of each pronunciation of each word by following formula:

P (V | W) = \frac{N (W, V)}{N (W)}

Above-mentioned steps b) the speech network of being mentioned in as shown in Figure 3, its each bar Bian Dou has represented certain pronunciation of certain speech in the Key for Reference.It is a network capable of circulation, and the expression user can arbitrarily read aloud the sentence that speech constituted wherein.In order to identify the content outside the Key for Reference that the user reads aloud, a special pronunciation---filler also is added in this search volume.Filler is a special pronunciation model in the acoustic model, and it is formed by the language material aggregation of data training of all pronunciations, is commonly used to solve outer speech (OOV) problem of collection in the voice.

3) structure is read aloud language model: according to Key for Reference structure user's the language model of reading aloud, this language model is described context and the probabilistic information thereof that the user may read aloud when reading aloud this reference statement;

The language model here is the language model of a binary factor, and it has described the probability that current speech W1 jumps to speech W2, and W1 and W2 are speech or the filler in the Key for Reference.This language model regulation, the probability that current speech jumps to next correct speech is P, jump in the Key for Reference other speech or the probability of filler be (P-1)/N, N is the total speech number in the Key for Reference.

For example, read aloud the mode construction of language model according to Fig. 6, Fig. 6 has provided the redirect probability of " vigorously " word to other words, and its probability that jumps to that correct next word " meets " is P, and the probability that jumps to other words is (1-P)/N.P can regulate loss of reading aloud error-detecting and false alarm rate, and it can be set by rule of thumb according to the practical application scene, is perhaps obtained by a collection of training data training.In our system, P=0.8.

4) search: in the search volume, according to acoustic model, read aloud paths of coupling of eigenvector stream that language model and multiple sound pattern search obtain and import, as user's the actual resultant content of reading aloud, i.e. recognition result sequence,

For example, native system adopts the algorithm of Viterbi, to each frame speech characteristic vector, calculates its acoustics likelihood probability p (x on each state _t/ s _j), the accumulation of this likelihood probability and state transition probability has constituted the acoustics likelihood probability P (O|V) of each speech, the accumulation of acoustics likelihood probability and pronunciation probability and probabilistic language model has constituted total likelihood probability of a speech, in search procedure, the likelihood probability of each speech W is calculated by following formula:

P(W|O)＝P(O|V)·P(V|W)·P(W)

Wherein, O is the eigenvector of current input, and P (W) is the probabilistic language model of current speech W, and P (V|W) is the probability of certain pronunciation V of current speech W, and P (O|V) is the acoustic model probability of eigenvector O on pronunciation V;

The Viterbi algorithm is sought the highest word sequence of likelihood probability accumulation, as final recognition result.Because final purpose is to detect bright read error, therefore the recognition result here will be to represent with the form of pronunciation sequence, but not the word sequence in the general speech recognition.

5) alignment: described Key for Reference is alignd the testing result of obtain user's mutiread, skip, mispronouncing with recognition result.

Here, the recognition result sequence of pronouncing is alignd with the pronunciation sequence of Key for Reference, employing be the alignment schemes of dynamic programming (Dynamic Programming).After the alignment, mutiread, skip and the information of mispronouncing promptly can obtain by the insertion in the alignment procedure, deletion, alternative information.

Detect test:

The data set that uses 1300 Hong Kong students to read aloud mandarin is tested the error-detecting method of reading aloud proposed by the invention.In this data centralization, each student reads aloud the short essay of one section about 350 word.These data mark its actual pronunciation by the expert, are represented by the Chinese phonetic alphabet.Whole data set is divided into two parts, and 900 people's data are used to train above-mentioned steps 2) the multiple sound model mentioned; Other 400 people's data are used for assessing the performance of reading aloud error-detecting.Performance index are represented by loss DER (Detection Error Rate) and false alarm rate FAR (FalseAlarm Rate).The ratio of the bright read error that the DER representative is not detected and all bright read errors; The FAR representative is by the ratio of flase drop survey for the correct syllable of bright read error and the syllable that all are correctly read aloud.The acoustic model that is adopted in the test comprises 5100 states, and each state is made of 48 gaussian component.

Experimental result is as shown in the table

Loss (DER)	False alarm rate (FAR)
Loss (DER)	False alarm rate (FAR)	??37.04％	??11.37％

At CPU is Pentium 3.0G, on the computing machine of internal memory 2G the real-time rate of test about 0.4 times in real time.

By above-mentioned test findings, the accuracy rate of Jian Ceing can reach general request for utilization substantially as can be seen, and its travelling speed also satisfies the demand that is used for real-time detection.

Claims

1, a kind of method of automatic detection reading errors of language learners is characterized in that, comprises following steps:

1) front-end processing: the input voice are carried out pre-service, carry out feature extraction, extract and be characterized as the MFCC eigenvector;

2) make up and to simplify the search volume: the content that the user will be read aloud is as the reference answer, and the search volume of simplifying according to Key for Reference, Pronounceable dictionary, multiple sound model and acoustics model construction;

4) search: in the search volume, according to acoustic model, read aloud paths of coupling of eigenvector stream that language model and multiple sound pattern search obtain and import,, make the recognition result sequence as user's the actual resultant content of reading aloud;

2, the method for automatic detection reading errors of language learners as claimed in claim 1, it is characterized in that, pre-service to the input voice in the described step 1) comprises input voice digitization, pre-emphasis high boost, divides frame and windowing process, and described feature extraction comprises MFCC eigenvector and the two jump resolutes that extract each frame voice.

3, the method for automatic detection reading errors of language learners as claimed in claim 1 is characterized in that, described step 2) further comprise the steps:

4, the method for automatic detection reading errors of language learners as claimed in claim 3, it is characterized in that, pronunciation that multiple each speech of sound model description in the described step b) is possible and corresponding probability P (V|W), be by collecting a large amount of language materials of reading aloud, these data are carried out the mark of actual pronunciation by the expert, to mark with Key for Reference and align, calculate the probability of each pronunciation of each word by following formula by dynamic programming algorithm:

P (V | W) = \frac{N (W, V)}{N (W)}

5, the method for automatic detection reading errors of language learners as claimed in claim 1, it is characterized in that, language model in the described step 3) is the language model of the binary factor, it describes the probability that current speech W1 jumps to speech W2, wherein, W1 and W2 are speech or the filler in the Key for Reference, this language model regulation, the probability that current speech jumps to next correct speech is P, jump in the Key for Reference other speech or the probability of filler be (P-1)/N, N is the total speech number in the Key for Reference.

6, the method for automatic detection reading errors of language learners as claimed in claim 1 is characterized in that, the Viterbi searching algorithm is adopted in the search in the described step 4), and the likelihood probability of each speech W is calculated by following formula in the algorithm implementation:

P(W|O)＝P(O|V)·P(V|W)·P(W)

7, the method for automatic detection reading errors of language learners as claimed in claim 1 is characterized in that, the alignment algorithm that is adopted in the described step 5) is the dynamic programming matching algorithm.