CN115440193A - Pronunciation evaluation scoring method based on deep learning - Google Patents
Pronunciation evaluation scoring method based on deep learning Download PDFInfo
- Publication number
- CN115440193A CN115440193A CN202211085643.1A CN202211085643A CN115440193A CN 115440193 A CN115440193 A CN 115440193A CN 202211085643 A CN202211085643 A CN 202211085643A CN 115440193 A CN115440193 A CN 115440193A
- Authority
- CN
- China
- Prior art keywords
- phoneme
- phonemes
- score
- pronunciation
- deep learning
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/01—Assessment or evaluation of speech recognition systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
- G10L15/142—Hidden Markov Models [HMMs]
- G10L15/144—Training of HMMs
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
- G10L2015/025—Phonemes, fenemes or fenones being the recognition units
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Probability & Statistics with Applications (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Machine Translation (AREA)
Abstract
The invention relates to the technical field of voice evaluation, in particular to a pronunciation evaluation scoring method based on deep learning. The invention uses the model of speech recognition to recognize the real text result of the audio. This is then used to obtain the a posteriori probability of the audio by HMM-DNN model. Finally, the phoneme is scored through a scoring model. Before forced alignment, the correct text of the audio is recognized by using a speech recognition model, so that the problem that the audio cannot be aligned to the correct position when the audio is inconsistent with the text in the forced alignment process is avoided. Meanwhile, a scoring model is constructed by using a deep neural network, so that a plurality of information such as posterior probability, vowels, part of speech, tone, pronunciation duration and the like can be fitted, and phoneme scoring is more reasonable and more accurate.
Description
Technical Field
The invention relates to the technical field of voice evaluation, in particular to a pronunciation evaluation scoring method based on deep learning.
Background
Spoken language is more and more taken attention in language education course, and teacher and student's one-to-one exchange and teaching are the most effective mode of improving spoken language in english, but hardly satisfy numerous spoken language learner's demand. Due to the rapid progress of the computer technology and the pronunciation evaluation technology, various spoken language evaluation schemes based on the artificial intelligence technology fall on the ground successively. The system provides additional learning opportunities and rich learning materials for students, can assist or replace teachers to guide the students to do more targeted pronunciation exercises, points out pronunciation errors of the students, provides effective diagnosis feedback information, evaluates the integral pronunciation level of the students, and effectively improves the spoken language learning efficiency and spoken language level of the students.
The current mainstream method for pronunciation evaluation is to obtain the posterior probability of the speech based on a hidden Markov-deep neural network (HMM-DNN) model, then to perform forced alignment with the evaluation text, and to score by using a GOP method.
The forced alignment method can achieve a high degree of accuracy, but this has to satisfy a precondition: a given text and audio must match. If a user reads I am a teacher as I was a teacher, when processing the audio segment corresponding to ws, the user may erroneously compare the audio segment with the phoneme corresponding to am, and it is likely that subsequent a and teacher cannot be aligned to the correct position, thereby affecting the scoring accuracy.
Disclosure of Invention
In order to solve the problems, the invention provides a speech evaluation scoring method based on deep learning. Firstly, the text of the audio is recognized through a speech recognition model, and then the recognized text is used for forced alignment, so that the alignment result is more accurate. And finally, predicting the scores of the phonemes through a scoring model constructed by a deep neural network, and calculating the scores of the words and the sentences according to the scores of the phonemes.
A pronunciation evaluation scoring method based on deep learning is provided, and firstly, a voice recognition model is used for recognizing real text results of audio. The next step is to use the HMM-DNN model to obtain the posterior probability of the audio. Then, the recognition text result of the audio and the posterior probability of the audio are used for forced alignment, and the time boundary of each phoneme is determined. Finally, the phoneme is scored through a scoring model.
The specific technical scheme is as follows:
step one, extracting acoustic features of the speech to be evaluated, sending the acoustic features into a speech recognition model, and recognizing a real text result of the speech to be evaluated.
And step two, the acoustic features of the speech to be evaluated, which are extracted in the step one, are sent into an HMM-DNN model, and the posterior probability of each frame is predicted.
And step three, performing forced alignment according to the text result identified in the step one and the posterior probability of each frame obtained in the step two, and determining the time boundary of each phoneme.
And step four, calculating the average value of the posterior probability of each phoneme according to the time boundary of each phoneme obtained in the step three and the posterior probability of each frame obtained in the step two, splicing the average value of the posterior probability of each phoneme with the characteristic information of the phoneme, such as the vowel consonant, the part of speech, the tone, the pronunciation duration and the like, and sending the spliced result into a scoring model to obtain the score of the phoneme.
And step five, performing phoneme alignment according to the text result recognized in the step one and the reference text, and determining which phonemes are multi-reading and skip-reading.
And step six, calculating the final score, and calculating the score of the word and the score of the whole sentence according to the multi-reading and missing-reading conditions in the step five.
Advantageous effects
According to the invention, the correct text of the audio is recognized by using the voice recognition model before forced alignment, so that the problem that the audio cannot be aligned to the correct position when the audio is inconsistent with the text in the forced alignment process is avoided. Meanwhile, a scoring model is constructed by using a deep neural network, so that a plurality of information such as posterior probability, vowels, part of speech, tone, pronunciation duration and the like can be fitted, and phoneme scoring is more reasonable and more accurate.
1. The correct text of the audio is recognized by using the voice recognition model, so that the problem that the audio cannot be aligned to the correct position when the audio is inconsistent with the text in the forced alignment process is avoided.
2. A scoring model is constructed by using a deep neural network, so that a plurality of information such as posterior probability, vowels, part of speech, tone, pronunciation duration and the like can be fitted, and phoneme scoring is more reasonable and more accurate.
Drawings
FIG. 1 is a schematic flow diagram.
Detailed Description
The present invention is described in further detail below with reference to the attached drawings.
FIG. 1 is a flow chart of the pronunciation assessment method based on deep learning according to the present invention. As shown in fig. 1, the method mainly comprises the following steps:
step one, extracting acoustic features of the speech to be evaluated, wherein the extracted acoustic features can be Fbank features, and when the Fbank features are extracted, the sampling frequency is 16000, the window length is set to be 25ms, and the frame shift is set to be 10ms. After the characteristics are extracted, the characteristics are sent to a voice recognition model, and the voice recognition model can use a wenet model to recognize the real text result of the voice to be evaluated.
And step two, sending the acoustic features of the speech to be evaluated, such as Fbank features, extracted in the step one into an HMM-DNN model, and predicting the posterior probability of each frame. Assuming a total of m frames, n phonemes, a matrix of m x n posterior probabilities is ultimately generated.
And step three, performing forced alignment according to the text result identified in the step one and the posterior probability of each frame obtained in the step two, wherein the forced alignment is performed by adopting a greedy or Viterbi algorithm, a path with the maximum probability is found out, and the time boundary of each phoneme is finally determined.
And step four, calculating the average value of the posterior probability of each phoneme according to the time boundary of each phoneme obtained in the step three and the posterior probability of each frame obtained in the step two.
Then, feature information of each phoneme, such as features of a vowel consonant, a part of speech, a tone, a pronunciation duration, and the like, needs to be obtained.
1. The vowel consonant feature judges whether the current phoneme is a vowel or a consonant, scores of the vowel and the consonant are different according to different phoneme types, and the vowel is more important.
2. And the part-of-speech characteristics are used for judging the part-of-speech of the word where the current phoneme is located, according to the importance of the part-of-speech of the word, the scoring results are different, and the importance of real words such as verbs and nouns is higher than that of virtual words.
3. And tone characteristics, namely judging whether the current phoneme contains tones, and reducing the final score if tone information is marked in the reference text but the tones are not read in the audio.
4. The pronunciation duration characteristic is used for calculating the duration of the current phoneme and carrying out normalization processing with the pronunciation duration of the standard phoneme,
if the normalized utterance duration feature is low or high, the score is reduced.
And finally, splicing the average value of the posterior probability of the phoneme with the characteristic information of the phoneme, such as the vowel consonant, the part of speech, the tone, the pronunciation duration and the like, and sending the spliced result into a scoring model to obtain the score of the phoneme.
The scoring model can be modeled by adopting DNN, three layers of deep neural networks DNN are used, the dimensionality of a hidden layer is set to be 128, the dimensionality of the last layer is set to be 1, and the final scoring result is output.
Various features of the phonemes are modeled by using a deep neural network DNN, and a complex function is fitted to score the phonemes, so that the scoring of the phonemes is more reasonable and accurate.
And step five, performing phoneme alignment according to the text result and the reference text identified in the step one, and determining which phonemes are the multi-reading phoneme, the missing-reading phoneme and the normal phoneme.
The alignment method may be implemented by calculating the edit distance by first converting the recognized text results into a corresponding list of phoneme strings X [1, …, n ], converting the reference text into a corresponding list of phoneme strings Y [1, …, m ], defining the distance D (i, j) as the distance between X [1, …, i ] and Y [1, …, j ], so that the edit distance between X and Y is D (n, m). And solving by using a dynamic programming method, wherein a state transition equation is as follows:
D(i,0)=i
D(0,j)=j
wherein insert represents the unread phone, delete represents the unread phone, and norm represents the normal phone.
By recording which sub-problem each sub-result is solved by, the result can be traced back, and finally the alignment result of two character strings can be obtained.
And step six, calculating the final score, and calculating the score of the word and the score of the whole sentence according to the multi-reading and missing-reading conditions in the step five.
The score of a word is related to the scores of all normal phonemes in the word and the ratio of multiple-reading missed reads. The score of the sentence is not only related to the scores and the multi-reading missing ratio of all the normal phonemes in the sentence, but also related to the continuous frame number of the words and the mute frame number between the words, and the mute frame number between the words is only calculated to be larger than the mute frame number of the normal pause time.
The score of the word is calculated by the following formula:
wherein, score norm_phone Denotes the score of the normal phonemes in a word, n denotes the number of all phonemes in the reference text, nor cnt Indicates the number of normal phonemes, ins cnt Indicating the number of unread phones, del cnt Indicating the number of missed phonemes.
The score of the sentence is calculated by the following formula:
wherein, score norm_phone Represents the score of the normal phonemes in the sentence, n represents the number of all phonemes in the reference text, nor cnt Indicates the number of normal phonemes, ins cnt Indicating the number of unread phones, del cnt Indicating the number of missed phonemes. frame word Representing the number of frames, occupied by all words sil Representing the number of silent frames between words.
Claims (8)
1. A pronunciation evaluation scoring method based on deep learning is characterized by comprising the following steps:
firstly, identifying a real text result of an audio frequency through a speech recognition model;
secondly, acquiring the posterior probability of the audio through an HMM-DNN model;
then, forcibly aligning the recognition text result of the audio and the posterior probability of the audio to determine the time boundary of each phoneme;
finally, scoring the phoneme through a scoring model;
the method comprises the following specific steps:
extracting acoustic features of a voice to be evaluated, sending the acoustic features into a voice recognition model, and recognizing a real text result of the voice to be evaluated;
step two, the acoustic features of the speech to be evaluated extracted in the step one are sent into an HMM-DNN model, and the posterior probability of each frame is predicted;
step three, performing forced alignment according to the text result identified in the step one and the posterior probability of each frame obtained in the step two, and determining the time boundary of each phoneme;
step four, calculating the average value of the posterior probability of each phoneme according to the time boundary of each phoneme obtained in the step three and the posterior probability of each frame obtained in the step two, splicing the average value of the posterior probability of each phoneme with the characteristic information of the vowel consonant, the part of speech, the tone and the pronunciation duration of the phoneme, and sending the spliced value into a scoring model to obtain the score of the phoneme;
step five, aligning phonemes according to the text result and the reference text identified in the step one, and determining which phonemes are multi-reading and skip-reading;
and step six, calculating the final score, and calculating the score of the word and the score of the whole sentence according to the multi-reading and missing-reading conditions in the step five.
2. The pronunciation assessment scoring method based on deep learning as claimed in claim 1, wherein the acoustic features extracted in step one are Fbank features.
3. The pronunciation assessment scoring method based on deep learning of claim 1, wherein the step of a speech recognition model uses a wenet model.
4. The pronunciation assessment scoring method based on deep learning as claimed in claim 1, wherein the forced alignment in step three uses the text result recognized in step one.
5. The pronunciation assessment scoring method based on deep learning as claimed in claim 1, wherein step three is performed by using greedy or Viterbi algorithm to find a path with the largest probability and finally determine the time boundary of each phoneme.
6. The pronunciation assessment scoring method based on deep learning as claimed in claim 1, wherein the characteristics used by the step four scoring model are the average of the posterior probabilities of the phoneme and the vowel, part of speech, tone, pronunciation duration characteristics of the phoneme.
7. The pronunciation assessment scoring method based on deep learning according to claim 1, wherein the quinphone alignment method is implemented by calculating the edit distance, and firstly converting the recognized text result into the corresponding phone string list X [1, …, n ], converting the reference text into the corresponding phone string list Y [1, …, m ], defining the distance D (i, j) as the distance between X [1, …, i ] and Y [1, …, j ], so that the edit distance between X and Y is D (n, m);
and solving by using a dynamic programming method, wherein a state transition equation is as follows:
D(i,0)=i
D(0,j)=j
wherein insert represents the multi-reading phoneme, delete represents the missing-reading phoneme, and norm represents the normal phoneme;
and backtracking the result by recording which subproblem each subproblem is solved by, and finally obtaining the alignment result of the two character strings.
8. The pronunciation assessment scoring method based on deep learning as claimed in claim 1, wherein the score of the word is calculated by using the following formula:
wherein, score norm_phone Represents the score of the normal phonemes in a word, n represents the number of all phonemes in the reference text, norn cnt Indicates the number of normal phonemes, ins cnt Indicating the number of unread phones, del cnt Representing the number of missed phonemes;
the score of the sentence is calculated by the following formula:
wherein, score norm_phone Representing normal phonemes in a sentenceN denotes the number of all phonemes in the reference text, nor cnt Indicates the number of normal phonemes, ins cnt Indicating the number of unread phones, del cnt Representing the number of missed phonemes, frame word Representing the number of frames, occupied by all words sil Representing the number of silent frames between words.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211085643.1A CN115440193A (en) | 2022-09-06 | 2022-09-06 | Pronunciation evaluation scoring method based on deep learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211085643.1A CN115440193A (en) | 2022-09-06 | 2022-09-06 | Pronunciation evaluation scoring method based on deep learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115440193A true CN115440193A (en) | 2022-12-06 |
Family
ID=84247794
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211085643.1A Pending CN115440193A (en) | 2022-09-06 | 2022-09-06 | Pronunciation evaluation scoring method based on deep learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115440193A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116403604A (en) * | 2023-06-07 | 2023-07-07 | 北京奇趣万物科技有限公司 | Child reading ability evaluation method and system |
-
2022
- 2022-09-06 CN CN202211085643.1A patent/CN115440193A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116403604A (en) * | 2023-06-07 | 2023-07-07 | 北京奇趣万物科技有限公司 | Child reading ability evaluation method and system |
CN116403604B (en) * | 2023-06-07 | 2023-11-03 | 北京奇趣万物科技有限公司 | Child reading ability evaluation method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112397091B (en) | Chinese speech comprehensive scoring and diagnosing system and method | |
US7266495B1 (en) | Method and system for learning linguistically valid word pronunciations from acoustic data | |
US8019602B2 (en) | Automatic speech recognition learning using user corrections | |
JP3481497B2 (en) | Method and apparatus using a decision tree to generate and evaluate multiple pronunciations for spelled words | |
CN101551947A (en) | Computer system for assisting spoken language learning | |
US20090258333A1 (en) | Spoken language learning systems | |
CN111862954B (en) | Method and device for acquiring voice recognition model | |
CN109979257B (en) | Method for performing accurate splitting operation correction based on English reading automatic scoring | |
Gao et al. | A study on robust detection of pronunciation erroneous tendency based on deep neural network. | |
US7280963B1 (en) | Method for learning linguistically valid word pronunciations from acoustic data | |
US11935523B2 (en) | Detection of correctness of pronunciation | |
CN112466279B (en) | Automatic correction method and device for spoken English pronunciation | |
Lee | Language-independent methods for computer-assisted pronunciation training | |
KR20090060631A (en) | System and method of pronunciation variation modeling based on indirect data-driven method for foreign speech recognition | |
Ibrahim et al. | Improve design for automated Tajweed checking rules engine of Quranic verse recitation: a review | |
CN115440193A (en) | Pronunciation evaluation scoring method based on deep learning | |
Azim et al. | Large vocabulary Arabic continuous speech recognition using tied states acoustic models | |
CN115116428B (en) | Prosodic boundary labeling method, device, equipment, medium and program product | |
Luo et al. | Automatic pronunciation evaluation of language learners' utterances generated through shadowing. | |
CN111508522A (en) | Statement analysis processing method and system | |
JP2006084966A (en) | Automatic evaluating device of uttered voice and computer program | |
CN111429886B (en) | Voice recognition method and system | |
Mote et al. | Tactical language detection and modeling of learner speech errors: The case of Arabic tactical language training for American English speakers | |
Li et al. | Improving mandarin tone mispronunciation detection for non-native learners with soft-target tone labels and blstm-based deep models | |
JPH08123470A (en) | Speech recognition device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |